Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory error #82

Open
aCoalBall opened this issue Oct 29, 2023 · 13 comments
Open

memory error #82

aCoalBall opened this issue Oct 29, 2023 · 13 comments

Comments

@aCoalBall
Copy link

Hi there,

Im using pod5 python API to load signals in to another file. But I meet this error when I iterate through the pod5 reader.

  File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/projects/myNanoporeProject/extract/extract.py", line 88, in load_pod5_signals_and_save
  File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 284, in signal
  File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 284, in <listcomp>
  File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 380, in _find_signal_row_index
  File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 1070, in _get_signal_batch
  File "pyarrow/ipc.pxi", line 974, in pyarrow.lib._RecordBatchFileReader.get_batch
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 590, in read
MemoryError

I allocated 256GB memory for this task and I think that I just loaded signals for one read each time and did not save all signals in the pod5 file into the memory manually.

Here is the code:

   pod5_reads = pod5_reader.reads(selection = read_ids)
   for pod5_read in pod5_reads:
        read_id = str(pod5_read.read_id)
        if read_id in bam_info_map:
            read, cpgs = bam_info_map[read_id]
            if len(cpgs) > 0:
                cpg_chunks = generate_training_data(read, cpgs, pod5_read.signal)
                for chunk in cpg_chunks:
                    chunk = np.asanyarray(chunk, dtype=object)
                    np.save(f, chunk, allow_pickle=True)
@HalfPhoton
Copy link
Collaborator

HalfPhoton commented Oct 29, 2023

Hi @aCoalBall ,

Can you show some more of the code - specifically how you're handling the pod5_reader?

Also, are you processing many files, or some very large files?

@aCoalBall
Copy link
Author

Hi @aCoalBall ,

Can you show some more of the code - specifically how you're handling the pod5_reader?

Also, are you processing many files, or some very large files?

Hi HalfPhoton,

I just create the pod5_reader by

pod5_reader = pod5.Reader(pod5_path) # pod5_path is the str of path

And I am processing a single pod5 file with 141G.

@aCoalBall
Copy link
Author

By the way, it seems not due to memory runs out. I tried to specify different size of memory but it just shut down at the same position (around 300000th reads)

@0x55555555
Copy link
Collaborator

Interesting.

@aCoalBall can you confirm it still crashes if you don't do your downstream processing?

I wouldn't expect the code above to retain the signals in memory unless you hold them in your training data.

So, this should work:

   for pod5_read in pod5_reads:
        read_id = str(pod5_read.read_id)
        if read_id in bam_info_map:
            read, cpgs = bam_info_map[read_id]
            if len(cpgs) > 0:
                pass

Can you confirm? Are you able to provide the complete code otherwise and we can investigate further.

  • George

@aCoalBall
Copy link
Author

Interesting.

@aCoalBall can you confirm it still crashes if you don't do your downstream processing?

I wouldn't expect the code above to retain the signals in memory unless you hold them in your training data.

So, this should work:

   for pod5_read in pod5_reads:
        read_id = str(pod5_read.read_id)
        if read_id in bam_info_map:
            read, cpgs = bam_info_map[read_id]
            if len(cpgs) > 0:
                pass

Can you confirm? Are you able to provide the complete code otherwise and we can investigate further.

  • George

@jorj1988
Hi George,

I tried what you suggest, the following code runs fine.

   for pod5_read in pod5_reads:
        read_id = str(pod5_read.read_id)
        if read_id in bam_info_map:
            read, cpgs = bam_info_map[read_id]
            if len(cpgs) > 0:
                pass

However, as long as I use pod5_read.signal, like signal = pod5_read.signal or x = type(pod5_read.signal)it raises memory error.

   for pod5_read in pod5_reads:
        read_id = str(pod5_read.read_id)
        if read_id in bam_info_map:
            read, cpgs = bam_info_map[read_id]
            if len(cpgs) > 0:
                signal =  pod5_read.signal
Traceback (most recent call last):
  File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/projects/myNanoporeProject/extract/prepare_data.py", line 67, in <module>
  File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/projects/myNanoporeProject/extract/prepare_data.py", line 52, in main
  File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/projects/myNanoporeProject/extract/extract.py", line 95, in load_pod5_signals_and_save
  File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 284, in signal
  File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 284, in <listcomp>
  File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 380, in _find_signal_row_index
  File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 1070, in _get_signal_batch
  File "pyarrow/ipc.pxi", line 974, in pyarrow.lib._RecordBatchFileReader.get_batch
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "/rshare1/ZETTAI_path_WA_slash_home_KARA/home/coalball/venvs/methbert2_venv/lib/python3.10/site-packages/pod5/reader.py", line 590, in read
MemoryError

@0x55555555
Copy link
Collaborator

ok thanks.

What is your environment like - what sort of OS, VMem, Physical Memory available etc?

Thanks,

  • George

@aCoalBall
Copy link
Author

ok thanks.

What is your environment like - what sort of OS, VMem, Physical Memory available etc?

Thanks,

  • George

Hi @jorj1988

Here is some basic info about the system.

Screen Shot 2023-10-30 at 11 39 39 PM

@0x55555555
Copy link
Collaborator

0x55555555 commented Oct 30, 2023

And can you confirm if you have a virtual memory limit on the system?

eg:

george@host:~$ ulimit -v
unlimited

@aCoalBall
Copy link
Author

@jorj1988

Yes, there is a virtual memory limit

[coalball@gc066 ~]$ ulimit -v
134217728

@HalfPhoton
Copy link
Collaborator

Hi @aCoalBall ,

Would you be able to test the following code to ascertain if this is virtual memory issue?

pod5_reader = pod5.Reader(pod5_path)
pod5_reader._signal_handle._reader = None
pod5_reader._signal_handle._reader = pod5_reader._signal_handle._open_without_mmap()
for pod5_read in pod5_reads:
    # remaining code here...

@aCoalBall
Copy link
Author

Hi @aCoalBall ,

Would you be able to test the following code to ascertain if this is virtual memory issue?

pod5_reader = pod5.Reader(pod5_path)
pod5_reader._signal_handle._reader = None
pod5_reader._signal_handle._reader = pod5_reader._signal_handle._open_without_mmap()
for pod5_read in pod5_reads:
    # remaining code here...

Hi @HalfPhoton ,
I tried but the error is still there...

@0x55555555
Copy link
Collaborator

Hi @aCoalBall,

I've been running the below on a system similar to yours and not seen a crash yet... can you confirm this does crash for you?

My input file is 1.4TB, 64GB physical memory, with a virtual memory limit set to the same as yours (134217728).

import pod5
import sys

print("Open file")
pod5_reader = pod5.Reader(sys.argv[1])
print("Opened file")


pod5_reads = pod5_reader.reads()
for i, pod5_read in enumerate(pod5_reads):
    signal =  pod5_read.signal
    if i % 10000 == 0:
        print(f"at read {i}")

I have taken out several bits of your script to put this example together... maybe we need to add some of them back to make it crash?

Thanks,

  • George

@aCoalBall
Copy link
Author

Hi @aCoalBall,

I've been running the below on a system similar to yours and not seen a crash yet... can you confirm this does crash for you?

My input file is 1.4TB, 64GB physical memory, with a virtual memory limit set to the same as yours (134217728).

import pod5
import sys

print("Open file")
pod5_reader = pod5.Reader(sys.argv[1])
print("Opened file")


pod5_reads = pod5_reader.reads()
for i, pod5_read in enumerate(pod5_reads):
    signal =  pod5_read.signal
    if i % 10000 == 0:
        print(f"at read {i}")

I have taken out several bits of your script to put this example together... maybe we need to add some of them back to make it crash?

Thanks,

  • George

Hi @jorj1988 ,

Actually, even the simplest iteration would cause the error

import pod5

pod5_path = '/home/coalball/projects/pod5/output.pod5'
pod5_reader = pod5.Reader(pod5_path)
pod5_reader._signal_handle._reader = None
pod5_reader._signal_handle._reader = pod5_reader._signal_handle._open_without_mmap()
for read in pod5_reader.reads():
    read.signal

Im running this task on a HPC but I don't know how they actually designed it. Im trying to change my computation platform now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants