Reading single chunk takes 10x longer than remfile #74

rly · 2024-05-27T14:23:11Z

Using remfile as below:

import remfile
import h5py
import pynwb
import timeit

# URL to HDF5 NWB file
s3_url = "https://dandiarchive.s3.amazonaws.com/blobs/fec/8a6/fec8a690-2ece-4437-8877-8a002ff8bd8a"
byte_stream = remfile.File(url=s3_url)
file = h5py.File(name=byte_stream)
io = pynwb.NWBHDF5IO(file=file)
nwbfile = io.read()
data_to_slice = nwbfile.acquisition["ElectricalSeriesAp"].data

start = timeit.default_timer()
data_to_slice[0:10,0:384]
end = timeit.default_timer()
print(end - start)

Takes 0.2 seconds on my laptop.

Using lindi as below:

import lindi
import pynwb
import timeit

# URL to LINDI JSON of NWB file
s3_url = "https://dandi-api-staging-dandisets.s3.amazonaws.com/blobs/914/6aa/9146aa46-9c01-45be-9d2a-693e6a7bb778"
client = lindi.LindiH5pyFile.from_lindi_file(url_or_path=s3_url)
io = pynwb.NWBHDF5IO(file=client)
nwbfile = io.read()
data_to_slice = nwbfile.acquisition["ElectricalSeriesAp"].data

start = timeit.default_timer()
data_to_slice[0:10,0:384]
end = timeit.default_timer()
print(end - start)

Takes 2.4 seconds on my laptop.

The data chunk size is (13653, 384) with no compression. Nothing stands out in the LINDI JSON. I'm not sure if I am doing something wrong or if there is an efficiency somewhere in the system.

I'll start looking into it. @magland, do you have any ideas about what might be going on?

The text was updated successfully, but these errors were encountered:

magland · 2024-05-27T15:09:51Z

@rly

I think what's going on here...

h5py can read partial chunks - and in this case there is no compression so this is possible

whereas lindi/zarr is set up to always read entire chunks

According to the lindi.json file, the chunk size is [13653, 384]

Maybe this is a zarr limitation/constraint/feature?

rly · 2024-05-27T15:26:53Z

Ah, that makes sense. After changing the slice size to equal the chunk size, lindi is now only ~2x the speed of remfile. In inspecting the execution, it looks like zarr makes the request for key acquisition/ElectricalSeriesAp/data/0.0 twice. I'm trying to figure out why.

But also in digging through the Zarr code, I found that Zarr might be able to support partial reads:
https://github.com/zarr-developers/zarr-python/blob/b1f4c509abaee1cb8dec18e3a973e1199226011a/src/zarr/v2/core.py#L2054-L2095

Right now, execution is going through the else because "get_partial_values" is not an attribute of LindiReferenceFileSystemStore.

magland · 2024-05-27T19:13:31Z

Ah. It will be good to figure out whether the duplicate request can be avoided... and/or whether we should implement some caching for this type of situation.

Do you think we should set the get_partial_values attribute somehow?

rly · 2024-05-28T04:59:57Z

Do you think we should set the get_partial_values attribute somehow?

Yeah, I think that would be nice, but not urgent. For most large reads, I think it would not make a big difference because the read will be mostly full chunks and some part of a chunk on each axis. And most big datasets are compressed.

If you have time, it would be great if you can take a look but no pressure. Otherwise, I'll try to take a look at it next week.

magland · 2024-05-28T13:54:21Z

Makes sense. I'm not going to work on it right now.

rly added the category: bug errors in the code or code behavior label May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading single chunk takes 10x longer than remfile #74

Reading single chunk takes 10x longer than remfile #74

rly commented May 27, 2024

magland commented May 27, 2024

rly commented May 27, 2024

magland commented May 27, 2024

rly commented May 28, 2024

magland commented May 28, 2024

Reading single chunk takes 10x longer than remfile #74

Reading single chunk takes 10x longer than remfile #74

Comments

rly commented May 27, 2024

magland commented May 27, 2024

rly commented May 27, 2024

magland commented May 27, 2024

rly commented May 28, 2024

magland commented May 28, 2024