Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split large unchunked contiguous HDF5 datasets into smaller chunks #84

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

magland
Copy link
Collaborator

@magland magland commented Jul 9, 2024

This is a third attempt at #82 and #83

This solves the following important problem. Some of the large (raw ephys) datasets on DANDI are uncompressed and stored in a single very large chunk within the NWB file. For example 000935. In fact, the default in HDF5 is to create an uncompressed contiguous chunk. This is a problem for Zarr with remote files because (as far as I can see) there is no such thing as a partial read of a Zarr chunk. So if I try to randomly access a time segment with one of these arrays, the entire chunk/dataset needs to be downloaded.

This PR modifies LindiH5ZarrStore so that such datasets are effectively split into smaller chunks within the reference file system.

@codecov-commenter
Copy link

codecov-commenter commented Jul 9, 2024

Codecov Report

Attention: Patch coverage is 87.50000% with 22 lines in your changes missing coverage. Please review.

Project coverage is 80.97%. Comparing base (731fcb5) to head (3b9f2e9).
Report is 9 commits behind head on main.

Files Patch % Lines
lindi/LindiH5ZarrStore/LindiH5ZarrStore.py 85.03% 19 Missing ⚠️
lindi/LindiH5ZarrStore/_util.py 75.00% 1 Missing ⚠️
lindi/LindiH5pyFile/LindiH5pyFile.py 75.00% 1 Missing ⚠️
...ndi/LindiH5pyFile/LindiReferenceFileSystemStore.py 97.43% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #84      +/-   ##
==========================================
+ Coverage   79.43%   80.97%   +1.54%     
==========================================
  Files          30       30              
  Lines        2256     2423     +167     
==========================================
+ Hits         1792     1962     +170     
+ Misses        464      461       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@magland
Copy link
Collaborator Author

magland commented Jul 10, 2024

Ran into a difficulty that required a hacky solution in this commit
a6ebaac

The problem is with the final chunk in the subdivided continuous chunk. Unless the chunk size divides evenly into the total shape (rarely the case) the final chunk in the HDF5 file will be smaller than the chunk size reported in .zarray. That's fine for HDF5. But for Zarr, all the chunk files need to be the same size... and the extra data in the final chunk is discarded. This is a problem for the reference file system since there is no way to specify that a particular byte range (in the HDF5 file) needs to be padded by extra zeros. After thinking about it for a while I concluded that we either need to

(a) Expand the capabilities of the reference file system to specify this extra padding (clumsy and unnatural)
(b) Apply the following hack (which is what I did):

For the special Zarr store that is backed by the reference file system, automatically detect when the chunk data is smaller than expected based on the .zarray metadata, and in that case, pad the file with the necessary zeros.

I needed to implement this for both of the custom Zarr stores: LindiReferenceFileSystemStore and LindiH5ZarrStore

It's hard to remember what the difference between these are, so I'll summarize:

LindiReferenceFileSystemStore: This is when you have a .lindi.json file (reference file system) and you want to represent it as LINDI Zarr and then wrap that in LindH5pyFile.

LindiH5ZarrStore: This is when you have an HDF5 file (e.g., a remote file on DANDI) and you want to represent it as LINDI Zarr, usually for the purpose of generating a .lindi.json file.

@magland magland requested a review from rly July 10, 2024 12:00
@bendichter
Copy link

I ran into this problem also. If I understand your solution, you are doing the same thing I did, which is to chunk contiguous arrays along the 0th dimension. So if your array is (time * channel) then you are imposing artificial chunking along time and not channels. This solution does work, but can lead to problems when you need to index the other dimensions. E.g., if you want to access all time for a given channel, you'll need to read the entire dataset for all channels, and there is no way to adjust the chunking to make this more efficient, since this solution only allows you to chunk the 0th dimension.

I think a long-term solution might be to build contiguous array support into Zarr directly. It can be expressed easily in Zarr (chunks=null, filters=null), and it should be pretty easy to implement since it's really just a simple memmap.

@magland
Copy link
Collaborator Author

magland commented Jul 17, 2024

@bendichter
Just want to be clear. Are you proposing modifying Zarr? So this would be something external to lindi? And then once that's in place, then we no longer need to use the hack of this PR?

Also, see my comment above about what I needed to do for the final chunk (so it wasn't completely straightforward unfortunately).

@bendichter
Copy link

Yes, I am suggesting we try to fix this on the Zarr side. I think it would be worth discussing with them and seeing if they would be open to this feature.

@magland
Copy link
Collaborator Author

magland commented Jul 17, 2024

Focusing on the case of a remote file, where each read of a byte range requires a separate http request...
... I believe your suggestion only applies in the case where you are loading a single time frame and a subset of channels (assume the first dimension is time and second dimension is channels). If you are trying to load a subset of time frames and a subset of channels, then you can't get around the fact that the data are one contiguous block, because it is inefficient to make a separate request for each time frame -- it's usually still best to load data for all the channels, and then slice locally. Does this make sense?

@bendichter
Copy link

OK I see your point. When accessing data on disk you can use stride, but when accessing remote data, the http request needs to be continuous, so you'll need to download that data and bring it into RAM regardless. You might be able to do some caching so you don't need to read the entire continuous dataset into RAM, but you'll still be forced to download the entire dataset, which is the expensive part anyway. You could effectively download strides by sending multiple requests, but then you'd send too many requests.

There would still be two benefits of supporting direct read for non-chunked data in remote settings. I'm not sure either of them are game-changers, but anyway I thought they were worth mentioning:

  1. When requesting small ranges, you don't need to obey arbitrary chunk boundaries. When you use artificial chunking, if you chunk the 0th dimension by 100 and then you want to read items [99:101,...], you'll need to request the entire first and second chunk. With direct read, you could request just the range of interest, which would be much less data. You can think of cases where this would save a lot of bandwidth, but I'm not sure how often those cases would come up in practice.
  2. When requesting large ranges, you can set request size on the fly. For some systems you might want many smaller range requests while for other systems you might want fewer larger requests. When you impose chunks on the fsspec file, you lose the flexibility to choose this on the fly. If you simply say this is continuous, you could have the system decide how it wants to request the data, and it does not need to be limited to multiples of the chunk size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants