(feat): experimental `read_backed` method for `zarr` + `hdf5` via `read_dispatched` #947

ilan-gold · 2023-03-08T14:22:59Z

Creates a new experimental AnnDataBacked class as well as a helper read_backed method for reading of on-disk zarr and hdf5 datasets lazily, with a focus on usage over the internet. The primary focus is less on optimizing for performance of analysis than on reading metadata/making fetching of data quick.

Fixes #951 as well
Probably fixes #981 if an "experimental" PR can do that.

The main highlights of this PR in experimental.read_backed:

xarray for all dataframes
by-default lazy loading of all elements (dense array, sparse array, dataframes) except AwkwardArrays
Out-of-the-box, fully backed, works with both h5ad and zarr while also being compatible with both on-disk/remote storage (i.e., changing subelements should work if you want them to be e.g., numpy arrays although this isn't tested because the focus here is really on reading)
simple to_memory function on the AnnDataBacked class that brings whatever you want locally for use in scanpy including optional exclude keys for making the download faster by restricting it to exactly what you need

Outside of this PR but included in the overall work:

views of views for backed sparse matrices (and more targeted/efficient reading)
common abstract AnnData class beginning to define a sensible contract for new classes to build upon
backed X for zarr

for more information, see https://pre-commit.ci

To fix error reporting, I've put the attempt to catch an error during IO on top of the `read_elem` method. Since the decorator is sometimes used on functions, I modified it to be able to handle the signature of both a method and a function. What's weird is that sometimes the decorator is being passed the arguments of a method, that has a name like a method, but is a function. So that still needs to be fixed.

for more information, see https://pre-commit.ci

pyproject.toml

ilan-gold · 2023-08-04T10:47:59Z

Before this can be merged/reviewed, there are several blocking PRs:
#949
#765
ivirshup#4

for more information, see https://pre-commit.ci

…a into ig/read_remote_dispatched

ilan-gold · 2023-08-11T11:49:03Z

Open question: If we want out-of-core indices on load (i.e., obs_names) and we decide to make obs a generator we need to figure out

Is e.g., obs immutable or mutable? (cached_property or just property)
If mutable, how do we do caching?

See pydata/xarray#1650 for the "real" way forward. This just mitigates the initial load of indices when calling read_backed because creating an xarray object entails loading the coords. Ideally, that just wouldn't happen so making our xarray objects lazily-generated (on top of already being "lazily-loaded") is a way to circumvent this i.e., users only pay for the indices when accessing e.g., obs

ilan-gold · 2023-11-21T12:25:54Z

Note to self: we need the sparse_dataset changes (I think) because they allow for getting a representation of the matrix without actually reading it into memory. That is, for X in the current AnnData object, the following happens

anndata/anndata/_core/anndata.py

Lines 667 to 685 in 3e340e1

    
           def X(self) -> np.ndarray | sparse.spmatrix | ArrayView | None: 
        
               """Data matrix of shape :attr:`n_obs` × :attr:`n_vars`.""" 
        
               if self.isbacked: 
        
                   if not self.file.is_open: 
        
                       self.file.open() 
        
                   X = self.file["X"] 
        
                   if isinstance(X, h5py.Group): 
        
                       X = sparse_dataset(X) 
        
                   # This is so that we can index into a backed dense dataset with 
        
                   # indices that aren’t strictly increasing 
        
                   if self.is_view: 
        
                       X = _subset(X, (self._oidx, self._vidx)) 
        
               elif self.is_view and self._adata_ref.X is None: 
        
                   X = None 
        
               elif self.is_view: 
        
                   X = as_view( 
        
                       _subset(self._adata_ref.X, (self._oidx, self._vidx)), 
        
                       ElementRef(self, "X"), 
        
                   )

This is problematic because it means that accessing X immediately reads it into memory. This is why I moved the indexing on to the BaseCompressedSparseDataset class here.

ilan-gold · 2024-08-26T15:36:40Z

No need as #1247 seems like it will be the way to go.

ivirshup and others added 30 commits April 29, 2022 17:38

Start backed sparse support for zarr

2f73576

Merge branch 'master' into zarr-sparse-array

df160f0

Merge branch 'master' into zarr-sparse-array

7983291

Fix sparse_to_dense

a5e0311

Merge branch 'master' into zarr-sparse-array

b28448c

[pre-commit.ci] auto fixes from pre-commit.com hooks

5e3cb02

for more information, see https://pre-commit.ci

Start write_dispatched

3ee693c

(wip): remote reading via new AxisArrays and AnnData object

7e0825a

(chore): rename

0b87230

(chore): venv to .gitignore

f2de515

(fix): concatenation test

7bc0f76

Revert changes to some backwards compat tests

7a12515

Fixes after merge

c3a5e07

Clean up error reporting + remove commented out code

f22660d

(wip): semi-working demo?

93b8778

(chore): compat for old index key

6d32d8e

(chore): only use backed

3cf7036

(feat): add custom to_df method

d99dd56

(feat): get dataframe access working properly

83aa3ab

(chore): remove TODO

66c86fe

(chore): write up to-do's

fca6fe5

(chore): add head method

2f31f91

(chore): add better check for to_df

2fad98e

(feat): categorical zarr array.

c07e71a

[pre-commit.ci] auto fixes from pre-commit.com hooks

a785310

for more information, see https://pre-commit.ci

(feat): add categorical array to the read_remote

85a9006

(chore): remove todo

568241a

(chore): remove commented out parts

a5bd7dc

(chore): remove more unused methods

c1b090c

flying-sheep assigned ilan-gold Jul 27, 2023

ilan-gold added 3 commits July 30, 2023 21:45

(feat): obsm/varm xr.Dataset

703812c

(chore): refactor ZarrArray subset function

8ce994b

Merge branch 'ig/refactor_base_class' into ig/read_remote_dispatched

e480700

ilan-gold force-pushed the ig/read_remote_dispatched branch from 4d60586 to e480700 Compare July 30, 2023 19:59

Merge branch 'ig/refactor_base_class' into ig/read_remote_dispatched

cee7f6d

ilan-gold force-pushed the ig/read_remote_dispatched branch from da5144d to cee7f6d Compare July 30, 2023 20:06

ilan-gold added 3 commits August 1, 2023 09:40

Merge branch 'ig/refactor_base_class' into ig/read_remote_dispatched

b41a9a5

(fix): backed for experimental merge.py

6fe7016

(fix): pyproject.toml missing comma

c3f6935

ilan-gold force-pushed the ig/read_remote_dispatched branch from cd3737e to c3f6935 Compare August 1, 2023 07:50

[pre-commit.ci] auto fixes from pre-commit.com hooks

deced7c

for more information, see https://pre-commit.ci

Zethson reviewed Aug 3, 2023

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

ilan-gold and others added 7 commits August 8, 2023 11:31

Merge branch 'ig/refactor_base_class' into ig/read_remote_dispatched

8617bd1

[pre-commit.ci] auto fixes from pre-commit.com hooks

48a134b

for more information, see https://pre-commit.ci

(chore): remove pre-commit deps

2806a9f

Merge branch 'ig/read_remote_dispatched' of github.com:scverse/anndat…

e35603a

…a into ig/read_remote_dispatched

(fix): don't let ruff change == for DataFrame to is

516f984

(chore): move xarray to test deps

60b0ae6

(style): change folder structure

9d53307

Zethson added the skip-gpu-ci label Aug 11, 2023

ilan-gold force-pushed the ig/read_remote_dispatched branch from be37320 to 9d53307 Compare September 4, 2023 13:05

Merge branch 'ig/refactor_base_class' into ig/read_remote_dispatched

3a428f4

Neah-Ko mentioned this pull request Nov 8, 2023

Patch AnnData.__sizeof__() for backed datasets #1230

Merged

3 tasks

ilan-gold mentioned this pull request Nov 30, 2023

(feat): xarray with experimental backed reading #1247

Open

3 tasks

ilan-gold closed this Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(feat): experimental `read_backed` method for `zarr` + `hdf5` via `read_dispatched` #947

(feat): experimental `read_backed` method for `zarr` + `hdf5` via `read_dispatched` #947

ilan-gold commented Mar 8, 2023 •

edited

Loading

ilan-gold commented Aug 4, 2023

ilan-gold commented Aug 11, 2023 •

edited

Loading

ilan-gold commented Nov 21, 2023

ilan-gold commented Aug 26, 2024

(feat): experimental read_backed method for zarr + hdf5 via read_dispatched #947

(feat): experimental read_backed method for zarr + hdf5 via read_dispatched #947

Conversation

ilan-gold commented Mar 8, 2023 • edited Loading

ilan-gold commented Aug 4, 2023

ilan-gold commented Aug 11, 2023 • edited Loading

ilan-gold commented Nov 21, 2023

ilan-gold commented Aug 26, 2024

(feat): experimental `read_backed` method for `zarr` + `hdf5` via `read_dispatched` #947

(feat): experimental `read_backed` method for `zarr` + `hdf5` via `read_dispatched` #947

ilan-gold commented Mar 8, 2023 •

edited

Loading

ilan-gold commented Aug 11, 2023 •

edited

Loading