Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask dataframe support #823

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

rahulbshrestha
Copy link
Contributor

@rahulbshrestha rahulbshrestha commented Sep 28, 2022

This PR introduces support for Dask dataframes in anndata.

TODOs:

  • Indexing
  • Writing / Reading
  • Concatenation
  • assert_equal for tests
  • adata.to_memory() / adata.copy()

Related PR (Dask array support): #813
Contributors: @rahulbshrestha @syelman

@codecov
Copy link

codecov bot commented Sep 28, 2022

Codecov Report

Merging #823 (af84fdd) into master (919d34c) will decrease coverage by 0.15%.
The diff coverage is 57.14%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #823      +/-   ##
==========================================
- Coverage   83.49%   83.33%   -0.16%     
==========================================
  Files          34       32       -2     
  Lines        4441     4333     -108     
==========================================
- Hits         3708     3611      -97     
+ Misses        733      722      -11     
Impacted Files Coverage Δ
anndata/compat/__init__.py 85.96% <28.57%> (-2.45%) ⬇️
anndata/tests/helpers.py 95.12% <85.71%> (-0.34%) ⬇️
anndata/_core/merge.py 93.71% <0.00%> (-0.28%) ⬇️
anndata/__init__.py
anndata/utils.py

@ivirshup
Copy link
Member

ivirshup commented Oct 4, 2022

So, I've looked into the length thing a bit. It looks like there is still no way to include info on number of rows for a dask dataframe. This is tracked multiple places in the dask repo, but this issue looks most recent: dask/dask#5633

It's possible we can do something clever to work around this, like persisting the index of the data frame and doing length checks there. We could also not do length checks on dask dataframes until we try to compute, and error then.

@ryan-williams, any chance you have thoughts here? Is it best to just wait on dask some more?

@ivirshup
Copy link
Member

ivirshup commented Oct 6, 2022

Here is a gist with some code for reading a dataframe saved in AnnData to a dask DataFrame

@rahulbshrestha rahulbshrestha marked this pull request as draft October 13, 2022 13:50
@ilan-gold
Copy link
Contributor

@ivirshup I've got a branch with your gist - I can start an issue for this but so far what I see is that:

  1. calling len(df) when df is a dask dataframe loads the whole dataframe into memory
  2. the index has no is_unique attribute
    Both seems manageable as PR's into dask (if they're actually issues) but just figured I'd document this somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants