Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster filtering on sparse matrices #2772

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

gjhuizing
Copy link

@gjhuizing gjhuizing commented Nov 30, 2023

  • Closes # no existing issue
  • Tests included or not required because: No new tests
  • Release notes not necessary because: I did not write release notes

Hi :)

I am proposing a change that speeds up filter_cells (x1000 speedup) and filter_genes (x2 speedup) for CSR sparse matrices. On my personal machine for 1M cells, sc.pp.filter_cells(adata, min_genes=xx) runs in 1ms instead of 10s currently. The speedup should be even stronger on sparser modalities like ATAC.

In spirit, this simply replaces np.sum(X > 0, axis=axis) with X.getnnz(axis=axis), which is much more efficient. But the axis argument in getnnz in csr_array may be deprecated. I think it should still be fine with csr_matrix, but since I don't know for sure I manually implemented it for the CSR case as in scipy/scipy#19405 .

What do you think?

Regarding getnnz: Of course it would be nicer to be able to write .getnnz(axis=axis), which extends beyond CSR to other sparse matrices. Can we assume that we're getting sparse matrices and not sparse arrays ?

Pinging @dschult from the Scipy issue liked above, who mentioned:

I'm pretty sure that a reasonable and commonly occuring use-case would be enough to make the developers include this feature somehow.

(edited because I confused csr_array and csr_matrix)

Copy link

codecov bot commented Nov 30, 2023

Codecov Report

Attention: Patch coverage is 88.88889% with 2 lines in your changes missing coverage. Please review.

Project coverage is 72.47%. Comparing base (bc349b9) to head (a0670c7).
Report is 143 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2772      +/-   ##
==========================================
+ Coverage   72.46%   72.47%   +0.01%     
==========================================
  Files         111      111              
  Lines       12418    12430      +12     
==========================================
+ Hits         8999     9009      +10     
- Misses       3419     3421       +2     
Files Coverage Δ
scanpy/preprocessing/_simple.py 82.29% <88.88%> (+0.02%) ⬆️

@gjhuizing gjhuizing mentioned this pull request Nov 30, 2023
3 tasks
@gjhuizing
Copy link
Author

gjhuizing commented Nov 30, 2023

This made me realize .getnnz is used elsewhere in the code, opened an issue #2773

Copy link

@dschult dschult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe to make this work for csr_array, you just need to avoid isspmatrix_csr (which rules out the *_array objects). more below.

Comment on lines +155 to +156
elif isspmatrix_csr(X):
number_per_cell = np.diff(X.indptr)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this will work for csr_array (and csr_matrix of course) if you change this elif clause:

Suggested change
elif isspmatrix_csr(X):
number_per_cell = np.diff(X.indptr)
elif issparse(X) and X.format == 'csr':
number_per_cell = np.diff(X.indptr)

number_per_gene = np.sum(X, axis=0)
if issparse(X):
number_per_gene = number_per_gene.A1
elif isspmatrix_csr(X):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly here:

Suggested change
elif isspmatrix_csr(X):
elif issparse(X) and X.format == 'csr':

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants