Speedup (~20x) of scanpy.pp.regress_out function using Linear Least Square method. #3110

ashish615 · 2024-06-18T06:28:04Z

Closes #
Tests included or not required because:

Release notes not necessary because:

Hi,
We are submitting PR for speed up of the regress_out function. Here we finding coefficient using Linear regression (Linear Least Squares) rather then GLM for non categorical data.

	Time(sec)
Original	297
Updated	14.91
Speedup	19.91

experiment setup : AWS r7i.24xlarge

import time
import numpy as np

import pandas as pd

import scanpy as sc
from sklearn.cluster import KMeans

import os
import wget

import warnings



warnings.filterwarnings('ignore', 'Expected ')
warnings.simplefilter('ignore')
input_file = "./1M_brain_cells_10X.sparse.h5ad"

if not os.path.exists(input_file):
    print('Downloading import file...')
    wget.download('https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad',input_file)


# marker genes
MITO_GENE_PREFIX = "mt-" # Prefix for mitochondrial genes to regress out
markers = ["Stmn2", "Hes1", "Olig1"] # Marker genes for visualization

# filtering cells
min_genes_per_cell = 200 # Filter out cells with fewer genes than this expressed
max_genes_per_cell = 6000 # Filter out cells with more genes than this expressed

# filtering genes
min_cells_per_gene = 1 # Filter out genes expressed in fewer cells than this
n_top_genes = 4000 # Number of highly variable genes to retain

# PCA
n_components = 50 # Number of principal components to compute

# t-SNE
tsne_n_pcs = 20 # Number of principal components to use for t-SNE

# k-means
k = 35 # Number of clusters for k-means

# Gene ranking

ranking_n_top_genes = 50 # Number of differential genes to compute for each cluster

# Number of parallel jobs
sc._settings.ScanpyConfig.n_jobs = os.cpu_count()

start=time.time()
tr=time.time()
adata = sc.read(input_file)
adata.var_names_make_unique()
adata.shape
print("Total read time : %s" % (time.time()-tr))



tr=time.time()
# To reduce the number of cells:
USE_FIRST_N_CELLS = 1300000
adata = adata[0:USE_FIRST_N_CELLS]
adata.shape

sc.pp.filter_cells(adata, min_genes=min_genes_per_cell)
sc.pp.filter_cells(adata, max_genes=max_genes_per_cell)
sc.pp.filter_genes(adata, min_cells=min_cells_per_gene)
sc.pp.normalize_total(adata, target_sum=1e4)
print("Total filter and normalize time : %s" % (time.time()-tr))


tr=time.time()
sc.pp.log1p(adata)
print("Total log time : %s" % (time.time()-tr))


# Select highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes, flavor = "cell_ranger")

# Retain marker gene expression
for marker in markers:
        adata.obs[marker + "_raw"] = adata.X[:, adata.var.index == marker].toarray().ravel()

# Filter matrix to only variable genes
adata = adata[:, adata.var.highly_variable]

#Regress out confounding factors (number of counts, mitochondrial gene expression)
mito_genes = adata.var_names.str.startswith(MITO_GENE_PREFIX)
n_counts = np.array(adata.X.sum(axis=1))
adata.obs['percent_mito'] = np.array(np.sum(adata[:, mito_genes].X, axis=1)) / n_counts
adata.obs['n_counts'] = n_counts

ts=time.time()
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
print("Total regress out time : %s" % (time.time()-ts))

codecov · 2024-06-18T06:53:39Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.39%. Comparing base (ad657ed) to head (5e4df8c).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3110      +/-   ##
==========================================
+ Coverage   76.31%   76.39%   +0.08%     
==========================================
  Files         109      109              
  Lines       12513    12526      +13     
==========================================
+ Hits         9549     9569      +20     
+ Misses       2964     2957       -7

Files	Coverage Δ
src/scanpy/preprocessing/_simple.py	`88.10% <100.00%> (+0.15%)`	⬆️

... and 1 file with indirect coverage changes

src/scanpy/preprocessing/_simple.py

flying-sheep · 2024-06-18T11:47:31Z

Thanks for keeping them coming in! It would be helpful if you documented a bit why your newly introduced branch is faster with comments.

scverse-benchmark · 2024-06-18T11:54:49Z

Benchmark changes

Change	Before [`ad657ed`]	After [`3e3ca9d`]	Ratio	Benchmark (Parameter)
-	404M	339M	0.84	preprocessing_log.peakmem_pca('pbmc68k_reduced')
-	1.61±0.01s	11.6±0.4ms	0.01	preprocessing_log.time_regress_out('pbmc68k_reduced')

Comparison: https://github.com/scverse/scanpy/compare/ad657edfb52e9957b9a93b3a16fc8a87852f3f09..3e3ca9dcb2e77e72b75095ff895cb55aeb7f98bc
Last changed: Wed, 19 Jun 2024 06:43:50 +0000

More details: https://github.com/scverse/scanpy/pull/3110/checks?check_run_id=26405245612

flying-sheep · 2024-06-18T12:42:46Z

ooh, this time the benchmark shows really nicely how much faster it is!

Ratio: 0.01 Benchmark: preprocessing_log.time_regress_out('pbmc68k_reduced')

ashish615 · 2024-06-19T06:01:25Z

ooh, this time the benchmark shows really nicely how much faster it is!

Looks like preprocessing_log.time_regress_out('pbmc68k_reduced') , regress out those variables that is not inside it. It should regress_out ['n_counts', 'percent_mito'] instead of ["total_counts", "pct_counts_mt"]. For the both commit it fails so report the same time.

src/scanpy/preprocessing/_simple.py

flying-sheep

looks good from my side, do you have opinions, @Intron7?

Intron7 · 2024-06-20T14:24:53Z

It looks really good. I would love to have a closer look next week

ashish615 · 2024-06-20T16:13:48Z

ooh, this time the benchmark shows really nicely how much faster it is!

Looks like preprocessing_log.time_regress_out('pbmc68k_reduced') , regress out those variables that is not inside it. It should regress_out ['n_counts', 'percent_mito'] instead of ["total_counts", "pct_counts_mt"]. For the both commit it fails so report the same time.

@flying-sheep , can you look at the this benchmark test?

Co-authored-by: Philipp A. <[email protected]>

flying-sheep · 2024-06-24T08:00:47Z

Hi, everything is OK with the benchmarks, regress_out would fail if called with variables that doesn’t exist.

The reason these are named differently is here:

scanpy/benchmarks/benchmarks/_utils.py

Lines 27 to 31 in ad657ed

    
           mapper = dict( 
        
               percent_mito="pct_counts_mt", 
        
               n_counts="total_counts", 
        
           ) 
        
           adata.obs.rename(columns=mapper, inplace=True)

I did that to be able to run benchmarks benchmarks on multiple data sets with the same code.

Intron7 · 2024-07-03T11:30:10Z

The to_dense function only works for csr matrices. I think we need another kernel that handles csc or just .T the resulting array if csc. I also get performance warnings about the matmul in the numba_kernel. I'll investigate this further.

ashish615 · 2024-07-03T11:55:18Z

This will work for csc matrix

@numba.njit(cache=True, parallel=True)
def to_dense_csc(
    shape: tuple[int, int],
    indptr: NDArray[np.integer],
    indices: NDArray[np.integer],
    data: NDArray[DT],
) -> NDArray[DT]:
    """\
    Numba kernel for np.toarray() function
    """
    X = np.empty(shape, dtype=data.dtype)

    for c in numba.prange(shape[1]):
        X[:,c] = 0
        for i in range(indptr[c], indptr[c + 1]):
            X[indices[i],c] = data[i]
    return X

Intron7 · 2024-07-03T12:53:57Z

That kernel technically works but is way slower than the default scipy operation due to inefficient memory layout.

ilan-gold · 2024-07-04T09:39:26Z

src/scanpy/preprocessing/_simple.py

+    @numba.njit(cache=True, parallel=True)
+    def get_resid(
+        data: np.ndarray,
+        regressor: np.ndarray,
+        coeff: np.ndarray,
+    ) -> np.ndarray:


Is the DT not important here as above?

ilan-gold · 2024-07-04T09:45:02Z

src/scanpy/preprocessing/_simple.py

            regres = regressors
        tasks.append(tuple((data_chunk, regres, variable_is_categorical)))

-    from joblib import Parallel, delayed
+    res = None
+    if not variable_is_categorical:
+        A = regres.to_numpy()


I think the code before this should be refactored as well to include these new methods in the condition (i.e. at https://github.com/scverse/scanpy/pull/3110/files#diff-f9e0bdcdfc04622c421f0a5788bbba6ee0303750580d2915caba3239d799322fR759). We can just check regressors.to_numpy() for non-zero determinant before iterating through the chunk list. We don't need to make the list at all, from what I can tell, if not variable_is_categorical and the determinant is non-zero. This will also make things cleaner because we will need fewer checks. We can then use a boolean to check whether we should move into the next clause (which will include chunk creation) instead of checking res=None or not.

Does this make sense?

Finding coefficient using Linear regression (Linear Least Squares)

fdcad6c

flying-sheep requested changes Jun 18, 2024

View reviewed changes

flying-sheep added the benchmark label Jun 18, 2024

ashish615 added 2 commits June 19, 2024 06:08

Merge branch 'main' into regress-out

5c17048

changes done as per suggestions

3e3ca9d

ashish615 requested a review from flying-sheep June 19, 2024 10:44

flying-sheep reviewed Jun 20, 2024

View reviewed changes

src/scanpy/preprocessing/_simple.py Outdated Show resolved Hide resolved

flying-sheep reviewed Jun 20, 2024

View reviewed changes

src/scanpy/preprocessing/_simple.py Outdated Show resolved Hide resolved

flying-sheep reviewed Jun 20, 2024

View reviewed changes

ashish615 and others added 2 commits June 24, 2024 10:20

Apply suggestions from code review

ebdf441

Co-authored-by: Philipp A. <[email protected]>

Update src/scanpy/preprocessing/_simple.py

5e4df8c

Co-authored-by: Philipp A. <[email protected]>

ilan-gold assigned Intron7 Jul 2, 2024

flying-sheep mentioned this pull request Jul 2, 2024

Performance improvements for Regress Out #2781

Closed

ilan-gold requested changes Jul 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup (~20x) of scanpy.pp.regress_out function using Linear Least Square method. #3110

Speedup (~20x) of scanpy.pp.regress_out function using Linear Least Square method. #3110

ashish615 commented Jun 18, 2024 •

edited

Loading

codecov bot commented Jun 18, 2024 •

edited

Loading

flying-sheep commented Jun 18, 2024

scverse-benchmark bot commented Jun 18, 2024 •

edited

Loading

flying-sheep commented Jun 18, 2024

ashish615 commented Jun 19, 2024

flying-sheep left a comment

Intron7 commented Jun 20, 2024

ashish615 commented Jun 20, 2024

flying-sheep commented Jun 24, 2024 •

edited

Loading

Intron7 commented Jul 3, 2024

ashish615 commented Jul 3, 2024 •

edited

Loading

Intron7 commented Jul 3, 2024

ilan-gold Jul 4, 2024

ilan-gold Jul 4, 2024 •

edited

Loading

Speedup (~20x) of scanpy.pp.regress_out function using Linear Least Square method. #3110

Are you sure you want to change the base?

Speedup (~20x) of scanpy.pp.regress_out function using Linear Least Square method. #3110

Conversation

ashish615 commented Jun 18, 2024 • edited Loading

codecov bot commented Jun 18, 2024 • edited Loading

Codecov Report

flying-sheep commented Jun 18, 2024

scverse-benchmark bot commented Jun 18, 2024 • edited Loading

Benchmark changes

flying-sheep commented Jun 18, 2024

ashish615 commented Jun 19, 2024

flying-sheep left a comment

Choose a reason for hiding this comment

Intron7 commented Jun 20, 2024

ashish615 commented Jun 20, 2024

flying-sheep commented Jun 24, 2024 • edited Loading

Intron7 commented Jul 3, 2024

ashish615 commented Jul 3, 2024 • edited Loading

Intron7 commented Jul 3, 2024

ilan-gold Jul 4, 2024

Choose a reason for hiding this comment

ilan-gold Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

ashish615 commented Jun 18, 2024 •

edited

Loading

codecov bot commented Jun 18, 2024 •

edited

Loading

scverse-benchmark bot commented Jun 18, 2024 •

edited

Loading

flying-sheep commented Jun 24, 2024 •

edited

Loading

ashish615 commented Jul 3, 2024 •

edited

Loading

ilan-gold Jul 4, 2024 •

edited

Loading