Integration with nf-core pipelines #34

grst · 2023-06-14T11:55:30Z

I am an active contributor to the nf-core project and have been working on the scRNA-seq and spatialtranscriptomics pipelins in the past. For both pipelines, we are considering to integrate checkatlas to generate MultiQC reports (see nf-core/scrnaseq#80 and nf-core/spatialvi#40).

From what I understood, the checkatlas architecture is rather complex, consisting of

a python library that takes a h5ad object and computes various QC metrics
a nextflow workflow that executes the different parts of the python library via CLI wrappers. The Nextflow workflow itself is wrapped in another Python CLI script.
a MultiQC module that reads the outputs of this workflow to generate a report
an R script to convert Seurat to h5ad.

To integrate checkatlas in one of our pipelines, we need to define a nextflow module that takes h5ad files as input, and generates files that can be ingested by a downstream MultiQC process. In addition we need a standalone container including all required dependencies (see also #25).

While it would be totally possible to create a container that contains both the Python dependencies, nextflow+java and R dependencies it seems a bit convoluted to run a nextflow workflow that starts a docker container that runs a python script that runs a nextflow workflow that runs another python script. It's also suboptimal in terms of resource management, because the checkatlas-nextflow running in the container cannot make use of the cluster/cloud scheduler the "outer" nextflow pipeline was configured to run with.

From our perspective, it would be better to separate the python library from the nextflow workflow in checkatlas. That way we could have a lightweight container for the python part, and build a "checkatlas" nextflow (sub)workflow that can be integrated in both pipelines. If necessary, conversion from Seurat to h5ad would run in a separate process with a separate container -- avoiding manual installation of R packages (mitigating issues like #24). In general, I think it is best to have nextflow as the outermost layer, to let it handle all dependencies and take advantage of its flexible resource management (local vs. hpc vs cloud).

Let me know what you think!

Cheers,
Gregor

CC @fasterius @cavenel (nf-core/spatialtranscriptomics), @fmalmeida (nf-core/scrnaseq)

The text was updated successfully, but these errors were encountered:

drbecavin · 2023-06-14T14:47:12Z

Dear Gregor,

Thanks for your interest in checkatlas ! I would love to make it compatible with your two nf-core pipelines. It is actually a planned development to integrate the spatial. As we are using more and more these type of data in the lab.

You are totally right ! Nextflow should be on top of the checkatlas pipeline. I first developed checkatlas as a stand-alone python program and then added nextflow in the last months. Not being a nextflow expert, I implemented it, the fastest way i knew.

Sadly, I do not have any grant to found this project. So I am alone on the development. If you feel that checkatlas would be a great addition to your pipeline. I would love to interact with you and make it "nf-core" compatible. Some help would be more than useful !

DM me : becavin AT ipmc DOT cnrs DOT fr

grst · 2023-06-22T08:23:41Z

Hi @drbecavin,

I now had a chance to play around with checkatlas a bit more. I now understood that the nextflow part is optional and it also runs quite well without it.

For moving forward, we mainly need two things:

a container, preferably lightweight and without the nextflow and R dependencies. As mentioned in Checkatlas conda, docker, singularity definition file #25, I recommend using bioconda for that, as it will give you conda, docker+singularity in one go. I can help with creating the bioconda recipe.
Full MultiQC integration, i.e. getting New module: Checkatlas MultiQC/MultiQC#1713 merged.

Once we have that, I would create a module on nf-core/modules that can be integrated into different pipelines.

drbecavin · 2023-06-22T14:04:57Z

Hello @grst

The nextflow is essentially there when you want to calculate extensive metrics for classification, annotation and dim reduction. In these cases it takes a lot of time and one need to parallelise the datasets.

Ok, next week I will work on :

The bioconda definition
Fix problem for MultiQC PR

I can also start a draft nextflow pipeline for the whole checkatlas run (replacing my current nextflow.nf). Which you would be able to use and improve.

I'll get you posted.
thanks !

grst · 2023-06-22T14:08:10Z

I can also start a draft nextflow pipeline for the whole checkatlas run (replacing my current nextflow.nf). Which you would be able to use and improve.

in that case consider starting from the nf-core template. It may seem a bit overwhelming in the beginning, but the community figured out a lot of things to make it easy to run it seamlessly across different setups. Also it makes it easier to publish it as an nf-core pipeline later.

drbecavin · 2023-06-22T14:11:44Z

Alright ! I will do that !

grst added enhancement New feature or request question Further information is requested labels Jun 14, 2023

cavenel mentioned this issue Jan 24, 2024

Add SpatialData as output nf-core/spatialvi#39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration with nf-core pipelines #34

Integration with nf-core pipelines #34

grst commented Jun 14, 2023

drbecavin commented Jun 14, 2023

grst commented Jun 22, 2023

drbecavin commented Jun 22, 2023 •

edited

Loading

grst commented Jun 22, 2023

drbecavin commented Jun 22, 2023

Integration with nf-core pipelines #34

Integration with nf-core pipelines #34

Comments

grst commented Jun 14, 2023

drbecavin commented Jun 14, 2023

grst commented Jun 22, 2023

drbecavin commented Jun 22, 2023 • edited Loading

grst commented Jun 22, 2023

drbecavin commented Jun 22, 2023

drbecavin commented Jun 22, 2023 •

edited

Loading