Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with nf-core pipelines #34

Open
grst opened this issue Jun 14, 2023 · 5 comments
Open

Integration with nf-core pipelines #34

grst opened this issue Jun 14, 2023 · 5 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@grst
Copy link

grst commented Jun 14, 2023

Hi @drbecavin,

I am an active contributor to the nf-core project and have been working on the scRNA-seq and spatialtranscriptomics pipelins in the past. For both pipelines, we are considering to integrate checkatlas to generate MultiQC reports (see nf-core/scrnaseq#80 and nf-core/spatialvi#40).

From what I understood, the checkatlas architecture is rather complex, consisting of

  • a python library that takes a h5ad object and computes various QC metrics
  • a nextflow workflow that executes the different parts of the python library via CLI wrappers. The Nextflow workflow itself is wrapped in another Python CLI script.
  • a MultiQC module that reads the outputs of this workflow to generate a report
  • an R script to convert Seurat to h5ad.

To integrate checkatlas in one of our pipelines, we need to define a nextflow module that takes h5ad files as input, and generates files that can be ingested by a downstream MultiQC process. In addition we need a standalone container including all required dependencies (see also #25).

While it would be totally possible to create a container that contains both the Python dependencies, nextflow+java and R dependencies it seems a bit convoluted to run a nextflow workflow that starts a docker container that runs a python script that runs a nextflow workflow that runs another python script. It's also suboptimal in terms of resource management, because the checkatlas-nextflow running in the container cannot make use of the cluster/cloud scheduler the "outer" nextflow pipeline was configured to run with.

From our perspective, it would be better to separate the python library from the nextflow workflow in checkatlas. That way we could have a lightweight container for the python part, and build a "checkatlas" nextflow (sub)workflow that can be integrated in both pipelines. If necessary, conversion from Seurat to h5ad would run in a separate process with a separate container -- avoiding manual installation of R packages (mitigating issues like #24). In general, I think it is best to have nextflow as the outermost layer, to let it handle all dependencies and take advantage of its flexible resource management (local vs. hpc vs cloud).

Let me know what you think!

Cheers,
Gregor

CC @fasterius @cavenel (nf-core/spatialtranscriptomics), @fmalmeida (nf-core/scrnaseq)

@grst grst added enhancement New feature or request question Further information is requested labels Jun 14, 2023
@drbecavin
Copy link
Member

Dear Gregor,

Thanks for your interest in checkatlas ! I would love to make it compatible with your two nf-core pipelines. It is actually a planned development to integrate the spatial. As we are using more and more these type of data in the lab.

You are totally right ! Nextflow should be on top of the checkatlas pipeline. I first developed checkatlas as a stand-alone python program and then added nextflow in the last months. Not being a nextflow expert, I implemented it, the fastest way i knew.

Sadly, I do not have any grant to found this project. So I am alone on the development. If you feel that checkatlas would be a great addition to your pipeline. I would love to interact with you and make it "nf-core" compatible. Some help would be more than useful !

DM me : becavin AT ipmc DOT cnrs DOT fr

@grst
Copy link
Author

grst commented Jun 22, 2023

Hi @drbecavin,

I now had a chance to play around with checkatlas a bit more. I now understood that the nextflow part is optional and it also runs quite well without it.

For moving forward, we mainly need two things:

Once we have that, I would create a module on nf-core/modules that can be integrated into different pipelines.

@drbecavin
Copy link
Member

drbecavin commented Jun 22, 2023

Hello @grst

The nextflow is essentially there when you want to calculate extensive metrics for classification, annotation and dim reduction. In these cases it takes a lot of time and one need to parallelise the datasets.

Ok, next week I will work on :

  • The bioconda definition
  • Fix problem for MultiQC PR

I can also start a draft nextflow pipeline for the whole checkatlas run (replacing my current nextflow.nf). Which you would be able to use and improve.

I'll get you posted.
thanks !

@grst
Copy link
Author

grst commented Jun 22, 2023

I can also start a draft nextflow pipeline for the whole checkatlas run (replacing my current nextflow.nf). Which you would be able to use and improve.

in that case consider starting from the nf-core template. It may seem a bit overwhelming in the beginning, but the community figured out a lot of things to make it easy to run it seamlessly across different setups. Also it makes it easier to publish it as an nf-core pipeline later.

@drbecavin
Copy link
Member

Alright ! I will do that !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
Development

No branches or pull requests

2 participants