Asaru Sim Documentation

AsaruSim is an automated Nextflow workflow designed for simulating 10x single-cell Nanopore reads. It allows to benchmark and optimize single-cell Nanopore long read data processing pipelines. Full documentation is avialable here.

Prerequisites

Before starting, ensure the following tools are installed and properly set up on your system:

Nextflow: A workflow engine for complex data pipelines. Installation guide for Nextflow.
Docker or Singularity: Containers for packaging necessary software, ensuring reproducibility. Docker installation guide, Singularity installation guide.
Git: Required to clone the workflow repository. Git installation guide.

Installation

Clone the AsaruSim GitHub repository:

git clone https://github.com/alihamraoui/AsaruSim.git
cd AsaruSim

Configuration

Customize runs by editing the nextflow.config file and/or specifying parameters at the command line.

Pipeline Input Parameters

Here are the primary input parameters for configuring the workflow:

Parameter	Description	Default Value
`matrix`	Path to the count matrix csv file (required)	`test_data/matrix.csv`
`bc_counts`	Path to the barcode count file	`test_data/test_bc.csv`
`transcriptome`	Path to the reference transcriptome file (required)	`test_data/transcriptome.fa`
`features`	Matrix feature counts	`transcript_id`
`gtf`	Path to transcriptom annotation .gtf file	`null`
`cell_types_annotation`	Path to cell type annotation .csv file	`null`

Error/Qscore Parameters

Configuration for error model:

Parameter	Description	Default Value
`trained_model`	Badread pre-trained error/Qscore model name	`nanopore2023`
`badread_identity`	Comma-separated values for Badread identity parameters	`"98,2,99"`
`error_model`	Custom error model file (optional)	`null`
`qscore_model`	Custom Q-score model file (optional)	`null`
`build_model`	to build your own error/Qscor model	`false`
`fastq_model`	reference real read (.fastq) to train error model (optional)	`false`
`ref_genome`	reference genome .fasta file (optional)	`false`

Additional Parameters

Parameter	Description	Default Value
`amp`	Amplification factor	`1`
`outdir`	Output directory for results	`"results"`
`projectName`	Name of the project	`"test_project"`

Run Parameters

Configuration for running the workflow:

Parameter	Description	Default Value
`threads`	Number of threads to use	`4`
`container`	Docker container for the workflow	`'hamraouii/wf-SLSim'`
`docker.runOptions`	Docker run options to use	`'-u $(id -u):$(id -g)'`

Usage

User can choose among 4 ways to simulate template reads.

use a real count matrix
estimated the parameter from a real count matrix to simulate synthetic count matrix
specified by his/her own the input parameter
a combination of the above options

We use SPARSIM tools to simulate count matrix. for more information a bout synthetic count matrix, please read SPARSIM documentaion.

EXAMPLES

Sample data

A demonstration dataset to initiate this workflow is accessible on zenodo DOI : 10.5281/zenodo.12731408. This dataset is a subsample from a Nanopore run of the 10X 5k human pbmcs.

The human GRCh38 reference transcriptome, gtf annotation and fasta referance genome can be downloaded from Ensembl.

BASIC WORKFLOW

 nextflow run main.nf --matrix dataset/sub_pbmc_matrice.csv \
                      --transcriptome dataset/Homo_sapiens.GRCh38.cdna.all.fa \
                      --features gene_name \
                      --gtf dataset/genes.gtf

WITH PCR AMPLIFICTION

 nextflow run main.nf --matrix dataset/sub_pbmc_matrice.csv \
                      --transcriptome dataset/Homo_sapiens.GRCh38.cdna.all.fa \
                      --features gene_name \
                      --gtf dataset/GRCh38-2020-A-genes.gtf \
                      --pcr_cycles 2 \
                      --pcr_dup_rate 0.7 \
                      --pcr_error_rate 0.00003

WITH SIMULATED CELL TYPE COUNTS

 nextflow run main.nf --matrix dataset/sub_pbmc_matrice.csv \
                      --transcriptome dataset/Homo_sapiens.GRCh38.cdna.all.fa \
                      --features gene_name \
                      --gtf dataset/GRCh38-2020-A-genes.gtf \
                      --sim_celltypes true \
                      --cell_types_annotation dataset/sub_pbmc_cell_type.csv

USING A SPARSIM PRESET MATRIX (e.g Chu et al. 10X Genomics datasets)

nextflow run main.nf --matrix Chu_param_preset \
                      --transcriptome datasets/Homo_sapiens.GRCh38.cdna.all.fa \
                      --features gene_name \
                      --gtf datasets/Homo_sapiens.GRCh38.112.gtf

WITH PERSONALIZED ERROR MODEL

nextflow run main.nf --matrix dataset/sub_pbmc_matrice.csv \
                     --transcriptome dataset/Homo_sapiens.GRCh38.cdna.all.fa \
                     --features gene_name \
                     --gtf dataset/GRCh38-2020-A-genes.gtf \
                     --build_model true \
                     --fastq_model dataset/sub_pbmc_reads.fq \
                     --ref_genome dataset/GRCh38-2020-A-genome.fa

COMPLETE WORKFLOW

 nextflow run main.nf --matrix dataset/sub_pbmc_matrice.csv \
                      --transcriptome dataset/Homo_sapiens.GRCh38.cdna.all.fa \
                      --features gene_name \
                      --gtf dataset/GRCh38-2020-A-genes.gtf \
                      --sim_celltypes true \
                      --cell_types_annotation dataset/sub_pbmc_cell_type.csv \
                      --build_model true \
                      --fastq_model dataset/sub_pbmc_reads.fq \
                      --ref_genome dataset/GRCh38-2020-A-genome.fa \
                      --pcr_cycles 2 \
                      --pcr_dup_rate 0.7 \
                      --pcr_error_rate 0.00003

Results

After execution, results will be available in the specified --outdir. This includes simulated Nanopore reads .fastq, along with log files and QC report.

Cleaning Up

To clean up temporary files generated by Nextflow:

nextflow clean -f

Workflow

Acknowledgements

We would like to express our gratitude to Youyupei for the development of SLSim, which has been helpful to the AsaruSim workflow.
Additionally, our thanks go to the teams behind Badread and SPARSim, whose tools are integral to the AsaruSim workflow.

Support and Contributions

For support, please open an issue in the repository's "Issues" section. Contributions via Pull Requests are welcome. Follow the contribution guidelines specified in CONTRIBUTING.md.

License

AsaruSim is distributed under a specific license. Check the LICENSE file in the GitHub repository for details.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
bin		bin
docs		docs
images		images
modules		modules
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Asaru Sim Documentation

Prerequisites

Installation

Configuration

Pipeline Input Parameters

Error/Qscore Parameters

Additional Parameters

Run Parameters

Usage

EXAMPLES

Sample data

BASIC WORKFLOW

WITH PCR AMPLIFICTION

WITH SIMULATED CELL TYPE COUNTS

USING A SPARSIM PRESET MATRIX (e.g Chu et al. 10X Genomics datasets)

WITH PERSONALIZED ERROR MODEL

COMPLETE WORKFLOW

Results

Cleaning Up

Workflow

Acknowledgements

Support and Contributions

License

About

Releases 2

Packages

Languages

License

GenomiqueENS/AsaruSim

Folders and files

Latest commit

History

Repository files navigation

Asaru Sim Documentation

Prerequisites

Installation

Configuration

Pipeline Input Parameters

Error/Qscore Parameters

Additional Parameters

Run Parameters

Usage

EXAMPLES

Sample data

BASIC WORKFLOW

WITH PCR AMPLIFICTION

WITH SIMULATED CELL TYPE COUNTS

USING A SPARSIM PRESET MATRIX (e.g Chu et al. 10X Genomics datasets)

WITH PERSONALIZED ERROR MODEL

COMPLETE WORKFLOW

Results

Cleaning Up

Workflow

Acknowledgements

Support and Contributions

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages