Analysis pipeline for RNA-seq data

This pipeline is specifically designed for the analysis of RNA-seq data to fit our B-ALL subtypes prediction pipeline MD-ALL.Upon execution, the pipeline will produce outputs detailing gene read counts, mutations, fusions, and chromosomal-level copy number variations (gains/losses) derived from the RNA-seq data.You can choose to install all the packages yourself or use the Singularity containers we have created, which include all the essential software. The use of Singularity containers is highly recommended, as some of the software can be challenging to install correctly. This pipeline is still under active development, and new analyses regarding B-ALL subtype prediction will be released.

The pipe written in Snakemake.

The workflow of this pipeline:

For singularity container users

Please download the singularity containers and the reference database.

Configuration

Users will need to edit the run_rnaseq.smk file for configurations.

Parameters:

‘ref_fa’, the fasta file of reference genome of human GRCh38. Users need to download it.

‘gtf’, gtf annotation file of the reference genome. Users need to download it.

‘bed_DUX4’, bed file of DUX4 genes. This file is used in the read counts patching process for DUX4 genes. Already included in the 0.ref directory.

‘ref_star’, the directory of reference used by STAR to do alignment. Users will get it after the installation of STAR.

‘ref_fusioncatcher’, the directory of reference used by FusionCatcher to call gene fusions. Users will get it after the installation of FusionCatcher.

‘ref_cicero’, the directory of reference used by Cicero to call gene fusions. Users will get it after the installation of Cicero.

‘ref_RNApeg_flat’, the refFlat file used by RNApeg. Already included in the 0.ref directory.

‘cores_star’, ‘cores_samtoolsSort’, ‘cores_fusioncatcher’, ‘cores_RNApeg’ and ‘cores_cicero’ are the number of threads used by the the corresponding software.

‘dir_in’, the directory of input fastq files. Only gz compressed paired-end fastq files are supported currently. The file names should follow the pattern {sample}.R1.fq.gz and {sample}.R2.fq.gz. If a sample id is COH000456_D1, then the fastq file names should be COH000456_D1.R1.fq.gz and COH000456_D1.R2.fq.gz.

‘dir_out’, the output directory. Results will be stored in sub-directories within this folder, each named according to the respective sample ID.

‘samplelist’, the sample ID list that will be processed for analysis. The corresponding fastq files need to be stored in the directory ‘dir_in’.

Pipeline running

Before running, making sure the singularity is correctly installed and loaded.

singularity exec --bind /full_path_to_ref/ref \
/full_path_to/app_gulab_rnaseq-20231121_haswell.sif  \
snakemake -s run_rnaseq.smk --rerun-incomplete -p -j16 --keep-going

Since fusion calling based on Cicero sometimes runs very slowly, users may choose whether to use it or not. Therefore, we have developed a separate Singularity container for Cicero to facilitate fusion calling. To run fusion calling using the Cicero Singularity container, please use the following codes:

singularity exec --bind/full_path_to_Cicero_ref/ref \
/full_path_to/cicero_0.3.0p2.sif \
Cicero.sh -n 8 -b input.bam -g GRCh38_no_alt -r ref -j input.junctions -s 2 -c 10 -o output