Skip to content

Latest commit

 

History

History
executable file
·
268 lines (222 loc) · 19 KB

README.md

File metadata and controls

executable file
·
268 lines (222 loc) · 19 KB

🧬 bigr_long-reads_bulk

This pipeline is built to analyse bulk long-reads DNA-seq data from PromethION (Oxford Nanopore Technologies). It can perform basecalling and methylation calling, Differential Methylated Regions analysis, germline and somatic variants identification (SNV, SV, long CNV) and phasing.

POD5 directory or BAM files (aligned BAM) are accepted as input.

Table of Contents

  1. Pipeline Overview
  2. Installation
  3. Usage
  4. Pipeline Description

📋 Pipeline Overview

Workflow Overview

This pipeline is designed to perform :

  • Preprocessing: Basecalling and Methylation calling
  • Alignment
  • Quality control
  • Differential Methylated Regions (DMR) analysis between alls samples and/or between two conditions
  • Germline and Somatic Single Nucleotide Variants (SNV) identification
  • Germline and Somatic Structural Variants (SV) identification
  • Long Copy Number Variations (CNV) identification
  • Phasing

Go to pipeline detailed description to see details about each pipeline step and watch the rules order on the rulegraph.


⚙️ Installation

Gustave Roussy users

The pipeline and environments are already installed on the Flamingo cluster of Gustave Roussy.
It is localised at /mnt/beegfs/pipelines/bigr_long-reads_bulk/<VERSION>. You can check Usage section.

External users

1️⃣ Download pipeline

cd /path_to_pipeline_installation_dir/
VERSION="2.0.0"
git clone https://github.com/gustaveroussy/bigr_long-reads_bulk.git ${VERSION}

2️⃣ Download Singularity images

Download all singularity images from Zenodo in the directory /path_to_pipeline_installation_dir/${VERSION}/envs/singularity/ of the cloned repository.

3️⃣ Install Snakemake environment

source /mnt/beegfs/software/miniconda/24.3.0/etc/profile.d/conda.sh
conda env create -f /path_to_pipeline_installation_dir/${VERSION}/envs/conda/snakemake.yaml --prefix=/path_to_pipeline_installation_dir/${VERSION}/envs/compiled_conda/snakemake -y

You are now ready to use the pipeline!


📖 Usage

Input files

You need to create 2 mandatory input files: a configuration file and a design file. You can also add an optional file containing a list of genes of interest.

1️⃣ Configuration file

3 models of configuration files are available in config/ directory, depending on the genome of reference used during the analysis. You can copy/paste and modify any model config file to fit your needs:

Warning

We recommand using an Ensembl reference genome. Otherwise, please submit an issue if using another database reference fasta file triggers pipeline errors.

There are various fields of interest in the config file:

  • input_format: accepts pod5 (POD5 directory) or bam (aligned BAM file).

  • basecalling_mode: accepts basic or methylation. Choose if you wish to perform basecalling only or methylation calling (5mCG and 5hmCG in Super Accuracy = SUP).

  • variant_calling_mode: accepts germline or somatic. Choose the variant calling mode if you want to perform variant calling.

  • steps: The configuration file allows you to choose which step(s) will be executed by the pipeline by setting any step name to true or false.

steps:
  basecalling: true
  alignment: true
  differential_methylation_sample: true
  differential_methylation_condition: true
  snv_calling: true
  sv_calling: true
  phasing: true
  cnv_calling: true

differential_methylation_sample: DMR analysis between every pair of samples extracted from the design file. differential_methylation_condition: DMR analysis between two conditions/groups extracted from the design file.

  • spectre: Using a blacklist file is recommanded by Spectre documentation for long CNV calling, GRCh38 blacklist file provided by Spectre can be given to the config file as following:
spectre:
  blacklist: "/mnt/beegfs/database/bioinfo/bigr_long-reads_bulk/REFERENCES/Spectre/blacklist/grch38_renamed_blacklist_spectre.bed"

Warning

  • As we recommand using Ensembl references, the chromosomes in the given GRCh38 blacklist file are named "1", ..., "22", "X", "Y" and not "chr1", ..., "chr22", "chrX", "chrY". You can copy/paste this file and change chromosome names if needed by the reference file you have chosen.
  • You can download your own blacklist on UCSC table browser as explained below.

How to get your own blacklist

First go to the UCSC table browser. Select your organism information. For mouse (GRCm38), choose:

  • Clade: "Mammal" / Genome: "Mouse" / Assembly: "Dev. 2011 (GRCm38/mm10)
  • Group: "Mapping and Sequencing" / Track: "All Gaps"
  • Region: "Genome" / Output format: "BED - browser extensible data"
  • Output filename: "mm10_all_gaps_ucsc.bed"
  • File type returned "Plain text"

Then click on "Get output", uncheck "Include custom track header" and check Create one BED record per: "Whole Gene". Finally, click on "get BED".

If you are using an Ensembl reference, do not forget to change chromosome names sed 's/chr//g' mm10_all_gaps_ucsc.bed mm10_all_gaps_ucsc_renamed.bed.

It's done! Give the absolute path to this bed file in the spectre blacklist field in the config file.

You can change various fields in the config.yaml file such as tools parameters (example: you can select the specific filters you want to apply after SNV calling).

2️⃣ Design file

It must be a comma separated file (.csv where comma is ",") and its path should be given in the config.yaml file. Field names are the following and depend on the analysis you wish to perform:

  • sample_id (required): the sample name of you sample (it could be different that your fastq files).
  • path_file (required): absolute path to the BAM file or to the POD5 directory.
  • methyl_group (optional): name of the group/condition corresponding to the sample, it will be used only for DMR analysis. You can only use 2 groups/conditions at the moment, named case and control. It is required only if you set differential_methylation_condition step as true.
  • somatic_ctrl (optional): sample id corresponding to the normal sample. It is required only if you set the variant_calling_mode as somatic.
  • cnv_cancer (optional): boolean to identify the sample as normal or cancer sample, for CNV analysis. It is required only if you set cnv_calling step as true.

Example with POD5 directory input and all columns given:

sample_id,path_file,methyl_group,somatic_ctrl,cnv_cancer
sample1_Tumor,/mnt/beegfs/scratch/n_rabearivelo/test_pipeline/methylation/data_input/sample1_Tumor/,case,sample1_Normal,TRUE
sample1_Normal,/mnt/beegfs/scratch/n_rabearivelo/test_pipeline/methylation/data_input/sample1_Normal/,control,,FALSE
sample2_Tumor,/mnt/beegfs/scratch/n_rabearivelo/test_pipeline/methylation/data_input/sample2_Tumor/,case,sample2_Normal,TRUE
sample2_Normal,/mnt/beegfs/scratch/n_rabearivelo/test_pipeline/methylation/data_input/sample2_Normal/,control,,FALSE

Here, every step available in the config file can be executed and both variant analysis modes can be used.

Other design files examples

POD5 directory as input and the only required columns:

sample_id,path_file
sample1_KO,/path/to/data_input/sample1/pod5_dir/
sample2_WT,/path/to/data_input/sample2/pod5_dir/

Here, every step available in the config file can be executed, except differential_methylation_condition step without methyl_group column and long CNV calling in cnv_calling step without cnv_cancer column. No somatic variant calling mode can be used without somatic_ctrl column.

BAM file as input:

sample_id,path_file,methyl_group
sample1_KO,/path/to/data_input/sample1.bam,case
sample2_WT,/path/to/data_input/sample2.bam,control

Here, every step available in the config file can be executed, except long CNV calling in cnv_calling step without cnv_cancer column. No somatic variant calling mode can be used without somatic_ctrl column.

Note

  • sample_id should not contain special characters, spaces or 'vs'.
  • each aligned BAM file used as input must be sorted and have its bai index in the same directory than them.
  • if you need to concatenate POD5 files for one sample, you only need to put them all in the same input directory that you give in the design file.

3️⃣ Genes of interest file (optional)

You can create a .txt file containing a list of genes of interest for your analysis, its path will be given in the config.yaml file. It should have one gene name per line. The gene name should come from HUGO Gene Nomenclature Committee (HGNC) gene symbol. It can be used by some tools to plot supplementary graphs specifically for these genes (maftools lollipop plots).

TP53
ATM
BARD1
BRCA1
BRCA2
CHEK2
NF1

Run the pipeline

You need snakemake and singularity. For GR users, as they are already installed on Flamingo (snakemake via conda and singularity via module load) just follow the example below.
Don't forget to change the version of the pipeline and the path to your configuration file.

Script example:

#!/bin/bash
#using: sbatch run.sh
#SBATCH --job-name=LR_analysis
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=250M
#SBATCH --partition=longq

source /mnt/beegfs/software/miniconda/24.3.0/etc/profile.d/conda.sh
conda activate /mnt/beegfs/pipelines/bigr_long-reads_bulk/<version>/envs/conda/snakemake
module load singularity

LR_pipeline="/mnt/beegfs/pipelines/bigr_long-reads_bulk/<version>/"

snakemake --profile ${LR_pipeline}/profiles/slurm \
          -s ${LR_pipeline}/Snakefile \
          --configfile /path_to_my_configuration_file/config.yaml

🔎 Pipeline detailed description

Basecalling & Methylation calling

Step Tool Description
Basecalling Dorado Perform basecalling (SUP)
Methylation calling Dorado Perform methylation calling for 5mCG and 5hmCG modifications (SUP)

Alignment

Step Tool Description
Filter python script Filter to keep only reads with quality score >Q10 and length >200
Alignment Dorado Align sample reads to the reference genome, remove secondary alignments
Concatenate BAM samtools As POD5 files were treated by batch, BAM concatenation is more convenient for next steps
Sort, index samtools Sort and index the just generated BAM files

Quality Control

Step Tool Description
BAM QC NanoPlot, Qualimap Generate general QC metrics and graphs regarding a BAM file
BAM QC Mosdepth Calculate the depth of coverage of a BAM file
BAM to FASTQ samtools Convert BAM file to FASTQ file for supplementary QC
FASTQ QC FastQC Perform QC on FASTQ file
FASTQ QC FastQ-Screen Perform library quality check on FASTQ file
QC report MultiQC Aggregates BAM and FASTQ QC results to generate a final QC report for all samples

Differential Methylation Analysis

Step Tool Description
DMR by sample GeneDMRs (R) Differentially methylated regions analyses (Alu, Transcripts, CpG) for and between all samples
DMR by condition GeneDMRs (R) Differentially methylated regions analyses (Alu, Transcripts, CpG) between two conditions

Germline variant calling

Step Tool Description
SNV calling Clair3, PEPPER-Margin-DeepVariant Call Single-Nucleotide Variants
SNV annotation SnpEff, SnpSift Annotate Single-Nucleotide Variants using ClinVar and/or dbNSFP
SNV filtering SnpSift Filter depending on the parameters entered in the config file
SNV graphs vcf2maf, maftools (R) Convert VCF to MAF file and generate general graphs by sample and graphs by gene if given in the config file
SV calling Sniffles2, cuteSV Call Structural Variants
SV annotation AnnotSV Annotate SV
SV graphs Sniffles2_plot Generate general graphs by sample
Long CNV calling Spectre Call long CNV >100kb

Somatic variant calling

Step Tool Description
SNV calling ClairS Call somatic Single-Nucleotide Variants
SNV annotation SnpEff, SnpSift Annotate Single-Nucleotide Variants using ClinVar and/or dbNSFP
SNV filtering SnpSift Filter depending on the parameters entered in the config file
SNV graphs vcf2maf, maftools (R) Convert VCF to MAF file and generate general graphs by sample and graphs by gene if given in the config file
SV calling nanomonsv Call somatic Structural Variants
SV annotation AnnotSV Annotate SV

Phasing

Step Tool Description
Phasing WhatsHap Germline SNV phasing
Haplotagging WhatsHap Assign Haplotype 1 or Haplotype 2 to the reads covering at least two heterozygous variants, generating a new BAM file, useful for IGV.

This pipeline is still under active development to provide the most complete and accurate analysis experience.

🗨️ Feel free to submit issues either to report any bug or to suggest pipeline enhancements.