Skip to content

Snakemake file for handling concatenated Illumina .fastq.gz files to generate aligned sequence

Notifications You must be signed in to change notification settings

varunchakrakodi/genome_alignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

There are three scripts provided here.

1. smkalign.py

This is a Snakemake file that can be used to process the .fastq.gz files and align against a reference sequence. For this to work, concat R1 and R2 reads for each sample and put them in a folder. The method uses bwa-mem2 based alignment. Hence, the reference sequence needs to be indexed. Also requires a config.yaml file (Template is provided).

Command: It is a good idea to run DAG (Directed Acyclic Graph) before executing the snakemake program to ensure everything is in place. (This step is optional). Also gives a visualisation of the expected process.

snakemake --snakefile smkalign.py --configfile config.yaml --dag | dot -Tpng > dag.png

For executing snakemake

snakemake --snakefile smkalign.py --configfile config.yaml --cores n

Depending on the system configuration you are using, filesystems might cache the status of files. Further, there can be a delay between the completion of a job and the visibility of its output files. This can cause Snakemake to prematurely fail, thereby wrongly assuming the files were not generated. Adding a latency wait gives the filesystem time to update and make the new files visible. In such cases Use --latency-wait sec

$PATH = "/path/to/working_dir"

$PATH/results containing all the outputs will be created in the working directory for verification and further processing if needed.

Unfortunately, samtools consensus calling creates a header as >reference in each fasta file. This leads to issues in downstream processing if you run cat *.fasta > Samples.fasta

Hence it is recommended to change headers in all .fasta files using the following command structure before further processing.

sed -i 's/^>.*/>file/' file.fasta

Dependencies:

  1. Snakemake (https://snakemake.readthedocs.io/en/stable/)
  2. Cutadapt ( https://github.com/marcelm/cutadapt )
  3. samtools (https://github.com/samtools/samtools/releases/)
  4. bwa-mem2 (https://github.com/bwa-mem2/bwa-mem2)
  5. If running DAG, requires Graphviz (https://github.com/graphp/graphviz)

.bai file can be viewed using Tablet Alignment viewer (https://ics.hutton.ac.uk/tablet/)

2. alignstats.sh

Sometimes it is useful to have a deeper look into the Alignment statistics. The bash script looks into .bam and file_depth.txt files and generates alignment_info.csv file containing data on Number of Reads, Aligned Coverage (%), Minimum Depth, Maximum Depth and Average Depth

Command: ./alignstats.sh $PATH/results

3. smkscrub.py

Sometimes it is useful to scrub the host reads before aligning the reads to the target reference sequence. This is especially useful if dealing with a complex mixture of microbes in samples. This is a Snakemake file that can be used to process the .fastq.gz files against a host reference sequence. Once the scrubbed reads are obtained, it can be further processed using smkalign.py

scrubbing uses minimap2 and needs to be installed in system, if not available (https://github.com/lh3/minimap2)

Rules of usage are similar to smkalign.py

Command: snakemake --snakefile smkscrub.py --configfile config.yaml --cores

Note: The pipelines are for working with Illumina Short-reads only (Uses bwa-mem2 and cutadapt with Illumina adaptors).

About

Snakemake file for handling concatenated Illumina .fastq.gz files to generate aligned sequence

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages