There are three scripts provided here.
1. smkalign.py
This is a Snakemake file that can be used to process the .fastq.gz files and align against a reference sequence. For this to work, concat R1 and R2 reads for each sample and put them in a folder. The method uses bwa-mem2 based alignment. Hence, the reference sequence needs to be indexed. Also requires a config.yaml file (Template is provided).
Command: It is a good idea to run DAG (Directed Acyclic Graph) before executing the snakemake program to ensure everything is in place. (This step is optional). Also gives a visualisation of the expected process.
snakemake --snakefile smkalign.py --configfile config.yaml --dag | dot -Tpng > dag.png
For executing snakemake
snakemake --snakefile smkalign.py --configfile config.yaml --cores n
Depending on the system configuration you are using, filesystems might cache the status of files. Further, there can be a delay between the completion of a job and the visibility of its output files. This can cause Snakemake to prematurely fail, thereby wrongly assuming the files were not generated. Adding a latency wait gives the filesystem time to update and make the new files visible. In such cases Use --latency-wait sec
$PATH = "/path/to/working_dir"
$PATH/results containing all the outputs will be created in the working directory for verification and further processing if needed.
Unfortunately, samtools consensus calling creates a header as >reference in each fasta file. This leads to issues in downstream processing if you run cat *.fasta > Samples.fasta
Hence it is recommended to change headers in all .fasta files using the following command structure before further processing.
sed -i 's/^>.*/>file/' file.fasta
Dependencies:
- Snakemake (https://snakemake.readthedocs.io/en/stable/)
- Cutadapt ( https://github.com/marcelm/cutadapt )
- samtools (https://github.com/samtools/samtools/releases/)
- bwa-mem2 (https://github.com/bwa-mem2/bwa-mem2)
- If running DAG, requires Graphviz (https://github.com/graphp/graphviz)
.bai file can be viewed using Tablet Alignment viewer (https://ics.hutton.ac.uk/tablet/)
2. alignstats.sh
Sometimes it is useful to have a deeper look into the Alignment statistics. The bash script looks into .bam and file_depth.txt files and generates alignment_info.csv file containing data on Number of Reads, Aligned Coverage (%), Minimum Depth, Maximum Depth and Average Depth
Command: ./alignstats.sh $PATH/results
3. smkscrub.py
Sometimes it is useful to scrub the host reads before aligning the reads to the target reference sequence. This is especially useful if dealing with a complex mixture of microbes in samples. This is a Snakemake file that can be used to process the .fastq.gz files against a host reference sequence. Once the scrubbed reads are obtained, it can be further processed using smkalign.py
scrubbing uses minimap2 and needs to be installed in system, if not available (https://github.com/lh3/minimap2)
Rules of usage are similar to smkalign.py
Command: snakemake --snakefile smkscrub.py --configfile config.yaml --cores
Note: The pipelines are for working with Illumina Short-reads only (Uses bwa-mem2 and cutadapt with Illumina adaptors).