AU_BIOL-7180_SPR20_GroupProject

This GitHub will be dedicated to organizing the efforts made by students learning biological scripting to analyzing real datasets for a class group project. Raw read data can be found in the following directory in scratch and in the shared box folder: /scratch/AU_BIOL-7180_GrpProject/samples_Salmonella_muenster

Background

Salmonella Gram-Negative Bacilli

Reference Genomes and Assemblies Used

Traditional NCBI Refernce: Salmonella enterica CVM N51250 v1.0 (SER) Salmonella enterica subsp. enterica serovar Muenster (enterobacteria)
Ray Assembly of WT-2012 (scaffolds): SER_ray
Ray Assembly of WT-2012 (contig): SER_rayc
SPAdes Assembly of WT-2012: SER_spades

Reference	Genome Assembly Size
Salmonella enterica CVM N51250 v1.0	4.7 Mbp
SER_ray	4.8 Mbp
SER_rayc	4.8 Mbp
SER_spades	4.8 Mbp

General Workflow

Obtain raw reads using SRA Toolkit fastqDump.
FastQC was utilized for basic quality control metrics of raw MiSeq sequencing data.See Resulting Output
Analyzed data for trimming (Step skipped)
BWA for aligining reads to reference genomes and assemblies. BWA was used to do the following:

a. Create an index for the reference genomes (script 4)

b. Align the samples we have to both reference genomes (script 5) INPUT FILE: .fastq.gz OUTPUT FILE: .aln.sam

c. Convert the aligned files to an extension the BWA could use to sort and merge files based on reference genome aligned to (script 6) INPUT FILE: .aln.sam OUTPUT FILE: .sorted.merged.bam
Generate phylogeny of samples using CSIPhylogeny. Tree edit with Mega X.

SNP Analysis from Vcf files

Mutect2 is a vcf generating somatic variant caller. HaplotypeCaller is a took of GATK that calls variands by default with diploid ploidy, but can have ploidy specifed. And FreeBayes is a variant caller commonly ustilized for varaint calling in haploid samples. Here, we opted to do a comparision of the variant calling quality for these three popular variant calling tools. Summary stats of the vcf files genertaed using bcftools stats and plot-vcfstats. The stats are represented below.

Supplementary code to generate stats file(s):

module load bcftools
module load samtools

#Run bcftools stats to genertae stats on vcf file to be used in python to genrate plot. This is the input for plot-vcfstats
bcftools stats WT_2012_ref_assemblyc.unfiltered.vcf > ./plot_aw/WT_2012_ref_assemblyc.unfiltered.vchk

#Run plot-vcfstats under bcftools. Be sure to have python3-matplotlib installed as this is a dependency.
##For matplotlib run : sudo apt-get install python3-matplotlib
plot-vcfstats WT_2012_ref_assemblyc.unfiltered.vchk -p ./unflit_assembly_plot/

Files Generated By plot-vcfstats:

counts_by_af.indels.dat  depth.0.dat  depth.0.png   indels.0.pdf  plot-vcfstats.log  substitutions.0.pdf  tstv_by_af.0.dat    tstv_by_qual.0.pdf
counts_by_af.snps.dat    depth.0.pdf  indels.0.dat  indels.0.png  plot.py            substitutions.0.png  tstv_by_qual.0.dat  tstv_by_qual.0.png

Run python3 on polt.py script:

#Use python3 to plot data generated in plot.py script and produce png graph files.
python3 plot.py

Table-1

Reference	indels	snps	transitions	transversions	mnp¹	other	mas²	ts/tv³	Variant Caller
Salmonella enterica CVM N51250 v1.0	49	2017	1073	944	436	9	0	1.136653	FreeBayes
WT_2012_Ray_Scaffold	18	1367	434	933	77	5	9	0.465166	FreeBayes
WT_2012_Ray_Contigs	18	1351	437	914	77	5	9	0.478118	FreeBayes
Salmonella enterica CVM N51250 v1.0	239	1966	1221	745	0	7828	2007	1.638926	HaplotypeCaller
WT_2012_Ray_Scaffold	27	553	140	413	0	5467	467	0.338983	HaplotypeCaller
WT_2012_Ray_Contigs	24	553	139	414	0	5442	466	0.335749	HaplotypeCaller
WT_2012_Ray_Contigs	614	1062	448	614	53	0	10	0.729642	Mutect2

Table-1 Notes

1-Multi Nucleotide Polymorphism (mnp)

2-Multiallelic sites (mas)

3-Transition to transversion raio (ts/tv)

Rplots of SNPs Detected

Example Script for generating R plots

library(ggplot2)
p1<-ggplot(data=INSERT_DATAFILE_HERE, aes(x=Reference, y=other)) + geom_bar(stat="identity",  width=0.5, fill="black")+ coord_flip()
p2<-ggplot(data=INSERT_DATAFILE_HERE, aes(x=Reference, y=other)) + geom_bar(stat="identity",  width=0.5, fill="black")+ coord_flip()+ ylim(NA, 40)
gridExtra::grid.arrange(p7, top= "INSERT_TITLE_HERE")
gridExtra::grid.arrange(p6, top= "INSERT_TITLE_HERE")

SNPs and Indels Detected Using FreeBayes and Mutect2	SNPs and Indels Detected Using HaplotypeCaller and Mutect2

Transitions and Transversions Detected Using FreeBayes and Mutect2	Transitions and Transversions Detected Using HaplotypeCaller and Mutect2

Other SNPS Detected Using FreeBayes and Mutect2	Other SNPs Detected Using HaplotypeCaller and Mutect2

fastq2vcf tool

Within the TestFile directory and on the main GitHub page is the script fastq2vcf.sh which can be run in the ASC and submitted to the queue system. This script is an automated version of our pipeline and will perform all steps with the exception of trimming fastq files. The script is designed to download fastq files using SRA toolkit's fastq-dump will ultimately gernerate files needed for viewing data on the Integrative Genome Viewer (IGV), VCF files based on user input for ploidy, and stats files of the VCF (ends in .vchk). All files except index files, dictionary files, and other reference file dependent files are organized into directories as part of a final cleanup step. Within the script, users must define thre variables, "ref=", "input_file=", and "ploidy=". Please note this is a skeleton script and users have the ability to edit tool arguments to best fir their data. The .log file generated displays the timings of when each step of the pipeline was completed.

Note: No space should be added between the "=" and user information.

User-Defined Variables

ref= basename of fasta file (Example: for Salmonella.fasta, put Salmonella)

input_file= name of file with list of SRR number to be downloaded with fastq-dump (Example: sample_list.txt)

ploidy= ploidy of organism from samples (Example: for Salmonella.fasta, put 1)

Note about ploidy- If the defined ploidy =1, FreeBayes will be called. Any other ploidy will call HaplotypeCaller.

Sample datasets are provided in the TestFile folder.

IMPORTANT To download an entire set of reads from the SRA, remove the -X # argument. The script's runtime may increase as a result.

Salmonella.fasta (haploid)

sample_list.txt (haploid-Bacterial isolates from this project)

covid-19_MT039887.1.fasta (Negative control)

sample_list2.txt (diploid-Human)

Dependencies

bwa/0.7.12

samtools/1.3.1

picard/1.79

gatk/4.1.4.0

freebayes/1.0.2

sra

fastqc/0.10.1

Authors are listed below.

Ashley Williams (email: [email protected])

Shelby Osburn (email: [email protected])

Steven Kitchens (email: [email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
Completed_scripts		Completed_scripts
SPAdes_Assemblies_Scripts_Analysis		SPAdes_Assemblies_Scripts_Analysis
SRR10740739_assembly		SRR10740739_assembly
Scripts_old		Scripts_old
TestFile		TestFile
VCF_Stats_SNP		VCF_Stats_SNP
Working_Scripts_AW		Working_Scripts_AW
fastQC		fastQC
Final_Pipeline.png		Final_Pipeline.png
README.md		README.md
Rplot_FB_SNP_Indels.png		Rplot_FB_SNP_Indels.png
Rplot_Other_SNP_Hap.png		Rplot_Other_SNP_Hap.png
Rplot_SNP_Other_FB.png		Rplot_SNP_Other_FB.png
Rplot_Snp_Indel_Hap.png		Rplot_Snp_Indel_Hap.png
Rplot_Ts_Tv_Hap.png		Rplot_Ts_Tv_Hap.png
Rplot_Tv_Ts_FreeBayes.png		Rplot_Tv_Ts_FreeBayes.png
Sal_Muenster_SNP_Tree.svg		Sal_Muenster_SNP_Tree.svg
Workflow.png		Workflow.png
fastq2vcf.sh		fastq2vcf.sh
snp_tree.main_tree.svg		snp_tree.main_tree.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AU_BIOL-7180_SPR20_GroupProject

Background

Reference Genomes and Assemblies Used

General Workflow

SNP Analysis from Vcf files

Table-1

Table-1 Notes

Rplots of SNPs Detected

Example Script for generating R plots

fastq2vcf tool

User-Defined Variables

Sample datasets are provided in the TestFile folder.

Dependencies

Authors are listed below.

About

Releases

Packages

Contributors 3

Languages

asw0049/AU_BIOL-7180_SPR20_GroupProject

Folders and files

Latest commit

History

Repository files navigation

AU_BIOL-7180_SPR20_GroupProject

Background

Reference Genomes and Assemblies Used

General Workflow

SNP Analysis from Vcf files

Table-1

Table-1 Notes

Rplots of SNPs Detected

Example Script for generating R plots

fastq2vcf tool

User-Defined Variables

Sample datasets are provided in the TestFile folder.

Dependencies

Authors are listed below.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages