Computational Biology 483 - Mini Project

This repository contains Python code from an LUC Computational Biology project.

Software Requirements

Linux/Unix
Python3
Biopython
Kallisto
Bowtie2
SPAdes

Packages Imported for Python

os
csv
BioPython (Entrez and SeqIO)

Packages Imported for R

sleuth
dplyr

Run the Pipeline:

To run this script, first copy the following line of code to clone this repository to your local machine:

git clone https://github.com/rmormando/MiniProject_RitaMormando.git

Then use this line to run through the python script:

python3 final_minicode.py

Installation Location:

https://github.com/rmormando/MiniProject_RitaMormando.git

Using the Test Data:

You can run through the script using the given SRR files included in the python code. However, if you want to save time and run through this program quickly use the test samples that are included on this repository.

The test samples contain the first 10000 lines of the paired-end fastq files.

Files included in Repo:

final_minicode.py

the whole pipeline which is composed of multiple functions that give each outputs

Sleuth_Rscript.R

R script which outputs a text file which identifies the diffrence between 2 timepoints in the expressed genes

MiniProject.log

a log file that has all of the information from the data ran with the script

Output Files included in Repo:

HCMV_index.idx

index created to be used by kallisto

EF999921_CDS.fasta

FASTA file with CDS of EF999921

myBlastResults.csv

CSV file that has contains the results after run through BLAST

longest_contig.fasta

FASTA file that contains the longest contig found while running SPAdes

out_sleuth.txt

TXT file used by R and then written to the log file

You will also need to download all of the sequences from NCBI's nucleotide database for the entire Betaherpesvirinae subfamily. This file is about 151MB long (too large to include in the repo) but is necessary for BLAST to work.

You can follow this link (https://www.ncbi.nlm.nih.gov/nuccore/?term=txid10357%5BOrganism%3Aexp%5D) or search this (txid10357[Organism:exp]) in the nucleotide database to bring you to NCBI's entire collection of Betaherpesvirinae sumbfamily sequences.

You can download them all by clicking on the 'Send To' link and select for these parameters which will be necessary for the code to function correctly: Complete Record, File, FASTA, Sequence Length

Assignment Questions:

We would like to compare HCMV transcriptomes 2- and 6-days post-infection (dpi). First, retrieve the following transcriptomes from two patient donors from SRA and convert to paired-end fastq files. You can use wget (by constructing the path based on the SRR numbers for each of these samples). Donor 1 (2dpi): https://www.ncbi.nlm.nih.gov/sra/SRX2896360 Donor 1 (6dpi): https://www.ncbi.nlm.nih.gov/sra/SRX2896363 Donor 3 (2dpi): https://www.ncbi.nlm.nih.gov/sra/SRX2896374 Donor 3 (6dpi): https://www.ncbi.nlm.nih.gov/sra/SRX2896375
We will quantify TPM in each sample using kallisto, but first we need to build a transcriptome index for HCMV (NCBI accession EF999921). Use Biopython to retrieve and generate the appropriate input and then build the index with kallisto. You will need to extract the CDS features from the GenBank format. Write the following to your log file (replace # with the number of coding sequences in the HCMV genome): The HCMV genome (EF999921) has # CDS.
Quantify the TPM of each CDS in each transcriptome using kallisto and use these results as input to find differentially expressed genes between the two timepoints (2pi and 6dpi) using the R package sleuth. Write the following details for each significant transcript (FDR < 0.05) to your log file, include a header row, and tab-delimit each item: target_id test_stat pval qval
It has been proposed that HCMV disease and pathogenesis may be related to the genetic diversity of the virus (Renzette et al. https://www.ncbi.nlm.nih.gov/pubmed/25154343/). Which publicly available strains are most similar to these patient samples? To compare to other strains, we will assemble these transcriptome reads. We don’t expect assembly to produce the entire genome, but enough to be useful in BLAST. Virus sequencing experiments often include host DNAs. It is difficult to isolate the RNA of just the virus (as it only transcribes during infection of the host cell). Before assembly, let’s make sure our reads map to HCMV. Using Bowtie2, create an index for HCMV (NCBI accession EF999921). Next, save the reads that map to the HCMV index for use in assembly. Write to your log file the number of reads in each transcriptome before and after the Bowtie2 mapping. For instance, if I was looking at the Donor 1 (2dpi) sample, I would write to the log (numbers here are arbitrary): Donor 1 (2dpi) had 230000 read pairs before Bowtie2 filtering and 100000 read pairs after.
Using the Bowtie2 output reads, assemble all four transcriptomes together to produce 1 assembly via SPAdes. Write the SPAdes command used to the log file.
Write Python code to calculate the number of contigs with a length > 1000 and write the # to the log file as follows (replace # with the correct integer): There are # contigs > 1000 bp in the assembly.
Write Python code to calculate the length of the assembly (the total number of bp in all of the contigs > 1000 bp in length) and write this # to the log file as follows (replace # with the correct integer): There are # bp in the assembly.
Write Python code to retrieve the longest contig from your SPAdes assembly. Use the longest contig as blast+ input to query the nr nucleotide database limited to members of the Betaherpesvirinae subfamily. You will need to make a local database of just sequences from the Betaherpesvirinae subfamily. Identify the top 10 hits. For the top 10 hits, write the following to your log file: Subject accession, Percent identity, Alignment length, Start of alignment in query, End of alignment in query, Start of alignment in subject, End of alignment in subject, Bit score, E-value, and Subject Title. Include the following header row in the log file, followed by the top 10 hits, and tab-delimit each item: sacc pident length qstart qend sstart send bitscore evalue stitle

Author

Rita Mormando

Bioinformatics Program, Loyola University Chicago

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
EF999921_CDS.fasta		EF999921_CDS.fasta
HCMV_index.idx		HCMV_index.idx
MiniProject.log		MiniProject.log
README.md		README.md
Sleuth_Rscript.R		Sleuth_Rscript.R
final_minicode.py		final_minicode.py
longest_contig.fasta		longest_contig.fasta
myBlastResults.csv		myBlastResults.csv
testAGAIN_SRR5660030.1_1.fastq		testAGAIN_SRR5660030.1_1.fastq
testAGAIN_SRR5660030.1_2.fastq		testAGAIN_SRR5660030.1_2.fastq
testAGAIN_SRR5660033.1_1.fastq		testAGAIN_SRR5660033.1_1.fastq
testAGAIN_SRR5660033.1_2.fastq		testAGAIN_SRR5660033.1_2.fastq
testAGAIN_SRR5660044.1_1.fastq		testAGAIN_SRR5660044.1_1.fastq
testAGAIN_SRR5660044.1_2.fastq		testAGAIN_SRR5660044.1_2.fastq
testAGAIN_SRR5660045.1_1.fastq		testAGAIN_SRR5660045.1_1.fastq
testAGAIN_SRR5660045.1_2.fastq		testAGAIN_SRR5660045.1_2.fastq

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Computational Biology 483 - Mini Project

Software Requirements

Packages Imported for Python

Packages Imported for R

Run the Pipeline:

Installation Location:

Using the Test Data:

Files included in Repo:

Output Files included in Repo:

Assignment Questions:

Author

About

Releases

Packages

Languages

rmormando/MiniProject_RitaMormando

Folders and files

Latest commit

History

Repository files navigation

Computational Biology 483 - Mini Project

Software Requirements

Packages Imported for Python

Packages Imported for R

Run the Pipeline:

Installation Location:

Using the Test Data:

Files included in Repo:

Output Files included in Repo:

Assignment Questions:

Author

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages