Skip to content

Code we have created to navigate miscellaneous bioinformatic challenges.

Notifications You must be signed in to change notification settings

NewtonLabUWM/Misc_Bioinformatics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Miscellaneous bioinformatics

Navigating common challenges in microbial ecology.

Multiplexing

Protocol. How to sort barcoded illumina reads into individual FASTQ files... The easy way to taxonomically identify microbial isolates! Includes a program (demultiplexFASTQ.py) and a four-sample dataset.

exactMatching

Protocol. We commonly want to find exact matches between sequences in two FASTA files. When files are large, we don't always need or want the robust BLAST algorithm. This is a perl program that is fast, light, and easy.

Cutadapt

Protocol. How to trim primer sequences from reads generated by Illumina. Plus several common targets:

  1. microbial V3-V4 16S rRNA
  2. microbial V4 16S rRNA
  3. bacterial V4-V5 16S rRNA
  4. microbial V1-V9 16S rRNA
  5. microbial V1-ITS 16S rRNA
  6. fungal 18S rRNA

DADA2

  1. filterAndTrim_bigData.R. At the filterAndTrim step, process groups of samples one at a time instead of all samples simultaneously. Saves time and computer power and crashes and headaches.

  2. merge_ASV_tables.R. Helpful when you have many ASV tables from DADA2 and want to merge them by unique FASTA sequences.

NCBI

  1. removeLineBreakFASTA.sh. Downloading contigs from NCBI, there are line breaks at 800bp. Remove those with this.

  2. downloadMultipleSRA_series.sh. Download multiple files from Sequence Read Archive. Use when you're interested in runs that are named as a series of numbers, which is typical for BioProjects (e.g., runs in project PRJNA597057 range from SRR10755563 to SRR10755886).

  3. downloadMultipleSRA_text.sh. Download multiple files from Sequence Read Archive. Use when you're interested in runs that are not named in a series. Create a text file called "runs.txt" with all desired runs.

  4. ncbiTaxDB_scrape.sh. With a list of NCBI IDs, scrape the taxonomy database webpage associated with it, keeping only taxonomy paths (Kingdom, Phylum, etc) in the resulting file.

  5. ncbiAssemblyDB_scrape.sh Sample thing, here we are scraping the NCBI assemby database for associated BioSamples.

navigateFASTQ-A

  1. catFASTQ.sh. Concatenate FASTQ files with identical names. Its original purpose was to combine files from two sequencing runs (on full and nano Illumina flow cells) on the same samples.

  2. calculateRPKM.py. Count number of bases in FASTA and convert to reads per kilobase million (rpkm). Metric used in metatrascriptomics.

  3. subsetFASTQ.sh. Subset a large FASTQ into smaller ones. Was helpful when learning error rates on a large dataset in dada2.

  4. fastaToCSV.sh. Have a FASTA file? Want to work with it in Excel or R? Use this. The result is a spreadsheet with two columns, "Headers" and "FASTA."

toolsAndPipelines

  1. rgiFASTA.sh. Mine ARGs from FASTAs in a directory with CARD's resistance gene identifier.

  2. deepARG_organize.R. Load and organize results from the deepARG online tool.

  3. metaxa2_[fastq/fasta].sh. Assess taxonomy in assembled or unassembled metagenomes with Metaxa2.

  4. integronFinder.sh. Mine integron sequences from contigs with Integron Finder.

  5. mobileOG-db.sh. Mine mobile genetic elements from the mobileOG database.

About

Code we have created to navigate miscellaneous bioinformatic challenges.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published