A very basic script for analyzing ERCC controls in RNA-seq data
Input:
- directory containing one or more fastq.gz files.
- Tested with concatentated Illumina HiSeq output
- SampleSheet.csv which describes the samples found in fastq.gz files
- Sorry this is a custom format from Illuminati
- ERCC bowtie indexes built under a subdirectory called ‘data’
- mix 1 and mix 2 expected values in a file called ‘expected.txt’ inside data
Output:
A sub-directory in the starting directory containing the following:
- Alignments to the ERCC control sequences
- Basic RPKM outputs for each sample
- Graphs displaying dynamic range of samples compared to mix 1 and mix 2
- git clone into directory of your choice
- create ‘data’ subdirectory
- here are the needed files inside data:
ERCC92.1.ebwt ERCC92.2.ebwt ERCC92.3.ebwt ERCC92.4.ebwt ERCC92.fa ERCC92.fa.fai expected.txt
expected.txt has the following format:
Re-sort ID ERCC ID subgroup concentration in Mix 1 (atmol/ul) concentration in Mix 2 (atmol/ul) expected fold-change rati log2(fold-change) expected dCt
run ./spikein_run.rb /path/to/fastq.gz/files
SampleReport.csv has the following fields:
output,lane,sample name,illumina index,custom barcode,read,reference,total reads,pass filter reads,pass filter percent,align percent,type,read length
Only output
, lane
, sample name
, and illumina index
should be necessary.
output
should be the name of the fastq file, sample name
along with lane
is used to name the spikein analysis output.
Example:
output,lane,sample name,illumina index,custom barcode,read,reference,total reads,pass filter reads,pass filter percent,align percent,type,read length test.fastq,1,Illumina_1ug_1,ACAGTG,,1,mm9,38088422,38088422,100.00,69.60,single,51
Enjoy!
Only tested on Mac 10.7 and Linux – CentOS6.
- ruby 1.9
- R
- samtools
- bowtie
- zcat