Skip to content

scripts to automate illumina secondary analysis pipeline

Notifications You must be signed in to change notification settings

metalhelix/illuminati

 
 

Repository files navigation

Illuminati

Automated Illumina Pipeline

                  
                                                            `-.`'.-'
                                                         `-.        .-'.
                                                      `-.    -./\.-    .-'
                                                          -.  /_|\  .-
                                                      `-.   `/____\'   .-'.
                                                   `-.    -./.-""-.\.-      '
                                                      `-.  /< (()) >\  .-'
                                                    -   .`/__`-..-'__\'   .-
                                                  ,...`-./___|____|___\.-'.,.
                                                     ,-'   ,` . . ',   `-,
                                                  ,-'                     `-,

About

Illuminati is a set of tools geared towards automating portions of Illumina’s secondary analysis pipeline. This application is custom tailored for a specific research institution and not architectured for general use. This may change in the future, and others dealing with Illumina software could still find it interesting.

Note: we will refer to this pipeline as the ‘primary’ analysis pipeline, even though Illumina documentation refers to it as the ‘secondary’ analysis pipeline.

Quickstart

Let’s say you have Illuminati installed here (like we do for some reason):

/n/ngs/tools/pilluminati

and you have a cluster with a head node ngs-cluster

You get a notification that Flowcell XYZ is done and ready for primary analysis.

First, we generate a run script that will

ssh ngs-cluster
cd /n/ngs/tools/pilluminati

./bin/startup_run.rb XYZ

Where XYZ is the flowcell ID for the flowcell you are running.
This will create a XYZ.sh run script in your run folder.

Our run folder is in:

/n/ngs/runs

The startup_run.rb has not done anything except create this run script. The run script will drive the primary analysis pipeline.

We can look over the run script before executing it

less /n/ngs/runs/XYZ.sh

We see a lot of comments – mostly data output from LIMS. We also see a series of commands that will be executed. Look it over.

Most important is the comments that come after the /n/ngs/tools/pilluminati/bin/config_maker.rb. These comments represent what will go into our config.txt and SampleSheet.csv files. If something looks wrong, we can call this command manually beforehand, tweak the results, and comment it out of our run script.

Once you are satisfied, all you have to do is run the run script…

/n/ngs/runs/XYZ.sh

Make sure to do that on the cluster head node (ngs-cluster). It will kick off the primary analysis pipeline which includes demultiplexing, alignment, and post processing. Emails will be sent to the pipeline administrator during the run process. When it is done, files should be in their project directories. LIMS should be updated with the QA/QC data. Everything should be beautiful.

What it does

General overview of the execution of the analysis pipeline this tool performs:

  • Create configuration files
    • SampleSheet.csv and config.txt are created using data from custom LIMS system
  • Convert raw unaligned reads to Fastq format
  • Demultiplex indexed lanes
  • Perform alignment to reference genome
  • Aggregate and rename unaligned reads
  • Remove reads that do not pass filter
  • Split custom barcoded lanes
  • Analyze unaligned reads using fastqc
  • Aggregate and rename export files
  • Create Sample_Report for name / lane / sample mapping
  • Distribute data to project directories
  • Distribute stats and quality control analysis to qcdata directory

Requirements

  • CASAVA 1.8.x – this tool is primarily a wrapper around the CASAVA pipeline software help automate it. This tool is designed to deal specifically with CASAVA 1.8, and would fail completely with an older version. CASAVA binaries should be located at CASAVA_PATH (see lib/illuminati/constants.rb).
    • Currently using CASAVA 1.8.1
  • fastqc – we need fastqc available from the command line. Tested with fastqc version 0.9.2.
    • This is run by the fastqc.pl script – located in the scripts directory.
  • image magick – the fastqc.pl script also uses the convert tool of Image Magick to process the images created by fastqc. Expected to be on the path.
  • fastx_barcode_splitter.pl – needs to be availible from command line
  • lims_data.pl – the most troublesome external dependency. lims_data.pl should sit in the SCRIPT_PATH directory. It is a perl script that connects to our LIMS system and provides flowcell and distribution information back to Illuminati. Future work will be to eliminate or reduce this external dependency.
    • NOTE – There is a crude work-around for not needing this LIMS connection. See External Data below.
  • Custom Directory Structure – By default, Illuminati expects a certain directory structure to be in place to be able to run correctly. lib/illuminati/config.rb describes the default path structure, with lib/illuminati/constants.rb providing some description of how these paths are used. Configuration of the paths and other options used by Illuminati can be done by creating a config file and setting the ENV variable ILLUMINATI_CONFIG to point to that path. See assests/example.config.yaml for a starting point for the config file.

Outsourced Orders

Frequently, we deal with flowcells processed from outsourcing centers and the resulting data returned to us. Each outsourcing center has their own way of doing things which results in their own tweaks and gotchas in the returned data.

Illuminati can handle outsourced data, but does not attempt to deal with all the nuances of the particular facility. It expects that the outsourced data flowcell directory has the Unaligned and Aligned sub-directories setup as if they came out of a flowcell run.

If this is satisfied, then the bin/post_run command can be used to process outsourced orders using the -o flag.

Using Illuminati

To use Illuminati, you need a few things in place:

  • Flowcell data is in LIMS system and data provided by lims_data.pl is correct.

Once things are in order, running Illuminati should be a three step process:

Step 1: Run startup.rb

Run the bin/startup_run.rb script passing in the flowcell id:

$ cd /solexa/bin/illuminati
$ ./bin/startup_run.rb 639AXXPY

This should generate an Admin script in /solexa/runs/. This admin script is named <flowcell_id>.sh and is meant to be executed to start off the main Illuminati process. So for example, this admin script would be named 639AXXPY.sh.

Step 2: Review Admin Script

Have a look at the admin script to make sure things look good

$ cd /solexa/runs
less 639AXXPY.sh

First, you should see output from the LIMS system about the data of each of the flowcell’s lanes. This is the data that will be used to create the SampleSheet.csv and config.txt, so its important that its right.

Then comes the SampleSheet.csv output.

Next, you will see the output that will go into the config.txt file. Check this over and ensure the output doesn’t include errors about missing genomes or otehr information.

Finally, the admin script contains the shell commands that will kick off CASAVA with the files to be generated. You can see that there is a POST_RUN_COMMAND in the make command that will automate the next step, alignment. However, if everything goes as planned (like it always does), then you won’t really need to deal with anything but this file.

Step 3: Run Admin Script

If the admin script looks good, and you are ready to roll, then go!

$ ./639AXXPY.sh

This will actually create the config.txt and SampleSheet.csv files with the output as described in the admin script, then startup CASAVA’s BCL Converter.

What Illuminati Does Next

Ok, so the config files CASAVA needs are automatically generated and the BCL converter starts up, but what else does Illuminati do?

Demultiplexing

Well, the BCL converter in CASAVA 1.8 now also automatically does the demultiplexing step as well. So, if your lanes are multiplexed, you get the demultiplexing for free!

Alignment

After the BCL converter finishes, The POST_RUN_COMMAND we saw in the make command kicks off the next step in the process: alignment. Technically, we didn’t need to create the config.txt file until this step, but getting it done initially makes running the alignment process pretty simple.

The alignment portion of the CASAVA pipeline is started by the bin/align_runner.rb script. Its a little wordy in there, but ultimately we are just calling 3 commands. First, CASAVA’s configureAlignment.pl script is run without the --make flag, so it just does a test run. The output from this test run is analyzed by the align_runner. If there is a problem (specifically, if “ERROR” is found in the output), then Illuminati shuts down and emails you to get things going again. Usually, this means that there is a bug in the config.txt file. If there isn’t an error, then we run configureAlignment.pl for reals, and create the Aligned directory. Then we run make inside that directory to start CASAVA’s ELAND aligner.

Post Run

The alignment’s make command contains another POST_RUN_COMMAND which starts up the final part of the primary analysis pipeline: the post run.

The post run functionality is contained in bin/post_run. Its an elaborate maze of fun. But hopefully the method names at least hint at what is going on. Briefly, this script does the following:

  • use cat to combine the multiple fastq.gz files that CASAVA makes into one for each lane / sample.
  • use zcat and the bin/fastq_filter.rb script to filter out reads that don’t pass filter.
  • split lanes with custom barcodes using fastx_barcode_splitter.pl.
  • distribute fastq.gz files to their project directories.
  • run fastqc on the filtered fastq.gz files and distribute the results.
  • create Sample_Report.csv to summarize the sample data
  • distribute the relevant stats files to the project directories.
  • combine export files using cat and distribute these files.
  • distribute relevant qa / qc and stats files to the appropriate qcdata directory.

The post_run tool has a lot of options if you want to run it manually, for outsourced orders, or to repeat a portion of the run.

Use bin/post_run -h to get a list of options to use.

Email & Logging

You also get emails along the way. Check out the EMAIL_LIST in lib/illuminati/constants.rb to see who gets emailed.

A log file is also created for each flowcell in /solexa/runs/log. This will be used for future dashboard-style awesome-ness.

align_runner.rb and post_run also generate log files during their execution. If post_run completes, it will produce a shell script with all the commands it did, so that theoretically, you could run it again if you wanted.

CASAVA’s make output for the bcl conversion and align steps are captured in their own make.out files.

External Data

If you want to avoid using the lims_data.pl script or other external LIMS systems, you can do so by creating an external_data.yml file and placing it in the flowcell’s root directory.

Essentially, the external_data.yml contains all the required information about the samples on a run. As described in ExternalDataBase.rb, these required fields are:

:lane           String name of lane (1 - 8),
:name           Sample name.
:genome         Code for genome used for lane. Should correlate to folder name in genomes dir,
:protocol       Should be either "eland_extended" or "eland_pair",
:barcode_type   Should be :illumina, :custom, or :none
:barcode        If :barcode_type is not :none, this provides the 6 sequence barcode

These are all under the :samples: section of the YAML file. It also contains information about how the files are distributed, in the :distributions: section:

:lane:    Lane number
:path:    Path to distribute the data for this lane

These sections will probably be combined in the future to allow for distributions on a per-sample level.

Example external_data.yml file:

--- 
:samples: 
- :lane: "1"
  :name: 9.5dpc_HoxB1_input
  :genome: mm9
  :protocol: eland_extended
  :barcode_type: :none
  :barcode: ""
- :lane: "2"
  :name: HoxB1_3FMS_myc_M2Flag_IP
  :genome: mm9
  :protocol: eland_extended
  :barcode_type: :none
  :barcode: ""
- :lane: "3"
  :name: HoxB1_3FMS_myc_myc_IP
  :genome: mm9
  :protocol: eland_extended
  :barcode_type: :none
  :barcode: ""
- :lane: "4"
  :name: HoxB1_3FMS_myc_input
  :genome: mm9
  :protocol: eland_extended
  :barcode_type: :none
  :barcode: ""
- :lane: "5"
  :name: HoxB1_HF_M2Flag_IP
  :genome: mm9
  :protocol: eland_extended
  :barcode_type: :none
  :barcode: ""
- :lane: "6"
  :name: HoxB1_HF_M2Flag_input
  :genome: mm9
  :protocol: eland_extended
  :barcode_type: :none
  :barcode: ""
- :lane: "7"
  :name: Pbx_input
  :genome: mm9
  :protocol: eland_extended
  :bases: Y*
  :barcode_type: :none
  :barcode: ""
- :lane: "8"
  :genome: phiX
  :name: Phi X
  :protocol: eland_extended
  :barcode_type: :none
  :barcode: ""
:distributions: 
- :lane: 1
  :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX
- :lane: 2
  :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX
- :lane: 3
  :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX
- :lane: 4
  :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX
- :lane: 5
  :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX
- :lane: 6
  :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX
- :lane: 7
  :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX

See the specs for ExternalDataYml for details.

CASAVA 1.8 Directory Structure

CASAVA 1.8 has a pretty different structure for where it puts data now. Plus, the post_run script creates some custom directories to do its job. Here’s an overview of the directories relevant to the primary analysis pipeline:

_ FLOWCELL_BASE_DIR
|
|___ Data
|    |
|    |___ Intensities
|
|___ Unaligned
|    |
|    |___ all
|    |
|    |___ filter
|
|___ Aligned
     |
     |___ all

The big change is that you don’t go digging in Data/Intensities/GERALD_* directories anymore for your sequence / alignment files.

Instead, there are two separate directories right off the flowcell’s base directory called Unaligned and Aligned that are now used to store that data (technically they are user configurable, but these are the defaults, and thats what we’ll use).

Unaligned contains your fastq.gz files. Illuminati adds two additional sub-directories – all and filter. all contains the concatenated, unfiltered fastq.gz files with the correct naming structure. filter contains the same files, but with the reads that didn’t pass filter removed.

Aligned as you might guess contains the export.txt.gz files. Illuminati again adds the all directory to store files using the naming convention we want to use.

Testing Illuminati

Some effort has been put in to writing tests to ensure components of Illuminati are functional and don’t break when new features are added. These tests are currently incomplete, but it is a decent start. If you are on the particular machine this software was meant to be run on, you can simply run autotest to check all tests:

$ cd /solexa/bin/illuminati
$ autotest

Everything should pass, or something is broken. If you are on another machine, some of the tests can be run using rspec. Some tests will fail due to requirements of file structure and database access. Future work includes making it configurable enough to test on other machines.

Contributing

If you’d like to hack on Illuminati, start by forking the repo on GitHub:

https://github.com/vlandham/illuminati

To get all of the dependencies, run bundle install first. The best way to get
your changes merged back into core is as follows:

  • Clone down your fork
  • Create a thoughtfully named topic branch to contain your change
  • Hack away
  • Add tests and make sure at least the new features pass by running rspec
  • If you are adding new functionality, document it in the README
  • Do not change the version number, I will do that on my end
  • If necessary, rebase your commits into logical chunks, without errors
  • Push the branch up to GitHub
  • Send a pull request to the vlandham/illuminati project.

About

scripts to automate illumina secondary analysis pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Ruby 46.9%
  • Perl 34.3%
  • HTML 17.3%
  • Shell 1.1%
  • Other 0.4%