Skip to content

NCI-CGR/TypeSeq2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TypeSeq2

NCI CGR laboratory HPV typing analysis workflows and R package

TypeSeq2 is an R package that includes

  • several helper functions for working with TypeSeq data
  • contains a "make" based pipeline for processing Ion or Illumina runs
  • contains a docker build file that includes all the dependencies inside a single container

We recommend running the pipeline inside the docker container cgrlab/typeseq2:v2.2.0 as it contains all the required dependencies in the correct locations.

The workflow manager we use is drake https://github.com/ropensci/drake

This package supports the Ion Torrent platforms.

The only requirement for this workflow is either docker or singularity

Ion Torrent Plugin

We also include a wrapper for the Ion Torrent server that can be uploaded via the provided zip file. The prerequisite for running the Ion Torrent Plugin successfully is to install singularity on the server ahead of time.

Download and add hpv-typing plugin zip file via torrent server gui

The latest plugin package can be found at here.

Tools utilized in the Ion Torrent Workflows

Insights about this plugin

Many issues have been posted and addressed in the related issue page. Some of the common issues are to be highlighted as below.

1. Demultiplexing and the related issues

In TypeSeq2 assay, samples are dual barcoded to increase productivity. The standard Ion Torrent pipeline processes 5`-barcodes but not 3`-barcodes. The demultiplexing step is a key component of this TypeSeq2 plugin: it processes the raw bam files and demultiplex them by the 3`-barcode. Please note that those "raw" bam files are generated by the standard IonTorrent pipeline, having already been demultiplexed by the 5`-barcode.

In particular, the input raw bam files are IonXpress_0XX_rawlib.bam, where XX is the ID of 5`-barcode, range from 49 to 96. The output bam after this step is something like: AXXPYY_sorted.bam, where YY is the ID of 3`-barcode, range from 01 to 48.

Within this step, adam/spark is employed to make the demultiplexing in the programming language Scala (see inst/methylation/demux_3prime_barcode_adam.scala). Briefly, it processes each alignment in the bam files with the following steps:

  • Parse information in the files barcodes.csv and .manifest.csv.
  • Filtering alignment
    • full length and mapq >4
  • Obtain 3`-barcode ID if there is perfect match and replace read group name
    • For example, RG:Z:MRFF6.IonXpress_080 => RG:Z:A80P06.MRFF6.IonXpress_080
  • Finally, save all the modified alignment records into one file demux_reads.bam.

Then, samtools are used to split demux_reads.bam by new read group names, and further sort each of them as AXXPYY_sorted.bam.

The whole demultiplexing process is complicated and high cost in terms of CPU and RAM. As the code were developed several years before, some libraries are not supported any more and we have to rely on the containerization to keep it functional.

There is one issue in this step, which is hard to fix. In the scalar code, the sample names are expected to be matched with the ID of their 5`-barcode, like, 49,50,...,96.

  • Problematic scalar code
withColumn("recordGroupSample", concat(lit("A"), $"recordGroupSample")).
  filter($"recordGroupSample" === $"BC1").
  • The read group example in a properly named bam file. :bookmark: Note that SM:80 is recognized by the scalar code.
@RG     ID:MRFF6.IonXpress_080  CN:OptimusFry/S5XL-0040 DS:chef 
DT:2022-06-27T10:35:58-0400     
FO:TACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATC
GATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGC
ATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACG
TCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCT
ACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATG
TACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGA KS:TCAGTCAGACCAGGTGAT
PG:tmap PL:IONTORRENT   PU:s5/540/Q2FV82/18/DAIC01728/IonXpress_080     SM:80

If the samples are not named in this way, it will caused issues as reported before:

It is ideal to pull 5`-barcode ID from ID:MRFF6.IonXpress_080 rather than from the hard-coded SM:80. However, the adam library used in this plugin is outdated, so that it is difficult to find the related documents/references about how to make such changes properly.

Noticeably, dual barcode seems supported by TorrentSuite now (see https://assets.thermofisher.com/TFS-Assets/LSG/manuals/MAN0017972_031419_TorrentSuite_5_12_UG_.pdf), which is likely to be a better solution. The feature has been explored by Sarah at certain extension (see https://github.com/NCI-CGR/TypeSeqHPV_issues/issues/84). A successful adoption of this solution will reduce the computation cost of this plugin dramatically.

2. TypeSeq2 Plugin package

The plugin package is a zip file, with the a typical folder structure as below:

└── TypeSeq2-dev
    ├── instance.html
    ├── launch.sh
    ├── plan.html
    ├── pluginMedia
    │   └── configs
    │       ├── TS2_config.csv
    │       ├── TS2_plugin-barcodes_v1.csv
    │       ├── TypeSeq2_BED_v1.bed
    │       ├── TypeSeq2_Batch-controls_v1.csv
    │       ├── TypeSeq2_Grouping-defs_v1.csv
    │       ├── TypeSeq2_Hotspot_v1.vcf
    │       ├── TypeSeq2_Internal-Controls_v1.csv
    │       ├── TypeSeq2_Ion_Ref.fasta
    │       ├── TypeSeq2_Lineage-defs_v1.csv
    │       ├── TypeSeq2_Overall-qc-defs_v1.csv
    │       ├── TypeSeq2_PN-criteria_v1.csv
    │       ├── TypeSeq2_Parameters_v1.json
    │       └── TypeSeq2_Scaling_v1.csv
    └── pluginsettings.json
  • The plugin name is actually defined by the folder name within the zip file, that is, TypeSeq2-dev in this example.
  • The script launch.sh is the core component of the package (see an example below), playing several important functions:
    • Define the version of the package.
    • Prepare some settings for singularity.
    • Prepare settings for the docker run: docker://cgrlab/typeseq2:dev_v2.1.7
      • The whole TypeSeq2 pipeline is an R script, which is containerized within docker://cgrlab/typeseq2:dev_v2.1.7.
      • --bind /results/plugins/TypeSeq2-dev/pluginMedia:/user_files
        • /results/plugins/TypeSeq2-dev/pluginMedia is the default pluginMedia folder after the plugin installation on the IonTorrent server. It should be mounted as /user_files to the docker container to run the plugin properly.
        • 📝 Note: Please remember to update the folder name /results/plugins/TypeSeq2-dev/pluginMedia accordingly if the plugin name is changed from TypeSeq2-dev.
  • The file instance.html defines the HTML interface to run the plugin.

Some details about IonTorrent plugin and its development can is available here.

  • An example of launch.sh
#!/bin/bash

set -x
# TypeSeq2 HPV
VERSION="2.1.7"
#autorundisable
echo Pipeline version $VERSION

ln ../../*.bam ./


export SINGULARITY_TMPDIR=/tmp/singularity_${USER}_tmp
mkdir -p $SINGULARITY_TMPDIR
export SINGULARITY_CACHEDIR=/tmp/singularity_${USER}_cache
mkdir -p $SINGULARITY_CACHEDIR


### transfer some metrics files
mkdir raw_metrics

cp ../../raw_peak_signal raw_metrics
cp ../../basecaller_results/TFStats.json raw_metrics
cp ../../sigproc_results/analysis.bfmask.stats raw_metrics
cp ../../ionstats_alignment.json raw_metrics
cp ../../basecaller_results/datasets_basecaller.json raw_metrics
cp ../../basecaller_results/BaseCaller.json raw_metrics
cp ../../expMeta.dat raw_metrics 

mkdir tmp

### Note to update the pluginMedia folder name if the package name is changed
# /results/plugins/TypeSeq2-dev/pluginMedia

singularity exec  --pwd /mnt --bind $(pwd):/mnt --bind $(pwd)/tmp:/tmp --bind /results/plugins/TypeSeq2-dev/pluginMedia:/user_files docker://cgrlab/typeseq2:dev_v2.1.7 \
        Rscript /TypeSeq2/workflows/TypeSeq2.R \
        --is_torrent_server yes \
        --is_clinical yes \
        --cores 22 \
        --ram 24G \
        --tvc_cores 4

rm *rawlib.bam

💡 Tip: The TypeSeq2 plugin package is maintained at https://github.com/NCI-CGR/IonTorrent_plugins/tree/main/TypeSeq2 . User may need test new config settings by themselves. It is better to build a new plugin package based on the existing package, with the following steps:

  1. Download the plugin zip file from https://github.com/NCI-CGR/IonTorrent_plugins.
  2. Unzip the downloaded zip file.
  3. Change the configure files under pluginMedia.
    • Make sure to revise TS2_config.csv properly.
  4. Change the package name to a unique name (so that it will not messed up with other users).
  • Rename the root folder name to the new name in the unzipped files.
  • Revise launch.sh
    • Change the version, for example, VERSION="0.0.1".
      • You many need to increase version number to overwrite previous installation with the same plugin name.
      • Or you choose to uninstall the previous installation first and reinstall with the save version number.
    • Update the binding folder accordingly
      • --bind /results/plugins/[new_plugin_name]/pluginMedia:/user_files
  1. Zip the package and the modified plugin is ready for use.

Dev and Prd version

TypeSeq2-dev is a development version, which differs from the product version TypeSeq2:

  • Package name: TypeSeq2-dev VS TypeSeq2.
  • Version: Version number of TypeSeq2-dev typically is no less than that of TypeSeq2.
  • Docker image: cgrlab/typeseq2:dev_V2.1.7 VS cgrlab/typeseq2:V2.1.4
    • Most important of all, TypeSeq2-dev uses a different dockerfile from TypeSeq2.
      • The former incorporates the latest change in the main branch of https://github.com/NCI-CGR/TypeSeq2. Therefore, the docker image cgrlab/typeseq2:dev_V2.1.7 is not stable. The dynamic docker image has one advantage in the development: developers just need rebuild the docker image (with the same tag) so as to change the R functions defined in the installed TypeSeq2-dev (without reinstalling or upgrading the package). Contrastingly, the production version TypeSeq2 has a stable docker image, not affected by the github commits.
      • The dockerfile for the production version is linked to the stable release of https://github.com/NCI-CGR/TypeSeq2, so that it is stable. Once a new stable version is released, the docker images for the older development versions can be deleted from the docker hub.

3. Common plugin errors.

3.a. Exit code 143.

The demultiplexing step uses spark and requires lot of memory. Each plugin job is submited as a SGE job to the queue plugin.q in the IonTorrent server. By default, the plugin.q queue has 48G memory assigned. The job will be killed if the memory usage exceeds 48G. There are 3 likely solutions available to deal with this memory overflow:

  • Rerun the plugin job as the RAM requirement is usually dynamic for the same run.
  • Change the default maximum memory setting of plugin.q to 80G.
  • Consider to use dual barcodes (re Modify the 3' barcode demux to save time - TS 3' BC read tagging #84). Sarah and I had tested some before. If it works, the demultiplexing step will not be needed any more and the run time will be reduced dramatically.