Skip to content

TranscriptClean Tutorial

Dana Wyman edited this page Sep 24, 2019 · 6 revisions

User note: If using the chromosome 1 reference genome fasta file provided in the example_files directory, please run the following command to unzip it first:

gunzip chr1.fa.gz
Due to GitHub’s limit on the size of individual files, we were unable to provide an unzipped copy from the start.

Basic correction

At its most basic, TranscriptClean can be run using an input SAM file of transcripts and the reference genome fasta file that was used to map them. Using the example files provided with TranscriptClean for GM12878 chromosome 1, the command is

python TranscriptClean.py --sam example_files/GM12878_chr1.sam \
                          --genome chr1.fa \
                          --outprefix /my/path/outprefix
By default, this command will correct indels and mismatches using the reference sequence. Deletions ≤ 5 basepairs in size will be corrected by filling in the reference bases that are missing, and insertions ≤ 5 basepairs will be corrected by removing the extra bases. Mismatches are changed to the reference base. The 5 basepair threshold for correction was chosen based on the distribution of indel size in PacBio human cell line data. If you prefer to change this, you can specify your own threshold using the --maxLenIndel option.
python TranscriptClean.py --sam example_files/GM12878_chr1.sam \
                          --genome chr1.fa \
                          --maxLenIndel 10 \
                          --outprefix /my/path/outprefix

Indel and mismatch correction can be disabled using the --correctIndels and --correctMismatches options, respectively.

No indel correction:

python TranscriptClean.py --sam example_files/GM12878_chr1.sam \
                          --genome chr1.fa \
                          --correctIndels false --outprefix /my/path/outprefix

No mismatch correction:

python TranscriptClean.py --sam example_files/GM12878_chr1.sam \
                          --genome chr1.fa \
                          --correctMismatches false \
                          --outprefix /my/path/outprefix

Variant-aware correction

Correcting all mismatches and indels in the transcripts is not an ideal approach, because some of them are likely to be variants. In particular, many mismatches in the sequence may be single nucleotide polymorphisms (SNPs), or more rarely, RNA editing events. To address this, you can run TranscriptClean in variant-aware mode by providing a VCF file. In this mode, mismatches and indels that exactly match a variant in the VCF file will retain their original sequence. The match must be exact- if a mismatch is in the same place as a known SNP but does not match the known alleles, it will be treated as a sequencing error. One downside of this is that we are not yet able to address special cases where there is an indel variant in the sequence disrupted by a sequencing error. This is something we hope to improve in the future.

python TranscriptClean.py --sam example_files/GM12878_chr1.sam \
                          --genome chr1.fa \
                          --variants example_files/GM12878_chr1.vcf.gz \
                          --outprefix /my/path/outprefix

Noncanonical splice junction correction

To enable correction of noncanonical splice junctions, you must supply an annotation file of high-confidence splice sites. At this time, the required format matches the splice junction file that the STAR read aligner (https://github.com/alexdobin/STAR) outputs when mapping short RNA-seq reads to the genome. For each splice junction, the file lists its genomic location, strand, intron motif, whether or not the junction is in the provided GTF transcriptome annotation (i.e. GENCODE v24), and the amount of read support it has. If short reads in fastq format are available for the sample that you are working with, the most straightforward way to get a splice junction file is to run STAR. We use the following command with parameters recommended by the ENCODE consortium:

STAR \
--runThreadN 4 \
--genomeDir STAR_hg38_ENCODE \
 --readFilesIn illumina_1.fastq illumina_2.fastq \
--sjdbGTFfile gencode.v24.annotation.gtf \
--outFilterType BySJout \
--outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.04 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--outSAMattributes NH HI NM MD jM jI \
--outSAMtype SAM

Once you have a splice junction annotation in the appropriate format, run TranscriptClean with the following command to perform splice junction correction.

python TranscriptClean.py --sam example_files/GM12878_chr1.sam \
                          --genome chr1.fa \
                          --spliceJns example_files/GM12878_SJs_chr1.tab \
                          --outprefix /my/path/outprefix

For each noncanonical splice site, TranscriptClean will find the nearest annotated splice site in the file you provided. By default, if the nearest splice site is ≤ 5 bp away, the original splice site will be modified to match the annotated site. If you prefer to use a different distance threshold, you may change it using the --maxSJOffset option.

python TranscriptClean.py --sam example_files/GM12878_chr1.sam \
                          --genome chr1.fa \
                          --spliceJns example_files/GM12878_SJs_chr1.tab \
                          --maxSJOffset 10 
                          --outprefix /my/path/outprefix