Skip to content

decomposing and subsetting vcfs

Brent Pedersen edited this page Jul 9, 2019 · 3 revisions

slivar expects VCFs to be decomposed so that mult-allelic variants are split into separate variants. This should result in a single variant (and alternate allele) per line.

In order to do this correctly, the VCF may need to be adjusted so that the AD field which indicates the (A)llelic (D)epths will be decomposed properly. In older versions of GATK, the AD header will appear as:

##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

This leaves no way for tools to know how to decompose the variant. Instead, it should be changed to:

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

This was only added to the VCF spec in version 4.3 but both bcftools and vt will correctly decompose the AD field after adjusting the header to the above line.

The decomposition can be accomplished in a stream of commands with:

zcat $gatk_vcf \
  | sed -e 's/ID=AD,Number=\./ID=AD,Number=R/' \
  | bcftools norm -m - -w 10000 -f $fasta -O b -o $clean_bcf

it is also important to change the header before sub-setting by sample (e.g. with bcftools view -s)

The default javascript functions in the slivar repo rely heavily on having the correct values in the AD field in order to calculate allele balance so this is a critical step.

Clone this wiki locally