Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNV VCFs could be improved #6167

Open
tfenne opened this issue Sep 17, 2019 · 9 comments
Open

CNV VCFs could be improved #6167

tfenne opened this issue Sep 17, 2019 · 9 comments

Comments

@tfenne
Copy link
Contributor

tfenne commented Sep 17, 2019

Feature request

Tool(s) or class(es) involved

GermlineCNVCaller / PostprocessGermlineCNVCalls

Description

The VCF produced by the germline CNV calling workflow could be nicer. TBH the VCF output feels like a bit of an afterthought compared to the other outputs. This seems common for CNV callers, but I was hoping the VCF produced by the GATK would be more complete.

Things that would make the VCF easier to use/interpret with downstream tooling:

  1. ##contig lines in the header. PostprocessGermlineCNVCalls takes in a sequence dictionary, and I was surprised that isn't used in generating the VCF.
  2. Every record in the VCF has both <DUP> and <DEL> as alts even though the VCF is single-sample and a given sample can only be duplicated or deleted. This makes quick text-searching of the VCF difficult and means one has to parse the genotypes to determine if the record represents a duplication or deletion in the sample.
  3. QUAL is . for all events. There are various quality scores in the FORMAT/GENOTYPE fields. It would be nice if either the preferred one of those or some other quality measure could be emitted into the QUAL field.
  4. There's nothing in the VCF that gives any indication of the observed read depth or correlated measures, only the called CN. Having either the mean depth over the segment or the raw & denoised copy ratios for the segment etc. would be helpful for manual review.
@samuelklee
Copy link
Contributor

samuelklee commented Sep 18, 2019

Thanks for the suggestions @tfenne. I think we have considered 2 in the past, but just haven’t gotten around to implementing it yet. As for 4, output of the denoised copy ratios was recently added, but users may have to do a bit of work to grab those that overlap a particular interval.

As you allude to, conventions for CNV VCFs are not as settled—more than happy to take suggestions like these from the community.

@ldgauthier
Copy link
Contributor

For manual review, @asmirnov239 was working on normalized coverage plots, right?

@ldgauthier
Copy link
Contributor

@tfenne I've been putting in some work cleaning up the gCNV VCFs and I have fixes for your requests 1-3. For 4, do you know what you'd like in the VCF? As you know, most of the events are going to have multiple targets, and each target has different coverage, so are you thinking mean? Median? I was also contemplating putting more info in the intervals VCF since things like target coverage make more sense at the interval level.

@fgvieira
Copy link

Dear all,

I am a bit confused why GATK uses [0,1,2] for the GT files, even though VCF specifications clearly state that the GT field is encoded as allele values separated by either of / or |. They even say that for diploid calls examples could be 0/1, 1 | 0, or 1/2, etc.

As it is right now, if I read a VCF from GATK CNV germline pipeline through bcftools, the GT field is changed to -65:

1       17345376        CNV_1_17345376_161326630        N       <DUP>   101.19  .    END=161326630   GT:CN:NP:QA:QS:QSE:QSS  -65:3:13:11:101:3:18

1       161332119       CNV_1_161332119_161332223       N       <DEL>   3.19    .    END=161332223   GT:CN:NP:QA:QS:QSE:QSS  -65:1:1:3:3:3:3

1       193091331       CNV_1_193091331_241683022       N       <DUP>   268.21  .    END=241683022   GT:CN:NP:QA:QS:QSE:QSS  -65:3:27:34:268:36:3

2       96919546        CNV_2_96919546_96931119         N       .       62.93   .    END=96931119    GT:CN:NP:QA:QS:QSE:QSS  -65:2:3:38:63:38:63

3       10183532        CNV_3_10183532_69928534         N       .       469.93  .    END=69928534    GT:CN:NP:QA:QS:QSE:QSS  -65:2:22:31:470:19:75

3       69986973        CNV_3_69986973_70014399         N       <DUP>   10.12   .    END=70014399    GT:CN:NP:QA:QS:QSE:QSS  -65:3:8:4:10:4:10

Any reason to not use the standaed GT format?

I have also noticed that GATK outputs some non-variable SVs to the VCF without any ALT allele. Why not remove them if they are actually not SVs, if GT=0 and CN=2?

thanks,

@ldgauthier
Copy link
Contributor

@fgvieira would diploid no-call genotypes work in your pipeline? The example in the VCF spec (http://samtools.github.io/hts-specs/VCFv4.2.pdf page 11) has no-call diploid genotypes, GQ=0 with copy number (CN) and copy number quality (CNQ) specified. These aren't allelic calls so we can't say how many copies are on each haplotype. I think the haploid genotype calls were supposed to be for convenience. @cwhelan what do we do (or should we do) for depth-only CNV calls in the WGS SV pipeline?

We output CN2 calls so that they can be interpreted in the context of a larger cohort to calculate site frequency or to assess carrier status for family studies. Right now the "joint calling" for exome copy number is under active development, but we leave the reference calls for ad hoc joint analysis.

@cwhelan
Copy link
Member

cwhelan commented Apr 30, 2020

In the WGS SV pipeline, for deletions and duplications that the pipeline believes to be biallelic we do the following:

  • ALT: <DEL> or <DUP>
  • SVTYPE: DEL or DUP
  • GT: 0/0 or 0/1 or 1/1

We currently report depth based copy number and quality for these variants in custom format fields RD_CN and RD_GQ if they are available; we could possibly move those values to the standard CN and CNQ, but there is some complexity in how to handle events detected by paired end and split reads without good read depth support; ie those under 1kb or so depending on our depth binning size and the coverage. Our depth genotyping module makes estimates of copy number for these sites but sometimes these can be very inaccurate so at the moment we prefer not to report total copy number in those fields. Probably what we should do is fill in CN with 0, 1, or 2 based on the genotype we emitted and set CNQ to the value we computed for GQ.

For multiallelic CNVs (i.e. sites where our model is not sure that the variant is bi-allelic) we write:

  • ALT: <CNV>
  • SVTYPE: CNV
  • GT and GQ: .
  • CN and CNQ: estimate of total (diploid/unphased) copy number and quality of the depth evidence.

I think there are some tradeoffs in completely characterizing the evidence for and quality of each call and enabling easy searching across the whole VCF without having to parse and understand the entire record.

Older versions of our pipeline used to put the diploid copy number of the event into the GT field, I think similarly to what's being described above. This is incorrect VCF -- GT values should be indices into the allele list for the variant, and should be a list of length equal to the ploidy.

My view is that if you can confidently infer the alleles present at the site in the sample set you should use a GT value of the form 0/1, and if you don't know or aren't interested in trying to infer them you should use CN for total copy number and CNQ for the quality. CNF is also available in the spec for fractional estimates of copy number; for example Genome STRiP uses it to report the raw depth signal, which is useful for QC and identifying mosaic events.

@fgvieira
Copy link

fgvieira commented May 4, 2020

I like @cwhelan suiggestions, where we'd have:

  • ALT: <DEL> or <DUP> (<CNV> if not known)
  • SVTYPE: DEL or DUP (CNV if not known)
  • GT: 0/0 or 0/1 or 1/1 (. or ./. if not known)

@samuelklee
Copy link
Contributor

We should apply the changes made to the segments VCF in #6352 to the intervals VCF to keep them consistent. See https://gatk.broadinstitute.org/hc/en-us/community/posts/360071515912-GATK4-1-6-0-gCNV-inconsistent-CNV-calls-in-intervals-and-segments-vcfs

@samuelklee
Copy link
Contributor

samuelklee commented Oct 29, 2020

Stumbled back on this when looking at #6924. @ldgauthier @mwalker174 was the above comment addressed? Might be good to verify the gCNV tutorial is still consistent or update it at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants