Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Middle ground between --dont-trim-active-regions true and --dont-trim-active-regions false #5791

Open
tfenne opened this issue Mar 12, 2019 · 3 comments
Assignees

Comments

@tfenne
Copy link
Contributor

tfenne commented Mar 12, 2019

Feature request

Tool(s) or class(es) involved

HaplotypeCaller (latest version from master as of 3/12/2019)

Description

I'm running into situations where the HaplotypeCaller does the wrong thing when genotyping, when it is allowed to trim active regions. A good example is a fairly high frequency variant at HG19:chr11 -6411935 aka HG38:chr11:6390705. I'm looking at this data in the 24 samples that Broad did deep PCR-free 2x250 WGS sequencing on for the 1000G project. In these samples there are a pair of variants at that locus, one is a SNP and the other is a 6bp deletion. It's right on the boundary of a simple sequence repeat.

What appears to happen is that when the active region is trimmed, sometimes the deletion allele is lost. From looking at the gVCFs created and also at the output with --debug it's clear the allele is discovered, but then when genotyping is done on the trimmed active region the allele disappears. Here's an example pair of calls where the only different in HC invocation was that the first one used --dont-trim-active-regions true:

chr11   6411935 rs3838786       TGCTGGC CGCTGGC,T,<NON_REF>     4029.06 .       DB;DP=118;ExcessHet=3.0103;MLEAC=1,1,0;MLEAF=0.500,0.500,0.00;RAW_MQandDP=424800,118;REF_BASES=ATGGGCCTGGTGCTGGCGCTG    GT:AD:DP:F1R2:F2R1:GQ:PL:SB     1/2:0,62,40,0:102:0,31,23,0:0,31,17,0:99:4046,1646,1982,2435,0,2437,4113,1933,2560,4431:0,0,54,48

and the second one didn't:

chr11   6411935 rs3838786       TGCTGGC T,CGCTGGC,<NON_REF>     2308.64 .       BaseQRankSum=-1.312;ClippingRankSum=0.877;DB;DP=119;ExcessHet=3.0103;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00;MQRankSum=0.000;RAW_MQandDP=428400,119;REF_BASES=ATGGGCCTGGTGCTGGCGCTG;ReadPosRankSum=0.255      GT:AD:DP:F1R2:F2R1:GQ:PL:SB     0/2:7,0,65,0:72:1,0,34,0:6,0,31,0:99:2316,2364,2996,0,269,1897,2506,2977,1274,3798:1,6,34,31

Note how in the second case, there are two alts in the gVCF, but only one of them has depth!

The only way to recover these cases is to run with --dont-trim-active-regions, but that make the HC run approximately 5 times slower, which is obviously not ideal.

What I'd like to suggest is that the HC have some automated way to detect when this kind of error is likely to happen or has happened, and work around it. My suggestion(s) would be:

  1. I think this really only happens in repetitive regions. I wonder if it would be possible to have the HC automatically trim active regions when assembly at kmer size 10 works, and disable it when it has to escalate to a higher kmer size?

  2. Trim the active region, but retain the untrimmed active region also. Genotype using the trimmed region. If any allele receives count=0, re-genotype using the untrimmed regions.

My thought here is that I think not trimming the active regions really only makes a difference at a small fraction of sites, on the order of 1/1000, but to rescue those sites we have to pay a 5x performance penalty at every site. It would be great if trimming could be auto-disabled at only those sites that are problematic, so we could have our cake and eat it too.

@droazen
Copy link
Collaborator

droazen commented Mar 13, 2019

@davidbenjamin Your thoughts on this idea?

@davidbenjamin
Copy link
Contributor

I like Tim's suggestion #1. It would probably be best to generalize it to finding some principled relationship between the kmer size, the trimming, and perhaps the padding.

@davidbenjamin
Copy link
Contributor

@tfenne I have a PR: #6358 that tries to make more sense of all the windows and padding in the GATK. Could you try this jar: gs://broad-dsde-methods-davidben/gatk-builds/window.jar and see if the deletion gets its depth back?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants