Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable --only-output-calls-starting-in-intervals for SelectVariants #6339

Open
lbergelson opened this issue Dec 20, 2019 · 9 comments
Open

Comments

@lbergelson
Copy link
Member

There isn't a good way to subset a vcf into disjoint chunks using gatk right now. We should enable --only-output-calls-starting-in-intervals for SelectVariants.

Requested in #6335

@lbergelson
Copy link
Member Author

I think we can generally enable this by pushing the option up to VariantWalker / GATKTool and integrating it with the createVCFWriter method.

It can optionally return a writer wrapped in a decorator that only outputs sites within the given intervals.

We might want to rename the option in that case to something like "only-output-variants-starting-in-intervals" so it's clear that it only effects variant outputs. Or make it work with generated bamWriters too...

@lbergelson lbergelson self-assigned this Jan 6, 2020
@lbergelson lbergelson added this to the GATK-Triaged-Issues milestone Jan 6, 2020
@tfenne
Copy link
Contributor

tfenne commented Jan 9, 2020

Jumping in here to add that I'd very much like the opposite behavior in other tools. My understanding is that HaplotypeCaller and GenotypeGVCFs interpret intervals the other way - only emitting variants that start within the --intervals given. @yfarjoun pointed out to me that this is desirable when e.g. scatter/gathering WGS samples. But it would be nice for capture data to be able to cause the HC to emit all variants that overlap any of the intervals given.

For example if you are capturing a gene panel, it is reasonable to want to see all deletions that delete one or more bases of any exon, regardless of whether the start of the deletion is within the exon. Right now this requires padding the intervals by an estimated "max deletion length" to be safe, which then causes more variants that are totally outside the intervals to also be emitted.

Might I instead suggest that an argument be introduced like:

--variant-interval-matching [STARTS_WITHIN|CONTAINED|OVERLAPS]

that would then have different defaults based on the historical behavior of various tools?

@lbergelson
Copy link
Member Author

@tfenne The gatk3 behavior when using intervals was that it only loaded data that started within the interval. We change that in gatk4 so queries get all overlapping data. This means that for GenoytpeGVCFs at least you should be getting all upstream deletions that overlap. If that's NOT the case then it's a bug we should fix. The side effect was getting duplicate calls in adjacent shards which is why we added --only-output-calls-starting-in-interval.

The HaplotypeCaller behavior is probably different and I don't know for sure what it is.

HaplotypeCaller leading deletion behavior

It seems like there are 3 possible outcomes for a deletion that starts outside the interval in HaplotypeCaller. @davidbenjamin Do you know which we do? James said you were re-writing related code recently. A toggle to change between behavior 1 and 2 would be idea like you're suggesting @tfenne. I'm a bit afraid that we do 3 and emit nothing in these cases though.

@lbergelson
Copy link
Member Author

I did a bit of testing with the following variant and haplotype caller.

20   10187114    .    CAACCTCATTCTTTTGCAAATG C

As far as I can tell you only get a result if the interval contains the entirety of the deletion. We don't output any part of a leading deletion or of a deletion that continues off the edge of the interval. This seems wrong.

@davidbenjamin
Copy link
Contributor

@lbergelson I am not familiar with the details.

@tfenne
Copy link
Contributor

tfenne commented Jan 10, 2020

@lbergelson I agree that your result is undesirable. In fact, it's hard to imagine when you'd want it to work that way. I think folks need to be able to specify:

  1. "starts within" the interval so that scatter/gather can work and not generate redundant variants, since a variant can only start in one of a set of non-overlapping intervals
  2. "overlaps with" so that when calling a subset of the genome (e.g. for capture experiments) we can output all variants that involve our regions of interest
  3. "contained within" because I suspect there are times you want (e.g. in SelectVariants) to be able to find only variants wholly contained in the region you are interested in

@lbergelson
Copy link
Member Author

@tfenne I don't think the behavior i'm seeing at is by design, I think it's a result of a poor interaction between multiple systems in HaplotypeCaller that each make reasonable assumptions but end up removing the wrong things.

I'm happy to move from a boolean to an enum with 3 options although I suspect there won't be a ton of uses of the "CONTAINS" option.

In any case, adding this output filter option to haplotype caller won't fix the problem there, that will need additional work. It looks like a bug to me though so we should fix it...

@ldgauthier
Copy link
Contributor

One of our collaborators ran into this with ApplyVQSR as well. A deletion that happens to span a shard boundary gets output in each shard. So that's a +1 for engine-level argument, or else ApplyVQSR also needs a fix.

@droazen droazen removed this from the GATK-Triaged-Issues milestone Jun 22, 2020
@jamesemery
Copy link
Collaborator

jamesemery commented Aug 30, 2021

@lbergelson I just tested with one of our variants in the test suite for HaplotyeCaller (specifically 20 10098219 . TATATATA T,<NON_REF>) and its definately the case that HaplotypeCaller will still output that variant even if you place the -L interval boundary inside the variant (like say at 10098221) but it will fail to call that same deletion if you make the start of your interval anything but the first base of that deletion.

The HaplotypeCaller has its own filtering at Haplotype->Allele->Genotyper layer that restricts it to only calling alleles that START within the active region (excluding padding) which makes it impossible to output deletions starting before the provided intervals even if they overlap into the specified region. The HaplotypeCaller will however output deletions that go past the end of the provided -L intervals so long as they begin within the intervals in question. The HaplotypeCaller is hopelessly fickle and everything within a few thousand bases of the edge of the calling intervals is subject to start-position/calling failures and any given site may or may not disappear to assembly failures when close to the edge. Adjusting these settings at the VCF writer level could prevent the latter case from occurring (by filtering them with some strict setting to prevent deletions from overhanging the end of the -L region) but it cannot ever allow the former case to exist (that is to say it will never force a deletion from a few bases before the interval to be called by HaplotypeCaller since it will never send that allele to the genotyper under any circumstance).

I would say that your branch adds an entirely new and subtle way for users to get confused about what the HaplotypeCaller is actually doing under the hood and why they did or didn't see a particular variant so we should tread carefully.

@jamesemery jamesemery mentioned this issue Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants