[Feature Request] Mutect2/HaplotypeCaller "--skip-assembly" option #7064

AlijahArcher · 2021-01-30T16:33:22Z

Feature request

Tool(s) or class(es) involved

Mutect/HaplotypeCaller

Description

(Background)
I've spent a lot of time working with Mutect2 in the past year (I've built a whole workflow centered around this tool). But, while I recognize that the internal reassembly feature leads to "best-in-class" results in terms of calling variants, for my purposes it generally just creates headaches since it makes interpreting (certain) calls and verifying (certain) base-level behaviors/expectations very difficult (even when looking at the bamout and assembly logs). Moreover, while we know our alignment process isn't perfect, we think it's appropriate for our purposes, and we would gladly accept the loss of a few calls to be able to have more control over the expected behaviors. With that, I purpose a "--skip-assembly" flag that would cause the Mutect2/HaplotypeCaller engine to use the original alignment information to determine the haplotypes.

All that said, I imagine this could be a niche feature request, so I've spent some time digging through the source code trying to see if there could be a quick fix that could be made available to whatever group of developers would want this. It seemed like there could be another conditional branched added here (

gatk/src/main/java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/readthreading/ReadThreadingAssembler.java

Line 159 in 9ff3f8b

if (generateSeqGraph) {

) to build a resultSet based on a non-assembly based approach. However, I'm not certain how using the original alignment information would affect the statistics employed for genotyping the candidate haplotypes, so I'm starting to back off implementing a custom fix and hoping the experts can help (or at least explain to me why this feature is not currently possible OR if there is a way that I can access this behavior that I'm missing).

Thank you for the consideration.

(TL;DR)
Introduce a --skip-assembly option that would cause the Mutect2/HaplotypeCaller engine to use the original alignment information to determine the haplotypes.

The text was updated successfully, but these errors were encountered:

davidbenjamin · 2021-02-01T16:48:02Z

@AlijahArcher We will need to clarify exactly what you intend by skipping the assembly. Here are some possibilities, with my initial thoughts:

"skip assembly" = "trust the alignments completely and use them directly for variant calling": if your aligner is good this is reasonable though not ideal. If this is what you want you might as well use samtools for variant calling.
"skip assembly" = "every unique pattern of variants seen in your reads defines a haplotype": the problem is that every sequencing error generates a new haplotype, so you need some way to cull bad haplotypes. Also, reads might only cover part of a haplotype so you need a way to sew them together.
"skip assembly" = "find all variants in your read alignments and let every combination thereof define a haplotype": if I recall correctly this is FreeBayes. I would call this a quick-and-dirty assembly rather than skipping assembly entirely.
"skip assembly" = "avoid haplotypes altogether and genotype variants directly": as you are aware, this is not possible within HC and M2.

Hopefully that focuses the conversation somewhat on the two main questions: how do you generate haplotypes, and how do you refine the set of haplotypes to a few good candidates? What did you have in mind?

AlijahArcher · 2021-02-01T18:24:39Z

@davidbenjamin Thanks for taking the time to look into this and focusing the conversation further. You bring up good points, I think options 1 and 3 are the most appealing, so I'll lean into those:

Option 1 (i.e. trusting the alignments completely)

Yeah, I recognize that this puts us closer in theory to the realms of a positional-based variant caller; however, I'm quite invested in the pre-/post- processing implementations I've got built around Mutect2 (not to mention the benefit of simultaneous processing of the normal and tumor samples, pre-filtering, model-based filtering, etc). If Mutect would employ this naive approach to haplotype calling, I suppose it would end up looking like the "Platypus" caller, which again might be suited for our needs, but potentially makes option 3 more appealing.

Option 3 ( i.e. Quick-and-dirty ("FreeBayes-ian") assembly:

This is interesting and would seem to solve my problems (I believe?) by creating a Haplotype-based, somatic variant caller with the Mutect perks/processing steps/output formats. Again, though, I could see the generation of many candidate haplotypes if things are really messy; however, could you not use a simple "supporting reads"-based approach for haplotype selection. That would make the likelihood calculations fairly straight-forward. It would undeniably be less-sophisticated than the current De Bruijn Graph/Smith Waterman realignment-based approach but could be better for folks that want more control of the expected behaviors of the tool.

Option 5 (Disable realignment portion of assembly):

I'm going to go out on a limb with this one (feel free to shut this line of thought down quick if I'm really off-base). I've never been able to fully understand the code in the findBestPaths method (

gatk/src/main/java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/readthreading/ReadThreadingAssembler.java

Line 307 in 9ff3f8b

    
           List<Haplotype> findBestPaths(final Collection<T> graphs, final Map<T, AssemblyResult> assemblyResultByGraph,

) and I've had troubles figuring out the details of realignment from the official docs. It could be this part of the assembly process that causes me the most troubles in my pipeline, since this is where the original alignment information can really get disregarded as Mutect2 looks for better understandings of the input alignments. Admittedly, I find this feature to be really neat (particularly for the big ugly INDELs), but again the lose of the original alignment information has been troubling in certain cases. Could there be a potential approach to disabling realignment?

davidbenjamin · 2021-02-04T20:53:00Z

findBestPaths extracts haplotype sequences from the assembly graph, so if you have assembly you have to have findBestPaths. This is not exactly where original alignments vanish since presumably the variant haplotypes contain the right bases. Rather, it is later, when we align the assembled haplotypes to the reference with the Smith-Waterman algorithm, that a variant may be represented in a way that does not match the original reads.

If variant representation is the issue, that is not a matter of read realignment (the realignments are only used in annotations and are not part of variant calling) but of haplotype alignments. They're not realignments because the input bam has no concept of haplotypes, of course.

Do I understand correctly that your issue is with the representation of variants, and that Mutect2 outputs representations that, though equivalent in terms of base sequence, don't match the reads' CIGAR strings?

AlijahArcher · 2021-02-04T21:54:03Z

I see. Yes, I think this is closer to the issue at hand -- for instance, we'll have variants that are called, then when we go back to the original alignments there will be no indication of alternate observations at the loci which the variants were called (and vice versa, we'll have original alignments that suggest a variant, but won't get called). In the worst cases scenarios, the haplotype alignments seem ridiculous compared to the original reads (e.g. my GATK forum post: https://gatk.broadinstitute.org/hc/en-us/community/posts/360075749392-Bamout-haplotypes-are-much-worse-than-bamout-tumor-reads-would-suggest).

So, right, it's not the realignment of the reads, it's the alignment of the assembled haplotypes that complicates things. Thinking about it in this context though, I see we might be at an impasse (or back at square one with considering the assembly process).

My true fear is that we'll lose what appear to be definite-somatic variants in the original alignments because of the internal assembly + haplotype alignment engine. But I will say, it does seem like we see these problems for lower-coverage areas/variants, so I might be exaggerating the issue at hand. Maybe the solution for testing our expectations is to use higher depths to avoid bad assemblies and to apply some post-processing to the calls. (And wait and see if there are truly troubling cases of "obvious" variants getting lost)

AlijahArcher · 2021-02-04T22:10:33Z

Admittedly, I think I was being led to believe that the original reads themselves were being realigned somewhere in this process (I think I got this notion from the case study in the Chipster Tutorials video from last year (https://youtu.be/U85HP37IYSo?t=558)). These new insights help me understand my own case study on the GATK forums (linked on my previous comment) too.

davidbenjamin · 2021-02-25T15:24:24Z

Since the problem seems to be representation I'm going to close this issue, but I encourage you to continue submitting specific cases as we might eventually want to improve upon our Smith Waterman haplotype alignments.

AlijahArcher · 2021-02-25T15:25:12Z

Sounds good, thanks again @davidbenjamin

fleharty assigned davidbenjamin Jan 31, 2021

davidbenjamin closed this as completed Feb 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Mutect2/HaplotypeCaller "--skip-assembly" option #7064

[Feature Request] Mutect2/HaplotypeCaller "--skip-assembly" option #7064

AlijahArcher commented Jan 30, 2021

davidbenjamin commented Feb 1, 2021

AlijahArcher commented Feb 1, 2021 •

edited

Loading

davidbenjamin commented Feb 4, 2021

AlijahArcher commented Feb 4, 2021

AlijahArcher commented Feb 4, 2021

davidbenjamin commented Feb 25, 2021

AlijahArcher commented Feb 25, 2021

[Feature Request] Mutect2/HaplotypeCaller "--skip-assembly" option #7064

[Feature Request] Mutect2/HaplotypeCaller "--skip-assembly" option #7064

Comments

AlijahArcher commented Jan 30, 2021

Feature request

Tool(s) or class(es) involved

Description

davidbenjamin commented Feb 1, 2021

AlijahArcher commented Feb 1, 2021 • edited Loading

davidbenjamin commented Feb 4, 2021

AlijahArcher commented Feb 4, 2021

AlijahArcher commented Feb 4, 2021

davidbenjamin commented Feb 25, 2021

AlijahArcher commented Feb 25, 2021

AlijahArcher commented Feb 1, 2021 •

edited

Loading