Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Mutect2/HaplotypeCaller "--skip-assembly" option #7064

Closed
AlijahArcher opened this issue Jan 30, 2021 · 7 comments
Closed
Assignees

Comments

@AlijahArcher
Copy link

Feature request

Tool(s) or class(es) involved

Mutect/HaplotypeCaller

Description

(Background)
I've spent a lot of time working with Mutect2 in the past year (I've built a whole workflow centered around this tool). But, while I recognize that the internal reassembly feature leads to "best-in-class" results in terms of calling variants, for my purposes it generally just creates headaches since it makes interpreting (certain) calls and verifying (certain) base-level behaviors/expectations very difficult (even when looking at the bamout and assembly logs). Moreover, while we know our alignment process isn't perfect, we think it's appropriate for our purposes, and we would gladly accept the loss of a few calls to be able to have more control over the expected behaviors. With that, I purpose a "--skip-assembly" flag that would cause the Mutect2/HaplotypeCaller engine to use the original alignment information to determine the haplotypes.

All that said, I imagine this could be a niche feature request, so I've spent some time digging through the source code trying to see if there could be a quick fix that could be made available to whatever group of developers would want this. It seemed like there could be another conditional branched added here (

) to build a resultSet based on a non-assembly based approach. However, I'm not certain how using the original alignment information would affect the statistics employed for genotyping the candidate haplotypes, so I'm starting to back off implementing a custom fix and hoping the experts can help (or at least explain to me why this feature is not currently possible OR if there is a way that I can access this behavior that I'm missing).

Thank you for the consideration.

(TL;DR)
Introduce a --skip-assembly option that would cause the Mutect2/HaplotypeCaller engine to use the original alignment information to determine the haplotypes.

@davidbenjamin
Copy link
Contributor

@AlijahArcher We will need to clarify exactly what you intend by skipping the assembly. Here are some possibilities, with my initial thoughts:

  • "skip assembly" = "trust the alignments completely and use them directly for variant calling": if your aligner is good this is reasonable though not ideal. If this is what you want you might as well use samtools for variant calling.
  • "skip assembly" = "every unique pattern of variants seen in your reads defines a haplotype": the problem is that every sequencing error generates a new haplotype, so you need some way to cull bad haplotypes. Also, reads might only cover part of a haplotype so you need a way to sew them together.
  • "skip assembly" = "find all variants in your read alignments and let every combination thereof define a haplotype": if I recall correctly this is FreeBayes. I would call this a quick-and-dirty assembly rather than skipping assembly entirely.
  • "skip assembly" = "avoid haplotypes altogether and genotype variants directly": as you are aware, this is not possible within HC and M2.

Hopefully that focuses the conversation somewhat on the two main questions: how do you generate haplotypes, and how do you refine the set of haplotypes to a few good candidates? What did you have in mind?

@AlijahArcher
Copy link
Author

AlijahArcher commented Feb 1, 2021

@davidbenjamin Thanks for taking the time to look into this and focusing the conversation further. You bring up good points, I think options 1 and 3 are the most appealing, so I'll lean into those:

Option 1 (i.e. trusting the alignments completely)

Yeah, I recognize that this puts us closer in theory to the realms of a positional-based variant caller; however, I'm quite invested in the pre-/post- processing implementations I've got built around Mutect2 (not to mention the benefit of simultaneous processing of the normal and tumor samples, pre-filtering, model-based filtering, etc). If Mutect would employ this naive approach to haplotype calling, I suppose it would end up looking like the "Platypus" caller, which again might be suited for our needs, but potentially makes option 3 more appealing.

Option 3 ( i.e. Quick-and-dirty ("FreeBayes-ian") assembly:

This is interesting and would seem to solve my problems (I believe?) by creating a Haplotype-based, somatic variant caller with the Mutect perks/processing steps/output formats. Again, though, I could see the generation of many candidate haplotypes if things are really messy; however, could you not use a simple "supporting reads"-based approach for haplotype selection. That would make the likelihood calculations fairly straight-forward. It would undeniably be less-sophisticated than the current De Bruijn Graph/Smith Waterman realignment-based approach but could be better for folks that want more control of the expected behaviors of the tool.

Option 5 (Disable realignment portion of assembly):

I'm going to go out on a limb with this one (feel free to shut this line of thought down quick if I'm really off-base). I've never been able to fully understand the code in the findBestPaths method (

List<Haplotype> findBestPaths(final Collection<T> graphs, final Map<T, AssemblyResult> assemblyResultByGraph,
) and I've had troubles figuring out the details of realignment from the official docs. It could be this part of the assembly process that causes me the most troubles in my pipeline, since this is where the original alignment information can really get disregarded as Mutect2 looks for better understandings of the input alignments. Admittedly, I find this feature to be really neat (particularly for the big ugly INDELs), but again the lose of the original alignment information has been troubling in certain cases. Could there be a potential approach to disabling realignment?

@davidbenjamin
Copy link
Contributor

findBestPaths extracts haplotype sequences from the assembly graph, so if you have assembly you have to have findBestPaths. This is not exactly where original alignments vanish since presumably the variant haplotypes contain the right bases. Rather, it is later, when we align the assembled haplotypes to the reference with the Smith-Waterman algorithm, that a variant may be represented in a way that does not match the original reads.

If variant representation is the issue, that is not a matter of read realignment (the realignments are only used in annotations and are not part of variant calling) but of haplotype alignments. They're not realignments because the input bam has no concept of haplotypes, of course.

Do I understand correctly that your issue is with the representation of variants, and that Mutect2 outputs representations that, though equivalent in terms of base sequence, don't match the reads' CIGAR strings?

@AlijahArcher
Copy link
Author

I see. Yes, I think this is closer to the issue at hand -- for instance, we'll have variants that are called, then when we go back to the original alignments there will be no indication of alternate observations at the loci which the variants were called (and vice versa, we'll have original alignments that suggest a variant, but won't get called). In the worst cases scenarios, the haplotype alignments seem ridiculous compared to the original reads (e.g. my GATK forum post: https://gatk.broadinstitute.org/hc/en-us/community/posts/360075749392-Bamout-haplotypes-are-much-worse-than-bamout-tumor-reads-would-suggest).

So, right, it's not the realignment of the reads, it's the alignment of the assembled haplotypes that complicates things. Thinking about it in this context though, I see we might be at an impasse (or back at square one with considering the assembly process).

My true fear is that we'll lose what appear to be definite-somatic variants in the original alignments because of the internal assembly + haplotype alignment engine. But I will say, it does seem like we see these problems for lower-coverage areas/variants, so I might be exaggerating the issue at hand. Maybe the solution for testing our expectations is to use higher depths to avoid bad assemblies and to apply some post-processing to the calls. (And wait and see if there are truly troubling cases of "obvious" variants getting lost)

@AlijahArcher
Copy link
Author

Admittedly, I think I was being led to believe that the original reads themselves were being realigned somewhere in this process (I think I got this notion from the case study in the Chipster Tutorials video from last year (https://youtu.be/U85HP37IYSo?t=558)). These new insights help me understand my own case study on the GATK forums (linked on my previous comment) too.

@davidbenjamin
Copy link
Contributor

Since the problem seems to be representation I'm going to close this issue, but I encourage you to continue submitting specific cases as we might eventually want to improve upon our Smith Waterman haplotype alignments.

@AlijahArcher
Copy link
Author

Sounds good, thanks again @davidbenjamin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants