Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow custom ploidy regions in HaplotypeCaller #8464

Closed
wants to merge 8 commits into from

Conversation

rickymagner
Copy link
Contributor

Summary

This PR adds a new flag to HaplotypeCaller called --ploidy-regions which allows the user to input a .bed or .interval_list with "name" column equal to a positive integer for the ploidy to use when calling variants in that region. The main use case is for calling haploid variants outside the PAR for XY individuals as required by the VCF spec, but this provides a much more flexible interface for other similar niche applications, like genotyping individuals with other known aneuploidies. The global -ploidy flag will still provide the background default (or the built-in ploidy of 2 for humans), but the user input value will supersede these in overlapping regions. Note that the overlap is checked against the active region, meaning variants near the boundary of the --ploidy-regions file may end up with GT fields having ploidy slightly differently than expected, for example if your custom region overlaps a given active region but the variant ends up being written to a location outside that interval. In this case the ploidy from the user input would be used rather than any other default.

Implementation Details

The key idea is to allow HaplotypeCallerEngine to initialize multiple genotyping engines based on the --ploidy-regions input. The intervals are first parsed to check for positive integer ploidy values, and then used to create hashmaps of ploidy -> genotyper. The engine uses two types of genotypers: one for active region determination and one for doing the actual genotyping. Both admit a ploidy paramter passed via hcArgs. This PR modifies the HaplotypeCallerArgumentCollection class to include a method for creating copies of this object with differing ploidy amounts. These then get fed to the constructors of the appropriate genotyper classes, which are organized into two hashmaps. In every situation where one of these genotypers is used, we instead begin the scope by calling a "get local genotyper" method that performs the logic of checking whether the region of interest overlaps any of the user-provided regions, and then selects the appropriate localEngine genotyper for the task, ensuring the user-provided ploidy supersedes any other defaults.

A Note on Dependency

The flexibility of using either .bed or .interval_list files to specify this information depends on this PR in htsjdk being made into a full release, and then bumping the dependency of GATK. The code in this PR would not compile until this happens.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. Overall this approach is a very light touch in terms of code which is good. Some comments and one efficiency concern that should be easy to correct

@@ -946,6 +946,29 @@ public void testHaplotypeCallerRemoveAltAlleleBasedOnHaptypeScores() throws IOEx
}
}

@Test(dataProvider="HaplotypeCallerTestInputs")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some tests where where the ployidy regions are invalid for various reasons (check some in that are on the wrong reference, some that don't have the right names etc...) and assert that you are getting user exceptions in all of those cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some tests to check for strings instead of integers for ploidy values, and checking for negative values for ploidy. I'm not sure how to check against a reference mismatch since this supports bed files, which don't have any sort of sequence dictionary. Do you have any ideas? I suppose during construction I could check against the VCF header, but that only protects against differently named reference sequences, while the lengths could be different. (E.g. a hg38 <-> T2T mixup wouldn't be detectable with just a bed file)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the references mismatch subtly then we are hosed... but if somebody uses an HG19 bedfile + an HG38 reference this should at least warn the user because it should be easy to detect.... theoretically you could also check against the sequence dictionary that the intervals actually fall within the contig as well... Just asserting the names mismatch is probably strong enough protection

* @param region Current region of interest, e.g. AssemblyRegion
* @return Genotyping engine with correct ploidy setting for given region
*/
protected HaplotypeCallerGenotypingEngine getLocalGenotypingEngine(Locatable region) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a somewhat ignorable comment but you could extract out a method here so these two are necessarily in lock-step and there is no risk of somebody breaking one but not the other...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'd like them to be covered by just one method, but unfortunately the return different types. I tried returning just the parent class and using a HashMap input that maps to the parent class, but Java doesn't seem to like this. Do you have any other suggestions, or do you think it's fine to leave as is?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pull out a method "getPloyidyToUseAtTHISSite()" or something similar and have both of these methods call that one that has your adjudication logic and then return whatever map value is needed.

// Private constructor for use in copyWithNewPloidy
// Allows you to make new instances of hcArgs with synced fields except for genotyper ploidy
private HaplotypeCallerArgumentCollection(int ploidy) {
this.standardArgs.genotypeArgs.samplePloidy = ploidy;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works? This looks terrifying...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I swear I found this idea somewhere online and I can't find where anymore... Do you have any suggestions on doing this in a better way?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that both of us are weary of this option... if you keep this confusing copy constructor (that i'm really not sure works at all this is horrifying...) you are going to need to check in some unit tests where you call this some times on a on-default argument collection (probably also non-default sub-argument collections) and demonstrate that everything is identical after calling this except for ployidy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole point was so they would be in-sync if something changes in one? that is easily demonstrated with a test...

@rickymagner
Copy link
Contributor Author

Thanks @jamesemery and @davidbenjamin for your helpful comments. I think I either addressed most of them or had some follow-up questions about ideas for getting the last few changes in.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you need to rebase this to get the tests working. A few medium sized comments... mostly this looks good.


// Should fail due to negative integer values in name column of bed
@Test(dataProvider="HaplotypeCallerTestInputs", expectedExceptions = UserException.class)
public void testNonPositiveCustomPloidyRegions(final String inputFileName, final String referenceFileName) throws Exception {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in keeping with my comment above... add a test here where you mis-map a bedfile with the reference as that seems detectable at the overlap detector construction phase.

// Private constructor for use in copyWithNewPloidy
// Allows you to make new instances of hcArgs with synced fields except for genotyper ploidy
private HaplotypeCallerArgumentCollection(int ploidy) {
this.standardArgs.genotypeArgs.samplePloidy = ploidy;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that both of us are weary of this option... if you keep this confusing copy constructor (that i'm really not sure works at all this is horrifying...) you are going to need to check in some unit tests where you call this some times on a on-default argument collection (probably also non-default sub-argument collections) and demonstrate that everything is identical after calling this except for ployidy.

// Private constructor for use in copyWithNewPloidy
// Allows you to make new instances of hcArgs with synced fields except for genotyper ploidy
private HaplotypeCallerArgumentCollection(int ploidy) {
this.standardArgs.genotypeArgs.samplePloidy = ploidy;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole point was so they would be in-sync if something changes in one? that is easily demonstrated with a test...

defaultGenotypingEngine.setAnnotationEngine(annotationEngine);

// Create other custom genotyping engines if user provided ploidyRegions
for (final int ploidy : this.allCustomPloidies) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so currently this code maintains two separate genotypers for ployidy 2 if it gets specified?

* @param region Current region of interest, e.g. AssemblyRegion
* @return Genotyping engine with correct ploidy setting for given region
*/
protected HaplotypeCallerGenotypingEngine getLocalGenotypingEngine(Locatable region) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pull out a method "getPloyidyToUseAtTHISSite()" or something similar and have both of these methods call that one that has your adjudication logic and then return whatever map value is needed.

@@ -0,0 +1,3 @@
20 10000000 10010000 2
20 10020000 10030000 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put into the test file a demonstration of the rule "overlapping ployidy defaults to the highest option?"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a method for getting current ploidy to trim down that part of the code. Still need to add the test

@rickymagner
Copy link
Contributor Author

rickymagner commented Nov 28, 2023

Just pushed some small changes based on our conversation. I think the last things to add now are 3 tests:

  • Check a user exception is thrown if contig names in the ploidy regions don't match the bam sequence dictionary (e.g. chr1 vs 1 naming)
  • Add a test that when ploidies given overlap, the highest one is actually used
  • Add a test about the private constructor deep-copy Java magic to make sure that non-ploidy fields are actually in sync when the copy is made (and move that part of the code to the bottom of constructor)

I also need to rebase to get the right htsjdk version, and other changes that might've been made to the code, which are hopefully easy to resolve...

@rickymagner rickymagner force-pushed the tkay_PARploidy branch 2 times, most recently from 7b70607 to b5486cc Compare November 29, 2023 19:43
Finish changes to HaplotypeCaller to add ploidy-regions flag allowing variable ploidy

Change IntervalFileFeature interface to NamedFeature for htsjdk dependency

Address comments for draft PR

Make a few changes and refactors based on comments
@gatk-bot
Copy link

gatk-bot commented Dec 1, 2023

Github actions tests reported job failures from actions build 7063736836
Failures in the following jobs:

Test Type JDK Job ID Logs
variantcalling 17.0.6+10 7063736836.2 logs

* Introduced a new class MulitPloidyGenotyperCache which keeps track of multiple GenotypingEngines that have different ploidy values.  
* minor refactoring in related classes
* removed an unused type parameter from GenotypingEngine
@gatk-bot
Copy link

gatk-bot commented Dec 7, 2023

Github actions tests reported job failures from actions build 7131188447
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 17.0.6+10 7131188447.10 logs
variantcalling 17.0.6+10 7131188447.2 logs
integration 17.0.6+10 7131188447.0 logs

@gatk-bot
Copy link

gatk-bot commented Dec 7, 2023

Github actions tests reported job failures from actions build 7131774248
Failures in the following jobs:

Test Type JDK Job ID Logs
variantcalling 17.0.6+10 7131774248.2 logs

@gatk-bot
Copy link

gatk-bot commented Dec 7, 2023

Github actions tests reported job failures from actions build 7132143535
Failures in the following jobs:

Test Type JDK Job ID Logs
variantcalling 17.0.6+10 7132143535.2 logs

@gatk-bot
Copy link

gatk-bot commented Dec 8, 2023

Github actions tests reported job failures from actions build 7143620989
Failures in the following jobs:

Test Type JDK Job ID Logs
variantcalling 17.0.6+10 7143620989.2 logs

@droazen
Copy link
Collaborator

droazen commented Dec 12, 2023

The changes here were merged in #8609

@droazen droazen closed this Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants