Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert AF calculation bug and yet another reblocking fix #7670

Merged
merged 10 commits into from
Mar 8, 2022

Conversation

ldgauthier
Copy link
Contributor

The warp pipeline tests caught some cases that we apparently didn't have in our integration tests, but now we do!

@ldgauthier ldgauthier force-pushed the ldg_yetAnotherReblockingFix branch 2 times, most recently from 6f61408 to 483ec41 Compare February 11, 2022 21:47
@@ -133,7 +133,7 @@ public VariantContext calculateGenotypes(final VariantContext vc, final Genotype
if (maxAltAlleles < vc.getAlternateAlleles().size()) {
final List<Allele> allelesToKeep = AlleleSubsettingUtils.calculateMostLikelyAlleles(vc, defaultPloidy, maxAltAlleles);
final GenotypesContext reducedGenotypes = allelesToKeep.size() == 1 ? GATKVariantContextUtils.subsetToRefOnly(vc, defaultPloidy) :
AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(), defaultPloidy, vc.getAlleles(), allelesToKeep, gpc, GenotypeAssignmentMethod.SET_TO_NO_CALL, vc.getAttributeAsInt(VCFConstants.DEPTH_KEY, 0), false);
AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(), defaultPloidy, vc.getAlleles(), allelesToKeep, gpc, GenotypeAssignmentMethod.SET_TO_NO_CALL, vc.getAttributeAsInt(VCFConstants.DEPTH_KEY, 0), true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These go with reverting the boolean variable to force PLs to be output

@@ -94,7 +94,7 @@ public static GenotypesContext subsetAlleles(final GenotypesContext originalGs,
newLog10GQ = -0.1*g.getGQ();
}

final boolean useNewLikelihoods = !suppressUninformativePLs ||
final boolean useNewLikelihoods = forceEmitUninformativePLs ||
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old conditional wasn't right -- I trusted the IntelliJ refactoring too much. This is also a more logical name.

@@ -77,7 +77,7 @@ public static AlleleFrequencyCalculator makeCalculator(final DragstrParams drags
final double[] log10Likelihoods;
if (g.hasLikelihoods()) {
log10Likelihoods = g.getLikelihoods().getAsVector();
} else if (g.isHomRef()) {
} else if ( g.isHomRef() || g.isNoCall()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking out the no-call shouldn't have happened in 4.2.5.0, but it did because of poor test coverage. Added new multi-allelic tests to GGVCFs.

Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldgauthier Back to you with my comments

@@ -55,7 +55,7 @@ public static GenotypesContext subsetAlleles(final GenotypesContext originalGs,
final GenotypePriorCalculator gpc,
final GenotypeAssignmentMethod assignmentMethod,
final int depth,
final boolean suppressUninformativePLs) {
final boolean forceEmitUninformativePLs) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check all usages of this method to make sure that the booleans have been flipped when necessary, since this arg now has the opposite meaning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this out instead. The reason it was still there was because I didn't want to change the SelectVariants integration test expected outputs, but that's not a good reason. I want to prefer explicitness and consistency over space savings or whatever the motivation was for dropping PLs sometimes.

@@ -77,7 +77,7 @@ public static AlleleFrequencyCalculator makeCalculator(final DragstrParams drags
final double[] log10Likelihoods;
if (g.hasLikelihoods()) {
log10Likelihoods = g.getLikelihoods().getAsVector();
} else if (g.isHomRef()) {
} else if ( g.isHomRef() || g.isNoCall()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a brief comment here documenting the reason why hom refs and no calls are handled in the same case here, to discourage our future selves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just too dangerous. One of the integration tests was doing the wrong thing, so who knows what QUAL came out. Instead I still prohibit no-calls here but call genotypes in the internal subsetting so we should always have calls when we have the right information.

@@ -1033,7 +1033,7 @@ private VariantContext subsetRecord(final VariantContext vc, final boolean prese
newGC = sub.getNAlleles() == vc.getNAlleles() ? subGenotypesWithOldAlleles :
AlleleSubsettingUtils.subsetAlleles(subGenotypesWithOldAlleles, 0, vc.getAlleles(),
sub.getAlleles(), null, GenotypeAssignmentMethod.DO_NOT_ASSIGN_GENOTYPES,
vc.getAttributeAsInt(VCFConstants.DEPTH_KEY, 0), false);
vc.getAttributeAsInt(VCFConstants.DEPTH_KEY, 0), true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the force emit arg is critical here too, add a comment to that effect

@@ -143,7 +143,7 @@ public static GenotypeCounts computeDiploidGenotypeCounts(final VariantContext v
* @return true if we have enough info for AF calculation
*/
public static boolean genotypeIsUsableForAFCalculation(Genotype g) {
return g.hasLikelihoods() || (g.isHomRef() && g.hasGQ() && 2 == g.getPloidy());
return g.hasLikelihoods() || g.hasGQ() || g.getAlleles().stream().anyMatch(a -> a.isCalled() && a.isNonReference() && !a.isSymbolic());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this condition got reverted to its previous state too -- can you add a brief comment here explaining the logic for posterity?

@@ -1687,7 +1687,7 @@ public static void splitASFilters(VariantContext vc, List<Map<String, Object>> a
genotypeAssignmentMethodUsed != GenotypeAssignmentMethod.SET_TO_NO_CALL)
AlleleSubsettingUtils.addInfoFieldAnnotations(vc, builder, keepOriginalChrCounts);

builder.genotypes(AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(),2,vc.getAlleles(), alleles, null, genotypeAssignmentMethodUsed,vc.getAttributeAsInt("DP",0), true));
builder.genotypes(AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(),2,vc.getAlleles(), alleles, null, genotypeAssignmentMethodUsed,vc.getAttributeAsInt("DP",0), false));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

force emit is false here -- is that correct?

//23 highly multi-allelic sites across 54 1000G exomes to test allele subsetting and QUAL calculation
{getTestFile("multiallelicQualRegression.vcf "),
getTestFile("multiallelicQualRegression.expected.vcf"),
NO_EXTRA_ARGS, hg38Reference}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've confirmed that this new test case fails with 4.2.5.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- it fails like the Hindenburg.

" -do-qual-approx --floor-blocks -GQB 20 -GQB 30 -GQB 40 " +
" --" + StandardArgumentDefinitions.ADD_OUTPUT_VCF_COMMANDLINE + " false",
Arrays.asList(getToolTestDataDir() + "prodWesOutput.g.vcf"));
spec3.executeTest("testWithExactComparison", this);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this test method should use a DataProvider to feed it the three command lines, instead of embedding the three cases within one method like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@droazen droazen self-assigned this Feb 15, 2022
@ldgauthier
Copy link
Contributor Author

@droazen there might be some new test failures (which I will address if/when they arise), but this is ready for re-review.

Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldgauthier Back to you -- just needs a few additional comments and then I think we should be good, provided the WARP and GATK tests pass

@@ -133,7 +133,7 @@ public VariantContext calculateGenotypes(final VariantContext vc, final Genotype
if (maxAltAlleles < vc.getAlternateAlleles().size()) {
final List<Allele> allelesToKeep = AlleleSubsettingUtils.calculateMostLikelyAlleles(vc, defaultPloidy, maxAltAlleles);
final GenotypesContext reducedGenotypes = allelesToKeep.size() == 1 ? GATKVariantContextUtils.subsetToRefOnly(vc, defaultPloidy) :
AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(), defaultPloidy, vc.getAlleles(), allelesToKeep, gpc, GenotypeAssignmentMethod.SET_TO_NO_CALL, vc.getAttributeAsInt(VCFConstants.DEPTH_KEY, 0), false);
AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(), defaultPloidy, vc.getAlleles(), allelesToKeep, gpc, GenotypeAssignmentMethod.BEST_MATCH_TO_ORIGINAL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment documenting why we want to use GenotypeAssignmentMethod.BEST_MATCH_TO_ORIGINAL here

@@ -391,11 +391,12 @@ boolean isVcCoveredByDeletion(final VariantContext vc) {
* {@link GenotypeLikelihoods#MAX_DIPLOID_ALT_ALLELES_THAT_CAN_BE_GENOTYPED}.
*/
protected final boolean cannotBeGenotyped(final VariantContext vc) {
// protect against too many alternate alleles that we can't even run AF on:
//
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fill in or delete empty comment

} else if (g.isHomRef()) {
} else if ( g.isHomRef()) {
//no-call with no PLs seems risky, but there are a few places in the QUAL/AF code where we subset alleles,
// but then leave the genotypes as no-calls
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems misplaced/confusing, since no-calls with no PLs will now get handled by the else block below, not this block, and will always produce the Genotype does not contain likelihoods necessary to calculate posteriors exception, right?

else {
    throw new IllegalStateException("Genotype " + g + " does not contain likelihoods necessary to calculate posteriors.");
}

Recommend adding a comment to the else block below explaining clearly why we want to always throw an IllegalStateException here for no calls with no PLs.

@@ -568,8 +566,7 @@ VariantContext cleanUpHighQualityVariant(final VariantContext variant) {
if(allelesNeedSubsetting && !keepAllAlts) {
newAlleleSetUntrimmed.removeAll(allelesToDrop);
final GenotypesContext gc = AlleleSubsettingUtils.subsetAlleles(variant.getGenotypes(), genotype.getPloidy(), variant.getAlleles(),
newAlleleSetUntrimmed, null, GenotypeAssignmentMethod.USE_PLS_TO_ASSIGN,
variant.getAttributeAsInt(VCFConstants.DEPTH_KEY, 0), true);
newAlleleSetUntrimmed, null, GenotypeAssignmentMethod.USE_PLS_TO_ASSIGN);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment documenting why we're using GenotypeAssignmentMethod.USE_PLS_TO_ASSIGN here instead of the GenotypeAssignmentMethod.BEST_MATCH_TO_ORIGINAL we're using everywhere else.

@@ -139,6 +139,7 @@ public static GenotypeCounts computeDiploidGenotypeCounts(final VariantContext v
/**
* Do we have (or can we infer) likelihoods necessary for allele frequency calculation?
* Some reblocked and/or DRAGEN GVCFs omit likelihoods for ref blocks, but we can estimate them
* If GenomicsDB max alt threshold is too low, variants may also be missimg PLs -- reject those
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missimg -> missing

Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldgauthier Back to you with a couple of final comments

@@ -133,7 +133,7 @@ public VariantContext calculateGenotypes(final VariantContext vc, final Genotype
if (maxAltAlleles < vc.getAlternateAlleles().size()) {
final List<Allele> allelesToKeep = AlleleSubsettingUtils.calculateMostLikelyAlleles(vc, defaultPloidy, maxAltAlleles);
final GenotypesContext reducedGenotypes = allelesToKeep.size() == 1 ? GATKVariantContextUtils.subsetToRefOnly(vc, defaultPloidy) :
AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(), defaultPloidy, vc.getAlleles(), allelesToKeep, gpc, GenotypeAssignmentMethod.SET_TO_NO_CALL, vc.getAttributeAsInt(VCFConstants.DEPTH_KEY, 0), false);
AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(), defaultPloidy, vc.getAlleles(), allelesToKeep, gpc, GenotypeAssignmentMethod.BEST_MATCH_TO_ORIGINAL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming that this should be hardcoded to BEST_MATCH_TO_ORIGINAL rather than respecting configuration.genotypeArgs.genotypeAssignmentMethod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, with no PLs no-calls are just going to cause problems, so keep any 0/0 genotypes as such. I added a comment to that extent.

} else if (g.isHomRef()) {
} else if ( g.isHomRef()) {
//no-call with no PLs seems risky, but there are a few places in the QUAL/AF code where we subset alleles,
// but then leave the genotypes as no-calls
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment still seems confusing/misplaced. It talks about handling no-calls in the case that now handles hom-ref (no-calls now fall through to the else block below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -139,6 +139,7 @@ public static GenotypeCounts computeDiploidGenotypeCounts(final VariantContext v
/**
* Do we have (or can we infer) likelihoods necessary for allele frequency calculation?
* Some reblocked and/or DRAGEN GVCFs omit likelihoods for ref blocks, but we can estimate them
* If GenomicsDB max alt threshold is too low, variants may also be missimg PLs -- reject those
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missimg -> missing

Is this comment still relevant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed typo. Yes, still relevant. I clarified that here variants means non-reference genotypes, where we can't estimate PLs.

@Test
public void testGDBMaxAltsGreaterThanGGVCFsMaxAlts() throws IOException {

}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fill in empty test method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized this case was redundant with testMaxAltsToCombineInGenomicsDB. Deleted and added comment to testMaxAlts

Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Merge when tests pass @ldgauthier

@ldgauthier ldgauthier merged commit 2d2cca2 into master Mar 8, 2022
@ldgauthier ldgauthier deleted the ldg_yetAnotherReblockingFix branch March 8, 2022 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants