Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding argument to GenotypeGVCFs to keep only RAW_GT_COUNT #7996

Merged
merged 13 commits into from
Oct 24, 2022
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,8 @@ public final class GenotypeGVCFs extends VariantLocusWalker {
public static final String ALL_SITES_SHORT_NAME = "all-sites";
public static final String KEEP_COMBINED_LONG_NAME = "keep-combined-raw-annotations";
public static final String KEEP_COMBINED_SHORT_NAME = "keep-combined";
public static final String KEEP_SPECIFIED_RAW_ANNOTATION_LONG_NAME = "keep-specific-raw-annotation";
public static final String KEEP_SPECIFIED_RAW_ANNOTATION_SHORT_NAME = "keep-raw";
public static final String FORCE_OUTPUT_INTERVALS_NAME = "force-output-intervals";

@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME,
Expand Down Expand Up @@ -132,19 +134,25 @@ public final class GenotypeGVCFs extends VariantLocusWalker {
doc = "LOD threshold to emit variant to VCF.")
protected double tlodThreshold = 3.5; //allow for some lower quality variants


/**
* Margin of error in allele fraction to consider a somatic variant homoplasmic, i.e. if there is less than a 0.1% reference allele fraction, those reads are likely errors
*/
@Argument(fullName=CombineGVCFs.ALLELE_FRACTION_DELTA_LONG_NAME, doc = "Margin of error in allele fraction to consider a somatic variant homoplasmic")
protected double afTolerance = 1e-3; //based on Q30 as a "good" base quality score

/**
* If specified, keep the combined raw annotations (e.g. AS_SB_TABLE) after genotyping. This is applicable to Allele-Specific annotations
* If specified, keep all the combined raw annotations (e.g. AS_SB_TABLE) after genotyping. This is applicable to Allele-Specific annotations
*/
@Argument(fullName=KEEP_COMBINED_LONG_NAME, shortName = KEEP_COMBINED_SHORT_NAME, doc = "If specified, keep the combined raw annotations")
protected boolean keepCombined = false;

/**
* Keep only the specific combined raw annotations specified (removing the other raw annotations if keep-combined-raw-annotations is not set).
*/
@Argument(fullName= KEEP_SPECIFIED_RAW_ANNOTATION_LONG_NAME, shortName = KEEP_SPECIFIED_RAW_ANNOTATION_SHORT_NAME, optional = true,
Copy link
Contributor

@samuelklee samuelklee Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we perhaps use mutex to ensure we don't use both -keep-raw and -keep-combined simultaneously?

Do you have an idea of the usefulness of -keep-combined (i.e., the convenience of being able to keep all combined raw annotations, even if we don't know what those annotations might be) and where it might be currently used (e.g., WARP WDLs, etc.)? How much of a pain would it be to require the use of -keep-raw in all cases---presumably one would always be able to know which annotations one wants to keep?

In any case, -keep-raw and -keep-combined (and the corresponding long names) don't really give the sense of -keep and -keep-all that is implemented here. Assuming we have the freedom to muck with the name of -keep-combined, do you think it might be worth changing? And should we also change the short/long names of -keep-raw to somehow include combined (as you note in the Javadoc here)?

A general comment: as a relative neophyte to this annotation framework, I will say that I find the lack of clear documentation/definitions for terms like "combined" and "raw" makes it hard to be sure what these arguments (or even the tool) are doing, especially from the doc strings and tool docs alone. Given that you have more experience with this framework, do you feel the same, or do you think things are clear as is?

Happy to hash all of this out further offline/elsewhere, seems a bit thorny!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if -keep-combined is used anywhere currently. I'm assuming it's mostly useful for debugging/development, but I don't know. I would rather not break backwards compatibility though, so I think we should still leave it in and I have a preference to not rename it.

I can add some more documentation for these arguments though to help make it clearer. I agree that the names don't well represent what they are doing.

doc="Keep only the specific combined raw annotations specified (removing the other raw annotations).")
protected List<String> keepSpecifiedRawAnnotations = new ArrayList<>();
Copy link
Contributor

@samuelklee samuelklee Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on what you do with -keep-combined above, you may want to add additional argument documentation/validation/sanitization here. E.g., add to the doc string that duplicate values will be ignored, and ensure that downstream code enforces/respects this as early as possible.


@ArgumentCollection
private GenotypeCalculationArgumentCollection genotypeArgs = new GenotypeCalculationArgumentCollection();

Expand Down Expand Up @@ -262,7 +270,7 @@ public void onTraversalStart() {
Collections.emptyList();

Collection<Annotation> variantAnnotations = makeVariantAnnotations();
annotationEngine = new VariantAnnotatorEngine(variantAnnotations, dbsnp.dbsnp, Collections.emptyList(), false, keepCombined);
annotationEngine = new VariantAnnotatorEngine(variantAnnotations, dbsnp.dbsnp, Collections.emptyList(), false, keepCombined, keepSpecifiedRawAnnotations);

merger = new ReferenceConfidenceVariantContextMerger(annotationEngine, getHeaderForVariants(), somaticInput, false, true);

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,14 @@
import htsjdk.variant.vcf.*;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions;
import org.broadinstitute.hellbender.engine.FeatureContext;
import org.broadinstitute.hellbender.engine.FeatureDataSource;
import org.broadinstitute.hellbender.engine.FeatureInput;
import org.broadinstitute.hellbender.engine.ReferenceContext;
import org.broadinstitute.hellbender.exceptions.GATKException;
import org.broadinstitute.hellbender.exceptions.UserException;
import org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs;
import org.broadinstitute.hellbender.tools.walkers.annotator.allelespecific.ReducibleAnnotation;
import org.broadinstitute.hellbender.tools.walkers.annotator.allelespecific.ReducibleAnnotationData;
import org.broadinstitute.hellbender.utils.Utils;
Expand Down Expand Up @@ -43,6 +45,7 @@ public final class VariantAnnotatorEngine {
private boolean expressionAlleleConcordance;
private final boolean useRawAnnotations;
private final boolean keepRawCombinedAnnotations;
private final List<String> rawAnnotationsToKeep;

private final static Logger logger = LogManager.getLogger(VariantAnnotatorEngine.class);
private final static OneShotLogger jumboAnnotationsLogger = new OneShotLogger(VariantAnnotatorEngine.class);
Expand All @@ -59,17 +62,20 @@ public final class VariantAnnotatorEngine {
* @param useRaw When this is set to true, the annotation engine will call {@link ReducibleAnnotation#annotateRawData(ReferenceContext, VariantContext, AlleleLikelihoods)}
* on annotations that extend {@link ReducibleAnnotation}, instead of {@link InfoFieldAnnotation#annotate(ReferenceContext, VariantContext, AlleleLikelihoods)},
* @param keepCombined If true, retain the combined raw annotation values instead of removing them after finalizing
* @param rawAnnotationsToKeep List of raw annotations to keep even when others are removed
*/
public VariantAnnotatorEngine(final Collection<Annotation> annotationList,
final FeatureInput<VariantContext> dbSNPInput,
final List<FeatureInput<VariantContext>> featureInputs,
final boolean useRaw,
boolean keepCombined){
boolean keepCombined,
meganshand marked this conversation as resolved.
Show resolved Hide resolved
final List<String> rawAnnotationsToKeep){
Utils.nonNull(featureInputs, "comparisonFeatureInputs is null");
infoAnnotations = new ArrayList<>();
genotypeAnnotations = new ArrayList<>();
jumboInfoAnnotations = new ArrayList<>();
jumboGenotypeAnnotations = new ArrayList<>();
final List<String> variantAnnotationKeys = new ArrayList<>();
for (Annotation annot : annotationList) {
if (annot instanceof InfoFieldAnnotation) {
infoAnnotations.add((InfoFieldAnnotation) annot);
Expand All @@ -82,11 +88,18 @@ public VariantAnnotatorEngine(final Collection<Annotation> annotationList,
} else {
throw new GATKException.ShouldNeverReachHereException("Unexpected annotation type: " + annot.getClass().getName());
}
variantAnnotationKeys.addAll(((VariantAnnotation) annot).getKeyNames());
}
variantOverlapAnnotator = initializeOverlapAnnotator(dbSNPInput, featureInputs);
reducibleKeys = new LinkedHashSet<>();
useRawAnnotations = useRaw;
keepRawCombinedAnnotations = keepCombined;
for (String rawAnnot : rawAnnotationsToKeep) {
meganshand marked this conversation as resolved.
Show resolved Hide resolved
if (!variantAnnotationKeys.contains(rawAnnot)) {
throw new UserException("Requested --" + GenotypeGVCFs.KEEP_SPECIFIED_RAW_ANNOTATION_LONG_NAME + ": " + rawAnnot + " is not available. Add requested annotation with --" + StandardArgumentDefinitions.ANNOTATION_LONG_NAME + ".");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's confusing that the command line looks something like: --keep-specific-raw-annotation: RAW_GT_COUNT --annotation RawGtCount (since the latter is derived from the Java class name). Is this pattern something that could be cleaned up at this stage, or has that ship long sailed?

And this exception message, while helpful, doesn't actually tell you what value you need to supply to the annotation argument---not sure if that's actually that trivial to look this up and output it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is finally resolved! I got help from @cmnbroad to add an argument that uses the GATKAnnotationPlugin, but even so I think I did not implement this in the cleanest way. @cmnbroad I'd appreciate if you'd take a look at this branch again to make sure I didn't miss anything.

}
}
this.rawAnnotationsToKeep = rawAnnotationsToKeep;
for (InfoFieldAnnotation annot : infoAnnotations) {
if (annot instanceof ReducibleAnnotation) {
for (final String rawKey : ((ReducibleAnnotation) annot).getRawKeyNames()) {
Expand All @@ -96,6 +109,14 @@ public VariantAnnotatorEngine(final Collection<Annotation> annotationList,
}
}

public VariantAnnotatorEngine(final Collection<Annotation> annotationList,
final FeatureInput<VariantContext> dbSNPInput,
final List<FeatureInput<VariantContext>> featureInputs,
final boolean useRaw,
boolean keepCombined){
this(annotationList, dbSNPInput, featureInputs, useRaw, keepCombined, Collections.emptyList());
}

private VariantOverlapAnnotator initializeOverlapAnnotator(final FeatureInput<VariantContext> dbSNPInput, final List<FeatureInput<VariantContext>> featureInputs) {
final Map<FeatureInput<VariantContext>, String> overlaps = new LinkedHashMap<>();
for ( final FeatureInput<VariantContext> fi : featureInputs) {
Expand Down Expand Up @@ -253,6 +274,14 @@ public Map<String, Object> combineAnnotations(final List<Allele> allelesList, Ma
public VariantContext finalizeAnnotations(VariantContext vc, VariantContext originalVC) {
final Map<String, Object> variantAnnotations = new LinkedHashMap<>(vc.getAttributes());

//save annotations that have been requested to be kept
final Map<String, Object> savedRawAnnotations = new LinkedHashMap<>();
for(String rawAnnot : rawAnnotationsToKeep) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about final here. Note that it's actually used in other loops in this method...up to you if you want to make it consistent everywhere.

if (variantAnnotations.containsKey(rawAnnot)) {
savedRawAnnotations.put(rawAnnot, variantAnnotations.get(rawAnnot));
}
}

// go through all the requested info annotationTypes
for (final InfoFieldAnnotation annotationType : infoAnnotations) {
if (annotationType instanceof ReducibleAnnotation) {
Expand Down Expand Up @@ -280,6 +309,8 @@ public VariantContext finalizeAnnotations(VariantContext vc, VariantContext orig
variantAnnotations.remove(GATKVCFConstants.VARIANT_DEPTH_KEY);
variantAnnotations.remove(GATKVCFConstants.RAW_GENOTYPE_COUNT_KEY);
}
//add back raw annotations that have specifically been requested to keep
variantAnnotations.putAll(savedRawAnnotations);

// generate a new annotated VC
final VariantContextBuilder builder = new VariantContextBuilder(vc).attributes(variantAnnotations);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,6 @@
import java.security.NoSuchAlgorithmException;
import java.util.*;
import java.util.function.BiConsumer;
import java.util.function.Consumer;
import java.util.function.IntUnaryOperator;
import java.util.stream.Collectors;
import java.util.stream.Stream;

Expand Down Expand Up @@ -930,7 +928,8 @@ public void testRawGtCountAnnotation() {
args.addReference(b37_reference_20_21)
.addVCF(reblockedGVCF)
.addOutput(output)
.add("keep-combined-raw-annotations", true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this replaces the previous test in favor of this new one. Perhaps we should consider just adding a new test, so that we cover both 1) keeping all combined annotations, and 2) keeping only the specified raw annotations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've now made these two arguments (keep all combined and keeping only specified) mutually exclusive (not at the time of your review, but since then based on other comments). So I'm going to leave this test as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just to be clear, you mean you'll revert this test to its original behavior, right? In that case, will you add another test for the keep-specified behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original behavior of --keep-combined-raw-annotations set to true while --keep-specified-raw-annotations is not empty is no longer allowed due to the mutex. So I think I am not reverting this test to the original behavior. Does that make sense? This test sets --keep-specified-raw-annotations to RawGtCount

Copy link
Contributor

@samuelklee samuelklee Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, by "original behavior" I mean the current behavior of the test as it is in master. This test sets --keep-combined-raw-annotations true and hence additionally retains RAW_MQandDP.

In contrast, you modify the test in your branch so that only RAW_GT_COUNT is retained and we check that RAW_MQandDP is not. This means the code path that is responsible for retaining RAW_MQandDP is no longer run in this test, at least (and is perhaps not run for this particular input test file in any test at all). Even though the original test does not explicitly check for any particular RAW_MQandDP output, this is still an effective drop in test coverage.

Probably not anything to sweat about, but that's all that I wanted to point out in my original comment. I'll leave it up to you to decide if --keep-combined-raw-annotations true is sufficiently covered by the other tests so that we don't have to worry about replacing the original test here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Sorry, I see what you're saying now. I think I originally added this test as a way of testing RawGTCount, rather than --keep-combined-raw-annotations true, so I overlooked that it did actually test the latter too. I do think the other tests sufficiently cover --keep-combined-raw-annotations true but I very much see how that is not clear from this change.

.add(GenotypeGVCFs.KEEP_COMBINED_LONG_NAME, false)
.add(GenotypeGVCFs.KEEP_SPECIFIED_RAW_ANNOTATION_LONG_NAME, GATKVCFConstants.RAW_GENOTYPE_COUNT_KEY)
.add("A", "RawGtCount");
runCommandLine(args);

Expand All @@ -942,6 +941,7 @@ public void testRawGtCountAnnotation() {
Assert.assertEquals(rawGtCount.get(0), ".");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on with the hom-ref count? I see the TODO in the RawGtCount class but I don't understand the fundamental limitation.

Incidentally, perhaps we can go ahead and expand the header-line documentation of this annotation to explicitly give the hom-ref/het/hom-var order. Maybe this is obvious to everyone else, but why not document it for dummies like me?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that RawGtCount is combined by GenomicsDB as a sum. There is no RawGtCount annotation on RefBlocks so the HomRef count always ends up being 0 in this case. It seems like ExcessHet isn't actually using RawGtCount as I expected. It recalculates the counts in GenotypeUtils.computeDiploidGenotypeCounts which doesn't seem to use RawGtCount so I'm not sure why RawGtCount is the rawKey associated with ExcessHet. Seems like this whole thing could be cleaned up so that we use the same method ExcessHet does to calculate the count of Het and HomVar genotypes at a site, but I'm not sure why we didn't do that in the first place.

Also I added the documentation to the header line.

Assert.assertEquals(rawGtCount.get(1), "2");
Assert.assertEquals(rawGtCount.get(2), "0");
Assert.assertFalse(vc.getAttributes().containsKey(GATKVCFConstants.RAW_MAPPING_QUALITY_WITH_DEPTH_KEY));
}

}
Expand Down