Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for FORMAT/FT VQSLod Filtering and cohort-wide LowQual filter #7248

Merged
merged 8 commits into from
May 18, 2021

Conversation

kcibul
Copy link
Contributor

@kcibul kcibul commented May 11, 2021

The overarching goal of this PR is to reduce or eliminate the effect of cohort size on the filtering of variants for a specific sample. As an example this means the filtering for the genotypes for a GIAB sample should be the same whether you make a VCF of the full cohort and then subset to the GIAB sample (expensive) or you just make a callset with just the GIAB sample. This is good for users since their results won't get "better" with more samples that they don't care about in their VCF.

  • calculate and store LowQual filter as a part of Filter Set creation
  • use LowQual filter from filter set rather than recalculating it from QUALapprox at extract time
  • flag (default is true) to perform VQSLod filtering at the sample/genotype level

Before/After results showing minimal impact are at:
https://docs.google.com/spreadsheets/d/1LUrssKHBCwIzbA_9M3b01Ul0urMbOOmv4Z703dHwiyg/edit#gid=398306713

@gatk-bot
Copy link

gatk-bot commented May 11, 2021

Travis reported job failures from build 34151
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 34151.13 logs
unit openjdk8 34151.3 logs

@kcibul kcibul marked this pull request as ready for review May 14, 2021 15:34
@kcibul kcibul changed the title Kc ft Support for FORMAT/FT VQSLod Filtering and cohort-wide LowQual filter May 14, 2021
@@ -19,33 +19,30 @@

parts = line.split("\t")

if ("ExcessHet" in parts[6]):
# strip out hard filtered sites, so vcfeval can use "all-records" to plot the ROC curves
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably just because I am too new here, but seeing an example in the comments would be really helpful, esp for the transformation that starts on line 37 where we seem to be putting "new" data into parts[6]?

@@ -19,33 +19,30 @@

parts = line.split("\t")

if ("ExcessHet" in parts[6]):
# strip out hard filtered sites, so vcfeval can use "all-records" to plot the ROC curves
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be sure to remember to reindex any files you run this on...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep -- this is only used from a benchmarking pipeline (see scripts/variantstore/tieout/GIAB_TIEOUT.md) and it runs tabix to generate the index

if (parts[6] == "PASS" or parts[6] == "."):
parts[6] = ft
else:
parts[6] = parts[6] + "," + ft
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will always be true:
if ft != "PASS" or ft != ".":
if think you want if ft != "PASS" and ft != ".":

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworked this, and re-ran to make it didn't change the tieout results (which it did not)

// the genotype is passed, nothing to do here as non-filtered is the default
} else {
// get the minimum (worst) vqslod for all SNP non-Yay sites, and apply the filter
Optional<Double> snpMin =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are actually getting the max. if that is what we want, update the comment and variable name for this and indels

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch -- updating

Copy link
Contributor

@ahaessly ahaessly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just 2 comments that need some fixes

Copy link
Member

@mmorgantaylor mmorgantaylor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good some comments mostly for explanation

@@ -1,7 +1,6 @@
package org.broadinstitute.hellbender.tools.variantdb.nextgen;

import htsjdk.variant.vcf.VCFHeader;
import htsjdk.variant.vcf.VCFHeaderLine;
import htsjdk.variant.vcf.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this standard for java? I was taught always be explicit with imports

@@ -89,6 +89,13 @@
)
private boolean disableGnarlyGenotyper = true;

@Argument(
fullName = "vqslod-filter-genotypes",
doc = "Should VQSLOD filtering be applied at the genotype level",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add what the alternative is? i.e., if this is false, filtering will be applied at.... site level?


// get the minimum (worst) vqslod for all INDEL non-Yay sites
Optional<Double> indelMin =
nonRefAlleles.stream().filter(a -> a.length() != ref.length()).map(a -> remappedVqsLodMap.get(a)).filter(Objects::nonNull).max(Double::compareTo);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this calls .max but we're getting the minimum? is that a mistake or is the logic reversed because we're filtering? if this is a straightforward java thing, ignore me, but a comment about this logic might be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated variables names/comment. fixed

} else {
// get the minimum (worst) vqslod for all SNP non-Yay sites, and apply the filter
Optional<Double> snpMin =
nonRefAlleles.stream().filter(a -> a.length() == ref.length()).map(a -> remappedVqsLodMap.get(a)).filter(Objects::nonNull).max(Double::compareTo);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same question of min vs max as below for the indelMin)
but also - is len(a) == len(ref) a reliable SNP identifier? what if e.g. ref is ACGC and allele is AGGG ? (if this is How It's Done and beyond the scope of this PR, fine)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is somewhat "how it's done"... what you describe is an MNP but would actually be represented as two SNPs (one at position 2, and one at 4). It's possible it could be even uglier, but in that case we still would use the SNP model to threshold VQSR

@kcibul kcibul merged commit c3dd67f into ah_var_store May 18, 2021
@kcibul kcibul deleted the kc_ft branch May 18, 2021 23:54
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants