Support for FORMAT/FT VQSLod Filtering and cohort-wide LowQual filter #7248

kcibul · 2021-05-11T20:48:21Z

The overarching goal of this PR is to reduce or eliminate the effect of cohort size on the filtering of variants for a specific sample. As an example this means the filtering for the genotypes for a GIAB sample should be the same whether you make a VCF of the full cohort and then subset to the GIAB sample (expensive) or you just make a callset with just the GIAB sample. This is good for users since their results won't get "better" with more samples that they don't care about in their VCF.

calculate and store LowQual filter as a part of Filter Set creation
use LowQual filter from filter set rather than recalculating it from QUALapprox at extract time
flag (default is true) to perform VQSLod filtering at the sample/genotype level

Before/After results showing minimal impact are at:
https://docs.google.com/spreadsheets/d/1LUrssKHBCwIzbA_9M3b01Ul0urMbOOmv4Z703dHwiyg/edit#gid=398306713

gatk-bot · 2021-05-11T21:28:38Z

Travis reported job failures from build 34151
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
unit	openjdk11	34151.13	logs
unit	openjdk8	34151.3	logs

RoriCremer · 2021-05-17T13:49:52Z

scripts/variantstore/tieout/add_max_as_vqslod.py

@@ -19,33 +19,30 @@

        parts = line.split("\t")

-        if ("ExcessHet" in parts[6]):
+        # strip out hard filtered sites, so vcfeval can use "all-records" to plot the ROC curves


probably just because I am too new here, but seeing an example in the comments would be really helpful, esp for the transformation that starts on line 37 where we seem to be putting "new" data into parts[6]?

lbergelson · 2021-05-17T20:13:15Z

scripts/variantstore/tieout/add_max_as_vqslod.py

@@ -19,33 +19,30 @@

        parts = line.split("\t")

-        if ("ExcessHet" in parts[6]):
+        # strip out hard filtered sites, so vcfeval can use "all-records" to plot the ROC curves


Be sure to remember to reindex any files you run this on...

Yep -- this is only used from a benchmarking pipeline (see scripts/variantstore/tieout/GIAB_TIEOUT.md) and it runs tabix to generate the index

ahaessly · 2021-05-18T13:19:46Z

scripts/variantstore/tieout/add_max_as_vqslod.py

+                if (parts[6] == "PASS" or parts[6] == "."):
+                    parts[6] = ft
+                else:
+                    parts[6] = parts[6] + "," + ft


this will always be true:
if ft != "PASS" or ft != ".":
if think you want if ft != "PASS" and ft != ".":

great catch!

reworked this, and re-ran to make it didn't change the tieout results (which it did not)

ahaessly · 2021-05-18T16:01:07Z

src/main/java/org/broadinstitute/hellbender/tools/variantdb/nextgen/ExtractCohortEngine.java

+                    // the genotype is passed, nothing to do here as non-filtered is the default
+                } else {
+                    // get the minimum (worst) vqslod for all SNP non-Yay sites, and apply the filter
+                    Optional<Double> snpMin =


we are actually getting the max. if that is what we want, update the comment and variable name for this and indels

good catch -- updating

ahaessly

just 2 comments that need some fixes

mmorgantaylor

looks good some comments mostly for explanation

mmorgantaylor · 2021-05-18T14:23:36Z

src/main/java/org/broadinstitute/hellbender/tools/variantdb/nextgen/ExtractCohort.java

@@ -1,7 +1,6 @@
 package org.broadinstitute.hellbender.tools.variantdb.nextgen;

-import htsjdk.variant.vcf.VCFHeader;
-import htsjdk.variant.vcf.VCFHeaderLine;
+import htsjdk.variant.vcf.*;


is this standard for java? I was taught always be explicit with imports

mmorgantaylor · 2021-05-18T14:24:22Z

src/main/java/org/broadinstitute/hellbender/tools/variantdb/nextgen/ExtractCohort.java

@@ -89,6 +89,13 @@
    )
    private boolean disableGnarlyGenotyper = true;

+    @Argument(
+            fullName = "vqslod-filter-genotypes",
+            doc = "Should VQSLOD filtering be applied at the genotype level",


can you add what the alternative is? i.e., if this is false, filtering will be applied at.... site level?

mmorgantaylor · 2021-05-18T14:39:01Z

src/main/java/org/broadinstitute/hellbender/tools/variantdb/nextgen/ExtractCohortEngine.java

+
+                    // get the minimum (worst) vqslod for all INDEL non-Yay sites
+                    Optional<Double> indelMin =
+                            nonRefAlleles.stream().filter(a -> a.length() != ref.length()).map(a -> remappedVqsLodMap.get(a)).filter(Objects::nonNull).max(Double::compareTo);


this calls .max but we're getting the minimum? is that a mistake or is the logic reversed because we're filtering? if this is a straightforward java thing, ignore me, but a comment about this logic might be helpful

outdated variables names/comment. fixed

mmorgantaylor · 2021-05-18T14:42:31Z

src/main/java/org/broadinstitute/hellbender/tools/variantdb/nextgen/ExtractCohortEngine.java

+                } else {
+                    // get the minimum (worst) vqslod for all SNP non-Yay sites, and apply the filter
+                    Optional<Double> snpMin =
+                            nonRefAlleles.stream().filter(a -> a.length() == ref.length()).map(a -> remappedVqsLodMap.get(a)).filter(Objects::nonNull).max(Double::compareTo);


(same question of min vs max as below for the indelMin)
but also - is len(a) == len(ref) a reliable SNP identifier? what if e.g. ref is ACGC and allele is AGGG ? (if this is How It's Done and beyond the scope of this PR, fine)

yeah this is somewhat "how it's done"... what you describe is an MNP but would actually be represented as two SNPs (one at position 2, and one at 4). It's possible it could be even uglier, but in that case we still would use the SNP model to threshold VQSR

kcibul added 3 commits May 11, 2021 16:41

first pass at FT

2e11c63

added FT support to python script, annotate-only mode for CohortExtract

7470061

WIP

5a93d31

kcibul added 3 commits May 13, 2021 13:50

support for LowQual from site filter set info

8da93fe

fixed bug with alleles

f76ad59

PR cleanup

6b5346b

kcibul marked this pull request as ready for review May 14, 2021 15:34

kcibul changed the title ~~Kc ft~~ Support for FORMAT/FT VQSLod Filtering and cohort-wide LowQual filter May 14, 2021

kcibul requested review from ahaessly and mmorgantaylor May 14, 2021 15:38

RoriCremer reviewed May 17, 2021

View reviewed changes

lbergelson reviewed May 17, 2021

View reviewed changes

ahaessly reviewed May 18, 2021

View reviewed changes

ahaessly approved these changes May 18, 2021

View reviewed changes

addressing PR comments, adding logging for Read API calls

ba78a64

mmorgantaylor approved these changes May 18, 2021

View reviewed changes

more PR comments

082c0e0

kcibul merged commit c3dd67f into ah_var_store May 18, 2021

kcibul deleted the kc_ft branch May 18, 2021 23:54

This was referenced Mar 17, 2023

lb merge gvs branch #8248

Closed

testing something, please ignore #8251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for FORMAT/FT VQSLod Filtering and cohort-wide LowQual filter #7248

Support for FORMAT/FT VQSLod Filtering and cohort-wide LowQual filter #7248

kcibul commented May 11, 2021 •

edited

Loading

gatk-bot commented May 11, 2021 •

edited

Loading

RoriCremer May 17, 2021

lbergelson May 17, 2021

kcibul May 18, 2021

ahaessly May 18, 2021

mmorgantaylor May 18, 2021

kcibul May 18, 2021

ahaessly May 18, 2021

kcibul May 18, 2021

ahaessly left a comment

mmorgantaylor left a comment

mmorgantaylor May 18, 2021

mmorgantaylor May 18, 2021

mmorgantaylor May 18, 2021

kcibul May 18, 2021

mmorgantaylor May 18, 2021

kcibul May 18, 2021

Support for FORMAT/FT VQSLod Filtering and cohort-wide LowQual filter #7248

Support for FORMAT/FT VQSLod Filtering and cohort-wide LowQual filter #7248

Conversation

kcibul commented May 11, 2021 • edited Loading

gatk-bot commented May 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahaessly left a comment

Choose a reason for hiding this comment

mmorgantaylor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kcibul commented May 11, 2021 •

edited

Loading

gatk-bot commented May 11, 2021 •

edited

Loading