Added a new suite of tools for variant filtering based on site-level …

…annotations. (#7954) * Adds wdl that tests joint VCF filtering tools (#7932) * adding filtering wdl * renaming pipeline * addressing comments * added bash * renaming json * adding glob to extract for extra files * changing dollar signs * small comments * Added changes for specifying model backend and other tweaks to WDLs and environment. * Added classes for representing a collection of labeled variant annotations. * Added interfaces for modeling and scoring backends. * Added a new suite of tools for variant filtering based on site-level annotations. * Added integration tests. * Added test resources and expected results. * Miscellaneous changes. * Removed non-ASCII characters. * Added documentation for TrainVariantAnnotationsModel and addressed review comments. Co-authored-by: meganshand <[email protected]>
broadinstitute · Aug 9, 2022 · 05a7634 · 05a7634
1 parent e818963
commit 05a7634
Show file tree

Hide file tree

Showing 204 changed files with 4,964 additions and 5 deletions.
diff --git a/.github/workflows/gatk-tests.yml b/.github/workflows/gatk-tests.yml
@@ -291,7 +291,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        wdlTest: [ 'RUN_CNV_GERMLINE_COHORT_WDL', 'RUN_CNV_GERMLINE_CASE_WDL', 'RUN_CNV_SOMATIC_WDL', 'RUN_M2_WDL', 'RUN_CNN_WDL' ]
+        wdlTest: [ 'RUN_CNV_GERMLINE_COHORT_WDL', 'RUN_CNV_GERMLINE_CASE_WDL', 'RUN_CNV_SOMATIC_WDL', 'RUN_M2_WDL', 'RUN_CNN_WDL', 'RUN_VCF_SITE_LEVEL_FILTERING_WDL' ]
     continue-on-error: true
     name: WDL test ${{ matrix.wdlTest }} on cromwell
     steps:
@@ -349,3 +349,9 @@ jobs:
         run: |
           echo "Running CNN WDL";
           bash scripts/cnn_variant_cromwell_tests/run_cnn_variant_wdl.sh;
+
+      - name: "VCF_SITE_LEVEL_FILTERING_WDL_TEST"
+        if: ${{ matrix.wdlTest == 'RUN_VCF_SITE_LEVEL_FILTERING_WDL' }}
+        run: |
+          echo "Running VCF Site Level Filtering WDL";
+          bash scripts/vcf_site_level_filtering_cromwell_tests/run_vcf_site_level_filtering_wdl.sh;
diff --git a/build.gradle b/build.gradle
@@ -285,6 +285,7 @@ dependencies {
 
     implementation 'org.apache.commons:commons-lang3:3.5'
     implementation 'org.apache.commons:commons-math3:3.5'
+    implementation 'org.hipparchus:hipparchus-stat:2.0'
     implementation 'org.apache.commons:commons-collections4:4.1'
     implementation 'org.apache.commons:commons-vfs2:2.0'
     implementation 'org.apache.commons:commons-configuration2:2.4'

diff --git a/scripts/gatkcondaenv.yml.template b/scripts/gatkcondaenv.yml.template
@@ -42,6 +42,7 @@ dependencies:
 - conda-forge::matplotlib=3.2.1
 - conda-forge::pandas=1.0.3
 - conda-forge::typing_extensions=4.1.1   # see https://github.com/broadinstitute/gatk/issues/7800 and linked PRs
+- conda-forge::dill=0.3.4                # used for pickling lambdas in TrainVariantAnnotationsModel
 
 # core R dependencies; these should only be used for plotting and do not take precedence over core python dependencies!
 - r-base=3.6.2

diff --git a/scripts/vcf_site_level_filtering_cromwell_tests/README.md b/scripts/vcf_site_level_filtering_cromwell_tests/README.md
@@ -0,0 +1,9 @@
+# Filtering Automated Tests for WDL
+
+**This directory is for GATK devs only**
+
+This directory contains scripts for running Variant Site Level WDL tests in the automated travis build environment.
+
+Please note that this only tests whether the WDL will complete successfully.
+
+Test data is a "plumbing test" using a small portion of a 10 sample callset.
diff --git a/scripts/vcf_site_level_filtering_cromwell_tests/run_vcf_site_level_filtering_wdl.sh b/scripts/vcf_site_level_filtering_cromwell_tests/run_vcf_site_level_filtering_wdl.sh
@@ -0,0 +1,38 @@
+#!/bin/bash -l
+set -e
+#cd in the directory of the script in order to use relative paths
+script_path=$( cd "$(dirname "${BASH_SOURCE}")" ; pwd -P )
+cd "$script_path"
+
+WORKING_DIR=/home/runner/work/gatk
+
+set -e
+echo "Building docker image for VCF Site Level Filtering WDL tests (skipping unit tests)..."
+
+#assume Dockerfile is in root
+echo "Building docker without running unit tests... ========="
+cd $WORKING_DIR/gatk
+
+# IMPORTANT: This code is duplicated in the cnv and M2 WDL test.
+if [ ! -z "$CI_PULL_REQUEST" ]; then
+  HASH_TO_USE=FETCH_HEAD
+  sudo bash build_docker.sh  -e ${HASH_TO_USE} -s -u -d $PWD/temp_staging/ -t ${CI_PULL_REQUEST};
+  echo "using fetch head:"$HASH_TO_USE
+else
+  HASH_TO_USE=${CI_COMMIT}
+  sudo bash build_docker.sh  -e ${HASH_TO_USE} -s -u -d $PWD/temp_staging/;
+  echo "using travis commit:"$HASH_TO_USE
+fi
+echo "Docker build done =========="
+
+cd $WORKING_DIR/gatk/scripts/
+sed -r "s/__GATK_DOCKER__/broadinstitute\/gatk\:$HASH_TO_USE/g" vcf_site_level_filtering_cromwell_tests/vcf_site_level_filtering_travis.json >$WORKING_DIR/vcf_site_level_filtering_travis.json
+echo "JSON FILES (modified) ======="
+cat $WORKING_DIR/vcf_site_level_filtering_travis.json
+echo "=================="
+
+
+echo "Running Filtering WDL through cromwell"
+ln -fs $WORKING_DIR/gatk/scripts/vcf_site_level_filtering_wdl/JointVcfFiltering.wdl
+cd $WORKING_DIR/gatk/scripts/vcf_site_level_filtering_wdl/
+java -jar $CROMWELL_JAR run JointVcfFiltering.wdl -i $WORKING_DIR/vcf_site_level_filtering_travis.json
diff --git a/scripts/vcf_site_level_filtering_cromwell_tests/vcf_site_level_filtering_travis.json b/scripts/vcf_site_level_filtering_cromwell_tests/vcf_site_level_filtering_travis.json
@@ -0,0 +1,13 @@
+{
+  "JointVcfFiltering.gatk_docker": "__GATK_DOCKER__",
+  "JointVcfFiltering.vcf": ["/home/runner/work/gatk/gatk/src/test/resources/large/filteringJointVcf/test_10_samples.22.avg.vcf.gz",
+    "/home/runner/work/gatk/gatk/src/test/resources/large/filteringJointVcf/test_10_samples.23.avg.vcf.gz"],
+  "JointVcfFiltering.vcf_index": ["/home/runner/work/gatk/gatk/src/test/resources/large/filteringJointVcf/test_10_samples.22.avg.vcf.gz.tbi",
+    "/home/runner/work/gatk/gatk/src/test/resources/large/filteringJointVcf/test_10_samples.23.avg.vcf.gz.tbi"],
+  "JointVcfFiltering.sites_only_vcf": "/home/runner/work/gatk/gatk/src/test/resources/large/filteringJointVcf/test_10_samples.sites_only.vcf.gz",
+  "JointVcfFiltering.sites_only_vcf_index": "/home/runner/work/gatk/gatk/src/test/resources/large/filteringJointVcf/test_10_samples.sites_only.vcf.gz.tbi",
+  "JointVcfFiltering.basename": "test_10_samples",
+  "JointVcfFiltering.snp_annotations": "-A ReadPosRankSum -A FS -A SOR -A QD -A AVERAGE_TREE_SCORE -A AVERAGE_ASSEMBLED_HAPS -A AVERAGE_FILTERED_HAPS",
+  "JointVcfFiltering.indel_annotations": "-A MQRankSum -A ReadPosRankSum -A FS -A SOR -A QD -A AVERAGE_TREE_SCORE",
+  "JointVcfFiltering.model_backend": "PYTHON_IFOREST"
+}
diff --git a/scripts/vcf_site_level_filtering_wdl/JointVcfFiltering.wdl b/scripts/vcf_site_level_filtering_wdl/JointVcfFiltering.wdl
@@ -0,0 +1,273 @@
+version 1.0
+
+# This is a workflow for filtering a joint callset VCF using INFO level annotations (so filtering is at the site level).
+# Note that the input VCFs here may be sharded by genomic position which may be helpful for large cohorts. The script
+# will output the same number of shards that are input.
+# This portion of the filtering pipeline will assign a SCORE INFO field annotation to each site, but does not yet apply
+# the filtering threshold to the final VCF.
+
+workflow JointVcfFiltering {
+	input {
+		Array[File] vcf
+		Array[File] vcf_index
+		File sites_only_vcf
+		File sites_only_vcf_index
+		String basename
+
+		String model_backend
+		File? training_python_script
+		File? scoring_python_script
+		File? hyperparameters_json
+
+		String gatk_docker
+		File? extract_interval_list
+		File? score_interval_list
+
+		String snp_annotations
+		String indel_annotations
+		File? gatk_override
+
+		String snp_resource_args = "--resource:hapmap,training=true,calibration=true gs://gcp-public-data--broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz --resource:omni,training=true,calibration=true gs://gcp-public-data--broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz --resource:1000G,training=true,calibration=false gs://gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz"
+		String indel_resource_args = "--resource:mills,training=true,calibration=true gs://gcp-public-data--broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz"
+	}
+
+	parameter_meta {
+		vcf: "An array of input VCFs that are one callset sharded by genomic region."
+		sites_only_vcf: "The full VCF callset without any genotype or sample level information."
+		basename: "Desired output file basename."
+	}
+
+	call ExtractVariantAnnotations as ExtractVariantAnnotationsSNPs {
+		input:
+			input_vcf = sites_only_vcf,
+			input_vcf_index = sites_only_vcf_index,
+			mode = "SNP",
+			annotations = snp_annotations,
+			resource_args = snp_resource_args,
+			basename = basename,
+			interval_list = extract_interval_list,
+			gatk_override = gatk_override,
+			gatk_docker = gatk_docker
+	}
+
+	call ExtractVariantAnnotations as ExtractVariantAnnotationsINDELs {
+		input:
+			input_vcf = sites_only_vcf,
+			input_vcf_index = sites_only_vcf_index,
+			mode = "INDEL",
+			annotations = indel_annotations,
+			resource_args = indel_resource_args,
+			basename = basename,
+			interval_list = extract_interval_list,
+			gatk_override = gatk_override,
+			gatk_docker = gatk_docker
+	}
+
+	call TrainVariantAnnotationModel as TrainVariantAnnotationModelSNPs {
+		input:
+			annots = ExtractVariantAnnotationsSNPs.annots,
+			basename = basename,
+			mode = "snp",
+			model_backend = model_backend,
+			python_script = training_python_script,
+			hyperparameters_json = hyperparameters_json,
+			gatk_override = gatk_override,
+			gatk_docker = gatk_docker
+	}
+
+	call TrainVariantAnnotationModel as TrainVariantAnnotationModelINDELs {
+		input:
+			annots = ExtractVariantAnnotationsINDELs.annots,
+			basename = basename,
+			mode = "indel",
+			model_backend = model_backend,
+			python_script = training_python_script,
+			hyperparameters_json = hyperparameters_json,
+			gatk_override = gatk_override,
+			gatk_docker = gatk_docker
+	}
+
+	scatter(idx in range(length(vcf))) {
+		call ScoreVariantAnnotations as ScoreVariantAnnotationsSNPs {
+			input:
+				vcf = vcf[idx],
+				vcf_index = vcf_index[idx],
+				basename = basename,
+				mode = "SNP",
+				model_backend = model_backend,
+				python_script = scoring_python_script,
+				annotations = snp_annotations,
+				extracted_training_vcf = ExtractVariantAnnotationsSNPs.extracted_training_vcf,
+				extracted_training_vcf_index = ExtractVariantAnnotationsSNPs.extracted_training_vcf_index,
+				interval_list = score_interval_list,
+				model_files = TrainVariantAnnotationModelSNPs.outputs,
+				resource_args = snp_resource_args,
+				gatk_override = gatk_override,
+				gatk_docker = gatk_docker
+		}
+
+		call ScoreVariantAnnotations as ScoreVariantAnnotationsINDELs {
+			input:
+				vcf = ScoreVariantAnnotationsSNPs.output_vcf,
+				vcf_index = ScoreVariantAnnotationsSNPs.output_vcf_index,
+				basename = basename,
+				mode = "INDEL",
+				model_backend = model_backend,
+				python_script = scoring_python_script,
+				annotations = indel_annotations,
+				extracted_training_vcf = ExtractVariantAnnotationsINDELs.extracted_training_vcf,
+				extracted_training_vcf_index = ExtractVariantAnnotationsINDELs.extracted_training_vcf_index,
+				interval_list = score_interval_list,
+				model_files = TrainVariantAnnotationModelINDELs.outputs,
+				resource_args = indel_resource_args,
+				gatk_override = gatk_override,
+				gatk_docker = gatk_docker
+		}
+	}
+
+	output {
+		Array[File] variant_filtered_vcf = ScoreVariantAnnotationsINDELs.output_vcf
+		Array[File] variant_filtered_vcf_index = ScoreVariantAnnotationsINDELs.output_vcf_index
+	}
+
+}
+
+task ExtractVariantAnnotations {
+	input {
+		String gatk_docker
+		File? gatk_override
+		File input_vcf
+		File input_vcf_index
+		String basename
+		String mode
+		String annotations
+		String resource_args
+		File? interval_list
+
+		Int memory_mb = 14000
+		Int command_mem = memory_mb - 1000
+	}
+	Int disk_size = ceil(size(input_vcf, "GB") + 50)
+	command {
+		set -e
+		export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk_override}
+
+		gatk --java-options "-Xmx~{command_mem}m" \
+			ExtractVariantAnnotations \
+			-V ~{input_vcf} \
+			-O ~{basename}.~{mode} \
+			~{annotations} \
+			~{"-L " + interval_list} \
+			--mode ~{mode} \
+			~{resource_args}
+	}
+	output {
+		File annots = "~{basename}.~{mode}.annot.hdf5"
+		File extracted_training_vcf = "~{basename}.~{mode}.vcf.gz"
+		File extracted_training_vcf_index = "~{basename}.~{mode}.vcf.gz.tbi"
+		Array[File] outputs = glob("~{basename}.~{mode}.*")
+	}
+	runtime {
+		docker: gatk_docker
+		disks: "local-disk " + disk_size + " LOCAL"
+		memory: memory_mb + " MiB"
+	}
+}
+
+task TrainVariantAnnotationModel {
+	input {
+		String gatk_docker
+		File? gatk_override
+		File annots
+		String basename
+		String mode
+		String model_backend
+		File? python_script
+		File? hyperparameters_json
+
+		Int memory_mb = 14000
+		Int command_mem = memory_mb - 1000
+	}
+	Int disk_size = ceil(size(annots, "GB") + 100)
+	command <<<
+		set -e
+
+		export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk_override}
+
+		mode=$(echo "~{mode}" | awk '{print toupper($0)}')
+
+		gatk --java-options "-Xmx~{command_mem}m" \
+			TrainVariantAnnotationsModel \
+			--annotations-hdf5 ~{annots} \
+			-O ~{basename} \
+			--model-backend ~{model_backend} \
+			~{"--python-script " + python_script} \
+			~{"--hyperparameters-json " + hyperparameters_json} \
+			--mode $mode
+
+	>>>
+	output {
+		Array[File] outputs = glob("~{basename}.~{mode}.*")
+	}
+	runtime {
+		docker: gatk_docker
+		disks: "local-disk " + disk_size + " LOCAL"
+		memory: memory_mb + " MiB"
+	}
+}
+
+task ScoreVariantAnnotations {
+	input {
+		String gatk_docker
+		File? gatk_override
+		File vcf
+		File vcf_index
+		String basename
+		String mode
+		String model_backend
+		File? python_script
+		String annotations
+		String resource_args
+		File extracted_training_vcf
+		File extracted_training_vcf_index
+		File? interval_list
+		Array[File] model_files
+
+		Int memory_mb = 16000
+		Int command_mem = memory_mb - 1000
+	}
+	Int disk_size = ceil(size(vcf, "GB") *2 + 50)
+
+	command {
+		set -e
+
+		ln -s ~{sep=" . && ln -s " model_files} .
+
+		export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk_override}
+
+		gatk --java-options "-Xmx~{command_mem}m" \
+			ScoreVariantAnnotations \
+			~{"-L " + interval_list} \
+			-V ~{vcf} \
+			-O ~{basename}.~{mode} \
+			--model-backend ~{model_backend} \
+			~{"--python-script " + python_script} \
+			--model-prefix ~{basename} \
+			~{annotations} \
+			--mode ~{mode} \
+			--resource:extracted,extracted=true ~{extracted_training_vcf} \
+			~{resource_args}
+	}
+	output {
+		File scores = "~{basename}.~{mode}.scores.hdf5"
+		File annots = "~{basename}.~{mode}.annot.hdf5"
+		File output_vcf = "~{basename}.~{mode}.vcf.gz"
+		File output_vcf_index = "~{basename}.~{mode}.vcf.gz.tbi"
+	}
+	runtime {
+		docker: gatk_docker
+		disks: "local-disk " + disk_size + " LOCAL"
+		memory: memory_mb + " MiB"
+	}
+}
+
diff --git a/src/main/java/org/broadinstitute/hellbender/tools/copynumber/CollectReadCounts.java b/src/main/java/org/broadinstitute/hellbender/tools/copynumber/CollectReadCounts.java
@@ -68,7 +68,7 @@
  *         to TSV format. Using HDF5 files with {@link CreateReadCountPanelOfNormals}
  *         can decrease runtime, by reducing time spent on IO, so this is the default output format.
  *         The HDF5 format contains information in the paths defined in {@link HDF5SimpleCountCollection}. HDF5 files may be viewed using
- *         <a href="https://support.hdfgroup.org/products/java/hdfview/">hdfview</a> or loaded in python using
+ *         <a href="https://support.hdfgroup.org/products/java/hdfview/">hdfview</a> or loaded in Python using
  *         <a href="http://www.pytables.org/">PyTables</a> or <a href="http://www.h5py.org/">h5py</a>.
  *         The TSV format has a SAM-style header containing a read group sample name, a sequence dictionary, a row specifying the column headers contained in
  *         {@link SimpleCountCollection.SimpleCountTableColumn}, and the corresponding entry rows.

diff --git a/...in/java/org/broadinstitute/hellbender/tools/copynumber/CreateReadCountPanelOfNormals.java b/...in/java/org/broadinstitute/hellbender/tools/copynumber/CreateReadCountPanelOfNormals.java
@@ -85,7 +85,7 @@
  *         Panel-of-normals file.
  *         This is an HDF5 file containing the panel data in the paths defined in {@link HDF5SVDReadCountPanelOfNormals}.
  *         HDF5 files may be viewed using <a href="https://support.hdfgroup.org/products/java/hdfview/">hdfview</a>
- *         or loaded in python using <a href="http://www.pytables.org/">PyTables</a> or <a href="http://www.h5py.org/">h5py</a>.
+ *         or loaded in Python using <a href="http://www.pytables.org/">PyTables</a> or <a href="http://www.h5py.org/">h5py</a>.
  *     </li>
  * </ul>
  *