Pipeline | Date Updated | Documentation Authors | Questions or Feedback |
---|---|---|---|
AzureJointGenotyping | February, 2024 | Kaylee Mathews & Megan Shand | Please file issues in GitHub |
The AzureJointGenotyping workflow is an open-source, cloud-optimized pipeline that implements joint variant calling and filtering using using GATK and Microsoft Azure. The pipeline calls the Variant Extract-Train-Score (VETS) subworkflow to score variant annotations.
The AzureJointGenotyping pipeline can be configured to run using one of the following GATK joint genotyping methods:
- GenotypeGVCFs (default method) performs joint genotyping on GVCF files stored in GenomicsDB and pre-called with HaplotypeCaller.
- GnarlyGenotyper performs scalable, “quick and dirty” joint genotyping on a set of GVCF files stored in GenomicsDB and pre-called with HaplotypeCaller.
The pipeline takes in a sample map file listing GVCF files produced by HaplotypeCaller or DRAGEN version 3.7.8 in GVCF mode and creates a filtered VCF file (with index) containing genotypes for all samples present in the input VCF files. All sites that are present in the input VCF file are retained. Filtered sites are annotated as such in the FILTER field. If you are new to VCF files, see the file type specification.
Note The pipeline is adapted from the WARP JointGenotyping workflow, but is not subject to the same testing requirements as WARP pipelines.
The AzureJointGenotyping pipeline can be deployed using Cromwell, a GA4GH-compliant, flexible workflow management system that supports multiple computing platforms. The workflow can also be run in Terra, a cloud-based analysis platform.
The AzureJointGenotyping workflow is adapted from the WARP JointGenotyping workflow and requires all the same inputs with the following exceptions:
Parameter name | Description | Type |
---|---|---|
targets_interval_list | Required input. | File |
snp_recalibration_tranche_values | Removed from pipeline. | Array[String] |
indel_recalibration_tranche_values | Removed from pipeline. | Array[String] |
indel_recalibration_annotation_values | Removed from pipeline. | Array[String] |
vqsr_snp_filter_level | Removed from pipeline. | Float |
vqsr_indel_filter_level | Removed from pipeline. | Float |
snp_vqsr_downsampleFactor | Removed from pipeline. | Int |
use_allele_specific_annotations | Removed from pipeline. | Boolean |
run_vets | Removed from pipeline. | Boolean |
scatter_cross_check_fingerprints | Removed from pipeline; scattering during fingerprinting is determined by cross_check_fingerprint_scatter_partition . |
Boolean |
cross_check_fingerprint_scatter_partition | Optional integer specifying the number of samples to include in each partition for scattering during fingerprinting; recommended value is “1000”; fingerprinting will be performed without scattering if no value is passed to the pipeline. | Int |
sample_name_map | Path to file containing the sample names (first column; example: “NA12878”) and the Azure cloud path of the individual GVCF files (second column; example: “az://sc-74cc28aa-fa7c-4712-8b3e-7eb784790bec@lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/path_to_file.NA12878.vcf.gz”). | String |
Overall, the AzureJointGenotyping workflow:
- Splits the input interval list and imports GVCF files.
- Performs joint genotyping using GATK GenotypeGVCFs (default) or GnarlyGenotyper.
- Creates single site-specific VCF and index files.
- Creates and applies a variant filtering model using VETS.
- Collects variant calling metrics.
- Checks fingerprints.
The AzureJointGenotyping workflow imports individual “tasks,” also written in WDL script.
The AzureJointGenotyping workflow is adapted from the WARP JointGenotyping workflow and calls all the same tasks with the following exceptions:
Task | Tool | Software | Description |
---|---|---|---|
CheckSamplesUniqueAndMakeFofn as CheckSamplesUniqueAndMakeFofn | bash | bash | Renamed from CheckSamplesUnique ; checks that there are more than 50 unique samples in sample_name_map and generates necessary sample map files for Azure. |
JointVcfFiltering as TrainAndApplyVETS | ExtractVariantAnnotations, TrainVariantAnnotationsModel, ScoreVariantAnnotations | GATK | Default method for variant filtering; calls the JointVcfFiltering.wdl subworkflow to extract variant-level annotations, trains a model for variant scoring, and scores variants. |
IndelsVariantRecalibrator | VariantRecalibrator | GATK | Removed from the pipeline. |
SNPsVariantRecalibratorCreateModel | VariantRecalibrator | GATK | Removed from the pipeline. |
SNPsVariantRecalibrator as SNPsVariantRecalibratorScattered | VariantRecalibrator | GATK | Removed from the pipeline. |
GatherTranches as SNPGatherTranches | GatherTranches | GATK | Removed from the pipeline. |
SNPsVariantRecalibrator as SNPsVariantRecalibratorClassic | VariantRecalibrator | GATK | Removed from the pipeline. |
ApplyRecalibration | ApplyVQSR | GATK | Removed from the pipeline. |
GetFingerprintingIntervalIndices | IntervalListTools | GATK | If cross_check_fingerprint_scatter_partition is defined, gets and sorts indices for fingerprint intervals; otherwise the task is skipped. |
GatherVcfs as GatherFingerprintingVcfs | GatherVcfsCloud | GATK | If cross_check_fingerprint_scatter_partition is defined, compiles the fingerprint VCF files; otherwise the task is skipped. |
SelectFingerprintSiteVariants | SelectVariants | GATK | If cross_check_fingerprint_scatter_partition is defined, selects variants from the fingerprint VCF file; otherwise the task is skipped. |
PartitionSampleNameMap | bash | bash | Removed from the pipeline. |
CrossCheckFingerprint as CrossCheckFingerprintsScattered | CrosscheckFingerprints | GATK | If cross_check_fingerprint_scatter_partition is defined, checks fingerprints for the VCFs in the scattered partitions and produces a metrics file; otherwise the task is skipped. |
GatherPicardMetrics as GatherFingerprintingMetrics | bash | bash | If cross_check_fingerprint_scatter_partition is defined, combines the fingerprint metrics files into a single metrics file; otherwise the task is skipped. |
CrossCheckFingerprint as CrossCheckFingerprintSolo | CrosscheckFingerprints | GATK | If cross_check_fingerprint_scatter_partition is not defined, checks fingerprints for the single VCF file and produces a metrics file; otherwise the task is skipped. |
The AzureJointGenotyping workflow is adapted from the WARP JointGenotyping workflow and outputs all the same files.