Skip to content

Commit

Permalink
Changes needed for scaling up and running in Terra (#1)
Browse files Browse the repository at this point in the history
* testing dockerhub for terra

* dockers

* workaround for disk bug

* making cpu input for CrosscheckFingerprint

* trying fixed vm size

* new mcahine type

* some changes

* trying osmething

* trying different family

* another vm

* switching all VMs to E

* removing family size

* testing gendb from gatk

updating gatk docker

adding sas token

fixing headeer vcf

header index

* adding max stream size env variable

* trying new branch from genDB

* changing to Azure dockers

* adding maxRetries

* switching back to DockerHub for call caching

* testing gather

* fixing

* streaming gather of vcfs

* more testing of gather

* fixing fingerprint issues

* new version of GenomcisDB

* fixing crosscheck fingerprints

* comment

* cleanup

* typo

* sas token encoding/decoding

* updaing docker

* typo

* sas

* fixing scattered fingerprinting

* don't localize selectVariants for Fingerprinting

* fixing partitioning

* trying new sas feature on import only

* debugging

* removing debug print line

* revertting back for now

* forgot a change

* changing input array to fofn

* fixing header_vcf

* splitting fofn reading

* sas environment variable

* debugging

* trying sas for gatk

* adding fofns

* fixing quotes

* quotes?

* update gatk for gather

* testing new jar

* question mark?

* fixing fingerprinting

* fix

* typo

* trying new jar for fingerprinting

* no more max retries

* removing temp inputs

* removing debugging outputs from fingerprinting

* adding back retries

* more retries

* more maxRetries

* switching to gatk 4.5.0.0

* removing test wdl

* untested - addressing comments

* changing default scatter_mode to necessary option for 15k samples

* updating example inputs

* fixed typo

* fixing regex

* swapped url order

* fixing index files

* trying more sed to fix index

* fixing sed

* clean up sed in fingerprinting now that we generate gvcf list from sample_map
  • Loading branch information
meganshand committed Feb 2, 2024
1 parent 9bc3741 commit 08f2209
Show file tree
Hide file tree
Showing 5 changed files with 202 additions and 168 deletions.
2 changes: 2 additions & 0 deletions AzureJointGenotyping.22samples.DataTable.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
sample_set_id axiomPoly_resource_vcf axiomPoly_resource_vcf_index dbsnp_vcf dbsnp_vcf_index eval_interval_list haplotype_database hapmap_resource_vcf hapmap_resource_vcf_index huge_disk indel_filter_level indel_recalibration_annotation_values indel_recalibration_tranche_values large_disk medium_disk mills_resource_vcf mills_resource_vcf_index omni_resource_vcf omni_resource_vcf_index one_thousand_genomes_resource_vcf one_thousand_genomes_resource_vcf_index ref_dict ref_fasta ref_fasta_index sample_map_file scatter_mode small_disk snp_filter_level snp_recalibration_annotation_values snp_recalibration_tranche_values snp_vqsr_downsampleFactor targets_interval_list top_level_scatter_count unpadded_intervals_file
Public_Streaming https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz.tbi https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/exome_evaluation_regions.v1.interval_list https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/Homo_sapiens_assembly38.haplotype_database.txt https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/hapmap_3.3.hg38.vcf.gz https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/hapmap_3.3.hg38.vcf.gz.tbi 2000 95 "[""AS_FS"",""AS_ReadPosRankSum"",""AS_MQRankSum"",""AS_QD"",""AS_SOR""]" "[""100.0"",""99.95"",""99.9"",""99.5"",""99.0"",""97.0"",""96.0"",""95.0"",""94.0"",""93.5"",""93.0"",""92.0"",""91.0"",""90.0""]" 500 200 https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/1000G_omni2.5.hg38.vcf.gz https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/1000G_omni2.5.hg38.vcf.gz.tbi https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/Homo_sapiens_assembly38.dict https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/Homo_sapiens_assembly38.fasta https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/Homo_sapiens_assembly38.fasta.fai https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/PublicSampleStreamingMap.txt 100 99.7 "[""AS_QD"",""AS_MQRankSum"",""AS_ReadPosRankSum"",""AS_FS"",""AS_MQ"",""AS_SOR""]" "[""100.0"",""99.95"",""99.9"",""99.8"",""99.7"",""99.6"",""99.5"",""99.4"",""99.3"",""99.0"",""98.0"",""97.0"",""90.0""]" 10 https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/TwistAllianceClinicalResearchExome_Covered_Targets_hg38.interval_list https://lzb25a77f5eadb0fa72a2ae7.blob.core.windows.net/sc-097653a7-ddba-49b0-95a0-3ee6b00ac217/inputs/mirror_datasetpublicbroadref/hg38/v0/exome_calling_regions.v1.interval_list
38 changes: 0 additions & 38 deletions AzureJointGenotyping.22samples.inputs.json

This file was deleted.

84 changes: 44 additions & 40 deletions AzureJointGenotyping.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,8 @@ workflow JointGenotyping {
Boolean use_gnarly_genotyper = false
Boolean use_allele_specific_annotations = true
Boolean cross_check_fingerprints = true
Boolean scatter_cross_check_fingerprints = false
# If cross check fingerprints should be scattered, how many gvcfs per shard? Typically set to 1000.
Int? cross_check_fingerprint_scatter_partition
}

Boolean allele_specific_annotations = !use_gnarly_genotyper && use_allele_specific_annotations
Expand All @@ -73,8 +74,6 @@ workflow JointGenotyping {

Array[Array[String]] sample_name_map_lines_t = transpose(sample_name_map_lines)
Array[String] sample_names_from_map = sample_name_map_lines_t[0]
Array[File] gvcf_paths_from_map = sample_name_map_lines_t[1]
Array[File] gvcf_index_paths_from_map = sample_name_map_lines_t[2]

# Make a 2.5:1 interval number to samples in callset ratio interval list.
# We allow overriding the behavior by specifying the desired number of vcfs
Expand All @@ -90,11 +89,11 @@ workflow JointGenotyping {
Int unbounded_scatter_count = select_first([top_level_scatter_count, round(unbounded_scatter_count_scale_factor * num_gvcfs)])
Int scatter_count = if unbounded_scatter_count > 2 then unbounded_scatter_count else 2 #I think weird things happen if scatterCount is 1 -- IntervalListTools is noop?
#call Tasks.CheckSamplesUnique {
# input:
# sample_name_map = sample_name_map,
# sample_num_threshold = 10
#}
call Tasks.CheckSamplesUniqueAndMakeFofn as CheckSamplesUniqueAndMakeFofn {
input:
sample_name_map = sample_name_map,
sample_num_threshold = 5
}

call Tasks.SplitIntervalList {
input:
Expand All @@ -117,9 +116,10 @@ workflow JointGenotyping {
# the Hellbender (GATK engine) team!
call Tasks.ImportGVCFs {
input:
sample_names = sample_names_from_map,
gvcf_files = gvcf_paths_from_map,
gvcf_index_files = gvcf_index_paths_from_map,
sample_name_map = sample_name_map,
# need to provide an example header in order to stream from azure, so use the first gvcf
header_vcf = CheckSamplesUniqueAndMakeFofn.header_vcf,
header_vcf_index = CheckSamplesUniqueAndMakeFofn.header_vcf_index,
interval = unpadded_intervals[idx],
ref_fasta = ref_fasta,
ref_fasta_index = ref_fasta_index,
Expand Down Expand Up @@ -153,15 +153,13 @@ workflow JointGenotyping {
ref_fasta = ref_fasta,
ref_fasta_index = ref_fasta_index,
ref_dict = ref_dict,
dbsnp_vcf = dbsnp_vcf,
dbsnp_vcf = dbsnp_vcf
}
}

Array[File] gnarly_gvcfs = GnarlyGenotyper.output_vcf

call Tasks.GatherVcfs as TotallyRadicalGatherVcfs {
input:
input_vcfs = gnarly_gvcfs,
input_vcf_fofn = write_lines(GnarlyGenotyper.output_vcf),
output_vcf_name = callset_name + "." + idx + ".gnarly.vcf.gz",
disk_size = large_disk
}
Expand Down Expand Up @@ -196,9 +194,10 @@ workflow JointGenotyping {
}
}

#TODO: I suspect having write_lines in the input here is breaking call caching
call Tasks.GatherVcfs as SitesOnlyGatherVcf {
input:
input_vcfs = HardFilterAndMakeSitesOnlyVcf.sites_only_vcf,
input_vcf_fofn = write_lines(HardFilterAndMakeSitesOnlyVcf.sites_only_vcf),
output_vcf_name = callset_name + ".sites_only.vcf.gz",
disk_size = medium_disk
}
Expand Down Expand Up @@ -336,9 +335,10 @@ workflow JointGenotyping {
# For small callsets we can gather the VCF shards and then collect metrics on it.
# HUGE disk was failing in Azure...
if (is_small_callset) {

call Tasks.GatherVcfs as FinalGatherVcf {
input:
input_vcfs = ApplyRecalibration.recalibrated_vcf,
input_vcf_fofn = write_lines(ApplyRecalibration.recalibrated_vcf),
output_vcf_name = callset_name + ".vcf.gz",
disk_size = large_disk
}
Expand Down Expand Up @@ -369,7 +369,7 @@ workflow JointGenotyping {

# CrossCheckFingerprints takes forever on large callsets.
# We scatter over the input GVCFs to make things faster.
if (scatter_cross_check_fingerprints) {
if (defined(cross_check_fingerprint_scatter_partition)) {
call Tasks.GetFingerprintingIntervalIndices {
input:
unpadded_intervals = unpadded_intervals,
Expand All @@ -384,37 +384,41 @@ workflow JointGenotyping {

call Tasks.GatherVcfs as GatherFingerprintingVcfs {
input:
input_vcfs = vcfs_to_fingerprint,
input_vcf_fofn = write_lines(vcfs_to_fingerprint),
output_vcf_name = callset_name + ".gathered.fingerprinting.vcf.gz",
disk_size = medium_disk
}

call Tasks.SelectFingerprintSiteVariants {
input:
input_vcf = GatherFingerprintingVcfs.output_vcf,
input_vcf_index = GatherFingerprintingVcfs.output_vcf_index,
base_output_name = callset_name + ".fingerprinting",
haplotype_database = haplotype_database,
disk_size = medium_disk
}

call Tasks.PartitionSampleNameMap {
input:
sample_name_map = sample_name_map,
line_limit = 1000
}

scatter (idx in range(length(PartitionSampleNameMap.partitions))) {
# Get partitions by partition number of gvcfs, including any remainder in the last partition
# Subsetting happens in the CrossCheckFingerprints task
Array[Int] partitions = range((num_gvcfs+cross_check_fingerprint_scatter_partition)/cross_check_fingerprint_scatter_partition)

Array[File] files_in_partition = read_lines(PartitionSampleNameMap.partitions[idx])
scatter (idx in range(length(partitions))) {
Int parition_scaled = (partitions[idx] + 1) * cross_check_fingerprint_scatter_partition

call Tasks.CrossCheckFingerprint as CrossCheckFingerprintsScattered {
input:
gvcf_paths = files_in_partition,
vcf_paths = vcfs_to_fingerprint,
sample_name_map = sample_name_map,
gvcf_paths_fofn = CheckSamplesUniqueAndMakeFofn.gvcf_paths_fofn,
gvcf_index_paths_fofn = CheckSamplesUniqueAndMakeFofn.gvcf_index_paths_fofn,
vcf_paths_fofn = write_lines([SelectFingerprintSiteVariants.output_vcf]),
vcf_index_paths_fofn = write_lines([SelectFingerprintSiteVariants.output_vcf_index]),
sample_names_from_map_fofn = write_lines(sample_names_from_map),
partition_index = parition_scaled,
partition_ammount = cross_check_fingerprint_scatter_partition,
gvcf_paths_length = num_gvcfs,
haplotype_database = haplotype_database,
output_base_name = callset_name + "." + idx,
scattered = true
scattered = true,
disk = small_disk
}
}

Expand All @@ -426,19 +430,19 @@ workflow JointGenotyping {
}
}

if (!scatter_cross_check_fingerprints) {

scatter (line in sample_name_map_lines) {
File gvcf_paths = line[1]
}
if (!defined(cross_check_fingerprint_scatter_partition)) {

call Tasks.CrossCheckFingerprint as CrossCheckFingerprintSolo {
input:
gvcf_paths = gvcf_paths,
vcf_paths = ApplyRecalibration.recalibrated_vcf,
sample_name_map = sample_name_map,
gvcf_paths_fofn = CheckSamplesUniqueAndMakeFofn.gvcf_paths_fofn,
gvcf_index_paths_fofn = CheckSamplesUniqueAndMakeFofn.gvcf_index_paths_fofn,
vcf_paths_fofn = write_lines(ApplyRecalibration.recalibrated_vcf),
vcf_index_paths_fofn = write_lines(ApplyRecalibration.recalibrated_vcf_index),
sample_names_from_map_fofn = write_lines(sample_names_from_map),
gvcf_paths_length = num_gvcfs,
haplotype_database = haplotype_database,
output_base_name = callset_name
output_base_name = callset_name,
disk = small_disk
}
}

Expand Down
Loading

0 comments on commit 08f2209

Please sign in to comment.