Changes needed for scaling up and running in Terra #1

meganshand · 2023-12-19T13:49:50Z

Two scale tests have passed with this pipeline: 1) a small genomic region of the full 15k samples 2) 10 samples scattered 2250 ways

I'll find someone to review in January, but wanted to checkpoint here before I make the scientific updates we want for BGE data.

updating gatk docker adding sas token fixing headeer vcf header index

kcibul

Also sent you a slack!

kcibul · 2024-01-13T19:33:00Z

AzureJointGenotyping.wdl

      call Tasks.GatherVcfs as TotallyRadicalGatherVcfs {
        input:
-          input_vcfs = gnarly_gvcfs,
+          input_vcf_fofn = write_lines(GnarlyGenotyper.output_vcf),


This seems to be a lot of the changes -- moving from tasks taking an Array[File] to a fofn produced by write_lines. I'm guessing this is an Azure-ism?

Hmm, some of these should no longer be necessary. I cleaned some of this up.

For the ones that remain, I think we needed these to be FOFNs for two reasons: 1) localization_optional isn't implemented yet in Azure so the inputs need to be either FOFNs or Array[String] and more importantly 2) The SAS token environment variable provided by Cromwell is based on where the File input is located. Namely the FOFN File needs to have the same SAS token as the rest of the inputs and by providing the task a FOFN rather than Array[String] we tell Cromwell where to grab the SAS from.

kcibul · 2024-01-13T19:33:56Z

AzureJointGenotyping.wdl

@@ -196,9 +221,10 @@ workflow JointGenotyping {
    }
  }

+  #TODO: I suspect having write_lines in the input here is breaking call caching


Does call caching work now on Azure? Does it work the same as in GCP (conceptually?) in terms of how it establishes identity?

It does! It works the same as GCP as far as I can tell. The suspicion in the is todo is because the write_lines makes a new temp file each time, but I didn't investigate further. Could definitely be something else that caused call caching to break on this task one time which I happened to notice.

Also at the moment call caching only works with dockerhub (rather than Azure Container Registry), but it's mostly been consistent and working for me.

kcibul · 2024-01-13T19:40:30Z

AzureJointGenotyping.wdl

@@ -369,7 +396,7 @@ workflow JointGenotyping {

  # CrossCheckFingerprints takes forever on large callsets.
  # We scatter over the input GVCFs to make things faster.
-  if (scatter_cross_check_fingerprints) {
+  if (defined(cross_check_fingerprint_scatter_partition)) {


I'm really curious about (a) what crosscheck fingerprints is doing for us here (are you comparing to external truth? Do you expect no samples to match? are samples on > 1 lane?) and then also (b) the reasoning for the scaling approach here.

I answered in this in slack too, but the check here is to make sure that joint genotyping itself didn't swap samples, so we're checking the input gvcfs to the output vcf. This might be overkill, especially since we need to scatter it to work on a large number of samples. Might be better to spot check some random samples rather than truly check that every single sample wasn't swapped by our pipeline.

kcibul · 2024-01-13T19:41:38Z

AzureJointGenotypingTasks.wdl

@@ -30,7 +30,7 @@ task CheckSamplesUnique {
  runtime {
    memory: "1 GiB"
    disk: "10 GB"
-    docker: "gcscromwellacr.azurecr.io/us.gcr.io/broad-gotc-prod/python:2.7"
+    docker: "mshand/genomicsinthecloud:broad-gotc-prod_python_2.7"


How does this work in Azure -- is this your personal dockerhub image at the moment?

It is. This is probably a problem. I'm using dockerhub since I wanted call caching, but for "production" this should get cleaned up. I'll leave it as is for now, but make a note that we'll need to update to an official place at some point.

kcibul · 2024-01-13T19:42:50Z

AzureJointGenotypingTasks.wdl

      --reader-threads 5 \
      --merge-input-intervals \
-      --consolidate
+      --consolidate \


looks like some new GATK features -- can I read about these somewhere?

Yes, all of these were added to get Azure streaming to work in GenomicsDB. I'm not super familiar with the details, but the PR from Louis is here: broadinstitute/gatk#8438

kcibul · 2024-01-13T19:44:57Z

AzureJointGenotypingTasks.wdl

@@ -122,6 +125,8 @@ task ImportGVCFs {
    cpu: 4
    disk: disk_size + " GB"
    docker: gatk_docker
+    azureSasEnvironmentVariable: "AZURE_STORAGE_SAS_TOKEN"
+    maxRetries: 1


what's the philosophy on retries in Azure? Unnecessary?

Unfortunately this is still necessary for wide scattering tasks. It seems that we get some transient errors when kicking off many tasks at once. Typically one retry is enough to get all the shards through, especially since the errors end up spacing out when the tasks get kicked off.

kcibul · 2024-01-13T19:46:43Z

AzureJointGenotypingTasks.wdl

      --ignore-safety-checks \
      --gather-type BLOCK \
-      --input ~{sep=" --input " input_vcfs} \
+      --input "~{sep="?$AZURE_STORAGE_SAS_TOKEN\" --input \"" input_vcfs}?$AZURE_STORAGE_SAS_TOKEN" \


oooh I see now. Are you handed a list of blob storage paths... and then Cromwell sets an environment variable for you AZURE_STORAGE_SAS_TOKEN which then you construct essentially signed URLs on the command line?

Exactly :) GATK needs the signed URL in the command line in general, but GenomicsDB is a special case that uses the environment variable directly.

kcibul · 2024-01-13T19:47:44Z

AzureJointGenotypingTasks.wdl

      --ignore-safety-checks \
      --gather-type BLOCK \
-      --input ~{sep=" --input " input_vcfs} \
+      --input "~{sep="?$AZURE_STORAGE_SAS_TOKEN\" --input \"" input_vcfs}?$AZURE_STORAGE_SAS_TOKEN" \


Are you worried about length of this command line? I remember running into some problems when the length of the command line (when concatenating lots and lots of paths) got too long... but maybe that's no longer an issue?

That's a good point. I didn't run into this at the 15k sample scale since this is gathering over the number of shards (~2250), but I can easily see running into this as we scale up. I'll fix this.

kcibul · 2024-01-13T19:48:59Z

AzureJointGenotypingTasks.wdl

  }

+  #TODO: Make SelectVariants able to stream from https by including a VCF index input in addition to the vcf itself, for now localize


it looks like you're including the input vcf index now... is that enough to resolve this TODO (and stream the VCF instead of localizing)? Does this optimization even help on Azure?

No, for now we're just localizing the input_vcf. I think this TODO actually belongs in GATK, SelectVariants needs to have a separate argument for the vcf index: broadinstitute/gatk#8568

I'll make this comment clearer.

kcibul · 2024-01-13T19:50:31Z

AzureJointGenotypingTasks.wdl

+  # Handle partitioning if provided
+  Int partition_start  = if defined(partition_index) then partition_index - partition_ammount + 1 else 1
+  Int partition_end = if defined(partition_index) && partition_index < gvcf_paths_length then partition_index else gvcf_paths_length
+  Int num_gvcfs = partition_end - partition_start + 1
  Int cpu = if num_gvcfs < 32 then num_gvcfs else 32
  # Compute memory to use based on the CPU count, following the pattern of
  # 3.75GiB / cpu used by GCP's pricing: https://cloud.google.com/compute/pricing
  Int memMb = round(cpu * 3.75 * 1024)


Google-ism? But why not just use all the memory on the machine minus some fixed amount for overhead? With this calculation you might ask Java for more memory than you have and start paging, etc

We do end up requesting this amount of memory on the machine, but you could end up with a larger machine than you request for, right? So it would be more optimal to set this based on the actual machine size, but the java_mem will still always be slightly lower.

meganshand

I've addressed some of these comments but my changes haven't been tested yet. The workflows team is trying to get my workspace back in a running state and once I have that I'll run these changes to make sure I didn't break anything.

meganshand · 2024-01-30T19:11:59Z

AzureJointGenotyping.wdl

      call Tasks.GatherVcfs as TotallyRadicalGatherVcfs {
        input:
-          input_vcfs = gnarly_gvcfs,
+          input_vcf_fofn = write_lines(GnarlyGenotyper.output_vcf),


Hmm, some of these should no longer be necessary. I cleaned some of this up.

For the ones that remain, I think we needed these to be FOFNs for two reasons: 1) localization_optional isn't implemented yet in Azure so the inputs need to be either FOFNs or Array[String] and more importantly 2) The SAS token environment variable provided by Cromwell is based on where the File input is located. Namely the FOFN File needs to have the same SAS token as the rest of the inputs and by providing the task a FOFN rather than Array[String] we tell Cromwell where to grab the SAS from.

meganshand · 2024-01-30T19:13:52Z

AzureJointGenotyping.wdl

@@ -196,9 +221,10 @@ workflow JointGenotyping {
    }
  }

+  #TODO: I suspect having write_lines in the input here is breaking call caching


It does! It works the same as GCP as far as I can tell. The suspicion in the is todo is because the write_lines makes a new temp file each time, but I didn't investigate further. Could definitely be something else that caused call caching to break on this task one time which I happened to notice.

Also at the moment call caching only works with dockerhub (rather than Azure Container Registry), but it's mostly been consistent and working for me.

meganshand · 2024-01-30T19:15:13Z

AzureJointGenotyping.wdl

@@ -369,7 +396,7 @@ workflow JointGenotyping {

  # CrossCheckFingerprints takes forever on large callsets.
  # We scatter over the input GVCFs to make things faster.
-  if (scatter_cross_check_fingerprints) {
+  if (defined(cross_check_fingerprint_scatter_partition)) {


I answered in this in slack too, but the check here is to make sure that joint genotyping itself didn't swap samples, so we're checking the input gvcfs to the output vcf. This might be overkill, especially since we need to scatter it to work on a large number of samples. Might be better to spot check some random samples rather than truly check that every single sample wasn't swapped by our pipeline.

meganshand · 2024-01-30T19:16:34Z

AzureJointGenotypingTasks.wdl

@@ -30,7 +30,7 @@ task CheckSamplesUnique {
  runtime {
    memory: "1 GiB"
    disk: "10 GB"
-    docker: "gcscromwellacr.azurecr.io/us.gcr.io/broad-gotc-prod/python:2.7"
+    docker: "mshand/genomicsinthecloud:broad-gotc-prod_python_2.7"


It is. This is probably a problem. I'm using dockerhub since I wanted call caching, but for "production" this should get cleaned up. I'll leave it as is for now, but make a note that we'll need to update to an official place at some point.

meganshand · 2024-01-30T19:20:10Z

AzureJointGenotypingTasks.wdl

      --genomicsdb-workspace-path ~{workspace_dir_name} \
      --batch-size ~{batch_size} \
      -L ~{interval} \
-      -V ~{sep=' -V ' gvcf_files} \
+      --sample-name-map ~{sample_name_map} \


yes exactly. Unfortunately genomcisDB uses it's own az:// file paths, whereas the rest of the GATK is opting to use HTTPS paths from Azure with the SAS tokens included in the path themselves. So for this pipeline we end up passing around multiple FOFNs, some with az:// paths and others with https paths.

I cleaned this up to make the FOFNs generated from the az:// paths so the initial setup code is a bit cleaner now and the user only needs to provide this one sample map with the az:// paths.

meganshand · 2024-01-30T19:27:06Z

AzureJointGenotypingTasks.wdl

@@ -122,6 +125,8 @@ task ImportGVCFs {
    cpu: 4
    disk: disk_size + " GB"
    docker: gatk_docker
+    azureSasEnvironmentVariable: "AZURE_STORAGE_SAS_TOKEN"
+    maxRetries: 1


Unfortunately this is still necessary for wide scattering tasks. It seems that we get some transient errors when kicking off many tasks at once. Typically one retry is enough to get all the shards through, especially since the errors end up spacing out when the tasks get kicked off.

meganshand · 2024-01-30T19:27:46Z

AzureJointGenotypingTasks.wdl

      --ignore-safety-checks \
      --gather-type BLOCK \
-      --input ~{sep=" --input " input_vcfs} \
+      --input "~{sep="?$AZURE_STORAGE_SAS_TOKEN\" --input \"" input_vcfs}?$AZURE_STORAGE_SAS_TOKEN" \


Exactly :) GATK needs the signed URL in the command line in general, but GenomicsDB is a special case that uses the environment variable directly.

meganshand · 2024-01-30T19:29:19Z

AzureJointGenotypingTasks.wdl

      --ignore-safety-checks \
      --gather-type BLOCK \
-      --input ~{sep=" --input " input_vcfs} \
+      --input "~{sep="?$AZURE_STORAGE_SAS_TOKEN\" --input \"" input_vcfs}?$AZURE_STORAGE_SAS_TOKEN" \


That's a good point. I didn't run into this at the 15k sample scale since this is gathering over the number of shards (~2250), but I can easily see running into this as we scale up. I'll fix this.

meganshand · 2024-01-30T19:31:40Z

AzureJointGenotypingTasks.wdl

  }

+  #TODO: Make SelectVariants able to stream from https by including a VCF index input in addition to the vcf itself, for now localize


No, for now we're just localizing the input_vcf. I think this TODO actually belongs in GATK, SelectVariants needs to have a separate argument for the vcf index: broadinstitute/gatk#8568

I'll make this comment clearer.

meganshand · 2024-01-30T19:38:54Z

AzureJointGenotypingTasks.wdl

+  # Handle partitioning if provided
+  Int partition_start  = if defined(partition_index) then partition_index - partition_ammount + 1 else 1
+  Int partition_end = if defined(partition_index) && partition_index < gvcf_paths_length then partition_index else gvcf_paths_length
+  Int num_gvcfs = partition_end - partition_start + 1
  Int cpu = if num_gvcfs < 32 then num_gvcfs else 32
  # Compute memory to use based on the CPU count, following the pattern of
  # 3.75GiB / cpu used by GCP's pricing: https://cloud.google.com/compute/pricing
  Int memMb = round(cpu * 3.75 * 1024)


We do end up requesting this amount of memory on the machine, but you could end up with a larger machine than you request for, right? So it would be more optimal to set this based on the actual machine size, but the java_mem will still always be slightly lower.

…mple_map

meganshand added 30 commits June 15, 2023 10:22

testing dockerhub for terra

14787c3

dockers

df817ed

workaround for disk bug

2747ede

making cpu input for CrosscheckFingerprint

c7fe811

trying fixed vm size

6c3284f

new mcahine type

9a1f397

some changes

1ae8fdb

trying osmething

bd917fc

trying different family

bb044e0

another vm

e2c8ffc

switching all VMs to E

4123788

removing family size

6064433

testing gendb from gatk

21ee544

updating gatk docker adding sas token fixing headeer vcf header index

adding max stream size env variable

1f04a73

trying new branch from genDB

7856d35

changing to Azure dockers

4b8493c

adding maxRetries

3b2e931

switching back to DockerHub for call caching

97208f8

testing gather

73d2ce6

fixing

dce810c

streaming gather of vcfs

2783557

more testing of gather

919917b

fixing fingerprint issues

41b7457

new version of GenomcisDB

c89afd7

fixing crosscheck fingerprints

e3534d4

comment

6f01587

cleanup

c028c44

typo

02d05f1

sas token encoding/decoding

4e06990

updaing docker

2c60b6a

meganshand added 17 commits December 8, 2023 16:39

fixing quotes

fc6ea5b

quotes?

d65996e

update gatk for gather

0598f85

testing new jar

e57985d

question mark?

35907bc

fixing fingerprinting

cd86f4c

fix

b44c1e8

typo

dd23a36

trying new jar for fingerprinting

54627e0

no more max retries

75e752a

removing temp inputs

f6439a0

removing debugging outputs from fingerprinting

063aa7d

adding back retries

a1fba0d

more retries

15577f9

more maxRetries

db3dc8d

switching to gatk 4.5.0.0

6baf3ca

removing test wdl

ccbc049

kcibul reviewed Jan 18, 2024

View reviewed changes

untested - addressing comments

ee0dade

meganshand commented Jan 30, 2024

View reviewed changes

meganshand added 9 commits January 31, 2024 09:29

changing default scatter_mode to necessary option for 15k samples

70498d9

updating example inputs

24d25eb

fixed typo

b61e2d0

fixing regex

0a7dea9

swapped url order

0437928

fixing index files

f5ab3d0

trying more sed to fix index

9c7e85e

fixing sed

fd8790a

clean up sed in fingerprinting now that we generate gvcf list from sa…

d72ab3b

…mple_map

meganshand merged commit 08f2209 into main Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes needed for scaling up and running in Terra #1

Changes needed for scaling up and running in Terra #1

meganshand commented Dec 19, 2023

kcibul left a comment

kcibul Jan 13, 2024

meganshand Jan 30, 2024

kcibul Jan 13, 2024

meganshand Jan 30, 2024

kcibul Jan 13, 2024

meganshand Jan 30, 2024

kcibul Jan 13, 2024

meganshand Jan 30, 2024

kcibul Jan 13, 2024

meganshand Jan 30, 2024

kcibul Jan 13, 2024

meganshand Jan 30, 2024

kcibul Jan 13, 2024

meganshand Jan 30, 2024

kcibul Jan 13, 2024

meganshand Jan 30, 2024

kcibul Jan 13, 2024

meganshand Jan 30, 2024

kcibul Jan 13, 2024

meganshand Jan 30, 2024

meganshand left a comment

meganshand Jan 30, 2024

meganshand Jan 30, 2024

meganshand Jan 30, 2024

meganshand Jan 30, 2024

meganshand Jan 30, 2024

meganshand Jan 30, 2024

meganshand Jan 30, 2024

meganshand Jan 30, 2024

meganshand Jan 30, 2024

meganshand Jan 30, 2024

		}

		#TODO: Make SelectVariants able to stream from https by including a VCF index input in addition to the vcf itself, for now localize

Changes needed for scaling up and running in Terra #1

Changes needed for scaling up and running in Terra #1

Conversation

meganshand commented Dec 19, 2023

kcibul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

meganshand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment