Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes needed for scaling up and running in Terra #1

Merged
merged 74 commits into from
Feb 2, 2024

Conversation

meganshand
Copy link
Collaborator

Two scale tests have passed with this pipeline: 1) a small genomic region of the full 15k samples 2) 10 samples scattered 2250 ways

I'll find someone to review in January, but wanted to checkpoint here before I make the scientific updates we want for BGE data.

Copy link
Collaborator

@kcibul kcibul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also sent you a slack!

call Tasks.GatherVcfs as TotallyRadicalGatherVcfs {
input:
input_vcfs = gnarly_gvcfs,
input_vcf_fofn = write_lines(GnarlyGenotyper.output_vcf),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a lot of the changes -- moving from tasks taking an Array[File] to a fofn produced by write_lines. I'm guessing this is an Azure-ism?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, some of these should no longer be necessary. I cleaned some of this up.

For the ones that remain, I think we needed these to be FOFNs for two reasons: 1) localization_optional isn't implemented yet in Azure so the inputs need to be either FOFNs or Array[String] and more importantly 2) The SAS token environment variable provided by Cromwell is based on where the File input is located. Namely the FOFN File needs to have the same SAS token as the rest of the inputs and by providing the task a FOFN rather than Array[String] we tell Cromwell where to grab the SAS from.

@@ -196,9 +221,10 @@ workflow JointGenotyping {
}
}

#TODO: I suspect having write_lines in the input here is breaking call caching
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does call caching work now on Azure? Does it work the same as in GCP (conceptually?) in terms of how it establishes identity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does! It works the same as GCP as far as I can tell. The suspicion in the is todo is because the write_lines makes a new temp file each time, but I didn't investigate further. Could definitely be something else that caused call caching to break on this task one time which I happened to notice.

Also at the moment call caching only works with dockerhub (rather than Azure Container Registry), but it's mostly been consistent and working for me.

@@ -369,7 +396,7 @@ workflow JointGenotyping {

# CrossCheckFingerprints takes forever on large callsets.
# We scatter over the input GVCFs to make things faster.
if (scatter_cross_check_fingerprints) {
if (defined(cross_check_fingerprint_scatter_partition)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really curious about (a) what crosscheck fingerprints is doing for us here (are you comparing to external truth? Do you expect no samples to match? are samples on > 1 lane?) and then also (b) the reasoning for the scaling approach here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I answered in this in slack too, but the check here is to make sure that joint genotyping itself didn't swap samples, so we're checking the input gvcfs to the output vcf. This might be overkill, especially since we need to scatter it to work on a large number of samples. Might be better to spot check some random samples rather than truly check that every single sample wasn't swapped by our pipeline.

@@ -30,7 +30,7 @@ task CheckSamplesUnique {
runtime {
memory: "1 GiB"
disk: "10 GB"
docker: "gcscromwellacr.azurecr.io/us.gcr.io/broad-gotc-prod/python:2.7"
docker: "mshand/genomicsinthecloud:broad-gotc-prod_python_2.7"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work in Azure -- is this your personal dockerhub image at the moment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. This is probably a problem. I'm using dockerhub since I wanted call caching, but for "production" this should get cleaned up. I'll leave it as is for now, but make a note that we'll need to update to an official place at some point.

--reader-threads 5 \
--merge-input-intervals \
--consolidate
--consolidate \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like some new GATK features -- can I read about these somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all of these were added to get Azure streaming to work in GenomicsDB. I'm not super familiar with the details, but the PR from Louis is here: broadinstitute/gatk#8438

@@ -122,6 +125,8 @@ task ImportGVCFs {
cpu: 4
disk: disk_size + " GB"
docker: gatk_docker
azureSasEnvironmentVariable: "AZURE_STORAGE_SAS_TOKEN"
maxRetries: 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the philosophy on retries in Azure? Unnecessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is still necessary for wide scattering tasks. It seems that we get some transient errors when kicking off many tasks at once. Typically one retry is enough to get all the shards through, especially since the errors end up spacing out when the tasks get kicked off.

--ignore-safety-checks \
--gather-type BLOCK \
--input ~{sep=" --input " input_vcfs} \
--input "~{sep="?$AZURE_STORAGE_SAS_TOKEN\" --input \"" input_vcfs}?$AZURE_STORAGE_SAS_TOKEN" \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooh I see now. Are you handed a list of blob storage paths... and then Cromwell sets an environment variable for you AZURE_STORAGE_SAS_TOKEN which then you construct essentially signed URLs on the command line?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly :) GATK needs the signed URL in the command line in general, but GenomicsDB is a special case that uses the environment variable directly.

--ignore-safety-checks \
--gather-type BLOCK \
--input ~{sep=" --input " input_vcfs} \
--input "~{sep="?$AZURE_STORAGE_SAS_TOKEN\" --input \"" input_vcfs}?$AZURE_STORAGE_SAS_TOKEN" \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you worried about length of this command line? I remember running into some problems when the length of the command line (when concatenating lots and lots of paths) got too long... but maybe that's no longer an issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I didn't run into this at the 15k sample scale since this is gathering over the number of shards (~2250), but I can easily see running into this as we scale up. I'll fix this.

}

#TODO: Make SelectVariants able to stream from https by including a VCF index input in addition to the vcf itself, for now localize
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like you're including the input vcf index now... is that enough to resolve this TODO (and stream the VCF instead of localizing)? Does this optimization even help on Azure?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for now we're just localizing the input_vcf. I think this TODO actually belongs in GATK, SelectVariants needs to have a separate argument for the vcf index: broadinstitute/gatk#8568

I'll make this comment clearer.

# Handle partitioning if provided
Int partition_start = if defined(partition_index) then partition_index - partition_ammount + 1 else 1
Int partition_end = if defined(partition_index) && partition_index < gvcf_paths_length then partition_index else gvcf_paths_length
Int num_gvcfs = partition_end - partition_start + 1
Int cpu = if num_gvcfs < 32 then num_gvcfs else 32
# Compute memory to use based on the CPU count, following the pattern of
# 3.75GiB / cpu used by GCP's pricing: https://cloud.google.com/compute/pricing
Int memMb = round(cpu * 3.75 * 1024)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google-ism? But why not just use all the memory on the machine minus some fixed amount for overhead? With this calculation you might ask Java for more memory than you have and start paging, etc

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do end up requesting this amount of memory on the machine, but you could end up with a larger machine than you request for, right? So it would be more optimal to set this based on the actual machine size, but the java_mem will still always be slightly lower.

Copy link
Collaborator Author

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed some of these comments but my changes haven't been tested yet. The workflows team is trying to get my workspace back in a running state and once I have that I'll run these changes to make sure I didn't break anything.

call Tasks.GatherVcfs as TotallyRadicalGatherVcfs {
input:
input_vcfs = gnarly_gvcfs,
input_vcf_fofn = write_lines(GnarlyGenotyper.output_vcf),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, some of these should no longer be necessary. I cleaned some of this up.

For the ones that remain, I think we needed these to be FOFNs for two reasons: 1) localization_optional isn't implemented yet in Azure so the inputs need to be either FOFNs or Array[String] and more importantly 2) The SAS token environment variable provided by Cromwell is based on where the File input is located. Namely the FOFN File needs to have the same SAS token as the rest of the inputs and by providing the task a FOFN rather than Array[String] we tell Cromwell where to grab the SAS from.

@@ -196,9 +221,10 @@ workflow JointGenotyping {
}
}

#TODO: I suspect having write_lines in the input here is breaking call caching
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does! It works the same as GCP as far as I can tell. The suspicion in the is todo is because the write_lines makes a new temp file each time, but I didn't investigate further. Could definitely be something else that caused call caching to break on this task one time which I happened to notice.

Also at the moment call caching only works with dockerhub (rather than Azure Container Registry), but it's mostly been consistent and working for me.

@@ -369,7 +396,7 @@ workflow JointGenotyping {

# CrossCheckFingerprints takes forever on large callsets.
# We scatter over the input GVCFs to make things faster.
if (scatter_cross_check_fingerprints) {
if (defined(cross_check_fingerprint_scatter_partition)) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I answered in this in slack too, but the check here is to make sure that joint genotyping itself didn't swap samples, so we're checking the input gvcfs to the output vcf. This might be overkill, especially since we need to scatter it to work on a large number of samples. Might be better to spot check some random samples rather than truly check that every single sample wasn't swapped by our pipeline.

@@ -30,7 +30,7 @@ task CheckSamplesUnique {
runtime {
memory: "1 GiB"
disk: "10 GB"
docker: "gcscromwellacr.azurecr.io/us.gcr.io/broad-gotc-prod/python:2.7"
docker: "mshand/genomicsinthecloud:broad-gotc-prod_python_2.7"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. This is probably a problem. I'm using dockerhub since I wanted call caching, but for "production" this should get cleaned up. I'll leave it as is for now, but make a note that we'll need to update to an official place at some point.

--genomicsdb-workspace-path ~{workspace_dir_name} \
--batch-size ~{batch_size} \
-L ~{interval} \
-V ~{sep=' -V ' gvcf_files} \
--sample-name-map ~{sample_name_map} \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly. Unfortunately genomcisDB uses it's own az:// file paths, whereas the rest of the GATK is opting to use HTTPS paths from Azure with the SAS tokens included in the path themselves. So for this pipeline we end up passing around multiple FOFNs, some with az:// paths and others with https paths.

I cleaned this up to make the FOFNs generated from the az:// paths so the initial setup code is a bit cleaner now and the user only needs to provide this one sample map with the az:// paths.

@@ -122,6 +125,8 @@ task ImportGVCFs {
cpu: 4
disk: disk_size + " GB"
docker: gatk_docker
azureSasEnvironmentVariable: "AZURE_STORAGE_SAS_TOKEN"
maxRetries: 1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is still necessary for wide scattering tasks. It seems that we get some transient errors when kicking off many tasks at once. Typically one retry is enough to get all the shards through, especially since the errors end up spacing out when the tasks get kicked off.

--ignore-safety-checks \
--gather-type BLOCK \
--input ~{sep=" --input " input_vcfs} \
--input "~{sep="?$AZURE_STORAGE_SAS_TOKEN\" --input \"" input_vcfs}?$AZURE_STORAGE_SAS_TOKEN" \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly :) GATK needs the signed URL in the command line in general, but GenomicsDB is a special case that uses the environment variable directly.

--ignore-safety-checks \
--gather-type BLOCK \
--input ~{sep=" --input " input_vcfs} \
--input "~{sep="?$AZURE_STORAGE_SAS_TOKEN\" --input \"" input_vcfs}?$AZURE_STORAGE_SAS_TOKEN" \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I didn't run into this at the 15k sample scale since this is gathering over the number of shards (~2250), but I can easily see running into this as we scale up. I'll fix this.

}

#TODO: Make SelectVariants able to stream from https by including a VCF index input in addition to the vcf itself, for now localize
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for now we're just localizing the input_vcf. I think this TODO actually belongs in GATK, SelectVariants needs to have a separate argument for the vcf index: broadinstitute/gatk#8568

I'll make this comment clearer.

# Handle partitioning if provided
Int partition_start = if defined(partition_index) then partition_index - partition_ammount + 1 else 1
Int partition_end = if defined(partition_index) && partition_index < gvcf_paths_length then partition_index else gvcf_paths_length
Int num_gvcfs = partition_end - partition_start + 1
Int cpu = if num_gvcfs < 32 then num_gvcfs else 32
# Compute memory to use based on the CPU count, following the pattern of
# 3.75GiB / cpu used by GCP's pricing: https://cloud.google.com/compute/pricing
Int memMb = round(cpu * 3.75 * 1024)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do end up requesting this amount of memory on the machine, but you could end up with a larger machine than you request for, right? So it would be more optimal to set this based on the actual machine size, but the java_mem will still always be slightly lower.

@meganshand meganshand merged commit 08f2209 into main Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants