Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AoU GVS Cohort Extract wdl #7242

Merged
merged 37 commits into from
Jun 11, 2021
Merged

Conversation

ericsong
Copy link

@ericsong ericsong commented May 6, 2021

Our production extraction WDL which calls

  1. Cohort Table setup task
  2. GvsPrepareCallset
  3. GvsExtractCallset

I had to make a few small changes to GvsPrepare/Extract but I tried to keep them as minimal as possible.

Copy link
Contributor

@kcibul kcibul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- a few questions, and I think we should change the name of the top level script to be something to make it clear that this is for the AoU use case (thus cutting off the need to make it generally/generically configurable, etc).

@@ -0,0 +1,135 @@
version 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give this a more distinct name from GvsExtractCallset? I don't have any great suggestions with this little coffee in me... GvsEndToEndCohortExtract? @mmorgantaylor @ahaessly any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to be a little AoU specific, maybe we can use that for the name "GvsAoU..."

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting aside the comments in the WDL that reference AoU specifically.. shouldn't there be alignment on having a general purpose extraction workflow to run all of the extraction steps together, rather than as multiple separate workflows? This is my understanding of the main need that we have (but perhaps there are some other subtleties here). Is it possible to instead generalize anything else here that might be too specific (e.g. make certain tasks skippable/idempotent or generalize naming if needed)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that having a wrapper WDL for the prepare/extract step is a good thing to have. It's not at the top of our priority list (compared to making sure that it scales, doesn't cost a fortune, etc) but it's a "good thing" for sure.

The reason I suggested naming it AoU specifically was so that we wouldn't hold this up in order to make it generally more useful outside AoU. If y'all are up for that, that would be great!

There are a few things in here that seem to be AoU specific, for example:

  • requiring a destination bucket to copy the results
  • requiring an extraction uuid
  • not using filter_set (though maybe this is just temporary)

The other thing that I think needs to be figured out is... the lifecycle of the tables created by the Prepare step. There are temp tables with a TTL, that's probably ok, but then there's the main table used by Extract. That probably should be cleaned up. In addition, the creation of the extract table in Prepare is 90% of the cost of this pipeline. If that succeeds, but then one of the Extract scatters fails... you need to rerun the whole pipeline. At small scale, you might not care but at larger scale you definitely don't want to do this. How to best use cromwell call caching, or disable that and do some form of short circuiting in the task if the output already exists, or something else entirely needs to be thought through.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also a little hesitant to name it something AoU specific because my intent is for this WDL to be generalized enough for your use cases as well such that we're both working on and using the same WDL. That way, the work we do to improve stuff like table lifecycle and cromwell call caching will automatically be picked up by us when we update the WDL snapshot.

Given that, all those things you mentioned are definitely changes we can make. Though, I might try to put those changes in a follow up PR so I can wrap this one up soon.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed to GvsExtractCohortFromSampleNames so it looks a little less like GvsExtractCallset and better captures what our intent with the WDL is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for making this more generally useful if you are -- I'll go make some comments along these lines then. What I want to avoid is it being really only usable by AoU, yet not saying it's primarily for AoU.

>>>

output {
String fq_cohort_extract_table = read_string("fq_cohort_extract_table.txt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My WDL is rusty, but do you have to go through a file here? Can you just do something like

String fq_cohort_extract_table = "~{fq_destination_dataset}.~{destination_cohort_table_name}"

or something similar?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that works! thanks for the suggestion

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, ended up backtracking on this because I added some more bash interpolation logic to generate the output string. unless you know of a way to directly map a bash env variable -> cromwell output?

@@ -36,6 +36,10 @@ workflow GvsPrepareCallset {
docker = docker_final
}

output {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good improvement here


String output_file_base_name

String? fq_filter_set_table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the AoU case, I'm thinking this would always be required

data_project = "", # unused if fq_filter_set_* args are given or filtering is off
query_project = query_project,
default_dataset = "", # unused if fq_filter_set_* args are given or filtering is off
filter_set_name = "", # unused if fq_filter_set_* args are given or filtering is off
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does AoU want unfiltered data, that would be very surprising

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I hadn't gotten around to adding the filter steps yet which is the main reason these are missing. Do you know what the process for that will be? Do we just point the filter set arguments to tables that you guys generate or is it something that is created on a per extract basis?

@ericsong
Copy link
Author

@kcibul @ahaessly @mmorgantaylor do you guys know what the best way to upload multifile WDLs to Agora is? I can't import the current WDL as it is because the imports refer to a relative local file which Agora can't resolve. My workaround has been to rename the import path so that it points to a raw github file. I've been doing this manually for testing but it will not work with our automated Github -> Agora method creation tool (without some code changes to read the WDL and rename the imports).

# Drop trailing slash if one exists
OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//')

if [ -n "${OUTPUT_GCS_DIR}" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these files are also task outputs, if someone runs with output_gcs_dir they'll have two copies of the data, one in the cromwell delocalized results of this task and another in the output bucket? Do you have ideas about how to handle that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the cromwell delocalized results, do you mean the files in the "execution directory" (as Terra calls it)? In my mind, I had always thought of those files as "temporary" as in they're usually created within a TTL folder. I'm realizing now that this may be an incorrect assumption that I carried over from my time working with the long reads methods team.

I feel like that might be OK though? Users of the WDL that do not want the duplication can drop the output_gcs_dir argument. Users that want the output_gcs_dir argument likely only care about the VCF output files so they can clean up the execution directory as needed.

String gvs_extraction_destination_dataset
String gvs_extraction_temp_tables_dataset
String extraction_uuid
String output_gcs_dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be optional


call GvsPrepareCallset.GvsPrepareCallset {
input:
destination_cohort_table_name = extraction_uuid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the same dataset is used for gvs_extraction_destination_dataset and gvs_extraction_cohorts_dataset? In both cases the table name is just the UUID so I think things will error out. If that's the case maybe use the uuid as a suffix instead of the sole table name

FROM
\`~{gvs_dataset}.sample_info\`
WHERE
sample_name IN " > create_cohort.sql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the limit for how many samples can be queried this way? In other parts of GVS we found that performance drops off after > ~1000 entries. I don't think performance is a concern here, but there are limitations on query size, etc. If there's a limit here, would be good to know what it is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW... looks like the total query size limit is 1 MB so with a wild guess at 20 bytes per sample name that's 50k samples. Seems reasonable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking on that for me. Agree that it should work but it was good to point out. I might be able to load this from a file which probably has better performance and scalability.

@kcibul
Copy link
Contributor

kcibul commented May 19, 2021

@ericsong on the Agora question -- we've been using Dockstore since it pulls right from GitHub and these are public WDLs. You'll need to add this WDL to the .dockstore.yml file, and if you want to have you're branch you'll add that as well (should be fairly obvious when you look at the file). I haven't tried imports, but I think they work well/better than Agora

@kcibul
Copy link
Contributor

kcibul commented May 19, 2021

Thinking out loud... if the PrepareCallset could take a file of sample names as an input would we no longer need to have this step of making the sample table? ExtractCohort is able to take that already, Eric added that a while back right?

@ericsong
Copy link
Author

ericsong commented May 23, 2021

thanks for the review @kcibul. I made some changes accordingly.

re: PrepareCallset file of sample names. That would be nice! It would make this workflow simpler and it also simplifies the access requirements for PrepareCallset.

re: Dockstore. We actually ruled this out because Terra says that the definition of a method configuration can change automatically if its updated in dockstore. Which can be useful but it adds a security risk since a compromised Dockstore can change the definition of the production AoU extraction WDL which runs with highly elevated permissions. We already have a script that creates method configurations from github so I can probably add something a little hacky to resolve relative imports to the raw github file that it refers to.

Copy link
Contributor

@ahaessly ahaessly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Are you going to try to incorporate kris' changes?

@ericsong
Copy link
Author

yep! I should have that wrapped up soon. testing the updated WDL now

@gatk-bot
Copy link

Travis reported job failures from build 34321
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 34321.12 logs

@ericsong
Copy link
Author

@kcibul @ahaessly I made another round of changes to incorporate the new Prepare/Extract

Copy link
Contributor

@ahaessly ahaessly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few changes for defaulting values


Boolean do_not_filter_override
String? fq_filter_set_info_table
String? fq_filter_set_site_table
String? fq_filter_set_tranches_table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything that is assigned a default value in the outer workflow should not be optional here.
Also, filter_set_name should not be optional

Boolean do_not_filter_override = false
String? filter_set_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter set name should not be optional. This will be named something that indicates which release of the filtering model the extract should use.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be set even if filtering is not going to be used? for example, if do_not_filter_override is true, the argument will never be used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, i didn't catch that. It's fine to be optional.

String output_file_base_name

Boolean do_not_filter_override = false
String? filter_set_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not be optional. or if optional, should default to a valid filter set name

String? filter_set_name
String fq_filter_set_info_table = "~{gvs_project}.~{gvs_dataset}.filter_set_info"
String fq_filter_set_site_table = "~{gvs_project}.~{gvs_dataset}.filter_set_sites"
String fq_filter_set_tranches_table = "~{gvs_project}.~{gvs_dataset}.filter_set_tranches"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to set these values or pass them to the workflow since they are defaulted in the workflow. If you want to allow the caller to set them, you should make them optional and then you would pass them into the workflow.

fq_petvet_dataset = "~{gvs_project}.~{gvs_dataset}",
fq_sample_mapping_table = "~{gvs_project}.~{gvs_dataset}.sample_info",
fq_temp_table_dataset = fq_gvs_extraction_temp_tables_dataset,
fq_destination_dataset = fq_gvs_extraction_destination_dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make these parameters in the definition of the task optional, since they are assigned defaults and then you won't have to pass the hard coded values

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so these do have defaults in GvsPrepareCallset but the default doesn't work for me which is why I overrode it here. is that what you're suggesting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, well then maybe add a comment that you are explicitly overriding the default values

destination_cohort_table_prefix = extraction_uuid,
sample_names_to_extract = cohort_sample_names,
data_project = query_project,
default_dataset = gvs_dataset, # unused if fq_* args are given
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though it's unused here, it still needs to be passed into the task

String gvs_dataset
String fq_gvs_extraction_cohorts_dataset
String fq_gvs_extraction_destination_dataset
String fq_gvs_extraction_temp_tables_dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make these 2 optional since they are defaulted in the called workflow.
also I don't see fq_gvs_extraction_cohorts_dataset used anywhere. should it be removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. removed.

regarding the two optionals though, do you mean the arguments on lines 16 and 17? those are passed directly into Prepare

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. if you make fq_gvs_extraction_destination_dataset and fq_gvs_extraction_temp_tables_dataset optional (and still pass them into the prepare) then the caller does not have to supply the input. If an optional that has not been set is passed to a workflow/task they will be treated as None (and will therefore get the default values). The only reason not to make them optional is if you know the defaults won't work for you.

Copy link
Contributor

@ahaessly ahaessly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

set -e
if [ ~{has_service_account_file} = 'true' ]; then
gcloud auth activate-service-account --key-file='~{service_account_json}'
fi

LASTMODIFIED=$(bq show --location=US --format=json ~{dataset_table} | python3 -c "import sys, json; print(json.load(sys.stdin)['lastModifiedTime']);")
# bq needs the project name to be separate by a colon
DATASET_TABLE_COLON=$(echo ~{dataset_table} | sed 's/\./:/')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a branch where I fixed this a different way. by just taking the table name prefix and using the dataset.tablename. I also had an issue where the show command was return a bunch of text prepended to the result which i fixed by creating a .bigqueryrc. I will have a PR for this soon and tag you to review it.

@ericsong ericsong merged commit b8b5f39 into ah_var_store Jun 11, 2021
@ericsong ericsong deleted the songe/combine-prepare-extract branch June 11, 2021 18:02
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants