AoU GVS Cohort Extract wdl #7242

ericsong · 2021-05-06T00:07:29Z

Our production extraction WDL which calls

Cohort Table setup task
GvsPrepareCallset
GvsExtractCallset

I had to make a few small changes to GvsPrepare/Extract but I tried to keep them as minimal as possible.

kcibul

LGTM -- a few questions, and I think we should change the name of the top level script to be something to make it clear that this is for the AoU use case (thus cutting off the need to make it generally/generically configurable, etc).

scripts/variantstore/wdl/GvsExtractCallset.wdl

kcibul · 2021-05-17T13:56:42Z

scripts/variantstore/wdl/GvsExtractCohort.wdl

@@ -0,0 +1,135 @@
+version 1.0


Can we give this a more distinct name from GvsExtractCallset? I don't have any great suggestions with this little coffee in me... GvsEndToEndCohortExtract? @mmorgantaylor @ahaessly any ideas?

It looks to be a little AoU specific, maybe we can use that for the name "GvsAoU..."

Setting aside the comments in the WDL that reference AoU specifically.. shouldn't there be alignment on having a general purpose extraction workflow to run all of the extraction steps together, rather than as multiple separate workflows? This is my understanding of the main need that we have (but perhaps there are some other subtleties here). Is it possible to instead generalize anything else here that might be too specific (e.g. make certain tasks skippable/idempotent or generalize naming if needed)?

I agree that having a wrapper WDL for the prepare/extract step is a good thing to have. It's not at the top of our priority list (compared to making sure that it scales, doesn't cost a fortune, etc) but it's a "good thing" for sure.

The reason I suggested naming it AoU specifically was so that we wouldn't hold this up in order to make it generally more useful outside AoU. If y'all are up for that, that would be great!

There are a few things in here that seem to be AoU specific, for example:

requiring a destination bucket to copy the results

requiring an extraction uuid

not using filter_set (though maybe this is just temporary)

The other thing that I think needs to be figured out is... the lifecycle of the tables created by the Prepare step. There are temp tables with a TTL, that's probably ok, but then there's the main table used by Extract. That probably should be cleaned up. In addition, the creation of the extract table in Prepare is 90% of the cost of this pipeline. If that succeeds, but then one of the Extract scatters fails... you need to rerun the whole pipeline. At small scale, you might not care but at larger scale you definitely don't want to do this. How to best use cromwell call caching, or disable that and do some form of short circuiting in the task if the output already exists, or something else entirely needs to be thought through.

I'm also a little hesitant to name it something AoU specific because my intent is for this WDL to be generalized enough for your use cases as well such that we're both working on and using the same WDL. That way, the work we do to improve stuff like table lifecycle and cromwell call caching will automatically be picked up by us when we update the WDL snapshot.

Given that, all those things you mentioned are definitely changes we can make. Though, I might try to put those changes in a follow up PR so I can wrap this one up soon.

I renamed to GvsExtractCohortFromSampleNames so it looks a little less like GvsExtractCallset and better captures what our intent with the WDL is.

I'm all for making this more generally useful if you are -- I'll go make some comments along these lines then. What I want to avoid is it being really only usable by AoU, yet not saying it's primarily for AoU.

kcibul · 2021-05-17T14:00:39Z

scripts/variantstore/wdl/GvsPrepareCallset.wdl

    >>>

+    output {
+      String fq_cohort_extract_table = read_string("fq_cohort_extract_table.txt")


My WDL is rusty, but do you have to go through a file here? Can you just do something like

String fq_cohort_extract_table = "~{fq_destination_dataset}.~{destination_cohort_table_name}"

or something similar?

that works! thanks for the suggestion

actually, ended up backtracking on this because I added some more bash interpolation logic to generate the output string. unless you know of a way to directly map a bash env variable -> cromwell output?

kcibul · 2021-05-17T14:01:05Z

scripts/variantstore/wdl/GvsPrepareCallset.wdl

@@ -36,6 +36,10 @@ workflow GvsPrepareCallset {
            docker                          = docker_final
    }

+    output {


good improvement here

kcibul · 2021-05-17T14:03:32Z

scripts/variantstore/wdl/GvsExtractCohort.wdl

+
+    String output_file_base_name
+
+    String? fq_filter_set_table


In the AoU case, I'm thinking this would always be required

kcibul · 2021-05-17T14:04:34Z

scripts/variantstore/wdl/GvsExtractCohort.wdl

+      data_project = "", # unused if fq_filter_set_* args are given or filtering is off
+      query_project = query_project,
+      default_dataset = "", # unused if fq_filter_set_* args are given or filtering is off
+      filter_set_name = "", # unused if fq_filter_set_* args are given or filtering is off


Does AoU want unfiltered data, that would be very surprising

I don't think so. I hadn't gotten around to adding the filter steps yet which is the main reason these are missing. Do you know what the process for that will be? Do we just point the filter set arguments to tables that you guys generate or is it something that is created on a per extract basis?

ericsong · 2021-05-19T21:44:21Z

@kcibul @ahaessly @mmorgantaylor do you guys know what the best way to upload multifile WDLs to Agora is? I can't import the current WDL as it is because the imports refer to a relative local file which Agora can't resolve. My workaround has been to rename the import path so that it points to a raw github file. I've been doing this manually for testing but it will not work with our automated Github -> Agora method creation tool (without some code changes to read the WDL and rename the imports).

kcibul · 2021-05-19T23:24:44Z

scripts/variantstore/wdl/GvsExtractCallset.wdl

+        # Drop trailing slash if one exists
+        OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//')
+
+        if [ -n "${OUTPUT_GCS_DIR}" ]; then


Since these files are also task outputs, if someone runs with output_gcs_dir they'll have two copies of the data, one in the cromwell delocalized results of this task and another in the output bucket? Do you have ideas about how to handle that?

By the cromwell delocalized results, do you mean the files in the "execution directory" (as Terra calls it)? In my mind, I had always thought of those files as "temporary" as in they're usually created within a TTL folder. I'm realizing now that this may be an incorrect assumption that I carried over from my time working with the long reads methods team.

I feel like that might be OK though? Users of the WDL that do not want the duplication can drop the output_gcs_dir argument. Users that want the output_gcs_dir argument likely only care about the VCF output files so they can clean up the execution directory as needed.

kcibul · 2021-05-19T23:24:58Z

scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl

+    String gvs_extraction_destination_dataset
+    String gvs_extraction_temp_tables_dataset
+    String extraction_uuid
+    String output_gcs_dir


should be optional

scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl

kcibul · 2021-05-19T23:30:13Z

scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl

+
+  call GvsPrepareCallset.GvsPrepareCallset {
+    input:
+      destination_cohort_table_name   = extraction_uuid,


What happens if the same dataset is used for gvs_extraction_destination_dataset and gvs_extraction_cohorts_dataset? In both cases the table name is just the UUID so I think things will error out. If that's the case maybe use the uuid as a suffix instead of the sole table name

scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl

kcibul · 2021-05-19T23:36:35Z

scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl

+    FROM
+    \`~{gvs_dataset}.sample_info\`
+    WHERE
+    sample_name IN " > create_cohort.sql


What's the limit for how many samples can be queried this way? In other parts of GVS we found that performance drops off after > ~1000 entries. I don't think performance is a concern here, but there are limitations on query size, etc. If there's a limit here, would be good to know what it is

FWIW... looks like the total query size limit is 1 MB so with a wild guess at 20 bytes per sample name that's 50k samples. Seems reasonable

Thanks for checking on that for me. Agree that it should work but it was good to point out. I might be able to load this from a file which probably has better performance and scalability.

kcibul · 2021-05-19T23:42:37Z

@ericsong on the Agora question -- we've been using Dockstore since it pulls right from GitHub and these are public WDLs. You'll need to add this WDL to the .dockstore.yml file, and if you want to have you're branch you'll add that as well (should be fairly obvious when you look at the file). I haven't tried imports, but I think they work well/better than Agora

kcibul · 2021-05-19T23:44:03Z

Thinking out loud... if the PrepareCallset could take a file of sample names as an input would we no longer need to have this step of making the sample table? ExtractCohort is able to take that already, Eric added that a while back right?

ericsong · 2021-05-23T03:12:22Z

thanks for the review @kcibul. I made some changes accordingly.

re: PrepareCallset file of sample names. That would be nice! It would make this workflow simpler and it also simplifies the access requirements for PrepareCallset.

re: Dockstore. We actually ruled this out because Terra says that the definition of a method configuration can change automatically if its updated in dockstore. Which can be useful but it adds a security risk since a compromised Dockstore can change the definition of the production AoU extraction WDL which runs with highly elevated permissions. We already have a script that creates method configurations from github so I can probably add something a little hacky to resolve relative imports to the raw github file that it refers to.

…nge/combine-prepare-extract

ahaessly

This looks good. Are you going to try to incorporate kris' changes?

ericsong · 2021-05-26T17:53:18Z

yep! I should have that wrapped up soon. testing the updated WDL now

gatk-bot · 2021-05-26T19:42:34Z

Travis reported job failures from build 34321
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	34321.12	logs

ericsong · 2021-05-26T20:46:24Z

@kcibul @ahaessly I made another round of changes to incorporate the new Prepare/Extract

ahaessly

a few changes for defaulting values

ahaessly · 2021-06-08T17:19:52Z

scripts/variantstore/wdl/GvsExtractCallset.wdl


        Boolean do_not_filter_override
+        String? fq_filter_set_info_table
+        String? fq_filter_set_site_table
+        String? fq_filter_set_tranches_table


Anything that is assigned a default value in the outer workflow should not be optional here.
Also, filter_set_name should not be optional

ahaessly · 2021-06-08T17:21:23Z

scripts/variantstore/wdl/GvsExtractCallset.wdl

        Boolean do_not_filter_override = false
+        String? filter_set_name


Filter set name should not be optional. This will be named something that indicates which release of the filtering model the extract should use.

should it be set even if filtering is not going to be used? for example, if do_not_filter_override is true, the argument will never be used

oh, i didn't catch that. It's fine to be optional.

ahaessly · 2021-06-08T17:35:21Z

scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl

+    String output_file_base_name
+
+    Boolean do_not_filter_override = false
+    String? filter_set_name


should not be optional. or if optional, should default to a valid filter set name

ahaessly · 2021-06-08T17:35:57Z

scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl

+    String? filter_set_name
+    String fq_filter_set_info_table = "~{gvs_project}.~{gvs_dataset}.filter_set_info"
+    String fq_filter_set_site_table = "~{gvs_project}.~{gvs_dataset}.filter_set_sites"
+    String fq_filter_set_tranches_table = "~{gvs_project}.~{gvs_dataset}.filter_set_tranches"


I don't think you need to set these values or pass them to the workflow since they are defaulted in the workflow. If you want to allow the caller to set them, you should make them optional and then you would pass them into the workflow.

ahaessly · 2021-06-08T17:39:25Z

scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl

+      fq_petvet_dataset               = "~{gvs_project}.~{gvs_dataset}",
+      fq_sample_mapping_table         = "~{gvs_project}.~{gvs_dataset}.sample_info",
+      fq_temp_table_dataset           = fq_gvs_extraction_temp_tables_dataset,
+      fq_destination_dataset          = fq_gvs_extraction_destination_dataset


maybe make these parameters in the definition of the task optional, since they are assigned defaults and then you won't have to pass the hard coded values

so these do have defaults in GvsPrepareCallset but the default doesn't work for me which is why I overrode it here. is that what you're suggesting?

ah, well then maybe add a comment that you are explicitly overriding the default values

ahaessly · 2021-06-08T17:40:52Z

scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl

+      destination_cohort_table_prefix = extraction_uuid,
+      sample_names_to_extract         = cohort_sample_names,
+      data_project                    = query_project,
+      default_dataset                 = gvs_dataset, # unused if fq_* args are given


even though it's unused here, it still needs to be passed into the task

ahaessly · 2021-06-08T17:48:27Z

scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl

+    String gvs_dataset
+    String fq_gvs_extraction_cohorts_dataset
+    String fq_gvs_extraction_destination_dataset
+    String fq_gvs_extraction_temp_tables_dataset


You can make these 2 optional since they are defaulted in the called workflow.
also I don't see fq_gvs_extraction_cohorts_dataset used anywhere. should it be removed?

good catch. removed.

regarding the two optionals though, do you mean the arguments on lines 16 and 17? those are passed directly into Prepare

yes. if you make fq_gvs_extraction_destination_dataset and fq_gvs_extraction_temp_tables_dataset optional (and still pass them into the prepare) then the caller does not have to supply the input. If an optional that has not been set is passed to a workflow/task they will be treated as None (and will therefore get the default values). The only reason not to make them optional is if you know the defaults won't work for you.

ahaessly

👍

ahaessly · 2021-06-11T17:16:12Z

scripts/variantstore/wdl/GvsExtractCallset.wdl

        set -e
        if [ ~{has_service_account_file} = 'true' ]; then
            gcloud auth activate-service-account --key-file='~{service_account_json}'
        fi

-        LASTMODIFIED=$(bq show --location=US --format=json ~{dataset_table} | python3 -c "import sys, json; print(json.load(sys.stdin)['lastModifiedTime']);")
+        # bq needs the project name to be separate by a colon
+        DATASET_TABLE_COLON=$(echo ~{dataset_table} | sed 's/\./:/')


I have a branch where I fixed this a different way. by just taking the table name prefix and using the dataset.tablename. I also had an issue where the show command was return a bunch of text prepended to the result which i fixed by creating a .bigqueryrc. I will have a PR for this soon and tag you to review it.

ericsong added 16 commits April 30, 2021 14:38

push docker override into task

b72c5ed

optional

6621160

use docker_final

84ce13a

output prepare callset's table table

ce8f26b

add optional bucket to copy output vcfs into

f528b12

add extract step to AoU wdl

87ab1db

add prepare output to workflow output

ee321d6

add empty output section

43aeea8

add output files

06d52b8

log

2496017

fix var setting in bash

dd1d54f

move docker override back into workflow top level

6051a6c

rename vars

11ecec3

rename

100744d

metadata -> sample_info

9eb49c5

add backticks to tablename

3abbda9

ericsong requested review from kcibul, ahaessly and mmorgantaylor May 10, 2021 15:32

kcibul approved these changes May 17, 2021

View reviewed changes

rename, testing some changes

a7593b6

kcibul reviewed May 19, 2021

View reviewed changes

ericsong mentioned this pull request May 20, 2021

[ticket=no][risk=no] Update WGS extraction WDL for test/local/preprod all-of-us/workbench#5021

Merged

8 tasks

ericsong added 3 commits May 21, 2021 18:20

pipe through filter params and refactor cohort sample table creation

fa4068d

make filter arguments optional

48a93ea

mirror prepare/extract arguments

33f5104

ericsong added 4 commits May 26, 2021 11:13

Merge branch 'ah_var_store' of github.com:broadinstitute/gatk into so…

9ed26dc

…nge/combine-prepare-extract

update wdl to use new prepare/extract

d28d4fb

remove prepare output

f700fd4

remove output

d3bfb28

ahaessly approved these changes May 26, 2021

View reviewed changes

ericsong added 2 commits May 26, 2021 14:55

turns out prepare just needs a table prefix not fq

46105cb

pipe output

5683c30

merge

e69432a

ahaessly reviewed Jun 8, 2021

View reviewed changes

ericsong added 10 commits June 8, 2021 14:25

use defaults

91ca1f1

add comment

6ca3071

remove optional on default strings

b4776a8

add query project arg

e052cdc

debug lines

13dd121

colon swap

5601e2f

add argument for no filtering

5ea46ac

merge

aba3686

cleanup

c5539f9

restore localize

fbaa396

ahaessly approved these changes Jun 11, 2021

View reviewed changes

ericsong merged commit b8b5f39 into ah_var_store Jun 11, 2021

ericsong deleted the songe/combine-prepare-extract branch June 11, 2021 18:02

This was referenced Mar 17, 2023

lb merge gvs branch #8248

Closed

testing something, please ignore #8251

Closed

		Boolean do_not_filter_override = false
		String? filter_set_name

AoU GVS Cohort Extract wdl #7242

AoU GVS Cohort Extract wdl #7242

Conversation

ericsong commented May 6, 2021

kcibul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericsong commented May 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kcibul commented May 19, 2021

kcibul commented May 19, 2021

ericsong commented May 23, 2021 • edited Loading

ahaessly left a comment

Choose a reason for hiding this comment

ericsong commented May 26, 2021

gatk-bot commented May 26, 2021

ericsong commented May 26, 2021

ahaessly left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahaessly left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericsong commented May 23, 2021 •

edited

Loading