#299 - Sample list ease of use for cohort extracts #7272

kcibul · 2021-05-24T21:06:52Z

A collection of changes to enhance reliability and ease-of-use. Users no longer have to make a table containing the sample names to extract in GvsPrepareCallset (which was painful) and they don't have to re-supply that same table when rendering the VCF in GvsExtractCallset (which was error prone).

GvsPrepareCallet now takes a file of sample names as a parameter as well as an export table prefix

The main sample_info table is then subset to the sample names in the supplied file and stored in the table {export_prefix}__SAMPLES. The export table is created and now named {export_prefix}__DATA

GvsExtractCallset now only needs to take this export prefix and is able to get the sample list and data it needs from these tables.

@ericsong -- does this fit the AoU use case well?

RoriCremer · 2021-05-24T21:16:51Z

scripts/variantstore/wdl/extract/create_cohort_extract_data_table.py

    # query_labels is string that looks like 'key1=val1, key2=val2'
-    if len(query_labels) != 0:
+    if query_labels is not None and len(query_labels) != 0:


ooooh thank you!

RoriCremer · 2021-05-24T21:17:12Z

scripts/variantstore/wdl/extract/create_cohort_extract_data_table.py

    global client
    # this is where a set of labels are being created for the cohort extract
    query_labels_map = {}
-    query_labels_map["id"]= {output_table_prefix}
-    query_labels_map["gvs_tool_name"]= f"gvs_prepare_callset"
+    query_labels_map["id"]= output_table_prefix


thank you!!!

RoriCremer · 2021-05-24T21:20:23Z

scripts/variantstore/wdl/extract/create_cohort_extract_data_table.py

-
+          if not (bool(re.match(r"[a-z0-9_-]+$", key)) & bool(re.match(r"[a-z0-9_-]+$", value))):
+            raise ValueError(f"label key or value did not pass validation--format should be 'key1=val1, key2=val2'")
+        


I want to add testing for all of this, but am procrastinating it because it seems like this will become Java at some point soon, but maybe that's not true....

RoriCremer · 2021-05-24T21:43:59Z

scripts/variantstore/wdl/extract/create_cohort_extract_data_table.py

+  table_id = f"{fq_temp_table_dataset}.{EXTRACT_SAMPLE_TABLE}"
+
+  job_labels = client._default_query_job_config.labels
+  job_labels["gvs_query_name"] = "load-sample-names"


note to self: "gvs_query_name" should prob be a const

ericsong

I had some comments around the prefix but they're non blocking; the changes look good and def work for AoU 👍 .

ericsong · 2021-05-25T15:22:38Z

scripts/variantstore/wdl/extract/create_cohort_extract_data_table.py

-  sql = f"select m.sample_id from `{fq_cohort_sample_names}` c JOIN `{fq_sample_mapping_table}` m ON (m.sample_name = c.sample_name)"
+def load_sample_names(sample_names_to_extract, fq_temp_table_dataset):
+  schema = [ bigquery.SchemaField("sample_name", "STRING", mode="REQUIRED") ]
+  table_id = f"{fq_temp_table_dataset}.{EXTRACT_SAMPLE_TABLE}"


nit: table_id -> fq_sample_table?

ericsong · 2021-05-25T15:27:17Z

scripts/variantstore/wdl/GvsExtractCallset.wdl

@@ -13,8 +13,7 @@ workflow GvsExtractCallset {
        File reference_index
        File reference_dict

-        String fq_samples_to_extract_table
-        String fq_cohort_extract_table
+        String fq_cohort_extract_table_prefix


hm, I'm wondering if the previous way of providing the fq_tables explicitly is more flexible since it ties the implementation a little less to PrepareCallset. Now, the fq_sample_to_extract_table and fq_cohort_extract_table would need to be part of the same dataset even though they technically don't have to be.

That said, this works just fine for AoU. Just thinking about some other use cases.

ericsong · 2021-05-25T15:37:00Z

scripts/variantstore/wdl/extract/create_cohort_extract_data_table.py


 if __name__ == '__main__':
  parser = argparse.ArgumentParser(allow_abbrev=False, description='Extract a cohort from BigQuery Variant Store ')

  parser.add_argument('--fq_petvet_dataset',type=str, help='project.dataset location of pet/vet data', required=True)
  parser.add_argument('--fq_temp_table_dataset',type=str, help='project.dataset location where results should be stored', required=True)
  parser.add_argument('--fq_destination_dataset',type=str, help='project.dataset location where results should be stored', required=True)
-  parser.add_argument('--destination_table',type=str, help='destination table', required=True)
-  parser.add_argument('--fq_cohort_sample_names',type=str, help='FQN of cohort table to extract, contains "sample_name" column', required=True)
+  parser.add_argument('--destination_cohort_table_prefix',type=str, help='prefix for destination tables', required=True)


similar comment to above - passing around a prefix works well if PrepareCallset and ExtractCallset are only meant to be used with each other but from the perspective of someone that just wants to run PrepareCallset, they wouldn't know the fq names of the tables generated without reading the implementation and seeing that __DATA and __SAMPLE suffixes are being added.

In that regard, I think the explicit fq table names are nice but I think a comment in the argument's help section will also suffice. It could also be good to give an example of a valid prefix. It's not immediately clear if the prefix should contain just the table name prefix or the project and dataset as well.

Good idea about the comments, I'll take a look at that. I wish there were some other way to create collections of related tables, and we could give that name instead. Datasets would work for that, but they're much heavier weight and permissions wise are a higher bar than tables

ahaessly

LGTM

work on issue 299

fea22d5

RoriCremer reviewed May 24, 2021

View reviewed changes

RoriCremer approved these changes May 24, 2021

View reviewed changes

ericsong approved these changes May 25, 2021

View reviewed changes

kcibul added 2 commits May 26, 2021 10:28

PR comments

06925bc

update dockstore/image

2f0a652

ahaessly approved these changes May 26, 2021

View reviewed changes

kcibul merged commit f09214c into ah_var_store May 26, 2021

kcibul deleted the kc_prepare_samples branch May 26, 2021 14:48

This was referenced Mar 17, 2023

lb merge gvs branch #8248

Closed

testing something, please ignore #8251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#299 - Sample list ease of use for cohort extracts #7272

#299 - Sample list ease of use for cohort extracts #7272

kcibul commented May 24, 2021

RoriCremer May 24, 2021

RoriCremer May 24, 2021

RoriCremer May 24, 2021

RoriCremer May 24, 2021

ericsong left a comment

ericsong May 25, 2021

ericsong May 25, 2021

ericsong May 25, 2021

kcibul May 26, 2021

ahaessly left a comment


		if not (bool(re.match(r"[a-z0-9_-]+$", key)) & bool(re.match(r"[a-z0-9_-]+$", value))):
		raise ValueError(f"label key or value did not pass validation--format should be 'key1=val1, key2=val2'")

#299 - Sample list ease of use for cohort extracts #7272

#299 - Sample list ease of use for cohort extracts #7272

Conversation

kcibul commented May 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericsong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahaessly left a comment

Choose a reason for hiding this comment