serial inserts for scaling prepare, factored out sample name #7288

kcibul · 2021-06-04T18:57:58Z

Two primary sets of changes

split out the combined "CREATE TABLE AS... SELECT... join PET + VET" into 3 separate items. CREATE, INSERT vet, INSERT pet
To keep our shuffle down we are not joining in sample_id at query time, since we already have the id -> name mapping in ExtractCohort... we just needed to use it (should reduce costs slightly also)

Testing

Tested on the GVS tieout set. As expected the only difference in the cohort extract tables is that we are no longer seeing mis-joined VET information at * sites (which is a nice side benefits). Otherwise tables tie out exactly in SQL.

In addition, I ran a full GIAB tieout before and after and the results are identical

kcibul · 2021-06-04T19:06:28Z

scripts/variantstore/wdl/extract/create_cohort_extract_data_table.py

-            `{fq_sample_mapping_table}` s ON (new_pet.sample_id = s.sample_id))
-      """
+        (
+              location      INT64,


fix formatting

kcibul · 2021-06-04T19:07:28Z

scripts/variantstore/wdl/extract/create_cohort_extract_data_table.py

@@ -297,7 +327,7 @@ def make_extract_table(fq_pet_vet_dataset,
    #Default QueryJobConfig will be merged into job configs passed in
    #but if a specific default config is being updated (eg labels), new config must be added
    #to the client._default_query_job_config that already exists
-    default_config = QueryJobConfig(labels=query_labels_map, priority="INTERACTIVE", use_query_cache=False)
+    default_config = QueryJobConfig(labels=query_labels_map, priority="INTERACTIVE", use_query_cache=True)


I don't know why this was false before, probably for testing/benchmarking

ahaessly · 2021-06-07T11:34:36Z

scripts/variantstore/wdl/extract/create_cohort_extract_data_table.py


-  cohort_extract_final_query_job.result()
-  JOB_IDS.add((f"insert final cohort table {fq_destination_table_data}", cohort_extract_final_query_job.job_id))
+  sql = f"""


is it more or less error prone in python to reuse the same variable name? sql
should we instead use create_table_sql, etc?

kcibul added 3 commits June 4, 2021 15:00

serial inserts for scaling prepare, factored out sample name

18ccd5a

baseline

fd54796

updated docker image

c291bf3

kcibul force-pushed the kc_scale_prepare branch from 5104f74 to c291bf3 Compare June 4, 2021 19:04

kcibul commented Jun 4, 2021

View reviewed changes

fixed avro tests, minor comments

a955464

kcibul requested a review from ahaessly June 4, 2021 20:08

ahaessly approved these changes Jun 7, 2021

View reviewed changes

kcibul merged commit 7325330 into ah_var_store Jun 7, 2021

kcibul deleted the kc_scale_prepare branch June 7, 2021 19:36

This was referenced Mar 17, 2023

lb merge gvs branch #8248

Closed

testing something, please ignore #8251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

serial inserts for scaling prepare, factored out sample name #7288

serial inserts for scaling prepare, factored out sample name #7288

kcibul commented Jun 4, 2021 •

edited

Loading

kcibul Jun 4, 2021

kcibul Jun 4, 2021

ahaessly Jun 7, 2021

serial inserts for scaling prepare, factored out sample name #7288

serial inserts for scaling prepare, factored out sample name #7288

Conversation

kcibul commented Jun 4, 2021 • edited Loading

kcibul Jun 4, 2021

Choose a reason for hiding this comment

kcibul Jun 4, 2021

Choose a reason for hiding this comment

ahaessly Jun 7, 2021

Choose a reason for hiding this comment

kcibul commented Jun 4, 2021 •

edited

Loading