Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serial inserts for scaling prepare, factored out sample name #7288

Merged
merged 4 commits into from
Jun 7, 2021

Conversation

kcibul
Copy link
Contributor

@kcibul kcibul commented Jun 4, 2021

Two primary sets of changes

  1. split out the combined "CREATE TABLE AS... SELECT... join PET + VET" into 3 separate items. CREATE, INSERT vet, INSERT pet
  2. To keep our shuffle down we are not joining in sample_id at query time, since we already have the id -> name mapping in ExtractCohort... we just needed to use it (should reduce costs slightly also)

Testing

Tested on the GVS tieout set. As expected the only difference in the cohort extract tables is that we are no longer seeing mis-joined VET information at * sites (which is a nice side benefits). Otherwise tables tie out exactly in SQL.

In addition, I ran a full GIAB tieout before and after and the results are identical

`{fq_sample_mapping_table}` s ON (new_pet.sample_id = s.sample_id))
"""
(
location INT64,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix formatting

@@ -297,7 +327,7 @@ def make_extract_table(fq_pet_vet_dataset,
#Default QueryJobConfig will be merged into job configs passed in
#but if a specific default config is being updated (eg labels), new config must be added
#to the client._default_query_job_config that already exists
default_config = QueryJobConfig(labels=query_labels_map, priority="INTERACTIVE", use_query_cache=False)
default_config = QueryJobConfig(labels=query_labels_map, priority="INTERACTIVE", use_query_cache=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why this was false before, probably for testing/benchmarking

@kcibul kcibul requested a review from ahaessly June 4, 2021 20:08

cohort_extract_final_query_job.result()
JOB_IDS.add((f"insert final cohort table {fq_destination_table_data}", cohort_extract_final_query_job.job_id))
sql = f"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it more or less error prone in python to reuse the same variable name? sql
should we instead use create_table_sql, etc?

@kcibul kcibul merged commit 7325330 into ah_var_store Jun 7, 2021
@kcibul kcibul deleted the kc_scale_prepare branch June 7, 2021 19:36
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants