Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter on gvs_ids for workflow #7428

Merged
merged 2 commits into from
Aug 25, 2021
Merged

Conversation

ahaessly
Copy link
Contributor

@ahaessly ahaessly commented Aug 20, 2021

only set is_loaded to true for the sample ids being processesd in the workflow
VS-176

@@ -323,6 +325,7 @@ task GetMaxTableIdLegacy {
}
output {
Int max_table_id = read_int(stdout())
File gvs_ids = "gvs_ids"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any benefit to making gvs_ids a var?

Copy link
Contributor Author

@ahaessly ahaessly Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure what you mean by a var.
Do you mean declare
String gvs_id_file = "gvs_ids"
and then the output would be ~{gvs_id_file}
If so, I'm not sure of the trade offs between the two. Any opinions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you have works great--- I was just "thinking out loud"....

@@ -838,7 +843,7 @@ task AddIsLoadedColumn {

# set is_loaded to true if there is a corresponding pet table partition with rows for that sample_id
bq --location=US --project_id=~{project_id} query --format=csv --use_legacy_sql=false \
"UPDATE ~{dataset_name}.sample_info SET is_loaded = true WHERE sample_id IN (SELECT CAST(partition_id AS INT64) from ~{dataset_name}.INFORMATION_SCHEMA.PARTITIONS WHERE partition_id != '__UNPARTITIONED__' AND total_logical_bytes > 0 AND table_name LIKE \"pet_%\")"
"UPDATE ~{dataset_name}.sample_info SET is_loaded = true WHERE sample_id IN (SELECT CAST(partition_id AS INT64) from ~{dataset_name}.INFORMATION_SCHEMA.PARTITIONS WHERE partition_id in ('~{sep="\',\'" gvs_id_array}') AND total_logical_bytes > 0 AND table_name LIKE \"pet_%\")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this made me wonder if there would be a benefit to checking and validating that there are no samples with a partition_id of 'UNPARTITIONED'....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that could be a good check. but i'm wondering where/when to do it? I guess we could do it here and fail if there is data in unpartitioned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I'm not sure where the check could go---maybe just print out a warning in the mean time? I dont think it's necessary to this pr though and would be fine as a follow on ticket

@ahaessly ahaessly merged commit 386a310 into ah_var_store Aug 25, 2021
@ahaessly ahaessly deleted the ah_set_is_loaded_specific branch August 25, 2021 21:09
ahaessly added a commit that referenced this pull request Aug 27, 2021
* filter on gvs_ids for workflow
* update for legacy sample_map
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants