Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#299 - Sample list ease of use for cohort extracts #7272

Merged
merged 3 commits into from
May 26, 2021

Conversation

kcibul
Copy link
Contributor

@kcibul kcibul commented May 24, 2021

A collection of changes to enhance reliability and ease-of-use. Users no longer have to make a table containing the sample names to extract in GvsPrepareCallset (which was painful) and they don't have to re-supply that same table when rendering the VCF in GvsExtractCallset (which was error prone).

GvsPrepareCallet now takes a file of sample names as a parameter as well as an export table prefix

The main sample_info table is then subset to the sample names in the supplied file and stored in the table {export_prefix}__SAMPLES. The export table is created and now named {export_prefix}__DATA

GvsExtractCallset now only needs to take this export prefix and is able to get the sample list and data it needs from these tables.

@ericsong -- does this fit the AoU use case well?

# query_labels is string that looks like 'key1=val1, key2=val2'
if len(query_labels) != 0:
if query_labels is not None and len(query_labels) != 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooooh thank you!

global client
# this is where a set of labels are being created for the cohort extract
query_labels_map = {}
query_labels_map["id"]= {output_table_prefix}
query_labels_map["gvs_tool_name"]= f"gvs_prepare_callset"
query_labels_map["id"]= output_table_prefix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!!!


if not (bool(re.match(r"[a-z0-9_-]+$", key)) & bool(re.match(r"[a-z0-9_-]+$", value))):
raise ValueError(f"label key or value did not pass validation--format should be 'key1=val1, key2=val2'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to add testing for all of this, but am procrastinating it because it seems like this will become Java at some point soon, but maybe that's not true....

table_id = f"{fq_temp_table_dataset}.{EXTRACT_SAMPLE_TABLE}"

job_labels = client._default_query_job_config.labels
job_labels["gvs_query_name"] = "load-sample-names"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: "gvs_query_name" should prob be a const

Copy link

@ericsong ericsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some comments around the prefix but they're non blocking; the changes look good and def work for AoU 👍 .

sql = f"select m.sample_id from `{fq_cohort_sample_names}` c JOIN `{fq_sample_mapping_table}` m ON (m.sample_name = c.sample_name)"
def load_sample_names(sample_names_to_extract, fq_temp_table_dataset):
schema = [ bigquery.SchemaField("sample_name", "STRING", mode="REQUIRED") ]
table_id = f"{fq_temp_table_dataset}.{EXTRACT_SAMPLE_TABLE}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: table_id -> fq_sample_table?

@@ -13,8 +13,7 @@ workflow GvsExtractCallset {
File reference_index
File reference_dict

String fq_samples_to_extract_table
String fq_cohort_extract_table
String fq_cohort_extract_table_prefix

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I'm wondering if the previous way of providing the fq_tables explicitly is more flexible since it ties the implementation a little less to PrepareCallset. Now, the fq_sample_to_extract_table and fq_cohort_extract_table would need to be part of the same dataset even though they technically don't have to be.

That said, this works just fine for AoU. Just thinking about some other use cases.


if __name__ == '__main__':
parser = argparse.ArgumentParser(allow_abbrev=False, description='Extract a cohort from BigQuery Variant Store ')

parser.add_argument('--fq_petvet_dataset',type=str, help='project.dataset location of pet/vet data', required=True)
parser.add_argument('--fq_temp_table_dataset',type=str, help='project.dataset location where results should be stored', required=True)
parser.add_argument('--fq_destination_dataset',type=str, help='project.dataset location where results should be stored', required=True)
parser.add_argument('--destination_table',type=str, help='destination table', required=True)
parser.add_argument('--fq_cohort_sample_names',type=str, help='FQN of cohort table to extract, contains "sample_name" column', required=True)
parser.add_argument('--destination_cohort_table_prefix',type=str, help='prefix for destination tables', required=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment to above - passing around a prefix works well if PrepareCallset and ExtractCallset are only meant to be used with each other but from the perspective of someone that just wants to run PrepareCallset, they wouldn't know the fq names of the tables generated without reading the implementation and seeing that __DATA and __SAMPLE suffixes are being added.

In that regard, I think the explicit fq table names are nice but I think a comment in the argument's help section will also suffice. It could also be good to give an example of a valid prefix. It's not immediately clear if the prefix should contain just the table name prefix or the project and dataset as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea about the comments, I'll take a look at that. I wish there were some other way to create collections of related tables, and we could give that name instead. Datasets would work for that, but they're much heavier weight and permissions wise are a higher bar than tables

Copy link
Contributor

@ahaessly ahaessly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kcibul kcibul merged commit f09214c into ah_var_store May 26, 2021
@kcibul kcibul deleted the kc_prepare_samples branch May 26, 2021 14:48
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants