Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate various docs for AoU callset generation into one to rule them all [VS-553] #7971

Merged
merged 12 commits into from
Aug 3, 2022

Conversation

rsasch
Copy link

@rsasch rsasch commented Aug 2, 2022

No description provided.

@rsasch rsasch requested review from gbggrant and mcovarr August 2, 2022 20:17
@codecov
Copy link

codecov bot commented Aug 2, 2022

Codecov Report

❗ No coverage uploaded for pull request base (ah_var_store@3e62331). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff                @@
##             ah_var_store     #7971   +/-   ##
================================================
  Coverage                ?   86.247%           
  Complexity              ?     35200           
================================================
  Files                   ?      2173           
  Lines                   ?    165016           
  Branches                ?     17793           
================================================
  Hits                    ?    142321           
  Misses                  ?     16368           
  Partials                ?      6327           

- To optimize the GVS internal queries, each sample must have a unique and consecutive integer ID assigned. Running the `GvsAssignIds` will create a unique GVS ID for each sample (`sample_id`) and update the BQ `sample_info` table (creating it if it doesn't exist). This workflow takes care of creating the BQ `vet_*`, `ref_ranges_*` and `cost_observability` tables needed for the sample IDs generated.
- Run at the `sample set` level ("Step 1" in workflow submission) with a sample set of all the new samples to be included in the callset (created by the "Fetch WGS metadata for samples from list" notebook mentioned above).
- You will want to set the `external_sample_names` input based on the column in the workspace Data table, e.g. "this.samples.research_id".
- If new controls are being added, they need to be done in a separate run, with the `samples_are_controls` input set to "true" (the referenced Data columns may also be different, e.g. "this.control_samples.control_sample_id" instead of "this.samples.research_id").
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confused as what this means - the external_sample_name may be pulled from the control_sample_id?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The external_sample_name workflow input is an array of strings, usually grabbed from the Terra Data store in the workspace. The columns in that store can be called anything, and in the past, control samples and participant samples were in different tables with different column names.

scripts/variantstore/AOU_DELIVERABLES.md Show resolved Hide resolved
- **TBD VDS Extract WDL/notebook/??**
- Run the "Fetch WGS metadata for samples from list" notebook after you have placed the file with the list of the new samples to ingest in a GCS location the notebook (running with your @pmi-ops account) will have access to. This will grab the samples from the workspace where they were reblocked and bring them into this callset workspace.
- Set the `sample_list_file_path` variable in that notebook to the path of the file
- Run the "now that the data have been copied, you can make sample sets if you wish" step if you want to automatically break up the new samples into smaller sample sets. Set the `SUBSET_SIZE` and `set_name` variables to customize.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances would we want to do this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want to throw all 100K (or whatever number) samples at GvsImportGenomes at once.

scripts/variantstore/AOU_DELIVERABLES.md Outdated Show resolved Hide resolved
- To optimize the GVS internal queries, each sample must have a unique and consecutive integer ID assigned. Running the `GvsAssignIds` will create a unique GVS ID for each sample (`sample_id`) and update the BQ `sample_info` table (creating it if it doesn't exist). This workflow takes care of creating the BQ `vet_*`, `ref_ranges_*` and `cost_observability` tables needed for the sample IDs generated.
- Run at the `sample set` level ("Step 1" in workflow submission) with a sample set of all the new samples to be included in the callset (created by the "Fetch WGS metadata for samples from list" notebook mentioned above).
- You will want to set the `external_sample_names` input based on the column in the workspace Data table, e.g. "this.samples.research_id".
- If new controls are being added, they need to be done in a separate run, with the `samples_are_controls` input set to "true" (the referenced Data columns may also be different, e.g. "this.control_samples.control_sample_id" instead of "this.samples.research_id").
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we know if new controls are being added?

Copy link
Author

@rsasch rsasch Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past, Lee/AoU has let us know that we should add additional (or different) controls to a callset.

- Run at the `sample set` level ("Step 1" in workflow submission). You can either run this on a sample_set of all the samples and rely on the workflow logic to break it up into batches (or manually set the `load_data_batch_size` input) or run it on smaller sample_sets created by the "Fetch WGS metadata for samples from list" notebook mentioned above.
- You will want to set the `external_sample_names`, `input_vcfs` and `input_vcf_indexes` inputs based on the columns in the workspace Data table, e.g. "this.samples.research_id", "this.samples.reblocked_gvcf_v2" and "this.samples.reblocked_gvcf_index_v2".
3. `GvsWithdrawSamples` workflow
- Run if there are any samples to withdraw from the last callset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know if there are samples to withdraw?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We compare the sample list for the callset we are creating (which we get from Lee/Aou) with the samples already in the database.

@rsasch rsasch requested a review from mcovarr August 3, 2022 14:09
Copy link
Collaborator

@mcovarr mcovarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nomenclatural thing otherwise lgtm 👍

scripts/variantstore/AOU_DELIVERABLES.md Outdated Show resolved Hide resolved
@rsasch rsasch merged commit 798d4e8 into ah_var_store Aug 3, 2022
@rsasch rsasch deleted the rsa_vs553_aou_docs branch August 3, 2022 18:04
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants