Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VS-263 notes on ingest and beyond #7618

Merged
merged 5 commits into from
Mar 23, 2022
Merged

VS-263 notes on ingest and beyond #7618

merged 5 commits into from
Mar 23, 2022

Conversation

RoriCremer
Copy link
Contributor

@RoriCremer RoriCremer commented Dec 21, 2021

these are some of my notes from our discussions over the last week---this is not a final draft, but I wanted to get this into your hands
brickbats welcome

notes for things to add:
sample sets need to be run with this.sample_set_id

there's not a good way to make sample sets from the UI---let's ask Morgan about what her process is to make them outside the UI--is there a script?

@RoriCremer RoriCremer changed the title notes on ingest and beyond VS-263 notes on ingest and beyond Jan 28, 2022
scripts/variantstore/AOU_GVS_WORKSPACE.md Outdated Show resolved Hide resolved
**Note:**
Samples that will be batched and loaded together must be put into a sample_set ahead of time, otherwise their loading may cause conflicts.
This workflow must be done piecemeal if over 4000 samples are to be loaded as only 4000 samples can be loaded in at a time. The best way to do this currently is to create sample_sets of 4000 samples each.
The workflow can then be run once for each sample_set. if the same sample_set is inadvertantly run twice the workflow will detect that the samples already exist in the system and the second duplicate workflow will fail.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The workflow can then be run once for each sample_set. if the same sample_set is inadvertantly run twice the workflow will detect that the samples already exist in the system and the second duplicate workflow will fail.
The workflow can then be run once for each sample_set. If the same sample_set is inadvertently run twice the workflow will detect that the samples already exist in the system and the workflow will fail.

Comment on lines 76 to 80
If any of the imports have failed on a single sample, check that all of the other samples have been loaded in that sample_set during that workflow. Sometimes a sample will fail during loading while there are still samples in the queue waiting for loading to begin. Because of the failure, these samples will not be loaded at all.
Keep track of the samples that have not been loaded whether because they failed, or because they were in the queue when another sample failed. They will need to be added later.

Once all sample_sets have been run, if there have been any failures, collect all non-loaded samples together in a new sample_set and load that in.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably makes sense to just add a query to check for samples that are in sample_sets but never made it to thesample_load_status table to capture the ones that need to be put into a new sample_set. That way the user doesn't have to comb through past runs.

Copy link
Contributor Author

@RoriCremer RoriCremer Mar 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that we would do as part of the import genomes WDL? should I make a ticket for this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the other queries that you include in this doc, you could include a sample query that lists all the samples that are in the sample_info table but not in the sample_load_status, which means the loading was never kicked off. The user could use this instead of having to keep track during ingest in order to figure out which samples those were.

scripts/variantstore/AOU_GVS_WORKSPACE.md Outdated Show resolved Hide resolved
scripts/variantstore/AOU_GVS_WORKSPACE.md Outdated Show resolved Hide resolved
scripts/variantstore/AOU_GVS_WORKSPACE.md Outdated Show resolved Hide resolved
@@ -101,30 +130,34 @@ This is done by running the `GvsCreateFilterSet` workflow with the following par

**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths`

Sometimes this workflow will fail because the Gaussians have not cenverged. Dont panic! It can happen to anyone's data!
The first step in this case will be to adjust the Guassian for the failed step (there are two possible steps: model creation for the SNPS and model creation for the InDels) to a lower number.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include the specific inputs names for the gaussian values to change so that the user can fill them out easily.

Comment on lines 135 to 136
You can then kick off the workflow again. If that still does not work, or you would prefer to not change the number of Guassians, then you can remove a column from the model creation
TODO: How do you remove a column from the model creation?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we want to include this in the docs, since we don't have a real process for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like completely skip that a column could be removed? or specifically the TODO?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a real process for how to pick which column (if that's the same as annotation) to remove.

@RoriCremer RoriCremer merged commit 3b4c5ba into ah_var_store Mar 23, 2022
@RoriCremer RoriCremer deleted the rc-more-notes branch March 23, 2022 20:14
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants