-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VS-263 notes on ingest and beyond #7618
Conversation
**Note:** | ||
Samples that will be batched and loaded together must be put into a sample_set ahead of time, otherwise their loading may cause conflicts. | ||
This workflow must be done piecemeal if over 4000 samples are to be loaded as only 4000 samples can be loaded in at a time. The best way to do this currently is to create sample_sets of 4000 samples each. | ||
The workflow can then be run once for each sample_set. if the same sample_set is inadvertantly run twice the workflow will detect that the samples already exist in the system and the second duplicate workflow will fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow can then be run once for each sample_set. if the same sample_set is inadvertantly run twice the workflow will detect that the samples already exist in the system and the second duplicate workflow will fail. | |
The workflow can then be run once for each sample_set. If the same sample_set is inadvertently run twice the workflow will detect that the samples already exist in the system and the workflow will fail. |
If any of the imports have failed on a single sample, check that all of the other samples have been loaded in that sample_set during that workflow. Sometimes a sample will fail during loading while there are still samples in the queue waiting for loading to begin. Because of the failure, these samples will not be loaded at all. | ||
Keep track of the samples that have not been loaded whether because they failed, or because they were in the queue when another sample failed. They will need to be added later. | ||
|
||
Once all sample_sets have been run, if there have been any failures, collect all non-loaded samples together in a new sample_set and load that in. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably makes sense to just add a query to check for samples that are in sample_sets but never made it to thesample_load_status
table to capture the ones that need to be put into a new sample_set. That way the user doesn't have to comb through past runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that we would do as part of the import genomes WDL? should I make a ticket for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the other queries that you include in this doc, you could include a sample query that lists all the samples that are in the sample_info
table but not in the sample_load_status
, which means the loading was never kicked off. The user could use this instead of having to keep track during ingest in order to figure out which samples those were.
@@ -101,30 +130,34 @@ This is done by running the `GvsCreateFilterSet` workflow with the following par | |||
|
|||
**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths` | |||
|
|||
Sometimes this workflow will fail because the Gaussians have not cenverged. Dont panic! It can happen to anyone's data! | |||
The first step in this case will be to adjust the Guassian for the failed step (there are two possible steps: model creation for the SNPS and model creation for the InDels) to a lower number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include the specific inputs names for the gaussian values to change so that the user can fill them out easily.
You can then kick off the workflow again. If that still does not work, or you would prefer to not change the number of Guassians, then you can remove a column from the model creation | ||
TODO: How do you remove a column from the model creation? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we want to include this in the docs, since we don't have a real process for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like completely skip that a column could be removed? or specifically the TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a real process for how to pick which column (if that's the same as annotation) to remove.
these are some of my notes from our discussions over the last week---this is not a final draft, but I wanted to get this into your hands
brickbats welcome
notes for things to add:
sample sets need to be run with this.sample_set_id
there's not a good way to make sample sets from the UI---let's ask Morgan about what her process is to make them outside the UI--is there a script?