-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vs-1379 adding ploidy support to reference data #8857
Conversation
…to have the arguments passed through so it works in the WDLs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first pass
src/main/java/org/broadinstitute/hellbender/tools/gvs/common/SchemaUtils.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/gvs/extract/ExtractCohortEngine.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/gvs/extract/ExtractCohortEngine.java
Show resolved
Hide resolved
public static final String SAMPLE_ID = "sample_id"; | ||
public static final String GENOTYPE = "genotype"; | ||
public static final String PLOIDY = "ploidy"; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need a ploidy col if we have a genotype col?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually an artifact of how the table is created, at this point. First we scan alt-allele to create the table, and part of that contains the genotype that we sampled from alt allele that we will use to infer ref ploidy as well as an empty column for what the final ploidy will be. Then we do a second pass over our table and calculate the correct ploidy based on the genotype column.
It means that the genotype column is actually vestigial once we have the ploidy column, but it didn't seem worth the effort at this point to remove it altogether. I DO think that the moment we change our code to create the table and calculate this stuff from the beginning during ingest, we'll want to trim out all references to that column. The current sql is definitely not going to be the final sql!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks for the explanation, I was hurting my brain trying to figure out what genotype meant here...
And while we're on the subject is there a ticket for making the ploidy table automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll create that ticket now so we don't forget it
src/main/java/org/broadinstitute/hellbender/tools/gvs/common/SchemaUtils.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/gvs/extract/ExtractCohortEngine.java
Show resolved
Hide resolved
public static final String SAMPLE_ID = "sample_id"; | ||
public static final String GENOTYPE = "genotype"; | ||
public static final String PLOIDY = "ploidy"; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks for the explanation, I was hurting my brain trying to figure out what genotype meant here...
And while we're on the subject is there a ticket for making the ploidy table automatically?
* checkpointing here to switch branches * locally working first pass at adding in the ploidy info. Still needs to have the arguments passed through so it works in the WDLs * Propagating changes up through the wdl * Stupid WDL substitution mistake * On a roll with WDL today wheeeeee * Cleaning up slightly * PR feedback * PR feedback v2: Ploidy Boogaloo
* Vs-1379 adding ploidy support to reference data (#8857) * checkpointing here to switch branches * locally working first pass at adding in the ploidy info. Still needs to have the arguments passed through so it works in the WDLs * Propagating changes up through the wdl * Stupid WDL substitution mistake * On a roll with WDL today wheeeeee * Cleaning up slightly * PR feedback * PR feedback v2: Ploidy Boogaloo * updating GATK docker for latest ploidy changes
This PR allows the extract process to read ploidy information from an optional table and use it when writing out reference data. This code does NOT create that table. In the absence of such data, it will do nothing and behave like before (assuming a ploidy of 2 at all sites and expanding the reference data accordingly).
Quickstart extract WITHOUT ploidy table specified:https://app.terra.bio/#workspaces/gvs-dev/GVS%20Tiny%20Quickstart%20hatcher/job_history/fcc47f3f-080c-41f3-9847-0dd1487ef39c
Quickstart extract WITH ploidy table specified: https://app.terra.bio/#workspaces/gvs-dev/GVS%20Tiny%20Quickstart%20hatcher/job_history/2d608711-758f-47d2-ab54-ae825293e4a9
Successful integration run for verifying backwards compatibility: https://app.terra.bio/#workspaces/gvs-dev/GVS%20Integration/job_history/d30c9db9-bdeb-4ff7-a236-3d3078258d06
The ploidy table used was based on quickstart data, but had data for samples 5, 9, and 10 manually updated to haploid. This will produce a INCORRECT vcf, inasmuch as it will reflect a mismatch in ploidy between the variant and ref data. But it allows us to see that, when the table is specified, it does in fact use it for writing out the ref data.
As expected, shards 0-21 are identical with the only changes being on shards 22 and 23, and with diffs of this form: