Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vs-1379 adding ploidy support to reference data #8857

Merged
merged 8 commits into from
Jun 17, 2024

Conversation

koncheto-broad
Copy link

@koncheto-broad koncheto-broad commented Jun 3, 2024

This PR allows the extract process to read ploidy information from an optional table and use it when writing out reference data. This code does NOT create that table. In the absence of such data, it will do nothing and behave like before (assuming a ploidy of 2 at all sites and expanding the reference data accordingly).

Quickstart extract WITHOUT ploidy table specified:https://app.terra.bio/#workspaces/gvs-dev/GVS%20Tiny%20Quickstart%20hatcher/job_history/fcc47f3f-080c-41f3-9847-0dd1487ef39c

Quickstart extract WITH ploidy table specified: https://app.terra.bio/#workspaces/gvs-dev/GVS%20Tiny%20Quickstart%20hatcher/job_history/2d608711-758f-47d2-ab54-ae825293e4a9

Successful integration run for verifying backwards compatibility: https://app.terra.bio/#workspaces/gvs-dev/GVS%20Integration/job_history/d30c9db9-bdeb-4ff7-a236-3d3078258d06

The ploidy table used was based on quickstart data, but had data for samples 5, 9, and 10 manually updated to haploid. This will produce a INCORRECT vcf, inasmuch as it will reflect a mismatch in ploidy between the variant and ref data. But it allows us to see that, when the table is specified, it does in fact use it for writing out the ref data.

As expected, shards 0-21 are identical with the only changes being on shards 22 and 23, and with diffs of this form:

< chrX	2800975	.	C	CA	.	.	AC=2;AF=0.250;AN=8;AS_QUALapprox=0|108;CALIBRATION_SENSITIVITY=0.9621;QUALapprox=81;SCORE=-0.5449	GT:AD:GQ:RGQ	./.	./.	./.	./.	0/0:.:30	./.	0/0:.:30	0/1:8,3:27:27	./.	0/1:6,5:80:81
---
> chrX	2800975	.	C	CA	.	.	AC=2;AF=0.286;AN=7;AS_QUALapprox=0|108;CALIBRATION_SENSITIVITY=0.9621;QUALapprox=81;SCORE=-0.5449	GT:AD:GQ:RGQ	./.	./.	./.	./.	0/0:.:30	./.	0:.:30	0/1:8,3:27:27	./.	0/1:6,5:80:81
25122c25122
< chrX	2805509	.	C	T	.	.	AC=1;AF=0.100;AN=10;AS_QUALapprox=0|360;CALIBRATION_SENSITIVITY=0.8769;QUALapprox=360;SCORE=-0.4865	GT:AD:GQ:RGQ	0/0:.:30	./.	./.	./.	./.	0/0:.:20	0/0:.:30	0/1:14,13:99:360	0/0:.:30	./.
---
> chrX	2805509	.	C	T	.	.	AC=1;AF=0.125;AN=8;AS_QUALapprox=0|360;CALIBRATION_SENSITIVITY=0.8769;QUALapprox=360;SCORE=-0.4865	GT:AD:GQ:RGQ	0:.:30	./.	./.	./.	./.	0/0:.:20	0:.:30	0/1:14,13:99:360	0/0:.:30	./.
25211,25212c25211,25212
< chrX	2822963	.	G	T	.	LowQual;NO_HQ_GENOTYPES	AC=1;AF=0.083;AN=12;AS_QUALapprox=0|20;CALIBRATION_SENSITIVITY=.;QUALapprox=20;SCORE=.	GT:AD:GQ:PGT:PID:RGQ	./.	./.	0/0:.:20	0/0:.:20	./.	0/0:.:20	0/0:.:20	./.	0/1:7,1:20:0|1:2822963_G_T:20	0/0:.:30
< chrX	2822965	.	G	T	.	LowQual;NO_HQ_GENOTYPES	AC=1;AF=0.167;AN=6;AS_QUALapprox=0|20;CALIBRATION_SENSITIVITY=.;QUALapprox=20;SCORE=.	GT:AD:GQ:PGT:PID:RGQ	0/0:.:30	./.	./.	./.	./.	./.	./.	./.	0/1:6,1:20:0|1:2822963_G_T:20	0/0:.:30
---
> chrX	2822963	.	G	T	.	LowQual;NO_HQ_GENOTYPES	AC=1;AF=0.091;AN=11;AS_QUALapprox=0|20;CALIBRATION_SENSITIVITY=.;QUALapprox=20;SCORE=.	GT:AD:GQ:PGT:PID:RGQ	./.	./.	0/0:.:20	0/0:.:20	./.	0/0:.:20	0:.:20	./.	0/1:7,1:20:0|1:2822963_G_T:20	0/0:.:30
> chrX	2822965	.	G	T	.	LowQual;NO_HQ_GENOTYPES	AC=1;AF=0.200;AN=5;AS_QUALapprox=0|20;CALIBRATION_SENSITIVITY=.;QUALapprox=20;SCORE=.	GT:AD:GQ:PGT:PID:RGQ	0:.:30	./.	./.	./.	./.	./.	./.	./.	0/1:6,1:20:0|1:2822963_G_T:20	0/0:.:30```

@koncheto-broad koncheto-broad marked this pull request as ready for review June 13, 2024 13:09
@mcovarr mcovarr self-requested a review June 13, 2024 19:13
@mcovarr mcovarr self-assigned this Jun 13, 2024
Copy link
Collaborator

@mcovarr mcovarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass

public static final String SAMPLE_ID = "sample_id";
public static final String GENOTYPE = "genotype";
public static final String PLOIDY = "ploidy";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a ploidy col if we have a genotype col?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually an artifact of how the table is created, at this point. First we scan alt-allele to create the table, and part of that contains the genotype that we sampled from alt allele that we will use to infer ref ploidy as well as an empty column for what the final ploidy will be. Then we do a second pass over our table and calculate the correct ploidy based on the genotype column.

It means that the genotype column is actually vestigial once we have the ploidy column, but it didn't seem worth the effort at this point to remove it altogether. I DO think that the moment we change our code to create the table and calculate this stuff from the beginning during ingest, we'll want to trim out all references to that column. The current sql is definitely not going to be the final sql!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for the explanation, I was hurting my brain trying to figure out what genotype meant here...

And while we're on the subject is there a ticket for making the ploidy table automatically?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create that ticket now so we don't forget it

public static final String SAMPLE_ID = "sample_id";
public static final String GENOTYPE = "genotype";
public static final String PLOIDY = "ploidy";

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for the explanation, I was hurting my brain trying to figure out what genotype meant here...

And while we're on the subject is there a ticket for making the ploidy table automatically?

@koncheto-broad koncheto-broad merged commit d28b342 into ah_var_store Jun 17, 2024
16 of 17 checks passed
@koncheto-broad koncheto-broad deleted the VS-1379-ploidy branch June 17, 2024 16:19
koncheto-broad added a commit that referenced this pull request Jun 18, 2024
* checkpointing here to switch branches

* locally working first pass at adding in the ploidy info.  Still needs to have the arguments passed through so it works in the WDLs

* Propagating changes up through the wdl

* Stupid WDL substitution mistake

* On a roll with WDL today wheeeeee

* Cleaning up slightly

* PR feedback

* PR feedback v2: Ploidy Boogaloo
koncheto-broad added a commit that referenced this pull request Jun 18, 2024
* Vs-1379 adding ploidy support to reference data (#8857)

* checkpointing here to switch branches

* locally working first pass at adding in the ploidy info.  Still needs to have the arguments passed through so it works in the WDLs

* Propagating changes up through the wdl

* Stupid WDL substitution mistake

* On a roll with WDL today wheeeeee

* Cleaning up slightly

* PR feedback

* PR feedback v2: Ploidy Boogaloo

* updating GATK docker for latest ploidy changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants