Vs-1379 adding ploidy support to reference data #8857

koncheto-broad · 2024-06-03T18:33:42Z

This PR allows the extract process to read ploidy information from an optional table and use it when writing out reference data. This code does NOT create that table. In the absence of such data, it will do nothing and behave like before (assuming a ploidy of 2 at all sites and expanding the reference data accordingly).

Quickstart extract WITHOUT ploidy table specified:https://app.terra.bio/#workspaces/gvs-dev/GVS%20Tiny%20Quickstart%20hatcher/job_history/fcc47f3f-080c-41f3-9847-0dd1487ef39c

Quickstart extract WITH ploidy table specified: https://app.terra.bio/#workspaces/gvs-dev/GVS%20Tiny%20Quickstart%20hatcher/job_history/2d608711-758f-47d2-ab54-ae825293e4a9

Successful integration run for verifying backwards compatibility: https://app.terra.bio/#workspaces/gvs-dev/GVS%20Integration/job_history/d30c9db9-bdeb-4ff7-a236-3d3078258d06

The ploidy table used was based on quickstart data, but had data for samples 5, 9, and 10 manually updated to haploid. This will produce a INCORRECT vcf, inasmuch as it will reflect a mismatch in ploidy between the variant and ref data. But it allows us to see that, when the table is specified, it does in fact use it for writing out the ref data.

As expected, shards 0-21 are identical with the only changes being on shards 22 and 23, and with diffs of this form:

< chrX	2800975	.	C	CA	.	.	AC=2;AF=0.250;AN=8;AS_QUALapprox=0|108;CALIBRATION_SENSITIVITY=0.9621;QUALapprox=81;SCORE=-0.5449	GT:AD:GQ:RGQ	./.	./.	./.	./.	0/0:.:30	./.	0/0:.:30	0/1:8,3:27:27	./.	0/1:6,5:80:81
---
> chrX	2800975	.	C	CA	.	.	AC=2;AF=0.286;AN=7;AS_QUALapprox=0|108;CALIBRATION_SENSITIVITY=0.9621;QUALapprox=81;SCORE=-0.5449	GT:AD:GQ:RGQ	./.	./.	./.	./.	0/0:.:30	./.	0:.:30	0/1:8,3:27:27	./.	0/1:6,5:80:81
25122c25122
< chrX	2805509	.	C	T	.	.	AC=1;AF=0.100;AN=10;AS_QUALapprox=0|360;CALIBRATION_SENSITIVITY=0.8769;QUALapprox=360;SCORE=-0.4865	GT:AD:GQ:RGQ	0/0:.:30	./.	./.	./.	./.	0/0:.:20	0/0:.:30	0/1:14,13:99:360	0/0:.:30	./.
---
> chrX	2805509	.	C	T	.	.	AC=1;AF=0.125;AN=8;AS_QUALapprox=0|360;CALIBRATION_SENSITIVITY=0.8769;QUALapprox=360;SCORE=-0.4865	GT:AD:GQ:RGQ	0:.:30	./.	./.	./.	./.	0/0:.:20	0:.:30	0/1:14,13:99:360	0/0:.:30	./.
25211,25212c25211,25212
< chrX	2822963	.	G	T	.	LowQual;NO_HQ_GENOTYPES	AC=1;AF=0.083;AN=12;AS_QUALapprox=0|20;CALIBRATION_SENSITIVITY=.;QUALapprox=20;SCORE=.	GT:AD:GQ:PGT:PID:RGQ	./.	./.	0/0:.:20	0/0:.:20	./.	0/0:.:20	0/0:.:20	./.	0/1:7,1:20:0|1:2822963_G_T:20	0/0:.:30
< chrX	2822965	.	G	T	.	LowQual;NO_HQ_GENOTYPES	AC=1;AF=0.167;AN=6;AS_QUALapprox=0|20;CALIBRATION_SENSITIVITY=.;QUALapprox=20;SCORE=.	GT:AD:GQ:PGT:PID:RGQ	0/0:.:30	./.	./.	./.	./.	./.	./.	./.	0/1:6,1:20:0|1:2822963_G_T:20	0/0:.:30
---
> chrX	2822963	.	G	T	.	LowQual;NO_HQ_GENOTYPES	AC=1;AF=0.091;AN=11;AS_QUALapprox=0|20;CALIBRATION_SENSITIVITY=.;QUALapprox=20;SCORE=.	GT:AD:GQ:PGT:PID:RGQ	./.	./.	0/0:.:20	0/0:.:20	./.	0/0:.:20	0:.:20	./.	0/1:7,1:20:0|1:2822963_G_T:20	0/0:.:30
> chrX	2822965	.	G	T	.	LowQual;NO_HQ_GENOTYPES	AC=1;AF=0.200;AN=5;AS_QUALapprox=0|20;CALIBRATION_SENSITIVITY=.;QUALapprox=20;SCORE=.	GT:AD:GQ:PGT:PID:RGQ	0:.:30	./.	./.	./.	./.	./.	./.	./.	0/1:6,1:20:0|1:2822963_G_T:20	0/0:.:30```

…to have the arguments passed through so it works in the WDLs

mcovarr

first pass

scripts/variantstore/wdl/GvsExtractCallset.wdl

src/main/java/org/broadinstitute/hellbender/tools/gvs/common/SchemaUtils.java

src/main/java/org/broadinstitute/hellbender/tools/gvs/extract/ExtractCohortEngine.java

RoriCremer · 2024-06-17T15:23:44Z

src/main/java/org/broadinstitute/hellbender/tools/gvs/common/SchemaUtils.java

+    public static final String SAMPLE_ID = "sample_id";
+    public static final String GENOTYPE = "genotype";
+    public static final String PLOIDY = "ploidy";
+


why do we need a ploidy col if we have a genotype col?

That's actually an artifact of how the table is created, at this point. First we scan alt-allele to create the table, and part of that contains the genotype that we sampled from alt allele that we will use to infer ref ploidy as well as an empty column for what the final ploidy will be. Then we do a second pass over our table and calculate the correct ploidy based on the genotype column.

It means that the genotype column is actually vestigial once we have the ploidy column, but it didn't seem worth the effort at this point to remove it altogether. I DO think that the moment we change our code to create the table and calculate this stuff from the beginning during ingest, we'll want to trim out all references to that column. The current sql is definitely not going to be the final sql!

Ok thanks for the explanation, I was hurting my brain trying to figure out what genotype meant here...

And while we're on the subject is there a ticket for making the ploidy table automatically?

I'll create that ticket now so we don't forget it

src/main/java/org/broadinstitute/hellbender/tools/gvs/common/SchemaUtils.java

src/main/java/org/broadinstitute/hellbender/tools/gvs/extract/ExtractCohortEngine.java

scripts/variantstore/wdl/GvsExtractCallset.wdl

mcovarr · 2024-06-17T15:56:45Z

src/main/java/org/broadinstitute/hellbender/tools/gvs/common/SchemaUtils.java

+    public static final String SAMPLE_ID = "sample_id";
+    public static final String GENOTYPE = "genotype";
+    public static final String PLOIDY = "ploidy";
+


Ok thanks for the explanation, I was hurting my brain trying to figure out what genotype meant here...

And while we're on the subject is there a ticket for making the ploidy table automatically?

* checkpointing here to switch branches * locally working first pass at adding in the ploidy info. Still needs to have the arguments passed through so it works in the WDLs * Propagating changes up through the wdl * Stupid WDL substitution mistake * On a roll with WDL today wheeeeee * Cleaning up slightly * PR feedback * PR feedback v2: Ploidy Boogaloo

* Vs-1379 adding ploidy support to reference data (#8857) * checkpointing here to switch branches * locally working first pass at adding in the ploidy info. Still needs to have the arguments passed through so it works in the WDLs * Propagating changes up through the wdl * Stupid WDL substitution mistake * On a roll with WDL today wheeeeee * Cleaning up slightly * PR feedback * PR feedback v2: Ploidy Boogaloo * updating GATK docker for latest ploidy changes

koncheto-broad added 6 commits May 29, 2024 13:05

checkpointing here to switch branches

f73c0b4

locally working first pass at adding in the ploidy info. Still needs …

2fb59b7

…to have the arguments passed through so it works in the WDLs

Propagating changes up through the wdl

03b2045

Stupid WDL substitution mistake

21cc2b4

On a roll with WDL today wheeeeee

b4b7bcc

Cleaning up slightly

52b66b9

koncheto-broad marked this pull request as ready for review June 13, 2024 13:09

mcovarr self-requested a review June 13, 2024 19:13

mcovarr self-assigned this Jun 13, 2024

mcovarr reviewed Jun 13, 2024

View reviewed changes

PR feedback

37008db

RoriCremer reviewed Jun 17, 2024

View reviewed changes

RoriCremer approved these changes Jun 17, 2024

View reviewed changes

gbggrant approved these changes Jun 17, 2024

View reviewed changes

src/main/java/org/broadinstitute/hellbender/tools/gvs/common/SchemaUtils.java Outdated Show resolved Hide resolved

src/main/java/org/broadinstitute/hellbender/tools/gvs/extract/ExtractCohortEngine.java Show resolved Hide resolved

PR feedback v2: Ploidy Boogaloo

5cc306d

mcovarr approved these changes Jun 17, 2024

View reviewed changes

koncheto-broad merged commit d28b342 into ah_var_store Jun 17, 2024
16 of 17 checks passed

koncheto-broad deleted the VS-1379-ploidy branch June 17, 2024 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vs-1379 adding ploidy support to reference data #8857

Vs-1379 adding ploidy support to reference data #8857

koncheto-broad commented Jun 3, 2024 •

edited

Loading

mcovarr left a comment

RoriCremer Jun 17, 2024

koncheto-broad Jun 17, 2024

mcovarr Jun 17, 2024

koncheto-broad Jun 17, 2024

mcovarr Jun 17, 2024

Vs-1379 adding ploidy support to reference data #8857

Vs-1379 adding ploidy support to reference data #8857

Conversation

koncheto-broad commented Jun 3, 2024 • edited Loading

mcovarr left a comment

Choose a reason for hiding this comment

RoriCremer Jun 17, 2024

Choose a reason for hiding this comment

koncheto-broad Jun 17, 2024

Choose a reason for hiding this comment

mcovarr Jun 17, 2024

Choose a reason for hiding this comment

koncheto-broad Jun 17, 2024

Choose a reason for hiding this comment

mcovarr Jun 17, 2024

Choose a reason for hiding this comment

koncheto-broad commented Jun 3, 2024 •

edited

Loading