Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VS-775 vat validation shards #8175

Merged
merged 17 commits into from
Apr 10, 2023
Merged

Conversation

RoriCremer
Copy link
Contributor

@RoriCremer RoriCremer commented Jan 26, 2023

Add additional validation around duplicated rows in the VAT
duplicate_AN_or_AC_values

This has a successful run (except for one failure that is because it's being run on way less data)
https://job-manager.dsde-prod.broadinstitute.org/jobs/07ddde58-ac0d-4229-9f96-d093f5c11682
The failed test is:
SpotCheckForAAChangeAndExonNumberConsistency

Perhaps we want to update this to not run this test if there are less than 10k samples?
Yes we do:
Here's the ticket for that:
https://broadworkbench.atlassian.net/browse/VS-878

@codecov
Copy link

codecov bot commented Jan 26, 2023

Codecov Report

❗ No coverage uploaded for pull request base (ah_var_store@8217073). Click here to learn what that means.
The diff coverage is n/a.

Additional details and impacted files
@@               Coverage Diff                @@
##             ah_var_store     #8175   +/-   ##
================================================
  Coverage                ?   85.880%           
  Complexity              ?     35515           
================================================
  Files                   ?      2194           
  Lines                   ?    167029           
  Branches                ?     18006           
================================================
  Hits                    ?    143444           
  Misses                  ?     17204           
  Partials                ?      6381           

Copy link
Collaborator

@gbggrant gbggrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task 'DuplicateAnnotations' is failing for me (when run on my QuickStart) with:
/cromwell_root/script: line 76: syntax error near unexpected token )' /cromwell_root/script: line 76: ) > "$out94c3fb6c" 2> "$err94c3fb6c"'

Maybe you're not escaping the table string?

@mcovarr
Copy link
Collaborator

mcovarr commented Feb 6, 2023

@gbggrant do you have a link to your failed run of DuplicateAnnotations? The query escapes look okay to me and nothing else jumps out as being wrong but something clearly is...

@gbggrant
Copy link
Collaborator

gbggrant commented Feb 6, 2023

@RoriCremer RoriCremer force-pushed the rc-vs-774-vat-validation-shards branch from f88464f to 3786d6c Compare February 9, 2023 18:25
@RoriCremer RoriCremer force-pushed the rc-vs-774-vat-validation-shards branch from a9228af to 2d69f15 Compare April 3, 2023 18:28
Comment on lines 844 to 847
bq query --nouse_legacy_sql --project_id=~{query_project_id} --format=csv 'SELECT * from
(SELECT contig, position, gvs_all_an, COUNT(DISTINCT gvs_all_an) AS an_count FROM `~{fq_vat_table}`
group by contig, position, gvs_all_an)
where an_count >1' > bq_an_output.csv
Copy link
Collaborator

@mcovarr mcovarr Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to simplify this and the query below with a HAVING (i.e. without a nested query) as was done previously.

Suggested change
bq query --nouse_legacy_sql --project_id=~{query_project_id} --format=csv 'SELECT * from
(SELECT contig, position, gvs_all_an, COUNT(DISTINCT gvs_all_an) AS an_count FROM `~{fq_vat_table}`
group by contig, position, gvs_all_an)
where an_count >1' > bq_an_output.csv
bq query --nouse_legacy_sql --project_id=~{query_project_id} --format=csv '
SELECT contig, position, gvs_all_an, COUNT(DISTINCT gvs_all_an) AS an_count
FROM `~{fq_vat_table}`
GROUP BY contig, position, gvs_all_an
HAVING an_count > 1
' > bq_an_output.csv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to make sure that for multiple rows with the same VID, there aren't multiple values of AN (and below the AC!)

Comment on lines 849 to 852
bq query --nouse_legacy_sql --project_id=~{query_project_id} --format=csv 'SELECT * from
(SELECT contig, position, vid, gvs_all_ac, COUNT(DISTINCT gvs_all_ac) AS ac_count FROM `~{fq_vat_table}`
group by contig, position, vid, gvs_all_ac)
where ac_count >1' > bq_ac_output.csv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bq query --nouse_legacy_sql --project_id=~{query_project_id} --format=csv 'SELECT * from
(SELECT contig, position, vid, gvs_all_ac, COUNT(DISTINCT gvs_all_ac) AS ac_count FROM `~{fq_vat_table}`
group by contig, position, vid, gvs_all_ac)
where ac_count >1' > bq_ac_output.csv
bq query --nouse_legacy_sql --project_id=~{query_project_id} --format=csv '
SELECT contig, position, gvs_all_ac, COUNT(DISTINCT gvs_all_ac) AS ac_count
FROM `~{fq_vat_table}`
GROUP BY contig, position, gvs_all_ac
HAVING ac_count > 1
' > bq_ac_output.csv

@@ -129,7 +136,7 @@ workflow GvsValidateVat {
SubpopulationMax.pass,
SubpopulationAlleleCount.pass,
SubpopulationAlleleNumber.pass,
ClinvarSignificance.pass,
DuplicateAnnotations.pass,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to replace ClinvarSignificance with DuplicateAnnotations? If so, you should get rid of the task itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet---I want to check with Lee about how to handle this:

https://broadworkbench.atlassian.net/browse/VS-635

Copy link
Collaborator

@gbggrant gbggrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave a comment explaining that ClinvarSignificance has been removed temporarily

@RoriCremer RoriCremer merged commit e2b2e5c into ah_var_store Apr 10, 2023
@RoriCremer RoriCremer deleted the rc-vs-774-vat-validation-shards branch April 10, 2023 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants