Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performed a round of ablation on new annotation-based filtering tools. #8131

Merged
merged 2 commits into from
Sep 22, 2023

Conversation

samuelklee
Copy link
Contributor

@samuelklee samuelklee commented Dec 14, 2022

  • Removed positive-negative training from TrainVariantAnnotationsModel along with associated integration and WDL tests.
  • Added ability to run positive-unlabeled training by passing unlabeled annotations to a custom python backend (although no example backend or tests were added).
  • Cleaned up some WDL arguments to allow distinct training and scoring python scripts.
  • Removed the useAlleleSpecificAnnotations argument; we instead infer whether to run in allele-specific mode from the VCF header.

@samuelklee samuelklee marked this pull request as draft December 14, 2022 20:54
@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@codecov
Copy link

codecov bot commented Dec 14, 2022

Codecov Report

Merging #8131 (2268ee6) into master (c9bf941) will increase coverage by 0.264%.
The diff coverage is 56.604%.

❗ Current head 2268ee6 differs from pull request most recent head d1dbe69. Consider uploading reports for the commit d1dbe69 to get more accurate results

@@               Coverage Diff               @@
##              master     #8131       +/-   ##
===============================================
+ Coverage     86.362%   86.626%   +0.264%     
+ Complexity     39551     38919      -632     
===============================================
  Files           2362      2336       -26     
  Lines         186121    182603     -3518     
  Branches       20305     20062      -243     
===============================================
- Hits          160738    158181     -2557     
+ Misses         18236     17379      -857     
+ Partials        7147      7043      -104     
Files Changed Coverage
...lkers/vqsr/scalable/ExtractVariantAnnotations.java ø
...walkers/vqsr/scalable/ScoreVariantAnnotations.java 0.000%
.../scalable/data/LabeledVariantAnnotationsDatum.java ø
...scalable/modeling/BGMMVariantAnnotationsModel.java ø
...sr/scalable/modeling/VariantAnnotationsScorer.java ø
...able/ExtractVariantAnnotationsIntegrationTest.java ø
...rs/vqsr/scalable/TrainVariantAnnotationsModel.java 37.037%
...alable/modeling/PythonVariantAnnotationsModel.java 66.667%
...e/TrainVariantAnnotationsModelIntegrationTest.java 77.778%
...vqsr/scalable/LabeledVariantAnnotationsWalker.java 100.000%
... and 2 more

@samuelklee
Copy link
Contributor Author

samuelklee commented Feb 2, 2023

@koncheto-broad this is one of the VQSR-lite PRs you will want to eventually rebase on. It's still awaiting review (I was waiting until the dust from updating to Java 17 in #8035 settles), but if anyone from your team wants to take a first crack, feel free! Not too many code changes, so hopefully it should be pretty manageable.

Just so it's all in one place: your #8157 GVS branch is currently rebased on #7954, which contains the "serial SNP-then-indel" version of the Joint Genotyping WDL (written by Megan for Ultima) and the Java code for the tools. Some minor updates were made to the Java code in #8049 and the WDL was rewritten by me to do SNPs and indels in a single pass in #8074. (EDIT: I was originally confused here, the WDL that was replaced in this PR simply ran SNPs and indels separately, rather than serially—thanks to George for correcting me here.)

The PR here makes relatively minor updates to both the Java code and the WDL and might require very minor updates to GVS code or JSON configurations. And finally, the larger PR at #8132 adds a Pure Java BGMM backend. As we discussed during my mobbing presentation, this is provided merely as a convenience for those users that might not be able to control their python environment (hopefully a small number, these days!), so getting it merged is probably less urgent and should not affect any GVS work.

@samuelklee
Copy link
Contributor Author

Oh, also note that there might be some variable-name references in the Javadocs for the VQSR-lite tools that are not rendered properly in online docs; see #8146 for more context. However, if you're just looking at the Javadocs via IntelliJ, everything should look fine.

@koncheto-broad
Copy link

Ooh, thanks for the heads up Sam! It looks like, after George is done and we have done some head to head comparisons, this is definitely one of the PRs we'll need to yank over

…n and fixed other minor documentation issues.
@samuelklee
Copy link
Contributor Author

@meganshand I think this PR should go in the same GATK release needed for your broadinstitute/warp#1076 PR. Would you mind reviewing?

There are a couple of very minor argument changes that will bubble up to the WDL level---namely, the changes to the arguments for specifying the python scripts and allele-specific mode outlined in the bullets above.

@koncheto-broad @jonn-smith @droazen @MonicaDou24 might also want to take note.

@samuelklee samuelklee marked this pull request as ready for review September 21, 2023 13:51
@meganshand meganshand self-assigned this Sep 21, 2023
Copy link
Contributor

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A few very minor questions.

@@ -121,34 +121,33 @@
*
* <ul>
* <li>
* Scored VCF file and index. The VCF will not be gzipped if the {@value DO_NOT_GZIP_VCF_OUTPUT_LONG_NAME}
* Scored VCF file and index. The VCF will not be gzipped if the "--do-not-gzip-vcf-output"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming there's a reason that all of the links were replaced in the documentation and that wasn't a mistake, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see #8146 and discussion linked above. I left the @link tags since they are more parseable, even when not rendered. Unfortunately, I think removing these tags will make the docs harder to maintain, but it seems like that issue is a wontfix at this point! :(

@samuelklee
Copy link
Contributor Author

Thanks for the speedy review, @meganshand!

@samuelklee samuelklee merged commit 7058e0c into master Sep 22, 2023
20 checks passed
@samuelklee samuelklee deleted the sl_lite_ablation branch September 22, 2023 17:01
rickymagner pushed a commit that referenced this pull request Nov 28, 2023
#8131)

* Performed a round of ablation on new annotation-based filtering tools.

* Removed Javadoc tags unsupported by Barclay in VETS tool documentation and fixed other minor documentation issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants