Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGE settings for Reblocking and JointCalling pipelines #1076

Merged
merged 18 commits into from
Feb 6, 2024
Merged

Conversation

meganshand
Copy link
Contributor

Description

This is a draft of how we'd like to run joint calling on BGE with a few changes to the current pipeline:

  • Use an updated variant calling interval list (bge v1.1 calling intervals)
  • Reblocking
    • Use an updated version of ReblockGVCFs to work with DRAGEN input
    • Added argument to strip “PRI” annotation: --format-annotations-to-remove PRI
    • Add argument to move site level filters to genotype level instead: --add-site-filters-to-genotype
    • No need to do extra blocking at this step since the block sizes are now set in the DRAGEN/single sample pipeline
  • Filtering
    • Allow for the option to use the newer VETS pipeline rather than VQSR in JointGenotyping
    • Hard filter sites that land outside of the targeted regions (but within the exome variant calling interval list)
    • Train the VETS model only over the targeted regions of the BGE (where there's expected coverage)
    • Provide the SCORE annotation calculated by the VETS model for all variants, but don't use it as an applied filter so that users can choose their own cutoff values

Making two paths in the JointCalling pipeline for two different filtering strategies was a bit messy, so I'm happy to hear if there are any recommendations on how to clean that up. I also tried to run the tests that already exist for these workflows, but ran into errors that seemed unrelated.

@jessicaway For now I left out any changes to the ByChromosome workflows and we can cross that bridge when we come to it.


Checklist

If you can answer "yes" to the following items, please add a checkmark next to the appropriate checklist item(s) and notify our WARP documentation team by tagging either @ekiernan or @kayleemathews in a comment on this PR.

  • Did you add inputs, outputs, or tasks to a workflow?
  • Did you modify, delete or move: file paths, file names, input names, output names, or task names?
  • If you made a changelog update, did you update the pipeline version number?

@github-actions
Copy link

github-actions bot commented Sep 8, 2023

Remember to squash merge!

2 similar comments
@github-actions
Copy link

Remember to squash merge!

@github-actions
Copy link

Remember to squash merge!

@meganshand
Copy link
Contributor Author

@jessicaway @samuelklee This is ready for review. As discussed in the meeting this morning the two remaining issues are:

  • We'll need to update the version of GATK once it's been released
  • Test files still need to be copied into the appropriate bucket for all the tests to pass here

I don't think either of these should stop an initial review though. Thanks!

@meganshand
Copy link
Contributor Author

@ekiernan This PR will need some documentation updates for the BGE pipeline. I also submitted a request through the portal. Happy to work with you on this.

@samuelklee
Copy link
Contributor

Great, thanks @meganshand! Should be able to get an initial review back to you in the next couple days.

Would be great to also get broadinstitute/gatk#8131 in before the next GATK release. Let's chat about whether that necessitates any changes here---I don't think it should, but that PR has been on the back burner for a bit, so I'll have to double check.

Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @meganshand, just a few minor comments/questions from me!


if (run_vets) {
String allele_specific_extra_args = if allele_specific_annotations then " --use-allele-specific-annotations " else ""
String resource_args = " --resource:hapmap,training=true,calibration=true " + hapmap_resource_vcf +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to expose this and demote the default resources from required arguments at some point, but this is fine for now.

if (num_gvcfs > snps_variant_recalibration_threshold) {
call Tasks.SNPsVariantRecalibratorCreateModel {

if (run_vets) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we'd have separate VQSR and VETS subworkflows, and the latter would be a direct import of the JointVcfFiltering WDL in the GATK repo. Either we'll get there in the future or we'll remove VQSR, let's see which happens first!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I can take a stab at pulling out VQSR into it's own subworkflow if that's helpful, but it sounds like now might not be the right time if we're close to deprecating VQSR completely.

@@ -1,3 +1,9 @@
# 2.1.6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to update some of these changelogs now that #1081 is in?

Copy link
Contributor Author

@meganshand meganshand Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did this to myself :(

I'll fix this when I have to resolve conflicts.

"JointGenotyping.one_thousand_genomes_resource_vcf": "gs://gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz",
"JointGenotyping.one_thousand_genomes_resource_vcf_index": "gs://gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi",

"JointGenotyping.snp_recalibration_annotation_values": ["QD", "MQRankSum", "ReadPosRankSum", "FS", "MQ", "SOR", "DP"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these not allele-specific annotations, when use_allele_specific_annotations = true below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo good catch, I think dragen might not be able to generate allele specific annotations, or at the very least we're not currently set up to do that for BGE so we might have to run without allele specific annotations for now. I changed use_allele_specific_annotations to false for now, but this will also be a non-issue when we move to the new version of GATK/VETS (based on your other PR)

"JointGenotyping.callset_name": "bge_joint_genotyping_plumbing",
"JointGenotyping.run_vets": true,

"JointGenotyping.unpadded_intervals_file": "",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were unpadded_intervals_file, eval_interval_list, and targets_interval_list intentionally left empty for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I need to work with @jessicaway on copying the input files (the Twist targets interval list and the sample files) into the test bucket get these two new tests to run.

@@ -273,6 +273,7 @@ task HardFilterAndMakeSitesOnlyVcf {

String variant_filtered_vcf_filename
String sites_only_vcf_filename
File? targets_interval_list # filters out all variants outside of the targets interval list for targetted sequencing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

targetted -> targeted.

I'm also unclear about whether such changes to tasks need to be reflected in a changelog somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put in a line in the JointGenotyping and UltimaJointGenotyping changelogs (which I think are the only pipelines that uses this task):

Added option to hard filter sites outside of provided interval list

Let me know if there's a clearer way to put that.

Copy link

Remember to squash merge!

Copy link

Remember to squash merge!

@meganshand
Copy link
Contributor Author

The failing smart tests here are due to a change in the VCF header when updating to GATK 4.5.0.0:

< ##INFO=<ID=RAW_GT_COUNT,Number=3,Type=Integer,Description="Counts of genotypes w.r.t. the reference allele in the following order: 0/0, 0/*, */*, i.e. all alts lumped together; for use in calculating excess heterozygosity">
---
> ##INFO=<ID=RAW_GT_COUNT,Number=3,Type=Integer,Description="Counts of genotypes w.r.t. the reference allele: 0/0, 0/*, */*, i.e. all alts lumped together; for use in calculating excess heterozygosity">

This change seems harmless to me, but I'm not sure who needs to sign off on this change. The PR with the rest of the GATK updates to 4.5.0.0 is still in progress, but should be coming soon.

Copy link

Remember to squash merge!

Copy link

Remember to squash merge!

@meganshand
Copy link
Contributor Author

Failing Scientific Tests that can be updated:

  • JointGenotyping: BGE test is failing because the truth data is out of data. This can safely be updated as these differences are expected.
  • ReblockGVCFs: Tests are failing because documentation on a header line changed (see comment above). This can safely be updated.
  • VariantCalling: Same header line change as ReblockGVCFs. This can be updated.

Copy link

github-actions bot commented Feb 5, 2024

Remember to squash merge!

Copy link

github-actions bot commented Feb 5, 2024

Remember to squash merge!

@nikellepetrillo
Copy link
Contributor

@nikellepetrillo nikellepetrillo merged commit 45928e3 into develop Feb 6, 2024
4 of 6 checks passed
@meganshand meganshand deleted the ms_bge branch March 26, 2024 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants