Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VS-1335 bgzip on extract ah var store edition #8809

Merged
merged 12 commits into from
May 8, 2024

Conversation

gbggrant
Copy link
Collaborator

@gbggrant gbggrant commented May 3, 2024

Optionally extract VCFs in bgzipped compression format.

Integration test - Here

Here is an example extract using bgzip
Here is an example extract not using bgzip (so, just '.gz')

@gbggrant gbggrant requested review from mcovarr and rsasch May 3, 2024 14:05
@@ -20,6 +20,7 @@ workflow GvsExtractCallset {
Int? scatter_count
Int? extract_memory_override_gib
Int? disk_override
Boolean bgzip_output_vcfs = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ticket says this should be off by default, although if we're running this WDL singly we're probably doing an AoU "small callset" where we actually do want this on...

So I'm not sure if we should leave this as you have it, or default to false and update the (currently nonexistent) AoU small callset docs to specify that this should be set to true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That's what I was thinking. That is, have it off by default, but enable it here for the likely AoU small call sets case. I agree that it should be off by default.

@@ -31,6 +31,7 @@ workflow GvsExtractCohortFromSampleNames {
String? output_gcs_dir
# set to "NONE" if all the reference data was loaded into GVS in GvsImportGenomes
String drop_state = "NONE"
Boolean bgzip_output_vcfs = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same concerns here as per my earlier comment

@@ -75,6 +75,7 @@ workflow GvsQuickstartIntegration {
# necessarily the same as the branch name selected in Terra for the integration `GvsQuickstartIntegration` workflow,
# though in practice likely they are the same.
if (run_hail_integration) {
# This workflow will be the *closest* to AoU test.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pretty sure I know what you mean but I am struggling to parse this sentence

@@ -5,6 +5,8 @@ import "GvsQuickstartHailIntegration.wdl" as QuickstartHailIntegration
import "GvsJointVariantCalling.wdl" as JointVariantCalling
import "GvsUtils.wdl" as Utils

# This is a comment!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While factually correct, we might want to get rid of this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't like well commented code?

@@ -92,6 +92,7 @@
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
- Specify the same `call_set_identifier`, `dataset_name`, `project_id`, `extract_table_prefix`, and `interval_list` that were used in the `GvsPrepareRangesCallset` run documented above.
- Specify the `interval_weights_bed` appropriate for the PGEN / VCF extraction run you are performing. `gs://gvs_quickstart_storage/weights/gvs_full_vet_weights_1kb_padded_orig.bed` is the interval weights BED used for Quickstart.
- For `GvsExtractCallset` only, set the `bgzip_output_vcfs` input to `true` to generate VCFs that are compressed using bgzip.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this means. Are there other single-step workflows that have this input? I think "If you want to output VCFs that are compressed using bgzip, set the GvsExtractCallset bgzip_output_vcfs input to true."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is is under the section that covers both GvsExtractCallset and GvsExtractCallsetPgenMerged so I think this is saying that these remarks apply only to the former.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I meant. I will update the documentation to make it clearer.

Copy link

@rsasch rsasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change/clarification, plus I agree with Miguel's comments about comments.

@@ -92,6 +92,7 @@
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
- Specify the same `call_set_identifier`, `dataset_name`, `project_id`, `extract_table_prefix`, and `interval_list` that were used in the `GvsPrepareRangesCallset` run documented above.
- Specify the `interval_weights_bed` appropriate for the PGEN / VCF extraction run you are performing. `gs://gvs_quickstart_storage/weights/gvs_full_vet_weights_1kb_padded_orig.bed` is the interval weights BED used for Quickstart.
- For `GvsExtractCallset` only, set the `bgzip_output_vcfs` input to `true` to generate VCFs that are compressed using bgzip.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is is under the section that covers both GvsExtractCallset and GvsExtractCallsetPgenMerged so I think this is saying that these remarks apply only to the former.

Pass extension override to gzip so it can handle bgzipped files.
Copy link

@rsasch rsasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Miguel doesn't care that his comments about comments weren't addressed, I guess I don't either.

@gbggrant gbggrant merged commit bb957c7 into ah_var_store May 8, 2024
7 of 13 checks passed
@gbggrant gbggrant deleted the gg_VS-1335_BgzipOnExtractAhVarStoreEdition branch May 8, 2024 10:49
gbggrant added a commit that referenced this pull request May 8, 2024
* Optionally extract to bgz format.
* Set bgzipping to be off (everywhere) by default.
* Update assert_identical_outputs to handle bgzipped outputs.
gbggrant added a commit that referenced this pull request May 8, 2024
* Optionally extract to bgz format.
* Set bgzipping to be off (everywhere) by default.
* Update assert_identical_outputs to handle bgzipped outputs.
RoriCremer pushed a commit that referenced this pull request Jun 10, 2024
* Optionally extract to bgz format.
* Set bgzipping to be off (everywhere) by default.
* Update assert_identical_outputs to handle bgzipped outputs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants