Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VS-1368 The tarball is too damn big #8829

Merged
merged 6 commits into from
May 13, 2024
Merged

Conversation

gbggrant
Copy link
Collaborator

This PR does 2 things to address the size of the tarball:

  1. Compresses it.
  2. Strips out all of the 'non-standard' contigs from the interval list headers.

Example run here.

Copy link
Collaborator

@mcovarr mcovarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC what is the squish factor?

scripts/variantstore/wdl/GvsUtils.wdl Outdated Show resolved Hide resolved
@gbggrant
Copy link
Collaborator Author

It reduces the size of just the header lines in the interval list from 581689 bytes to 3976. So 0.0068 smaller.

Copy link

@koncheto-broad koncheto-broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel that a whitelist vs blacklist approach would be cleaner, as it would allow us to avoid the grep regex filtering. It would let us keep what we want by doing it in a way that is genomically meaningful--specifying the intervals that we want to keep and tossing everything that isn't in them by intersecting our whitelist with the intervals we are passed--instead of relying on string manipulation. But I'm willing to give this a thumb as long as we ensure that we go back and do it in a cleaner way later. We want to get something workable in place sooner rather than later, after all.

Also, I'd probably be compelled to give it a thumb anyway just for the name of the PR.

@gbggrant gbggrant merged commit 527be7a into ah_var_store May 13, 2024
17 checks passed
@gbggrant gbggrant deleted the gg_VS-1368_TarballIsTooBig branch May 13, 2024 18:18
gbggrant added a commit that referenced this pull request May 13, 2024
* Compress the tarball saves a bit.
* Remove unused contigs from interval_list files by grepping.
---------

Co-authored-by: Miguel Covarrubias <[email protected]>
gbggrant added a commit that referenced this pull request May 13, 2024
* Compress the tarball saves a bit.
* Remove unused contigs from interval_list files by grepping.
---------

Co-authored-by: Miguel Covarrubias <[email protected]>
RoriCremer pushed a commit that referenced this pull request Jun 10, 2024
* Compress the tarball saves a bit.
* Remove unused contigs from interval_list files by grepping.
---------

Co-authored-by: Miguel Covarrubias <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants