Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustering vqsr tables by location #7656

Merged
merged 1 commit into from
Jan 31, 2022
Merged

clustering vqsr tables by location #7656

merged 1 commit into from
Jan 31, 2022

Conversation

kcibul
Copy link
Contributor

@kcibul kcibul commented Jan 29, 2022

I was bugged by the fact that @rsasch calculated the latest callset read 1957 TB of data via the Read API when we knew the core extract tables were only ~100 TB. Turns out that 90% of that scanning was actually reading the various un-partitioned, un-clustered VQSR related tables. Each shard read the entire table (which is ~80 GB), multiplied by 20k shards --> is about 1600 TB of reads. The cost of that represents ~20% of the total cost of making the 100k callset, so a big cost savings.

Clustering and partitioning these tables allows each shard to just consume the data it needs.

Copy link

@rsasch rsasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch 👍🏻

@kcibul kcibul merged commit 0ca6abb into ah_var_store Jan 31, 2022
@kcibul kcibul deleted the kc_cluster_vqsr branch January 31, 2022 15:39
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants