clustering vqsr tables by location #7656

kcibul · 2022-01-29T14:32:02Z

I was bugged by the fact that @rsasch calculated the latest callset read 1957 TB of data via the Read API when we knew the core extract tables were only ~100 TB. Turns out that 90% of that scanning was actually reading the various un-partitioned, un-clustered VQSR related tables. Each shard read the entire table (which is ~80 GB), multiplied by 20k shards --> is about 1600 TB of reads. The cost of that represents ~20% of the total cost of making the 100k callset, so a big cost savings.

Clustering and partitioning these tables allows each shard to just consume the data it needs.

rsasch

good catch 👍🏻

clustering vqsr tables by location

5e62eed

rsasch approved these changes Jan 31, 2022

View reviewed changes

kcibul merged commit 0ca6abb into ah_var_store Jan 31, 2022

kcibul deleted the kc_cluster_vqsr branch January 31, 2022 15:39

This was referenced Mar 17, 2023

lb merge gvs branch #8248

Closed

testing something, please ignore #8251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clustering vqsr tables by location #7656

clustering vqsr tables by location #7656

kcibul commented Jan 29, 2022 •

edited

Loading

rsasch left a comment

clustering vqsr tables by location #7656

clustering vqsr tables by location #7656

Conversation

kcibul commented Jan 29, 2022 • edited Loading

rsasch left a comment

Choose a reason for hiding this comment

kcibul commented Jan 29, 2022 •

edited

Loading