clustering vqsr tables by location #7656
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I was bugged by the fact that @rsasch calculated the latest callset read 1957 TB of data via the Read API when we knew the core extract tables were only ~100 TB. Turns out that 90% of that scanning was actually reading the various un-partitioned, un-clustered VQSR related tables. Each shard read the entire table (which is ~80 GB), multiplied by 20k shards --> is about 1600 TB of reads. The cost of that represents ~20% of the total cost of making the 100k callset, so a big cost savings.
Clustering and partitioning these tables allows each shard to just consume the data it needs.