Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batched Avro export [VS-630] #8020

Merged
merged 4 commits into from
Sep 20, 2022
Merged

Conversation

mcovarr
Copy link
Collaborator

@mcovarr mcovarr commented Sep 15, 2022

To address scalability failings with unbatched Avro exports.

@codecov
Copy link

codecov bot commented Sep 15, 2022

Codecov Report

❗ No coverage uploaded for pull request base (ah_var_store@0ef7433). Click here to learn what that means.
The diff coverage is n/a.

Additional details and impacted files
@@               Coverage Diff                @@
##             ah_var_store     #8020   +/-   ##
================================================
  Coverage                ?   86.244%           
  Complexity              ?     35197           
================================================
  Files                   ?      2173           
  Lines                   ?    165004           
  Branches                ?     17792           
================================================
  Hits                    ?    142306           
  Misses                  ?     16372           
  Partials                ?      6326           

Copy link

@rsasch rsasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bq query --nouse_legacy_sql --project_id=~{project_id} "
EXPORT DATA OPTIONS(
uri='${avro_prefix}/vets/vet_${str_table_index}/vet_${str_table_index}_*.avro', format='AVRO', compression='SNAPPY') AS
SELECT location, sample_id, ref, REPLACE(alt,',<NON_REF>','') alt, call_GT as GT, call_AD as AD, call_GQ as GQ, cast(SPLIT(call_pl,',')[OFFSET(0)] as int64) as RGQ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we need to extract PLs? it doesn't look like Tim uses them. Let's ask on Wednesday. (Fine to include for now, I just happened to notice it in this pr)

# appropriate partition, the outer '+ 1' is to iterate over the correct number of partitions.
scatter (i in range(((CountSamples.num_samples - 1) / 4000) + 1)) {
Int num_samples = CountSamples.num_samples
Int num_superpartitions = if (num_samples % 4000 == 0) then num_samples / 4000 else (num_samples / 4000 + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a comment on this math would be helpful

@mcovarr mcovarr merged commit ae56b88 into ah_var_store Sep 20, 2022
@mcovarr mcovarr deleted the vs_630_batched_avro_export branch September 20, 2022 20:10
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants