Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rightsize import batches [VS-486] #7925

Merged
merged 5 commits into from
Jul 6, 2022

Conversation

mcovarr
Copy link
Collaborator

@mcovarr mcovarr commented Jun 29, 2022

Automatically choose appropriate import batch, preemptible, and max retry values for sample sets up to 20K. This makes 20K samples our effective threshold for where we feel comfortable with the current stability / performance of our import code.

else if (num_samples <= 100) then 1
else if (num_samples <= 5000) then 5
else if (num_samples <= 20000) then 10
else 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps out of scope for this pr, but two suggestions:

  1. let's maybe do this as a rounded fraction based on the max # of cromwell vms we feel comfortable running (rather than a step function)
  2. do we want to throw an error or maybe just a warning if someone is trying to run a callset with > 20k samples and they haven't explicitly set a load_data_batch_size?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(bwahaha this is why I want to scale test with exomes)

@codecov
Copy link

codecov bot commented Jun 29, 2022

Codecov Report

❗ No coverage uploaded for pull request base (ah_var_store@874d615). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff                @@
##             ah_var_store     #7925   +/-   ##
================================================
  Coverage                ?   86.288%           
  Complexity              ?     35189           
================================================
  Files                   ?      2170           
  Lines                   ?    164888           
  Branches                ?     17786           
================================================
  Hits                    ?    142278           
  Misses                  ?     16288           
  Partials                ?      6322           

@mcovarr mcovarr force-pushed the vs_486_rightsize_import_batches branch from 4cc3c7d to 9636547 Compare June 30, 2022 23:45
@mcovarr mcovarr changed the title Vs 486 rightsize import batches Rightsize import batches [VS-486] Jul 5, 2022
@mcovarr mcovarr marked this pull request as ready for review July 5, 2022 13:33
@mcovarr
Copy link
Collaborator Author

mcovarr commented Jul 5, 2022

Successful run of GvsJointVariantCalling for this branch. Note the 10 import shards with 1 sample each that were created by default for this sample set of size 10.

Int max_auto_batch_size = 20000

if ((num_samples > max_auto_batch_size) && !(defined(load_data_batch_size))) {
call Utils.TerminateWorkflow as DieDueToTooManySamplesWithoutExplicitLoadDataBatchSize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love this

# At least 1, per limits above not more than 20.
Int effective_load_data_batch_size = if (defined(load_data_batch_size)) then select_first([load_data_batch_size])
else if num_samples < 1000 then 1
else num_samples / 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to floor / ceil or round this in some way? Or does the Int type do that for us? Could it get us into trouble?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic should be flooring for us.

Int effective_load_data_maxretries = if (defined(load_data_maxretries_override)) then select_first([load_data_maxretries_override])
else if (effective_load_data_batch_size < 12) then 3
else effective_load_data_batch_size / 4

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if you are already changing things---I would appreciate a little more prose on why this is being done this way

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ok, I'll try to come up with something... 😅

@mcovarr mcovarr force-pushed the vs_486_rightsize_import_batches branch from 9636547 to 33ab249 Compare July 6, 2022 20:09
@mcovarr mcovarr merged commit 3e60f0d into ah_var_store Jul 6, 2022
@mcovarr mcovarr deleted the vs_486_rightsize_import_batches branch July 6, 2022 21:29
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants