Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batching of samples for create import TSVs #7382

Merged
merged 3 commits into from
Jul 30, 2021
Merged

Conversation

kcibul
Copy link
Contributor

@kcibul kcibul commented Jul 29, 2021

Processing an exome takes ~1 minute, which means most of the time is spent on spinning up a VM, pulling docker images, etc. This is not very cost efficient. This PR allows for a batch_size to be set and then each task processes that many samples as a unit. The default is 1 which yields the current behavior, but in exomes I have set it to 20 and seen the cost to ingest drop dramatically

The GitHub PR makes it look like a lot has changed but really the changes are:

  • a new parameter
  • a new task to turn the Array[File] for the VCFs into set of FOFNs (file-of-file-names) similar to how we split up intervals
  • a loop in the actual Create TSV task to loop over the files in the FOFNs. For SA mode we copy down each file, and for non-SA mode we rely on the fact that localization is optional and we read them directly anywy

@kcibul kcibul marked this pull request as ready for review July 30, 2021 15:22
@kcibul kcibul requested a review from ahaessly July 30, 2021 15:22
Copy link
Contributor

@ahaessly ahaessly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ahaessly ahaessly merged commit b7e06b9 into ah_var_store Jul 30, 2021
@ahaessly ahaessly deleted the kc_batch_tsv branch July 30, 2021 20:08
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants