Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rc vs 778 spike import limit #8197

Merged
merged 57 commits into from
Mar 27, 2023
Merged

Conversation

RoriCremer
Copy link
Contributor

@RoriCremer RoriCremer commented Feb 10, 2023

The goal of this PR is to adjust the ingest in two ways:

  1. To update the ingest to loop through all samples (not just the first 10k)
  2. To update the ingest to be far more efficient in a few ways:
    • To remove the files that are downloaded to each vm so that they do not carry around the extra weight
    • To check that the samples in the fofns have not been ingested already so that additional work doesn't need to be done toward processing those samples.

There is still work to do around making the bulk ingest process significantly more user-friendly

@RoriCremer RoriCremer marked this pull request as ready for review February 21, 2023 04:23
koncheto-broad and others added 23 commits February 22, 2023 21:58
1. We do NOT want to assume that the sample ids we want are in the name field.  Pass that through as a parameter.
2. We want to explicitly pause every 500 samples, as that's our page size.  It slows our requests down enough to not spam the backend server and hit 503 errors, although it does slow down the rate at which we can write the files if the dataset is too big.  Which shouldn't be a concern, because as long as it doesn't cause errors it is still a hands off process.
3. We want to account to heterogenous data.  In AoU Delta, for instance, the control samples keep their vcf and vcf_index data in a different field.  This would cause the whole thing to fail if we weren't accounting for that explicitly, and now we generate an errors.txt file that will hold the row that we couldn't find the correct columns for so they can be examined later
… data table and being slightly more informative in the output of the python script
@codecov
Copy link

codecov bot commented Feb 23, 2023

Codecov Report

❗ No coverage uploaded for pull request base (ah_var_store@4a1c203). Click here to learn what that means.
The diff coverage is n/a.

Additional details and impacted files
@@               Coverage Diff                @@
##             ah_var_store     #8197   +/-   ##
================================================
  Coverage                ?   83.979%           
  Complexity              ?     34803           
================================================
  Files                   ?      2194           
  Lines                   ?    167039           
  Branches                ?     18005           
================================================
  Hits                    ?    140278           
  Misses                  ?     20534           
  Partials                ?      6227           

@gatk-bot
Copy link

gatk-bot commented Mar 7, 2023

Github actions tests reported job failures from actions build 4358183003
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 8 4358183003.10 logs

Copy link

@rsasch rsasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass of comments

scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl Outdated Show resolved Hide resolved
scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl Outdated Show resolved Hide resolved
scripts/variantstore/wdl/GvsImportGenomes.wdl Outdated Show resolved Hide resolved
scripts/variantstore/wdl/GvsImportGenomes.wdl Outdated Show resolved Hide resolved
scripts/variantstore/wdl/GvsImportGenomes.wdl Outdated Show resolved Hide resolved
scripts/variantstore/wdl/GvsImportGenomes.wdl Outdated Show resolved Hide resolved
scripts/variantstore/wdl/GvsImportGenomes.wdl Outdated Show resolved Hide resolved
scripts/variantstore/wdl/GvsPrepareBulkImport.wdl Outdated Show resolved Hide resolved
Copy link
Collaborator

@mcovarr mcovarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass

scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl Outdated Show resolved Hide resolved
scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl Outdated Show resolved Hide resolved
scripts/variantstore/wdl/GvsImportGenomes.wdl Outdated Show resolved Hide resolved
scripts/variantstore/wdl/GvsImportGenomes.wdl Outdated Show resolved Hide resolved
scripts/variantstore/wdl/GvsImportGenomes.wdl Outdated Show resolved Hide resolved
scripts/variantstore/wdl/GvsPrepareBulkImport.wdl Outdated Show resolved Hide resolved

>>>
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:2023-1-20-FOFN"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISO 8601 nit, names should be like YYYY-MM-DD, i.e. always two digits for month and day. Nice for sorting things chronologically and lexically at the same time. 🙂

@RoriCremer RoriCremer merged commit c0535f2 into ah_var_store Mar 27, 2023
@RoriCremer RoriCremer deleted the rc-VS-778-spike-import-limit branch March 27, 2023 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants