Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updates to ImportGenomes and LoadBigQueryData #7112

Merged
merged 21 commits into from
Mar 2, 2021

Conversation

mmorgantaylor
Copy link
Member

@mmorgantaylor mmorgantaylor commented Feb 26, 2021

changes in this PR:

  • resolves specops issue make 100% sure that allele ordering is preserved in htsjdk #247 - ImportGenomes.wdl takes Array[File] from data table as vcf input
  • refactor LoadBigQueryData.wdl back into ImportGenomes
  • returns an error if the bq load step fails (workflow was silently succeeding when this step failed)
  • checks existence of tables using bq show rather than the csv file - this should still be safe against a race condition because of @ericsong 's refactoring to prevent the CreateTables step from being scattered
  • run CreateTables at the start (don't wait for CreateImportTsvs)
  • does NOT use a preemptible VM for the LoadTables step, to minimize (though not eliminate) the possibility of loading a duplicate set of data (see specops issue investigate replacing R plotting with javascript plotting #248 for further discussion)

testing:

  • these changes were tested in Terra, BQ outputs checked and verified

@mmorgantaylor
Copy link
Member Author

closing - refactoring to one WDL

@mmorgantaylor mmorgantaylor reopened this Feb 26, 2021
Copy link
Contributor

@kcibul kcibul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a bunch of questions, but I think they mostly stemmed from the previous structure of Load and Create tables... We should clean that up, but we can also split that out of this PR and make it a different ticket (or part of the productionization ticket).

scripts/variantstore/wdl/ImportGenomes.wdl Show resolved Hide resolved
scripts/variantstore/wdl/ImportGenomes.wdl Show resolved Hide resolved
String numbered
String partitioned
String uuid
Array[String] tsv_creation_done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by requiring this as an input, this ensures that CreateImportTSVs runs before CreateTables can start. (removed here but added in to LoadTables)

scripts/variantstore/wdl/ImportGenomes.wdl Outdated Show resolved Hide resolved
scripts/variantstore/wdl/ImportGenomes.wdl Outdated Show resolved Hide resolved
scripts/variantstore/wdl/ImportGenomes.wdl Outdated Show resolved Hide resolved
scripts/variantstore/wdl/ImportGenomes.wdl Outdated Show resolved Hide resolved
schema = metadata_schema,
superpartitioned = "false",
partitioned = "false",
uuid = "",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kcibul @ahaessly should we include uuid as an optional input to the WDL? or rip this option (to prepend tables with a uuid) out altogether? this is currently just hardcoded to nothing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't used the UUID piece before, I think it was from earlier testing but now I would just create a new dataset instead of tables with a prefix. Remove it? (@ahaessly wdyt?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was definitely used for automated integration testing. I think Megan added it. If we wanted to add a uuid to the dataset, I think we would need to create that dataset outside of this wdl. But we should be able to do that in the test itself. Assuming we are not running that integration test, I would say let's go ahead and remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let's keep it for now until we decide what we're doing with integration testing.

schema = metadata_schema,
superpartitioned = "false",
partitioned = "false",
uuid = "",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't used the UUID piece before, I think it was from earlier testing but now I would just create a new dataset instead of tables with a prefix. Remove it? (@ahaessly wdyt?)

input {
String project_id
String dataset_name
String storage_location
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is used anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

Copy link
Contributor

@ahaessly ahaessly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mmorgantaylor mmorgantaylor merged commit b1f6753 into ah_var_store Mar 2, 2021
@mmorgantaylor mmorgantaylor deleted the mmt_vcf_inputs branch March 2, 2021 17:09
kcibul pushed a commit that referenced this pull request Mar 9, 2021
* revert input_vcfs to array[file], add this to sample inputs json

* add this branch to dockstore

* remove this branch from dockstore

* add LoadBigQueryData to dockstore, modify check for existing tables, load from github

* exit with error if bq load fails

* use relative path to import LoadBigQueryData.wdl

* refactor ImportGenomes to contain BQ table creation and loading

* remove for_testing_only

* docker -> docker_final

* last wdl fix please

* remove #done

* add back done - end of for loop

* remove LoadBigQueryData wdl

* ensure tsv creation before making bq tables

* run CreateTables concurrently, clean up old code, LoadTable not preemptible, rename numbered to superpartitioned

* pad table id to 3 digits

* fix padded table id

* fix padded logic again

* fix range for table_id

* remove unused import

* remove feature branch from dockstore.yml
mmorgantaylor added a commit that referenced this pull request Apr 6, 2021
* revert input_vcfs to array[file], add this to sample inputs json

* add this branch to dockstore

* remove this branch from dockstore

* add LoadBigQueryData to dockstore, modify check for existing tables, load from github

* exit with error if bq load fails

* use relative path to import LoadBigQueryData.wdl

* refactor ImportGenomes to contain BQ table creation and loading

* remove for_testing_only

* docker -> docker_final

* last wdl fix please

* remove #done

* add back done - end of for loop

* remove LoadBigQueryData wdl

* ensure tsv creation before making bq tables

* run CreateTables concurrently, clean up old code, LoadTable not preemptible, rename numbered to superpartitioned

* pad table id to 3 digits

* fix padded table id

* fix padded logic again

* fix range for table_id

* remove unused import

* remove feature branch from dockstore.yml
mmorgantaylor added a commit that referenced this pull request Apr 6, 2021
* revert input_vcfs to array[file], add this to sample inputs json

* add this branch to dockstore

* remove this branch from dockstore

* add LoadBigQueryData to dockstore, modify check for existing tables, load from github

* exit with error if bq load fails

* use relative path to import LoadBigQueryData.wdl

* refactor ImportGenomes to contain BQ table creation and loading

* remove for_testing_only

* docker -> docker_final

* last wdl fix please

* remove #done

* add back done - end of for loop

* remove LoadBigQueryData wdl

* ensure tsv creation before making bq tables

* run CreateTables concurrently, clean up old code, LoadTable not preemptible, rename numbered to superpartitioned

* pad table id to 3 digits

* fix padded table id

* fix padded logic again

* fix range for table_id

* remove unused import

* remove feature branch from dockstore.yml
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants