Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate bigquery table creation and data loading in LoadData #7056

Merged
merged 3 commits into from
Jan 28, 2021

Conversation

ericsong
Copy link

  • I sneaked in another change where I pass in a single file containing a list of input_vcfs instead of an array of input_vcfs. I made this because Terra couldn't save my inputs when I passed in 700 samples.
  • Most of the logic was moved into CreateTables, including the determination for what files to load. It would have been cleaner to move all of the file loading logic into LoadTable but the current approach cuts down the on the number of gsutil ls calls made and more importantly, only spins up a shard if there are files to load.
  • I pushed the logic into a separate workflow because I wanted to refactor it as two tasks and I couldn't find a way to get a Task to call another Task without wrapping it in a workflow.

@gatk-bot
Copy link

Travis reported job failures from build 32656
Failures in the following jobs:

Test Type JDK Job ID Logs
conda openjdk8 32656.5 logs

@@ -141,7 +142,9 @@ task CreateImportTsvs {

meta {
description: "Creates a tsv file for imort into BigQuery"
volatile: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh is this the thing that turns off call caching for a single task

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. I was running into bugs b/c of call caching. None of these are actually cache-able since they write to a GCS directory. That said, it would be possible to refactor CreateImportTsvs to separate the import file generation from the upload and cache the first part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat. good to hear this feature has been released

Copy link
Contributor

@ahaessly ahaessly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. all my comments can be done in a subsequent PR. thanks for fixing this!!


DIR="~{storage_location}/~{datatype}_tsvs/"

for TABLE_ID in $(seq 1 ~{max_table_id}); do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good for now, but I think we should change getMaxTableId to getTableIdsForSamples that returns just the list of tables that we need for the current samples and then we loop through those. I'll add a feature request for this.

PARTITION_STRING="--range_partitioning=$PARTITION_FIELD,$PARTITION_START,$PARTITION_END,$PARTITION_STEP"
fi

# we are loading ONLY one table, specified by table_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this comment to creating so it's not misleading

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed it since it's not technically true any more since I'm running it in a for loop now.

PREFIX="~{uuid}_"
fi

if [ $NUM_FILES -gt 0 ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we change the getMaxTableId to just return the ids of the tables we need to create, we won't need to do this check.


echo "$TABLE,$DIR,$FILES" >> table_dir_files.csv
else
echo "no ${FILES} files to process"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then this message could be removed as well.

@ericsong ericsong merged commit 75f0bd8 into ah_var_store Jan 28, 2021
@ericsong ericsong deleted the songe/separate-create-load-tasks branch January 28, 2021 18:48
@gatk-bot
Copy link

Travis reported job failures from build 32668
Failures in the following jobs:

Test Type JDK Job ID Logs
conda openjdk8 32668.5 logs

kcibul pushed a commit that referenced this pull request Jan 29, 2021
* separate table creation from loading

* add a comment to CreateTables

* remove old comment
kcibul pushed a commit that referenced this pull request Feb 1, 2021
* separate table creation from loading

* add a comment to CreateTables

* remove old comment
Marianie-Simeon pushed a commit that referenced this pull request Feb 16, 2021
* separate table creation from loading

* add a comment to CreateTables

* remove old comment
kcibul pushed a commit that referenced this pull request Mar 9, 2021
* separate table creation from loading

* add a comment to CreateTables

* remove old comment
mmorgantaylor pushed a commit that referenced this pull request Apr 6, 2021
* separate table creation from loading

* add a comment to CreateTables

* remove old comment
mmorgantaylor pushed a commit that referenced this pull request Apr 6, 2021
* separate table creation from loading

* add a comment to CreateTables

* remove old comment
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants