Separate bigquery table creation and data loading in LoadData #7056

ericsong · 2021-01-26T23:16:14Z

I sneaked in another change where I pass in a single file containing a list of input_vcfs instead of an array of input_vcfs. I made this because Terra couldn't save my inputs when I passed in 700 samples.
Most of the logic was moved into CreateTables, including the determination for what files to load. It would have been cleaner to move all of the file loading logic into LoadTable but the current approach cuts down the on the number of gsutil ls calls made and more importantly, only spins up a shard if there are files to load.
I pushed the logic into a separate workflow because I wanted to refactor it as two tasks and I couldn't find a way to get a Task to call another Task without wrapping it in a workflow.

gatk-bot · 2021-01-27T00:14:04Z

Travis reported job failures from build 32656
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
conda	openjdk8	32656.5	logs

als364 · 2021-01-27T16:42:01Z

scripts/variantstore/wdl/ImportGenomes.wdl

@@ -141,7 +142,9 @@ task CreateImportTsvs {

  meta {
    description: "Creates a tsv file for imort into BigQuery"
+    volatile: true


ooh is this the thing that turns off call caching for a single task

yeah. I was running into bugs b/c of call caching. None of these are actually cache-able since they write to a GCS directory. That said, it would be possible to refactor CreateImportTsvs to separate the import file generation from the upload and cache the first part.

neat. good to hear this feature has been released

ahaessly

looks good. all my comments can be done in a subsequent PR. thanks for fixing this!!

ahaessly · 2021-01-27T18:14:40Z

scripts/variantstore/wdl/LoadBigQueryData.wdl

+
+    DIR="~{storage_location}/~{datatype}_tsvs/"
+
+    for TABLE_ID in $(seq 1 ~{max_table_id}); do


I think this is good for now, but I think we should change getMaxTableId to getTableIdsForSamples that returns just the list of tables that we need for the current samples and then we loop through those. I'll add a feature request for this.

ahaessly · 2021-01-27T18:15:14Z

scripts/variantstore/wdl/LoadBigQueryData.wdl

+        PARTITION_STRING="--range_partitioning=$PARTITION_FIELD,$PARTITION_START,$PARTITION_END,$PARTITION_STEP"
+      fi
+
+      # we are loading ONLY one table, specified by table_id


change this comment to creating so it's not misleading

I just removed it since it's not technically true any more since I'm running it in a for loop now.

ahaessly · 2021-01-27T18:16:44Z

scripts/variantstore/wdl/LoadBigQueryData.wdl

+          PREFIX="~{uuid}_"
+      fi
+
+      if [ $NUM_FILES -gt 0 ]; then


I think if we change the getMaxTableId to just return the ids of the tables we need to create, we won't need to do this check.

ahaessly · 2021-01-27T18:17:28Z

scripts/variantstore/wdl/LoadBigQueryData.wdl

+
+        echo "$TABLE,$DIR,$FILES" >> table_dir_files.csv
+      else
+        echo "no ${FILES} files to process"


then this message could be removed as well.

gatk-bot · 2021-01-28T19:43:33Z

Travis reported job failures from build 32668
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
conda	openjdk8	32668.5	logs

* separate table creation from loading * add a comment to CreateTables * remove old comment

ericsong added 2 commits January 26, 2021 17:14

separate table creation from loading

b432bbf

add a comment to CreateTables

44f8aed

als364 reviewed Jan 27, 2021

View reviewed changes

ahaessly approved these changes Jan 27, 2021

View reviewed changes

remove old comment

dccffeb

ericsong merged commit 75f0bd8 into ah_var_store Jan 28, 2021

ericsong deleted the songe/separate-create-load-tasks branch January 28, 2021 18:48

kcibul pushed a commit that referenced this pull request Jan 29, 2021

Separate bigquery table creation and data loading in LoadData (#7056)

0773b28

* separate table creation from loading * add a comment to CreateTables * remove old comment

kcibul pushed a commit that referenced this pull request Feb 1, 2021

Separate bigquery table creation and data loading in LoadData (#7056)

06b407d

* separate table creation from loading * add a comment to CreateTables * remove old comment

Marianie-Simeon pushed a commit that referenced this pull request Feb 16, 2021

Separate bigquery table creation and data loading in LoadData (#7056)

4375fb4

* separate table creation from loading * add a comment to CreateTables * remove old comment

kcibul pushed a commit that referenced this pull request Mar 9, 2021

Separate bigquery table creation and data loading in LoadData (#7056)

4c63144

* separate table creation from loading * add a comment to CreateTables * remove old comment

mmorgantaylor pushed a commit that referenced this pull request Apr 6, 2021

Separate bigquery table creation and data loading in LoadData (#7056)

9370d21

* separate table creation from loading * add a comment to CreateTables * remove old comment

mmorgantaylor pushed a commit that referenced this pull request Apr 6, 2021

Separate bigquery table creation and data loading in LoadData (#7056)

860e00c

* separate table creation from loading * add a comment to CreateTables * remove old comment

This was referenced Mar 17, 2023

lb merge gvs branch #8248

Closed

testing something, please ignore #8251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate bigquery table creation and data loading in LoadData #7056

Separate bigquery table creation and data loading in LoadData #7056

ericsong commented Jan 26, 2021

gatk-bot commented Jan 27, 2021

als364 Jan 27, 2021

ericsong Jan 27, 2021

als364 Jan 27, 2021

ahaessly left a comment

ahaessly Jan 27, 2021

ahaessly Jan 27, 2021

ericsong Jan 28, 2021

ahaessly Jan 27, 2021

ahaessly Jan 27, 2021

gatk-bot commented Jan 28, 2021


		DIR="~{storage_location}/~{datatype}_tsvs/"

		for TABLE_ID in $(seq 1 ~{max_table_id}); do

Separate bigquery table creation and data loading in LoadData #7056

Separate bigquery table creation and data loading in LoadData #7056

Conversation

ericsong commented Jan 26, 2021

gatk-bot commented Jan 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahaessly left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatk-bot commented Jan 28, 2021