-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate bigquery table creation and data loading in LoadData #7056
Conversation
@@ -141,7 +142,9 @@ task CreateImportTsvs { | |||
|
|||
meta { | |||
description: "Creates a tsv file for imort into BigQuery" | |||
volatile: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh is this the thing that turns off call caching for a single task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. I was running into bugs b/c of call caching. None of these are actually cache-able since they write to a GCS directory. That said, it would be possible to refactor CreateImportTsvs to separate the import file generation from the upload and cache the first part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat. good to hear this feature has been released
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. all my comments can be done in a subsequent PR. thanks for fixing this!!
|
||
DIR="~{storage_location}/~{datatype}_tsvs/" | ||
|
||
for TABLE_ID in $(seq 1 ~{max_table_id}); do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good for now, but I think we should change getMaxTableId to getTableIdsForSamples that returns just the list of tables that we need for the current samples and then we loop through those. I'll add a feature request for this.
PARTITION_STRING="--range_partitioning=$PARTITION_FIELD,$PARTITION_START,$PARTITION_END,$PARTITION_STEP" | ||
fi | ||
|
||
# we are loading ONLY one table, specified by table_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this comment to creating so it's not misleading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed it since it's not technically true any more since I'm running it in a for loop now.
PREFIX="~{uuid}_" | ||
fi | ||
|
||
if [ $NUM_FILES -gt 0 ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we change the getMaxTableId to just return the ids of the tables we need to create, we won't need to do this check.
|
||
echo "$TABLE,$DIR,$FILES" >> table_dir_files.csv | ||
else | ||
echo "no ${FILES} files to process" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then this message could be removed as well.
* separate table creation from loading * add a comment to CreateTables * remove old comment
* separate table creation from loading * add a comment to CreateTables * remove old comment
* separate table creation from loading * add a comment to CreateTables * remove old comment
* separate table creation from loading * add a comment to CreateTables * remove old comment
* separate table creation from loading * add a comment to CreateTables * remove old comment
* separate table creation from loading * add a comment to CreateTables * remove old comment
CreateTables
, including the determination for what files to load. It would have been cleaner to move all of the file loading logic intoLoadTable
but the current approach cuts down the on the number ofgsutil ls
calls made and more importantly, only spins up a shard if there are files to load.