-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#224 Import WDL: handle 15 TB /table /day import limit #7167
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, nice bashing! I made a comment about how to simplify the code (for our future selves)
@@ -201,8 +201,10 @@ task SetLock { | |||
LOCKFILE="LOCKFILE" | |||
HAS_LOCKFILE=$(gsutil ls "${DIR}${LOCKFILE}" | wc -l) | |||
if [ $HAS_LOCKFILE -gt 0 ]; then | |||
echo "ERROR: lock file in place. Check whether another run of ImportGenomes with this output directory is in progress or a previous run had an error. If you would like to proceed, run `gsutil rm ${DIR}${LOCKFILE}` and re-run the workflow." 1>&2 | |||
exit 1 | |||
echo "ERROR: lock file in place. Check whether another run of ImportGenomes with this output directory is in progress or a previous run had an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be a multi-line string (ie it spans multiple lines)? Just. checking it works and wasn't the victim of some code auto-formatiing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is - had to reorganize it a bit because the
s were executing the gsutil rm ${DIR}${LOCKFILE}
within the echo statement and removing the LOCKFILE when we just wanted it to be an echo statement with the command expanded for a user to copy paste from the log file.
It does work :)
gsutil -m mv $DIR$FILES ${DIR}done/ | ||
# get list of of pet files and their byte sizes | ||
echo "Getting load file sizes(bytes) and path to each file." | ||
gsutil du "${DIR}${FILES}" | tr " " "\t" | tr -s "\t" > ~{datatype}_du.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love the use of tr! Might be worth a comment above what this is doing since we don't do a ton of bash
|
||
# get total memory in bytes | ||
echo "Calculating total files' size(bytes)." | ||
TOTAL_FILE_SIZE=$(awk '{print $1}' OFS="\t" ~{datatype}_du.txt| paste -sd+ - | bc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW you can also do sum with....
TOTAL_FILE_SIZE=$( awk '{sum+=$1;} END{print sum;}' ~{datatype}_du.txt)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't recall exactly why now, but I had that in there first and then swapped it for some reason..
TOTAL_FILE_SIZE=$(awk '{print $1}' OFS="\t" ~{datatype}_du.txt| paste -sd+ - | bc) | ||
|
||
# get number of iterations to loop through file - round up to get full set of files | ||
num_sets=$(((TOTAL_FILE_SIZE+16492674416639)/16492674416640)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this 16000000000000 below but something else here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. i wanted to se the TB limit to something a bit lower than the max allowed just to be on the safe side and didn't swap it here.
for set in $(seq 1 $num_sets) | ||
do | ||
# write set of data totaling 16000000000000 bytes to file labeled by set # | ||
awk '{s+=$1}{print $1"\t"$2"\t"s}' ~{datatype}_du.txt | awk '$3 < 16000000000000 {print $1"\t"$2}' > "${set}"_files_to_load.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud here... but maybe this would be easier to do/maintain by creating a file outside of this loop with a column for the "set". Awk has access to decent math functions, I bet it could be done there.
Then this loop could just grep that file for the set name being loaded
awk '{s+=$1}{print $1"\t"$2"\t"s"\t" "set"(1+int(s / 16000000000000))}'~{datatype}_du.txt
This adds a new column "setX" where X is the 1-based set of 16gb of files. You could do this outside the loop and. then the loop could possibly be
for set in $(cat file | cut -f4 | sort | uniq)
To loop over the unique value of sets.
Then you. don't have to do this subtracting, temp file stulff, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
brilliant, thats fantastic. implementing that now.
@@ -0,0 +1,154 @@ | |||
version 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there this whole extra file/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing file snuck in there. removed it.
* add bash code to chunk files at 15tb limit * update dockstore.yml * fix yml formatting * fill in importLoadTable inputs json * take out variable in comment * remove unnecessary files in task * printf for scientific notation * scientific notation * increase memory to check if memory issue * remove gsutil copy log * fix gsutil cp destination directory * decrease memory after testing for increased memory fails * testing bq job id * test sed pattern matching * take out mv command for testing purposes * update move to done files command to within each set * troubleshoot bq load command * update project for inputs json * revert parsing of job id * test * write load status to separate file * sed file to get job ID * test2 * fix bq wait status * separate wait status command into pieces * reorganize output block * move wait command into main loop * test taking out wait command entirely * add task jsut for bq wait testing * set project for bq wait * put everything back now that wait works with project * re commit with updated comments and cleaned up echo commands * take out call to test task * mv files from original dir to set dirs instead of copy * trouble shoot wait_status * change back to cp for testing in copying sets step * take out output file for bq wait * add back input file to wait statement * add in more tmp files in output * try to capture success in bq wait * reorg output file for while loop * fix sed command * clean up for final test - mv files from original to sets * update full import genomes wdl * take out un-used inputs in load table task * put back in the variables that control calls in order * fixing lockfile exists comment * add quotes to escape backtick in echo * take out hard coded pet table in gsutil du * single quote echo to block expression expansion in lockfile * multi line echo comment for lockfile * fix missing quotes * set dockstore yml back to what it should be for ah_var_store branch * remove testing wdl and json files * set single file with all sum and sets * fix sed variable format * testing pattern matching * fix reading wrong input file in loop * adding output files * test * fix quotes on awk statement * awk * clean up comments and print statements * edit comments
Chunk full list of .tsv files ready to load to bq into sets that are less than the 15tb limit set on each bq load. From the original
datatype_tsvs directory, each set is moved to its own directory, and when the load is complete, the data is moved into a done directory within each set.
Assuming pet tsvs and 1 set, at the start:
gs://bucket/pet_tsvs/pet_001_*
At end:
gs://bucket/pet_tsvs/set_1/done/pet_001_*
--
The output file,
bq_final_job_statuses.txt
, contains the following columns (and example data):What should be the best user experience in case of FAIL?