Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#224 Import WDL: handle 15 TB /table /day import limit #7167

Merged
merged 64 commits into from
Apr 2, 2021

Conversation

schaluva
Copy link

Chunk full list of .tsv files ready to load to bq into sets that are less than the 15tb limit set on each bq load. From the original
datatype_tsvs directory, each set is moved to its own directory, and when the load is complete, the data is moved into a done directory within each set.

Assuming pet tsvs and 1 set, at the start:
gs://bucket/pet_tsvs/pet_001_*

At end:
gs://bucket/pet_tsvs/set_1/done/pet_001_*

--
The output file, bq_final_job_statuses.txt, contains the following columns (and example data):

  1. bq load job ID : bqjob_r2715fbcab1fd0e44_00000178708f0abe_1
  2. set number:
  3. path to set data: gs://fc-13e1680e-eb3d-4102-975a-be0142ee9618/full_15tb_test_2/pet_tsvs/set_1/
  4. status of the bq load: SUCCESS/FAIL

What should be the best user experience in case of FAIL?

@gatk-bot
Copy link

gatk-bot commented Mar 26, 2021

Travis reported job failures from build 33370
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 33370.1 logs
cloud openjdk11 33370.14 logs

@gatk-bot
Copy link

gatk-bot commented Mar 26, 2021

Travis reported job failures from build 33372
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 33372.1 logs
cloud openjdk11 33372.14 logs

Copy link
Contributor

@kcibul kcibul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice bashing! I made a comment about how to simplify the code (for our future selves)

@@ -201,8 +201,10 @@ task SetLock {
LOCKFILE="LOCKFILE"
HAS_LOCKFILE=$(gsutil ls "${DIR}${LOCKFILE}" | wc -l)
if [ $HAS_LOCKFILE -gt 0 ]; then
echo "ERROR: lock file in place. Check whether another run of ImportGenomes with this output directory is in progress or a previous run had an error. If you would like to proceed, run `gsutil rm ${DIR}${LOCKFILE}` and re-run the workflow." 1>&2
exit 1
echo "ERROR: lock file in place. Check whether another run of ImportGenomes with this output directory is in progress or a previous run had an error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be a multi-line string (ie it spans multiple lines)? Just. checking it works and wasn't the victim of some code auto-formatiing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is - had to reorganize it a bit because the s were executing the gsutil rm ${DIR}${LOCKFILE} within the echo statement and removing the LOCKFILE when we just wanted it to be an echo statement with the command expanded for a user to copy paste from the log file.

It does work :)

gsutil -m mv $DIR$FILES ${DIR}done/
# get list of of pet files and their byte sizes
echo "Getting load file sizes(bytes) and path to each file."
gsutil du "${DIR}${FILES}" | tr " " "\t" | tr -s "\t" > ~{datatype}_du.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love the use of tr! Might be worth a comment above what this is doing since we don't do a ton of bash


# get total memory in bytes
echo "Calculating total files' size(bytes)."
TOTAL_FILE_SIZE=$(awk '{print $1}' OFS="\t" ~{datatype}_du.txt| paste -sd+ - | bc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW you can also do sum with....

TOTAL_FILE_SIZE=$( awk '{sum+=$1;} END{print sum;}' ~{datatype}_du.txt)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't recall exactly why now, but I had that in there first and then swapped it for some reason..

TOTAL_FILE_SIZE=$(awk '{print $1}' OFS="\t" ~{datatype}_du.txt| paste -sd+ - | bc)

# get number of iterations to loop through file - round up to get full set of files
num_sets=$(((TOTAL_FILE_SIZE+16492674416639)/16492674416640))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this 16000000000000 below but something else here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. i wanted to se the TB limit to something a bit lower than the max allowed just to be on the safe side and didn't swap it here.

for set in $(seq 1 $num_sets)
do
# write set of data totaling 16000000000000 bytes to file labeled by set #
awk '{s+=$1}{print $1"\t"$2"\t"s}' ~{datatype}_du.txt | awk '$3 < 16000000000000 {print $1"\t"$2}' > "${set}"_files_to_load.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud here... but maybe this would be easier to do/maintain by creating a file outside of this loop with a column for the "set". Awk has access to decent math functions, I bet it could be done there.

Then this loop could just grep that file for the set name being loaded

awk '{s+=$1}{print $1"\t"$2"\t"s"\t" "set"(1+int(s / 16000000000000))}'~{datatype}_du.txt

This adds a new column "setX" where X is the 1-based set of 16gb of files. You could do this outside the loop and. then the loop could possibly be

for set in $(cat file | cut -f4 | sort | uniq)

To loop over the unique value of sets.

Then you. don't have to do this subtracting, temp file stulff, etc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brilliant, thats fantastic. implementing that now.

@@ -0,0 +1,154 @@
version 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there this whole extra file/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing file snuck in there. removed it.

@gatk-bot
Copy link

gatk-bot commented Mar 31, 2021

Travis reported job failures from build 33428
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 33428.1 logs
cloud openjdk11 33428.14 logs

@gatk-bot
Copy link

gatk-bot commented Mar 31, 2021

Travis reported job failures from build 33433
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 33433.1 logs
cloud openjdk11 33433.14 logs

@gatk-bot
Copy link

gatk-bot commented Mar 31, 2021

Travis reported job failures from build 33435
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 33435.1 logs
cloud openjdk11 33435.14 logs

@gatk-bot
Copy link

gatk-bot commented Mar 31, 2021

Travis reported job failures from build 33437
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk11 33437.14 logs
cloud openjdk8 33437.1 logs

@gatk-bot
Copy link

gatk-bot commented Mar 31, 2021

Travis reported job failures from build 33439
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 33439.1 logs
cloud openjdk11 33439.14 logs

@gatk-bot
Copy link

gatk-bot commented Mar 31, 2021

Travis reported job failures from build 33443
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 33443.1 logs
cloud openjdk11 33443.14 logs

@gatk-bot
Copy link

gatk-bot commented Mar 31, 2021

Travis reported job failures from build 33447
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 33447.1 logs
cloud openjdk11 33447.14 logs

@gatk-bot
Copy link

gatk-bot commented Mar 31, 2021

Travis reported job failures from build 33454
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 33454.1 logs
cloud openjdk11 33454.14 logs

@gatk-bot
Copy link

gatk-bot commented Mar 31, 2021

Travis reported job failures from build 33456
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk11 33456.14 logs
cloud openjdk8 33456.1 logs

@gatk-bot
Copy link

gatk-bot commented Mar 31, 2021

Travis reported job failures from build 33458
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 33458.1 logs
cloud openjdk11 33458.14 logs

@schaluva schaluva merged commit f4cede7 into ah_var_store Apr 2, 2021
@schaluva schaluva deleted the bq_15tb_limit branch April 2, 2021 15:42
mmorgantaylor pushed a commit that referenced this pull request Apr 6, 2021
* add bash code to chunk files at 15tb limit

* update dockstore.yml

* fix yml formatting

* fill in importLoadTable inputs json

* take out variable in comment

* remove unnecessary files in task

* printf for scientific notation

* scientific notation

* increase memory to check if memory issue

* remove gsutil copy log

* fix gsutil cp destination directory

* decrease memory after testing for increased memory fails

* testing bq job id

* test sed pattern matching

* take out mv command for testing purposes

* update move to done files command to within each set

* troubleshoot bq load command

* update project for inputs json

* revert parsing of job id

* test

* write load status to separate file

* sed file to get job ID

* test2

* fix bq wait status

* separate wait status command into pieces

* reorganize output block

* move wait command into main loop

* test taking out wait command entirely

* add task jsut for bq wait testing

* set project for bq wait

* put everything back now that wait works with project

* re commit with updated comments and cleaned up echo commands

* take out call to test task

* mv files from original dir to set dirs instead of copy

* trouble shoot wait_status

* change back to cp for testing in copying sets step

* take out output file for bq wait

* add back input file to wait statement

* add in more tmp files in output

* try to capture success in bq wait

* reorg output file for while loop

* fix sed command

* clean up for final test - mv files from original to sets

* update full import genomes wdl

* take out un-used inputs in load table task

* put back in the variables that control calls in order

* fixing lockfile exists comment

* add quotes to escape backtick in echo

* take out hard coded pet table in gsutil du

* single quote echo to block expression expansion in lockfile

* multi line echo comment for lockfile

* fix missing quotes

* set dockstore yml back to what it should be for ah_var_store branch

* remove testing wdl and json files

* set single file with all sum and sets

* fix sed variable format

* testing pattern matching

* fix reading wrong input file in loop

* adding output files

* test

* fix quotes on awk statement

* awk

* clean up comments and print statements

* edit comments
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants