New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

#224 Import WDL: handle 15 TB /table /day import limit #7167

Merged

schaluva merged 64 commits into ah_var_store from bq_15tb_limit

Apr 2, 2021

schaluva commented Mar 26, 2021

Chunk full list of .tsv files ready to load to bq into sets that are less than the 15tb limit set on each bq load. From the original
datatype_tsvs directory, each set is moved to its own directory, and when the load is complete, the data is moved into a done directory within each set.

Assuming pet tsvs and 1 set, at the start:
gs://bucket/pet_tsvs/pet_001_*

At end:
gs://bucket/pet_tsvs/set_1/done/pet_001_*

--
The output file, bq_final_job_statuses.txt, contains the following columns (and example data):

bq load job ID : bqjob_r2715fbcab1fd0e44_00000178708f0abe_1
set number:
path to set data: gs://fc-13e1680e-eb3d-4102-975a-be0142ee9618/full_15tb_test_2/pet_tsvs/set_1/
status of the bq load: SUCCESS/FAIL

What should be the best user experience in case of FAIL?

schaluva added 30 commits

March 23, 2021 10:56


          add bash code to chunk files at 15tb limit

be97c77


          update dockstore.yml

9d7b7bb


          fix yml formatting

ad57123


          fill in importLoadTable inputs json

943b026


          take out variable in comment

471ff9d


          remove unnecessary files in task

f5035dc


          printf for scientific notation

8d592a2


          scientific notation

3c3cf57


          increase memory to check if memory issue

7974f59


          remove gsutil copy log

997d881


          fix gsutil cp destination directory

abd7beb


          decrease memory after testing for increased memory fails

097a1c7


          testing bq job id

af924b2


          test sed pattern matching

f47be35


          take out mv command for testing purposes

ad3e0e5


          update move to done files command to within each set

cf25aec


          troubleshoot bq load command

113d2b4


          update project for inputs json

3140c9f


          revert parsing of job id

61aba84


          test

d4f047c


          write load status to separate file

0cb842a


          sed file to get job ID

a2a4447


          test2

1d96e8c


          fix bq wait status

2d6751c


          separate wait status command into pieces

76d43ae


          reorganize output block

12c8ad0


          move wait command into main loop

6ffcb29


          test taking out wait command entirely

92601fe


          add task jsut for bq wait testing

f529cdf


          set project for bq wait

63ca90f


          set dockstore yml back to what it should be for ah_var_store branch

385c0f2

gatk-bot commented Mar 26, 2021 •

edited

Loading

Travis reported job failures from build 33370
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	33370.1	logs
cloud	openjdk11	33370.14	logs

gatk-bot commented Mar 26, 2021 •

edited

Loading

Travis reported job failures from build 33372
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	33372.1	logs
cloud	openjdk11	33372.14	logs

kcibul approved these changes

View reviewed changes

Contributor

kcibul left a comment

Looks good, nice bashing! I made a comment about how to simplify the code (for our future selves)

scripts/variantstore/wdl/ImportGenomes.wdl

@@ @@ -201,8 +201,10 @@ task SetLock { @@
                   LOCKFILE="LOCKFILE"
                   HAS_LOCKFILE=$(gsutil ls "${DIR}${LOCKFILE}" | wc -l)
                   if [ $HAS_LOCKFILE -gt 0 ]; then
-                    echo "ERROR: lock file in place. Check whether another run of ImportGenomes with this output directory is in progress or a previous run had an error. If you would like to proceed, run `gsutil rm ${DIR}${LOCKFILE}` and re-run the workflow." 1>&2
-                    exit 1
+                    echo "ERROR: lock file in place. Check whether another run of ImportGenomes with this output directory is in progress or a previous run had an error.

Contributor

kcibul Mar 31, 2021

Is this supposed to be a multi-line string (ie it spans multiple lines)? Just. checking it works and wasn't the victim of some code auto-formatiing

Author

schaluva Mar 31, 2021

It is - had to reorganize it a bit because the s were executing the gsutil rm ${DIR}${LOCKFILE} within the echo statement and removing the LOCKFILE when we just wanted it to be an echo statement with the command expanded for a user to copy paste from the log file.

It does work :)

scripts/variantstore/wdl/ImportGenomes.wdl Outdated

-                      gsutil -m mv $DIR$FILES ${DIR}done/
+                      # get list of of pet files and their byte sizes
+                      echo "Getting load file sizes(bytes) and path to each file."
+                      gsutil du "${DIR}${FILES}" | tr " " "\t" | tr -s "\t" > ~{datatype}_du.txt

Contributor

kcibul Mar 31, 2021

love the use of tr! Might be worth a comment above what this is doing since we don't do a ton of bash

scripts/variantstore/wdl/ImportGenomes.wdl Outdated

+                      # get total memory in bytes
+                      echo "Calculating total files' size(bytes)."
+                      TOTAL_FILE_SIZE=$(awk '{print $1}' OFS="\t" ~{datatype}_du.txt| paste -sd+ - | bc)

Contributor

kcibul Mar 31, 2021

FWIW you can also do sum with....

TOTAL_FILE_SIZE=$( awk '{sum+=$1;} END{print sum;}' ~{datatype}_du.txt)

Author

schaluva Mar 31, 2021

I can't recall exactly why now, but I had that in there first and then swapped it for some reason..

scripts/variantstore/wdl/ImportGenomes.wdl Outdated

+                      TOTAL_FILE_SIZE=$(awk '{print $1}' OFS="\t" ~{datatype}_du.txt| paste -sd+ - | bc)
+                      # get number of iterations to loop through file - round up to get full set of files
+                      num_sets=$(((TOTAL_FILE_SIZE+16492674416639)/16492674416640))

Contributor

kcibul Mar 31, 2021

Why is this 16000000000000 below but something else here?

Author

schaluva Mar 31, 2021

good catch. i wanted to se the TB limit to something a bit lower than the max allowed just to be on the safe side and didn't swap it here.

scripts/variantstore/wdl/ImportGenomes.wdl Outdated

+                      for set in $(seq 1 $num_sets)
+                      do
+                        # write set of data totaling 16000000000000 bytes to file labeled by set #
+                        awk '{s+=$1}{print $1"\t"$2"\t"s}' ~{datatype}_du.txt | awk '$3 < 16000000000000 {print $1"\t"$2}' > "${set}"_files_to_load.txt

Contributor

kcibul Mar 31, 2021

Thinking out loud here... but maybe this would be easier to do/maintain by creating a file outside of this loop with a column for the "set". Awk has access to decent math functions, I bet it could be done there.

Then this loop could just grep that file for the set name being loaded

awk '{s+=$1}{print $1"\t"$2"\t"s"\t" "set"(1+int(s / 16000000000000))}'~{datatype}_du.txt

This adds a new column "setX" where X is the 1-based set of 16gb of files. You could do this outside the loop and. then the loop could possibly be

for set in $(cat file | cut -f4 | sort | uniq)

To loop over the unique value of sets.

Then you. don't have to do this subtracting, temp file stulff, etc

Author

schaluva Mar 31, 2021

brilliant, thats fantastic. implementing that now.

scripts/variantstore/wdl/ImportGenomes_LoadTable.wdl Outdated

		@@ -0,0 +1,154 @@
		version 1.0

Contributor

kcibul Mar 31, 2021

Why is there this whole extra file/

Author

schaluva Mar 31, 2021

testing file snuck in there. removed it.


          remove testing wdl and json files

4c8aec7

gatk-bot commented Mar 31, 2021 •

edited

Loading

Travis reported job failures from build 33428
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	33428.1	logs
cloud	openjdk11	33428.14	logs

schaluva added 4 commits

March 31, 2021 12:23


          set single file with all sum and sets

5e8bf2c


          fix sed variable format

40cc5f2


          testing pattern matching

6495cc9


          fix reading wrong input file in loop

f27badd

gatk-bot commented Mar 31, 2021 •

edited

Loading

Travis reported job failures from build 33433
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	33433.1	logs
cloud	openjdk11	33433.14	logs

gatk-bot commented Mar 31, 2021 •

edited

Loading

Travis reported job failures from build 33435
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	33435.1	logs
cloud	openjdk11	33435.14	logs

gatk-bot commented Mar 31, 2021 •

edited

Loading

Travis reported job failures from build 33437
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk11	33437.14	logs
cloud	openjdk8	33437.1	logs


          adding output files

156f15d

gatk-bot commented Mar 31, 2021 •

edited

Loading

Travis reported job failures from build 33439
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	33439.1	logs
cloud	openjdk11	33439.14	logs

gatk-bot commented Mar 31, 2021 •

edited

Loading

Travis reported job failures from build 33443
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	33443.1	logs
cloud	openjdk11	33443.14	logs


          test

94135b8

gatk-bot commented Mar 31, 2021 •

edited

Loading

Travis reported job failures from build 33447
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	33447.1	logs
cloud	openjdk11	33447.14	logs

schaluva added 2 commits

March 31, 2021 15:56


          fix quotes on awk statement

cdb8799

awk

bf86ce1

gatk-bot commented Mar 31, 2021 •

edited

Loading

Travis reported job failures from build 33454
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	33454.1	logs
cloud	openjdk11	33454.14	logs

schaluva added 2 commits

March 31, 2021 16:37


          clean up comments and print statements

9b87117


          edit comments

2df0623

gatk-bot commented Mar 31, 2021 •

edited

Loading

Travis reported job failures from build 33456
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk11	33456.14	logs
cloud	openjdk8	33456.1	logs

gatk-bot commented Mar 31, 2021 •

edited

Loading

Travis reported job failures from build 33458
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	33458.1	logs
cloud	openjdk11	33458.14	logs

schaluva merged commit f4cede7 into ah_var_store

schaluva deleted the bq_15tb_limit branch

April 2, 2021 15:42

mmorgantaylor pushed a commit that referenced this pull request


          #224 Import WDL: handle 15 TB /table /day import limit (#7167)

85b5768

* add bash code to chunk files at 15tb limit

* update dockstore.yml

* fix yml formatting

* fill in importLoadTable inputs json

* take out variable in comment

* remove unnecessary files in task

* printf for scientific notation

* scientific notation

* increase memory to check if memory issue

* remove gsutil copy log

* fix gsutil cp destination directory

* decrease memory after testing for increased memory fails

* testing bq job id

* test sed pattern matching

* take out mv command for testing purposes

* update move to done files command to within each set

* troubleshoot bq load command

* update project for inputs json

* revert parsing of job id

* test

* write load status to separate file

* sed file to get job ID

* test2

* fix bq wait status

* separate wait status command into pieces

* reorganize output block

* move wait command into main loop

* test taking out wait command entirely

* add task jsut for bq wait testing

* set project for bq wait

* put everything back now that wait works with project

* re commit with updated comments and cleaned up echo commands

* take out call to test task

* mv files from original dir to set dirs instead of copy

* trouble shoot wait_status

* change back to cp for testing in copying sets step

* take out output file for bq wait

* add back input file to wait statement

* add in more tmp files in output

* try to capture success in bq wait

* reorg output file for while loop

* fix sed command

* clean up for final test - mv files from original to sets

* update full import genomes wdl

* take out un-used inputs in load table task

* put back in the variables that control calls in order

* fixing lockfile exists comment

* add quotes to escape backtick in echo

* take out hard coded pet table in gsutil du

* single quote echo to block expression expansion in lockfile

* multi line echo comment for lockfile

* fix missing quotes

* set dockstore yml back to what it should be for ah_var_store branch

* remove testing wdl and json files

* set single file with all sum and sets

* fix sed variable format

* testing pattern matching

* fix reading wrong input file in loop

* adding output files

* test

* fix quotes on awk statement

* awk

* clean up comments and print statements

* edit comments

This was referenced Mar 17, 2023

lb merge gvs branch #8248

Closed

testing something, please ignore #8251

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet