Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add manifest summary file to GvsExtractCallset #7457

Merged
merged 11 commits into from
Sep 9, 2021

Conversation

ericsong
Copy link

@ericsong ericsong commented Sep 7, 2021

The manifest file describes the files produced by the GvsExtractCallset task and is uploaded to GCS once all the shards have finished running.

The existence of the manifest.txt file can be used to determine if the extraction is complete or not by just using gsutil command and not going through Cromwell/Terra.

Here's a snippet of what that looks like.

interval_number, vcf_file_location, vcf_file_bytes, vcf_index_location, vcf_index_bytes
0,gs://fc-secure-765db3ba-3e2d-432e-89eb-9efda7430f93/genomic-extractions/66f225d3-35cf-4ff7-a4ec-3d887a86ea8b/vcfs/interval_0.vcf.gz,879403,gs://fc-secure-765db3ba-3e2d-432e-89eb-9efda7430f93/genomic-extractions/66f225d3-35cf-4ff7-a4ec-3d887a86ea8b/vcfs/interval_0.vcf.gz.tbi,1682
1,gs://fc-secure-765db3ba-3e2d-432e-89eb-9efda7430f93/genomic-extractions/66f225d3-35cf-4ff7-a4ec-3d887a86ea8b/vcfs/interval_1.vcf.gz,871855,gs://fc-secure-765db3ba-3e2d-432e-89eb-9efda7430f93/genomic-extractions/66f225d3-35cf-4ff7-a4ec-3d887a86ea8b/vcfs/interval_1.vcf.gz.tbi,1670
2,gs://fc-secure-765db3ba-3e2d-432e-89eb-9efda7430f93/genomic-extractions/66f225d3-35cf-4ff7-a4ec-3d887a86ea8b/vcfs/interval_2.vcf.gz,710617,gs://fc-secure-765db3ba-3e2d-432e-89eb-9efda7430f93/genomic-extractions/66f225d3-35cf-4ff7-a4ec-3d887a86ea8b/vcfs/interval_2.vcf.gz.tbi,1629
3,gs://fc-secure-765db3ba-3e2d-432e-89eb-9efda7430f93/genomic-extractions/66f225d3-35cf-4ff7-a4ec-3d887a86ea8b/vcfs/interval_3.vcf.gz,715565,gs://fc-secure-765db3ba-3e2d-432e-89eb-9efda7430f93/genomic-extractions/66f225d3-35cf-4ff7-a4ec-3d887a86ea8b/vcfs/interval_3.vcf.gz.tbi,1645
...

Copy link

@calbach calbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks. We should definitely check with spec ops for their thoughts; I think this should be generally useful but there may be other things they want to see or not see here.

@@ -188,16 +195,30 @@ task ExtractTask {
~{true='--emit-pls' false='' emit_pls} \
${FILTERING_ARGS}

du -b ~{output_file} | cut -f1 > vcf_bytes.txt
du -b ~{output_file}.tbi | cut -f1 > vcf_index_bytes.txt
INTERVAL_NUMBER=$(echo ~{output_file} | grep -oEi '[0-9]+\.vcf\.gz' | cut -d'.' -f1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is complicated and also breaks layers of abstraction by assuming a specific filename format. If you need this, I would add a new input parameter like interval_index


# Drop trailing slash if one exists
OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//')

if [ -n "${OUTPUT_GCS_DIR}" ]; then
gsutil cp ~{output_file} ${OUTPUT_GCS_DIR}/
gsutil cp ~{output_file}.tbi ${OUTPUT_GCS_DIR}/
OUTPUT_FILE_DEST=${OUTPUT_GCS_DIR}/~{output_file}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, existing issue: a lot of these variable replacements should probably be quoted. This prevents potential word splitting bugs, an possibly security vulnerabilities.


# Parent Task will collect manifest lines and create a joined file
# Currently, the schema is `[interval_number], [output_file_location], [output_file_size_bytes], [output_file_index_location], [output_file_size_bytes]`
echo ${INTERVAL_NUMBER},${OUTPUT_FILE_DEST},${OUTPUT_FILE_BYTES},${OUTPUT_FILE_INDEX_DEST},${OUTPUT_FILE_INDEX_BYTES} >> manifest.txt
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sold on this column's necessity. What are your thoughts? Is it needed in order to produce an ordered/merged manifest, i.e. for the sort -n later?

If we do need it, then I would use more generic terminology here, e.g. shard_index or output_index. I think the fact that this corresponds to a set of intervals currently is not relevant to or usable by an end consumer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with dropping it. I mostly added it for the sort functionality but also didn't think it would hurt to add.

task CreateManifest {

input {
Array[String] manifest_lines
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these ordered? If so, I suppose this task could also be responsible for numbering the rows.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. I wanted them to be ordered which is why I run it through sort -n. It's also why I have the index as the first column.. I can see if I can try something smarter by using a Map in WDL.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a way to make this work with the built in WDL functions.. I can package the interval_index and manifest line with their Pair object but I don't see a reasonable way to actually print the values out into the command block. Given that, I'm going to pass the index through as part of the manifest line like I was doing before but I am at least passing IN the index as part of the scatter call instead of parsing it from the filename.

@gatk-bot
Copy link

gatk-bot commented Sep 9, 2021

Travis reported job failures from build 35897
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 35897.12 logs

Copy link
Contributor

@kcibul kcibul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable to me -- question about the non-output gcs bucket path, and agree with CHs comments code/style wise


# Parent Task will collect manifest lines and create a joined file
# Currently, the schema is `[interval_number], [output_file_location], [output_file_size_bytes], [output_file_index_location], [output_file_size_bytes]`
echo ~{interval_index},${OUTPUT_FILE_DEST},${OUTPUT_FILE_BYTES},${OUTPUT_FILE_INDEX_DEST},${OUTPUT_FILE_INDEX_BYTES} >> manifest.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in the case where there is not an OUTPUT_GCS_DIR theOUTPUT_FILE_DEST is just going to be the filename on local filesystem not in the cromwell execution gcs bucket… is that what you intend?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. I could try to add the cromwell execution gcs bucket though, didn't know if that was possible. Do you know how I can get the current execution directory from a cromwell task?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kcibul going to merge this for now so I can get this in but I can follow up

@ericsong ericsong merged commit d14eb3c into ah_var_store Sep 9, 2021
@ericsong ericsong deleted the songe/add-manifest branch September 9, 2021 16:22
RoriCremer pushed a commit that referenced this pull request Sep 20, 2021
* add manifest file

* proper sorting

* fix regex

* better comment

* remove debug lines

* upload manifest to GCS

* pass through interval number

* write as map

* wdl 1.1

* use old method for sort

* add quotes
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants