Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Callset statistics [VS-560] #8018

Merged
merged 22 commits into from
Sep 27, 2022
Merged

Callset statistics [VS-560] #8018

merged 22 commits into from
Sep 27, 2022

Conversation

mcovarr
Copy link
Collaborator

@mcovarr mcovarr commented Sep 14, 2022

Successful Quickstart run here, has not yet been run on larger datasets.

@codecov
Copy link

codecov bot commented Sep 14, 2022

Codecov Report

❗ No coverage uploaded for pull request base (ah_var_store@3b74d0a). Click here to learn what that means.
The diff coverage is n/a.

Additional details and impacted files
@@               Coverage Diff                @@
##             ah_var_store     #8018   +/-   ##
================================================
  Coverage                ?   86.226%           
  Complexity              ?     35201           
================================================
  Files                   ?      2173           
  Lines                   ?    165004           
  Branches                ?     17792           
================================================
  Hits                    ?    142277           
  Misses                  ?     16393           
  Partials                ?      6334           

Copy link
Collaborator

@gbggrant gbggrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do the outputs go? I don't see anything in the ~{extract_prefix}_statistics table that got created in my test run.

Also, I kind of think a text file as output would be very useful for analysis/reporting.

exit 1
fi

# Schemas extracted programatically: https://stackoverflow.com/a/66987934
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

singleton,
pass_qc
)
SELECT "~{filter_set_name}" filter_set_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that none of the explanations are in the code that you are looking at to write this wdl, but I think getting Lee to add some context about what is being calculated would be really helpful. I'm fine with that being a future ticket

@gbggrant
Copy link
Collaborator

So, ran it on a previously run set and it failed as there were rows in the database tables.
Then dropped the tables and it failed because I dropped the tables. Presumably need a volatile=true on the CreateTables task (and make it find or create)?

@gbggrant
Copy link
Collaborator

Reran it here and it succeeded, data looks good as far as I can tell.

Copy link

@rsasch rsasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this workflow creates the statistics table but does not output the contents into a TSV or CSV, which is what we deliver along with the callset. Would it be possible to add an export to TSV to a specified GCS location to the CollectStatistics task (or a new one)?

@rsasch
Copy link

rsasch commented Sep 21, 2022

On the good news front, I compared an export of the "statistics_table" to the callset stats file I generated for Beta and they matched! 👍🏻 (if you're curious, the run is https://app.terra.bio/#workspaces/allofus-drc-wgs-dev/AoU_DRC_WGS_12-6-21_beta_ingest/job_history/45a7764c-9f8f-49e3-b1f6-2bf28ac16b4b)

@mcovarr
Copy link
Collaborator Author

mcovarr commented Sep 22, 2022

now with export to CSV

@mcovarr mcovarr requested a review from rsasch September 22, 2022 23:23
@mcovarr
Copy link
Collaborator Author

mcovarr commented Sep 23, 2022

command <<<
set -o errexit -o nounset -o xtrace -o pipefail

bq query --nouse_legacy_sql --project_id=~{project_id} --format=csv '
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably need to include a --max_rows with the number of samples, otherwise the file will be limited to 100 rows (see https://stackoverflow.com/questions/34215311/how-bq-query-can-get-10000-rows)

@mcovarr mcovarr merged commit 953f68c into ah_var_store Sep 27, 2022
@mcovarr mcovarr deleted the vs_560_callset_stats branch September 27, 2022 17:03
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants