Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardizing performance metrics output file #15

Open
jzook opened this issue Feb 25, 2016 · 4 comments
Open

Standardizing performance metrics output file #15

jzook opened this issue Feb 25, 2016 · 4 comments

Comments

@jzook
Copy link
Contributor

jzook commented Feb 25, 2016

Since we're getting close to outputting performance metrics, I thought we should have a discussion around the output format, which will allow us to exchange outputs from different comparison workflows on different systems.

@pkrusche proposed what I assume is probably the current hap.py output format in https://github.com/ga4gh/benchmarking-tools/blob/master/doc/ref-impl/outputs.md, and @goranrakocevic discussed Seven Bridges' benchmarking output schema in a presentation a few weeks ago https://drive.google.com/open?id=0B29EEcQ2PgqjRE91LV9yUTNSRHAtTC1sSTJsUWw5TUszZ0Fj. We've also defined performance metrics for Comparison Methods with different stringencies here https://github.com/ga4gh/benchmarking-tools/blob/master/doc/standards/GA4GHBenchmarkingPerformanceMetricsDefinitions.md and stratification methods https://github.com/ga4gh/benchmarking-tools/blob/master/doc/standards/GA4GHBenchmarkingPerformanceStratification.md.

In looking at our current outputs file, I think we might want to expand it a bit to incorporate some of the ideas from @goranrakocevic and from our performance metrics and stratification methods documents.

If we want a single flat file, I think having more columns may be useful in addition to the Metrics columns:
Test Call Set (md5?)
Benchmark call set (md5?)
Zygosity
Variant type
Stratification bed
ROC field
ROC threshold
Comparison Method (stringency of match from our spec)

For metrics columns, I'd suggest we take definitions and names from spec in https://github.com/ga4gh/benchmarking-tools/blob/master/doc/standards/GA4GHBenchmarkingPerformanceMetricsDefinitions.md. I also think we should add columns for 95% Confidence intervals for metrics.

A couple questions:
Should any of these fields be moved to the header and output a separate file for each distinct value?
How should we report if multiple stratification bed files are intersected?
I think it may also be useful to have some standardized values for some of these fields (e.g., snp, indel, complex, etc. for Variant type). Do others agree?

@pkrusche
Copy link
Member

Here are my slides from the last benchmarking call related to this issue:

https://docs.google.com/presentation/d/1VCguvdhaSJI0z7Vbn_oyBYdoYsMzqMyjlTIroHoLBks/edit?usp=sharing

Also, the proposed output format in there is now supported by hap.py 0.3.0 and is documented in here:

https://github.com/Illumina/hap.py/blob/dev-0.3/doc/happy.md

It's not necessarily final and comments are welcome.

@pkrusche
Copy link
Member

Also, here are some comments w.r.t. the differences between hap.py and the metrics definitions document:

  • hap.py outputs TRUTH.TP and QUERY.TP. Since (VCF-based) counts can be different depending on the representation, I think that we should consider adding this to the performance metrics document
  • "Recall" vs. "Sensitivity" -- I can rename this easily in hap.py, but would prefer not to. Is there a strong preference for using "Sensitivity"?
  • hap.py / qfy.py outputs counts "FP.gt" to quantify genotype mismatches. This is part of the default / summary table output.

@RebeccaTruty
Copy link

+1 to adding QUERY.TP to the output.

@jzook
Copy link
Contributor Author

jzook commented May 6, 2016

+1 to adding QUERY.TP to the output as well

My preference for "sensitivity" vs "recall" is based on my understanding is
that it is standard terminology for clinical labs, but it would be great
for people from clinical labs to confirm whether this is true.

As I've thought more about GT errors, I'm wondering whether we should
combine Comparison Methods #2 and #3 (where allele concordance or genotype
concordance is required), since we report GT errors as a separate
statistic. What do others' think?
On Fri, May 6, 2016 at 9:36 AM RebeccaTruty [email protected]
wrote:

+1 to adding QUERY.TP to the output.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#15 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants