Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Records in intermediate VCF format #13

Open
bioinformed opened this issue Jan 26, 2016 · 3 comments
Open

Records in intermediate VCF format #13

bioinformed opened this issue Jan 26, 2016 · 3 comments

Comments

@bioinformed
Copy link

Apologies for jumping into this discussion late. My question is what are the records in the intermediate VCF format? e.g. given two inputs

Truth

1 10 AT GT,AC 0/1

Query

1 10 A G 0/1
1 11 T C 0/1

What does the intermediate output look like for this matching case?

And for this non-matching case:

Truth

1 10 AT GT,AC 0/1

Query

1 10 A G 1/1
1 11 T C 1/1

@bioinformed
Copy link
Author

@pkrusche: More questions. For some reason I thought hap.py and xcmp already implemented the consensus intermediate format. Here is the current xcmp output from hap.py for my first example above (using valid ref coordinates):

##fileformat=VCFv4.1
##...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TRUTH QUERY
1 50011 . AT GT 1000 . gtt1=gt_het;type=FN;kind=missing;ctype=hap:match;HapMatch GT 0/1 ./.
1 50011 . A G 1000 . gtt2=gt_het;type=FP;kind=missing;ctype=hap:match;HapMatch GT ./. 0/1
1 50012 . T T 1000 . gtt2=gt_het;type=FP;kind=missing;ctype=hap:match;HapMatch GT ./. 0/1

Why is kind equal to FN or FP in any of the records? The superloci match, which to me implies that kind should equal TP in all records. Is this not the case in the consensus intermediate format?

@pkrusche
Copy link
Member

@bioinformed : about hap.py / xcmp: they will implement the new intermediate format soon, probably in February (it started out similar to what hap.py is writing, but changed during the discussion).

@pkrusche
Copy link
Member

In the matching case, if the comparison tool chooses to not split any input variants, I guess the only way to output the result is to print the records as they were and add "." genotypes to pad. The BDs would be for strict GT comparison:

CHROM POS REF ALT    FORMAT      T          Q
1     10  AT  GT,AC  GT:BK:BD    0/1:gm:TP  .:gm:TP
1     10  A   G      GT:BK:BD    .:gm:TP    0/1:gm:TP
1     11  T   C      GT:BK:BD    .:gm:TP    0/1:gm:TP 

For the mismatch case, it would depend on whether we want to require the comparison tool to be able to pick up a possible allele match. If so, it would probably output this:

CHROM POS REF ALT    FORMAT      T          Q
1     10  AT  GT,AC  GT:BK:BD    0/1:am:FP  .:am:FP
1     10  A   G      GT:BK:BD    .:am:FP    1/1:am:FP
1     11  T   C      GT:BK:BD    .:am:FP    1/1:am:FP 

This gives another corner case by the way if GTs are the other way around:

CHROM POS REF ALT    FORMAT      T          Q
1     10  AT  GT,AC  GT:BK:BD    1/1:lm:FP  .:lm:FP
1     10  A   G      GT:BK:BD    .:lm:FP    0/1:lm:FP
1     11  T   C      GT:BK:BD    .:lm:FP    0/1:lm:FP 

The reason I would go for lm instead of am here is that there is a way to phase the query calls which make the alleles mismatch by putting the SNPs onto different haplotypes.

Does this look reasonable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants