Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help subsetting samples #134

Open
oyhel opened this issue Sep 17, 2019 · 2 comments
Open

Help subsetting samples #134

oyhel opened this issue Sep 17, 2019 · 2 comments

Comments

@oyhel
Copy link

oyhel commented Sep 17, 2019

  • vcfpy version:
    0.12.1
  • Python version:
    3.6.3
  • Operating System:
    ubuntu 18.04

Description

I need to subset samples from a large VCF with approximately 100.000 samples. As the extraction is the only part of the workflow outside python it would be very nice to be able to use vcfpy for this. I read the docs and also tried to deduce how the package works from the source. However, I am coming up short and would very much appreciate some help understanding how the subsetting of samples works.

What I Did

import vcfpy
reader = vcfpy.Reader.from_path()

subset_samples = sel = ['sample123', 'sample124']
reader = vcfpy.Reader.from_path('myfile.vcf.gz', parsed_samples=subset_samples)

reader.parsed_samples returns the samples chosen to be parsed in the input:
['sample123', 'sample124']

When iterating over the records in the reader object:
for record in reader: record

I get:

<vcfpy.record.UnparsedCall object at 0x7f065fd30f98>

So I am assuming that the parser respects my list of samples to be parsed. However I am struggling with getting the calls for only these samples.

My goal is to extract all markers, but only for a few samples.

Any help would be very much appreciated!

@kylec
Copy link

kylec commented Jan 15, 2020

I have the same problem. I set parsed_samples for a subset of samples like you did but reader.calls still return calls for all the samples.

I got around it by finding the index of my samples.

sample_index = vcf_reader.header.samples.name_to_idx[subset_samples]
for row in reader:
  row.calls[sample_index]

@holtgrewe
Copy link
Member

Hi, I think that you are looking for Record.call_for_sample.

for record in reader:
    for sample in sel:
        call = reader.call_for_sample[sample]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants