Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pysam UnicodeDecodeError when loading with tabixed VCF #139

Open
dgomezpere opened this issue Sep 14, 2020 · 5 comments
Open

Pysam UnicodeDecodeError when loading with tabixed VCF #139

dgomezpere opened this issue Sep 14, 2020 · 5 comments

Comments

@dgomezpere
Copy link

  • vcfpy version: 0.13.2
  • Python version: 3.6.9 64bit [GCC 8.4.0]
  • Operating System: Linux 4.15.0 1093 oem x86_64 with Ubuntu 18.04 bionic

Description

When I fetch variants by contig ID I get the following UnicodeDecodeError demosntrating some issues when parsing the tabix file. Maybe the issue comes from pysam, but I would like to know if you have had previous reports based on this issue.

What I Did

  • Tabix VCF file
$ tabix -p vcf <vcf_filepath>
reader = vcfpy.Reader.from_path(path=DATA['annot_vcf'], tabix_path=DATA['annot_vcf']+'.tbi')
for record in reader.fetch('chr1'):
    [...]

Traceback Error

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-38-046818a3e579> in <module>
      3 variant_records = []
      4 sample_records = []
----> 5 for record in reader.fetch('chr1'):
      6     if record.CHROM in wanted_chroms:
      7         ALT = record.ALT[0].value

/usr/local/lib/python3.6/dist-packages/vcfpy/reader.py in __next__(self)
    171         """
    172         if self.tabix_iter:
--> 173             return self.parser.parse_line(str(next(self.tabix_iter)))
    174         else:
    175             result = self.parser.parse_next_record()

pysam/libctabix.pyx in pysam.libctabix.TabixIterator.__next__()

pysam/libcutils.pyx in pysam.libcutils.charptr_to_str()

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2821: ordinal not in range(128)
@holtgrewe
Copy link
Member

Interesting, what is your locale setting? C? What happens if you set export LC_ALL=en_US.UTF-8 or similar?

@dgomezpere
Copy link
Author

Hi @holtgrewe !!
My locale settings are already in en_US.UTF-8:

$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

@dgomezpere
Copy link
Author

Any other idea about the issue @holtgrewe??
Thanks in advance!!

@holtgrewe
Copy link
Member

It looks like that you have non-ASCII unicode in your VCF file and pysam is stumbling over this...

@holtgrewe
Copy link
Member

Hm, I don't remember why I was using pysam in favour of pytabix. I don't know whether that is more robust... Hm, one could try to replace the tabix part of pysam with pytabix in vcfpy...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants