Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A sample column value of 0|0 is not being parsed correctly #176

Open
JarvisVon opened this issue Jun 24, 2024 · 0 comments · May be fixed by #178
Open

A sample column value of 0|0 is not being parsed correctly #176

JarvisVon opened this issue Jun 24, 2024 · 0 comments · May be fixed by #178

Comments

@JarvisVon
Copy link

JarvisVon commented Jun 24, 2024

  • vcfpy version: 0.13.8
  • Python version: 3.12.1
  • Operating System: macOS 14.0

Description

A sample column with a value of 0|0, is not being parsed correctly with FORMAT "GT" with the GT FieldInfo not specified in the header.

Using a file with 11 columns, FORMAT="GT", "SAMPLE1" = "0|0", "SAMPLE2" = "1|0" the parser includes erroneous list artifacts:

ValueError: invalid literal for int() with base 10: "['0"

What I Did

import vcfpy
path = '/path/to/file.vcf'
>>> reader = vcfpy.Reader.from_path(path)
>>> for record in reader:
...    # do work

Stack Trace:

python3.12/site-packages/vcfpy/header.py:413: FieldInfoNotFound: FORMAT GT not found using String/'.' instead
  warnings.warn(
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "python3.12/site-packages/vcfpy/reader.py", line 175, in __next__
    result = self.parser.parse_next_record()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python3.12/site-packages/vcfpy/parser.py", line 804, in parse_next_record
    return self.parse_line(self._read_next_line())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python3.12/site-packages/vcfpy/parser.py", line 795, in parse_line
    return self._record_parser.parse_line(line)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python3.12/site-packages/vcfpy/parser.py", line 467, in parse_line
    calls = self._handle_calls(alts, format_, arr[8], arr)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python3.12/site-packages/vcfpy/parser.py", line 481, in _handle_calls
    call = record.Call(sample, data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python3.12/site-packages/vcfpy/record.py", line 236, in __init__
    self._genotype_updated()
  File "python3.12/site-packages/vcfpy/record.py", line 259, in _genotype_updated
    self.gt_alleles.append(int(allele))
                           ^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: "['0"

The issue here appears to be around the function parse_field_value in vcfpy/parser.py. The default behavior is split values on ',', and then return an array of those converted values. The issue happens when, as in the test data set, there is no ',' character to split on and therefore an an array of length 1 is returned. This value is then used in vcfpy/record.py "_genotype_updated, which is passed into the regex split for allele in ALLELE_DELIM.split(str(self.data["GT"])):, which again is not splitting on a string but on a list of strings of length 1 - causing the regex split to return the list type artifact [.

parser.parse_field_value could return a single string if the length is == 1, as opposed to a list of length one, or (probably a safer change) record._genotype_updated could check if the value of self.data["GT"] is an array, as opposed to assuming it is simply a string.

@JarvisVon JarvisVon changed the title A sample column value of 0|0, is not being parsed correctly A sample column value of 0|0 is not being parsed correctly Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant