Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Nodes with Special Characters in Name #368

Open
gp201 opened this issue Mar 4, 2024 · 1 comment
Open

Duplicate Nodes with Special Characters in Name #368

gp201 opened this issue Mar 4, 2024 · 1 comment

Comments

@gp201
Copy link

gp201 commented Mar 4, 2024

Description

When the nodes have certain special characters a duplicate node is created.

Steps to Reproduce

1.usher -t tree.nwk -v aligned.vcf -o tree.pb Observed in final_tree.nh
2.matUtils extract -i tree.pb -C lineagePaths.txt -j auspice_tree.json -S samplePaths.txt Observed in auspice_tree.json

Expected Behavior

The phylogenetic tree should not contain duplicate nodes.

Actual Behavior

The node 'hRSV/A/Germany/22-02516/2021' is present twice in the tree.

Additional Information

Files to reproduce the bug bug_example.zip. Run run.sh to generate relevant files.

Environment

Conda
Usher: 0.6.3

Please let me know whether this is a genuine error or an oversight on my end. Thank you.

@AngieHinrichs
Copy link
Contributor

Hi @gp201, that is a great little test case. It appears that usher's Newick parsing does not handle quoting. So it appears to usher that there are two distinct sequences: 'hRSV/A/Germany/22-02516/2021' in tree.nwk (treating the quotes as part of the name) with no substitutions relative to the reference, and a different sequence hRSV/A/Germany/22-02516/2021 in aligned.vcf (no quotes).

It would be better for usher's Newick parsing to recognize quoting, but in the meantime there is a straightforward workaround: strip all quote characters from your input Newick. (Also make sure before creating Newick and VCF that none of the input sequence names contain any characters with special meaning in Newick like [():,;].) Here is an example command that would do that:

sed -re "s/['\"]//g;" tree.nwk > tree.noQuotes.nwk
usher -t tree.noQuotes.nwk -v aligned.vcf -o tree.pb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants