Too many sequences in an output tree #358

imdanique · 2023-11-06T13:23:19Z

Hi, thanks for developing Usher.

I'm encountering an issue. My goal is to find the 10 nearest neighbors for each sequence in a FASTA file and build a tree with input sequences + neighbors. Here's what I did:

   #Alignment
    mafft --thread 16  --keeplength --addfragments all_seq.fasta ref.fa > aligned_seq.fa

    # Ensuring that first seq is ref
    head -1 aligned_seq.fa

    # Downloads the problematic sites in ref genome
    wget -O - "https://raw.githubusercontent.com/W-L/ProblematicSites_SARS-CoV2/master/problematic_sites_sarsCov2.vcf" > problematic_sites_sarsCov2.vcf

    # Converts fasta to vcf with correction for the problematic sites
    faToVcf -includeNoAltN -maskSites=problematic_sites_sarsCov2.vcf aligned_seq.fa aligned_seq.vcf

    # Downloads the latest global lineages
    wget -O - "http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/public-latest.all.masked.pb.gz" 

    # Runs usher for lineage assignment
    usher -i public-latest.all.masked.pb \
    -v aligned_seq.vcf -k 10  -T 16 -d result

Then, I tried to calculate all IDs in output subtrees:

from Bio import Phylo
import glob
# Initialize an empty list to store all the trees
trees = []

# Loop through all the files containing the subtrees
for file in glob.glob("result/*.nh"):
    tree = Phylo.read(file, "newick")
    # Extract sequence IDs from the tree and add them to the list
    sequence_ids.extend([leaf.name for leaf in tree.get_terminals()])

# Open a text file to write the sequence IDs
with open('result/sequence_ids.txt', 'w') as f:
    for id in sequence_ids:
        f.write(id + '\n')

Calculated all unique entries:

sort usher_results_k10/sequence_ids.txt | uniq | grep -v "node" | wc -l
>  2595922

In my input I had 1909 sequences. I expected a maximum of 19,090 output sequences (10 x input sequences), but I ended up with 2,595,922 sequences in 1152 output trees. How do I modify the Usher command to provide only 10 neighbors? Any insights on what went wrong would be greatly appreciated.

AngieHinrichs · 2023-11-07T16:12:25Z

Hi @imdanique. In addition to the -k output files subtree-*.nh, usher also makes an output file for the whole tree, final-tree.nh, and the glob("result/*.nh") in your Python script might be picking that up as well. Try changing that to glob("result/subtree-*.nh") to match only the subtree output files, and let us know if that doesn't fix it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many sequences in an output tree #358

Too many sequences in an output tree #358

imdanique commented Nov 6, 2023

AngieHinrichs commented Nov 7, 2023

Too many sequences in an output tree #358

Too many sequences in an output tree #358

Comments

imdanique commented Nov 6, 2023

AngieHinrichs commented Nov 7, 2023