Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different number of Sequences in Tree and Fasta #356

Open
wtporter opened this issue Oct 26, 2023 · 2 comments
Open

Different number of Sequences in Tree and Fasta #356

wtporter opened this issue Oct 26, 2023 · 2 comments

Comments

@wtporter
Copy link

wtporter commented Oct 26, 2023

Hi,
Comparing the public-latest.all.fa sequences to the nwk and tsv metadata file and it appears that there is a discrepancy within the sample numbers. Within the .fa file there are ~6.6 million and the tsv and tree have ~8.3 million sequences. Is the fasta reduced to just unique sequences or is there an issue preventing all ~8.3 million sequences from being written in the fasta?
http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/
Thanks for this great resource!

@AngieHinrichs
Copy link
Contributor

Thanks for pointing out the discrepancy @wtporter, I will look into it.

@AngieHinrichs
Copy link
Contributor

I've been adding new public sequences from the daily build to the public MSA, but that misses quite a few sequences over time because sometimes a new sequence is available from GISAID earlier than from public repo like GenBank, so the GISAID version of the sequence is aligned to reference and added to the tree -- and then later, when the public version becomes available, it is renamed in the tree instead of being aligned & added. So I needed to round up 1.7 million missing sequences, align them and add them to the MSA.

I have done that for the 2023-10-30 tree:

http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/2023/10/30/

Over time, the daily additions will gradually fall behind relative to the tree. Let me know if you need another update in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants