Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate in master db #1

Open
thsyd opened this issue Feb 14, 2019 · 3 comments
Open

duplicate in master db #1

thsyd opened this issue Feb 14, 2019 · 3 comments

Comments

@thsyd
Copy link

thsyd commented Feb 14, 2019

Hi,
>IS1004_IS200/IS605_IS200
and the associated sequence
is duplicate in https://raw.githubusercontent.com/thanhleviet/ISfinder-sequences/master/IS.fna
lines 13 and 15 (and 14 and 16 respectively)
Probably the same for the fsa and csv versions?

>IS1004_IS200/IS605_IS200
TGTCATCCCTAAACCACCGCTTTTAGCGGTGGTGATTGTCCCTAGGGGCTTTTGCCCGAAAATGCGCCCATGTTAGAAGACAAACTCTTATTCACCATAAGTAAGAGGATTCAAATAACATGGGCGACTACAGAAGTTCATCACACGTCTATTGGCGTTGCAAATATCATATCGTTTGGACACCAAAATTTCGTTTTAAGATCTTAAAAGGTAATGTTGCCAAAGAGCTAAATCGTTCGATCTACATTCTTTGTAATATGAAAGATTGTGAAGTTTTGGAACTCAATGTTCAGCCAGATCATGTCCACTTAGTTGCGATAATTCCGCCCAAAGTATCGATTTCGACGTTGATGGGAGTTTTAAAGGGTAGGAGTGCAATTAGGCTATTCAACAAGTTTCCACATATCAGGAAAAAGTTATGGGGAAATCATTTTTGGGCGCGAGGCTATTTTGTGGATACGGTAGGTGTAAATGAAGAAATCATTAGACGGTATGTACGGCATCAAGACAAAAAAGAGCTTGAGCAAGAGCAGCAGTTAGAGTTATTGAGAGACTAACAGCGTCGTGGCCCCCTTTTAGGGGGCTTATATTAAAACCGCCTTCTAAGAAGGCGGATTTTT
>IS1004_IS200/IS605_IS200
TGTCATCCCTAAACCACCGCTTTTAGCGGTGGTGATTGTCCCTAGGGGCTTTTGCCCGAAAATGCGCCCATGTTAGAAGACAAACTCTTATTCACCATAAGTAAGAGGATTCAAATAACATGGGCGACTACAGAAGTTCATCACACGTCTATTGGCGTTGCAAATATCATATCGTTTGGACACCAAAATTTCGTTTTAAGATCTTAAAAGGTAATGTTGCCAAAGAGCTAAATCGTTCGATCTACATTCTTTGTAATATGAAAGATTGTGAAGTTTTGGAACTCAATGTTCAGCCAGATCATGTCCACTTAGTTGCGATAATTCCGCCCAAAGTATCGATTTCGACGTTGATGGGAGTTTTAAAGGGTAGGAGTGCAATTAGGCTATTCAACAAGTTTCCACATATCAGGAAAAAGTTATGGGGAAATCATTTTTGGGCGCGAGGCTATTTTGTGGATACGGTAGGTGTAAATGAAGAAATCATTAGACGGTATGTACGGCATCAAGACAAAAAAGAGCTTGAGCAAGAGCAGCAGTTAGAGTTATTGAGAGACTAACAGCGTCGTGGCCCCCTTTTAGGGGGCTTATATTAAAACCGCCTTCTAAGAAGGCGGATTTTT
@thsyd
Copy link
Author

thsyd commented Feb 14, 2019

Actually,
using https://github.com/b-brankovics/fasta_tools/blob/master/bin/fasta_unique I find that
there are 767 sequences that are present in multiple copies and 5291 (of 5685) are unique.

usage of fasta_unique

fasta_unique input.fas >unique.fas 2>unique.tab

where input.fas = input file in fasta format
unique.fas = output file (a file containing unique sequences)
unique.tab = tab separated file of the sterr warnings of sequences that occur multiple times

code:

wget https://raw.githubusercontent.com/thanhleviet/ISfinder-sequences/master/IS.fna
perl fasta_unique.perl IS.fna > IS_fasta_unique.perl.fna 2>IS_fasta_unique.perl.unique.tsv

grep "^>" IS.fna | wc -l
5685

grep "^>" IS_fasta_unique.perl.fna | wc -l
5291

wc -l IS_fasta_unique.perl.unique.tsv
767 IS_fasta_unique.perl.unique.tsv

Output files (.txt added for compatibility with github):

IS_fasta_unique.perl.fna.txt
IS_fasta_unique.perl.unique.tsv.txt

@tianmao233666
Copy link

Thank you very much!

@zzlzef
Copy link

zzlzef commented Feb 17, 2020

good man!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants