Skip to content

This repo is to extract protein features using SPMap tool

License

Notifications You must be signed in to change notification settings

gozsari/SPMap_Tool

Repository files navigation

SPMap: Subsequence-based feature map for protein function classification

SPMap takes into account the information coming from the subsequences of a protein. A group of protein sequences that belong to the same level of classification is decomposed into fixed-length subsequences and they are clustered to obtain a representative feature space mapping. Mapping is defined as the distribution of the subsequences of a protein sequence over these clusters. The resulting feature space representation is used to train discriminative classifiers for functional families. The aim of this approach is to incorporate information coming from important subregions that are conserved over a family of proteins while avoiding the difficult task of explicit motif identification.

SPMap

Fig. 1. SPMap flow diagram. (A) Subsequence profile map construction: subsequences of the proteins in positive training set are clustered to construct subsequence profile map. (B) Classification: constructed profile map is utilized to find the feature space representation of the protein sequence to be classified.

SPMap tool:

It is sequence-based feature extraction tool based on the subsequences profiles obtained from trainin data (in fasta format).

  1. Users should first form a profile from the respective training dataset.
  2. Users can then extract protein features using the profile(s).

How to use:

 python runSPMap.py --generateProfile True --path 'input_folder' --fastaFile_P CYT_pos.fasta --minSeqLen 20 --subSeqLen 5 --fastaFile_O CYT_golden_positive.fasta

Table 1: SPMap tool's arguments

Arguments Description Values
generateProfile If profile files are to be generated or not, True if the profile file needs to be generated True or False
path path to fasta file directory default: "input_folder"
fastaFile_P fasta file name to construct profiles (fasta file of training data) default: "CYT_pos.fasta"
minSeqLen protein sequences shorter than this value will not be considered default: 20
profileFile profile file name to be generated default: "CYT_pos_profile.txt"
subSeqLen the length of subsequences default: 5
fastaFile_O fasta file name whose features will be extracted default: "CYT_golden_positive.fasta"

References

Sarac, O. S., Gürsoy-Yüzügüllü, Ö., Cetin-Atalay, R., & Atalay, V. (2008). Subsequence-based feature map for protein function classification. Computational biology and chemistry, 32(2), 122-130.

Our studies that we used SPMap:

  1. Özsarı, G., Rifaioglu, A. S., Atakan, A., Doğan, T., Martin, M. J., Çetin Atalay, R., & Atalay, V. (2022). SLPred: a multi-view subcellular localization prediction tool for multi-location human proteins. Bioinformatics.
  2. Rifaioglu, A. S., Doğan, T., Martin, M. J., Cetin-Atalay, R., & Atalay, V. (2019). DEEPred: automated protein function prediction with multi-task feed-forward deep neural networks. Scientific reports, 9(1), 1-16.
  3. Dalkiran, A., Rifaioglu, A. S., Martin, M. J., Cetin-Atalay, R., Atalay, V., & Doğan, T. (2018). ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC bioinformatics, 19(1), 1-13.
  4. Rifaioglu, A. S., Doğan, T., Saraç, Ö. S., Ersahin, T., Saidi, R., Atalay, M. V., ... & Cetin‐Atalay, R. (2018). Large‐scale automated function prediction of protein sequences and an experimental case study validation on PTEN transcript variants. Proteins: Structure, Function, and Bioinformatics, 86(2), 135-151.

About

This repo is to extract protein features using SPMap tool

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages