Skip to content

PSSpred (Protein Secondary Structure prediction) is a simple neural network training algorithm for accurate protein secondary structure prediction. It first collects multiple sequence alignments using PSI-BLAST. Amino-acid frequence and log-odds data with Henikoff weights are then used to train secondary structure, separately, based on the Rumel…

License

Notifications You must be signed in to change notification settings

nickcafferry/PSSpred

Repository files navigation

Documentation Status Appveyor Workflow Licence Travis Codecov Gitter Circleci

Copyright © Wei MEI, MLMS™—all rights reserved. 🀤

A simple neural network training algorithm for accurate protein secondary structure prediction (PSSpred )! See documentation for more details.

PSSpred (Protein Secondary Structure prediction) is a simple neural network training algorithm for accurate protein secondary structure prediction. It first collects multiple sequence alignments using PSI-BLAST. Amino-acid frequence and log-odds data with Henikoff weights are then used to train secondary structure, separately, based on the Rumelhart error backpropagation method. The final secondary structure prediction result is a combination of 7 neural network predictors from different profile data and parameters. The program is freely downloadable on this page.

We have a community chat at Gitter. Feel free to ask us anything there. We have a very welcoming and helpful community.

Installation

No installation is needed!

Simply fork this project and edit the file `seq.fasta` (file path: src/PSSpred_v4/seq.fasta) in `FASTA Format` in your own repository, then you can acquire the outputs through github worflow in about 8 minutes, and download them via artifacts link. The output files contains two results, one for `seq.dat` (PSSpred prediction in I-TASSER format), one for `seq.dat.ss` (the original confidence file). If you want to check more results, you need to edit github workflow file PSSPred.yml:

https://avatars3.githubusercontent.com/in/15368?s=64&v=4

Github-Actions

name: PSSpred

on:
  push:
    branches:
      - master

jobs:
  build_docs_and_deploy:
    runs-on: ubuntu-latest
    name: running PSSpred

    steps:
    - name: Checkout
      uses: actions/checkout@master

    - name: running perl
      run: |
         echo "Initializing the program....................."

         echo "---------------------------------------------"
         cd ../
         mkdir output
         echo "output file already created!"

         echo "---------------------------------------------"
         cd PSSpred/
         cd src/
         mkdir nr
         cd nr/
         wget -O nr.tar.gz https://zhanggroup.org/PSSpred/nr.tar.gz
         tar -xvf nr.tar.gz
         echo "nr.tar.gz already unpacked!"
         echo "Show the path of this file: "
         pwd

         cd ../
         cd PSSpred_v4/
         ./PSSpred.pl seq.fasta
         cp seq.dat /home/runner/work/PSSpred/output/
         cp seq.dat.ss /home/runner/work/PSSpred/output/
         cp blast.out /home/runner/work/PSSpred/output/
         cd /home/runner/work/PSSpred/output/
         ls
         pwd

    - uses: actions/upload-artifact@v2
      with:
        name: output results
        path: /home/runner/work/PSSpred/output/

Not familiar with `FASTA format`? Don't panick, this project is very user-friendly. You can type the following protein sequence:

MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADAQLELGAMNKAFRKDIAAKYKELGYQG

in `seq_1.txt` simply, and upload to the directory (path: src/PSSpred_v4/). Wait for almost 8 minutes (check Appveyor build status: pending? failing? passing?), download the output files when the job is done.

https://avatars3.githubusercontent.com/ml/11?s=62&v=4

Appveyor

image: Ubuntu

install:
    - sh: cd src/
    - sh: mkdir nr
    - sh: cd nr/
    - sh: wget -O nr.tar.gz https://zhanggroup.org/PSSpred/nr.tar.gz
    - sh: tar -xvf nr.tar.gz
    - sh: cd ../PSSpred_v4/
    - sh: ./PSSpred.pl seq_1.txt
    - sh: pwd

# Skip project specific build phase.
build: off

test_script:
    - "ls"
    - "pwd"

artifacts:
  - path: src\PSSpred_v4\seq.dat
    name: seq.dat

  - path: src\PSSpred_v4\seq.dat.ss
    name: seq.dat.ss

  - path: src\PSSpred_v4\protein.fasta
    name: protein.fasta

If you prefer to use CircleCI other than Appveyor, it is alright. Just edit the `seq_2.txt` (file path: src/PSSpred_v4/seq_2.txt) and commit. For example, you can use the following protein sequence and generatre the secondary structure prediction by your own. Also, change the `./PSSpred.pl seq_2.txt` to `./PSSpred.pl XXX.txt` if uploading input files with different file names, by editing the following `config.yml` file.

https://avatars3.githubusercontent.com/ml/7?s=62&v=4

CircleCI(file path: .circleci/config.yml)

version: 2

jobs:
  build: # name of your job
    machine: # executor type
      image: ubuntu-1604:201903-01 # # recommended linux image - includes Ubuntu 16.04, docker 18.09.3, docker-compose 1.23.1

    steps:
      - checkout
      - run: |
            cd src/
            mkdir nr
            cd nr/
            wget -O nr.tar.gz https://zhanggroup.org/PSSpred/nr.tar.gz
            tar -zxvf nr.tar.gz
            echo "nr.tar.gz already unpacked!"
            echo "Show the path of this file:"
            pwd
            cd ../
            cd PSSpred_v4/
            ./PSSpred.pl seq_2.txt
            ls

      - store_artifacts:
          path: src/PSSpred_v4/seq.dat
          destination: seq.dat

      - store_artifacts:
          path: src/PSSpred_v4/seq.dat.ss
          destination: seq.dat.ss

      - store_artifacts:
          path: src/PSSpred_v4/protein.fasta
          destination: protein.fasta

Download

To get the git version do

$ git clone https://github.com/nickcafferry/PSSpred.git

Or simply download the repository using the official Github CLI

$ gh repo clone nickcafferry/PSSpred

You can also click here to download PSSpred package version 4, and v3, v2, v1. Also, you can download the whole package by clicking source code.zip or source code.tar.gz.

Usage

Simply edit the file `seq.fasta`, or `seq_1.txt` or `seq_2.txt`, or you can upload your own sequence file and change the workflow file (PSSPred.yml, appveyor.yml, config.yml) correspondinlgy.

About Protein Sequence

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions:

  • lower-case letters are accepted and are mapped into upper-case;
  • a single hyphen or dash can be used to represent a gap of indeterminate length;
  • in amino acid sequences, U and * are acceptable letters (see below).
  • any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).

The nucleic acid codes are:

A --> adenosine           M --> A C (amino)
C --> cytidine            S --> G C (strong)
G --> guanine             W --> A T (weak)
T --> thymidine           B --> G T C
U --> uridine             D --> G A T
R --> G A (purine)        H --> A C T
Y --> T C (pyrimidine)    V --> G C A
K --> G T (keto)          N --> A G C T (any)
                            -  gap of indeterminate length

The accepted amino acid codes are:

A ALA alanine                         P PRO proline
B ASX aspartate or asparagine         Q GLN glutamine
C CYS cystine                         R ARG arginine
D ASP aspartate                       S SER serine
E GLU glutamate                       T THR threonine
F PHE phenylalanine                   U     selenocysteine
G GLY glycine                         V VAL valine
H HIS histidine                       W TRP tryptophan
I ILE isoleucine                      Y TYR tyrosine
K LYS lysine                          Z GLX glutamate or glutamine
L LEU leucine                         X     any
M MET methionine                      *     translation stop
N ASN asparagine                      -     gap of indeterminate length

Notes

  • seq.txt is fasta file at current directory (the only input file). If you know about FASTA format, you can always use that format.

  • output files:

    seq.dat
    seq.dat.ss
    
  • PSSpred.pl consists of three steps:

    a. prepare and run PSI-BLAST
    b. prepare mtx, pssm.txt, profw, freqccw, freqccwG
    c. run PSSpred and generate output files
    

Example input file

Input file: seq_1.txt(src/PSSpred_v4/seq_1.txt)

MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLS
EARQHLKDGTCGLVEVEKGVLPQLEQPYVFIKRSDARTAP
HGHVMVELVAELEGIQYGRSGETLGVLVPHVGEIPVAYRK
VLLRKNGNKGAGGHSYGADLKSFDLGDELGTDPYEDFQEN
WNTKHSSGVTRELMRELNGG

Snapshot of seq.dat

 1   MET    1    9 # the first column stands for numbers in order
 2   GLU    1    9 # the second column is the amino acid code (see `About Protein Sequence` for more details)
 3   SER    1    8 # the third one represents the secondary structure code: 1<->helix, 2<->coil, 4<->strand
 4   LEU    1    8 # the fourth one represents the confidence score: 1-9
 5   VAL    1    8
 6   PRO    1    8
 7   GLY    1    8
 8   PHE    1    7
 9   ASN    1    6
10   GLU    1    3
11   LYS    1    1
12   THR    4    3
13   HIS    4    6
14   VAL    4    8
15   GLN    4    9
16   LEU    4    9
17   SER    4    8
18   LEU    4    6
19   PRO    4    5
20   VAL    4    5

Snapshot of seq.dat.ss

   180   coil  helix  beta   # 180: the total number of sequence
                             # Protein secondary structure: coil, helix, beta
 1 M C  0.958  0.024  0.012  # the third column: the most possible secondary structure (C-coil, H-helix, E-strand)
 2 E C  0.900  0.043  0.046  # the second column: input sequence
 3 S C  0.871  0.072  0.061  # the first column: enumeration number
 4 L C  0.872  0.064  0.067  # 4-6 columns: probability of corresponding protein secondary structure
 5 V C  0.891  0.053  0.062
 6 P C  0.902  0.042  0.061
 7 G C  0.886  0.046  0.070
 8 F C  0.808  0.086  0.096
 9 N C  0.715  0.124  0.154
10 E C  0.620  0.124  0.272
11 K C  0.546  0.053  0.416
12 T E  0.364  0.013  0.636
13 H E  0.220  0.007  0.782
14 V E  0.105  0.005  0.902
15 Q E  0.069  0.004  0.936
16 L E  0.076  0.005  0.928
17 S E  0.112  0.005  0.895
18 L E  0.204  0.005  0.800
19 P E  0.230  0.008  0.760
20 V E  0.229  0.012  0.760

FASTA format

FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.

An example sequence in FASTA format is:

>gi|186681228|ref|YP_001864424.1| phycoerythrobilin:ferredoxin oxidoreductase
MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVTTSYAFQTAKLRQIRA
AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAYQAKYTEPILPIFHAHQ
QHLSWGGDFPEEAQPFFSPAFLWTRPQETAVVETQVFAAFKDYLKAYLDFVEQAEAVTDSQNLVAIKQAQ
LRYLRYRAEKDPARGMFKRFYGAEWTEEYIHGFLFDLERKLTVVK

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a MIT LICENCE (MIT LIC) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit Code of Conduct.

Refrence

Renxiang Yan, Dong Xu, Jianyi Yang, Sara Walker, Yang Zhang. A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction. Scientific Reports, 3: 2619 (2013).

About

PSSpred (Protein Secondary Structure prediction) is a simple neural network training algorithm for accurate protein secondary structure prediction. It first collects multiple sequence alignments using PSI-BLAST. Amino-acid frequence and log-odds data with Henikoff weights are then used to train secondary structure, separately, based on the Rumel…

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Sponsor this project

Packages

No packages published