Skip to content
CarlosHorro edited this page Apr 17, 2019 · 4 revisions

The input can be:

Genetic Variants

SNP rsId list:

The file contains one rsid identifier as defined in dbSNP[1] on each row. The list must be ordered by chromosome and base pair (bp). The list must not have duplicates. All rsids must appear in the human assembly GRCh37.p13.

Command line argument: match-rsids

Example:

rs187174427
rs182321900
rs566371895
rs375798137

Chromosome and base pair

Genetic variants can also be represented using the chromosome and the base pair numbers. The input should be sorted by chromosome number and then by base pair.

Command line argument: match-chrbp

Example:

1 210827406
2 14370
2 17330
10 1110696
18 1230237
20 1234567

Variant Call Format Specification (VCF)

The input follows the Variant Call Format Specification[2] v4.3. It also allows the possibility to specify only the first 4 columns in the data section of the file: CHROM, POS, ID, REF.

Whenever a value is missing, it is represented by a ".". The value for the columns CHROM, POS and REF are mandatory, only the column for ID can have missing values. The data records do not need to be ordered by chromosome and base. The search will only take into account the Single Nucleotide Polymorphisms present in the human assembly GRCh37.p13.

Command line argument: match-vcf

Example:

##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
1 210827406 NA T
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Genes

File with a one gene name in each line. Genes follow the HUGO gene nomenclature[3]

Command line argument: match-genes

Example:

CFTR
TGFB1
FCGR2A
DCTN4
SCNN1B
SCNN1G
SCNN1A
TNFRSF1A
CLCA4
STX1A
CXCL8

Proteins

UniProt Accession list

File with a one Uniprot Accession [5] in each line.

Command line argument: match-uniprot

Example:

P00519
P31749
P11274
P22681
P22681
P16220
P46109
P27361
Q9UQC2

Ensembl identifier list

File with a one Ensembl identifier [6] in each line.

Command line argument: match-ensembl

Example:

ENSG00000101076
ENSG00000106633
ENSP00000223366
ENSP00000312987
ENSP00000315180
ENSP00000379142
ENSP00000384247
ENSP00000396216

Proteoforms

A proteoform defines a specific state of a protein. It is composed by the protein UniProt accession, isoform and set of post translational modifications. This allows to narrow down the search for reactions and pathways finding only those places where the proteins participate in that proteoform state.

Since there is no universally accepted standard way to represent proteoforms we employ a simple custom representation as shown in the following image. It comprises a section for accession with an isoform and a section PTM set. The only mandatory part is the protein accession. The isoform and the PTMs are optional. When no isoform is specified, the default UniProt[5] sequence is taken by default, also known as isoform 1.

Image of the proteoform simple format explanation.

Each PTM is specified using a modification identifier and a site, separated by ':'(semicolon). For example: "00046:133". The identifier is a 5 digit id from the PSI-MOD Protein Modification Onthology [7], O-phospho-L-serine in the example. The site is a number specifying the position of the modified amino acid on the canonical protein sequence as defined by Uniprot.

The input file contains one line for each proteoform.

In order to map the proteoforms to reactions and pathways, it is necessary to decide if the proteoforms in the input are equivalent to the proteoforms annotated in the Reactome database. You can read more about this criteria here.

Command line argument: match-proteoforms

Each line must follow any of these patterns:

  • A single protein with no modifications
P00519
  • A protein with one PTM. The two fields are separated by a ','
P16220;00046:133
  • A protein and a set of PTMs separated by ';'. The PTMs can be ordered randomly
P62753;00000:235,00000:236,00000:240

In case the PTM type is not known, the modification id used is "00000". For example: "00000:245".


Example:

P00519;00046:245,00048:412
P31749;00047:473,00047:308
P11274;00187:177
P22681;00098:774
P22681
P16220;00046:133
P46109;01192:207
P27361;00047:202,00048:204
Q9UQC2;00000:452
Q15759;00048:182,00047:180
O15530;00048:241
P62753;00048:235,00049:236,00126:240
P12931;00048:419
P40763;00046:705,00046:727
P42229;00048:694
  • Note: It is a common use to write the identifiers of the post-transnational modifications with the prefix: "MOD:" before the five digits of the ontology term. For practical purposes we allow the user to write the identifier without the prefix, just with the five digits. PathwayMatcher supports both notations:
P00519;MOD:00046:245,MOD:00048:412
P31749;MOD:00047:473,MOD:00047:308
P11274;MOD:00187:177
P22681;MOD:0098:774
P22681
P16220;MOD:00046:133
P46109;MOD:01192:207
P27361;MOD:00047:202,MOD:00048:204
Q9UQC2;MOD:00000:452
Q15759;MOD:00048:182,MOD:00047:180
O15530;MOD:00048:241
P62753;MOD:00048:235,MOD:00049:236,MOD:00126:240
P12931;MOD:00048:419
P40763;MOD:00046:705,MOD:00046:727
P42229;MOD:00048:694

Note: One goal of PathwayMatcher is to show a proof of concept tool that performs pathway search and analysis with proteoforms. Therefore, the selected proteoform format is provisional and other formats for proteoforms may be implemented by request.

Peptides

Simple list

File with one peptide sequence per line.

Command line argument: match-peptides. It is also necessary to set a -f argument for the fasta file with the proteins that serve as search space for the peptides.

Note: for a better handling of the protein inference problem[4], it is recommended to build proteins from the given peptide list.

Example:

VGENHLVKVA
MSDVAIVKEG
GSPGKARPGT
HHLSPHPPGT
HHLSPHPPGT
QNKTLIEELKALKDLYCHKSD
MSSARFDSSDRSAWYMGPVSRQEAQTRLQGQRHGMFLVRDSSTCPGDYVL
LTEYVATRWYRAPEIMLNSKGYTKSIDIWSVGCILAEMLSNRPIFPGKHYLDQLNHILGILGSPSQEDLNCIINMKARNYLQSLPSKTKVAWAKLFPKSD
LPKPSRHNTEFRDSTYDLPRSLASHGHTKG
EALAHAYFSQ
SQELRPEAKN
MKLNISFPATGCQKLIEVDD
MGSNKSKPKDASQRRRSLEPAENVHGAGGG
WDQVAEVLSWQFSSTTKRGLSIEQLTTLAEKLLGPGVNYSGCQITWAKFC
CVMEYHQATGTLSAHFRNMSLKRIKRADRRGAESVTEEKF

Peptide List with PTM types and sites

Each line of the file corresponds to a single peptide with post-translational modifications. It has two fields: peptide sequence and a set of PTMs. Each PTM with a MOD type and the site number. The site of the modification is relative to the peptide sequence coordinates base 1.

Command line argument: match-modified-peptides

Example:

KDGATMKTFC  
KDGATMKTFC;MOD:00048:7
QFSYSASGTA;MOD:00048:2
LTEYVATRWY;MOD:031878:3,
QCEGEEDTEYMTPSSRPLRPLDTSQSSRACDCDQQIDSCTYEAMYNIQSQAPSITESSTFGEGNLAAAHANTGPEESENEDDGYDVPKPPVPAVLARRTL;MOD:00046:9,MOD:00047:40,MOD:00047:83

The pathway search finds the candidate proteins containing each peptide. If various peptides are related to the same protein, then all the PTM sites for the different peptides are grouped. Then results show the proteins in substates where they contain at least one of the PTM sites.

Note: for a better handling of the protein inference problem[4], it is recommended to build proteoforms from modified peptides list.

References

[1] dbSNP
[2] VCF v4.3
[3] Gray et al., Genenames.org: the HGNC resources in 2015. NAR (2015)
[4] Nesvizhskii and Aebersold, Interpretation of Shotgun Proteomic Data The Protein Inference Problem. MCP (2005)
[5] The UniProt Consortium, UniProt: the universal protein knowledgebase. NAR (2017)
[6] Ensembl
[7] Montecchi-Palazzi et al., The PSI-MOD community standard for representation of protein modification data. Nature Biotechnology (2008)