Skip to content

Tutorial (Exome scale)

Collin Tokheim edited this page Aug 3, 2024 · 3 revisions

HotMAPS Pipeline

This tutorial shows you how to setup protein structures, and to run HotMAPS on mutations that were previously mapped to protein structures. You do not need MySQL for this tutorial. In a [subsequent tutorial](Advance tutorial), if you want to run your own mutations through HotMAPS, you will need to load the MuPIT MySQL database (see [here](MySQL database)).

Initial Setup

First, download the Protein Data Bank (PDB) structures from ftp://ftp.wwpdb.org/pub/pdb/ and the theoretical protein structure models (https://salilab.org/modbase-download/projects/genomes/H_sapiens/2013/). You will need both the RefSeq and Ensembl theoretical protein structure models (H_sapiens_2013.tar.xz and ModBase_H_sapiens_2013_refseq.tar.xz, respectively). We advise you look at the instructions for PDB structures, available here. One command to download the structures is below:

$ rsync -rlpt -v -z --delete --port=33444 rsync.rcsb.org::ftp_data/ ./my_pdb_data_dir

This will create a new clean directory ./my_pdb_data_dir containing all the needed PDB structures. Be aware the download may be somewhat large though.

Next, update the config.txt to point toward the directories that you save the structure files at after extracting the theoretical models from compressed format. This will involve changing the base directories modbase_dir and pdb_dir, and the matching sub-directory paths for refseq_homology, ensembl_homology, biological_assembly and non_biological_assembly for your custom location for the protein structures. Additionally, download the mutations file, protein structure annotation file, and annotations for the CRAVAT reference transcript available here. Place all three files in a sub-directory called "data". Assuming you are already in the HotMAPS directory:

$ mkdir -p data
$ cd data
$ wget https://www.dropbox.com/scl/fi/jk1repun20wachbps2zii/mutations.txt.gz?rlkey=udp4k9f9siiuykqauj1m3b9xr&st=sof5rmoz&dl=1 -o mutations.txt.gz
$ gunzip mutations.txt.gz
$ wget https://www.dropbox.com/scl/fi/0lklxk9h8fkzzwrz9ge1x/pdb_info.txt.gz?rlkey=uadoiylkv1pcuaed267q751eu&st=n1ds6yhq&dl=1 -o pdb_info.txt.gz
$ gunzip pdb_info.txt.gz
$ wget https://www.dropbox.com/scl/fi/r2k3q0p2e4hmu2t6vqtzk/mupit_annotations.tar.gz?rlkey=6hj49sp4wlw97o1qxgtf2te9h&st=shsuxqvr&dl=1 -o mupit_annotations.tar.gz
$ tar xvzf mupit_annotations.tar.gz
$ cd ..

Assuming you have changed the config.txt file to point towards where you downloaded the protein structure files, an additional step is needed to annotate those protein structures.

$ make annotateStructures

Running 3D HotMAPS

To run the code in parallel using Sun Grid Engine (SGE) execute the following make command:

$ make OUTPUT_DIR=myoutput_dir runParallelHotspot

To run the code normally (no parallelization) execute:

$ make OUTPUT_DIR=myoutput_dir runNormalHotspot

myoutput_dir is the output directory (Default: output/all_pdb_run).

Note if you ran the normal version instead of parallel, you need not run this next step as the merged file will already be produced. To merge the output from the parallel runs use the following make command:

$ make OUTPUT_DIR=myoutput mergeHotspotFiles

Next, the p-values need to be adjusted for multiple hypotheses testing. This needs the CRAVAT reference transcript files noted in the Initital Setup section that was saved in the "data" sub-directory (parameter MUPIT_ANNOTATION_DIR in the make command).

$ make multipleTestCorrect OUTPUT_DIR=myoutput MUPIT_ANNOTATION_DIR=annotation_dir Q_VALUE=myqvalue 

myqvalue is the q-value for the False Discovery Rate (FDR) correction (.01 by default). The next step is group significant residues into regions. If you are interested in regions on the actual PDB protein structure, script use the following command:

$ make findHotregionStruct OUTPUT_DIR=myoutput_dir Q_VALUE=myqvalue MUPIT_ANNOTATION_DIR=annotation_dir

Where like before myoutput_dir is the output directory and myqvalue is the q-value (Default: .01). Similarly the regions can be constructed for each gene using the reference transcript selected by CRAVAT for each mutation.

$ make findHotregionGene OUTPUT_DIR=myoutput_dir Q_VALUE=myqvalue MUPIT_ANNOTATION_DIR=annotation_dir
Clone this wiki locally