Demystifying

This repository contains code for analyzing molecular simulations data, mainly using machine learning methods.

Dependencies

Python 2.7
Scikit-learn with its standard dependencies (numpy, scipy etc.)
MDTraj (only for preprocessing)
biopandas (only for postprocessing)

We are working on upgrading the project to python 3 as well as enabling installation of dependencies via package managers such as conda, pip and similar.

Using the code

As a standalone library

Include the modules library in your pyton path or import it directly in your python project. Below is an example.

Example code

from modules import feature_extraction as fe, visualization
... 
Load your data samples (input features) and labels (cluster indices) here 
...

# Create a feature extractor. All extractors implement the same methods, but in this demo we use a Random Forest 
extractor = fe.RandomForestFeatureExtractor(samples, labels, classifier_kwargs={'n_estimators': 1000})
extractor.extract_features()

# Do postprocessing to average the importance per feature into importance per residues
# As well as highlight important residues on a protein structure
postprocessor = extractor.postprocessing(working_dir="output/", pdb_file="input/protein.pdb")
postprocessor.average()
postprocessor.evaluate_performance()
postprocessor.persist()

# Visualize the importance per residue with standard functionality
visualization.visualize([[postprocessor]],
                        show_importance=True,
                        outfile="output/importance_per_residue.png")

Analyzing biological systems

The biological systems discussed in the paper (the beta2 adrenergic receptor, the voltage sensor domain (VSD) and Calmodulin (CaM)) come with independent run files. These can be used as templates for other systems.

Input data can be downloaded at here.

Benchmarking with a toy model

Start run_benchmarks.py to run the benchmarks discussed in the paper. This can be useful to test different hyperparameter setups as well as to enhance ones understanding of how different methods work.

run_toy_model contains a demo on how to launch single instances of the toy model. This script is currently not maintained.

Citing this work

Either cite the code (doi to come) and/or our paper (doi to come).

Support

Please open an issue or contact oliver.fleetwood (at) gmail.com it you have any questions or comments about the code.

Checklist for interpreting molecular simulations with machine learning

Identify the problem to investigate
Decide if you should use supervised or unsupervised machine learning (or both)

a. The best choice depends on what data is available and the problem at hand

b. If you chose unsupervised learning, consider also clustering the simulation frames to label them and perform supervised learning
Select a set of features and scale them

a. For many processes, protein internal coordinates are adequate. To reduce the number of features, consider filtering distances with a cutoff

b. Consider other features that can be expressed as a function of internal coordinates you suspect to be important for the process of interest (dihedral angles, cavity or pore hydration, ion or ligand binding etc...)
Chose a set of ML methods to derive feature importance

a. To quickly get a clear importance profile with little noise, consider RF or KL for supervised learning. RF may perform better for noisy data.

b. For unsupervised learning, consider PCA, which is relatively robust when conducted on internal coordinates

c. To find all important features, including those requiring nonlinear transformations of input features, also use neural network based approaches such as MLP. This may come at the cost of more peaks in the importance distribution

d. Decide if you seek the average importance across the entire dataset (all methods), the importance per state (KL, a set of binary RF classifiers or MLP), or the importance per single configuration (MLP, RBM, AE)

e. Chose a set of hyperparameters which gives as reasonable trade off between performance and model prediction accuracy
Ensure that the selected methods and hyperparameter choice perform well under cross-validation
Average the importance per feature over many iterations
Check that the distribution of importance has distinguishable peaks
To select low-dimensional, interpretable CVs for plotting and enhanced sampling, inspect the top-ranked features
For a holistic view, average the importance per residue or atom and visualize the projection on the 3d system
If necessary, iterate over steps 3-9 with different features, ML methods and hyperparameters

Name		Name	Last commit message	Last commit date
Latest commit History 265 Commits
benchmarking		benchmarking
modules		modules
vmd_scripts		vmd_scripts
.gitignore		.gitignore
ChangePDB.py		ChangePDB.py
README.md		README.md
__init__.py		__init__.py
run_CaM.py		run_CaM.py
run_VSD.py		run_VSD.py
run_benchmarks.py		run_benchmarks.py
run_beta2.py		run_beta2.py
run_toy_model.py		run_toy_model.py
run_traj_preprocessing.py		run_traj_preprocessing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Demystifying

Dependencies

Using the code

As a standalone library

Example code

Analyzing biological systems

Benchmarking with a toy model

Citing this work

Support

Checklist for interpreting molecular simulations with machine learning

About

Releases

Packages

Contributors 3

Languages

mkasimova/Neural.Network.Relevance.Propagation

Folders and files

Latest commit

History

Repository files navigation

Demystifying

Dependencies

Using the code

As a standalone library

Example code

Analyzing biological systems

Benchmarking with a toy model

Citing this work

Support

Checklist for interpreting molecular simulations with machine learning

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages