Skip to content

This repository includes code and a pre-trained model of scHiGex for single-cell gene expression prediction.

Notifications You must be signed in to change notification settings

zwang-bioinformatics/scHiGex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scHiGex: predicting single-cell gene expression based on single-cell Hi-C data

Architecture

This repository includes code and a pre-trained model of scHiGex for single-cell gene expression prediction.

Instructions

Python Environment

The code was tested on Python 3.10.4. The conda environment is shared via env/environment.yml, and for dnabert2, the environment is shared via env/environment_dnabert2.yml.

Dataset

The dataset used for training is from the HiRES experiment. The dataset is available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE223917.

Files to be placed in the assets directory are as follows:

Training

To train the scHiGex model from scratch for mm10,

  • Download and place the required files in the assets directory.
  • Run the python scripts inside the scripts directory in the order of the numbers prefixed to the file names. These scripts will generate the required data files for training the model.
  • Run ./train.sh to train the model.

Prediction

To predict gene expression levels using the trained model for mm10,

  • Download and place the required files in the assets directory (aparts from pairs files since there is no training involved).

  • Run the following python scripts inside the scripts directory (Goal is to create chromosome definitions inside scripts directory):

    • 1.1_run_gtfparse.py
    • 1.2_generate_metadata.py
  • Place the .pairs files in the predict directory:

    • Group of Hi-C .pairs file that you want to predict gene expressions of inside the directory predict/pairs/. At least 20 pairs files for each cell types are required to create the meta-cell.
    • example:
      • predict/pairs/
        • cell_type_1/
          • cell_type_1_1.pairs
          • cell_type_1_2.pairs
          • ...
        • cell_type_2/
          • cell_type_2_1.pairs
          • cell_type_2_2.pairs
          • ...
        • ...
  • Run python 1.data_prep.py to generate the required data files for prediction.

  • Run python 2.predict.py to predict gene expression levels.

  • The predicted gene expression levels will be saved in the predict directory under the file name predictions.csv

If you want to use your own trained model using scHiGex architecture, you need to point to right model file and node_embeddings.


The scripts were desinged to be compatible with the HiRES data for the experiment. The code can be easily modified to work according to the user's purpose.

Citation

Please cite the following paper:

@article{scHiGex,
  title={scHiGex: predicting single-cell gene expression based on single-cell Hi-C data},
  author={Bishal Shrestha, Andrew Jordan Siciliano, Hao Zhu, Tong Liu, Zheng Wang},
  journal={},
  year={2024}
}