Skip to content

Latest commit

 

History

History
executable file
·
26 lines (24 loc) · 2.78 KB

Inference-README.md

File metadata and controls

executable file
·
26 lines (24 loc) · 2.78 KB

Running SHEPHERD on your own data

Assuming that you are using the SHEPHERD models that we provide

Steps

  1. Create conda environment
  2. Download data from Harvard Dataverse
  3. Set up configuration file
  4. Download model checkpoint
  5. Preprocess your own patient data
    • (Optional for patients-like-me identification) If you would like to compare your patient cohort to an external cohort of patients (e.g., simulated patients), combine the jsonlines files of your own patient cohort and the external patient cohort.
  6. Update MY_TEST_DATA in project_config.py
  7. Generate shortest paths calculations using the flag --only_test_data
  8. Update MY_SPL_DATA and MY_SPL_INDEX_DATA in project_config.py
  9. Run predict.py to generate predictions for your patients
    • Make sure that the run type and checkpoints are aligned (i.e., use --run_type causal_gene_discovery with --best_ckpt checkpoints/causal_gene_discovery)
    • Make sure that the patient data flag is set to your own dataset (i.e., --patient_data my_data)

Results

The output of predict.py consists of:

  • Dataframe of scores for each patient (scores.csv)
    • For causal gene discovery: Each patient's list of candidate genes are scored. The columns of the table are: patient ID, identifier of the candidate gene, similarity score, and binary correct label.
    • For patients-like-me identification: All patients in the input jsonlines file are scored. The columns of the table are: patient ID, identifier of the candidate patient, similarity score, and binary correct label. Note that if you would like to compare only a subset of the patients, you can subset the scores of those patients and re-normalize.
    • For novel disease characterization: Either all diseases in the knowledge graph or all Orphanet diseases are scored. The columns of the table are: patient ID, identifier of candidate disease (MONDO or Orphanet name), similarity score, and binary correct label.
  • Phenotype attention (phenotype_attn.csv)
  • Patient embeddings (phenotype_embeddings.pth)
  • (Only for novel disease characterization) Disease embeddings (disease_embeddings.pth)