Skip to content

Multiomics EHR Risk KP

Alexandra Ralevski edited this page Mar 7, 2024 · 23 revisions

also known as EHR Clinical KP

Back to Home

The Electronic Health Record (EHR) Clinical KP is created and maintained by the Multiomics Provider team from the Institute for Systems Biology in Seattle, WA. This KP provides a knowledge graph pointing from risk factors to a variety health outcomes (diseases, phenotypes, medication exposure). We use data from over 28 million EHRs to train a large collection of interpretable machine learning models which are integrated into a single large knowledge graph. The edges of the graph are generated by running ~300 logistic regression models for clinical conditions with features including age, sex, medical conditions and medications as nodes to predict associations with disease outcome.

Data

The Data consists of over 28 million EHR records Providence Health Systems and Affiliates (PSHA), which cares for patients through 51 hospitals and 1085 clinics across seven states in the US, including Alaska, California, Montana, New Mexico, Oregon, Texas, and Washington.

The Graph

EHR KP

Clinical Risk Predicates

  • associated_with_increased_likelihood_of
  • associated_with_decreased_likelihood_of

The EHR knowledge graph includes results from 152 multivariate logistic regression models, which includes 152 conditions, 335 medications, 115 lab measurements, and 5 demographic features. Log odds ratios are used to quantify associations between concepts. The AUROC for each model is provided, along with the 95% confidence intervals and p-values for each association.

Features are indicated by a binary (0/1) for whether or not they are present in a person's medical history. Laboratory features are coded as high/low relative to the reference range at the time it was entered into the EHR. The specification of (1,0) or (0,1) indicates the lab result was high or low, respectively, while "normal" (as defined by the reference ranges) or the absence of lab result are mapped to (0,0). Laboratory values that were split into high or low were then mapped from LOINC codes to HPO phenotypes. Demographic features include age groups (0-17, 18-49, 50-74, and 75+ years old), sex (Female = 0), and ethnic group (Hispanic or Latino = 1).

Graph Properties

  • Disease nodes use Monarch Disease Ontology (MONDO) or Human Phenotype Ontology (HPO) identifiers, depending on the nature of the disease.
  • Medication nodes use CHEMBL or CHEBI identifiers, depending on the nature of the medication.
  • Laboratory results use the LOINC2HPO tool to map LOINC codes to HPO identifiers.
  • Edge predicates are "associated_with_increased_likelihood_of" if the coefficient is positive and "associated_with_decreased_likelihood_of" if the coefficient is negative.

Example edge (interpretation): EHR KP shows that rosuvastatin is associated with an increased likelihood of chronic ischemic heart disease, with a log odds ratio of 3.4278 and a p value of < 0.001 (N = 51200) in a cohort of patients from PHSA.

Clinical KP use cases

  • Supporting_study_metadata: In the future, this will refer to a PubMed paper.
  • Supporting_study_method_type: multivariable logistic regression.
  • Supporting_study_performance metrics: AUROC
  • Supporting_study_size: different per entity: currently expressed as order_of_magnitude to avoid differential privacy attacks
  • Supporting_study_cohort_inclusion_exclusion_criteria: age < 18
  • Supporting_study_date_range: 2022-2022 (contemporaneous) 2020-2022 (future prediction)
  • Supporting_study_context: Providence St Joseph Health System, seven states in the western United States
  • Object_variable_state: present (contemporaneous), new onset (future prediction)
  • Temporal_interval_qualifier: 2 years

Resources

Modes of Access

This BioThings API does not comply to the TRAPI standard. However, in collaboration with the Exploring Agent team, this BioThings API is accessible as a TRAPI KP service through:

In these situations, BioThings Explorer acts as a “TRAPI wrapper/transformer” that queries the BioThings APIs and processes their responses into knowledge-graphs that follow TRAPI and biolink-model standards (for node categories, edge predicates, etc).

Knowledge Sources Accessed

  • Providence Health Service EHRs

Using the KP

Our KP can be used in a variety of interesting ways, but we suggest the following uses with example queries:

  • Hypothesis Generation for Drug Repurposing--The greatest strength of our KP is as a hypothesis generator, especially as a first step in moving from clinical to biological reasoning. While our KP does not directly have biological information, other KPs within Translator includes predicates that relate genes, proteins, and the like to certain drugs or diseases. Our clinical risk KP can be, for example, queried using a 1-hop query to identify which drugs or diseases may decrease the risk for another disease or queried using a multi-hop query to see if there are biological entities that are related to those drugs or diseases.

  • Interpretation of Lab Results--The inclusion of labs in model B allows the user to search for possible causes of bizarre lab results. If you get back a surprising lab result, one can construct a 1-hop query on our KP to see which diseases/medications have that particular lab result as a risk factor. If the patient is on a medication or has a disease returned by the query, it can offer a possible explanation for the result. If not, one can then use the query results to further investigate the cause of the result, either within the Translator or some other external source. NOTE: We do not recommend using our KP directly for clinical diagnosis. Only as a resource into possible avenues through which a clinician can further investigate. For more info, see the "Building the KP section below.

  • Identification of Potential Unique Comorbidities-- The disease-to-disease predicates allow one to identify common and not-so-common comorbidities. However, this too must be interpreted with caution. One drawback to the Logistic Regression models (see below) is that it can lead to large weights for uncommon disease combinations. This is largely due to the scarcity of data points with unique comorbidity combinations. For instance, in an internal test performed by our group, we found that a disease affecting an organ present in males was returned as a high-risk factor for a disease affecting an organ present in females. This is clearly a very interesting combination of comorbidities since these organs are very rarely simultaneously present in a single individual. However, there have been a few documented cases of patients who exhibit both. The large weight given to the male-associated feature was due to a large imbalance between patients who exhibited both the feature and the female-associated outcome and those who only had the former. Identifying such unique or contradictory results can identify unique comorbidities that may have some very interesting underlying mechanisms worthy of further investigation.

Building The KP

The Models

Logistic Regression (LR) models fit a probability distribution over the binary outcomes as a sigmoid function, and the coefficients are found by minimizing the cross entropy or maximizing the log-likelihood of the data. Our EHR KP uses these coefficients as edge weights with the edge predicate "associated_with_[increased/decreased]_likelihood_of." The sign of the coefficient determines if the risk is increasing or decreasing, based on whether the sigmoid is increasing or decreasing along that particular feature direction. Namely, positive coefficients are associated with increased risk (due to the minus sign in the exponential in the denominator of the sigmoid), while negative is associated with decreased risk.

Even though the model coefficients are very important for how strong the associations between nodes are, we use the AUROC (Area Under the Receiver Operating Characteristic Curve) to measure the performance of our multivariable logistic regression. AUROC is a performance measurement for a binary classification model, averaging the performance measures across different classification thresholds. It is the area under the curve where the True Positive Rate plot against the False Positive Rate ranges from 0 to 1.

Clone this wiki locally