I predicted the pCHEMBL values, AlogP values, Molecular Weight and number of Lipinski's Rule of 5 Violations of a biomolecule by end-to-end training of multiple pre-trained Language models on Dopamine D2 active compounds sourced from the CHeMBL database.
• pCHEMBL represents the negative logarithm (base 10) of the standard values, pro-
viding a more balanced and standardized representation of potency across various values. It is a standardized version of the Standard Value, measuring the
molecule's bioactivity.
• AlogP measures a molecule’s lipophilicity or affinity to lipids/fats versus
water. This property is crucial as it significantly influences a drug’s pharmacokinetics,
impacting its absorption, distribution, metabolism, and excretion within the body.
Compounds with balanced AlogP values are more likely to be absorbed efficiently and
exhibit favourable pharmacological characteristics.
• Molecular Weight is a crucial factor in drug discovery and biopharma. It is also a
factor considered in Lipinski’s Rule of Five.
• RO5 Violations The number of Lipinski’s Rule of Five violations. Lipinski’s rule
of five is a widely used rule of thumb in medicinal chemistry to evaluate drug likeness
or oral drugs.
The implemented models are:
• RoBERTa randomly initialized, 125 million parameters
• RoBERTa pre-trained, 125 million parameters
• ChemBERTa pre-trained on PubChem 1M, 85 million parameters
• ChemBERTa pre-trained on 10M ZINC database, 3.5 million parameters
• ChemGPT pre-trained on PubChem10M Smile strings, 1.2 billion parameters
Use the main.ipynb file for end-to-end training and use_pretrained.ipynb for freezing the pre-trained language model part and only training the final linear layers for regression. The chosen models can be changed in the second cell.