Skip to content

Latest commit

 

History

History
195 lines (129 loc) · 7.15 KB

missing.i.md

File metadata and controls

195 lines (129 loc) · 7.15 KB

Missing Information

Missing information is common place in chemical file formats and line notations. In many cases this information is implicit to the representation, but recovering it is not always easy, requiring assumptions which may not be true. Examples of missing informations is the lack of bonds in XYZ files, and the removed double bond location information for aromatic ring systems.

Element and Isotope information

When reading files the format in one way or another has implicit information you may need for some algorithms. Element and isotope information is a key example. Typically, the element symbol is provided in the file, but not the mass number or isotope implied. You would need to read the format specification what properties are implicitly meant. The idea here is that information about elements and isotopes is pretty standardized by other organizations such as the IUPAC. Such default element and isotope properties are exposed in the CDK by the classes Elements and Isotopes.

Elements

The Elements class provides information about the element's atomic number, symbol, periodic table group and period, covalent radius and van der Waals radius and Pauling electronegativity:

ElementsDemo

For example, for lithium this gives:

ElementsDemo

Isotopes

Similarly, there is the Isotopes class to help you look up isotope information. For example, you can get all isotopes for an element or just the major isotope (a full list of isotopes is available from Appendix B:

HydrogenIsotopes

For hydrogen this gives:

HydrogenIsotopes

This class is also used by the getMajorIsotopeMass method in the MolecularFormulaManipulator class to calculate the monoisotopic mass of a molecule:

MonoisotopicMass

The output for ethanol looks like:

MonoisotopicMass

Reconnecting Atoms

XYZ files do not have bond information, and may look like:

code/data/methane.xyz

Fortunately, we can reasonably assume bonds to have a certain length, and reasonably understand how many connections and atom can have at most. Then, using the 3D coordinate information available from the XYZ file, an algorithm can deduce how the atoms must be bonded. The RebondTool does exactly that. And, it does it efficiently too, using a binary search tree, which allows it to scale to protein-sized molecules.

Now, the algorithm does need to know what reasonable bond lengths are, and for this we can use the Jmol list of covalent radii, and we configure the atoms accordingly:

CovalentRadii

which configures and prints the atoms' radii:

CovalentRadii

Then the RebondTool can be used to rebind the atoms:

RebondToolDemo

The number of bonds it found are reported in the last line:

RebondToolDemo

Missing Bond Orders

There are several reasons why bond orders are missing from an input structure. For example, you may be reading a XYZ file and just performed a rebonding as outlined in the previous section. Or, you may be reading SMILES strings with aromatic organic subset atoms, such as c1ccccc1. Or, you may be reading a MDL molfile that uses the query bond order 4 to indicate an aromatic bond.

The latter two situations are, in fact, very common in cheminformatics. Before CDK 1.4.11 we had the DeduceBondSystemTool to find the location of double bonds in such delocalized electron bond systems, but in that 1.4.11 release a new tool was released, the FixBondOrdersTool class, that does a better job, and faster too. Both classes only look for double bond positions in rings, but that covers many common use cases.

The method requires atom types to be perceived already, which is already done when reading SMILES, for example for pyrrole:

FixPyrroleBondOrders

This results in the image given in Figure pyrrole.

![](code/generated/FixPyrroleBondOrders.png)
Missing Hydrogens

The CDKHydrogenAdder class can be used to add missing hydrogens. The algorithm itself adds implicit hydrogens (see Section hydrogens), but we will see how these can be converted into explicit hydrogens. The hydrogen adding algorithm expects, however, that CDK atom types are already perceived (see Section atomtypePerception).

Implicit Hydrogens

Hydrogens that are not vertices in the molecular graph are called implicit hydrogens. They are merely a property of the atom to which they are connected. If these values are not given, which is common in for example SMILES, they can be (re)calculated with:

MissingHydrogens

which reports:

MissingHydrogens

Explicit Hydrogens

These implicit hydrogens can be converted into explicit hydrogens using the following code:

ExplicitHydrogens

which reports for the running methane example:

ExplicitHydrogens

2D Coordinates

Another bit of information missing from the input is often 2D coordinates. To generate 2D coordinates, the StructureDiagramGenerator can be used:

Layout

which will generate the coordinate starting with an initial direction:

Layout

Unknown Molecular Formula

Mass spectrometry (MS) is a technology where the experiment yields monoisotopic masses for molecules. In order to analyze these further, it is common to convert them to molecular formula. The MassToFormulaTool has functionality to determine these missing formulae. Miguel Rojas-Chertó developed this code for use in metabolomics [Q27134827]. Basic usage looks like:

MissingMF

This will create a long list of possible molecular formula. It is important to realize that it looks only at what molecular formula are possible with respect to the corresponding mass. This means that it will include chemically unlikely molecular formulae:

MissingMF

This is overcome by setting restrictions. For example, we can put restrictions on the number of elements we allow in the matched formulae:

MissingMFRestrictions

Now the list looks more chemical:

MissingMFRestrictions

Of course, this is a long way from actual chemical structures. An Open Source structure generator has been a long standing holy grail, and the CDK-based MAYGEN addresses this gap [Q109827109], though the also open source Surge is a good bit faster [Q113585012].

References