Octet rule in the problem of structure database autogeneration

Nowadays, it is evident that the preliminary investigation of desired properties of chemical compounds is the most efficient way for most researches to search for new functional materials. It allows to optimize time and expences  dramatically by the elimination of inappropriate candidates at early stages of the process. The majority of such pre-screening approaches are based on QSPR analysis that implies application of regression models, genetic algorithms, neural networks, deep learning, etc. However, the selection of the compound set to choose from (especially in unexplored regions of chemical space) remains a non-trivial task. In this respect, there is a growing in the generation of virtual libraries of chemical systems.

For virtual compounds (i.e, for those that have not been synthesized yet) there is a possibility to calculate various physic-chemical properties such as refractive indices, constants of dissociation, boiling and melting points, lipophilicity, electronic, mass, NMR, and infra-red spectra, etc. Databases that accumulate these structures and their characteristics provide an easy way for search, systematization, and identification of compounds and open vast opportunities in machine learning for chemical needs. However, such approaches require immense computational capabilities to perform calculations and to store the values of desired properties.

Several projects are trying to overcome the problem of virtual compounds generation. Ones of the most promising are GDB [link], ChEMBL [link] and OMG [link]. Recently, the research group from Fraunhofer Institute for Algorithms and Scientific Computing proposed an algorithm based on the octet rule [link] that allows selecting structures from the overall generated graphs according to the following chemical considerations. The authors define the ‘valid chemistry’ as molecules in which all electrons are located in a complete electronic shell i.e., the correct number of electrons and electroneutrality of the molecule as a whole:

\sum_{i}(\nu_{i}+b_{i}-8)=0

where \nu_{i} is the number of bonds where the ith atom is involved. The octet rule implicitly allows accounting Lewis structures, because it does not require every atom to be neutral by itself.

Such an algorithm has a number of advantages because it does not rely on predefined sets of atom valences that are obtained by the empiric data. It proposes some kind of ab initio approach to the generation of virtual compounds.

The stages of the generation are shown in the following figure:

Generation scheme

The decision on whether two compounds are mesomeric or tautomeric to each other is made based on the values of specially developed “keys”. The first one serves to eliminate automorphic structures, the second – mesomeric and the last one allows to exclude tautomers. However, identification of tautomers remains uncertain due to potential miscomprehension of the IUPAC definition by authors

… ethenol is tautomeric to ethanal and 1-hexene is tautomeric to 2-hexene, regardless of the interconversion time-scale. Also, we thus define the zwitterionic forms of amino acids to be tautomers of the nonionic forms and tautomerism to be a transitive relation: if both pairs 1-hexene/2-hexene and 2-hexene/3-hexene are tautomers, then also 1-hexene and 3-hexene are tautomers …

Whereas it is known that the case of ethenol/ethanal is a well-known example of keto-enol tautomerism: these two compounds are in the equilibrium, and the ratio [keto]/[enol] is near 10^7; the case of hex-1-ene and hex-2-ene represent two separate compounds that do not transform into each other under regular conditions. Notably, it is not the case for hex-1-ene/hex-3-ene pair. Consequently, based on the example of these alkene isomers, such selection rule will lead to systematical attribution of different alkenes/alkynes/etc. to the tautomers. Moreover, the question about optical isomers and conformers, that also might be removed or incorrectly classified remains opened (because authors have not discussed these problems). As seen, the mentioned limitation for the isomers with a different arrangement of double/triple bonds, however, would affect the general concept of the selection procedure.

To validate the generated virtual compounds, authors performed def2TZVP/DFT calculations for the set of 96 structures with three non-hydrogen atoms. First, the unclear moment is quite a questionable basis set choice, considering the fact that the systems are built only of C, N, O and F atoms, while the mentioned basis is developed for heavier atoms. Secondly, it is well known, that single ab initio calculation using only one basis set is not reliable and in certain cases may give an incorrect result. And the last, there is no data about initial approximation for geometry optimization. Likely, initial geometry is formed by the tabulated values of bond length, valence and dihedral angles. It means, that there may be unnecessary degeneration, that potentially leads to problematic structures. To exclude possible unstable solutions, it is necessary to analyze Hessian elements for negative values.

In summary, I believe that for the purpose of primary validation, the up-to-date semi-empirical approaches are sufficient enough. These methods are not computationally expensive and are suitable for large number of molecules, containing even several dozens of non-hydrogen atoms. The proposed generation algorithm, in contrast to earlier developed, has an important advantage- it does not include any predefined empiric considerations. It implies only chemical logic. However, it seems to be working fine only for atoms of the second-row elements. Generalization of the algorithm onto d-elements is not obvious, especially when it comes to metal complexes.