Machine Learning Algorithms for Prediction of Biological Activity and Chemical Properties
The focus of this work was to establish quantitative structure activity (QSAR, potency of allosteric modulators) and property (QSPR, carbon chemical shifts) relations for molecules with known structure and activity/property by means of machine learning. These machine learning models were then employed to predict biological activity and carbon chemical shifts of molecules with known structure but unknown biological activity or carbon chemical shifts. All described algorithms were implemented in the BioChemistry library (BCL). The BCL is an in-house object-oriented library providing functionality to manipulate small molecules and proteins. After giving an introduction to the field of QSAR/QSPR with specific consideration of machine learning and descriptor development, the application of these principles to determine positive allosteric modulators (PAMs) of metabotropic glutamate receptor subtype 5 (mGlu5) from a database of commercially available compounds is described. This external database was enriched by a factor of 30 compared to the original high-throughput screen. These newly developed methods were expanded to promote scaffold-hopping in the search for metabotropic glutamate receptor subtype 4 (mGlu4) PAMs. Introduction of the scaffold-hopping approach dropped the enrichment of an external database from 22 to eight. The refinement of the described methods led to the discovery of two compounds representing a new scaffold of negative allosteric modulators (NAMs) of mGlu5. The publicly available data of the NMRShiftDB can be used to train a machine learning model to predict 13C chemical shifts with a mae of 2.95ppm (rmsd of 3.95ppm). For a subset of 12 natural products a mae of 3.29ppm (rmsd of 4.50ppm) was determined demonstrating the ability of the methods to predict the 13C chemical shifts of newly discovered natural products. The successful introduction of configurational and conformational descriptors was shown by an improved mae on the independent data set of 2.84ppm. This dissertation shows a novel approach for ligand-based virtual high-throughput screening without a priori knowledge of the target protein. The fragment-based encoding methodology can be easily transferred to other drug targets. The necessity to select drug target specific descriptor subsets to improve predictive value was demonstrated. The training of a model on a publicly available database makes it possible to provide a web server for carbon spectra prediction under www.meilerlab.org. The input for this model is based on a newly developed spherical atom environment code. Solvent, temperature, and computed partial charges were introduced as additional descriptors to improve accuracy.