Unsupervised ML in Drug Discovery (I): Intro

Drug discovery is a challenging process, primarily due to the vast combinatorial chemical space. Commonly cited estimates suggest that the number of small organic molecules that could theoretically exist, with a molecular weight of up to around 500 Da, is on the order of 10^60. This immense chemical space presents both an opportunity and a challenge for drug discovery. While it holds the potential for discovering novel and potent drug candidates, exploring such a vast space efficiently is a daunting task. Traditional approaches, such as high-throughput screening (HTS), can only cover a limited portion of this space. Therefore, there is a growing interest in utilizing unsupervised machine learning techniques to navigate and narrow down this chemical space effectively in the early stages of drug discovery. These techniques can help identify promising regions of the chemical space, prioritize compounds with desirable properties, and guide the selection of diverse and representative subsets for further experimental validation. In this article, we will explore how unsupervised techniques can be applied in the earlier stages of drug discovery to tackle the combinatorial problem and accelerate the identification of potential drug candidates.

The Early Stage Screening: The Combinatorial Problem

To illustrate the combinatorial problem in early-stage drug discovery, let’s consider an example. Suppose we have a chemical library containing just four types of building blocks: carbon (C), nitrogen (N), oxygen (O), and hydrogen (H). If we limit the size of the molecules to a maximum of 10 atoms, the number of possible unique structures is already in the millions. Now, if we expand the library to include more atom types and allow for larger molecules, the number of possible combinations explodes exponentially. This is the essence of the combinatorial problem in drug discovery.

High-throughput screening (HTS) is a widely used approach to identify active compounds from large chemical libraries. In HTS, a large number of compounds are tested against a specific biological target using automated assays. The goal is to identify “hits” – compounds that show the desired activity against the target. However, even with advanced robotics and miniaturization, HTS can only screen a fraction of the available chemical space. This is where Virtual High-Throughput Screening come into play.

Virtual High-Throughput Screening (vHTS) is a computational method that involves screening large libraries of compounds in ‘silico’ (via computer simulations) to identify potential candidates that bind to specific biological targets with high affinity. It is often a strategy employed before the actual screening process to select a subset of compounds from the larger chemical space. The aim is to prioritize compounds that are more likely to have the desired properties, such as drug-likeness, bioavailability, and potential activity against the target. Effectively, we are compromising on the accuracy of the prediction for a faster, broader ‘funneling’ technique that quickly narrows down the field of candidates. This trade-off is strategically valuable as it accelerates the identification process, allowing for quicker iterations and refinement based on initial screening outcomes. Ultimately, this approach enhances the efficiency of drug discovery by concentrating efforts on the most promising candidates early in the process.

Unsupervised Machine Learning: Narrowing the Space

Unsupervised Machine Learning: Narrowing the Space We need a method to bring the enormous complexity down to a manageable space of particles that can go into HTS. Enter unsupervised machine learning. Unlike traditional supervised algorithms which require a labeled dataset, these techniques are capable of finding relationships and patterns within the data, all by themselves. Several unsupervised learning algorithms are commonly used in drug discovery:

  • t-SNE (t-Distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection) are dimensionality reduction techniques that can visualize high-dimensional data in a lower-dimensional space, typically 2D or 3D. These methods preserve the local structure of the data, allowing similar compounds to be clustered together. In vHTS, t-SNE and UMAP can be used to visualize the chemical space and identify regions of interest or diversity. Compounds that cluster together in the plot are likely to have similar properties.
  • Variational Autoencoders (VAEs) are a type of generative model that learns a compressed representation (latent space) of the input data. VAEs consist of an encoder network that maps the input data to the latent space and a decoder network that reconstructs the original data from the latent representation. In vHTS, VAEs can be trained on a large dataset of compounds to learn a continuous latent space. By exploring the latent space, researchers can identify regions that correspond to compounds with desirable properties and select them for further testing.
  • Self-Organizing Maps (SOMs) are a type of neural network that learns a low-dimensional, discretized representation of the input space. SOMs consist of a grid of neurons that adapt to the input data through a competitive learning process. In vHTS, SOMs can be used to cluster compounds based on their structural or physicochemical properties. Each neuron in the SOM represents a group of similar compounds, and the neighboring neurons represent related groups. By visualizing the SOM, researchers can identify clusters of compounds with distinct properties and select representative compounds from each cluster for HTS.
  • Generative Adversarial Networks (GANs) and Deep Belief Networks (DBNs) are deep learning architectures that can generate new data points similar to the training data. GANs consist of two neural networks, a generator and a discriminator, that compete against each other. The generator aims to create realistic data points, while the discriminator tries to distinguish between real and generated data. In vHTS, GANs can be trained on a dataset of known active compounds to generate novel compounds with similar properties. These generated compounds can then be prioritized for synthesis and testing. DBNs, on the other hand, are composed of multiple layers of restricted Boltzmann machines (RBMs) that learn a hierarchical representation of the input data. DBNs can be used to extract meaningful features from compound data and generate new compounds by sampling from the learned probability distribution.


These are just a few examples of unsupervised machine learning techniques used in drug discovery. Other methods, such as clustering algorithms (e.g., k-means, hierarchical clustering) and principal component analysis (PCA), are also commonly employed to explore and navigate the chemical space. By leveraging these techniques, researchers can effectively narrow down the vast combinatorial space to a manageable set of compounds, which are often organized into chemical libraries that serve as a stepping stone for further exploration and testing.

We help Biotech Startups with their AI

Neuralgap helps Biotech Startups bring experimental AI into reality. If you have an idea - but need a team to rapidly iterate or to build out the algorithm - we are here.

Please enable JavaScript in your browser to complete this form.
Job Title/Role
The work your biotech firm specializes in?
What is the primary challenge you face?