Structure-Based Screening and ML Scoring in drug discovery

In the relentless pursuit of novel therapeutics, computational methods have become indispensable tools in modern drug discovery. The ultimate goal of Structure-Based Virtual Screening (SBVS) and Machine Learning-based (ML) scoring is to accelerate the drug discovery process by efficiently identifying promising lead compounds that can bind strongly and selectively to a target protein, reducing the time and cost associated with experimental screening and validation. SBVS and ML scoring are complementary approaches that, when used in combination, can enhance the efficiency and effectiveness of the drug discovery pipeline. SBVS enables the rapid exploration of vast chemical spaces by virtually “docking” large libraries of compounds into the binding site of a target protein, while ML scoring refines and re-scores these docked poses to identify the most promising candidates. In this article, we will explore the principles behind these computational techniques, their advantages, and how they are being applied to tackle the challenges of finding new drug candidates in an increasingly complex and data-rich landscape.

Structure-Based Virtual Screening: Navigating the Vast Chemical Space

Structure-Based Virtual Screening (SBVS) is a powerful computational approach that allows researchers to navigate the vast chemical space in search of potential drug candidates. By leveraging the 3D structure of a target protein, SBVS enables the virtual “docking” of large libraries of compounds into the protein’s binding site to evaluate their potential to form favorable interactions. This process involves assessing the structural complementarity between the ligand and the protein, as well as estimating the strength of their intermolecular interactions.

A typical SBVS workflow begins with the preparation of the target protein structure, often obtained through experimental techniques such as X-ray crystallography or NMR spectroscopy. The protein structure is then optimized for docking by adding hydrogen atoms, removing water molecules, and defining the binding site region. Next, a large library of small molecules, known as ligands, is prepared by generating 3D conformations and assigning appropriate protonation states and partial charges. In recent years, the size of these libraries has exploded, with leading commercial sources like Enamine REAL, WuXi GalaXi, and Otava CHEMriya now offering a combined total of over 50 billion compounds, a staggering 2000-fold increase in just a decade.

The core of SBVS lies in the docking process, where each ligand from the library is systematically placed into the protein’s binding site, and its orientation and conformation are optimized to achieve the best fit. Docking algorithms employ various search strategies, such as exhaustive search, genetic algorithms, or Monte Carlo methods, to explore the conformational space of the ligand within the binding site. The quality of the ligand pose is evaluated using a scoring function that estimates the binding affinity between the ligand and the protein. Scoring functions can be based on physical principles, such as force fields or empirical energy terms, or they can be knowledge-based, derived from statistical analysis of known protein-ligand complexes.

However, the accuracy of these scoring functions remains a significant challenge, with physics-based methods often being too computationally expensive for large-scale screening and empirical methods struggling to capture the full complexity of protein-ligand interactions. Despite these limitations, SBVS has proven to be a valuable tool in drug discovery, enabling the rapid screening of millions of compounds, filtering out unlikely candidates, and prioritizing those that are most likely to bind to the target protein. This approach significantly reduces the number of compounds that need to be synthesized and tested experimentally, saving time and resources in the early stages of drug discovery. Recent studies have demonstrated hit rates as high as 33% using advanced techniques like V-SYNTHES and 39% with Chemical Space Docking, compared to hit rates of around 15% for standard virtual screening approaches, showcasing the potential of SBVS in identifying promising drug candidates.

ML Scoring Post-SBVS: Refining the Chemical Space for Lead Discovery

Once the SVBS has been successfully completed, ML-based is the next phase of the drug discovery pipeline. After the initial docking process in SBVS, where large libraries of compounds are screened and their poses optimized in the protein’s binding site, ML scoring can be used to refine and re-score the docked poses. This approach helps to filter out false positives that may have been scored favorably by traditional scoring functions but are less likely to bind strongly to the target in reality.

The implementation of ML-based scoring functions has shown significant performance improvements over traditional methods. Studies have demonstrated that ML scoring functions can achieve moderately strong correlations between predictions and experimental binding affinities, with Pearson correlation coefficients (Rp) ranging from 0.55 to 0.60 for experimental and computer-generated databases, respectively. This represents a substantial improvement over classic scoring functions, which in some cases only achieve Rp values of around 0.2. Furthermore, ML-based scoring functions have demonstrated superior speed by being several orders of magnitude faster than traditional scoring functions. This combination of improved accuracy and speed makes ML scoring a powerful tool for refining SBVS results and identifying the most promising lead compounds.

Different types of ML scoring mechanisms have been explored, each with its own strengths and limitations. One crucial distinction is between horizontal and vertical testing protocols. Horizontal tests, where proteins may be present in both training and test sets (albeit with different ligands), tend to yield overly optimistic results due to similarities between training and test data. Vertical tests, where test proteins are entirely excluded from the training set, provide a more realistic assessment of performance in real-world scenarios. Recent studies have also explored the potential of per-target scoring functions, which are trained on computer-generated databases specific to individual proteins. These per-target models have shown promising results, achieving average Rp scores of 0.44-0.52, compared to 0.30 for universal scoring functions in vertical tests. This suggests that tailoring ML scoring functions to specific targets could significantly enhance their predictive power.

Nevertheless, despite the advancements in ML-based scoring, several challenges still remain. One significant issue is the need for explicit protein-ligand interaction data, which can be computationally expensive and time-consuming to generate. This requirement often creates a bottleneck in the scoring process, limiting the number of compounds that can be evaluated. Additionally, the performance of ML scoring functions can be highly dependent on the quality and diversity of the training data. Biases in experimental databases or limitations in computer-generated structures can lead to suboptimal performance when applied to novel targets or chemical spaces. Furthermore, the interpretability of complex ML models remains a challenge, making it difficult to gain insights into the underlying factors driving binding affinity predictions.

In conclusion, the combination of Structure-Based Virtual Screening (SBVS) and Machine Learning (ML) scoring represents a powerful approach to navigate the vastness of chemical space and accelerate the drug discovery process. By leveraging the 3D structure of target proteins and the ability of ML to refine and re-score docked poses, researchers can efficiently identify promising lead compounds while minimizing the time and resources required for experimental validation. As the accessibility of ultra-large compound libraries continues to grow and ML algorithms become increasingly sophisticated, this integrated approach holds immense potential for revolutionizing the way we discover new therapeutics and address the pressing health challenges of our time.

Building ML Scoring Functions

While machine learning (ML) algorithms have demonstrated potential in enhancing SBVS, their application faces significant hurdles, primarily due to imbalanced datasets and uneven representation across chemical space libraries. These algorithms exhibit a pronounced sensitivity to class imbalances, where underrepresented interactions can skew the model’s ability to predict binding affinities accurately. Despite notable advancements—such as synthetic data generation and approaches like MetaBalance —effectively predicting how ML models adjust their confidence in binding affinities remains a challenge. This difficulty is compounded by the chemical diversity and structural complexity inherent in vast compound libraries, where varying molecular architectures introduce different levels of interaction difficulty. Moreover, temporal factors, such as evolving binding site dynamics, add another layer of complexity that further complicates reliable predictions.

At Neuralgap, we are addressing these challenges by systematically investigating the sensitivity of various ML algorithms to data imbalances, including the effects of over- and under sampling on model accuracy. Our focus also extends to analyzing how binding affinity confidence decays as dataset size decreases, particularly in relation to diverse and complex chemical structures. By refining these models and better understanding their behavior under varying data conditions, we aim to improve the overall robustness and accuracy of ML-based scoring in SBVS

Previously: Unsupervised ML in Drug Discovery

Leveraging Structure-Based Virtual Screening and ML Scoring

Computational Alchemy

Structure-Based Virtual Screening: Navigating the Vast Chemical Space

ML Scoring Post-SBVS: Refining the Chemical Space for Lead Discovery

Building ML Scoring Functions

Neuralgap Genesys helps accelerate virtual screening

Neuralgap Genesys is a multimodal transformer based model that uses our proprietary multi-unimodal architecture to leverage the dynamic receptor-ligand data and predict bioactivity.