neuralgap.io
Quantitative Structure-Activity Relationships (QSAR), the cornerstone of modern drug discovery, has relied on linear models to predict drug activity based on molecular structure. While this has had its benefits, neural networks -particularly Graph-based Neural Networks (GNNs) like Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs) – are showing promise at capturing complex, non-linear relationships within molecular data. By learning directly from raw molecular structures, neural networks can reveal subtle patterns and interactions that simpler models might miss. In this article, we look at how GNNs represent a powerful tool in QSAR analysis.
Samples (Compounds) | Variables (descriptors) | |||
---|---|---|---|---|
Compound | X₁ | X₂ | … | Xₘ |
1 | X₁₁ | X₁₂ | … | X₁ₘ |
2 | X₂₁ | X₂₂ | … | X₂ₘ |
… | … | … | … | … |
n | Xₙ₁ | Xₙ₂ | … | Xₙₘ |
So we understand how modern QSAR modeling relies heavily on ‘capturing’ the chemical structure interrelationships. Technically, these can be quite easily modeled as multivariate linear regressions, known as linear QSAR models. These models assume a straightforward linear correlation between molecular descriptors (such as physicochemical properties, topological indices, and electronic parameters) and biological activity, offering simplicity, interpretability, and computational efficiency. The basic form of these models are; Y = a0 + a1X1 + a2X2 + … + anXn, where Y represents biological activity and X1, X2, etc., are molecular descriptors. However, they face significant challenges in capturing the full complexity of structure-activity relationships which tend to be nonlinear in nature. As a result they are widely used in early-stage drug discovery and toxicity prediction.
However, if you are going deeper, you need to model accounting for more complexity. Enter GNNs. GNNs represent a flagship advancement in QSAR modeling because unlike traditional neural networks that operate on fixed-size inputs, GNNs are designed to process and learn from graph-structured data directly. In order to examine this let take a look at a typical architecture of a GNN:
Interesting fact – GNNs are becoming increasingly used in areas where ‘linkages’ and relationships need to be explicitly mapped. One of its earliest applications is in social networking. In social networks, nodes can represent individuals, and edges can represent relationships or interactions. GNNs can effectively model the complex dependencies and patterns within these networks, such as predicting friendships, detecting communities, and even identifying influential users!
Let’s examine GNNs usage directly in drug discovery with a couple of bullet points:
These attributes are cementing GNNs as the primary go-to-model in modern drug discovery.
Despite advancements in computational techniques, QSAR modeling and drug discovery continue to face significant challenges. Apart from the obvious computational costs required to run large scale QSAR models, a lot of it stems from things that can’t be directly solved using AI or machine learning. And revolves primarily around data quality and reproducibility.
The unreliability of published data on potential drug targets has become a critical problem in QSAR modeling and drug discovery. This issue is part of a broader trend of increasing retractions and misconduct in biomedical research. A comprehensive study analyzing 2,069 biomedical papers retracted between 2000 and 2021 from European institutions reveals alarming statistics. The overall retraction rate rose significantly from 10.7 per 100,000 publications in 2000 to 44.8 per 100,000 in 2020. Research misconduct accounted for a staggering 67% of these retractions, with the most common reasons being duplication, followed by unspecified research misconduct. This leads to a lack of reproducibility that not only wastes resources, but also misdirects future research efforts, potentially derailing drug discovery projects and delaying the development of new therapeutic agents. These problems are further exacerbated by the use of non-curated datasets that introduces multiple errors into QSAR modeling. These errors include the presence of wrong structures, misprints in compound names, incorrect calculation of descriptors, and the inclusion of duplicates, mixtures, and salts.
Alleviating these will require much more than the use of algorithms, it will require a comprehensive and systematic approach to data curation and validation. This approach should involve a multi-step workflow that progressively reduces the error rate while maintaining a substantial dataset size. Key steps in this process include rigorous chemical curation, duplicate analysis, assessment of intra- and inter-lab experimental variability, exclusion of unreliable data sources, and detection of activity cliffs. Furthermore, the development and validation of QSAR models should adhere to established guidelines, such as the OECD Principles. These principles emphasize the need for models to have a defined endpoint, an unambiguous algorithm, a clear domain of applicability, and appropriate measures of goodness-of-fit, robustness, and predictivity. Importantly, there should be a mechanistic interpretation of the model where possible, and the data used for modeling should be carefully curated. By implementing these stringent data curation protocols and adhering to established modeling principles, the field can significantly improve the reliability of QSAR models.
Neuralgap helps Biotech Startups bring experimental AI into reality. If you have an idea - but need a team to rapidly iterate or to build out the algorithm - we are here.
©2023. Neuralgap.io