Transformers in Biotech (I) - How it Started

Transformer models have emerged as a powerful tool in various domains of AI, particularly in natural language processing, and more recently, bioinformatics. Transformers were originally introduced in the paper “Attention is All You Need” by Vaswani et al. in 2017, and rely on a mechanism called “self-attention” to process input data allowing them to handle sequential data, capture long-range dependencies, and manage large datasets efficiently.  This naturally makes them a great fit for bioinformatics applications, especially sequencing and structure predictions.

In essence, transformers work by paying attention to different parts of the input data simultaneously, which is a departure from the traditional recurrent neural networks (RNNs) that process data sequentially. This parallel processing capability not only speeds up training but also enables transformers to achieve superior performance on many complex tasks. Their architecture consists of an encoder and a decoder, both made up of multiple layers of self-attention and feedforward neural networks.


In this article series, we will be examining transformer variants across the spectrum of bioinformatics, and examining them in 3 parts:

  1. Structural Biology Transformers for analyzing amino acid sequences and predicting 3D protein structures
    • Models to look at include AlphaFold and OmegaFold
  2. NLP Transformers for Biomedical text mining, disease detection, risk factor extraction, and clinical decision support
    • Models to look at include BioBERT, InferBERT, and BioGPT
  3. Medical Imaging Vision Transformers for Medical image segmentation, predicting tissue regions and structures, and enhancing image resolution for better diagnosis
    • Models to look at include HisTogene


But first, let’s have a brief run through of how we got here. for analyzing amino acid sequences and predicting three-dimensional protein structures

First uses of transformers in bioinformatics

Even a decade earlier, transformers were proposed in research as a potential solution to long sequencing problems in bioinformatics, but it is in 2019 when we started seeing research articles pop up with its applications. Papers such as “Predicting multiple cancer phenotypes based on somatic genomic alterations via the genomic impact transformer” by Yifeng Tao, Chunhui Cai, William W. Cohen, and Xinghua Lu and “Novel transformer networks for improved sequence labeling in genomics” by Jim Clauwaert and Willem Waegeman, revised and published towards late 2019, are classic examples of these. These papers discussed the benefits of applying multi-headed attention in long sequences, emphasizing the introduction of transformer architectures for whole-genome sequence labeling tasks, showing state-of-the-art performances and unbiased benchmarking results. This research interest was primarily fueled by the promising results of Open AI’s GPT 2 which was also released in late 2019.

One of the first successful implementations of transformers was in early 2021 with the introduction of DNABERT. This model adapted the BERT architecture to understand and process DNA sequences by pre-training on a human reference genome using k-mer tokenization. DNABERT demonstrated significant improvements in tasks such as DNA sequence classification and motif prediction, paving the way for more sophisticated analyses in genomics using transformer models.

Meanwhile, research rapidly became more nuanced and insightful. For example, in March 2021, the paper TALE: Transformer-based protein function Annotation with joint sequence-Label Embedding by Yue Cao and Yang Shen was published in Bioinformatics which highlights the use of transformers for protein function annotation by leveraging joint sequence-label embedding to improve functional insights from high-throughput sequence data. Naturally, explainability started becoming a prominent requirement. Studies such as the Explainability in transformer models for functional genomics by Jim Clauwaert, Gerben Menschaert, and Willem Waegeman, published in Briefings in Bioinformatics, showed the potential for unveiling the decision-making process of trained models. This also emphasized the need for new strategies to benefit from automated learning methods by revealing the decision-making process, specifically focusing on the automatic selection of relevant nucleotide motifs from DNA sequences.

Transformer Architecture in Genomics: A Closer Look

To better understand why transformers are particularly well-suited for genomic data analysis, let’s examine a specific implementation in detail. We’ll narrow our focus to a paper titled Transformer-based genomic prediction method fused with knowledge-guided module. This case study will allow us to highlight the key features that make transformer architectures so powerful in the field of genomics. By breaking down the components of this transformer-based genomic prediction method, we can illustrate how these models adapt to the unique challenges of biological sequences.

Implementing Transformers for Genomic Data: A Case Study in Sequence Analysis

The implementation of an “LLM-like” transformer model – but for genomic data – leverages the transformer architecture’s ability to handle both sequential and non-sequential data. Key to this process is tokenization, where sequential data, like DNA sequences, are converted into k-mers, while non-sequential data, such as single-cell RNA-seq, use gene IDs or expression values as tokens. Crucially, positional encoding is applied to these tokens to preserve the sequential information in genomic data. This is especially important for understanding the relative positions of genetic elements, which can significantly impact function. The transformer models employ masked language modeling (MLM) during pre-training, where tokens are masked, and the model predicts them using context from surrounding tokens. This enables the model to capture complex patterns and relationships within genomic data. The self-attention mechanism in transformers facilitates the model’s ability to attend to long-range dependencies, crucial for understanding genomic sequences that can span thousands of base pairs. By using multi-head attention, the models can jointly focus on different subspaces of the input, enhancing their predictive power. The final layers, including add-and-norm layers and fully-connected layers, refine these predictions, enabling the model to make accurate predictions on functional regions, disease-causing Single Nucleotide Polymorphisms (SNPs), and gene expression levels.

Explainability in Genomic Transformers: Examples from Recent Research

Explainability is crucial in bioinformatics, as researchers need to understand the decision-making process that leads to specific outputs. In a recent application of transformers to genomic data, researchers employed two key methods to enhance model interpretability:
  • Layer-wise Relevance Propagation (LRP): This technique decomposes the model’s prediction, assigning importance scores to individual input features. It works by redistributing the prediction score backward through the network layers. In the context of genomic data, LRP can highlight which parts of a DNA sequence or which genes contribute most significantly to a particular prediction, such as disease risk or gene function.
  • Heatmap-based Shapley Additive explanations (HSPA): HSPA uses Shapley values to create heat maps that visually represent each token’s contribution to the model’s prediction. In genomic applications, this could mean visualizing the importance of specific nucleotides, genes, or genomic regions. By calculating the Shapley value for each feature, HSPA provides a comprehensive view of how different genomic elements interact and contribute to the overall prediction.
These explainability methods, while not exclusive to transformer models, are particularly valuable in genomic applications. They allow researchers to identify influential genomic features, ensuring that the model is not only accurate but also transparent and interpretable. This approach enhances the utility of transformer models in genomics, enabling researchers to trust and verify the model’s outputs based on a clear understanding of feature importance. It’s worth noting that while these methods were effective in this particular study, the field of explainable AI in genomics is rapidly evolving. Other techniques, such as attention visualization or feature attribution methods, may also be applicable depending on the specific genomic task and data type.

Challenges in Applying Transformers to Biological Sequences

One of the primary challenges in applying a large transformer to genomic data is the significant difference in input dimensions compared to traditional language models. Genomic data often consists of extensive sequences, such as DNA, which can span tens of thousands of base pairs. This contrasts with the relatively shorter and more contextually rich text data used in NLP. For instance, while NLP tasks might involve tokenizing sentences into words or subwords, genomic sequences require tokenization into k-mers, where ‘k’ represents the length of nucleotide sequences. This results in much longer input sequences, necessitating models that can handle high-dimensional data efficiently. Additionally, non-sequential genomic data, such as single-cell RNA sequencing, introduces further complexity. Unlike the structured nature of text, these datasets involve various features that must be appropriately tokenized and represented. For example, gene expression data can be tokenized using gene IDs or expression values, requiring sophisticated preprocessing steps to ensure meaningful input to the model. The key intuition in employing transformers for genomic sequences is their capacity to attend to long-range interactions within DNA sequences, which can span tens of thousands of base pairs. However, transformers still experience computational memory constraints. Direct base-to-base attention over very long-range genomic sequences (such as millions of base pairs) has not yet been achieved using transformers. Instead, attention is often applied to features extracted by dilated convolutions before transformer modules, thereby extending the range of self-attention calculations, at the cost of losing pairwise scores of relevance across a whole sequence. In our next article we will be going in depth into Sequencing Transformers. Stay tuned!

We help Biotech Startups with their AI

Neuralgap helps Biotech Startups bring experimental AI into reality. If you have an idea - but need a team to rapidly iterate or to build out the algorithm - we are here.

Please enable JavaScript in your browser to complete this form.
Job Title/Role
The work your biotech firm specializes in?
What is the primary challenge you face?