
Language models are now also being applied in the natural sciences. In chemistry, they are used, for instance, to predict new biologically active compounds. These chemical language models (CLMs) require extensive training. However, they do not necessarily acquire knowledge of biochemical relationships in the process. Instead, they draw conclusions based on similarities and statistical correlations – as shown by a recent study conducted by researchers at the Lamarr Institute for Machine Learning and Artificial Intelligence and the University of Bonn. The results have now been published in the journal Patterns.
Large language models are often astonishingly good at what they do, whether proving mathematical theorems, composing music, or crafting advertising slogans. But how do they arrive at their results? Do they actually understand what constitutes a symphony or a good joke? It is not easy to answer that question. “All language models are a black box,” emphasizes Prof. Dr. Jürgen Bajorath, who leads the research area Life Sciences and Health at the Lamarr Institute. “It’s difficult to look inside their heads, metaphorically speaking.”
Together with his doctoral student Jannik P. Roth, Bajorath – also a cheminformatics scientist at the b-it and the University of Bonn – attempted exactly that. Their work focuses on a special form of AI algorithm: transformer-based chemical language models. These models operate similarly to ChatGPT, Google Gemini, or Elon Musk’s Grok, which are trained on vast amounts of text data to learn how to generate coherent sentences. CLMs, on the other hand, are usually trained on much smaller datasets. They acquire their knowledge from molecular representations and relationships, such as the so-called SMILES strings – sequences of letters and symbols that represent molecules and their structures.

Systematic manipulation of training data
In pharmaceutical research, scientists often aim to identify substances that can inhibit specific enzymes or block receptors. CLMs can be used to predict active molecules based on the amino acid sequences of target proteins. “We used sequence-based molecular design as a test system to better understand how transformers arrive at their predictions,” explains Roth. “After the training phase, if you introduce a new enzyme to such a model, it may produce a compound that can inhibit it. But does that mean the AI has learned the biochemical principles behind such inhibition?”
CLMs are trained using pairs of amino acid sequences of target proteins and their respective known active compounds. To explore this question, the researchers systematically manipulated the training data. “For example, we initially trained the model with only specific families of enzymes and their inhibitors,” says Bajorath. “When we then used a new enzyme from the same family for testing, the algorithm indeed suggested a plausible inhibitor.” However, the results changed when the researchers used an enzyme from a different family – one performing a different function in the body. In that case, the CLM failed to predict any meaningful active compounds.
Statistical rule of thumb instead of biochemical understanding
“This suggests that the model has not learned generally applicable chemical principles – that is, how enzyme inhibition typically works on a chemical level,” explains Bajorath. Instead, the model’s predictions are based purely on statistical correlations and data patterns. If a new enzyme resembles one it encountered during training, the model assumes that a similar inhibitor might also be effective. “Such a statistically driven rule of thumb is not necessarily a bad thing,” adds Bajorath. “It can help identify new applications for existing active substances.”
However, the models in this study showed no biochemical understanding even when estimating similarities. They classified enzymes (or receptors and other proteins) as similar if 50–60 percent of their amino acid sequences matched, and then suggested similar inhibitors. The researchers could randomize or scramble the remaining sequences without affecting the outcome. Yet often, only specific parts of an enzyme are crucial for its function. A single amino acid change in one of these key regions can render it inactive, while other areas are less important. “During training, the models did not learn to distinguish between functionally relevant and irrelevant sequence segments,” Bajorath points out.
Models merely repeat learned patterns
The study’s findings clearly demonstrate that, at least for this test system, transformer CLMs trained for sequence-based compound design lack deeper chemical understanding. In essence, they merely reproduce, with slight variations, what they have previously “heard” in similar contexts. “This does not mean they are unsuitable for drug research,” emphasizes Bajorath, who is also a member of the Transdisciplinary Research Area (TRA) Modelling at the University of Bonn. “It is quite possible that they suggest compounds that actually inhibit enzymes or block receptors. But this is certainly not because they understand chemistry – rather, because they identify similarities in text-based molecular representations and statistical correlations that remain hidden from us. This does not discredit their results, but they should not be overinterpreted.”
Publication
Jannik P. Roth, Jürgen Bajorath: Unraveling learning characteristics of transformer models for molecular design, Patterns, https://doi.org/10.1016/j.patter.2025.101392, URL: https://www.cell.com/patterns/fulltext/S2666-3899(25)00240-5