Extension of multi-site analogue series with potent compounds using a bidirectional transformer-based chemical language model

Generating potent compounds for evolving analogue series ({AS}) is a key challenge in medicinal chemistry. The versatility of chemical language models ({CLMs}) makes it possible to formulate this challenge as an off-the-beaten-path prediction task. In this work, we have devised a coding and tokenization scheme for evolving {AS} with multiple substitution sites (multi-site {AS}) and implemented a bidirectional transformer to predict new potent analogues for such series. Scientific foundations of this approach are discussed and, as a benchmark, the transformer model is compared to a recurrent neural network ({RNN}) for the prediction of analogues of {AS} with single substitution sites. Furthermore, the transformer is shown to successfully predict potent analogues with varying R-group combinations for multi-site {AS} having activity against many different targets. Prediction of R-group combinations for extending {AS} with potent compounds represents a novel approach for compound optimization.

Citation information

Chen, Hengwei; Yoshimori, Atsushi; Bajorath, Jürgen: Extension of multi-site analogue series with potent compounds using a bidirectional transformer-based chemical language model, {RSC} Medicinal Chemistry, 2024, 15, 7, 2527--2537, July, {RSC}, https://pubs.rsc.org/en/content/articlelanding/2024/md/d4md00423j, Chen.etal.2024b,