Extension of multi-site analogue series with potent compounds using a bidirectional transformer-based chemical language model
Generating potent compounds for evolving analogue series ({AS}) is a key challenge in medicinal chemistry. The versatility of chemical language models ({CLMs}) makes it possible to formulate this challenge as an off-the-beaten-path prediction task. In this work, we have devised a coding and tokenization scheme for evolving {AS} with multiple substitution sites (multi-site {AS}) and implemented a bidirectional transformer to predict new potent analogues for such series. Scientific foundations of this approach are discussed and, as a benchmark, the transformer model is compared to a recurrent neural network ({RNN}) for the prediction of analogues of {AS} with single substitution sites. Furthermore, the transformer is shown to successfully predict potent analogues with varying R-group combinations for multi-site {AS} having activity against many different targets. Prediction of R-group combinations for extending {AS} with potent compounds represents a novel approach for compound optimization.
- Published in:
{RSC} Medicinal Chemistry - Type:
Article - Authors:
- Year:
2024 - Source:
https://pubs.rsc.org/en/content/articlelanding/2024/md/d4md00423j
Citation information
: Extension of multi-site analogue series with potent compounds using a bidirectional transformer-based chemical language model, {RSC} Medicinal Chemistry, 2024, 15, 7, 2527--2537, July, {RSC}, https://pubs.rsc.org/en/content/articlelanding/2024/md/d4md00423j, Chen.etal.2024b,
@Article{Chen.etal.2024b,
author={Chen, Hengwei; Yoshimori, Atsushi; Bajorath, Jürgen},
title={Extension of multi-site analogue series with potent compounds using a bidirectional transformer-based chemical language model},
journal={{RSC} Medicinal Chemistry},
volume={15},
number={7},
pages={2527--2537},
month={July},
publisher={{RSC}},
url={https://pubs.rsc.org/en/content/articlelanding/2024/md/d4md00423j},
year={2024},
abstract={Generating potent compounds for evolving analogue series ({AS}) is a key challenge in medicinal chemistry. The versatility of chemical language models ({CLMs}) makes it possible to formulate this challenge as an off-the-beaten-path prediction task. In this work, we have devised a coding and tokenization scheme for evolving {AS} with multiple substitution sites (multi-site {AS}) and implemented a...}}