Minimal data requirements for accurate compound activity prediction using machine learning methods of different complexity

Data sparseness is expected to negatively impact machine learning (ML) and especially deep learning (DL). In the era of DL, much emphasis is put on learning from large datasets. In cheminformatics, compound classification represents a suitable task for exploring characteristics of ML/DL. Here, we determine minimal data requirements for learning using activity-based compound classification as an exemplary application. Systematic predictions using message-passing neural networks, random forests, and simple controls are carried out on more than 100 activity classes using learning sets of increasing size. Contrary to expectations, predictive models are already obtained on the basis of “ultra-small” training sets containing only a few active compounds. These findings are rationalized by explainable ML, which reveals features that determine predictions. Accurate predictions often only depend on a few molecular features that might be detected even in ultra-small training sets. The findings revise current assumptions about data requirements for effective ML/DL.

  • Published in:
    Cell Reports Physical Science
  • Type:
    Article
  • Authors:
    Siemers, Friederike Maite; Feldmann, Christian; Bajorath, Jürgen
  • Year:
    2022

Citation information

Siemers, Friederike Maite; Feldmann, Christian; Bajorath, Jürgen: Minimal data requirements for accurate compound activity prediction using machine learning methods of different complexity, Cell Reports Physical Science, 2022, 3, 101113, https://www.sciencedirect.com/science/article/pii/S2666386422004155?via=ihub, Siemers.etal.2022a,

Associated Lamarr Researchers

lamarr institute person Bajorath Juergen - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)

Prof. Dr. Jürgen Bajorath

Area Chair Life Sciences to the profile