Very Fast Streaming Submodular Function Maximization

Data summarization has become a valuable tool in understanding even terabytes of data. Due to their compelling theoretical properties, submodular functions have been in the focus of summarization algorithms. These algorithms offer worst-case approximations guarantees to the expense of higher computation and memory requirements. However, many practical applications do not fall under this worst-case, but are usually much more well-behaved. In this paper, we propose a new submodular function maximization algorithm called ThreeSieves, which ignores the worst-case, but delivers a good solution in high probability. It selects the most informative items from a data-stream on the fly and maintains a provable performance on a fixed memory budget. In an extensive evaluation, we compare our method against 6 other methods on 8 different datasets with and without concept drift. We show that our algorithm outperforms current state-of-the-art algorithms and, at the same time, uses fewer resources. Last, we highlight a real-world use-case of our algorithm for data summarization in gamma-ray astronomy.

  • Published in:
    ECML PKDD 2021: Machine Learning and Knowledge Discovery in Databases. Research Track European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)
  • Type:
    Inproceedings
  • Authors:
    S. Buschjäger, P.-J. Honysz, L. Pfahler, K. Morik
  • Year:
    2021

Citation information

S. Buschjäger, P.-J. Honysz, L. Pfahler, K. Morik: Very Fast Streaming Submodular Function Maximization, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), ECML PKDD 2021: Machine Learning and Knowledge Discovery in Databases. Research Track, 2021, https://doi.org/10.1007/978-3-030-86523-8_10, Buschjaeger.etal.2021,