Optimal Probabilistic Classification in Active Class Selection

Author: M. Bunse, D. Weichert, A. Kister, K. Morik
Journal: 2020 IEEE International Conference on Data Mining (ICDM)
Year: 2020

Citation information

M. Bunse, D. Weichert, A. Kister, K. Morik,
2020 IEEE International Conference on Data Mining (ICDM),
2020,
942-947,
https://doi.ieeecomputersociety.org/10.1109/ICDM50108.2020.00106

The goal of active class selection (ACS) is to optimize the class proportions in newly acquired data; a classifier trained from that data should exhibit maximum performance during its deployment. This paper provides an information-theoretic examination of the problem, resulting in an upper bound of the classifier’s error. This upper bound shows that the more data is acquired, the better is the performance of the class proportions that occur during deployment; other class proportions can outperform these natural proportions in the beginning of data acquisition, but natural proportions certainly yield optimal probabilistic classifiers in the limit. Put differently—and perhaps surprisingly—the more data is acquired, the less beneficial are ACS strategies. Our bound further reveals that the degree to which non-natural class proportions are eligible depends on the correlation between the features and the class label. Experiments on standard ACS data sets quantify these effects and also show that the conclusions drawn from our analysis take over to non-probabilistic classifiers.