Fooling Perturbation-Based Explainability Methods

Explanations are used to promote trust in machine learning models. If these explanations can be arbitrarily manipulated, models could be trusted by users based on their explanations but make biased predictions on real data. This manipulation is possible for several explainability methods including perturbation-based methods like LIME and KernelSHAP. We show that three methods, Anchors, LORE and EXPLAN are vulnerable to the same scaffolding attack that is effective against LIME and KernelSHAP. While the scaffolding attack is designed to target specific explainability methods, we show that KernelSHAP and especially LIME are also vulnerable to most attacks targeting other methods. We found that the explainability method most resistant against other attacks is Anchors. Additionally, we propose a fooling heuristic that quantifies the degree of fooling to enable objective comparison between different attack results.

Published in:
Trustworthy Artificial Intelligence Workshop at ECML PKDD Trustworthy Artificial Intelligence Workshop at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)
Type:
Inproceedings
Authors:
R. Wilking, M. Jakobs, K. Morik
Year:
2022

Citation information

R. Wilking, M. Jakobs, K. Morik: Fooling Perturbation-Based Explainability Methods, Trustworthy Artificial Intelligence Workshop at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Trustworthy Artificial Intelligence Workshop at ECML PKDD, 2022, Wilking.etal.2022,

Open BibTeX citation