Fooling Perturbation-Based Explainability Methods

Author: R. L. Wilking, M. Jakobs, K. Morik
Journal: Trustworthy Artificial Intelligence Workshop at ECML PKDD
Year: 2022

Citation information

R. L. Wilking, M. Jakobs, K. Morik,
Trustworthy Artificial Intelligence Workshop at ECML PKDD,

Explanations are used to promote trust in machine learning models. If these explanations can be arbitrarily manipulated, models could be trusted by users based on their explanations but make biased predictions on real data. This manipulation is possible for several explainability methods including perturbation-based methods like LIME and KernelSHAP. We show that three methods, Anchors, LORE and EXPLAN are vulnerable to the same scaffolding attack that is effective against LIME and KernelSHAP. While the scaffolding attack is designed to target specific explainability methods, we show that KernelSHAP and especially LIME are also vulnerable to most attacks targeting other methods. We found that the explainability method most resistant against other attacks is Anchors. Additionally, we propose a fooling heuristic that quantifies the degree of fooling to enable objective comparison between different attack results.