As we learned in the previous blog post, Machine Learning software is susceptible to certain attacks through artificially manipulated input data, leading to model errors. Such unforeseen and intentionally induced failures urgently need to be defended against in applications involving humans and affecting human lives. Thus, the reliability of an AI system is a primary requirement, leading to the development of several techniques to achieve so-called “adversarial robustness”.
What can we do against adversarial examples?
Since “adversarial attacks” have become a research area, defense against adversarial attacks has also evolved. As with any type of defense, it’s important to know beforehand what we want to protect the system against. If we know what kind of attacks to expect or at least what attackers are aiming for, we can construct a system that has been thoroughly checked for robustness in that regard. Thus, we prepare the system for “known unknowns”. Yet, a bigger problem lies with the so-called “unknown unknowns.“. This term became popular after a news briefing with US Secretary of Defense Donald Rumsfeld, suggesting that the primary danger lies in the attacks we do not know and do not expect.
Due to the long history of adversarial examples, we already know quite a bit about what to expect: Is it an attempt to steal the model and its parameters, or to impair its functionality? If the latter, attackers will try to alter or “poison” the training data to later have backdoors or generate inputs forcing the model into errors (also known as “poisoning attacks”). Are attackers capable of stealing the architecture, parameters, and gradients of the model, or do they need to construct a substitute for these? All these questions can help us create an attack model that prepares us for adversarial attacks.
Additionally, we should always keep in mind the effort required to make the model robust against adversarial attacks. If the system is not publicly accessible and is intended for internal use only (meaning inputs are only made by reliable, responsible employees), attacks on a system in practice are very unlikely. It may not be necessary to protect the system against attacks. Time and effort could be better invested in strengthening network security to prevent unauthorized access to the model in the first place.
Another important question to consider is whether an attacker can benefit in any way from using computational resources for an attack. In the case of spam filters, the benefit is clear – the attacker gains the opportunity to sneak in spam messages to make money. But in some cases, the output of the system doesn’t make much difference, for example, in age recognition applications in the entertainment industry for gaming apps, where the attacker only receives a false prediction.
Approaches to protect against adversarial examples
There are now a variety of protection techniques for Machine Learning models. Each of them is suitable for specific attack scenarios and may be weak against other attacks at the same time.
The simplest approach is to track and intercept attacks targeting the system and update the system each time – retrain it. A similar approach exists in software development: attackers or even specially hired specialists attempt to find vulnerabilities and improve them. Then a new corrected version is released, protecting against these vulnerabilities. Alternatively, we can also try to protect and hide all information about the model (architecture, parameters, gradients) so that there is no chance of generating an attack. Another way is to attempt to develop a universally robust system, which is much more difficult because of the many unknowns that need to be considered.
Below, we present three high-level approaches for a more robust system:
Adversarial training: As explained in the previous article, adversarial examples are primarily generated to maximize the model’s loss for the respective altered input. The idea of this specialized training approach is to convert the loss minimization problem into a minimax problem. This happens as follows: First, we maximize the training loss by modifying the inputs. Essentially, at this point, possible adversarial examples are generated to some extent. Then this maximum possible loss resulting from such adversarial examples is minimized to obtain the final parameters of the network. This defense deals with “epsilon perturbation adversarial examples,” meaning examples that are very close to the originals. For this defense to work, we must assume that real-world data is smooth and small changes should not drastically change the prediction. This generally holds true for image or text data. Such regularized training in various forms is the most popular method for creating robust models. An intuitive explanation for this approach is that such training is associated with creating a smoother loss surface, thereby reducing the chance of generating a nearby example with high error.
Additionally, we can directly show the model bad examples, so it learns how to recognize such inputs. Normal training data is mixed with generated adversarial examples and correct labels. Such training leads to models that are robust to similar adversarial examples. However, generating the maximal loss example is a tedious process, and adversarial inputs generated in a different way can overcome this defense.
Formal verification of a model: Ideally, one would want formal verification proving that the model cannot be tricked by an attacker. For deep neural networks, this has proven difficult. One approach is to translate a neural network into a logically understandable model and verify the robustness of this understandable model. An approach inspired by software security is to test the model against adversarial examples on the training data and some additional reference data. For the most advanced deep learning models that require such verification, this is very computationally intensive. The proposed methods are currently only practical for shallow models. Unlike deep learning models, shallow models use only 1-2 hidden layers for their neural network.
Pre-identification: Another large group of protection approaches deals with identifying adversarial inputs before an error occurs. This task is closely related to outlier detection, out-of-distribution detection, and uncertainty measurement tasks. They all essentially have the same goal: to identify inputs that are unusual for our model and could lead to unpredictable behavior. This can be done in various ways: either by modifying the inputs so that adversarial examples become benign, or by setting up a detector in front of the model that rejects an adversarial example. An interesting technique suggests allowing the model itself to classify adversarial examples (or all out-of-distribution examples) into a separate NULL class, thus preventing errors on them. The main problem here is epsilon-perturbation examples, as they are almost indistinguishable from natural examples. Overall, these approaches are very context –specific. For many of them, it has been shown that they do not generalize too well, so this approach is subject to the adversarial training approach.
Is there a universal guide for protecting ML models?
To describe a universal protection mechanism, we would have to describe all the “unknown unknowns,” which is impossible by definition. So, all we can do is protect the training data used:
- to avoid creating a model with backdoors for attackers;
- to protect all information about the model, such as training data and algorithm, architecture, and its parameters;
- to create a model considering all “known unknowns” and retrain it reactively when new attacks are described.
Furthermore, we should have maximum possible knowledge about the task and the domain: sometimes, a small change in the input can indeed drastically change the output. Current research is very active, and it might be that a formal verification method will be proposed in some time. But until then, we must ensure that all critical application areas are maximally protected.