What are adversarial examples?

00 Blog Adilova 1 - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
© Sergey Nivens/stock.adobe.com

To be able to use Machine Learning in everyday life, one must develop software capable of implementing the Machine Learning model. This premise opens up opportunities for attackers with malicious intent to use various cyberattacks to break the ML software and its work. This raises the question: Is there a specific vulnerability that only Machine Learning programs are susceptible to? As it turns out, yes! There are ways to attack the functionality of the Machine Learning model itself – essentially meaning finding inputs that cause the model to behave unexpectedly and incorrectly. Such inputs are called “adversarial examples”. This blog post delves into what are adversarial examples and how they work.

The beginning: the attackers

By definition, an adversary is an attacker, an antagonist who seeks to prevent or make one’s work dysfunctional. In the field of computer science and software development, the security of the system and the programs running within it are examined through the research field of cybersecurity. The development and widespread use of software based on Machine Learning have opened up new opportunities for such attackers. By manipulating input in the Machine Learning process, the behavior of the algorithm can be disrupted. For example, spam detection systems can be attacked by carefully falsifying incoming training examples. Over time, this leads to the model no longer recognizing spam. Or consider a Machine Learning model designed to protect a server containing company data from intrusion attacks. The goal of an attacker would be to fill the server in a way that their attack goes unnoticed.

stealth tshirt - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
At DEF CON (the largest conference on cybersecurity), Baidu, a Chinese search engine, presented a “Stealth T-Shirt” as part of its software package for cyberattacks. The pattern generated from the gradients of the object recognition model allows a person holding it in front of the camera to become “invisible” to this model.

Since 2010, the problem of adversarial Machine Learning has been formulated, showing that Machine Learning methods such as linear models, support vector machines, and random forests are vulnerable to adversarial examples.

The development and growing popularity of highly non-convex neural networks for deep learning raised hopes that such models would be more resistant towards attackers. However, in 2014, Szegedy et al. demonstrated a surprising property of neural networks – they made incorrect predictions when the input was changed only slightly using a specific and special method. This alteration of the input is now described as “adversarial examples”. The experiments were conducted on visual classifiers. The puzzling aspect was that the added noise didn’t really change much for the person viewing the image. However, the network would give a wrong answer and be very confident about it. This caught the attention of deep learning practitioners and led to a popular line of research on such adversarial examples based on small disturbances.

What are adversarial examples?

An adversarial example is a specially manipulated input signal for Machine Learning algorithms that intentionally misleads them into misclassifications. As mentioned above, the input is changed only slightly using a “specific special method.” But what does it mean to change an input in a way that is not recognizable to the human observer?

In recent years, many different algorithms have been proposed for generating faulty inputs, each with its own special characteristics. To understand how one of the most well-known algorithms works, we must remember that a neural network is trained to minimize the loss incurred with the training inputs. Minimization is done through gradient descent, meaning the parameters of the neural network move towards the smallest possible value of the loss of prediction.

Now imagine that we have reached a parameter configuration that leads to a small loss with the training data. We can now select the area around one of our examples, which is so small that all examples appear the same to a human observer. And since the loss depends on both the parameters of the network and the input example, we can use gradients to optimize the example so that the loss becomes as large as possible but does not leave the defined area of no change. This way, it is possible to generate an adversarial example for every input and every neural network.

AdversarialExamples1 12 MW 1 - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
© ML2R
First, the figure schematically shows the training process in which the weights are modified with gradient descent in order to achieve a minimum loss L(x,w). Next, the image schematically shows that when the weights are at the minimum, the input x can be changed so that the loss increases, but the changes are smaller than epsilon.

These examples are also called “small-perturbations large-loss” examples. Yet, the reason for the existence of such small-perturbations large-loss examples is not yet fully understood. That is, we can construct them with certain specific algorithms but cannot provide a reason for their existence. When training a Machine Learning model, we aim for the model to make no errors, at least with the training data and inputs similar to the training data. But now we know that an unlimited number of faulty examples can be generated. There are several theories about this but none that are universally accepted. The reason is mainly that all evidence is empirical and could therefore be proven wrong in the future. One of the most devastating types of attacks on Machine Learning manipulates the ability of deep neural networks to learn features that we as humans cannot distinguish but could still be useful for solving the task. So, if an attacker forces neural networks not to pay attention to those features that could potentially lead to errors in disruption, this overall leads to lower performance of the ML model.

An especially interesting characteristic of adversarial examples is the empirically proven transferability between different neural networks. This means that attacking noise generated for a specific input and deep learning model is also attacking noise for another input and model. This phenomenon is extremely dangerous. Imagine you’re a developer of an AI system whose core is a deep learning model, and you want to secure this system against attacks. You use all available cybersecurity tools to prevent outsiders from accessing the details of the model. Nonetheless, the adversary will still be able to observe your model at work, gather enough information, train their own replacement model, create adversarial examples for this model, and then break your highly protected model. Such black-box attacks are very worrying in the context of applications critical to human life, such as systems in autonomous vehicles.

The relevance of security for Machine Learning

In the modern world, where more and more tasks are being automated, software security is critical. Moreover, the security of software based on Machine Learning is important, as very sensitive areas are affected by it. Adversarial attacks must be considered in every AI system. That’s why ML2R (now the Lamarr Institute) is researching Trustworthy AI, focusing on the security of AI systems, among other things. Now it is important to know how to protect against adversarial attacks. We’ll introduce basic defense strategies in our next blog post.

Linara Adilova

Linara Adilova is a research associate at Fraunhofer IAIS in the field of “KI Absicherung” (AI security), with a research focus on the theoretical foundations of deep learning. Her project work focuses on the production of autonomous driving systems.

More blog posts