The rapid increase in the performance and application areas of Machine Learning methods —especially neural networks — is driving the spread of Artificial Intelligence (AI) applications across nearly all economic sectors and areas of life. This is accompanied by a professionalization of the development and operation of AI solutions, increasingly bringing these methods into small and medium-sized enterprises, government agencies, and public institutions, as well as into all areas of private and public infrastructure (energy and data networks, hospitals, etc.).
According to contemporary standards of professional software development, such as the V-model, systematic testing of functionality and security is an integral part of the development and delivery process and should occur throughout an application’s entire lifecycle — whether AI components are included or not. This is especially necessary where their use involves high or critical risks. Additionally, agile development processes with their fast, iterative updates and automated delivery, as well as highly dynamic deployment environments, require the use of easily automatable testing procedures.
Compared to traditional software, testing AI applications presents a number of specific challenges that established testing procedures have not yet adequately addressed. In this blog post, we introduce some of these challenges.
Modern software development is complex and highly automated
Specifically for Machine Learning methods, but also for traditional software, tests based solely on manually created functional tests are often insufficient to ensure an acceptable level of functionality and security. Moreover, software today is often developed in many short cycles, integrated into packages, and delivered automatically. This high dynamism results in many applications never being completely finished and, therefore, never fully tested. Effective testing procedures must account for this and be both practical and economical.
For instance, formal verification — mathematically proving the functionality of a software module — is typically very time-consuming and costly. In practice, it is suitable only for deterministic programs of manageable complexity and for scenarios that are both static and highly safety-critical (e.g., space missions). Oftentimes, such a correctness proof is desirable but not feasible, for instance, when the space of possible inputs is practically infinite. This is the case when, according to current knowledge, it will foreseeably never be possible to test all inputs, even with all potentially available human resources. An example is software that processes image content.
Even if an application is suitable for formal verification, there are currently no corresponding approaches for all Machine Learning methods. To meet the need for verification mentioned above for the large proportion of applications for which formal verification is not an option, alternative verification methods must therefore be found.
Specific challenges of AI models
The functionality of an AI model cannot be adequately tested through manually created unit or integration tests (i.e., tests of elementary functional units or the interaction of several functional units by comparing the calculated with the expected output for a given input) as is usual for traditional software. Instead, testing is done in a data-driven manner, embedded in the training process. A standard procedure for testing supervised learning methods involves splitting the input data into training, validation, and test datasets. The model is trained on the training dataset, and performance on the test data is continuously monitored as a criterion for the current, actual “learning success”. Finally, the model is tested with the withheld validation dataset to check how well it has learned to correctly respond to entirely new inputs, thereby generalizing the relationships to be learned.
Quality management of training data
However, this procedure is unsuitable for detecting missing important examples in an input dataset. For example, there are scenarios where critical known use cases of a model are not or only weakly represented in the training data or can only be generated at high cost or by simulation. An example would be the representation of “real” traffic accidents for training autonomous vehicles.
Even if all important types of permissible examples are included, the aforementioned training and testing procedure cannot prevent potential statistical biases between the collected input data and the relationships to be learned from being transferred to the model. Such biases can arise in various ways, for instance, if:
- the data collection is poorly implemented or too difficult/expensive.
- certain properties can only be measured or estimated indirectly.
- biases and prejudices of the people involved in the collection process are reflected in the data.
High dynamism of model and environment
Machine Learning methods can also allow models to continue learning during operation, which can lead to them deviating significantly from a previously tested version. Neural networks, for example, can be prone to catastrophic interference, where a newly learned example quickly and completely “overwrites” previously learned examples. It is also known that some algorithms — especially neural networks — tend to react less robustly to even the smallest variations in input. This is illustrated in two articles from the ML blog, which discuss adversarial examples and attacks as well as potential countermeasures.
Furthermore, in changing deployment environments, an AI model may encounter inputs in operation that are valid but were not or only weakly represented during training. For example, a neural network for recognizing road users may suddenly face a completely new vehicle category (e.g., e-scooters) due to changes in legal regulations. Ideally, a testing procedure can anticipate a model’s behavior in response to such deviations by generating new, plausible inputs itself.
Non-determinism in Machine Learning
Another difficulty arises from the non-deterministic properties of Machine Learning methods. Two AI systems trained in the same way can behave differently even with constant input data and a static problem (environment) if the training process is non-deterministic due to explicitly added random elements (e.g., shuffling the training data) or intrinsically contained random elements (e.g., in parallel execution).
Additionally, generative models can exhibit intrinsically non-deterministic behavior: Random elements intentionally added to generate an output can cause identical inputs to produce different outputs. For example, the language processing model GPT-3 typically generates texts with more or less significant differences for identical inputs. Although this problem is not limited to Machine Learning methods, it gains particular relevance in the AI context due to the proliferation of generative models. It also significantly increases the testing effort because, in principle, a whole distribution of outputs must be collected and tested for each possible input.
Conclusion
Being able to effectively and efficiently test AI models is becoming increasingly important in light of their growing prevalence, complexity, and criticality. At the same time, testing such systems involves specific challenges that, like the learning methods themselves, are the subject of ongoing research. In the next blog post, we will introduce a promising testing approach that addresses some of these problems.