How much data does Artificial Intelligence need?

00 Blog Weichert Daten - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
© ML2R

Methods from the fields of Artificial Intelligence and Machine Learning are generally believed to require a lot of data. But what does that mean specifically? In training sessions or conversations with users, we often encounter these questions: What constitutes “a lot” of data? How much data do I need at least for a good model? What factors influence the amount of data? In this article, we aim to discuss this topic to provide guidance in assessing one’s own challenges and to give insight into how data scientists approach various application problems.

Why does a Machine Learning model need so much data?

Firstly, let’s explain why a Machine Learning model needs a lot of data for training. One reason lies in the so-called “Curse of Dimensionality”. The “Curse of Dimensionality” describes the exponential growth of space and thus the data required to fill it, depending on the number of input variables.

An example of this is buying a new coffee machine with a multitude of binary switches, where the manual is lost. How can you find out which switch positions make tasty coffee and which settings should be avoided? With just one switch, one can easily try it out – for this, you would need to drink two cups of coffee and compare them. With two switches, there would be four cups for all on/off combinations, and with three switches, eight cups. Thus, the number of trials grows with 2number of switches. With 20 switches, this amounts to 1,048,576 cups of coffee. If all switches not only have binary settings but (N) settings, there is a relationship of (N{number of switches}). This means that for five settings with five switches, you need 3,125 trials to find out which setting is correct.

1 Curse of dimsenionality new2 1 1 - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
© ML2R
An illustration of the Curse of Dimensionality. Shows the number of required data points N for different numbers of dimensions d and discretizations p.

Now, if we want to develop an ML model for coffee taste depending on the switches, we need experimental data of switch setting -> coffee taste. If the model should be very simple yet very accurate, we would build one that looks at the experimental data to determine what taste results from the current switch setting. However, the strength of Machine Learning lies in generalization, meaning the model can draw conclusions about new states not part of the learning process based on learned examples.

In our case, this would mean that we only need a portion of the data, and the model infers from them how the coffee tastes for a different switch combination. Here arises the question: How many data points do I need then? To stay with the example of 5 switches (= 5 dimensions) with 5 settings (= 5 discretizations): How many of my 3,125 points are needed to create a good model for coffee taste depending on the switch position?

How much information does an ML model need?

To get a sense of how much data a model (e.g., coffee taste depending on switch positions) needs, let’s first look at an example with only one dimension.

If we knew how the process we want to model (the “Ground Truth”) behaves, the task would be easier:

1 B Zusammenhaenge 1 - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
© ML2R
Two extreme behaviors for the Ground Truth and the necessary sampling. Left: a very simple Ground Truth: two data points are sufficient to estimate it. Right: a very complicated Ground Truth: many data points are needed to capture only part of its behavior.

The image shows two extreme examples of how a process could behave depending on a factor. For a linear relationship (left image), we would need to measure twice, for a very uneven relationship (right image), very often. It can be concluded from this: The less smooth the process behavior is, the more data is needed. In signal processing, this was formulated in the Nyquist-Shannon sampling theorem. It states that a continuous signal (our process) with a maximum frequency (f_{max}) must be sampled with a frequency greater than (2 f_{max}) to be able to reconstruct it exactly. And this reconstruction of the signal or process is the goal of the ML model. It should predict the taste of coffee from the data for the switch setting. If we now want to learn not only a one-dimensional but a multi-dimensional relationship, we need exponentially more data according to the Curse of Dimensionality. In the simplest case, the number of data points will grow exponentially with the sampling frequency, so (N = number of points after sampling^{dimensions}).

In reality, the sampling frequency is unknown, and rarely are all assumptions underlying the theorem fulfilled. However, the theorem provides a scientific basis for intuition: Where a lot happens in the process, a lot of data is also needed. So, if the coffee taste varies greatly for switch A and is linear for the other four switches (no matter how A is positioned), a lot of data is needed for dimension A and only little for the other dimensions. In terms of points, we would be at (2^4 cdot 5^1 = 320) here, instead of the above 3,125 points.

What role does the model type play?

So far, we’ve talked about the data and the process – but what about the difference between Neural Networks, which typically require more data, and simple linear models, which require less data?

The answer lies in the processes the models can represent. A linear model will always express only a linear relationship, no matter how much data is available. With an increasing amount of data, its parameters (its slope and its offset – the value the model assumes for an input of 0) will change only slightly. Thus, a limited number of data points is required. In contrast, a Neural Network is a universal approximator. It can describe very complex relationships, and with an increasing amount of data, its parameters will change again and again. But if a Neural Network sees only very few data points, it will only find a simple, possibly linear relationship.

In practice, this means that the choice of model depends on the number of available data points. The more data is available, the more complex the relationship is that I can express, and the more complex the model can be. However, the number of necessary data points remains unaffected by this. It depends solely on the process.

A good ML model doesn’t necessarily need a large amount of data, but precisely the data that contains all information about the underlying process. The number of data points is determined by the Curse of Dimensionality and the process to be modeled – a minimum is necessary to build a valid model. A very simple process will require only a few points, a very complex one will require many. To strike a balance, Data Scientists and domain experts rely on each other. But, as is often the case, there is no simple answer to the question of how many data points are needed.

Dorina Weichert

Dorina Weichert is a research associate and doctoral candidate at Fraunhofer IAIS in Sankt Augustin. She works and researches in the field of Bayesian optimization and design of experiments. She enjoys working with Gaussian processes, since they efficiently combine a small amount of data with prior knowledge to create a trustworthy model.

More blog posts