Learning From Data Streams: Foundations of Stream Learning

Generated by ChatGPT via the prompt “A digital scene captures a humanoid robot, with a smooth white body and dark accents, operating a wooden waterwheel at a quaint watermill beside a flowing river under a clear sky. The landscape around it is painted in bold, swirling strokes reminiscent of Van Gogh's Post-Impressionist style, featuring lush greenery, rolling hills, and a winding path, with the robot and the rotating waterwheel as the focal points.”

In most machine learning scenarios, we start from the same premise: We have a dataset – say, a table of images and their labels – and we can look at it as often as we want. We can shuffle it, split it, normalize it, train a model, evaluate it, retrain it… all at leisure. This is what we might call batch machine learning. But what happens when data never stops coming in? When you don’t have the luxury of going through your data multiple times because it’s constantly being generated in real time? Welcome to data stream mining – also known as stream learning, online machine learning, or incremental learning.

From Batches to Streams

Graphic to demonstrate the difference between Batch-Learning and Stream Learning — Figure 1: Batch-Learning vs. Stream Learning © Dr. Sebastian Buschjäger / Lamarr Institute

In typical (batch) machine learning, you start with a fixed dataset. For example, you might train a model to detect cats and dogs in images, recognize handwritten digits, or generate text to answer queries. You can repeatedly access this data, tweak your model, and evaluate it as often as needed. Once you’re done, you deploy the model – and that’s it. In stream learning, data arrives continuously. You don’t have all the data at once, and you may only be able to look at each data point once before it’s gone. Models need to learn and adapt on the fly, often under limited computational or memory constraints. While this sounds unnecessarily complicated, there are real-world examples for stream mining:

Example 1 – Real-Time Sensor Data: Imagine a temperature sensor inside a car engine. It measures temperature hundreds of times per second. If the sensor detects overheating, the cooling system must react immediately, and waiting to analyze the entire day’s data later isn’t an option. Here, data is produced faster than it can be stored or reprocessed. We need algorithms and models that can make quick, reliable decisions as data arrives – often running on small, resource-constrained systems.
Example 2 – Keeping Models Up to Date: Now think of large language models (LLMs) like ChatGPT. After about six months, their knowledge of the world captured by the original training data is already outdated because new events happen. Retraining the whole model from scratch would take enormous compute resources. A better approach is to continuously update the model with new information as a way of incremental, lifelong learning.

Both examples highlight why stream learning is becoming more important: data grows faster than our ability to store and retrain on it. Those readers familiar with time-series forecasting will already see some overlap between both research directions because both scenarios deal with evolving data over time. However, forecasting typically focuses on predicting future numeric values (like tomorrow’s temperature), often using regression models designed for one-dimensional time series. Stream learning, on the other hand, is broader: it can involve tabular data, text, or images, and can handle tasks like classification, clustering, and anomaly detection – not just regression. This continuous, one-pass setting is the core idea behind real-time machine learning and online learning systems, where models must adapt immediately as new data arrives.

A (Gentle) Formal View

Let’s make the distinction between batch and stream learning a bit more formal to better understand the challenges of stream data mining. In batch supervised learning, we assume that the data comes from a single, stable distribution of which we collect $N$ samples into a dataset $D = {(xᵢ, yᵢ) ~ Θ | i = 1, …. N}$. This dataset can then be used for training our model $f : X → Y$ by minimizing a loss function:

${arg\,min} E_{x,y \sim \Theta}[\ell(f(x), y)]$

This means we’re trying to find the function $f$ that performs best on average for this one fixed data distribution. In stream learning, however, the data distribution itself can change over time. We can imagine a sequence of distributions $(Θₜᵢ, …), tᵢ = 1…, T$ where $T$ could potentially be very large or even infinite. Our data now looks like this $ D = {(xᵢ, yᵢ, tᵢ) | (xᵢ, yᵢ)~ Θₜᵢ, i = 1, ….}$ where $t$ is the point in time we collected a sample. Now we try to find a model $f$ that performs well across time:

$\arg \min_f \sum_{i=1}^T \mathbb{E}_{x,y \sim \Theta_i} \ell(f(x), y)$

This shift from a single, fixed dataset to an evolving sequence of distributions is what makes data stream mining and online ML fundamentally different from traditional machine learning. In other words: instead of learning from one world (i.e. one distribution) that never changes, the model must keep up with a world that evolves. That’s what makes stream learning so challenging.

Important concepts

This simple idea of going from one fixed set of data points to data points that may change over time has a large impact on how we can approach learning in this scenario.

The “Test-Then-Train” Principle

In batch learning, we usually split our dataset into train, validation, and test sets. But in streaming scenarios, data just keeps coming, so there’s no clear boundary between training and testing. Instead, stream learning uses the test-then-train protocol:

When a new data point arrives, the model first makes a prediction.
Once the true label becomes available, we measure the error and update the model using that information.

formula for the test-then-train principle — Figure 2: Formula for the “Test-Then-Train” principle. © Dr. Sebastian Buschjäger / Lamarr Institute

This has two important consequences:

Cold start: The model must make predictions even before it has seen any data! Early predictions will likely be poor, but this phase is part of the learning process.This evaluation strategy is standard in online classification, where models often must provide predictions under strict latency constraints.
Simple baselines: A surprisingly strong baseline in many tasks is the last-value predictor, which simply predicts the most recent observed value. Consider, for example, the daily temperature: While sudden jumps are possible and weather changes over the seasons, temperatures tend to be similar on consecutive days. Hence, a simple, yet effective baseline is to predict the previous value in the datastream.

Concept Drift

Since the data distribution changes over time, models must adapt to these so-called concept drifts – a central challenge in stream learning, online machine learning, and real-time analytics. A concept drift is a change in the data and can occur either in the data $x$ itself, in the corresponding labels $y$, or in both at once. For example, when a sensor in a car engine suddenly reports much higher values than expected, there might be a drift in the data space. However, when a new car engine is released that can withstand higher temperatures, then the potential classes might change. These drifts can occur suddenly (an abrupt event) or gradually (a slow trend). Detecting and adapting to drift is one of the central research challenges in data stream mining. Handling drift effectively is what allows incremental learning algorithms to remain accurate even as the underlying data evolves.

Seasonality

Not all changes are completely unexpected because some patterns repeat. These recurring concepts are known as seasonal effects. Think again of temperature data: the daily temperature naturally cycles through winter, spring, summer, and fall. A well-designed stream model should remember that such cycles will come back even after months of absence. These recurring patterns are also essential for online time-series learning, where models must balance short-term adaptation with long-term memory, but play a similar role in stream mining and online learning in general.

Wrapping Up

Data stream mining is all about dealing with a never-ending flow of information: learning continuously, reacting quickly, and adapting to change. Whether it’s a self-driving car processing sensor data, a fraud detection system monitoring transactions, or a model that keeps itself up-to-date with the latest news, the principles are the same:

see each data point once
learn as you go
and stay adaptive to an ever-changing world.

Ultimately, stream learning forms the backbone of modern real-time machine learning applications, enabling continuous adaptation without full retraining.

In upcoming posts, we’ll explore how algorithms manage to do this and what makes some approaches more robust than others. So stay tuned for a stream of new input, and don’t worry about the math. It’ll start making more sense once we see these concepts in action!

FAQ: Data Stream Mining, Online Learning & Concept Drift

What is data stream mining?
Data stream mining is the process of applying machine learning to continuously arriving data. Unlike batch learning, data cannot be stored or revisited, so models learn incrementally and in real time.

How does stream learning differ from online machine learning?
They are essentially the same: both refer to models that update continuously as new data comes in. “Stream learning” emphasizes the data source (a stream), while “online machine learning” emphasizes the learning procedure.

What is concept drift?
Concept drift refers to a change in the statistical properties of the data stream over time. Drift can affect features, labels, or both – and models must adapt quickly to maintain accuracy.

Why can’t we simply retrain models on all data?
In high-velocity or infinite data streams, storing all data is impossible. Retraining is too slow and compute-intensive. Incremental learning algorithms update the model in milliseconds as data arrives.

Where is online or streaming ML used in practice?
Typical use cases include sensor monitoring, fraud detection, adaptive recommendation systems and real-time forecasting.

Topics

Resource-aware Machine Learning