CapyMOA: Efficient Machine Learning for Evolving Data Streams

Van-Gogh-inspired digital painting in wide format showing the CapyMOA capybara logo centered within a circular blue frame, surrounded by swirling blue and golden data streams under a glowing night sky, with a distant city skyline and a small humanoid robot observing the scene.
© Illustration generated using ChatGPT by OpenAI

Machine learning is no longer confined to static datasets. Many modern problems — from energy forecasting and fraud detection to sensor analytics and cybersecurity — involve continuous streams of data. These streams arrive instance by instance, evolve over time, and often undergo concept drift, making traditional batch learning unsuitable.

This post is for ML practitioners and applied researchers working with evolving data streams who need high-throughput, adaptive learning – and want MOA-level performance in Python without sacrificing usability.

It is the final article in our Stream Mining series and builds on the foundations introduced in the earlier posts “Learning From Data Streams: Foundations of Stream Learning”, “Green Online Learning with Heterogeneous Ensembles” and “Streaming Gradient Boosting: Pushing Online Learning Beyond its Limits”, focusing on how these concepts come together in practice.

To address these challenges and many more, CapyMOA was developed. CapyMOA is an open-source Python library for stream learning and online continual learning (OCL). First released publicly in May 2024, CapyMOA aims to bring together the efficiency of MOA, the accessibility of Python, and the flexibility demanded by modern streaming applications.

MOA was the first widely adopted open-source library for data stream learning. Originally implemented in Java around 2007, it continues to be actively used and maintained.

Why CapyMOA?

Existing tools strike different compromises:

  • MOA (Java) is extremely efficient but less accessible to Python-centred workflows.
  • River is user-friendly but written entirely in Python and struggles with high-throughput streams.
  • Scikit-Multiflow bridged some gaps but is no longer actively developed.

For practitioners, this often leads to a trade-off between performance, usability, and long-term maintainability. CapyMOA bridges these worlds by focusing on three core pillars:

1. Efficiency

CapyMOA combines native Python implementations with access to MOA’s optimized Java algorithms through a modern Python API. Whenever suitable, it leverages MOA’s mature implementations via JPype with minimal overhead, while also providing its own growing ecosystem of stream learning methods. This gives users MOA-level performance where available, without writing a single line of Java.

Benchmark comparing CapyMOA against other data stream libraries
Figure 1: Benchmark comparing CapyMOA against other data stream libraries. © 2024 CapyMOA Developers. Source: CapyMOA GitHub repository.

Why this matters in practice:

Efficiency is critical in data stream learning, where models must process continuously arriving data under time and resource constraints. Faster implementations allow researchers and practitioners to work with longer and more realistic streams, run a larger number of experiments, and reduce computational costs. This makes large-scale evaluation and deployment more feasible, especially in resource-constrained environments. CapyMOA delivers this efficiency while remaining fully integrated within Python-based workflows.

2. Interoperability

CapyMOA integrates seamlessly with:

  • MOA algorithms (wrapped automatically)
  • scikit-learn models
  • PyTorch components

This allows hybrid approaches that combine traditional online models with deep learning — especially useful for Online Continual Learning. This interoperability is key for applied ML teams that already rely on scikit-learn or PyTorch and want to extend existing pipelines to streaming scenarios instead of rebuilding everything from scratch.

Capybara chilling.
Figure 2: Capybara chilling. © 2026 CapyMOA Developers. Source: CapyMOA GitHub repository

3. Accessibility

The API is designed to be intuitive for general users while maintaining the flexibility required by advanced users, such as researchers and developers. This is reflected in features such as:

Accessibility here does not mean oversimplification. It means reducing boilerplate for straightforward tasks—like rapidly benchmarking drift detectors—while retaining full control for advanced workflows, such as constructing pipelines with online preprocessing, multiple drift detectors, and custom metrics.

Learning from Data Streams with CapyMOA

Most stream-learning pipelines follow the test-then-train loop:

  1. Fetch the next instance
  2. Make a prediction
  3. Train the learner
  4. Update metrics

CapyMOA abstracts this into prequential evaluation functions, reducing boilerplate and standardising evaluation across algorithms. This abstraction ensures consistent evaluation across models and experiments.

Example: Hoeffding Tree on the Electricity Dataset

from capymoa.datasets import Electricity
from capymoa.classifier import HoeffdingTree
from capymoa.evaluation import prequential_evaluation

stream = Electricity()
ht = HoeffdingTree(schema=stream.get_schema(), grace_period=50)

results = prequential_evaluation(stream=stream, learner=ht, window_size=4500)

print(results.cumulative.accuracy())

Windowed metrics are automatically stored in a Pandas DataFrame and can be visualised with:

from capymoa.evaluation.visualization import plot_windowed_results
plot_windowed_results(results, metric="accuracy")

Example: Online Bagging on the Electricity Dataset

Of course, it is still possible to control the loop and evaluate explicitly as in the example below.

from capymoa.datasets import Electricity
from capymoa.evaluation import ClassificationEvaluator
from capymoa.classifier import OnlineBagging

elec_stream = Electricity()
ob_learner = OnlineBagging(schema=elec_stream.get_schema(), ensemble_size=5)
ob_evaluator = ClassificationEvaluator(schema=elec_stream.get_schema())

for instance in elec_stream:
    prediction = ob_learner.predict(instance)
    ob_learner.train(instance)
    ob_evaluator.update(instance.y_index, prediction)

print(ob_evaluator.accuracy())

This flexibility allows users to trade abstraction for fine-grained control whenever required. For more, have a look in the CapyMOA notebook.

Schemas and Instances: Data Consistency by Design

Unlike frameworks that rely on Python dictionaries, CapyMOA enforces a Schema for every stream. This avoids silent errors, ensures stable long-term experiments, and maintains compatibility across Python and Java components. Whether you load a CSV or ARFF file, or rely on synthetic generators, every instance carries structured metadata, making interoperability reliable and automatic across the entire pipeline.

In long-running stream experiments, small inconsistencies can silently invalidate results. For example, in a dictionary-based instance representation, the absence of a feature can be ambiguous: it may indicate a missing value, or that the feature no longer exists. With explicit Schema tracking, this distinction becomes clear — the system can detect whether the feature set itself has changed or whether a value is simply missing. This allows the learner (or other components in the pipeline) to adapt appropriately to schema evolution. Schema enforcement therefore makes such changes explicit and reproducible, which is a critical advantage in both research and production settings.

Concept Drift: A First-Class Citizen

CapyMOA provides a complete system for working with drift:

  • Drift simulation using DriftStream, AbruptDrift, and GradualDrift
  • Integration with evaluation and visualisation tools
  • Native support for classic and modern drift detectors

Example: building a stream with one abrupt and one gradual drift:

from capymoa.stream.drift import DriftStream, AbruptDrift, GradualDrift
from capymoa.stream.generator import SEA
from capymoa.evaluation.visualization import plot_windowed_results

stream_sea2drift = DriftStream(
    stream=[
        SEA(function=1),
        AbruptDrift(position=5000),
        SEA(function=3),
        GradualDrift(position=10000, width=2000),
        # GradualDrift(start=9000, end=12000),
        SEA(function=1),
    ]
)

OB = OnlineBagging(schema=stream_sea2drift.get_schema(), ensemble_size=10)

results_sea2drift_OB = prequential_evaluation(
    stream=stream_sea2drift, learner=OB, window_size=100, max_instances=15000
)

print(
    f"The definition of the DriftStream is accessible through the object:\n {stream_sea2drift}"
)
plot_windowed_results(results_sea2drift_OB, metric="accuracy")

Plots automatically show the drift positions, helping researchers diagnose when and why models struggle. This makes drift not just detectable, but interpretable — a crucial step toward trustworthy adaptive systems.

CapyMOA accuracy over time for Online Bagging.
Figure 3: Accuracy over time for Online Bagging. © 2026 CapyMOA Developers. Source: CapyMOA Website

For more, have a look in the CapyMOA notebook.

Pipelines: Modular, Flexible, and Ready for AutoML

Streaming pipelines are notoriously tricky because transformations must maintain state across time. CapyMOA introduces a flexible pipeline system that supports:

  • Feature scalers
  • Drift detectors
  • Incremental learners
  • Custom components via PipelineElement
  • Multi-detector pipelines (e.g., ABCD + ADWIN)

This design enables experimentation with complex adaptive pipelines while preserving correctness over time. It also lays the groundwork for AutoML for data streams, dynamically selecting and tuning pipelines as the data changes.

Example: Applying normalization and adding noise

from capymoa.stream.preprocessing import MOATransformer
from capymoa.stream.preprocessing import ClassifierPipeline
from moa.streams.filters import AddNoiseFilter, NormalisationFilter

elec_stream = Electricity()

# Creating the transformers
normalisation_transformer = MOATransformer(
    schema=elec_stream.get_schema(), moa_filter=NormalisationFilter()
)
add_noise_transformer = MOATransformer(
    schema=normalisation_transformer.get_schema(), moa_filter=AddNoiseFilter()
)

# Creating a learner
ob_learner = OnlineBagging(schema=add_noise_transformer.get_schema(), ensemble_size=5)

# Creating and populating the pipeline
pipeline = (
    ClassifierPipeline()
    .add_transformer(normalisation_transformer)
    .add_transformer(add_noise_transformer)
    .add_classifier(ob_learner)
)

# Creating the evaluator
ob_evaluator = ClassificationEvaluator(schema=elec_stream.get_schema())

while elec_stream.has_more_instances():
    instance = elec_stream.next_instance()
    prediction = pipeline.predict(instance)
    ob_evaluator.update(instance.y_index, prediction)
    pipeline.train(instance)

ob_evaluator.accuracy()

For more, have a look in the CapyMOA notebook here and here.

Online Continual Learning (OCL): Beyond Classic Stream Learning

Traditional stream learning handles concept drift but usually assumes supervised labels arrive promptly. OCL, however, tackles:

  • Incremental tasks
  • Domain shifts
  • Sparse or delayed labels
  • Memory constraints

CapyMOA integrates OCL natively, allowing both classic online models and neural-based continual learners to operate on the same evaluation framework.

This unified treatment of stream learning and continual learning is still rare and positions CapyMOA at the intersection of two rapidly growing research areas – bridging stream learning and online continual learning as one of the first fully integrated frameworks.

Example: executing Experience Replay

from capymoa.ann import Perceptron
from capymoa.classifier import Finetune
from capymoa.ocl.strategy import ExperienceReplay
from capymoa.ocl.datasets import TinySplitMNIST
from capymoa.ocl.evaluation import ocl_train_eval_loop
import torch
_ = torch.manual_seed(0)
scenario = TinySplitMNIST()
model = Perceptron(scenario.schema)
learner = ExperienceReplay(Finetune(scenario.schema, model))
results = ocl_train_eval_loop(
    learner,
    scenario.train_loaders(32),
    scenario.test_loaders(32),
)
print(f"{results.accuracy_final*100:.1f}%")

For more, have a look in the CapyMOA notebook.

Conclusion

CapyMOA aims to be the modern foundation for research and practice in adaptive machine learning. By balancing efficiency, interoperability, and usability, it provides a platform where:

  • Researchers can build and test new algorithms quickly.
  • Practitioners can use robust, high-performance tools for real-time data.
  • Students can learn stream learning concepts through clear, simple examples.

As CapyMOA evolves, upcoming work includes deeper OCL integration, improved drift diagnostics, enhanced deep learning pipelines, and expanded support for clustering and anomaly detection.

CapyMOA is open-source and welcomes contributions:
Website: https://capymoa.org
GitHub: https://github.com/adaptive-machine-learning/CapyMOA
Tutorials: https://capymoa.org/tutorials.html
Discord: https://discord.gg/spd2gQJGAb

Dr. Heitor Murilo Gomes

Heitor Murilo Gomes is a Senior Lecturer in Artificial Intelligence at Victoria University of Wellington, where he leads the Adaptive AI Lab, and serves as Chairperson of the New Zealand AI Olympiad (NZAIO). Before joining VUW, he was a Senior Research Fellow and Co-Director of the AI Institute at the University of Waikato. His research spans both fundamental and applied machine learning for data streams, including ensemble learning, concept drift […]

More blog posts