Scaling laws of emergent multilingualism in vision-language models

Abstract visualization of a vision-language model illustrating multilingual generalization and scaling effects in artificial intelligence

Symbolic illustration of scaling laws and emergent multilingual capabilities in vision-language models

How and under what conditions emergent capabilities arise in vision-language models is a central question in current AI research. The study “Scaling Laws for Conditional Emergence of Multilingual Image Captioning via Generalization from Translation” describes empirical scaling laws that explain when models are able to describe images in new languages even though no corresponding image-text data is available.

The work shows that multilingual image description can emerge as conditional emergence from the combination of visual image description in one language and purely text-based multilingual translation data. The scaling of the training setting is crucial here: model size, scope of translation data, and linguistic diversity together determine whether and how stable this ability is.

The results are particularly relevant in light of current research debates—for example, on generalization across modalities and languages, low-resource approaches to global AI, and more efficient alternatives to costly multimodal data collection.

The models are evaluated on established benchmarks such as Multi30K, COCO Karpathy, XM3600, and CoMMuTE, where they demonstrate consistent transfer performance in multimodal language tasks. The study thus illustrates that emergent multimodal capabilities do not arise by chance, but are closely linked to clearly describable scaling effects.

Prof. Dr. Sven Behnke, Principal Investigator and Area Chair Embodied AI at the Lamarr Institute for Machine Learning and Artificial Intelligence, is a contributing author of the study. It addresses key questions about emergence, generalization, and scaling in complex AI systems. The results will be presented at the AAAI Conference on Artificial Intelligence by Julian Spravil, Research Engineer at the Lamarr partner institution Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS.

Click here for the paper

Topics

Embodied AI

Science

Topics

More news