Progress Report: Towards European LLMs
We present two multilingual LLMs designed to embrace Europe’s linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models’ development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate competitive performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, MMLU, and TruthfulQA.
- Published in:
Max; Steinigen arXiv - Type:
Inproceedings - Authors:
Ali, Mehdi and Fromm, Michael and Thellmann, Klaudia and Ebert, Jan and Weber, Alexander Arno and Rutmann, Richard and Jain, Charvi and Lübbering, Max and Steinigen, Daniel and Leveling, Johannes and others - Year:
2024
Citation information
Ali, Mehdi and Fromm, Michael and Thellmann, Klaudia and Ebert, Jan and Weber, Alexander Arno and Rutmann, Richard and Jain, Charvi and Lübbering, Max and Steinigen, Daniel and Leveling, Johannes and others: Progress Report: Towards European LLMs, arXiv, Max; Steinigen, 2024, Ali.etal.2024b,
@Inproceedings{Ali.etal.2024b,
author={Ali, Mehdi and Fromm, Michael and Thellmann, Klaudia and Ebert, Jan and Weber, Alexander Arno and Rutmann, Richard and Jain, Charvi and Lübbering, Max and Steinigen, Daniel and Leveling, Johannes and others},
title={Progress Report: Towards European LLMs},
booktitle={arXiv},
journal={Max; Steinigen},
year={2024},
abstract={We present two multilingual LLMs designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models'...}}