Teuken-7B-Base \& Teuken-7B-Instruct: Towards European LLMs

We present two multilingual LLMs designed to embrace Europe’s linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60\% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models‘ development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate competitive performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, MMLU, and TruthfulQA.

  • Veröffentlicht in:
    arXiv preprint
  • Typ:
    Article
  • Autoren:
    Ali, Mehdi; Fromm, Michael; Thellmann, Klaudia; Ebert, Jan; Weber, Alexander Arno; Rutmann, Richard; Jain, Charvi; Lübbering, Max; Steinigen, Daniel; Leveling, Johannes; Klug, Katrin; Schulze Buschhoff, Jasper; Jurkschat, Lena; Abdelwahab, Hammam; Stein, Benny Jörg; Sylla, Karl-Heinz; Denisov, Pavel; Brandizzi, Nicolo'; Saleem, Qasid; Bhowmick, Anirban; Helmer, Lennard; John, Chelsea; Ortiz Suarez, Pedro; Ostendorff, Malte; Jude, Alex; Manjunath, Lalith; Weinbach, Samuel; Penke, Carolin; Filatov, Oleg; Asaadi, Shima; Barth, Fabio; Sifa, Rafet; Küch, Fabian; Herten, Andreas; Jäkel, René; Rehm, Georg; Kesselheim, Stefan; Köhler, Joachim; Flores-Herr, Nicolas
  • Jahr:
    2024
  • Source:
    https://arxiv.org/abs/2410.03730

Informationen zur Zitierung

Ali, Mehdi; Fromm, Michael; Thellmann, Klaudia; Ebert, Jan; Weber, Alexander Arno; Rutmann, Richard; Jain, Charvi; Lübbering, Max; Steinigen, Daniel; Leveling, Johannes; Klug, Katrin; Schulze Buschhoff, Jasper; Jurkschat, Lena; Abdelwahab, Hammam; Stein, Benny Jörg; Sylla, Karl-Heinz; Denisov, Pavel; Brandizzi, Nicolo'; Saleem, Qasid; Bhowmick, Anirban; Helmer, Lennard; John, Chelsea; Ortiz Suarez, Pedro; Ostendorff, Malte; Jude, Alex; Manjunath, Lalith; Weinbach, Samuel; Penke, Carolin; Filatov, Oleg; Asaadi, Shima; Barth, Fabio; Sifa, Rafet; Küch, Fabian; Herten, Andreas; Jäkel, René; Rehm, Georg; Kesselheim, Stefan; Köhler, Joachim; Flores-Herr, Nicolas: Teuken-7B-Base \& Teuken-7B-Instruct: Towards European LLMs, arXiv preprint, 2024, https://arxiv.org/abs/2410.03730, Ali.etal.2024c,

Assoziierte Lamarr-ForscherInnen

Prof. Dr. Rafet Sifa

Prof. Dr. Rafet Sifa

Principal Investigator Hybrides ML zum Profil
lamarr institute person Koehler Joachim - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)

Dr. Joachim Köhler

Principal Investigator NLP zum Profil
Photo. Portrait of Mehdi Ali.

Dr. Mehdi Ali

Lead Scientist Foundation Models NLP zum Profil
Max Lübbering

Dr. Max Lübbering

Wissenschaftler NLP zum Profil