Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions? 

Big data technology Data science analysing artificial intelligence generative AI deep learning machine learning algorithm Neural flow network analytics innovation abstract futuristic. 3d rendering
© Tamer Yilmaz –stock.adobe.com

Introduction to Multilingual Instruction-Tuning 

In the rapidly evolving landscape of large language models (LLMs), instruction-tuning on multilingual data has emerged as a key area of focus. This study, conducted within the OpenGPT-X project – which aims to develop a European multilingual LLM tailored to the needs of businesses and research–, explores the impact of multilingual instruction-tuning on polyglot model performance, addressing three main challenges: 

  1. Data Availability: Where can we get multilingual instruction-tuning data? 
  2. Data Composition: How to instruction-tune a multilingually pre-trained model? 
  3. Model Evaluation: How to evaluate these models effectively?  

This led us to the key problem statement: 

Given a multilingual model – that is, a model that was extensively pre-trained to a large fraction on non-English documents – what language composition of conversations works best to enable the model to follow instructions across languages?  

Diagram with several speech bubbles that illustrates multilingual instruction-tuning
Figure 1: Given a multilingual pre-trained model, is instruction-tuning in one language enough? © Fraunhofer IAIS

Creating Multilingual Datasets for Instruction-Tuning 

With this central question in mind, the requirement for datasets across the targeted languages became evident. While the Bactrian-X dataset offered a great quasi-parallel resource –  leveraging GPT-3.5’s responses to translated instructions – the at that time highly discussed LIMA paper suggested, the Superficial Alignment Hypothesis, a small but high-quality dataset for instruction-tuning suffices to learn to follow instructions. The authors ensured the high quality of LIMA, as the dataset was manually curated. Thus, we created and published Lima-X by translating the instructions and answers to obtain a parallel dataset for English, German, Spanish, French, and Italian. For fair comparison we also down-sampled Bactrian-X, leading to the dataset Bactrian-X-small. 

three tables giving an overview of multilingual datasets of different nature (synthetic / human curated)
Figure 2: Multilingual datasets of different nature (synthetic / human curated) and size (1K / 64K). © Fraunhofer IAIS

Experimental Setup and Training Strategy 

Given these parallel and semantically parallel datasets, we carefully designed an experimental setup to compare different dataset compositions for effectiveness in multilingual training. The selected datasets allow us to investigate different aspects:  

  • Impact of Dataset Nature: Comparing the same sized manually curated dataset (Lima-X) versus the synthetically generated dataset (Bactrian-X-small). 
  • Impact of Dataset Size:  With Bactrian-X-small containing 1,030 samples and the full Bactrian-X dataset comprising 64,000samples, we can assess how increasing the volume of data drawn from the same distribution impacts cross-linguistic performance. 
  • Impact of Language Composition: Assessing monolingual datasets versus multilingual and semantically parallel dataset compositions like ENDEFRITES (meaning: English (EN), German (DE), French (FR), Italian (IT) and Spanisch (ES)) and semantically non-overlapping dataset compositions like ENDEFRITES-sampled, ensuring a fair comparison with monolingual datasets.  
  • Impact of Predominant Language (English): Investigate the influence of including or omitting English in the datasets (comparing ENDEFRITES with DEFRITES and ENDEFRITES-sampled with DEFRITES-sampled), as this is the dominant language during pre-training 
  • Impact of Model Size: The experiments were conducted using two model checkpoints: a 7B parameter OpenGPT-X checkpoint trained on 1 trillion tokens, and the larger 8x7B Mixtral model. 
Complex Table showing the Training of 45 Model Variations Based on Dataset Compositions: Monolingual vs. Multilingual (x-fold fully parallel (ENDEFRITES, DEFRITES)/ same size semantically equal (ENDEFRITES-sampled, DEFRITES-sampled))
Figure 3: The Training of 45 Model Variations Based on Dataset Compositions: Monolingual vs. Multilingual (x-fold fully parallel (ENDEFRITES, DEFRITES)/ same size semantically equal (ENDEFRITES-sampled, DEFRITES-sampled)). © Fraunhofer IAIS

We train each model on each dataset composition and dataset as shown above, except for Mixtral-8x7B-Bactrian-X, as the compute resources available did not suffice to train the large model on the large dataset. This results in 45 different models to compare. 

Evaluation Methodology 

A graphic showing thehuman-curated MT-Bench-X data set for English, German, French, Spanish and Italian. © Fraunhofer IAIS
Figure 4: Human curated MT-Bench-X for English, German, French, Spanish and Italian. © Fraunhofer IAIS

Addressing the challenge of effective evaluation in a multilingual context, we adopted the de facto standard at that time, MT-Bench, and extended it to create MT-Bench-X. The original dataset consists of 80 two-turn questions, with ten examples across eight categories:  

  1. Math 
  2. Coding 
  3. Reasoning 
  4. Extraction 
  5. Writing 
  6. Roleplay 
  7. STEM, and  
  8. Humanities  

To extend this dataset to multiple languages, we added 320 additional examples by translating the questions and references from English to German, French, Italian, and Spanish. This formed the new evaluation benchmark, MT-Bench-X, specifically designed for multilingual instruction-tuning. Given the importance of proper evaluation, and recognizing that human evaluation is the gold standard, we manually checked and curated the translations. Additionally, we conducted a human evaluation study for German to correlate the scores of MT-Bench-DE with human judgment scores.  

The MT-Bench family of benchmarks uses the concept of LLM-as-a-judge, i.e. asking a powerful LLM to evaluate the potentially different answers generated in an open-ended way by the model under evaluation. For complex questions, the correct answer is provided to the LLM-as-a-judge. 

A complex graphic visualising the automatic evaluation with GPT-4-as-a-judge using the MT-Bench-X benchmark.
Figure 5: Automatic evaluation with GPT-4-as-a-judge utilizing the MT-Bench-X benchmark. © Fraunhofer IAIS

Results and Analysis 

We evaluated all model variants trained on the different language mixtures in all languages supported by MT-Bench-X, which led to the following results.  

A bar chart titled 'Monolingual Performance on MT-Bench-X of the 7B Model.' The chart displays the performance scores of a 7 billion parameter model across various monolingual benchmarks. Each bar represents a different language, showing comparative results for the model's effectiveness in each language
Figure 6: Monolingual Performance on MT-Bench-X of the 7B Model © Fraunhofer IAIS
A bar chart titled 'Monolingual Performance on MT-Bench-X of the 8x7B Model.' The chart illustrates performance scores for an ensemble of eight 7-billion-parameter models across different monolingual benchmarks. Each bar corresponds to a specific language, comparing the effectiveness of the ensemble model in each language
Figure 7: Monolingual Performance on MT-Bench-X of the 8x7B Model © Fraunhofer IAIS

Monolingual Evaluation

The results in the Figures above show that models trained on monolingual datasets in languages like German (Bactrian-DE), Italian (Bactrian-IT), and Spanish (Bactrian-ES) underperform compared to the multilingual Bactrian-ENDEFRITES model. At the same time, Bactrian-EN stands out, likely due to the higher volume of English data in its pre-training. Interestingly, Bactrian-FR outperforms Bactrian-ENDEFRITES in French benchmarks. Overall, no clear advantage is observed for models tuned on the fully parallel mixed-language data in monolingual evaluation. Bactrian-X-small performs similarly but at a lower level compared to Bactrian-X. 

Mixtral-8x7B models, being larger, consistently achieve higher scores than the 24EU-7B models stemming from the OpenGPT-X checkpoint. While instruction-tuning on cross-lingual datasets tends to improve monolingual performance for some models this effect remains inconsistent across languages when using mixed-language fine-tuning strategies like DEFRITES or ENDEFRITES. Overall, there is no clear pattern in monolingual performance. 

Cross-lingual Evaluation 

When comparing model performance across evaluation languages by the average MT-Bench-X scores, a clear pattern emerges: cross-lingual instruction-tuning takes the lead. If we examine this trend more closely in the following figure, where we subtract the mono-lingual performance of models trained on monolingual data compositions from the cross-lingual performance of models trained on multilingual data, we see that cross-lingual tuning consistently outperforms monolingual approaches in most cases. 

A bar chart titled 'Cross-lingual Improvement of Multilingual Models Compared to Monolingual Models on MT-Bench-X for the 7B Model Size.' The chart shows the percentage improvement in cross-lingual performance for multilingual models relative to monolingual models of the same 7 billion parameter size. Each bar represents a different language, highlighting gains achieved by multilingual models.
Figure 8: Cross-lingual improvement of multilingual models compared to monolingual models on MT-Bench-X for the 7B model size. © Fraunhofer IAIS
A bar chart titled 'Cross-lingual Improvement of Multilingual Models Compared to Monolingual Models on MT-Bench-X for the 8x7B Model Size.' The chart depicts the percentage improvement in cross-lingual performance for an ensemble of eight 7-billion-parameter multilingual models compared to individual monolingual models. Each bar corresponds to a specific language, showing the relative performance gains achieved by the multilingual ensemble.
Figure 9: Cross-lingual improvement of multilingual models compared to monolingual models on MT-Bench-X for the 8x7B model size. © Fraunhofer IAIS

The Figure compares the performance of models trained on parallel language mixtures versus monolingual datasets.  
Each bar represents the percentage improvement of a multilingual instruction-tuned model ((EN)DEFRITES, (EN)DEFRITES-sampled) compared to a monolingual model (EN, DE, FR, IT, ES) on the dataset variant (BX: Bactrian-X, BXs: Bactrian-X-small, LX: Lima-X). 

Key Observations

  1. Multilingual Tuning Enhances Performance: For 24EU-7B-based models, multilingual tuning with ENDEFRITES-sampled boosts the performance of Lima-X and Bactrian-X-small, but not Bactrian-X. Models trained on DEFRITES-sampled data perform worse than their monolingual counterparts, probably due to smaller sample sizes. Mixtral-8x7B models show consistent gains from parallel instruction-tuning, with few exceptions 
  2. Dataset Size Matters: Larger datasets, like full-sized ENDEFRITES and DEFRITES, improve performance across languages, particularly for Bactrian-X and Lima-X.  
  3. Inclusion of English is Beneficial: Including the dominant pre-training language (ENDEFRITES vs. DEFRITES) enhances tuning outcomes. Down-sampled variants underperform, likely due to lower signal-to-noise ratios in Bactrian-X.  
  4. Model Size Influences Results: Larger models like Mixtral-8x7B consistently archive higher scores, with few exceptions. 

Mixtral-8x7B’s larger size likely compensates for noise in smaller datasets like Lima-X, with maximum performance gains of up to 9.9% (0.99 points) for Lima-ENDEFRITES versus Lima-IT. 

Overall, fine-tuning on parallel data improves multilingual instruction-following for mid- to large-sized models, outperforming monolingual training. 

Synthetic vs. Curated Training Datasets 

To isolate the impact of the dataset nature, we down-sampled the synthetic Bactrian-X datasets to match the size of the human-curated LIMA-X datasets, creating Bactrian-X-small. As shown in Figure 10, the synthetic Bactrian-X datasets outperformed LIMA-X at both dataset scales. A clear trend emerges when comparing LIMA-X with models tuned on Bactrian-X-small in Figure 9: cross-lingual performance improves with parallel, Bactrian-X-based instruction-tuning. These findings highlight the advantages of using synthetic, semantically parallel datasets for multilingual tuning. 

Challenging the Superficial Alignment Hypothesis  

We explored the Superficial Alignment Hypothesis, which suggests that only a few examples per task are needed to teach a model to follow instructions. However, our findings challenge this notion. The following Figure shows that: 

  1. Mid-Sze Models Require Extensive Data: Models of the same scale tuned on Bactrian-X-small consistently outperform those tuned on the curated LIMA-X datasets.  
  2. Lager Models Show Different Trends: With the larger Mixtral-8x7B model, we see high performance across both synthetic and curated data, implying that the Superficial Alignment Hypothesis may become more effective with larger models or more advanced pre-training.
A bar chart titled 'Comparison of the Cross-lingual Performance on MT-Bench-X of Different Model Sizes.' The chart displays the cross-lingual performance scores of models of varying sizes, illustrating how performance changes as model size increases. Each bar represents a different model size, highlighting differences in effectiveness across languages.
Figure 10: Comparison of the cross-lingual performance on MT-Bench-X of different model sizes. © Fraunhofer IAIS

Human Evaluation Insights 

We extended the findings of Zheng et al. (2023) for German and analyzed the similarities and differences between human evaluation and evaluation with GPT-4 as a judge. We identify disparities between human evaluations and those generated by GPT-4 in multilingual chat scenarios. More details can be found in our paper

Key Takeaways 

  • Parallel Datasets Enhance Cross-Lingual Performance: Our findings highlight the benefits of instruction-tuning on parallel datasets, showing improvements of up to 9.9% in cross-lingual instruction-following ability compared to monolingual corpora. 
  • Synthetic Data Outperforms Curated Data: Synthetic Bactrian-X datasets down-sampled to the size of curated LIMA-X showed that Bactrian-X consistently outperformed LIMA-X, highlighting the clear advantage of synthetic, semantically parallel datasets for improving cross-lingual performance in multilingual tuning. 
  • Superficial Alignment Hypothesis Challenged: Our results show that extensive instruction-tuning datasets are necessary for mid-sized multilingual models but not necessarily for larger models, thus challenging the generalizability of the Superficial Alignment Hypothesis. 
  • New Resources Published: We publish the novel multilingual training resource Lima-X and the multilingual evaluation resource MT-Bench-X, contributing valuable tools to the NLP community. 

Future Work & Limitations 

While our study provides valuable insights into multilingual instruction-tuning of large language models (LLMs), it has certain limitations. We focused on exploring multilingual instruction-tuning techniques rather than pushing for state-of-the-art performance. Additionally, our research was confined to Germanic and Italo-Western languages, leaving the generalizability to other language families untested. 

Future work should aim to extend instruction-tuning methodologies across a broader range of languages to enhance the global performance of multilingual LLMs. Investigating multilingual multi-turn datasets could offer deeper insights into complex instruction-following capabilities. Moreover, improving the cost-efficiency and accuracy of automatic multilingual evaluation methods will significantly benefit the Natural Language Processing (NLP) community by making advanced evaluation more accessible. 

Alexander Weber

Alexander Weber is a PhD candidate at Fraunhofer IAIS, where he works on developing language models. He is particularly interested in advancing these models to make them as useful and easily integrable into customers’ or everyday applications as possible. His academic journey began with his thesis on generative language models for evaluating generated recipes, as part of his Computer Science degree at the Technical University of Dresden.

More blog posts