How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation
Large Language Models ({LLMs}) excel at evaluating machine translation ({MT}), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection ({CED}), we benchmark sub-2B models ({LFM}2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across {WMT}21, {WMT}22, and {SynCED}-{EnDe}-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality ({MCC}, F1-{ERR}/F1-{NOT}) and compute metrics ({VRAM}, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching {MCC}=0.77 with F1-{ERR}=0.98 on {SynCED}-{EnDe}-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a {MacBook} Pro M4 Pro (24 {GB}). At larger scale, Qwen-3-1.7B attains the highest absolute {MCC} (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned {LLMs} augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device {CED} for {MT}, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our {GitHub} repository.
- Published in:
arXiv - Type:
Article - Authors:
- Year:
2025 - Source:
http://arxiv.org/abs/2511.09748
Citation information
: How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation, arXiv, 2025, {arXiv}:2511.09748, November, {arXiv}, http://arxiv.org/abs/2511.09748, Chopra.etal.2025a,
@Article{Chopra.etal.2025a,
author={Chopra, Muskaan; Sparrenberg, Lorenz; Khanna, Sarthak; Sifa, Rafet},
title={How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation},
journal={arXiv},
number={{arXiv}:2511.09748},
month={November},
publisher={{arXiv}},
url={http://arxiv.org/abs/2511.09748},
year={2025},
abstract={Large Language Models ({LLMs}) excel at evaluating machine translation ({MT}), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection ({CED}), we benchmark sub-2B models ({LFM}2-350M, Qwen-3-0.6B/1.7B,...}}