In-Training Defenses against Emergent Misalignment in Language Models
Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) ℓ2 distance in feature space, (iii) projecting onto a safe subspace (SafeLoRA), and (iv) interleaving of a small amount of safe training examples from a general instruct-tuning dataset. We first evaluate the methods’ emergent misalignment effect across four malicious, EMA-inducing tasks. Second, we assess the methods’ impacts on benign tasks. We conclude with a discussion of open questions in emergent misalignment research.
- Published in:
arXiv - Type:
Article - Year:
2025 - Source:
https://arxiv.org/abs/2508.06249
Citation information
: In-Training Defenses against Emergent Misalignment in Language Models, arXiv, 2025, https://arxiv.org/abs/2508.06249, kaczer.etal.2025a,
@Article{kaczer.etal.2025a,
author={Kaczer, David; Jorgenvaag, Magnus; Vetter, Clemens; Flek, Lucie; Mai, Florian},
title={In-Training Defenses against Emergent Misalignment in Language Models},
journal={arXiv},
url={https://arxiv.org/abs/2508.06249},
year={2025},
abstract={Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a...}}