In-Training Defenses against Emergent Misalignment in Language Models

Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) ℓ2 distance in feature space, (iii) projecting onto a safe subspace (SafeLoRA), and (iv) interleaving of a small amount of safe training examples from a general instruct-tuning dataset. We first evaluate the methods’ emergent misalignment effect across four malicious, EMA-inducing tasks. Second, we assess the methods’ impacts on benign tasks. We conclude with a discussion of open questions in emergent misalignment research.

  • Published in:
    arXiv
  • Type:
    Article
  • Authors:
    Kaczer, David; Jorgenvaag, Magnus; Vetter, Clemens; Flek, Lucie; Mai, Florian
  • Year:
    2025
  • Source:
    https://arxiv.org/abs/2508.06249

Citation information

Kaczer, David; Jorgenvaag, Magnus; Vetter, Clemens; Flek, Lucie; Mai, Florian: In-Training Defenses against Emergent Misalignment in Language Models, arXiv, 2025, https://arxiv.org/abs/2508.06249, kaczer.etal.2025a,

Associated Lamarr Researchers

Prof. Dr. Lucie Flek

Prof. Dr. Lucie Flek

Area Chair NLP to the profile
Photo. Portrait of Florian Mai.

Dr. Florian Mai

Scientific Coordinator NLP to the profile