A New Aligned Simple German Corpus

“Leichte Sprache”, the German counterpart to Simple English, is a regulated language aiming to facilitate complex written language that would otherwise stay inaccessible to different groups of people. We present a new sentence-aligned monolingual corpus for Simple German — German. It contains multiple document-aligned sources which we have aligned using automatic sentence-alignment methods. We evaluate our alignments based on a manually labelled subset of aligned documents. The quality of our sentence alignments, as measured by the F1-score, surpasses previous work. We publish the dataset under CC BY-SA and the accompanying code under MIT license.

  • Published in:
    Annual Meeting of the Association for Computational Linguistics
  • Type:
    Inproceedings
  • Authors:
    Toborek, Vanessa; Busch, Moritz; Boßert, Malte; Bauckhage, Christian; Welke, Pascal
  • Year:
    2023

Citation information

Toborek, Vanessa; Busch, Moritz; Boßert, Malte; Bauckhage, Christian; Welke, Pascal: A New Aligned Simple German Corpus, Annual Meeting of the Association for Computational Linguistics, 2023, https://aclanthology.org/2023.acl-long.638/, Toborek.etal.2023a,

Associated Lamarr Researchers

- Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)

Vanessa Toborek

Author to the profile
Kopie von LAMARR Person 500x500 1 - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)

Prof. Dr. Christian Bauckhage

Director to the profile
Portrait of Pascal Welke.

Pascal Welke

Autor to the profile