{ArDia}: Improving Arabic Dialectal Language Classification Using a Novel Dataset

Despite Arabic being one of the most widely spoken languages, there is a scarcity of available dialectal Arabic data. In this paper, we address this challenge by proposing a novel approach to data collection through the main use of video captions from {TikTok}, and other resources such as dictionaries and articles, resulting in the creation of the {ArDia} dataset. To the best of our knowledge, the {ArDia} dataset is the largest labeled dialectal Arabic dataset, containing over 900,000 examples, each labeled with its respective dialect. We further leverage this dataset to pretrain transformer-based models, {ArDiaBERT} and {ArDiaGPT}. Due to a lack of research on the Arabic models, we present a comprehensive study of Arabic dialect identification using the {ArDia} dataset on the dialect identification task.

  • Published in:
    Proceedings of the International {AAAI} Conference on Web and Social Media
  • Type:
    Article
  • Authors:
    Elsafty, Hossam; Abdou, Bouthaina; Deußer, Tobias; Pielka, Maren; Bauckhage, Christian; Sifa, Rafet
  • Year:
    2025
  • Source:
    https://ojs.aaai.org/index.php/ICWSM/article/view/35944

Citation information

Elsafty, Hossam; Abdou, Bouthaina; Deußer, Tobias; Pielka, Maren; Bauckhage, Christian; Sifa, Rafet: {ArDia}: Improving Arabic Dialectal Language Classification Using a Novel Dataset, Proceedings of the International {AAAI} Conference on Web and Social Media, 2025, 19, 2413--2422, June, https://ojs.aaai.org/index.php/ICWSM/article/view/35944, Elsafty.etal.2025a,