{ArDia}: Improving Arabic Dialectal Language Classification Using a Novel Dataset
Despite Arabic being one of the most widely spoken languages, there is a scarcity of available dialectal Arabic data. In this paper, we address this challenge by proposing a novel approach to data collection through the main use of video captions from {TikTok}, and other resources such as dictionaries and articles, resulting in the creation of the {ArDia} dataset. To the best of our knowledge, the {ArDia} dataset is the largest labeled dialectal Arabic dataset, containing over 900,000 examples, each labeled with its respective dialect. We further leverage this dataset to pretrain transformer-based models, {ArDiaBERT} and {ArDiaGPT}. Due to a lack of research on the Arabic models, we present a comprehensive study of Arabic dialect identification using the {ArDia} dataset on the dialect identification task.
- Published in:
Proceedings of the International {AAAI} Conference on Web and Social Media - Type:
Article - Authors:
- Year:
2025 - Source:
https://ojs.aaai.org/index.php/ICWSM/article/view/35944
Citation information
: {ArDia}: Improving Arabic Dialectal Language Classification Using a Novel Dataset, Proceedings of the International {AAAI} Conference on Web and Social Media, 2025, 19, 2413--2422, June, https://ojs.aaai.org/index.php/ICWSM/article/view/35944, Elsafty.etal.2025a,
@Article{Elsafty.etal.2025a,
author={Elsafty, Hossam; Abdou, Bouthaina; Deußer, Tobias; Pielka, Maren; Bauckhage, Christian; Sifa, Rafet},
title={{ArDia}: Improving Arabic Dialectal Language Classification Using a Novel Dataset},
journal={Proceedings of the International {AAAI} Conference on Web and Social Media},
volume={19},
pages={2413--2422},
month={June},
url={https://ojs.aaai.org/index.php/ICWSM/article/view/35944},
year={2025},
abstract={Despite Arabic being one of the most widely spoken languages, there is a scarcity of available dialectal Arabic data. In this paper, we address this challenge by proposing a novel approach to data collection through the main use of video captions from {TikTok}, and other resources such as dictionaries and articles, resulting in the creation of the {ArDia} dataset. To the best of our knowledge,...}}