{"id":32271,"date":"2026-01-21T17:01:36","date_gmt":"2026-01-21T17:01:36","guid":{"rendered":"https:\/\/lamarr-institute.org\/publication\/ardia-improving-arabic-dialectal-language-classification-using-a-novel-dataset\/"},"modified":"2026-01-21T17:19:45","modified_gmt":"2026-01-21T17:19:45","slug":"ardia-improving-arabic-dialectal-language-classification-using-a-novel-dataset","status":"publish","type":"publication","link":"https:\/\/lamarr-institute.org\/de\/publication\/ardia-improving-arabic-dialectal-language-classification-using-a-novel-dataset\/","title":{"rendered":"{ArDia}: Improving Arabic Dialectal Language Classification Using a Novel Dataset"},"content":{"rendered":"<p>Despite Arabic being one of the most widely spoken languages, there is a scarcity of available dialectal Arabic data. In this paper, we address this challenge by proposing a novel approach to data collection through the main use of video captions from {TikTok}, and other resources such as dictionaries and articles, resulting in the creation of the {ArDia} dataset. To the best of our knowledge, the {ArDia} dataset is the largest labeled dialectal Arabic dataset, containing over 900,000 examples, each labeled with its respective dialect. We further leverage this dataset to pretrain transformer-based models, {ArDiaBERT} and {ArDiaGPT}. Due to a lack of research on the Arabic models, we present a comprehensive study of Arabic dialect identification using the {ArDia} dataset on the dialect identification task.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Despite Arabic being one of the most widely spoken languages, there is a scarcity of available dialectal Arabic data. In this paper, we address this challenge by proposing a novel approach to data collection through the main use of video captions from {TikTok}, and other resources such as dictionaries and articles, resulting in the creation of the {ArDia} dataset. To the best of our knowledge, the {ArDia} dataset is the [&hellip;]<\/p>\n","protected":false},"author":12,"featured_media":0,"template":"","meta":{"_acf_changed":false,"footnotes":""},"publication-type":[30],"class_list":["post-32271","publication","type-publication","status-publish","hentry","publication-type-article"],"acf":[],"publishpress_future_workflow_manual_trigger":{"enabledWorkflows":[]},"_links":{"self":[{"href":"https:\/\/lamarr-institute.org\/de\/wp-json\/wp\/v2\/publication\/32271","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lamarr-institute.org\/de\/wp-json\/wp\/v2\/publication"}],"about":[{"href":"https:\/\/lamarr-institute.org\/de\/wp-json\/wp\/v2\/types\/publication"}],"author":[{"embeddable":true,"href":"https:\/\/lamarr-institute.org\/de\/wp-json\/wp\/v2\/users\/12"}],"version-history":[{"count":0,"href":"https:\/\/lamarr-institute.org\/de\/wp-json\/wp\/v2\/publication\/32271\/revisions"}],"wp:attachment":[{"href":"https:\/\/lamarr-institute.org\/de\/wp-json\/wp\/v2\/media?parent=32271"}],"wp:term":[{"taxonomy":"publication-type","embeddable":true,"href":"https:\/\/lamarr-institute.org\/de\/wp-json\/wp\/v2\/publication-type?post=32271"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}