Improving Natural Language Inference in Arabic using Transformer Models and Linguistically Informed Pre-Training
This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP), with a particular focus on Natural Language Inference (NLI) and Contradiction Detection (CD). Arabic is considered a resource-poor language, meaning that there are few data sets available, which leads to limited availability of NLP methods. To overcome this limitation, we create a dedicated data set from publicly available resources. Subsequently, transformer-based machine learning models are being trained and evaluated. We find that a language-specific model (AraBERT) performs competitively with state-of-the-art multilingual approaches, when we apply linguistically informed pre-training methods such as Named Entity Recognition (NER). To our knowledge, this is the first large-scale evaluation for this task in Arabic, as well as the first application of multi-task pre-training in this context.
- Published in:
2023 IEEE Symposium Series on Computational Intelligence (SSCI) - Type:
Inproceedings - Authors:
Majd Saad Al Deen, Mohammad; Pielka, Maren; Hees, Jörn; Soulef Abdou, Bouthaina; Sifa, Rafet - Year:
2023 - Source:
https://ieeexplore.ieee.org/document/10371891
Citation information
Majd Saad Al Deen, Mohammad; Pielka, Maren; Hees, Jörn; Soulef Abdou, Bouthaina; Sifa, Rafet: Improving Natural Language Inference in Arabic using Transformer Models and Linguistically Informed Pre-Training, 2023 IEEE Symposium Series on Computational Intelligence (SSCI), 2023, https://ieeexplore.ieee.org/document/10371891, MajdSaadAlDeen.etal.2023a,
@Inproceedings{MajdSaadAlDeen.etal.2023a,
author={Majd Saad Al Deen, Mohammad; Pielka, Maren; Hees, Jörn; Soulef Abdou, Bouthaina; Sifa, Rafet},
title={Improving Natural Language Inference in Arabic using Transformer Models and Linguistically Informed Pre-Training},
booktitle={2023 IEEE Symposium Series on Computational Intelligence (SSCI)},
url={https://ieeexplore.ieee.org/document/10371891},
year={2023},
abstract={This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP), with a particular focus on Natural Language Inference (NLI) and Contradiction Detection (CD). Arabic is considered a resource-poor language, meaning that there are few data sets available, which leads to limited availability of NLP methods. To overcome this limitation, we create a...}}