---
license: apache-2.0
datasets:
- Abdelrahman-Rezk/Arabic_Dialect_Identification
language:
- ar
metrics:
- accuracy
- f1
base_model:
- UBC-NLP/MARBERTv2
pipeline_tag: text-classification
library_name: transformers
---

This is a finetuned version of the MARBERTv2 transformer for arabic dialect classification using the QADI dataset. This model classifies input text into one of 5 Arabic dialects which are: Gulf, Egyptian, Levantine, Mesopotamian(Iraq), and Maghrebi. For the 18 dialects version click here: **https://huggingface.co/oahmedd/MARBERTv2-Finetuned-on-QADI-dataset**


**Dataset:**

QADI (Qatar Arabic Dialect Identification

440,000 tweets

https://huggingface.com/datasets/Abdelrahman-Rezk/Arabic_Dialect_Identification by Abdelrahman Rezk et al.

Covers 18 Arabic dialects

We introduced a new column in the dataset 'arabic_region' which groups the 18 dialects into 5 Arabic regions which are: Gulf, Egyptian, Levantine, Mesopotamian(Iraq), and Maghrebi


**Evaluation:**

Metrics: 85% Accuracy & F1-score

This table shows the model’s F1 performance after grouping 18 Arabic dialects into 5 major regional classes:
| Arabic Region | F1 Score |
|---------------|----------|
| Gulf          | 89.0     |
| Egyptian      | 83.9     |
| Levantine     | 83.3     |
| Iraqi         | 70.6     |
| Maghrebi      | 81.7     |


**Usage:**
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("oahmedd/MARBERTv2-Finetuned-on-5-Dialects-QADI")
tokenizer = AutoTokenizer.from_pretrained("oahmedd/MARBERTv2-Finetuned-on-5-Dialects-QADI")
```


**For Dialect classification Inference:**


```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

DIALECT_LABELS = ["Gulf", "Egyptian", "Levantine", "Iraqi", "Maghrebi"]

model = AutoModelForSequenceClassification.from_pretrained(
  "oahmedd/MARBERTv2-Finetuned-on-QADI-dataset",
  num_labels=5
)
tokenizer = AutoTokenizer.from_pretrained("oahmedd/MARBERTv2-Finetuned-on-5-Dialects-QADI")

model.eval()

 text = "ازيك يصاحبي عامل ايه، ايه الاخبار"

inputs = tokenizer(
  text,
  return_tensors="pt",
  truncation=True,
  padding=True,
)

with torch.inference_mode():
  logits = model(**inputs).logits
  prediction = torch.argmax(logits, dim=1).item()
predicted_dialect = DIALECT_LABELS[prediction]

print(f"Predicted Dialect: {predicted_dialect}")
```


**Citation:**

If you find this helpful, please cite our work.

```
@article{essameldin2025arabic,
  title={Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis},
  author={Essameldin, Omar A and Elbeih, Ali O and Gomaa, Wael H and Elsersy, Wael F},
  journal={arXiv preprint arXiv:2506.19753},
  year={2025}
}
```