--- license: apache-2.0 datasets: - Abdelrahman-Rezk/Arabic_Dialect_Identification language: - ar metrics: - accuracy - f1 base_model: - UBC-NLP/MARBERTv2 pipeline_tag: text-classification library_name: transformers --- This is a finetuned version of the MARBERTv2 transformer for arabic dialect classification using the QADI dataset. This model classifies input text into one of 5 Arabic dialects which are: Gulf, Egyptian, Levantine, Mesopotamian(Iraq), and Maghrebi. For the 18 dialects version click here: **https://huggingface.co/oahmedd/MARBERTv2-Finetuned-on-QADI-dataset** **Dataset:** QADI (Qatar Arabic Dialect Identification 440,000 tweets https://huggingface.com/datasets/Abdelrahman-Rezk/Arabic_Dialect_Identification by Abdelrahman Rezk et al. Covers 18 Arabic dialects We introduced a new column in the dataset 'arabic_region' which groups the 18 dialects into 5 Arabic regions which are: Gulf, Egyptian, Levantine, Mesopotamian(Iraq), and Maghrebi **Evaluation:** Metrics: 85% Accuracy & F1-score This table shows the model’s F1 performance after grouping 18 Arabic dialects into 5 major regional classes: | Arabic Region | F1 Score | |---------------|----------| | Gulf | 89.0 | | Egyptian | 83.9 | | Levantine | 83.3 | | Iraqi | 70.6 | | Maghrebi | 81.7 | **Usage:** ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained("oahmedd/MARBERTv2-Finetuned-on-5-Dialects-QADI") tokenizer = AutoTokenizer.from_pretrained("oahmedd/MARBERTv2-Finetuned-on-5-Dialects-QADI") ``` **For Dialect classification Inference:** ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification DIALECT_LABELS = ["Gulf", "Egyptian", "Levantine", "Iraqi", "Maghrebi"] model = AutoModelForSequenceClassification.from_pretrained( "oahmedd/MARBERTv2-Finetuned-on-QADI-dataset", num_labels=5 ) tokenizer = AutoTokenizer.from_pretrained("oahmedd/MARBERTv2-Finetuned-on-5-Dialects-QADI") model.eval() text = "ازيك يصاحبي عامل ايه، ايه الاخبار" inputs = tokenizer( text, return_tensors="pt", truncation=True, padding=True, ) with torch.inference_mode(): logits = model(**inputs).logits prediction = torch.argmax(logits, dim=1).item() predicted_dialect = DIALECT_LABELS[prediction] print(f"Predicted Dialect: {predicted_dialect}") ``` **Citation:** If you find this helpful, please cite our work. ``` @article{essameldin2025arabic, title={Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis}, author={Essameldin, Omar A and Elbeih, Ali O and Gomaa, Wael H and Elsersy, Wael F}, journal={arXiv preprint arXiv:2506.19753}, year={2025} } ```