TunisianEncodersArena / Roadmap.md
hamzabouajila's picture
refactor the code for better scalability and update tsac naming to sentiment analysis, adding madar dataset for transliteration and normalization eval
bde1c71
## πŸ—ΊοΈ Tunisian NLP Leaderboard Roadmap
### πŸ“Œ Phase 1: Dataset Acquisition & Preparation
#### 1. **Sentiment Analysis**
* **Existing Dataset**: **TUNIZI**
* **Description**: A large dataset containing 100,000 Tunisian Arabizi comments annotated as positive, negative, or neutral.
* **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
* **Usage**: Utilize this dataset to evaluate models' performance in sentiment classification tasks.
#### 2. **Named Entity Recognition (NER)**
* **Existing Dataset**: **ArabNER**
* **Description**: A comprehensive Arabic NER corpus that can be adapted for Tunisian dialects.
* **Source**: [ResearchGate](https://www.researchgate.net/publication/374279027_Named_Entity_Recognition_of_Tunisian_Arabic_Using_the_Bi-LSTM-CRF_Model)
* **Usage**: Fine-tune models on this dataset to assess their ability to recognize entities in Tunisian Arabic text.
#### 3. **Corpus Coverage**
* **Existing Dataset**: **Tunisian Dialect Corpus**
* **Description**: A sizable collection of Tunisian dialect texts, useful for assessing vocabulary coverage.
* **Source**: [Hugging Face](https://huggingface.co/collections/tunis-ai/arabic-datasets-66344cf0df31dc81eb1dcf55)
* **Usage**: Evaluate models' coverage of the Tunisian dialect vocabulary using this corpus.
#### 4. **Arabizi Robustness**
* **Existing Dataset**: **TUNIZI**
* **Description**: Since it's in Arabizi, it can also serve to evaluate models' robustness to this writing style.
* **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
* **Usage**: Assess models' robustness to Arabizi by evaluating their performance on this dataset.
#### 5. **Code-Switching**
* **Existing Dataset**: **TunSwitch**
* **Description**: A dataset of code-switched Tunisian Arabic speech, valuable for training and evaluating models on code-switching tasks.
* **Source**: [Zenodo](https://zenodo.org/records/8342762)
* **Usage**: Evaluate models' ability to handle code-switching between Tunisian Arabic and other languages using this dataset.
#### 6. **Typo Robustness**
* **Existing Dataset**: **TUNIZI**
* **Description**: Its informal nature includes typographical variations, making it suitable for evaluating models' tolerance to typos.
* **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
* **Usage**: Assess models' robustness to typographical errors by evaluating their performance on this dataset.
#### 7. **Zero-Shot Transfer**
* **Existing Dataset**: **TUNIZI**
* **Description**: Can be used to test models' ability to generalize to tasks they weren't explicitly trained on.
* **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
* **Usage**: Evaluate models' zero-shot transfer capabilities by assessing their performance on this dataset.
#### 8. **Domain Shift**
* **Existing Dataset**: **TUNIZI**
* **Description**: Its diverse sources provide a foundation for testing domain adaptation capabilities.
* **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
* **Usage**: Assess models' ability to adapt to different domains by evaluating their performance on this dataset.
---
### πŸ§ͺ Phase 2: Metric Development & Evaluation Tasks
For each task, define the evaluation metric and the corresponding dataset:
| Task | Metric | Dataset |
| ------------------------ | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| Sentiment Analysis | Accuracy / F1 Score | [TUNIZI](https://k4all.org/project/database-tunisian-arabizi/) |
| Named Entity Recognition | F1 Score | [ArabNER](https://www.researchgate.net/publication/374279027_Named_Entity_Recognition_of_Tunisian_Arabic_Using_the_Bi-LSTM-CRF_Model) |
| Corpus Coverage | Vocabulary Coverage (%) | [Tunisian Dialect Corpus](https://huggingface.co/collections/tunis-ai/arabic-datasets-66344cf0df31dc81eb1dcf55) |
| Arabizi Robustness | Accuracy / F1 Score | [TUNIZI](https://k4all.org/project/database-tunisian-arabizi/) |
| Code-Switching | Accuracy / F1 Score | [TunSwitch](https://zenodo.org/records/8342762) |
| Typo Robustness | Accuracy / F1 Score | [TUNIZI]([https://k4all.org/project/database-tunisian](https://k4all.org/project/database-tunisian) |
Certainly! Here's a comprehensive roadmap to guide you through enhancing your **TunisianEncoderModelsLeaderboard** project, focusing on dataset acquisition, metric development, and evaluation tasks.
---
## πŸ—ΊοΈ Tunisian NLP Leaderboard Roadmap
### πŸ“Œ Phase 1: Dataset Acquisition & Preparation
#### 1. **Sentiment Analysis**
* **Existing Dataset**: **TUNIZI**
* **Description**: A large dataset containing 100,000 Tunisian Arabizi comments annotated as positive, negative, or neutral.
* **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
* **Usage**: Utilize this dataset to evaluate models' performance in sentiment classification tasks.
#### 2. **Named Entity Recognition (NER)**
* **Existing Dataset**: **ArabNER**
* **Description**: A comprehensive Arabic NER corpus that can be adapted for Tunisian dialects.
* **Source**: [ResearchGate](https://www.researchgate.net/publication/374279027_Named_Entity_Recognition_of_Tunisian_Arabic_Using_the_Bi-LSTM-CRF_Model)
* **Usage**: Fine-tune models on this dataset to assess their ability to recognize entities in Tunisian Arabic text.
#### 3. **Corpus Coverage**
* **Existing Dataset**: **Tunisian Dialect Corpus**
* **Description**: A sizable collection of Tunisian dialect texts, useful for assessing vocabulary coverage.
* **Source**: [Hugging Face](https://huggingface.co/collections/tunis-ai/arabic-datasets-66344cf0df31dc81eb1dcf55)
* **Usage**: Evaluate models' coverage of the Tunisian dialect vocabulary using this corpus.
#### 4. **Arabizi Robustness**
* **Existing Dataset**: **TUNIZI**
* **Description**: Since it's in Arabizi, it can also serve to evaluate models' robustness to this writing style.
* **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
* **Usage**: Assess models' robustness to Arabizi by evaluating their performance on this dataset.
#### 5. **Code-Switching**
* **Existing Dataset**: **TunSwitch**
* **Description**: A dataset of code-switched Tunisian Arabic speech, valuable for training and evaluating models on code-switching tasks.
* **Source**: [Zenodo](https://zenodo.org/records/8342762)
* **Usage**: Evaluate models' ability to handle code-switching between Tunisian Arabic and other languages using this dataset.
#### 6. **Typo Robustness**
* **Existing Dataset**: **TUNIZI**
* **Description**: Its informal nature includes typographical variations, making it suitable for evaluating models' tolerance to typos.
* **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
* **Usage**: Assess models' robustness to typographical errors by evaluating their performance on this dataset.
#### 7. **Zero-Shot Transfer**
* **Existing Dataset**: **TUNIZI**
* **Description**: Can be used to test models' ability to generalize to tasks they weren't explicitly trained on.
* **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
* **Usage**: Evaluate models' zero-shot transfer capabilities by assessing their performance on this dataset.
#### 8. **Domain Shift**
* **Existing Dataset**: **TUNIZI**
* **Description**: Its diverse sources provide a foundation for testing domain adaptation capabilities.
* **Source**: [K4All Foundation](https://k4all.org/project/database-tunisian-arabizi/)
* **Usage**: Assess models' ability to adapt to different domains by evaluating their performance on this dataset.
---
### πŸ§ͺ Phase 2: Metric Development & Evaluation Tasks
For each task, define the evaluation metric and the corresponding dataset:
| Task | Metric | Dataset |
| ------------------------ | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| Sentiment Analysis | Accuracy / F1 Score | [TUNIZI](https://k4all.org/project/database-tunisian-arabizi/) |
| Named Entity Recognition | F1 Score | [ArabNER](https://www.researchgate.net/publication/374279027_Named_Entity_Recognition_of_Tunisian_Arabic_Using_the_Bi-LSTM-CRF_Model) |
| Corpus Coverage | Vocabulary Coverage (%) | [Tunisian Dialect Corpus](https://huggingface.co/collections/tunis-ai/arabic-datasets-66344cf0df31dc81eb1dcf55) |
| Arabizi Robustness | Accuracy / F1 Score | [TUNIZI](https://k4all.org/project/database-tunisian-arabizi/) |
| Code-Switching | Accuracy / F1 Score | [TunSwitch](https://zenodo.org/records/8342762) |
| Typo Robustness | Accuracy / F1 Score | [TUNIZI](https://k4all.org/project/database-tunisian-arabizi/) |
| Zero-Shot Transfer | Accuracy / F1 Score | [TUNIZI](https://k4all.org/project/database-tunisian-arabizi/) |
| Domain Shift | Accuracy / F1 Score | [TUNIZI](https://k4all.org/project/database-tunisian-arabizi/) |
---
### πŸ—‚οΈ Suggested Folder Structure
To maintain organization and clarity, consider the following structure:
```
TunisianEncoderModelsLeaderboard/
β”œβ”€β”€ datasets/
β”‚ β”œβ”€β”€ sentiment/
β”‚ β”‚ └── tunizi.json
β”‚ β”œβ”€β”€ ner/
β”‚ β”‚ └── arabner.json
β”‚ β”œβ”€β”€ coverage/
β”‚ β”‚ └── tunisian_dialect_corpus.json
β”‚ β”œβ”€β”€ arabizi_robustness/
β”‚ β”‚ └── tunizi.json
β”‚ β”œβ”€β”€ code_switching/
β”‚ β”‚ └── tunswitch.json
β”‚ β”œβ”€β”€ typo_robustness/
β”‚ β”‚ └── tunizi_with_typos.json
β”‚ β”œβ”€β”€ zero_shot/
β”‚ β”‚ └── tunizi.json
β”‚ └── domain_shift/
β”‚ └── tunisian_domain_shift.json
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ preprocess.py
β”‚ β”œβ”€β”€ evaluate.py
β”‚ └── visualize.py
└── README.md
```
---
### βœ… Next Steps
1. **Integrate Existing Datasets**: Incorporate the datasets mentioned above into your repository, ensuring they are properly formatted and documented.
2. **Develop Evaluation Scripts**: Write scripts to evaluate models on each task, ensuring they are compatible with the leaderboard format.
3. **Populate the Leaderboard**: As models are evaluated, update the leaderboard to reflect their performance across tasks.
4. **Documentation**: Update the README.md file to provide clear instructions on how to use the leaderboard, contribute models, and interpret results.
---
If you need assistance with data collection, annotation guidelines, or script development, feel free to ask!