|
|
--- |
|
|
language: |
|
|
- multilingual |
|
|
- bg |
|
|
- en |
|
|
- fr |
|
|
- de |
|
|
- ru |
|
|
- es |
|
|
- sw |
|
|
- tr |
|
|
- vi |
|
|
tags: |
|
|
- deberta |
|
|
- deberta-v3 |
|
|
- mdeberta |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# mdeberta-v3-base-lite |
|
|
|
|
|
This model was created through vocabulary pruning of the original [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) model while maintaining full quality for Latin and Cyrillic-based languages. |
|
|
|
|
|
## Supported Languages |
|
|
- Bulgarian |
|
|
- English |
|
|
- French |
|
|
- German |
|
|
- Russian |
|
|
- Spanish |
|
|
- Swahili |
|
|
- Turkish |
|
|
- Vietnamese |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("rustemgareev/mdeberta-v3-base-lite") |
|
|
model = AutoModel.from_pretrained("rustemgareev/mdeberta-v3-base-lite") |
|
|
|
|
|
# Example usage |
|
|
text = "This is an example text in English." |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
``` |
|
|
|
|
|
## Performance Evaluation |
|
|
|
|
|
### Size Comparison |
|
|
| Metric | Original Model | Lite Model | Reduction | |
|
|
|--------|----------------|------------|-----------| |
|
|
| Vocabulary Size | 250,102 tokens | 163,211 tokens | 34.74% | |
|
|
| Disk Size | 1.06 GB | 817 MB | 23.23% | |
|
|
|
|
|
### VRAM Usage Comparison |
|
|
*Estimated using [Hugging Face Accelerate Model Estimator](https://huggingface.co/docs/accelerate/main/en/usage_guides/model_size_estimator).* |
|
|
|
|
|
| Metric | Original Model | Lite Model | Reduction | |
|
|
|--------|----------------|------------|-----------| |
|
|
| Largest Layer (float32) | 735.35 MB | 478.16 MB | 34.99% | |
|
|
| Total Size (float32) | 1.04 GB | 804.13 MB | 22.68% | |
|
|
| Training using Adam (Peak vRAM) | 4.15 GB | 3.14 GB | 24.34% | |
|
|
|
|
|
### Semantic Similarity Comparison |
|
|
|
|
|
**Evaluation Method**: Cosine similarity between embeddings of parallel sentences in different languages, using English as reference. |
|
|
|
|
|
**Test Phrases Used**: |
|
|
- English: "Artificial intelligence learns to understand human languages and helps people communicate." |
|
|
- Bulgarian: "Изкуственият интелект се учи да разбира човешките езици и помага на хората да общуват." |
|
|
- French: "L'intelligence artificielle apprend à comprendre les langages humains et aide les gens à communiquer." |
|
|
- German: "Künstliche Intelligenz lernt, menschliche Sprachen zu verstehen und hilft Menschen bei der Kommunikation." |
|
|
- Russian: "Искусственный интеллект учится понимать человеческие языки и помогает людям общаться." |
|
|
- Spanish: "La inteligencia artificial aprende a entender los idiomas humanos y ayuda a las personas a comunicarse." |
|
|
- Swahili: "Akili ya kisasa inajifunza kuelewa lugha za wanadamu na kusaidia watu kuwasiliana." |
|
|
- Turkish: "Yapay zeka, insan dillerini anlamayı öğrenir ve insanların iletişim kurmasına yardımcı olur." |
|
|
- Vietnamese: "Trí tuệ nhân tạo học cách hiểu ngôn ngữ con người và giúp mọi người giao tiếp." |
|
|
|
|
|
**Similarity Results**: |
|
|
|
|
|
| Language Pair | Original Similarity | Lite Similarity | Difference | |
|
|
|---------------|-----------------|-----------------|------------| |
|
|
| English-Bulgarian | 0.9276 | 0.9276 | 0.0000 | |
|
|
| English-French | 0.9322 | 0.9322 | 0.0000 | |
|
|
| English-German | 0.9178 | 0.9178 | 0.0000 | |
|
|
| English-Russian | 0.9335 | 0.9335 | 0.0000 | |
|
|
| English-Spanish | 0.9228 | 0.9228 | 0.0000 | |
|
|
| English-Swahili | 0.9591 | 0.9591 | 0.0000 | |
|
|
| English-Turkish | 0.9450 | 0.9450 | 0.0000 | |
|
|
| English-Vietnamese | 0.7955 | 0.7955 | 0.0000 | |
|
|
|
|
|
## License |
|
|
|
|
|
This model is distributed under the [MIT License](https://opensource.org/licenses/MIT). |