File size: 3,550 Bytes
03a8550
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
language:
- multilingual
- bg
- en
- fr
- de
- ru
- es
- sw
- tr
- vi
tags:
- deberta
- deberta-v3
- mdeberta
license: mit
---

# mdeberta-v3-base-lite

This model was created through vocabulary pruning of the original [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) model while maintaining full quality for Latin and Cyrillic-based languages.

## Supported Languages
- Bulgarian
- English
- French
- German
- Russian
- Spanish
- Swahili
- Turkish
- Vietnamese

## Usage

```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("rustemgareev/mdeberta-v3-base-lite")
model = AutoModel.from_pretrained("rustemgareev/mdeberta-v3-base-lite")

# Example usage
text = "This is an example text in English."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```

## Performance Evaluation

### Size Comparison
| Metric | Original Model | Lite Model | Reduction |
|--------|----------------|------------|-----------|
| Vocabulary Size | 250,102 tokens | 163,211 tokens | 34.74% |
| Disk Size | 1.06 GB | 817 MB | 23.23% |

### VRAM Usage Comparison
*Estimated using [Hugging Face Accelerate Model Estimator](https://huggingface.co/docs/accelerate/main/en/usage_guides/model_size_estimator).*

| Metric | Original Model | Lite Model | Reduction |
|--------|----------------|------------|-----------|
| Largest Layer (float32) | 735.35 MB | 478.16 MB | 34.99% |
| Total Size (float32) | 1.04 GB | 804.13 MB | 22.68% |
| Training using Adam (Peak vRAM) | 4.15 GB | 3.14 GB | 24.34% |

### Semantic Similarity Comparison

**Evaluation Method**: Cosine similarity between embeddings of parallel sentences in different languages, using English as reference.

**Test Phrases Used**:
- English: "Artificial intelligence learns to understand human languages and helps people communicate."
- Bulgarian: "Изкуственият интелект се учи да разбира човешките езици и помага на хората да общуват."
- French: "L'intelligence artificielle apprend à comprendre les langages humains et aide les gens à communiquer."
- German: "Künstliche Intelligenz lernt, menschliche Sprachen zu verstehen und hilft Menschen bei der Kommunikation."
- Russian: "Искусственный интеллект учится понимать человеческие языки и помогает людям общаться."
- Spanish: "La inteligencia artificial aprende a entender los idiomas humanos y ayuda a las personas a comunicarse."
- Swahili: "Akili ya kisasa inajifunza kuelewa lugha za wanadamu na kusaidia watu kuwasiliana."
- Turkish: "Yapay zeka, insan dillerini anlamayı öğrenir ve insanların iletişim kurmasına yardımcı olur."
- Vietnamese: "Trí tuệ nhân tạo học cách hiểu ngôn ngữ con người và giúp mọi người giao tiếp."

**Similarity Results**:

| Language Pair | Original Similarity | Lite Similarity | Difference |
|---------------|-----------------|-----------------|------------|
| English-Bulgarian | 0.9276 | 0.9276 | 0.0000 |
| English-French | 0.9322 | 0.9322 | 0.0000 |
| English-German | 0.9178 | 0.9178 | 0.0000 |
| English-Russian | 0.9335 | 0.9335 | 0.0000 |
| English-Spanish | 0.9228 | 0.9228 | 0.0000 |
| English-Swahili | 0.9591 | 0.9591 | 0.0000 |
| English-Turkish | 0.9450 | 0.9450 | 0.0000 |
| English-Vietnamese | 0.7955 | 0.7955 | 0.0000 |

## License

This model is distributed under the [MIT License](https://opensource.org/licenses/MIT).