File size: 7,467 Bytes
f7fef32 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
# [sentence-transformers/static-similarity-mrl-multilingual-v1](https://huggingface.co/sentence-transformers/static-similarity-mrl-multilingual-v1)
License: [apache-2.0](https://choosealicense.com/licenses/apache-2.0/)
Multi-lingual similarity embeddings that were trained with Matroyshka loss
that allows for more effective truncation of the embedding vectors. It
was trained on a variety of domains of multilingual datasets.
It's a general purpose model that can be used for semantic textual similarity,
paraphrase mining, text classification, clustering, and more
## Model Stats
Stats that describe the embeddings tensor shapes and value distribution.
| item | metric | value |
| --------------| ----------------------- | ----- |
| vocab | size | 105,879 |
| embedding | dimensions | 1,024 |
| vector length | mean | 413.61 |
| vector length | median | 437.74 |
| vector length | stddev | 195.51 |
| values | mean | -0.02 |
| values | median | -0.01 |
| values | stddev | 14.30 |
## Mean Pooled Quantization Loss
This test roundtrips the vectors through quantization, but performs the
mean pooling arithmetic in float32 space. The quantized and unquantized
mean pooled vectors are compared to each other to determine their cosine
similarity, to show how much the meaning of the vector has changed due
to quantization.
| Precision | Cosine Similarity |
| ------------- | ----------------- |
| fp16 | 1.00000 |
| fp8 e4m3 | 0.99980 |
| fp8 e5m2 | 0.99921 |
## Quantization Loss Per Vector
While ultimately the embedding vectors will be mean pooled together, it's
still useful to look at the loss per-vector in the embedding table to see
which quantization strategies retain the most vector meaning.
- **Cosine Similarity** — measures how well the *direction* of embedding vectors
is preserved after quantization, independent of scale. This is especially
relevant when embeddings are used for similarity search or retrieval.
- **MSE (Mean Squared Error)** — emphasizes large errors by squaring the
differences. Useful for detecting whether any values are badly distorted.
- **MAE (Mean Absolute Error)** — the average absolute difference between
original and quantized values. Easier to interpret, less sensitive to outliers.
| Precision | Metric | Value |
| ------------- | ------ | ----- |
| fp16 | cosine similarity | 1.00000 |
| fp8 e4m3 | cosine similarity | 0.99965 |
| fp8 e5m2 | cosine similarity | 0.99861 |
| fp16 | MSE | 0.00001 |
| fp8 e4m3 | MSE | 0.14369 |
| fp8 e5m2 | MSE | 0.56917 |
| fp16 | MAE | 0.00183 |
| fp8 e4m3 | MAE | 0.23372 |
| fp8 e5m2 | MAE | 0.46585 |
## Tokenizer Examples
**Input:** This is an example of encoding<br/>
**Tokens**: `[CLS]` `this` `is` `an` `example` `of` `en` `##co` `##ding` `[SEP]`
**Input:** The quick brown fox jumps over the lazy dog.<br/>
**Tokens**: `[CLS]` `the` `quick` `brown` `fox` `jump` `##s` `over` `the` `la` `##zy` `dog` `.` `[SEP]`
**Input:** Curaçao, naïve fiancé, jalapeño, déjà vu.<br/>
**Tokens**: `[CLS]` `curacao` `,` `nai` `##ve` `fia` `##nce` `,` `ja` `##lap` `##eno` `,` `deja` `vu` `.` `[SEP]`
**Input:** Привет, как дела?<br/>
**Tokens**: `[CLS]` `при` `##вет` `,` `как` `дела` `?` `[SEP]`
**Input:** Бързата кафява лисица прескача мързеливото куче.<br/>
**Tokens**: `[CLS]` `б` `##ър` `##за` `##та` `ка` `##ф` `##ява` `ли` `##си` `##ца` `пре` `##ска` `##ча` `м` `##ър` `##зе` `##ливо` `##то` `к` `##уч` `##е` `.` `[SEP]`
**Input:** Γρήγορη καφέ αλεπού πηδάει πάνω από τον τεμπέλη σκύλο.<br/>
**Tokens**: `[CLS]` `γ` `##ρη` `##γο` `##ρη` `κ` `##α` `##φ` `##ε` `α` `##λε` `##που` `π` `##η` `##δα` `##ει` `πανω` `απο` `τον` `τ` `##ε` `##μ` `##πε` `##λη` `σ` `##κ` `##υλο` `.` `[SEP]`
**Input:** اللغة العربية جميلة وغنية بالتاريخ.<br/>
**Tokens**: `[CLS]` `اللغة` `العربية` `ج` `##ميل` `##ة` `و` `##غنية` `با` `##لت` `##اري` `##خ` `.` `[SEP]`
**Input:** مرحبا بالعالم!<br/>
**Tokens**: `[CLS]` `م` `##رح` `##با` `با` `##ل` `##عا` `##لم` `!` `[SEP]`
**Input:** Simplified: 快速的棕色狐狸跳过懒狗。<br/>
**Tokens**: `[CLS]` `simplified` `:` `快` `速` `的` `棕` `色` `狐` `狸` `跳` `过` `懒` `狗` `。` `[SEP]`
**Input:** Traditional: 快速的棕色狐狸跳過懶狗。<br/>
**Tokens**: `[CLS]` `traditional` `:` `快` `速` `的` `棕` `色` `狐` `狸` `跳` `過` `懶` `狗` `。` `[SEP]`
**Input:** 素早い茶色の狐が怠け者の犬を飛び越える。<br/>
**Tokens**: `[CLS]` `素` `早` `い` `茶` `色` `の` `狐` `か` `怠` `け` `者` `の` `犬` `を` `飛` `ひ` `越` `える` `。` `[SEP]`
**Input:** コンピュータープログラミング<br/>
**Tokens**: `[CLS]` `コ` `##ン` `##ヒ` `##ュー` `##ター` `##フロ` `##ク` `##ラ` `##ミ` `##ンク` `[SEP]`
**Input:** 빠른 갈색 여우가 게으른 개를 뛰어넘습니다.<br/>
**Tokens**: `[CLS]` `ᄈ` `##ᅡ른` `가` `##ᆯ` `##색` `ᄋ` `##ᅧ` `##우` `##가` `ᄀ` `##ᅦ` `##ᄋ` `##ᅳ` `##른` `ᄀ` `##ᅢ를` `ᄄ` `##ᅱ` `##어` `##너` `##ᆷ` `##스` `##ᆸ니다` `.` `[SEP]`
**Input:** तेज़ भूरी लोमड़ी आलसी कुत्ते के ऊपर कूदती है।<br/>
**Tokens**: `[CLS]` `त` `##ज` `भर` `##ी` `ल` `##ो` `##म` `##डी` `आल` `##सी` `क` `##तत` `क` `ऊपर` `क` `##द` `##ती` `ह` `।` `[SEP]`
**Input:** দ্রুত বাদামী শিয়াল অলস কুকুরের উপর দিয়ে লাফ দেয়।<br/>
**Tokens**: `[CLS]` `দ` `##রত` `বা` `##দা` `##মী` `শ` `##িযা` `##ল` `অ` `##ল` `##স` `ক` `##কর` `##ের` `উপর` `দিযে` `ল` `##া` `##ফ` `দেয` `।` `[SEP]`
**Input:** வேகமான பழுப்பு நரி சோம்பேறி நாயின் மேல் குதிக்கிறது.<br/>
**Tokens**: `[CLS]` `வ` `##ே` `##கம` `##ான` `ப` `##ழு` `##பபு` `நர` `##ி` `ச` `##ோ` `##ம` `##ப` `##ே` `##றி` `ந` `##ாய` `##ின` `மேல` `க` `##ு` `##தி` `##ககிறது` `.` `[SEP]`
**Input:** สุนัขจิ้งจอกสีน้ำตาลกระโดดข้ามสุนัขขี้เกียจ.<br/>
**Tokens**: `[CLS]` `[UNK]` `.` `[SEP]`
**Input:** ብሩክ ቡናማ ቀበሮ ሰነፍ ውሻን ተዘልሏል።<br/>
**Tokens**: `[CLS]` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `[SEP]`
**Input:** Hello 世界 مرحبا 🌍<br/>
**Tokens**: `[CLS]` `hello` `世` `界` `م` `##رح` `##با` `[UNK]` `[SEP]`
**Input:** 123, αβγ, абв, العربية, 中文, हिन्दी.<br/>
**Tokens**: `[CLS]` `123` `,` `α` `##β` `##γ` `,` `аб` `##в` `,` `العربية` `,` `中` `文` `,` `हिनदी` `.` `[SEP]` |