minishlab/potion-multilingual-128M
License: mit
A multilingual embedder. The details are a bit scant on how it's trained as there is no source code for it. However, it's likely a close architecture to the potion-retrieval-32M model, but trained on Common Crawl data.
The 128M references the number of parameters in the embeddings:
256 dimensions * 500,353 vocab.
Model Stats
Stats that describe the embeddings tensor shapes and value distribution.
| item | metric | value |
|---|---|---|
| vocab | size | 500,353 |
| embedding | dimensions | 256 |
| vector length | mean | 12.73 |
| vector length | median | 11.94 |
| vector length | stddev | 5.12 |
| values | mean | -0.00 |
| values | median | -0.00 |
| values | stddev | 0.86 |
Mean Pooled Quantization Loss
This test roundtrips the vectors through quantization, but performs the mean pooling arithmetic in float32 space. The quantized and unquantized mean pooled vectors are compared to each other to determine their cosine similarity, to show how much the meaning of the vector has changed due to quantization.
| Precision | Cosine Similarity |
|---|---|
| fp16 | 1.00000 |
| fp8 e4m3 | 0.99993 |
| fp8 e5m2 | 0.99973 |
Quantization Loss Per Vector
While ultimately the embedding vectors will be mean pooled together, it's still useful to look at the loss per-vector in the embedding table to see which quantization strategies retain the most vector meaning.
- Cosine Similarity — measures how well the direction of embedding vectors is preserved after quantization, independent of scale. This is especially relevant when embeddings are used for similarity search or retrieval.
- MSE (Mean Squared Error) — emphasizes large errors by squaring the differences. Useful for detecting whether any values are badly distorted.
- MAE (Mean Absolute Error) — the average absolute difference between original and quantized values. Easier to interpret, less sensitive to outliers.
| Precision | Metric | Value |
|---|---|---|
| fp16 | cosine similarity | 1.00000 |
| fp8 e4m3 | cosine similarity | 0.99965 |
| fp8 e5m2 | cosine similarity | 0.99863 |
| fp16 | MSE | 0.00000 |
| fp8 e4m3 | MSE | 0.00052 |
| fp8 e5m2 | MSE | 0.00205 |
| fp16 | MAE | 0.00011 |
| fp8 e4m3 | MAE | 0.01364 |
| fp8 e5m2 | MAE | 0.02717 |
Tokenizer Examples
Input: This is an example of encoding
Tokens: ▁This ▁is ▁an ▁example ▁of ▁encoding
Input: The quick brown fox jumps over the lazy dog.
Tokens: ▁The ▁quick ▁brown ▁fox ▁jumps ▁over ▁the ▁lazy ▁dog ▁ .
Input: Curaçao, naïve fiancé, jalapeño, déjà vu.
Tokens: ▁Cura ça o ▁ , ▁na ï ve ▁fiancé ▁ , ▁ja lap eño ▁ , ▁déjà ▁vu ▁ .
Input: Привет, как дела?
Tokens: ▁При вет ▁ , ▁как ▁дела ▁?
Input: Бързата кафява лисица прескача мързеливото куче.
Tokens: ▁Бър за та ▁кафяв а ▁лис ица ▁пре ска ча ▁ мър зе ливо то ▁куче ▁ .
Input: Γρήγορη καφέ αλεπού πηδάει πάνω από τον τεμπέλη σκύλο.
Tokens: ▁Γ ρή γο ρη ▁καφέ ▁α λε πού ▁ πη δά ει ▁πάνω ▁από ▁τον ▁τε μπ έλη ▁σκύλο ▁ .
Input: اللغة العربية جميلة وغنية بالتاريخ.
Tokens: ▁اللغة ▁العربية ▁جميلة ▁وغ نية ▁بال تاريخ ▁ .
Input: مرحبا بالعالم!
Tokens: ▁مرحبا ▁بالعالم ▁!
Input: Simplified: 快速的棕色狐狸跳过懒狗。
Tokens: ▁Simp l ified ▁: ▁ 快速 的 棕 色 狐 狸 跳 过 懒 狗 。
Input: Traditional: 快速的棕色狐狸跳過懶狗。
Tokens: ▁Tradition al ▁: ▁ 快速 的 棕 色 狐 狸 跳 過 懶 狗 。
Input: 素早い茶色の狐が怠け者の犬を飛び越える。
Tokens: ▁素 早い 茶 色 の 狐 が 怠 け 者の 犬 を 飛び 越 える 。
Input: コンピュータープログラミング
Tokens: ▁ コンピュータ ー プロ グラ ミ ング
Input: 빠른 갈색 여우가 게으른 개를 뛰어넘습니다.
Tokens: ▁빠른 ▁갈 색 ▁여 우 가 ▁게 으 른 ▁ 개를 ▁뛰어 넘 습니다 ▁ .
Input: तेज़ भूरी लोमड़ी आलसी कुत्ते के ऊपर कूदती है।
Tokens: ▁तेज़ ▁भू री ▁लो म ड़ी ▁आ ल सी ▁कुत्ते ▁के ▁ऊपर ▁ कू द ती ▁है ।
Input: দ্রুত বাদামী শিয়াল অলস কুকুরের উপর দিয়ে লাফ দেয়।
Tokens: ▁দ্রুত ▁বাদাম ী ▁শি য়াল ▁অ ল স ▁কু কুর ের ▁উপর ▁দিয়ে ▁লা ফ ▁দেয় ।
Input: வேகமான பழுப்பு நரி சோம்பேறி நாயின் மேல் குதிக்கிறது.
Tokens: ▁வேக மான ▁பழ ு ப்பு ▁ந ரி ▁சோ ம் பே றி ▁நா யின் ▁மேல் ▁கு தி க்கிறது ▁ .
Input: สุนัขจิ้งจอกสีน้ำตาลกระโดดข้ามสุนัขขี้เกียจ.
Tokens: ▁ สุนัข จิ ้ง จอ ก สีน้ําตาล กระโดด ข้าม สุนัข ขี้ เกีย จ ▁ .
Input: ብሩክ ቡናማ ቀበሮ ሰነፍ ውሻን ተዘልሏል።
Tokens: ▁ ብሩ ክ ▁ቡና ማ ▁ ቀበ ሮ ▁ሰ ነፍ ▁ ው ሻ ን ▁ተ ዘ ል ሏል ።
Input: Hello 世界 مرحبا 🌍
Tokens: ▁Hello ▁世界 ▁مرحبا ▁🌍
Input: 123, αβγ, абв, العربية, 中文, हिन्दी.
Tokens: ▁123 ▁ , ▁α β γ ▁ , ▁аб в ▁ , ▁العربية ▁ , ▁中文 ▁ , ▁हिन्दी ▁ .