Add new SentenceTransformer model

Browse files

Files changed (9) hide show

1_Pooling/config.json +1 -1
README.md +114 -147
config.json +12 -14
model.safetensors +2 -2
sentence_bert_config.json +1 -1
special_tokens_map.json +19 -5
tokenizer.json +0 -0
tokenizer_config.json +26 -18
vocab.txt +5 -0

1_Pooling/config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-    "word_embedding_dimension": 384,
     "pooling_mode_cls_token": false,
     "pooling_mode_mean_tokens": true,
     "pooling_mode_max_tokens": false,

 {
+    "word_embedding_dimension": 768,
     "pooling_mode_cls_token": false,
     "pooling_mode_mean_tokens": true,
     "pooling_mode_max_tokens": false,

README.md CHANGED Viewed

@@ -5,78 +5,77 @@ tags:
 - feature-extraction
 - dense
 - generated_from_trainer
-- dataset_size:117861
 - loss:MultipleNegativesRankingLoss
 widget:
-- source_sentence: A slow progression can lead to great things.
   sentences:
-  - The bookmark emoji is often used to indicate saving or marking a specific page
-    or place of interest, such as in a book or on a website. It can also be used to
-    symbolize remembering something important or significant.
-  - The chess pawn emoji is often used to represent the lowly but essential piece
-    in the game of chess. It can symbolize strategy, patience, and the importance
-    of thinking ahead in various contexts.
-  - The flag of St. Kitts and Nevis is a symbol of the island nation in the Caribbean.
-    It consists of a blue field with two white stars representing the islands of St.
-    Kitts and Nevis. The green triangles and red diagonal lines represent the country's
-    lush vegetation and struggle for freedom.
-- source_sentence: I’m starting my day with a clean space today ◽
   sentences:
-  - The tent emoji is often used to symbolize camping, outdoor adventures, or spending
-    the night in nature. It can also represent festivals, events, or temporary shelter.
-    It is commonly used in messages and posts related to camping trips, hiking, or
-    enjoying the great outdoors.
-  - The dumpling emoji represents a delicious filled pastry, often served as an appetizer
-    or snack in various cuisines. It can also symbolize comfort food, gatherings with
-    friends or family, and celebrations.
-  - The white medium-small square emoji is used to represent a white square that is
-    neither too big nor too small. It can symbolize cleanliness, simplicity, or neutrality.
-- source_sentence: Wakeboarding sounds like fun right now.
   sentences:
-  - The green book emoji is often used to symbolize reading, education, and knowledge.
-    It can also represent environmental awareness or sustainability. It is commonly
-    used in posts about literature, learning, or going green.
-  - The speedboat emoji is typically used to represent speed, travel, vacation, or
-    fun on the water. It can also be used in conversations related to boating, sailing,
-    or water activities.
-  - The weary face emoji is used to express weariness, tiredness, or exhaustion. It
-    can also convey sadness, disappointment, or frustration. This emoji is commonly
-    used when expressing feeling drained or overwhelmed.
-- source_sentence: Winter days are best spent carving through the powder.
   sentences:
-  - The hand with fingers splayed emoji is often used to represent a high five, a
-    gesture of greeting, celebration, or agreement. It can also indicate the number
-    five or be used in a playful manner to express excitement or joy.
-  - The 🕉️ emoji is commonly used to represent spirituality, meditation, peace, and
-    harmony. It is often used in the context of yoga and mindfulness practices.
-  - The snowboarder emoji shows a person riding a snowboard down a snowy slope. It
-    is often used in conversations related to winter sports, skiing, snowboarding,
-    cold weather, and outdoor activities.
-- source_sentence: Just finished the book - not sure what to think.
   sentences:
-  - The ping pong emoji is often used to represent the sport of table tennis or a
-    fun game of ping pong. It can also symbolize friendly competition or leisure activities.
-  - The 🛐 emoji is used to represent a place of worship, such as a church, mosque,
-    temple, or shrine. It is often used in the context of religion, spirituality,
-    or practice of faith.
-  - The grey heart emoji is typically used to convey a sense of neutrality or indifference
-    in a conversation. It can also represent a more subdued or muted form of love
-    or appreciation.
 pipeline_tag: sentence-similarity
 library_name: sentence-transformers
 ---
-# SentenceTransformer
-This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 ## Model Details
 ### Model Description
 - **Model Type:** Sentence Transformer
-<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
-- **Maximum Sequence Length:** 256 tokens
-- **Output Dimensionality:** 384 dimensions
 - **Similarity Function:** Cosine Similarity
 <!-- - **Training Dataset:** Unknown -->
 <!-- - **Language:** Unknown -->
@@ -92,8 +91,8 @@ This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps
 ```
 SentenceTransformer(
-  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
-  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
   (2): Normalize()
 )
 ```
@@ -116,20 +115,20 @@ from sentence_transformers import SentenceTransformer
 model = SentenceTransformer("zoharzaig/emoji-prediction-model")
 # Run inference
 sentences = [
-    'Just finished the book - not sure what to think.',
-    'The grey heart emoji is typically used to convey a sense of neutrality or indifference in a conversation. It can also represent a more subdued or muted form of love or appreciation.',
-    'The ping pong emoji is often used to represent the sport of table tennis or a fun game of ping pong. It can also symbolize friendly competition or leisure activities.',
 ]
 embeddings = model.encode(sentences)
 print(embeddings.shape)
-# [3, 384]
 # Get the similarity scores for the embeddings
 similarities = model.similarity(embeddings, embeddings)
 print(similarities)
-# tensor([[ 1.0000,  0.5223, -0.0764],
-#         [ 0.5223,  1.0000,  0.0038],
-#         [-0.0764,  0.0038,  1.0000]])
 ```
 <!--
@@ -174,19 +173,19 @@ You can finetune this model on your own dataset.
 #### Unnamed Dataset
-* Size: 117,861 training samples
 * Columns: <code>sentence_0</code> and <code>sentence_1</code>
 * Approximate statistics based on the first 1000 samples:
-  |         | sentence_0                                                                        | sentence_1                                                                         |
-  |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
-  | type    | string                                                                            | string                                                                             |
-  | details | <ul><li>min: 5 tokens</li><li>mean: 11.98 tokens</li><li>max: 25 tokens</li></ul> | <ul><li>min: 17 tokens</li><li>mean: 45.61 tokens</li><li>max: 89 tokens</li></ul> |
 * Samples:
-  | sentence_0                                              | sentence_1                                                                                                                                                                                                                                                 |
-  |:--------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-  | <code>May your travels be filled with discovery.</code> | <code>The Vulcan salute emoji is often used by Star Trek fans as a way to greet each other or show appreciation for the science fiction franchise. It is also commonly used to symbolize peace, live long and prosper, or simply as a cool gesture.</code> |
-  | <code>Missing our moments together.</code>              | <code>The pink heart emoji is commonly used to express love, affection, and admiration. It can also symbolize femininity, sweetness, and care. This emoji is often sent on Valentine's Day or to show support to someone special.</code>                   |
-  | <code>The sound of waves is my favorite lullaby.</code> | <code>The beach with umbrella emoji is often used to symbolize relaxation, vacations, and sunny days spent by the ocean or sea. It can also evoke feelings of leisure, pleasure, and tranquility.</code>                                                   |
 * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
   ```json
   {
@@ -329,79 +328,47 @@ You can finetune this model on your own dataset.
 ### Training Logs
 | Epoch  | Step  | Training Loss |
 |:------:|:-----:|:-------------:|
-| 0.0679 | 500   | 1.0042        |
-| 0.1357 | 1000  | 0.7892        |
-| 0.2036 | 1500  | 0.6959        |
-| 0.2715 | 2000  | 0.6052        |
-| 0.3394 | 2500  | 0.5538        |
-| 0.4072 | 3000  | 0.5236        |
-| 0.4751 | 3500  | 0.5032        |
-| 0.5430 | 4000  | 0.4752        |
-| 0.6108 | 4500  | 0.4495        |
-| 0.6787 | 5000  | 0.4284        |
-| 0.7466 | 5500  | 0.4098        |
-| 0.8144 | 6000  | 0.4069        |
-| 0.8823 | 6500  | 0.398         |
-| 0.9502 | 7000  | 0.3728        |
-| 1.0181 | 7500  | 0.3515        |
-| 1.0859 | 8000  | 0.3058        |
-| 1.1538 | 8500  | 0.3023        |
-| 1.2217 | 9000  | 0.288         |
-| 1.2895 | 9500  | 0.2881        |
-| 1.3574 | 10000 | 0.277         |
-| 1.4253 | 10500 | 0.2711        |
-| 1.4931 | 11000 | 0.2782        |
-| 1.5610 | 11500 | 0.2721        |
-| 1.6289 | 12000 | 0.2589        |
-| 1.6968 | 12500 | 0.263         |
-| 1.7646 | 13000 | 0.2527        |
-| 1.8325 | 13500 | 0.2456        |
-| 1.9004 | 14000 | 0.2317        |
-| 1.9682 | 14500 | 0.2488        |
-| 2.0361 | 15000 | 0.2141        |
-| 2.1040 | 15500 | 0.214         |
-| 2.1718 | 16000 | 0.1982        |
-| 2.2397 | 16500 | 0.2109        |
-| 2.3076 | 17000 | 0.207         |
-| 2.3755 | 17500 | 0.206         |
-| 2.4433 | 18000 | 0.197         |
-| 2.5112 | 18500 | 0.1891        |
-| 2.5791 | 19000 | 0.1946        |
-| 2.6469 | 19500 | 0.2015        |
-| 2.7148 | 20000 | 0.1867        |
-| 2.7827 | 20500 | 0.1999        |
-| 2.8505 | 21000 | 0.1877        |
-| 2.9184 | 21500 | 0.2004        |
-| 2.9863 | 22000 | 0.1881        |
-| 3.0542 | 22500 | 0.1612        |
-| 3.1220 | 23000 | 0.1523        |
-| 3.1899 | 23500 | 0.1558        |
-| 3.2578 | 24000 | 0.1513        |
-| 3.3256 | 24500 | 0.1691        |
-| 3.3935 | 25000 | 0.1597        |
-| 3.4614 | 25500 | 0.1557        |
-| 3.5293 | 26000 | 0.1582        |
-| 3.5971 | 26500 | 0.1652        |
-| 3.6650 | 27000 | 0.1599        |
-| 3.7329 | 27500 | 0.1524        |
-| 3.8007 | 28000 | 0.1646        |
-| 3.8686 | 28500 | 0.1566        |
-| 3.9365 | 29000 | 0.1532        |
-| 4.0043 | 29500 | 0.153         |
-| 4.0722 | 30000 | 0.1397        |
-| 4.1401 | 30500 | 0.146         |
-| 4.2080 | 31000 | 0.137         |
-| 4.2758 | 31500 | 0.1272        |
-| 4.3437 | 32000 | 0.1353        |
-| 4.4116 | 32500 | 0.143         |
-| 4.4794 | 33000 | 0.1285        |
-| 4.5473 | 33500 | 0.1417        |
-| 4.6152 | 34000 | 0.1302        |
-| 4.6830 | 34500 | 0.1275        |
-| 4.7509 | 35000 | 0.1331        |
-| 4.8188 | 35500 | 0.1334        |
-| 4.8867 | 36000 | 0.1333        |
-| 4.9545 | 36500 | 0.1317        |
 ### Framework Versions

 - feature-extraction
 - dense
 - generated_from_trainer
+- dataset_size:65883
 - loss:MultipleNegativesRankingLoss
+base_model: sentence-transformers/all-mpnet-base-v2
 widget:
+- source_sentence: The calmness of my service dog is so comforting.
   sentences:
+  - The service dog emoji depicts a dog with a harness, denoting its role as a working
+    animal trained to assist individuals with disabilities. It is commonly used to
+    represent service animals, independence, and support for those in need.
+  - The 🧑‍🌾 emoji is commonly used to represent a farmer or someone working in agriculture.
+    It can be used in conversations related to farming, crops, gardening, and rural
+    lifestyle.
+  - The oil drum emoji is used to represent oil, petroleum, fuel, or other liquids
+    stored in a drum container. It can also symbolize industrial processes, mechanics,
+    or transportation related to oil and fuel.
+- source_sentence: Sipping water from this fountain always leaves a good taste.
   sentences:
+  - The ⛲ emoji is typically used to represent a fountain, flowing water, or a source
+    of water. It can also symbolize tranquility, relaxation, and a peaceful atmosphere.
+  - This emoji is used to represent a woman engaging in the sport of mountain biking.
+    It can be used in contexts related to sports, outdoor activities, or simply to
+    convey a sense of adventure and thrill.
+  - The crystal ball emoji is often used to symbolize magic, fortune-telling, mysticism,
+    or the unknown. It can also represent guidance, predictions, or future insights.
+    This emoji can be used in conversations related to spirituality, fantasy, astrology,
+    and predictions.
+- source_sentence: The bookstore had some amazing finds today!
   sentences:
+  - 'The keycap: 4 emoji is used to represent the number 4 in a clear and concise
+    way. It is often used in numerical sequences or lists.'
+  - The open book emoji is commonly used to represent reading, studying, learning,
+    education, or books in general. It can also be used to symbolize wisdom, knowledge,
+    or literature.
+  - The deer emoji is often used to symbolize grace, beauty, and tranquility. It can
+    also represent a love for nature and wildlife.
+- source_sentence: Hair appointment went perfectly, feeling confident!
   sentences:
+  - The emoji of a woman getting a haircut is often used to represent beauty salons,
+    haircuts, and hairstyles. It can also be used to signify self-care routines or
+    pampering sessions.
+  - The woman climbing emoji is used to represent rock climbing, outdoor adventure,
+    strength, and determination. It can be used when talking about physical activities,
+    hobbies, or overcoming challenges.
+  - The 💏 emoji is often used to represent a kiss between two individuals, such as
+    a romantic gesture or expression of love. It can also symbolize affection, intimacy,
+    or a moment of connection between partners.
+- source_sentence: Bald and beautiful, just how I like it.
   sentences:
+  - 'The woman: bald emoji is used to represent a female character without hair. It
+    can be used to show support for people undergoing chemotherapy, to represent beauty
+    in diverse forms, or simply to depict a bald woman.'
+  - The woman golfing emoji is typically used to represent a female person playing
+    golf. It can be used in the context of sports, leisure, physical activity, or
+    any mention of golf. It is often used in social media posts related to golfing
+    or to express enjoyment of the sport.
+  - The dog face emoji is commonly used to represent dogs, pets, loyalty, and cuteness.
 pipeline_tag: sentence-similarity
 library_name: sentence-transformers
 ---
+# SentenceTransformer based on sentence-transformers/all-mpnet-base-v2
+This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 ## Model Details
 ### Model Description
 - **Model Type:** Sentence Transformer
+- **Base model:** [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) <!-- at revision 12e86a3c702fc3c50205a8db88f0ec7c0b6b94a0 -->
+- **Maximum Sequence Length:** 384 tokens
+- **Output Dimensionality:** 768 dimensions
 - **Similarity Function:** Cosine Similarity
 <!-- - **Training Dataset:** Unknown -->
 <!-- - **Language:** Unknown -->
 ```
 SentenceTransformer(
+  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False, 'architecture': 'MPNetModel'})
+  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
   (2): Normalize()
 )
 ```
 model = SentenceTransformer("zoharzaig/emoji-prediction-model")
 # Run inference
 sentences = [
+    'Bald and beautiful, just how I like it.',
+    'The woman: bald emoji is used to represent a female character without hair. It can be used to show support for people undergoing chemotherapy, to represent beauty in diverse forms, or simply to depict a bald woman.',
+    'The dog face emoji is commonly used to represent dogs, pets, loyalty, and cuteness.',
 ]
 embeddings = model.encode(sentences)
 print(embeddings.shape)
+# [3, 768]
 # Get the similarity scores for the embeddings
 similarities = model.similarity(embeddings, embeddings)
 print(similarities)
+# tensor([[ 1.0000,  0.4778,  0.0503],
+#         [ 0.4778,  1.0000, -0.0784],
+#         [ 0.0503, -0.0784,  1.0000]])
 ```
 <!--
 #### Unnamed Dataset
+* Size: 65,883 training samples
 * Columns: <code>sentence_0</code> and <code>sentence_1</code>
 * Approximate statistics based on the first 1000 samples:
+  |         | sentence_0                                                                       | sentence_1                                                                         |
+  |:--------|:---------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
+  | type    | string                                                                           | string                                                                             |
+  | details | <ul><li>min: 5 tokens</li><li>mean: 11.9 tokens</li><li>max: 23 tokens</li></ul> | <ul><li>min: 18 tokens</li><li>mean: 45.38 tokens</li><li>max: 85 tokens</li></ul> |
 * Samples:
+  | sentence_0                                           | sentence_1                                                                                                                                                                                                            |
+  |:-----------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+  | <code>Still cooking breakfast.</code>                | <code>The hourglass not done emoji ⏳ is often used to represent the passing of time, a sense of urgency, or a countdown. It can also symbolize patience and waiting for something to be completed or resolved.</code> |
+  | <code>How do you feel about GMOs?</code>             | <code>The woman scientist emoji is used to represent a female scientist or researcher. It can be used in the context of science, research, discovery, and academia.</code>                                            |
+  | <code>The clear waters of Aruba look amazing!</code> | <code>The flag of Aruba emoji is used to represent the country of Aruba. Aruba is known for its beautiful beaches, warm weather, and vibrant culture. It is a popular tourist destination in the Caribbean.</code>    |
 * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
   ```json
   {
 ### Training Logs
 | Epoch  | Step  | Training Loss |
 |:------:|:-----:|:-------------:|
+| 0.1214 | 500   | 1.1886        |
+| 0.2428 | 1000  | 1.0327        |
+| 0.3643 | 1500  | 0.9711        |
+| 0.4857 | 2000  | 0.9062        |
+| 0.6071 | 2500  | 0.8915        |
+| 0.7285 | 3000  | 0.8699        |
+| 0.8499 | 3500  | 0.8658        |
+| 0.9713 | 4000  | 0.8191        |
+| 1.0928 | 4500  | 0.7382        |
+| 1.2142 | 5000  | 0.7059        |
+| 1.3356 | 5500  | 0.7004        |
+| 1.4570 | 6000  | 0.7012        |
+| 1.5784 | 6500  | 0.6842        |
+| 1.6999 | 7000  | 0.6994        |
+| 1.8213 | 7500  | 0.6832        |
+| 1.9427 | 8000  | 0.6597        |
+| 2.0641 | 8500  | 0.5964        |
+| 2.1855 | 9000  | 0.5506        |
+| 2.3069 | 9500  | 0.5155        |
+| 2.4284 | 10000 | 0.5531        |
+| 2.5498 | 10500 | 0.5439        |
+| 2.6712 | 11000 | 0.5471        |
+| 2.7926 | 11500 | 0.5492        |
+| 2.9140 | 12000 | 0.5331        |
+| 3.0355 | 12500 | 0.5052        |
+| 3.1569 | 13000 | 0.4309        |
+| 3.2783 | 13500 | 0.4162        |
+| 3.3997 | 14000 | 0.4268        |
+| 3.5211 | 14500 | 0.4142        |
+| 3.6425 | 15000 | 0.421         |
+| 3.7640 | 15500 | 0.4126        |
+| 3.8854 | 16000 | 0.4324        |
+| 4.0068 | 16500 | 0.4098        |
+| 4.1282 | 17000 | 0.3335        |
+| 4.2496 | 17500 | 0.3401        |
+| 4.3711 | 18000 | 0.3317        |
+| 4.4925 | 18500 | 0.3448        |
+| 4.6139 | 19000 | 0.336         |
+| 4.7353 | 19500 | 0.3299        |
+| 4.8567 | 20000 | 0.3601        |
+| 4.9781 | 20500 | 0.3347        |
 ### Framework Versions

config.json CHANGED Viewed

@@ -1,25 +1,23 @@
 {
   "architectures": [
-    "BertModel"
   ],
   "attention_probs_dropout_prob": 0.1,
-  "classifier_dropout": null,
-  "gradient_checkpointing": false,
   "hidden_act": "gelu",
   "hidden_dropout_prob": 0.1,
-  "hidden_size": 384,
   "initializer_range": 0.02,
-  "intermediate_size": 1536,
-  "layer_norm_eps": 1e-12,
-  "max_position_embeddings": 512,
-  "model_type": "bert",
   "num_attention_heads": 12,
-  "num_hidden_layers": 6,
-  "pad_token_id": 0,
-  "position_embedding_type": "absolute",
   "torch_dtype": "float32",
   "transformers_version": "4.53.2",
-  "type_vocab_size": 2,
-  "use_cache": true,
-  "vocab_size": 30522
 }

 {
   "architectures": [
+    "MPNetModel"
   ],
   "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "eos_token_id": 2,
   "hidden_act": "gelu",
   "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
   "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "mpnet",
   "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "relative_attention_num_buckets": 32,
   "torch_dtype": "float32",
   "transformers_version": "4.53.2",
+  "vocab_size": 30527
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6e3451807acdf6a54ce23b93dd3c192bc364958c0dff87ead32cce3d6042df29
-size 90864192

 version https://git-lfs.github.com/spec/v1
+oid sha256:aa9de72750b38df7da74d7c01d581d81f948c4816a7cea04c1308b9432291ad5
+size 437967672

sentence_bert_config.json CHANGED Viewed

@@ -1,4 +1,4 @@
 {
-    "max_seq_length": 256,
     "do_lower_case": false
 }

 {
+    "max_seq_length": 384,
     "do_lower_case": false
 }

special_tokens_map.json CHANGED Viewed

@@ -1,27 +1,41 @@
 {
   "cls_token": {
-    "content": "[CLS]",
     "lstrip": false,
     "normalized": false,
     "rstrip": false,
     "single_word": false
   },
-  "mask_token": {
-    "content": "[MASK]",
     "lstrip": false,
     "normalized": false,
     "rstrip": false,
     "single_word": false
   },
   "pad_token": {
-    "content": "[PAD]",
     "lstrip": false,
     "normalized": false,
     "rstrip": false,
     "single_word": false
   },
   "sep_token": {
-    "content": "[SEP]",
     "lstrip": false,
     "normalized": false,
     "rstrip": false,

 {
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
   "cls_token": {
+    "content": "<s>",
     "lstrip": false,
     "normalized": false,
     "rstrip": false,
     "single_word": false
   },
+  "eos_token": {
+    "content": "</s>",
     "lstrip": false,
     "normalized": false,
     "rstrip": false,
     "single_word": false
   },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
   "pad_token": {
+    "content": "<pad>",
     "lstrip": false,
     "normalized": false,
     "rstrip": false,
     "single_word": false
   },
   "sep_token": {
+    "content": "</s>",
     "lstrip": false,
     "normalized": false,
     "rstrip": false,

tokenizer.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json CHANGED Viewed

@@ -1,64 +1,72 @@
 {
   "added_tokens_decoder": {
     "0": {
-      "content": "[PAD]",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
-    "100": {
-      "content": "[UNK]",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
-    "101": {
-      "content": "[CLS]",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
-    "102": {
-      "content": "[SEP]",
       "lstrip": false,
-      "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
-    "103": {
-      "content": "[MASK]",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     }
   },
   "clean_up_tokenization_spaces": false,
-  "cls_token": "[CLS]",
-  "do_basic_tokenize": true,
   "do_lower_case": true,
   "extra_special_tokens": {},
-  "mask_token": "[MASK]",
   "max_length": 128,
-  "model_max_length": 256,
-  "never_split": null,
   "pad_to_multiple_of": null,
-  "pad_token": "[PAD]",
   "pad_token_type_id": 0,
   "padding_side": "right",
-  "sep_token": "[SEP]",
   "stride": 0,
   "strip_accents": null,
   "tokenize_chinese_chars": true,
-  "tokenizer_class": "BertTokenizer",
   "truncation_side": "right",
   "truncation_strategy": "longest_first",
   "unk_token": "[UNK]"

 {
   "added_tokens_decoder": {
     "0": {
+      "content": "<s>",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
+    "1": {
+      "content": "<pad>",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
+    "2": {
+      "content": "</s>",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
+    "3": {
+      "content": "<unk>",
       "lstrip": false,
+      "normalized": true,
       "rstrip": false,
       "single_word": false,
       "special": true
     },
+    "104": {
+      "content": "[UNK]",
       "lstrip": false,
       "normalized": false,
       "rstrip": false,
       "single_word": false,
       "special": true
+    },
+    "30526": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
     }
   },
+  "bos_token": "<s>",
   "clean_up_tokenization_spaces": false,
+  "cls_token": "<s>",
   "do_lower_case": true,
+  "eos_token": "</s>",
   "extra_special_tokens": {},
+  "mask_token": "<mask>",
   "max_length": 128,
+  "model_max_length": 384,
   "pad_to_multiple_of": null,
+  "pad_token": "<pad>",
   "pad_token_type_id": 0,
   "padding_side": "right",
+  "sep_token": "</s>",
   "stride": 0,
   "strip_accents": null,
   "tokenize_chinese_chars": true,
+  "tokenizer_class": "MPNetTokenizer",
   "truncation_side": "right",
   "truncation_strategy": "longest_first",
   "unk_token": "[UNK]"

vocab.txt CHANGED Viewed

@@ -1,3 +1,7 @@
 [PAD]
 [unused0]
 [unused1]
@@ -30520,3 +30524,4 @@ necessitated
 ##：
 ##？
 ##～

+<s>
+<pad>
+</s>
+<unk>
 [PAD]
 [unused0]
 [unused1]
 ##：
 ##？
 ##～
+<mask>