Add new SentenceTransformer model

Browse files

Files changed (11) hide show

1_Pooling/config.json +10 -0
README.md +493 -0
config.json +25 -0
config_sentence_transformers.json +14 -0
model.safetensors +3 -0
modules.json +20 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +65 -0
vocab.txt +0 -0

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "word_embedding_dimension": 384,
+    "pooling_mode_cls_token": false,
+    "pooling_mode_mean_tokens": true,
+    "pooling_mode_max_tokens": false,
+    "pooling_mode_mean_sqrt_len_tokens": false,
+    "pooling_mode_weightedmean_tokens": false,
+    "pooling_mode_lasttoken": false,
+    "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,493 @@

+---
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- dense
+- generated_from_trainer
+- dataset_size:166106
+- loss:MultipleNegativesRankingLoss
+base_model: sentence-transformers/all-MiniLM-L6-v2
+widget:
+- source_sentence: Not another call about this...
+  sentences:
+  - The heart with ribbon emoji is commonly used to express love, affection, or gratitude.
+    It can be sent to a loved one to show appreciation or as a symbol of a special
+    bond. It is also used in celebrations like Valentine's Day or birthdays.
+  - The ox emoji is often used to symbolize strength, power, and determination. It
+    can also represent hard work, reliability, and persistence. In Chinese culture,
+    the ox is associated with prosperity and hard work. This emoji can be used in
+    contexts related to agriculture, farming, and the zodiac sign of the Ox.
+  - The 😾 emoji, known as pouting cat, is often used to express annoyance, displeasure,
+    or stubbornness. It can convey a sense of sulking or being upset about something.
+- source_sentence: A peaceful beach day sounds amazing right now.
+  sentences:
+  - This emoji depicts a woman fairy with medium-light skin tone. It is often used
+    to represent magic, fantasy, and whimsical themes. It can also symbolize playfulness,
+    enchantment, and fairy tales.
+  - The backpack emoji is used to symbolize carrying items while traveling, hiking,
+    or going to school. It can also represent adventure and exploration.
+  - The spiral shell emoji is often used to represent the beach, ocean, sea life,
+    or vacation. It can also symbolize relaxation, tranquility, and summer vibes.
+- source_sentence: Do you know how far it is to the nearest ladies room?
+  sentences:
+  - The women's room emoji is used to indicate a restroom designated for females.
+    It is commonly used to specify the location of women's bathrooms in public spaces
+    or buildings.
+  - The grinning face emoji is used to express happiness, joy, or a friendly greeting.
+    It can also be used to show excitement or amusement. This emoji is commonly used
+    in casual conversations.
+  - The person walking facing right emoji with light skin tone is used to represent
+    someone walking to the right. It can symbolize movement, walking, or simply the
+    action of going in the direction of the right.
+- source_sentence: Let's fix this together.
+  sentences:
+  - The 💯 emoji is used to emphasize perfection, excellence, or a job well done. It
+    can also be used to indicate that something is top-notch or 100% correct. This
+    emoji is commonly used to show approval or admiration for something.
+  - The rightwards pushing hand emoji is used to indicate pushing or movement to the
+    right. It can also be used in a friendly manner to encourage someone to go in
+    a certain direction or to express agreement or approval.
+  - This emoji represents a man kneeling with light skin tone. It is commonly used
+    to show humility, gratitude, or to ask for forgiveness. It can also be used in
+    the context of prayer or protest.
+- source_sentence: I love how the sound of popping bubbles is so relaxing
+  sentences:
+  - The man surfing emoji with dark skin tone is used to represent someone enjoying
+    surfing in the ocean. It can be used in the context of beach activities, water
+    sports, or simply to convey a sense of relaxation and fun.
+  - This emoji is often used to signify raising a hand in agreement, asking a question,
+    or volunteering for something. It can also be used to show excitement or cheerfulness.
+  - The 🫧 emoji is commonly used to represent bubbles or to express a feeling of lightness
+    and playfulness. It can also be used in contexts related to cleanliness, baths,
+    or water.
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+---
+# SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2
+This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer
+- **Base model:** [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) <!-- at revision c9745ed1d9f207416be6d2e6f8de32d1f16199bf -->
+- **Maximum Sequence Length:** 256 tokens
+- **Output Dimensionality:** 384 dimensions
+- **Similarity Function:** Cosine Similarity
+<!-- - **Training Dataset:** Unknown -->
+<!-- - **Language:** Unknown -->
+<!-- - **License:** Unknown -->
+### Model Sources
+- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
+- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
+- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
+### Full Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
+  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
+  (2): Normalize()
+)
+```
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install -U sentence-transformers
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
+model = SentenceTransformer("zoharzaig/emoji-prediction-model")
+# Run inference
+sentences = [
+    'I love how the sound of popping bubbles is so relaxing',
+    'The \U0001fae7 emoji is commonly used to represent bubbles or to express a feeling of lightness and playfulness. It can also be used in contexts related to cleanliness, baths, or water.',
+    'The man surfing emoji with dark skin tone is used to represent someone enjoying surfing in the ocean. It can be used in the context of beach activities, water sports, or simply to convey a sense of relaxation and fun.',
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [3, 384]
+# Get the similarity scores for the embeddings
+similarities = model.similarity(embeddings, embeddings)
+print(similarities)
+# tensor([[1.0000, 0.6450, 0.2089],
+#         [0.6450, 1.0000, 0.3229],
+#         [0.2089, 0.3229, 1.0000]])
+```
+<!--
+### Direct Usage (Transformers)
+<details><summary>Click to see the direct usage in Transformers</summary>
+</details>
+-->
+<!--
+### Downstream Usage (Sentence Transformers)
+You can finetune this model on your own dataset.
+<details><summary>Click to expand</summary>
+</details>
+-->
+<!--
+### Out-of-Scope Use
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+<!--
+## Bias, Risks and Limitations
+*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
+-->
+<!--
+### Recommendations
+*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
+-->
+## Training Details
+### Training Dataset
+#### Unnamed Dataset
+* Size: 166,106 training samples
+* Columns: <code>sentence_0</code> and <code>sentence_1</code>
+* Approximate statistics based on the first 1000 samples:
+  |         | sentence_0                                                                        | sentence_1                                                                         |
+  |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
+  | type    | string                                                                            | string                                                                             |
+  | details | <ul><li>min: 5 tokens</li><li>mean: 11.95 tokens</li><li>max: 22 tokens</li></ul> | <ul><li>min: 26 tokens</li><li>mean: 45.67 tokens</li><li>max: 78 tokens</li></ul> |
+* Samples:
+  | sentence_0                                                         | sentence_1                                                                                                                                                                                                                                                    |
+  |:-------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+  | <code>Solidarity and strength united.</code>                       | <code>The left-facing fist emoji with medium skin tone is often used to represent solidarity, unity, and strength. It can also be used to show support or encouragement.</code>                                                                               |
+  | <code>Have you ever seen anything strange while stargazing?</code> | <code>The alien emoji is often used to represent anything extraterrestrial or outer-space related. It can also be used to convey a sense of otherness or weirdness. Some people use it to refer to conspiracy theories or unidentified flying objects.</code> |
+  | <code>Can't wait until I can hold you again.</code>                | <code>This emoji represents a kiss between two men, it is often used to symbolize love, affection, or a romantic relationship between two male individuals.</code>                                                                                            |
+* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
+  ```json
+  {
+      "scale": 20.0,
+      "similarity_fct": "cos_sim"
+  }
+  ```
+### Training Hyperparameters
+#### Non-Default Hyperparameters
+- `per_device_train_batch_size`: 16
+- `per_device_eval_batch_size`: 16
+- `num_train_epochs`: 5
+- `multi_dataset_batch_sampler`: round_robin
+#### All Hyperparameters
+<details><summary>Click to expand</summary>
+- `overwrite_output_dir`: False
+- `do_predict`: False
+- `eval_strategy`: no
+- `prediction_loss_only`: True
+- `per_device_train_batch_size`: 16
+- `per_device_eval_batch_size`: 16
+- `per_gpu_train_batch_size`: None
+- `per_gpu_eval_batch_size`: None
+- `gradient_accumulation_steps`: 1
+- `eval_accumulation_steps`: None
+- `torch_empty_cache_steps`: None
+- `learning_rate`: 5e-05
+- `weight_decay`: 0.0
+- `adam_beta1`: 0.9
+- `adam_beta2`: 0.999
+- `adam_epsilon`: 1e-08
+- `max_grad_norm`: 1
+- `num_train_epochs`: 5
+- `max_steps`: -1
+- `lr_scheduler_type`: linear
+- `lr_scheduler_kwargs`: {}
+- `warmup_ratio`: 0.0
+- `warmup_steps`: 0
+- `log_level`: passive
+- `log_level_replica`: warning
+- `log_on_each_node`: True
+- `logging_nan_inf_filter`: True
+- `save_safetensors`: True
+- `save_on_each_node`: False
+- `save_only_model`: False
+- `restore_callback_states_from_checkpoint`: False
+- `no_cuda`: False
+- `use_cpu`: False
+- `use_mps_device`: False
+- `seed`: 42
+- `data_seed`: None
+- `jit_mode_eval`: False
+- `use_ipex`: False
+- `bf16`: False
+- `fp16`: False
+- `fp16_opt_level`: O1
+- `half_precision_backend`: auto
+- `bf16_full_eval`: False
+- `fp16_full_eval`: False
+- `tf32`: None
+- `local_rank`: 0
+- `ddp_backend`: None
+- `tpu_num_cores`: None
+- `tpu_metrics_debug`: False
+- `debug`: []
+- `dataloader_drop_last`: False
+- `dataloader_num_workers`: 0
+- `dataloader_prefetch_factor`: None
+- `past_index`: -1
+- `disable_tqdm`: False
+- `remove_unused_columns`: True
+- `label_names`: None
+- `load_best_model_at_end`: False
+- `ignore_data_skip`: False
+- `fsdp`: []
+- `fsdp_min_num_params`: 0
+- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
+- `fsdp_transformer_layer_cls_to_wrap`: None
+- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
+- `deepspeed`: None
+- `label_smoothing_factor`: 0.0
+- `optim`: adamw_torch
+- `optim_args`: None
+- `adafactor`: False
+- `group_by_length`: False
+- `length_column_name`: length
+- `ddp_find_unused_parameters`: None
+- `ddp_bucket_cap_mb`: None
+- `ddp_broadcast_buffers`: False
+- `dataloader_pin_memory`: True
+- `dataloader_persistent_workers`: False
+- `skip_memory_metrics`: True
+- `use_legacy_prediction_loop`: False
+- `push_to_hub`: False
+- `resume_from_checkpoint`: None
+- `hub_model_id`: None
+- `hub_strategy`: every_save
+- `hub_private_repo`: None
+- `hub_always_push`: False
+- `hub_revision`: None
+- `gradient_checkpointing`: False
+- `gradient_checkpointing_kwargs`: None
+- `include_inputs_for_metrics`: False
+- `include_for_metrics`: []
+- `eval_do_concat_batches`: True
+- `fp16_backend`: auto
+- `push_to_hub_model_id`: None
+- `push_to_hub_organization`: None
+- `mp_parameters`:
+- `auto_find_batch_size`: False
+- `full_determinism`: False
+- `torchdynamo`: None
+- `ray_scope`: last
+- `ddp_timeout`: 1800
+- `torch_compile`: False
+- `torch_compile_backend`: None
+- `torch_compile_mode`: None
+- `include_tokens_per_second`: False
+- `include_num_input_tokens_seen`: False
+- `neftune_noise_alpha`: None
+- `optim_target_modules`: None
+- `batch_eval_metrics`: False
+- `eval_on_start`: False
+- `use_liger_kernel`: False
+- `liger_kernel_config`: None
+- `eval_use_gather_object`: False
+- `average_tokens_across_devices`: False
+- `prompts`: None
+- `batch_sampler`: batch_sampler
+- `multi_dataset_batch_sampler`: round_robin
+- `router_mapping`: {}
+- `learning_rate_mapping`: {}
+</details>
+### Training Logs
+<details><summary>Click to expand</summary>
+| Epoch  | Step  | Training Loss |
+|:------:|:-----:|:-------------:|
+| 0.0482 | 500   | 1.2811        |
+| 0.0963 | 1000  | 1.014         |
+| 0.1445 | 1500  | 0.874         |
+| 0.1926 | 2000  | 0.7975        |
+| 0.2408 | 2500  | 0.7261        |
+| 0.2890 | 3000  | 0.6903        |
+| 0.3371 | 3500  | 0.6547        |
+| 0.3853 | 4000  | 0.6412        |
+| 0.4334 | 4500  | 0.5815        |
+| 0.4816 | 5000  | 0.5765        |
+| 0.5298 | 5500  | 0.5461        |
+| 0.5779 | 6000  | 0.5402        |
+| 0.6261 | 6500  | 0.5347        |
+| 0.6742 | 7000  | 0.5102        |
+| 0.7224 | 7500  | 0.488         |
+| 0.7706 | 8000  | 0.4968        |
+| 0.8187 | 8500  | 0.4758        |
+| 0.8669 | 9000  | 0.4725        |
+| 0.9150 | 9500  | 0.4564        |
+| 0.9632 | 10000 | 0.4563        |
+| 1.0114 | 10500 | 0.4213        |
+| 1.0595 | 11000 | 0.381         |
+| 1.1077 | 11500 | 0.376         |
+| 1.1558 | 12000 | 0.3991        |
+| 1.2040 | 12500 | 0.3845        |
+| 1.2522 | 13000 | 0.377         |
+| 1.3003 | 13500 | 0.3752        |
+| 1.3485 | 14000 | 0.3648        |
+| 1.3966 | 14500 | 0.3914        |
+| 1.4448 | 15000 | 0.3665        |
+| 1.4930 | 15500 | 0.3867        |
+| 1.5411 | 16000 | 0.3606        |
+| 1.5893 | 16500 | 0.3706        |
+| 1.6374 | 17000 | 0.3462        |
+| 1.6856 | 17500 | 0.3616        |
+| 1.7338 | 18000 | 0.3424        |
+| 1.7819 | 18500 | 0.3465        |
+| 1.8301 | 19000 | 0.3433        |
+| 1.8783 | 19500 | 0.336         |
+| 1.9264 | 20000 | 0.3448        |
+| 1.9746 | 20500 | 0.3463        |
+| 2.0227 | 21000 | 0.3171        |
+| 2.0709 | 21500 | 0.3087        |
+| 2.1191 | 22000 | 0.2961        |
+| 2.1672 | 22500 | 0.2991        |
+| 2.2154 | 23000 | 0.3007        |
+| 2.2635 | 23500 | 0.2982        |
+| 2.3117 | 24000 | 0.2886        |
+| 2.3599 | 24500 | 0.2881        |
+| 2.4080 | 25000 | 0.2867        |
+| 2.4562 | 25500 | 0.2998        |
+| 2.5043 | 26000 | 0.2942        |
+| 2.5525 | 26500 | 0.2941        |
+| 2.6007 | 27000 | 0.2938        |
+| 2.6488 | 27500 | 0.2776        |
+| 2.6970 | 28000 | 0.2705        |
+| 2.7451 | 28500 | 0.2949        |
+| 2.7933 | 29000 | 0.2856        |
+| 2.8415 | 29500 | 0.2724        |
+| 2.8896 | 30000 | 0.2891        |
+| 2.9378 | 30500 | 0.2835        |
+| 2.9859 | 31000 | 0.2896        |
+| 3.0341 | 31500 | 0.2571        |
+| 3.0823 | 32000 | 0.2541        |
+| 3.1304 | 32500 | 0.2527        |
+| 3.1786 | 33000 | 0.2587        |
+| 3.2267 | 33500 | 0.251         |
+| 3.2749 | 34000 | 0.2437        |
+| 3.3231 | 34500 | 0.252         |
+| 3.3712 | 35000 | 0.2533        |
+| 3.4194 | 35500 | 0.2388        |
+| 3.4675 | 36000 | 0.2391        |
+| 3.5157 | 36500 | 0.2488        |
+| 3.5639 | 37000 | 0.2442        |
+| 3.6120 | 37500 | 0.2398        |
+| 3.6602 | 38000 | 0.254         |
+| 3.7083 | 38500 | 0.2427        |
+| 3.7565 | 39000 | 0.2396        |
+| 3.8047 | 39500 | 0.2409        |
+| 3.8528 | 40000 | 0.2406        |
+| 3.9010 | 40500 | 0.2529        |
+| 3.9491 | 41000 | 0.2485        |
+| 3.9973 | 41500 | 0.2427        |
+| 4.0455 | 42000 | 0.2133        |
+| 4.0936 | 42500 | 0.2337        |
+| 4.1418 | 43000 | 0.2307        |
+| 4.1899 | 43500 | 0.2172        |
+| 4.2381 | 44000 | 0.2137        |
+| 4.2863 | 44500 | 0.235         |
+| 4.3344 | 45000 | 0.2198        |
+| 4.3826 | 45500 | 0.2275        |
+| 4.4307 | 46000 | 0.2339        |
+| 4.4789 | 46500 | 0.227         |
+| 4.5271 | 47000 | 0.2265        |
+| 4.5752 | 47500 | 0.2232        |
+| 4.6234 | 48000 | 0.2248        |
+| 4.6715 | 48500 | 0.223         |
+| 4.7197 | 49000 | 0.2287        |
+| 4.7679 | 49500 | 0.2168        |
+| 4.8160 | 50000 | 0.2262        |
+| 4.8642 | 50500 | 0.2207        |
+| 4.9123 | 51000 | 0.2043        |
+| 4.9605 | 51500 | 0.2233        |
+</details>
+### Framework Versions
+- Python: 3.9.6
+- Sentence Transformers: 5.0.0
+- Transformers: 4.53.1
+- PyTorch: 2.7.1
+- Accelerate: 1.8.1
+- Datasets: 3.6.0
+- Tokenizers: 0.21.2
+## Citation
+### BibTeX
+#### Sentence Transformers
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+```
+#### MultipleNegativesRankingLoss
+```bibtex
+@misc{henderson2017efficient,
+    title={Efficient Natural Language Response Suggestion for Smart Reply},
+    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
+    year={2017},
+    eprint={1705.00652},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+<!--
+## Glossary
+*Clearly define terms in order to be accessible across audiences.*
+-->
+<!--
+## Model Card Authors
+*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
+-->
+<!--
+## Model Card Contact
+*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->

config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 384,
+  "initializer_range": 0.02,
+  "intermediate_size": 1536,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 6,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.53.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "__version__": {
+    "sentence_transformers": "5.0.0",
+    "transformers": "4.53.2",
+    "pytorch": "2.7.1"
+  },
+  "model_type": "SentenceTransformer",
+  "prompts": {
+    "query": "",
+    "document": ""
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f5726386c25d380821b8bb997a6b62feb1d27e1a16a9c609e490cfb87f941dfb
+size 90864192

modules.json ADDED Viewed

	@@ -0,0 +1,20 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  },
+  {
+    "idx": 2,
+    "name": "2",
+    "path": "2_Normalize",
+    "type": "sentence_transformers.models.Normalize"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "max_seq_length": 256,
+    "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_length": 128,
+  "model_max_length": 256,
+  "never_split": null,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff