adbaral commited on
Commit
90741a4
·
verified ·
1 Parent(s): 39f7a04

Add new CrossEncoder model

Browse files
Files changed (7) hide show
  1. README.md +221 -0
  2. config.json +35 -0
  3. model.safetensors +3 -0
  4. special_tokens_map.json +37 -0
  5. tokenizer.json +0 -0
  6. tokenizer_config.json +58 -0
  7. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,221 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - cross-encoder
7
+ - sentence-transformers
8
+ - text-classification
9
+ - sentence-pair-classification
10
+ - semantic-similarity
11
+ - semantic-search
12
+ - retrieval
13
+ - reranking
14
+ - generated_from_trainer
15
+ - dataset_size:40015318
16
+ - loss:BinaryCrossEntropyLoss
17
+ base_model: cross-encoder/ms-marco-TinyBERT-L2-v2
18
+ datasets:
19
+ - adbaral/langcache-sentencepairs-v3
20
+ pipeline_tag: text-ranking
21
+ library_name: sentence-transformers
22
+ ---
23
+
24
+ # Redis fine-tuned CrossEncoder model for semantic caching on LangCache
25
+
26
+ This is a [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) model finetuned from [cross-encoder/ms-marco-TinyBERT-L2-v2](https://huggingface.co/cross-encoder/ms-marco-TinyBERT-L2-v2) on the [LangCache Sentence Pairs (subsets=['all'], train+val=True)](https://huggingface.co/datasets/aditeyabaral-redis/langcache-sentencepairs-v3) dataset using the [sentence-transformers](https://www.SBERT.net) library. It computes scores for pairs of texts, which can be used for sentence pair classification.
27
+
28
+ ## Model Details
29
+
30
+ ### Model Description
31
+ - **Model Type:** Cross Encoder
32
+ - **Base model:** [cross-encoder/ms-marco-TinyBERT-L2-v2](https://huggingface.co/cross-encoder/ms-marco-TinyBERT-L2-v2) <!-- at revision 81d1926f67cb8eee2c2be17ca9f793c7c3bd20cc -->
33
+ - **Maximum Sequence Length:** 512 tokens
34
+ - **Number of Output Labels:** 1 label
35
+ - **Training Dataset:**
36
+ - [LangCache Sentence Pairs (subsets=['all'], train+val=True)](https://huggingface.co/datasets/aditeyabaral-redis/langcache-sentencepairs-v3)
37
+ - **Language:** en
38
+ - **License:** apache-2.0
39
+
40
+ ### Model Sources
41
+
42
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
43
+ - **Documentation:** [Cross Encoder Documentation](https://www.sbert.net/docs/cross_encoder/usage/usage.html)
44
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
45
+ - **Hugging Face:** [Cross Encoders on Hugging Face](https://huggingface.co/models?library=sentence-transformers&other=cross-encoder)
46
+
47
+ ## Usage
48
+
49
+ ### Direct Usage (Sentence Transformers)
50
+
51
+ First install the Sentence Transformers library:
52
+
53
+ ```bash
54
+ pip install -U sentence-transformers
55
+ ```
56
+
57
+ Then you can load this model and run inference.
58
+ ```python
59
+ from sentence_transformers import CrossEncoder
60
+
61
+ # Download from the 🤗 Hub
62
+ model = CrossEncoder("redis/langcache-reranker-v2-tiny-bce-label-smoothing-0.5")
63
+ # Get scores for pairs of texts
64
+ pairs = [
65
+ ['The newer Punts are still very much in existence today and race in the same fleets as the older boats .', 'The newer punts are still very much in existence today and run in the same fleets as the older boats .'],
66
+ ['Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .', 'Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .'],
67
+ ['After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall .', 'Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .'],
68
+ ['She married Peter Haygarth on 29 May 1964 in Durban . Her second marriage , to Robin Osborne , took place in 1977 .', 'She married Robin Osborne on May 29 , 1964 in Durban , and her second marriage with Peter Haygarth took place in 1977 .'],
69
+ ['In 2005 she moved to Norway , settled in Geilo and worked as a rafting guide , in 2006 she started mountain biking - races .', 'In 2005 , she moved to Geilo , settling in Norway and worked as a rafting guide . She started mountain bike races in 2006 .'],
70
+ ]
71
+ scores = model.predict(pairs)
72
+ print(scores.shape)
73
+ # (5,)
74
+
75
+ # Or rank different texts based on similarity to a single text
76
+ ranks = model.rank(
77
+ 'The newer Punts are still very much in existence today and race in the same fleets as the older boats .',
78
+ [
79
+ 'The newer punts are still very much in existence today and run in the same fleets as the older boats .',
80
+ 'Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .',
81
+ 'Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .',
82
+ 'She married Robin Osborne on May 29 , 1964 in Durban , and her second marriage with Peter Haygarth took place in 1977 .',
83
+ 'In 2005 , she moved to Geilo , settling in Norway and worked as a rafting guide . She started mountain bike races in 2006 .',
84
+ ]
85
+ )
86
+ # [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
87
+ ```
88
+
89
+ <!--
90
+ ### Direct Usage (Transformers)
91
+
92
+ <details><summary>Click to see the direct usage in Transformers</summary>
93
+
94
+ </details>
95
+ -->
96
+
97
+ <!--
98
+ ### Downstream Usage (Sentence Transformers)
99
+
100
+ You can finetune this model on your own dataset.
101
+
102
+ <details><summary>Click to expand</summary>
103
+
104
+ </details>
105
+ -->
106
+
107
+ <!--
108
+ ### Out-of-Scope Use
109
+
110
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
111
+ -->
112
+
113
+ <!--
114
+ ## Bias, Risks and Limitations
115
+
116
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
117
+ -->
118
+
119
+ <!--
120
+ ### Recommendations
121
+
122
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
123
+ -->
124
+
125
+ ## Training Details
126
+
127
+ ### Training Dataset
128
+
129
+ #### LangCache Sentence Pairs (subsets=['all'], train+val=True)
130
+
131
+ * Dataset: [LangCache Sentence Pairs (subsets=['all'], train+val=True)](https://huggingface.co/datasets/aditeyabaral-redis/langcache-sentencepairs-v3)
132
+ * Size: 40,015,318 training samples
133
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
134
+ * Approximate statistics based on the first 1000 samples:
135
+ | | sentence1 | sentence2 | label |
136
+ |:--------|:-------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------|
137
+ | type | string | string | float |
138
+ | details | <ul><li>min: 27 characters</li><li>mean: 110.99 characters</li><li>max: 199 characters</li></ul> | <ul><li>min: 29 characters</li><li>mean: 110.74 characters</li><li>max: 191 characters</li></ul> | <ul><li>min: 0.25</li><li>mean: 0.49</li><li>max: 0.75</li></ul> |
139
+ * Samples:
140
+ | sentence1 | sentence2 | label |
141
+ |:----------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------|
142
+ | <code>The film was remade in Telugu with the same name in 1981 by Chandra Mohan , Jayasudha , Chakravarthy and S. P. Balasubrahmanyam starring K. Vasu .</code> | <code>The film was written in Telugu with the same name in 1981 by Chandra Mohan , Jayasudha , Chakravarthy and S. P. Balasubrahmanyam with K. Vasu .</code> | <code>0.75</code> |
143
+ | <code>The Cauchy stress tensor has therefore the following form</code> | <code>Therefore , the cauchy tensor has the following form</code> | <code>0.75</code> |
144
+ | <code>John Jairo Culma , sometimes spelled as Jhon Jairo Culma ( born 17 March 1981 ) is a Colombian footballer .</code> | <code>John Jairo Culma , sometimes known as Jhon Jairo Culma ( born March 17 , 1981 ) , is a Colombian footballer .</code> | <code>0.75</code> |
145
+ * Loss: [<code>BinaryCrossEntropyLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#binarycrossentropyloss) with these parameters:
146
+ ```json
147
+ {
148
+ "activation_fn": "torch.nn.modules.linear.Identity",
149
+ "pos_weight": 0.1028856709599495
150
+ }
151
+ ```
152
+
153
+ ### Evaluation Dataset
154
+
155
+ #### LangCache Sentence Pairs (split=test)
156
+
157
+ * Dataset: [LangCache Sentence Pairs (split=test)](https://huggingface.co/datasets/aditeyabaral-redis/langcache-sentencepairs-v3)
158
+ * Size: 74,265 evaluation samples
159
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
160
+ * Approximate statistics based on the first 1000 samples:
161
+ | | sentence1 | sentence2 | label |
162
+ |:--------|:-------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------|:------------------------------------------------|
163
+ | type | string | string | int |
164
+ | details | <ul><li>min: 27 characters</li><li>mean: 112.72 characters</li><li>max: 197 characters</li></ul> | <ul><li>min: 27 characters</li><li>mean: 112.54 characters</li><li>max: 198 characters</li></ul> | <ul><li>0: ~50.30%</li><li>1: ~49.70%</li></ul> |
165
+ * Samples:
166
+ | sentence1 | sentence2 | label |
167
+ |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
168
+ | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>1</code> |
169
+ | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> | <code>0</code> |
170
+ | <code>After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall .</code> | <code>Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .</code> | <code>1</code> |
171
+ * Loss: [<code>BinaryCrossEntropyLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#binarycrossentropyloss) with these parameters:
172
+ ```json
173
+ {
174
+ "activation_fn": "torch.nn.modules.linear.Identity",
175
+ "pos_weight": 0.1028856709599495
176
+ }
177
+ ```
178
+
179
+ ### Framework Versions
180
+ - Python: 3.12.3
181
+ - Sentence Transformers: 5.1.0
182
+ - Transformers: 4.56.0
183
+ - PyTorch: 2.8.0+cu128
184
+ - Accelerate: 1.10.1
185
+ - Datasets: 4.0.0
186
+ - Tokenizers: 0.22.0
187
+
188
+ ## Citation
189
+
190
+ ### BibTeX
191
+
192
+ #### Sentence Transformers
193
+ ```bibtex
194
+ @inproceedings{reimers-2019-sentence-bert,
195
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
196
+ author = "Reimers, Nils and Gurevych, Iryna",
197
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
198
+ month = "11",
199
+ year = "2019",
200
+ publisher = "Association for Computational Linguistics",
201
+ url = "https://arxiv.org/abs/1908.10084",
202
+ }
203
+ ```
204
+
205
+ <!--
206
+ ## Glossary
207
+
208
+ *Clearly define terms in order to be accessible across audiences.*
209
+ -->
210
+
211
+ <!--
212
+ ## Model Card Authors
213
+
214
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
215
+ -->
216
+
217
+ <!--
218
+ ## Model Card Contact
219
+
220
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
221
+ -->
config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "dtype": "bfloat16",
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 128,
12
+ "id2label": {
13
+ "0": "LABEL_0"
14
+ },
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 512,
17
+ "label2id": {
18
+ "LABEL_0": 0
19
+ },
20
+ "layer_norm_eps": 1e-12,
21
+ "max_position_embeddings": 512,
22
+ "model_type": "bert",
23
+ "num_attention_heads": 2,
24
+ "num_hidden_layers": 2,
25
+ "pad_token_id": 0,
26
+ "position_embedding_type": "absolute",
27
+ "sentence_transformers": {
28
+ "activation_fn": "torch.nn.modules.linear.Identity",
29
+ "version": "5.1.0"
30
+ },
31
+ "transformers_version": "4.56.0",
32
+ "type_vocab_size": 2,
33
+ "use_cache": true,
34
+ "vocab_size": 30522
35
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:191dc53e48fb9178eb8319a471ad4c8ab8e9da09b2653af8362559f2005c8e3e
3
+ size 8776666
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "model_max_length": 512,
51
+ "never_split": null,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "strip_accents": null,
55
+ "tokenize_chinese_chars": true,
56
+ "tokenizer_class": "BertTokenizer",
57
+ "unk_token": "[UNK]"
58
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff