nielsr HF Staff commited on
Commit
eea3f37
·
verified ·
1 Parent(s): c1cd11a

Add pipeline tag and official links

Browse files

Hi! I'm Niels, part of the community science team at Hugging Face. This PR improves the model card for DeepConf by:
- Adding the `pipeline_tag: text-generation` to the YAML metadata for better discoverability.
- Including explicit links to the [Deep Think with Confidence paper](https://huggingface.co/papers/2508.15260), the project page, and the official GitHub repository.
- Organizing the content to better highlight the methodology and providing a BibTeX citation.

Files changed (1) hide show
  1. README.md +25 -185
README.md CHANGED
@@ -1,19 +1,22 @@
1
  ---
2
- license: apache-2.0
3
  library_name: transformers
 
 
4
  tags:
5
  - custom_generate
6
  - sampling
7
  ---
8
 
9
-
10
  # DeepCONF Custom Generation Strategy
11
 
12
- This repository implements the DeepCONF (Deep Confidence-based Early Stopping) generation strategy for Hugging Face Transformers models, following the [Deep Think with Confidence](https://jiaweizzhao.github.io/deepconf/) approach from the paper [Deep Think with Confidence](https://huggingface.co/papers/2508.15260).
 
 
 
13
 
14
  ## Overview
15
 
16
- DeepCONF monitors the confidence of generated tokens and stops generation when confidence falls below a threshold. The confidence is calculated as the negative mean log probability of the top-k tokens from the full vocabulary (before sampling/filtering is applied), following the methodology from the [official DeepConf implementation](https://github.com/facebookresearch/deepconf).
17
 
18
  ## Parameters
19
 
@@ -89,19 +92,13 @@ if hasattr(outputs, 'confidences'):
89
 
90
  ### Calibration (DeepConf-low/high)
91
 
92
- DeepConf's online stopping threshold can be automatically derived from a warmup phase. This allows you to calibrate the threshold based on actual model behavior rather than using a fixed value.
93
 
94
  **Step 1: Warmup Phase** - Generate multiple sequences and collect their minimum confidences:
95
 
96
  ```python
97
  from transformers import GenerationConfig
98
 
99
- # Prepare inputs
100
- question = "What is 2 + 2?"
101
- messages = [{"role": "user", "content": question}]
102
- prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
103
- inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
104
-
105
  # Configure warmup generation
106
  warmup_cfg = GenerationConfig(
107
  do_sample=True,
@@ -112,7 +109,6 @@ warmup_cfg = GenerationConfig(
112
  return_dict_in_generate=True,
113
  output_confidences=True,
114
  num_return_sequences=8, # Generate 8 warmup sequences
115
- # Note: Do NOT set threshold here - warmup should run without early stopping
116
  )
117
 
118
  # Generate warmup sequences
@@ -134,17 +130,14 @@ print(f"Warmup min confidences: {warmup_C}")
134
  # Configure production generation with calibrated threshold
135
  gen_cfg = GenerationConfig(
136
  do_sample=True,
137
- temperature=0.7,
138
- top_p=0.95,
139
  max_new_tokens=512,
140
  enable_conf=True,
141
  return_dict_in_generate=True,
142
  output_confidences=True,
143
 
144
  # Automatic threshold calibration
145
- deepconf_variant="low", # "low" (aggressive, 90th percentile) or "high" (permissive, 10th percentile)
146
  deepconf_warmup_confidences=warmup_C, # Pass warmup confidences
147
- # Optional: deepconf_eta=0.1, # Override eta (defaults: 0.1 for low, 0.9 for high)
148
  )
149
 
150
  # Generate with calibrated threshold
@@ -154,163 +147,6 @@ outputs = model.generate(
154
  custom_generate="kashif/DeepConf",
155
  trust_remote_code=True,
156
  )
157
-
158
- print(f"Generated: {tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)}")
159
- ```
160
-
161
- **Variant Explanation:**
162
- - **DeepConf-low** (eta=0.1): Uses 90th percentile threshold → More aggressive early stopping
163
- - **DeepConf-high** (eta=0.9): Uses 10th percentile threshold → More permissive, allows longer generation
164
-
165
- ### Two Modes of Operation
166
-
167
- DeepConf supports two modes that match different use cases:
168
-
169
- #### Mode 1: Online Early Stopping (Default)
170
-
171
- This is the default behavior where early stopping happens **during** generation:
172
-
173
- ```python
174
- # Online mode: Stop immediately when confidence drops
175
- gen_config = GenerationConfig(
176
- enable_conf=True,
177
- enable_early_stopping=True, # Default: True (online stopping)
178
- threshold=17.0,
179
- window_size=2048,
180
- max_new_tokens=512,
181
- )
182
-
183
- outputs = model.generate(**inputs, generation_config=gen_config, custom_generate="kashif/DeepConf")
184
- ```
185
-
186
- **Use cases:**
187
- - Interactive generation where you want immediate results
188
- - Real-time applications
189
- - Single-sequence generation
190
- - Lower memory usage (no need to store full sequences)
191
-
192
- #### Mode 2: Batch Generation + Post-Processing
193
-
194
- Generate multiple sequences without early stopping, then analyze them afterward:
195
-
196
- ```python
197
- import torch
198
-
199
- # Phase 1: Generate multiple sequences WITHOUT early stopping
200
- gen_config = GenerationConfig(
201
- enable_conf=True,
202
- enable_early_stopping=False, # Disable online stopping
203
- output_confidences=True,
204
- return_dict_in_generate=True,
205
- max_new_tokens=64,
206
- )
207
-
208
- # Expand inputs for batch generation (e.g., 8 sequences)
209
- num_sequences = 8
210
- expanded_input_ids = inputs.input_ids.repeat(num_sequences, 1)
211
- if 'attention_mask' in inputs and inputs.attention_mask is not None:
212
- expanded_attention_mask = inputs.attention_mask.repeat(num_sequences, 1)
213
- else:
214
- expanded_attention_mask = None
215
-
216
- # Generate batch
217
- outputs = model.generate(
218
- input_ids=expanded_input_ids,
219
- attention_mask=expanded_attention_mask,
220
- generation_config=gen_config,
221
- custom_generate="kashif/DeepConf"
222
- )
223
-
224
- # Phase 2: Post-process to analyze confidence patterns
225
- from custom_generate.utils import process_batch_results
226
-
227
- results = process_batch_results(
228
- outputs,
229
- tokenizer,
230
- window_size=2048,
231
- threshold=17.0
232
- )
233
-
234
- # Analyze results
235
- print(f"Generated {results['num_traces']} sequences")
236
- print(f"Min confidences: {results['min_confs']}")
237
-
238
- for i, trace in enumerate(results['traces']):
239
- print(f"\nSequence {i+1}:")
240
- print(f" Text: {trace['text'][:100]}...")
241
- print(f" Min confidence: {trace['min_conf']:.3f}")
242
- print(f" Would stop early: {trace['stopped_early']}")
243
- if trace['stopped_early']:
244
- print(f" Stop position: {trace['stop_position']}")
245
- ```
246
-
247
- **Use cases:**
248
- - Research and experimentation (try different thresholds without regenerating)
249
- - Batch serving (generate multiple candidates at once)
250
- - Analysis and voting (like the official implementation)
251
- - Calibration and threshold tuning
252
-
253
- **Utility Functions:**
254
-
255
- The `custom_generate/utils.py` module provides helper functions:
256
-
257
- - `process_batch_results()`: Analyze batch outputs to detect early stopping positions
258
- - `analyze_early_stopping()`: Calculate statistics on early stopping behavior
259
- - `compute_warmup_threshold()`: Derive threshold from warmup confidences
260
- - `extract_answer()`: Parse LaTeX `\boxed{answer}` patterns
261
-
262
- #### Complete Workflow Example (Like Official DeepConf)
263
-
264
- This demonstrates the full workflow matching the official implementation:
265
-
266
- ```python
267
- # Step 1: Warmup phase - generate multiple sequences
268
- warmup_config = GenerationConfig(
269
- do_sample=True,
270
- temperature=0.7,
271
- max_new_tokens=64,
272
- enable_conf=True,
273
- enable_early_stopping=False, # No stopping during warmup
274
- output_confidences=True,
275
- return_dict_in_generate=True,
276
- )
277
-
278
- # Expand for 8 warmup sequences
279
- num_warmup = 8
280
- expanded_ids = inputs.input_ids.repeat(num_warmup, 1)
281
- expanded_mask = inputs.attention_mask.repeat(num_warmup, 1) if 'attention_mask' in inputs else None
282
-
283
- warmup_outputs = model.generate(
284
- input_ids=expanded_ids,
285
- attention_mask=expanded_mask,
286
- generation_config=warmup_config,
287
- custom_generate="kashif/DeepConf"
288
- )
289
-
290
- # Process warmup to get min confidences
291
- from custom_generate.utils import process_batch_results, compute_warmup_threshold
292
-
293
- warmup_results = process_batch_results(warmup_outputs, tokenizer, window_size=10)
294
- print(f"Warmup min confidences: {warmup_results['min_confs']}")
295
-
296
- # Step 2: Compute threshold from warmup
297
- threshold = compute_warmup_threshold(
298
- warmup_results['min_confs'],
299
- variant="low" # or "high"
300
- )
301
- print(f"Calibrated threshold: {threshold:.3f}")
302
-
303
- # Step 3: Final generation with calibrated threshold
304
- final_config = GenerationConfig(
305
- enable_conf=True,
306
- enable_early_stopping=True, # Online stopping with calibrated threshold
307
- threshold=threshold,
308
- window_size=10,
309
- max_new_tokens=128,
310
- )
311
-
312
- final_output = model.generate(**inputs, generation_config=final_config, custom_generate="kashif/DeepConf")
313
- print(tokenizer.decode(final_output.sequences[0], skip_special_tokens=True))
314
  ```
315
 
316
  ## Technical Details
@@ -318,25 +154,29 @@ print(tokenizer.decode(final_output.sequences[0], skip_special_tokens=True))
318
  ### Confidence Calculation
319
 
320
  The confidence score for each generated token is calculated as follows:
321
-
322
- 1. **Extract top-k tokens**: Get the top-k (default: 20) tokens with highest probabilities from the full vocabulary
323
- 2. **Compute log probabilities**: Calculate log probabilities for these top-k tokens
324
- 3. **Average**: The confidence score is `-mean(log_probs)` of the top-k tokens
325
-
326
- This approach:
327
- - Uses the **full probability distribution** (before any top-k/top-p/temperature filtering)
328
- - Always considers a **fixed number of tokens** (conf_topk=20)
329
- - Naturally **includes the sampled token** if it's in the top-k
330
 
331
  ### Online Stopping
332
 
333
  The online method uses a sliding window of confidence scores:
334
- - Maintains a window of the last `window_size` (default: 2048) confidence scores
335
- - Calculates the mean confidence over this window
336
- - Stops generation when: `mean_confidence < threshold`
337
 
338
  ## Requirements
339
 
340
  - PyTorch >= 1.13.0
341
  - Transformers >= 4.35.0
342
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  library_name: transformers
3
+ license: apache-2.0
4
+ pipeline_tag: text-generation
5
  tags:
6
  - custom_generate
7
  - sampling
8
  ---
9
 
 
10
  # DeepCONF Custom Generation Strategy
11
 
12
+ This repository implements the DeepCONF (Deep Confidence-based Early Stopping) generation strategy for Hugging Face Transformers models, as presented in the paper [Deep Think with Confidence](https://huggingface.co/papers/2508.15260).
13
+
14
+ - **Project Page:** [https://jiaweizzhao.github.io/deepconf](https://jiaweizzhao.github.io/deepconf)
15
+ - **GitHub Repository:** [https://github.com/facebookresearch/deepconf](https://github.com/facebookresearch/deepconf)
16
 
17
  ## Overview
18
 
19
+ DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It monitors the confidence of generated tokens and stops generation when confidence falls below a threshold. The confidence is calculated as the negative mean log probability of the top-k tokens from the full vocabulary (before sampling/filtering is applied), following the methodology from the official implementation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks.
20
 
21
  ## Parameters
22
 
 
92
 
93
  ### Calibration (DeepConf-low/high)
94
 
95
+ DeepConf's online stopping threshold can be automatically derived from a warmup phase. This allows you to calibrate the threshold based on actual model behavior.
96
 
97
  **Step 1: Warmup Phase** - Generate multiple sequences and collect their minimum confidences:
98
 
99
  ```python
100
  from transformers import GenerationConfig
101
 
 
 
 
 
 
 
102
  # Configure warmup generation
103
  warmup_cfg = GenerationConfig(
104
  do_sample=True,
 
109
  return_dict_in_generate=True,
110
  output_confidences=True,
111
  num_return_sequences=8, # Generate 8 warmup sequences
 
112
  )
113
 
114
  # Generate warmup sequences
 
130
  # Configure production generation with calibrated threshold
131
  gen_cfg = GenerationConfig(
132
  do_sample=True,
 
 
133
  max_new_tokens=512,
134
  enable_conf=True,
135
  return_dict_in_generate=True,
136
  output_confidences=True,
137
 
138
  # Automatic threshold calibration
139
+ deepconf_variant="low", # "low" (aggressive) or "high" (permissive)
140
  deepconf_warmup_confidences=warmup_C, # Pass warmup confidences
 
141
  )
142
 
143
  # Generate with calibrated threshold
 
147
  custom_generate="kashif/DeepConf",
148
  trust_remote_code=True,
149
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
  ```
151
 
152
  ## Technical Details
 
154
  ### Confidence Calculation
155
 
156
  The confidence score for each generated token is calculated as follows:
157
+ 1. **Extract top-k tokens**: Get the top-k (default: 20) tokens with highest probabilities from the full vocabulary.
158
+ 2. **Compute log probabilities**: Calculate log probabilities for these top-k tokens.
159
+ 3. **Average**: The confidence score is `-mean(log_probs)` of the top-k tokens.
 
 
 
 
 
 
160
 
161
  ### Online Stopping
162
 
163
  The online method uses a sliding window of confidence scores:
164
+ - Maintains a window of the last `window_size` (default: 2048) confidence scores.
165
+ - Calculates the mean confidence over this window.
166
+ - Stops generation when: `mean_confidence < threshold`.
167
 
168
  ## Requirements
169
 
170
  - PyTorch >= 1.13.0
171
  - Transformers >= 4.35.0
172
 
173
+ ## Citation
174
+
175
+ ```bibtex
176
+ @article{fu2025deep,
177
+ title={Deep think with confidence},
178
+ author={Fu, Yichao and Wang, Xuewei and Tian, Yuandong and Zhao, Jiawei},
179
+ journal={arXiv preprint arXiv:2508.15260},
180
+ year={2025}
181
+ }
182
+ ```