Add pipeline tag and official links

Hi! I'm Niels, part of the community science team at Hugging Face. This PR improves the model card for DeepConf by:
- Adding the `pipeline_tag: text-generation` to the YAML metadata for better discoverability.
- Including explicit links to the [Deep Think with Confidence paper](https://huggingface.co/papers/2508.15260), the project page, and the official GitHub repository.
- Organizing the content to better highlight the methodology and providing a BibTeX citation.

Files changed (1) hide show

README.md +25 -185

README.md CHANGED Viewed

@@ -1,19 +1,22 @@
 ---
-license: apache-2.0
 library_name: transformers
 tags:
 - custom_generate
 - sampling
 ---
 # DeepCONF Custom Generation Strategy
-This repository implements the DeepCONF (Deep Confidence-based Early Stopping) generation strategy for Hugging Face Transformers models, following the [Deep Think with Confidence](https://jiaweizzhao.github.io/deepconf/) approach from the paper [Deep Think with Confidence](https://huggingface.co/papers/2508.15260).
 ## Overview
-DeepCONF monitors the confidence of generated tokens and stops generation when confidence falls below a threshold. The confidence is calculated as the negative mean log probability of the top-k tokens from the full vocabulary (before sampling/filtering is applied), following the methodology from the [official DeepConf implementation](https://github.com/facebookresearch/deepconf).
 ## Parameters
@@ -89,19 +92,13 @@ if hasattr(outputs, 'confidences'):
 ### Calibration (DeepConf-low/high)
-DeepConf's online stopping threshold can be automatically derived from a warmup phase. This allows you to calibrate the threshold based on actual model behavior rather than using a fixed value.
 **Step 1: Warmup Phase** - Generate multiple sequences and collect their minimum confidences:
 ```python
 from transformers import GenerationConfig
-# Prepare inputs
-question = "What is 2 + 2?"
-messages = [{"role": "user", "content": question}]
-prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 # Configure warmup generation
 warmup_cfg = GenerationConfig(
     do_sample=True,
@@ -112,7 +109,6 @@ warmup_cfg = GenerationConfig(
     return_dict_in_generate=True,
     output_confidences=True,
     num_return_sequences=8,         # Generate 8 warmup sequences
-    # Note: Do NOT set threshold here - warmup should run without early stopping
 )
 # Generate warmup sequences
@@ -134,17 +130,14 @@ print(f"Warmup min confidences: {warmup_C}")
 # Configure production generation with calibrated threshold
 gen_cfg = GenerationConfig(
     do_sample=True,
-    temperature=0.7,
-    top_p=0.95,
     max_new_tokens=512,
     enable_conf=True,
     return_dict_in_generate=True,
     output_confidences=True,
     # Automatic threshold calibration
-    deepconf_variant="low",  # "low" (aggressive, 90th percentile) or "high" (permissive, 10th percentile)
     deepconf_warmup_confidences=warmup_C,  # Pass warmup confidences
-    # Optional: deepconf_eta=0.1,  # Override eta (defaults: 0.1 for low, 0.9 for high)
 )
 # Generate with calibrated threshold
@@ -154,163 +147,6 @@ outputs = model.generate(
     custom_generate="kashif/DeepConf",
     trust_remote_code=True,
 )
-print(f"Generated: {tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)}")
-```
-**Variant Explanation:**
-- **DeepConf-low** (eta=0.1): Uses 90th percentile threshold → More aggressive early stopping
-- **DeepConf-high** (eta=0.9): Uses 10th percentile threshold → More permissive, allows longer generation
-### Two Modes of Operation
-DeepConf supports two modes that match different use cases:
-#### Mode 1: Online Early Stopping (Default)
-This is the default behavior where early stopping happens **during** generation:
-```python
-# Online mode: Stop immediately when confidence drops
-gen_config = GenerationConfig(
-    enable_conf=True,
-    enable_early_stopping=True,  # Default: True (online stopping)
-    threshold=17.0,
-    window_size=2048,
-    max_new_tokens=512,
-)
-outputs = model.generate(**inputs, generation_config=gen_config, custom_generate="kashif/DeepConf")
-```
-**Use cases:**
-- Interactive generation where you want immediate results
-- Real-time applications
-- Single-sequence generation
-- Lower memory usage (no need to store full sequences)
-#### Mode 2: Batch Generation + Post-Processing
-Generate multiple sequences without early stopping, then analyze them afterward:
-```python
-import torch
-# Phase 1: Generate multiple sequences WITHOUT early stopping
-gen_config = GenerationConfig(
-    enable_conf=True,
-    enable_early_stopping=False,  # Disable online stopping
-    output_confidences=True,
-    return_dict_in_generate=True,
-    max_new_tokens=64,
-)
-# Expand inputs for batch generation (e.g., 8 sequences)
-num_sequences = 8
-expanded_input_ids = inputs.input_ids.repeat(num_sequences, 1)
-if 'attention_mask' in inputs and inputs.attention_mask is not None:
-    expanded_attention_mask = inputs.attention_mask.repeat(num_sequences, 1)
-else:
-    expanded_attention_mask = None
-# Generate batch
-outputs = model.generate(
-    input_ids=expanded_input_ids,
-    attention_mask=expanded_attention_mask,
-    generation_config=gen_config,
-    custom_generate="kashif/DeepConf"
-)
-# Phase 2: Post-process to analyze confidence patterns
-from custom_generate.utils import process_batch_results
-results = process_batch_results(
-    outputs,
-    tokenizer,
-    window_size=2048,
-    threshold=17.0
-)
-# Analyze results
-print(f"Generated {results['num_traces']} sequences")
-print(f"Min confidences: {results['min_confs']}")
-for i, trace in enumerate(results['traces']):
-    print(f"\nSequence {i+1}:")
-    print(f"  Text: {trace['text'][:100]}...")
-    print(f"  Min confidence: {trace['min_conf']:.3f}")
-    print(f"  Would stop early: {trace['stopped_early']}")
-    if trace['stopped_early']:
-        print(f"  Stop position: {trace['stop_position']}")
-```
-**Use cases:**
-- Research and experimentation (try different thresholds without regenerating)
-- Batch serving (generate multiple candidates at once)
-- Analysis and voting (like the official implementation)
-- Calibration and threshold tuning
-**Utility Functions:**
-The `custom_generate/utils.py` module provides helper functions:
-- `process_batch_results()`: Analyze batch outputs to detect early stopping positions
-- `analyze_early_stopping()`: Calculate statistics on early stopping behavior
-- `compute_warmup_threshold()`: Derive threshold from warmup confidences
-- `extract_answer()`: Parse LaTeX `\boxed{answer}` patterns
-#### Complete Workflow Example (Like Official DeepConf)
-This demonstrates the full workflow matching the official implementation:
-```python
-# Step 1: Warmup phase - generate multiple sequences
-warmup_config = GenerationConfig(
-    do_sample=True,
-    temperature=0.7,
-    max_new_tokens=64,
-    enable_conf=True,
-    enable_early_stopping=False,  # No stopping during warmup
-    output_confidences=True,
-    return_dict_in_generate=True,
-)
-# Expand for 8 warmup sequences
-num_warmup = 8
-expanded_ids = inputs.input_ids.repeat(num_warmup, 1)
-expanded_mask = inputs.attention_mask.repeat(num_warmup, 1) if 'attention_mask' in inputs else None
-warmup_outputs = model.generate(
-    input_ids=expanded_ids,
-    attention_mask=expanded_mask,
-    generation_config=warmup_config,
-    custom_generate="kashif/DeepConf"
-)
-# Process warmup to get min confidences
-from custom_generate.utils import process_batch_results, compute_warmup_threshold
-warmup_results = process_batch_results(warmup_outputs, tokenizer, window_size=10)
-print(f"Warmup min confidences: {warmup_results['min_confs']}")
-# Step 2: Compute threshold from warmup
-threshold = compute_warmup_threshold(
-    warmup_results['min_confs'],
-    variant="low"  # or "high"
-)
-print(f"Calibrated threshold: {threshold:.3f}")
-# Step 3: Final generation with calibrated threshold
-final_config = GenerationConfig(
-    enable_conf=True,
-    enable_early_stopping=True,  # Online stopping with calibrated threshold
-    threshold=threshold,
-    window_size=10,
-    max_new_tokens=128,
-)
-final_output = model.generate(**inputs, generation_config=final_config, custom_generate="kashif/DeepConf")
-print(tokenizer.decode(final_output.sequences[0], skip_special_tokens=True))
 ```
 ## Technical Details
@@ -318,25 +154,29 @@ print(tokenizer.decode(final_output.sequences[0], skip_special_tokens=True))
 ### Confidence Calculation
 The confidence score for each generated token is calculated as follows:
-1. **Extract top-k tokens**: Get the top-k (default: 20) tokens with highest probabilities from the full vocabulary
-2. **Compute log probabilities**: Calculate log probabilities for these top-k tokens
-3. **Average**: The confidence score is `-mean(log_probs)` of the top-k tokens
-This approach:
-- Uses the **full probability distribution** (before any top-k/top-p/temperature filtering)
-- Always considers a **fixed number of tokens** (conf_topk=20)
-- Naturally **includes the sampled token** if it's in the top-k
 ### Online Stopping
 The online method uses a sliding window of confidence scores:
-- Maintains a window of the last `window_size` (default: 2048) confidence scores
-- Calculates the mean confidence over this window
-- Stops generation when: `mean_confidence < threshold`
 ## Requirements
 - PyTorch >= 1.13.0
 - Transformers >= 4.35.0

 ---
 library_name: transformers
+license: apache-2.0
+pipeline_tag: text-generation
 tags:
 - custom_generate
 - sampling
 ---
 # DeepCONF Custom Generation Strategy
+This repository implements the DeepCONF (Deep Confidence-based Early Stopping) generation strategy for Hugging Face Transformers models, as presented in the paper [Deep Think with Confidence](https://huggingface.co/papers/2508.15260).
+- **Project Page:** [https://jiaweizzhao.github.io/deepconf](https://jiaweizzhao.github.io/deepconf)
+- **GitHub Repository:** [https://github.com/facebookresearch/deepconf](https://github.com/facebookresearch/deepconf)
 ## Overview
+DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It monitors the confidence of generated tokens and stops generation when confidence falls below a threshold. The confidence is calculated as the negative mean log probability of the top-k tokens from the full vocabulary (before sampling/filtering is applied), following the methodology from the official implementation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks.
 ## Parameters
 ### Calibration (DeepConf-low/high)
+DeepConf's online stopping threshold can be automatically derived from a warmup phase. This allows you to calibrate the threshold based on actual model behavior.
 **Step 1: Warmup Phase** - Generate multiple sequences and collect their minimum confidences:
 ```python
 from transformers import GenerationConfig
 # Configure warmup generation
 warmup_cfg = GenerationConfig(
     do_sample=True,
     return_dict_in_generate=True,
     output_confidences=True,
     num_return_sequences=8,         # Generate 8 warmup sequences
 )
 # Generate warmup sequences
 # Configure production generation with calibrated threshold
 gen_cfg = GenerationConfig(
     do_sample=True,
     max_new_tokens=512,
     enable_conf=True,
     return_dict_in_generate=True,
     output_confidences=True,
     # Automatic threshold calibration
+    deepconf_variant="low",  # "low" (aggressive) or "high" (permissive)
     deepconf_warmup_confidences=warmup_C,  # Pass warmup confidences
 )
 # Generate with calibrated threshold
     custom_generate="kashif/DeepConf",
     trust_remote_code=True,
 )
 ```
 ## Technical Details
 ### Confidence Calculation
 The confidence score for each generated token is calculated as follows:
+1. **Extract top-k tokens**: Get the top-k (default: 20) tokens with highest probabilities from the full vocabulary.
+2. **Compute log probabilities**: Calculate log probabilities for these top-k tokens.
+3. **Average**: The confidence score is `-mean(log_probs)` of the top-k tokens.
 ### Online Stopping
 The online method uses a sliding window of confidence scores:
+- Maintains a window of the last `window_size` (default: 2048) confidence scores.
+- Calculates the mean confidence over this window.
+- Stops generation when: `mean_confidence < threshold`.
 ## Requirements
 - PyTorch >= 1.13.0
 - Transformers >= 4.35.0
+## Citation
+```bibtex
+@article{fu2025deep,
+  title={Deep think with confidence},
+  author={Fu, Yichao and Wang, Xuewei and Tian, Yuandong and Zhao, Jiawei},
+  journal={arXiv preprint arXiv:2508.15260},
+  year={2025}
+}
+```