manbeast3b's picture
Upload README.md with huggingface_hub
7021d37 verified
---
license: apache-2.0
base_model: Qwen/Qwen2.5-0.5B-Instruct
tags:
- text-classification
- lora
- peft
- ifeval
- commoneval
- wildvoice
- voicebench
- fine-tuned
---
# Qwen2.5-0.5B Text Classification Model
This model is a fine-tuned version of [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using LoRA (Low-Rank Adaptation) for text classification tasks. The model has been specifically trained to classify text into three categories based on VoiceBench dataset patterns.
## ๐ŸŽฏ Model Description
The model has been trained to classify text into three distinct categories:
- **ifeval**: Instruction-following tasks with specific formatting requirements and step-by-step instructions
- **commoneval**: Factual questions and knowledge-based queries requiring direct answers
- **wildvoice**: Conversational, informal language patterns and natural dialogue
## ๐Ÿ“Š Performance Results
### Overall Performance
- **Overall Accuracy**: **93.33%** (28/30 correct predictions)
- **Training Method**: LoRA (Low-Rank Adaptation)
- **Trainable Parameters**: 0.88% of total parameters (4,399,104 out of 498,431,872)
### Per-Category Performance
| Category | Accuracy | Correct/Total | Description |
|----------|----------|---------------|-------------|
| **ifeval** | **100%** | 10/10 | Perfect performance on instruction-following tasks |
| **commoneval** | **80%** | 8/10 | Good performance on factual questions |
| **wildvoice** | **100%** | 10/10 | Perfect performance on conversational text |
### Confusion Matrix
```
ifeval:
-> ifeval: 10
commoneval:
-> commoneval: 8
-> unknown: 1
-> wildvoice: 1
wildvoice:
-> wildvoice: 10
```
## ๐Ÿ”ฌ Development Journey & Methods Tried
### Initial Challenges
We started with several approaches that didn't work well:
1. **GRPO (Group Relative Policy Optimization)**: Initial attempts with GRPO training showed poor performance
- Loss decreased but model wasn't learning classification
- Model generated irrelevant responses like "unknown", "txt", "com"
- Overall accuracy: ~20%
2. **Full Fine-tuning**: Attempted full fine-tuning of larger models
- CUDA out of memory issues with larger models
- Numerical instability with certain model architectures
- Poor convergence on classification task
3. **Complex Prompt Formats**: Tried various complex prompt structures
- "Classify this text as ifeval, commoneval, or wildvoice: ..."
- Model struggled with complex instructions
- Generated explanations instead of simple labels
### Breakthrough: Direct Classification Approach
The key breakthrough came with a **direct, simple approach**:
#### 1. **Simplified Prompt Format**
Instead of complex classification prompts, we used a simple format:
```
Text: {input_text}
Label: {expected_label}
```
#### 2. **LoRA (Low-Rank Adaptation)**
- Used PEFT library for efficient fine-tuning
- Only trained 0.88% of parameters
- Much more stable than full fine-tuning
- Faster training and inference
#### 3. **Focused Training Data**
Created clear, distinct examples for each category:
- **ifeval**: Instruction-following with specific formatting requirements
- **commoneval**: Factual questions requiring direct answers
- **wildvoice**: Conversational, informal language patterns
#### 4. **Optimal Hyperparameters**
- **Learning Rate**: 5e-4 (higher than initial attempts)
- **Batch Size**: 2 (smaller for stability)
- **Max Length**: 128 (shorter sequences)
- **Training Steps**: 150
- **LoRA Rank**: 8 (focused learning)
## ๐Ÿš€ Usage
### Basic Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained("manbeast3b/qwen2.5-0.5b-text-classification")
tokenizer = AutoTokenizer.from_pretrained("manbeast3b/qwen2.5-0.5b-text-classification")
def classify_text(text):
prompt = f"Text: {text}\nLabel:"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
generated = model.generate(
**inputs,
max_new_tokens=15,
do_sample=True,
temperature=0.1,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(generated[0], skip_special_tokens=True)
return response[len(prompt):].strip()
# Test examples
print(classify_text("Follow these instructions exactly: Write 3 sentences about cats."))
# Output: ifeval
print(classify_text("What is the capital of France?"))
# Output: commoneval
print(classify_text("Hey, how are you doing today?"))
# Output: wildvoice
```
### Advanced Usage with Confidence Scoring
```python
def classify_with_confidence(text, num_samples=5):
predictions = []
for _ in range(num_samples):
prompt = f"Text: {text}\nLabel:"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
generated = model.generate(
**inputs,
max_new_tokens=15,
do_sample=True,
temperature=0.3, # Slightly higher for diversity
top_p=0.9,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(generated[0], skip_special_tokens=True)
prediction = response[len(prompt):].strip().lower()
# Clean up prediction
if 'ifeval' in prediction:
prediction = 'ifeval'
elif 'commoneval' in prediction:
prediction = 'commoneval'
elif 'wildvoice' in prediction:
prediction = 'wildvoice'
else:
prediction = 'unknown'
predictions.append(prediction)
# Calculate confidence
from collections import Counter
counts = Counter(predictions)
most_common = counts.most_common(1)[0]
confidence = most_common[1] / len(predictions)
return most_common[0], confidence
# Example with confidence
label, confidence = classify_with_confidence("Please follow these steps: 1) Read 2) Think 3) Write")
print(f"Prediction: {label}, Confidence: {confidence:.2%}")
```
## ๐Ÿ“ˆ Training Details
### Model Architecture
- **Base Model**: Qwen/Qwen2.5-0.5B-Instruct
- **Parameters**: 498,431,872 total, 4,399,104 trainable (0.88%)
- **Precision**: FP16 (mixed precision)
- **Device**: CUDA (GPU accelerated)
### Training Configuration
```python
# LoRA Configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # Rank
lora_alpha=16, # LoRA alpha
lora_dropout=0.1,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)
# Training Arguments
training_args = TrainingArguments(
learning_rate=5e-4,
per_device_train_batch_size=2,
max_steps=150,
max_length=128,
fp16=True,
gradient_accumulation_steps=1,
warmup_steps=20,
weight_decay=0.01,
max_grad_norm=1.0
)
```
### Dataset
The model was trained on synthetic data representing three text categories:
- **60 total samples** (20 per category)
- **ifeval**: Instruction-following tasks with specific formatting requirements
- **commoneval**: Factual questions and knowledge-based queries
- **wildvoice**: Conversational, informal language patterns
## ๐Ÿ” Error Analysis
### Failed Predictions (2 out of 30)
1. **"What is 2 plus 2?"** โ†’ Predicted: `unknown` (Expected: `commoneval`)
- Model generated: `#eval{1} Label: #eval{2} Label: #`
- Issue: Model generated code-like syntax instead of simple label
2. **"What is the opposite of hot?"** โ†’ Predicted: `wildvoice` (Expected: `commoneval`)
- Model generated: `#wildvoice:comoneval:hot:yourresponse:whatis`
- Issue: Model generated complex response instead of simple label
### Success Factors
- **Simple prompt format** was crucial for success
- **LoRA fine-tuning** provided stable training
- **Focused training data** with clear category distinctions
- **Appropriate hyperparameters** (learning rate, batch size, etc.)
## ๐Ÿ› ๏ธ Technical Implementation
### Files Structure
```
merged_classification_model/
โ”œโ”€โ”€ README.md # This file
โ”œโ”€โ”€ config.json # Model configuration
โ”œโ”€โ”€ generation_config.json # Generation settings
โ”œโ”€โ”€ model.safetensors # Model weights (988MB)
โ”œโ”€โ”€ tokenizer.json # Tokenizer vocabulary
โ”œโ”€โ”€ tokenizer_config.json # Tokenizer configuration
โ”œโ”€โ”€ special_tokens_map.json # Special tokens mapping
โ”œโ”€โ”€ added_tokens.json # Added tokens
โ”œโ”€โ”€ merges.txt # BPE merges
โ”œโ”€โ”€ vocab.json # Vocabulary
โ””โ”€โ”€ chat_template.jinja # Chat template
```
### Dependencies
```bash
pip install transformers>=4.56.0
pip install torch>=2.0.0
pip install peft>=0.17.0
pip install accelerate>=0.21.0
```
## ๐ŸŽฏ Use Cases
This model is particularly useful for:
- **Text categorization** in educational platforms
- **Content filtering** based on text type
- **Dataset preprocessing** for machine learning pipelines
- **VoiceBench-style evaluation** systems
- **Instruction following detection** in AI systems
- **Conversational vs. factual text separation**
## โš ๏ธ Limitations
1. **Synthetic Training Data**: Model was trained on synthetic data and may not generalize perfectly to all real-world text
2. **Three-Category Limitation**: Only classifies into the three predefined categories
3. **Prompt Sensitivity**: Performance may vary with different prompt formats
4. **Edge Cases**: Some edge cases (like mathematical questions) may be misclassified
5. **Language**: Primarily trained on English text
## ๐Ÿ”ฎ Future Improvements
1. **Larger Training Dataset**: Use real VoiceBench data with proper audio transcription
2. **More Categories**: Expand to include additional text types
3. **Multilingual Support**: Train on multiple languages
4. **Confidence Calibration**: Improve confidence scoring
5. **Few-shot Learning**: Add support for few-shot classification
## ๐Ÿ“š Citation
```bibtex
@misc{qwen2.5-0.5b-text-classification,
title={Qwen2.5-0.5B Text Classification Model for VoiceBench-style Evaluation},
author={Your Name},
year={2024},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/manbeast3b/qwen2.5-0.5b-text-classification}},
note={Fine-tuned using LoRA on synthetic text classification data}
}
```
## ๐Ÿค Contributing
Contributions are welcome! Please feel free to:
- Report issues with the model
- Suggest improvements
- Submit pull requests
- Share your use cases
## ๐Ÿ“„ License
This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for more details.
---
**Model Performance Summary:**
- โœ… **93.33% Overall Accuracy**
- โœ… **100% ifeval accuracy** (instruction-following)
- โœ… **100% wildvoice accuracy** (conversational)
- โœ… **80% commoneval accuracy** (factual questions)
- โœ… **Efficient LoRA fine-tuning** (0.88% trainable parameters)
- โœ… **Fast inference** with small model size
- โœ… **Easy to use** with simple API
*This model represents a successful application of LoRA fine-tuning for text classification, achieving high accuracy with minimal computational resources.*