|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: Qwen/Qwen2.5-0.5B-Instruct |
|
|
tags: |
|
|
- text-classification |
|
|
- lora |
|
|
- peft |
|
|
- ifeval |
|
|
- commoneval |
|
|
- wildvoice |
|
|
- voicebench |
|
|
- fine-tuned |
|
|
--- |
|
|
|
|
|
# Qwen2.5-0.5B Text Classification Model |
|
|
|
|
|
This model is a fine-tuned version of [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using LoRA (Low-Rank Adaptation) for text classification tasks. The model has been specifically trained to classify text into three categories based on VoiceBench dataset patterns. |
|
|
|
|
|
## ๐ฏ Model Description |
|
|
|
|
|
The model has been trained to classify text into three distinct categories: |
|
|
- **ifeval**: Instruction-following tasks with specific formatting requirements and step-by-step instructions |
|
|
- **commoneval**: Factual questions and knowledge-based queries requiring direct answers |
|
|
- **wildvoice**: Conversational, informal language patterns and natural dialogue |
|
|
|
|
|
## ๐ Performance Results |
|
|
|
|
|
### Overall Performance |
|
|
- **Overall Accuracy**: **93.33%** (28/30 correct predictions) |
|
|
- **Training Method**: LoRA (Low-Rank Adaptation) |
|
|
- **Trainable Parameters**: 0.88% of total parameters (4,399,104 out of 498,431,872) |
|
|
|
|
|
### Per-Category Performance |
|
|
| Category | Accuracy | Correct/Total | Description | |
|
|
|----------|----------|---------------|-------------| |
|
|
| **ifeval** | **100%** | 10/10 | Perfect performance on instruction-following tasks | |
|
|
| **commoneval** | **80%** | 8/10 | Good performance on factual questions | |
|
|
| **wildvoice** | **100%** | 10/10 | Perfect performance on conversational text | |
|
|
|
|
|
### Confusion Matrix |
|
|
``` |
|
|
ifeval: |
|
|
-> ifeval: 10 |
|
|
commoneval: |
|
|
-> commoneval: 8 |
|
|
-> unknown: 1 |
|
|
-> wildvoice: 1 |
|
|
wildvoice: |
|
|
-> wildvoice: 10 |
|
|
``` |
|
|
|
|
|
## ๐ฌ Development Journey & Methods Tried |
|
|
|
|
|
### Initial Challenges |
|
|
We started with several approaches that didn't work well: |
|
|
|
|
|
1. **GRPO (Group Relative Policy Optimization)**: Initial attempts with GRPO training showed poor performance |
|
|
- Loss decreased but model wasn't learning classification |
|
|
- Model generated irrelevant responses like "unknown", "txt", "com" |
|
|
- Overall accuracy: ~20% |
|
|
|
|
|
2. **Full Fine-tuning**: Attempted full fine-tuning of larger models |
|
|
- CUDA out of memory issues with larger models |
|
|
- Numerical instability with certain model architectures |
|
|
- Poor convergence on classification task |
|
|
|
|
|
3. **Complex Prompt Formats**: Tried various complex prompt structures |
|
|
- "Classify this text as ifeval, commoneval, or wildvoice: ..." |
|
|
- Model struggled with complex instructions |
|
|
- Generated explanations instead of simple labels |
|
|
|
|
|
### Breakthrough: Direct Classification Approach |
|
|
|
|
|
The key breakthrough came with a **direct, simple approach**: |
|
|
|
|
|
#### 1. **Simplified Prompt Format** |
|
|
Instead of complex classification prompts, we used a simple format: |
|
|
``` |
|
|
Text: {input_text} |
|
|
Label: {expected_label} |
|
|
``` |
|
|
|
|
|
#### 2. **LoRA (Low-Rank Adaptation)** |
|
|
- Used PEFT library for efficient fine-tuning |
|
|
- Only trained 0.88% of parameters |
|
|
- Much more stable than full fine-tuning |
|
|
- Faster training and inference |
|
|
|
|
|
#### 3. **Focused Training Data** |
|
|
Created clear, distinct examples for each category: |
|
|
- **ifeval**: Instruction-following with specific formatting requirements |
|
|
- **commoneval**: Factual questions requiring direct answers |
|
|
- **wildvoice**: Conversational, informal language patterns |
|
|
|
|
|
#### 4. **Optimal Hyperparameters** |
|
|
- **Learning Rate**: 5e-4 (higher than initial attempts) |
|
|
- **Batch Size**: 2 (smaller for stability) |
|
|
- **Max Length**: 128 (shorter sequences) |
|
|
- **Training Steps**: 150 |
|
|
- **LoRA Rank**: 8 (focused learning) |
|
|
|
|
|
## ๐ Usage |
|
|
|
|
|
### Basic Usage |
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
# Load the model and tokenizer |
|
|
model = AutoModelForCausalLM.from_pretrained("manbeast3b/qwen2.5-0.5b-text-classification") |
|
|
tokenizer = AutoTokenizer.from_pretrained("manbeast3b/qwen2.5-0.5b-text-classification") |
|
|
|
|
|
def classify_text(text): |
|
|
prompt = f"Text: {text}\nLabel:" |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
generated = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=15, |
|
|
do_sample=True, |
|
|
temperature=0.1, |
|
|
top_p=0.9, |
|
|
pad_token_id=tokenizer.eos_token_id, |
|
|
eos_token_id=tokenizer.eos_token_id, |
|
|
) |
|
|
|
|
|
response = tokenizer.decode(generated[0], skip_special_tokens=True) |
|
|
return response[len(prompt):].strip() |
|
|
|
|
|
# Test examples |
|
|
print(classify_text("Follow these instructions exactly: Write 3 sentences about cats.")) |
|
|
# Output: ifeval |
|
|
|
|
|
print(classify_text("What is the capital of France?")) |
|
|
# Output: commoneval |
|
|
|
|
|
print(classify_text("Hey, how are you doing today?")) |
|
|
# Output: wildvoice |
|
|
``` |
|
|
|
|
|
### Advanced Usage with Confidence Scoring |
|
|
```python |
|
|
def classify_with_confidence(text, num_samples=5): |
|
|
predictions = [] |
|
|
for _ in range(num_samples): |
|
|
prompt = f"Text: {text}\nLabel:" |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
generated = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=15, |
|
|
do_sample=True, |
|
|
temperature=0.3, # Slightly higher for diversity |
|
|
top_p=0.9, |
|
|
pad_token_id=tokenizer.eos_token_id, |
|
|
eos_token_id=tokenizer.eos_token_id, |
|
|
) |
|
|
|
|
|
response = tokenizer.decode(generated[0], skip_special_tokens=True) |
|
|
prediction = response[len(prompt):].strip().lower() |
|
|
|
|
|
# Clean up prediction |
|
|
if 'ifeval' in prediction: |
|
|
prediction = 'ifeval' |
|
|
elif 'commoneval' in prediction: |
|
|
prediction = 'commoneval' |
|
|
elif 'wildvoice' in prediction: |
|
|
prediction = 'wildvoice' |
|
|
else: |
|
|
prediction = 'unknown' |
|
|
|
|
|
predictions.append(prediction) |
|
|
|
|
|
# Calculate confidence |
|
|
from collections import Counter |
|
|
counts = Counter(predictions) |
|
|
most_common = counts.most_common(1)[0] |
|
|
confidence = most_common[1] / len(predictions) |
|
|
|
|
|
return most_common[0], confidence |
|
|
|
|
|
# Example with confidence |
|
|
label, confidence = classify_with_confidence("Please follow these steps: 1) Read 2) Think 3) Write") |
|
|
print(f"Prediction: {label}, Confidence: {confidence:.2%}") |
|
|
``` |
|
|
|
|
|
## ๐ Training Details |
|
|
|
|
|
### Model Architecture |
|
|
- **Base Model**: Qwen/Qwen2.5-0.5B-Instruct |
|
|
- **Parameters**: 498,431,872 total, 4,399,104 trainable (0.88%) |
|
|
- **Precision**: FP16 (mixed precision) |
|
|
- **Device**: CUDA (GPU accelerated) |
|
|
|
|
|
### Training Configuration |
|
|
```python |
|
|
# LoRA Configuration |
|
|
lora_config = LoraConfig( |
|
|
task_type=TaskType.CAUSAL_LM, |
|
|
r=8, # Rank |
|
|
lora_alpha=16, # LoRA alpha |
|
|
lora_dropout=0.1, |
|
|
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] |
|
|
) |
|
|
|
|
|
# Training Arguments |
|
|
training_args = TrainingArguments( |
|
|
learning_rate=5e-4, |
|
|
per_device_train_batch_size=2, |
|
|
max_steps=150, |
|
|
max_length=128, |
|
|
fp16=True, |
|
|
gradient_accumulation_steps=1, |
|
|
warmup_steps=20, |
|
|
weight_decay=0.01, |
|
|
max_grad_norm=1.0 |
|
|
) |
|
|
``` |
|
|
|
|
|
### Dataset |
|
|
The model was trained on synthetic data representing three text categories: |
|
|
- **60 total samples** (20 per category) |
|
|
- **ifeval**: Instruction-following tasks with specific formatting requirements |
|
|
- **commoneval**: Factual questions and knowledge-based queries |
|
|
- **wildvoice**: Conversational, informal language patterns |
|
|
|
|
|
## ๐ Error Analysis |
|
|
|
|
|
### Failed Predictions (2 out of 30) |
|
|
1. **"What is 2 plus 2?"** โ Predicted: `unknown` (Expected: `commoneval`) |
|
|
- Model generated: `#eval{1} Label: #eval{2} Label: #` |
|
|
- Issue: Model generated code-like syntax instead of simple label |
|
|
|
|
|
2. **"What is the opposite of hot?"** โ Predicted: `wildvoice` (Expected: `commoneval`) |
|
|
- Model generated: `#wildvoice:comoneval:hot:yourresponse:whatis` |
|
|
- Issue: Model generated complex response instead of simple label |
|
|
|
|
|
### Success Factors |
|
|
- **Simple prompt format** was crucial for success |
|
|
- **LoRA fine-tuning** provided stable training |
|
|
- **Focused training data** with clear category distinctions |
|
|
- **Appropriate hyperparameters** (learning rate, batch size, etc.) |
|
|
|
|
|
## ๐ ๏ธ Technical Implementation |
|
|
|
|
|
### Files Structure |
|
|
``` |
|
|
merged_classification_model/ |
|
|
โโโ README.md # This file |
|
|
โโโ config.json # Model configuration |
|
|
โโโ generation_config.json # Generation settings |
|
|
โโโ model.safetensors # Model weights (988MB) |
|
|
โโโ tokenizer.json # Tokenizer vocabulary |
|
|
โโโ tokenizer_config.json # Tokenizer configuration |
|
|
โโโ special_tokens_map.json # Special tokens mapping |
|
|
โโโ added_tokens.json # Added tokens |
|
|
โโโ merges.txt # BPE merges |
|
|
โโโ vocab.json # Vocabulary |
|
|
โโโ chat_template.jinja # Chat template |
|
|
``` |
|
|
|
|
|
### Dependencies |
|
|
```bash |
|
|
pip install transformers>=4.56.0 |
|
|
pip install torch>=2.0.0 |
|
|
pip install peft>=0.17.0 |
|
|
pip install accelerate>=0.21.0 |
|
|
``` |
|
|
|
|
|
## ๐ฏ Use Cases |
|
|
|
|
|
This model is particularly useful for: |
|
|
- **Text categorization** in educational platforms |
|
|
- **Content filtering** based on text type |
|
|
- **Dataset preprocessing** for machine learning pipelines |
|
|
- **VoiceBench-style evaluation** systems |
|
|
- **Instruction following detection** in AI systems |
|
|
- **Conversational vs. factual text separation** |
|
|
|
|
|
## โ ๏ธ Limitations |
|
|
|
|
|
1. **Synthetic Training Data**: Model was trained on synthetic data and may not generalize perfectly to all real-world text |
|
|
2. **Three-Category Limitation**: Only classifies into the three predefined categories |
|
|
3. **Prompt Sensitivity**: Performance may vary with different prompt formats |
|
|
4. **Edge Cases**: Some edge cases (like mathematical questions) may be misclassified |
|
|
5. **Language**: Primarily trained on English text |
|
|
|
|
|
## ๐ฎ Future Improvements |
|
|
|
|
|
1. **Larger Training Dataset**: Use real VoiceBench data with proper audio transcription |
|
|
2. **More Categories**: Expand to include additional text types |
|
|
3. **Multilingual Support**: Train on multiple languages |
|
|
4. **Confidence Calibration**: Improve confidence scoring |
|
|
5. **Few-shot Learning**: Add support for few-shot classification |
|
|
|
|
|
## ๐ Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{qwen2.5-0.5b-text-classification, |
|
|
title={Qwen2.5-0.5B Text Classification Model for VoiceBench-style Evaluation}, |
|
|
author={Your Name}, |
|
|
year={2024}, |
|
|
publisher={Hugging Face}, |
|
|
howpublished={\url{https://huggingface.co/manbeast3b/qwen2.5-0.5b-text-classification}}, |
|
|
note={Fine-tuned using LoRA on synthetic text classification data} |
|
|
} |
|
|
``` |
|
|
|
|
|
## ๐ค Contributing |
|
|
|
|
|
Contributions are welcome! Please feel free to: |
|
|
- Report issues with the model |
|
|
- Suggest improvements |
|
|
- Submit pull requests |
|
|
- Share your use cases |
|
|
|
|
|
## ๐ License |
|
|
|
|
|
This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for more details. |
|
|
|
|
|
--- |
|
|
|
|
|
**Model Performance Summary:** |
|
|
- โ
**93.33% Overall Accuracy** |
|
|
- โ
**100% ifeval accuracy** (instruction-following) |
|
|
- โ
**100% wildvoice accuracy** (conversational) |
|
|
- โ
**80% commoneval accuracy** (factual questions) |
|
|
- โ
**Efficient LoRA fine-tuning** (0.88% trainable parameters) |
|
|
- โ
**Fast inference** with small model size |
|
|
- โ
**Easy to use** with simple API |
|
|
|
|
|
*This model represents a successful application of LoRA fine-tuning for text classification, achieving high accuracy with minimal computational resources.* |
|
|
|