qwen2.5-0.5b-text-classification / README.md

Upload README.md with huggingface_hub

7021d37 verified 2 months ago

11.4 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen2.5-0.5B-Instruct
	tags:
	- text-classification
	- lora
	- peft
	- ifeval
	- commoneval
	- wildvoice
	- voicebench
	- fine-tuned
	---

	# Qwen2.5-0.5B Text Classification Model

	This model is a fine-tuned version of [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using LoRA (Low-Rank Adaptation) for text classification tasks. The model has been specifically trained to classify text into three categories based on VoiceBench dataset patterns.

	## 🎯 Model Description

	The model has been trained to classify text into three distinct categories:
	- ifeval: Instruction-following tasks with specific formatting requirements and step-by-step instructions
	- commoneval: Factual questions and knowledge-based queries requiring direct answers
	- wildvoice: Conversational, informal language patterns and natural dialogue

	## 📊 Performance Results

	### Overall Performance
	- Overall Accuracy: 93.33% (28/30 correct predictions)
	- Training Method: LoRA (Low-Rank Adaptation)
	- Trainable Parameters: 0.88% of total parameters (4,399,104 out of 498,431,872)

	### Per-Category Performance
	\| Category \| Accuracy \| Correct/Total \| Description \|
	\|----------\|----------\|---------------\|-------------\|
	\| ifeval \| 100% \| 10/10 \| Perfect performance on instruction-following tasks \|
	\| commoneval \| 80% \| 8/10 \| Good performance on factual questions \|
	\| wildvoice \| 100% \| 10/10 \| Perfect performance on conversational text \|

	### Confusion Matrix
	```
	ifeval:
	-> ifeval: 10
	commoneval:
	-> commoneval: 8
	-> unknown: 1
	-> wildvoice: 1
	wildvoice:
	-> wildvoice: 10
	```

	## 🔬 Development Journey & Methods Tried

	### Initial Challenges
	We started with several approaches that didn't work well:

	1. GRPO (Group Relative Policy Optimization): Initial attempts with GRPO training showed poor performance
	- Loss decreased but model wasn't learning classification
	- Model generated irrelevant responses like "unknown", "txt", "com"
	- Overall accuracy: ~20%

	2. Full Fine-tuning: Attempted full fine-tuning of larger models
	- CUDA out of memory issues with larger models
	- Numerical instability with certain model architectures
	- Poor convergence on classification task

	3. Complex Prompt Formats: Tried various complex prompt structures
	- "Classify this text as ifeval, commoneval, or wildvoice: ..."
	- Model struggled with complex instructions
	- Generated explanations instead of simple labels

	### Breakthrough: Direct Classification Approach

	The key breakthrough came with a direct, simple approach:

	#### 1. Simplified Prompt Format
	Instead of complex classification prompts, we used a simple format:
	```
	Text: {input_text}
	Label: {expected_label}
	```

	#### 2. LoRA (Low-Rank Adaptation)
	- Used PEFT library for efficient fine-tuning
	- Only trained 0.88% of parameters
	- Much more stable than full fine-tuning
	- Faster training and inference

	#### 3. Focused Training Data
	Created clear, distinct examples for each category:
	- ifeval: Instruction-following with specific formatting requirements
	- commoneval: Factual questions requiring direct answers
	- wildvoice: Conversational, informal language patterns

	#### 4. Optimal Hyperparameters
	- Learning Rate: 5e-4 (higher than initial attempts)
	- Batch Size: 2 (smaller for stability)
	- Max Length: 128 (shorter sequences)
	- Training Steps: 150
	- LoRA Rank: 8 (focused learning)

	## 🚀 Usage

	### Basic Usage
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Load the model and tokenizer
	model = AutoModelForCausalLM.from_pretrained("manbeast3b/qwen2.5-0.5b-text-classification")
	tokenizer = AutoTokenizer.from_pretrained("manbeast3b/qwen2.5-0.5b-text-classification")

	def classify_text(text):
	prompt = f"Text: {text}\nLabel:"
	inputs = tokenizer(prompt, return_tensors="pt")

	with torch.no_grad():
	generated = model.generate(
	**inputs,
	max_new_tokens=15,
	do_sample=True,
	temperature=0.1,
	top_p=0.9,
	pad_token_id=tokenizer.eos_token_id,
	eos_token_id=tokenizer.eos_token_id,
	)

	response = tokenizer.decode(generated[0], skip_special_tokens=True)
	return response[len(prompt):].strip()

	# Test examples
	print(classify_text("Follow these instructions exactly: Write 3 sentences about cats."))
	# Output: ifeval

	print(classify_text("What is the capital of France?"))
	# Output: commoneval

	print(classify_text("Hey, how are you doing today?"))
	# Output: wildvoice
	```

	### Advanced Usage with Confidence Scoring
	```python
	def classify_with_confidence(text, num_samples=5):
	predictions = []
	for _ in range(num_samples):
	prompt = f"Text: {text}\nLabel:"
	inputs = tokenizer(prompt, return_tensors="pt")

	with torch.no_grad():
	generated = model.generate(
	**inputs,
	max_new_tokens=15,
	do_sample=True,
	temperature=0.3, # Slightly higher for diversity
	top_p=0.9,
	pad_token_id=tokenizer.eos_token_id,
	eos_token_id=tokenizer.eos_token_id,
	)

	response = tokenizer.decode(generated[0], skip_special_tokens=True)
	prediction = response[len(prompt):].strip().lower()

	# Clean up prediction
	if 'ifeval' in prediction:
	prediction = 'ifeval'
	elif 'commoneval' in prediction:
	prediction = 'commoneval'
	elif 'wildvoice' in prediction:
	prediction = 'wildvoice'
	else:
	prediction = 'unknown'

	predictions.append(prediction)

	# Calculate confidence
	from collections import Counter
	counts = Counter(predictions)
	most_common = counts.most_common(1)[0]
	confidence = most_common[1] / len(predictions)

	return most_common[0], confidence

	# Example with confidence
	label, confidence = classify_with_confidence("Please follow these steps: 1) Read 2) Think 3) Write")
	print(f"Prediction: {label}, Confidence: {confidence:.2%}")
	```

	## 📈 Training Details

	### Model Architecture
	- Base Model: Qwen/Qwen2.5-0.5B-Instruct
	- Parameters: 498,431,872 total, 4,399,104 trainable (0.88%)
	- Precision: FP16 (mixed precision)
	- Device: CUDA (GPU accelerated)

	### Training Configuration
	```python
	# LoRA Configuration
	lora_config = LoraConfig(
	task_type=TaskType.CAUSAL_LM,
	r=8, # Rank
	lora_alpha=16, # LoRA alpha
	lora_dropout=0.1,
	target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
	)

	# Training Arguments
	training_args = TrainingArguments(
	learning_rate=5e-4,
	per_device_train_batch_size=2,
	max_steps=150,
	max_length=128,
	fp16=True,
	gradient_accumulation_steps=1,
	warmup_steps=20,
	weight_decay=0.01,
	max_grad_norm=1.0
	)
	```

	### Dataset
	The model was trained on synthetic data representing three text categories:
	- 60 total samples (20 per category)
	- ifeval: Instruction-following tasks with specific formatting requirements
	- commoneval: Factual questions and knowledge-based queries
	- wildvoice: Conversational, informal language patterns

	## 🔍 Error Analysis

	### Failed Predictions (2 out of 30)
	1. "What is 2 plus 2?" → Predicted: `unknown` (Expected: `commoneval`)
	- Model generated: `#eval{1} Label: #eval{2} Label: #`
	- Issue: Model generated code-like syntax instead of simple label

	2. "What is the opposite of hot?" → Predicted: `wildvoice` (Expected: `commoneval`)
	- Model generated: `#wildvoice:comoneval:hot:yourresponse:whatis`
	- Issue: Model generated complex response instead of simple label

	### Success Factors
	- Simple prompt format was crucial for success
	- LoRA fine-tuning provided stable training
	- Focused training data with clear category distinctions
	- Appropriate hyperparameters (learning rate, batch size, etc.)

	## 🛠️ Technical Implementation

	### Files Structure
	```
	merged_classification_model/
	├── README.md # This file
	├── config.json # Model configuration
	├── generation_config.json # Generation settings
	├── model.safetensors # Model weights (988MB)
	├── tokenizer.json # Tokenizer vocabulary
	├── tokenizer_config.json # Tokenizer configuration
	├── special_tokens_map.json # Special tokens mapping
	├── added_tokens.json # Added tokens
	├── merges.txt # BPE merges
	├── vocab.json # Vocabulary
	└── chat_template.jinja # Chat template
	```

	### Dependencies
	```bash
	pip install transformers>=4.56.0
	pip install torch>=2.0.0
	pip install peft>=0.17.0
	pip install accelerate>=0.21.0
	```

	## 🎯 Use Cases

	This model is particularly useful for:
	- Text categorization in educational platforms
	- Content filtering based on text type
	- Dataset preprocessing for machine learning pipelines
	- VoiceBench-style evaluation systems
	- Instruction following detection in AI systems
	- Conversational vs. factual text separation

	## ⚠️ Limitations

	1. Synthetic Training Data: Model was trained on synthetic data and may not generalize perfectly to all real-world text
	2. Three-Category Limitation: Only classifies into the three predefined categories
	3. Prompt Sensitivity: Performance may vary with different prompt formats
	4. Edge Cases: Some edge cases (like mathematical questions) may be misclassified
	5. Language: Primarily trained on English text

	## 🔮 Future Improvements

	1. Larger Training Dataset: Use real VoiceBench data with proper audio transcription
	2. More Categories: Expand to include additional text types
	3. Multilingual Support: Train on multiple languages
	4. Confidence Calibration: Improve confidence scoring
	5. Few-shot Learning: Add support for few-shot classification

	## 📚 Citation

	```bibtex
	@misc{qwen2.5-0.5b-text-classification,
	title={Qwen2.5-0.5B Text Classification Model for VoiceBench-style Evaluation},
	author={Your Name},
	year={2024},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/manbeast3b/qwen2.5-0.5b-text-classification}},
	note={Fine-tuned using LoRA on synthetic text classification data}
	}
	```

	## 🤝 Contributing

	Contributions are welcome! Please feel free to:
	- Report issues with the model
	- Suggest improvements
	- Submit pull requests
	- Share your use cases

	## 📄 License

	This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for more details.

	---

	Model Performance Summary:
	- ✅ 93.33% Overall Accuracy
	- ✅ 100% ifeval accuracy (instruction-following)
	- ✅ 100% wildvoice accuracy (conversational)
	- ✅ 80% commoneval accuracy (factual questions)
	- ✅ Efficient LoRA fine-tuning (0.88% trainable parameters)
	- ✅ Fast inference with small model size
	- ✅ Easy to use with simple API

	This model represents a successful application of LoRA fine-tuning for text classification, achieving high accuracy with minimal computational resources.