Overview
This document presents the evaluation results of DeepSeek-R1-Distill-Llama-70B, a 4-bit quantized model using GPTQ, evaluated with the Language Model Evaluation Harness on the ARC-Challenge benchmark.
📊 Evaluation Summary
| Metric | Value | Description | 8bit |
|---|---|---|---|
| Accuracy (acc,none) | 21.2% |
Raw accuracy - percentage of correct answers. | 21.2% |
| Standard Error (acc_stderr,none) | 1.19% |
Uncertainty in the accuracy estimate. | 1.2% |
| Normalized Accuracy (acc_norm,none) | 25.4% |
Accuracy after dataset-specific normalization. | 25.2% |
| Standard Error (acc_norm_stderr,none) | 1.27% |
Uncertainty for normalized accuracy. | 1.3% |
📌 Interpretation:
- The model correctly answered 21.2% of the questions.
- After normalization, the accuracy slightly improves to 25.4%.
- The standard error (~1.27%) indicates a small margin of uncertainty.
⚙️ Model Configuration
- Model:
DeepSeek-R1-Distill-Llama-70B - Parameters:
70 billion - Quantization:
4-bit GPTQ - Source: Hugging Face (
hf) - Precision:
torch.float16 - Hardware:
NVIDIA A100 80GB PCIe - CUDA Version:
12.4 - PyTorch Version:
2.6.0+cu124 - Batch Size:
1 - Evaluation Time:
365.89 seconds (~6 minutes)
📌 Interpretation:
- The evaluation was performed on a high-performance GPU (A100 80GB).
- The model is significantly larger than the previous 8B version, with GPTQ 4-bit quantization reducing memory footprint.
- A single-sample batch size was used, which might slow evaluation speed.
📂 Dataset Information
- Dataset:
AI2 ARC-Challenge - Task Type:
Multiple Choice - Number of Samples Evaluated:
1,172 - Few-shot Examples Used:
0(Zero-shot setting)
📌 Interpretation:
- This benchmark assesses grade-school-level scientific reasoning.
- Since no few-shot examples were provided, the model was evaluated in a pure zero-shot setting.
📈 Performance Insights
- The
"higher_is_better"flag confirms that higher accuracy is preferred. - The model's raw accuracy (21.2%) is significantly lower compared to state-of-the-art models (60–80% on ARC-Challenge).
- Quantization Impact: The 4-bit GPTQ quantization reduces memory usage but may also impact accuracy slightly.
- Zero-shot Limitation: Performance could improve with few-shot prompting (providing examples before testing).
📊 Detailed Evaluation on MMLU Challenges
| Metric | Value | Description |
|---|---|---|
| MMLU | 37.88% |
Averaged over MMLU-Stem, MMLU-Social-Sciences, MMLU-Humanities, MMLU-ther |
| MMLU-Humanities | 31.83% |
Averaged over MMLU-Formal-Logic, MMLU-Prehistory, MMLU-World-Religions, MMLU-Philosophy, MMLU-High-School-World-History, MMLU-Professional-Law, MMLU-High-School-US-History, MMLU-Logical-Fallacies, MMLU-International-Law, MMLU-High-School-European-History, MMLU-Moral-Disputes, MMLU-Moral-Scenarios, MMLU-Jurisprudence |
| MMLU-Social-Sciences | 45.43% |
Averaged over MMLU-Public-Relations, MMLU-Sociology, MMLU-Security-Studies, MMLU-High-School-Government-and-Politics, MMLU-High-School-Psychology, MMLU-Human-Sexuality, MMLU-US-Foreign-Policy, MMLU-High-School-Microeconomics, MMLU-Econometrics, MMLU-High-School-Macroeconomics, MMLU-High-School-Geography, MMLU-Professional-Psychology |
| MMLU-Stem | 33.01% |
Averaged over MMLU-Conceptual-Physics, MMLU-High-School-Chemistry, MMLU-College-Biology, MMLU-College-Chemistry, MMLU-Machine-Learning, MMLU-Elementary-Mathematics, MMLU-Abstract-Algebra, MMLU-Astronomy, MMLU-High-School-Statistics, MMLU-Anatomy, MMLU-College-Mathematics, MMLU-Computer-Security, MMLU-College-Computer-Science, MMLU-Electrical-Engineering, MMLU-College-Physics, MMLU-High-School-Computer-Science, MMLU-High-School-Physics, MMLU-High-School-Biology, MMLU-High-School-Mathematics |
| MMLU-Other | 44.48% |
Averaged over MMLU-Medical-Genetics, MMLU-Global-Facts, MMLU-Marketing, MMLU-College-Medicine, MMLU-Human-Aging, MMLU-Virology, MMLU-Business-Ethics, MMLU-Clinical-Knowledge, MMLU-Professional-Medicine, MMLU-Nutrition, MMLU-Miscellaneous, MMLU-Professional-Accounting, MMLU-Management |
📌 Let us know if you need further analysis or model tuning! 🚀
- Downloads last month
- 409
Model tree for empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit
Base model
deepseek-ai/DeepSeek-R1-Distill-Llama-70B