SLR-Bench / benchmark_data.csv
ahmad21omar's picture
add the tuned models
a9249e6
Model,Logical Power Ranking,Logical Power Score,Accuracy,Syntax Score,Logic Basic Accuracy,Logic Easy Accuracy,Logic Medium Accuracy,Logic Hard Accuracy
o3,1,15.5,0.78,0.8,0.99,0.93,0.74,0.45
o4-mini-high,2,12.8,0.64,0.88,0.98,0.96,0.4,0.21
o4-mini,3,12.3,0.61,0.86,0.93,0.88,0.52,0.13
o1,4,11.9,0.59,0.68,0.92,0.89,0.41,0.15
o3-mini,5,11.6,0.58,0.75,0.97,0.9,0.37,0.07
o4-mini-low,6,10.3,0.52,0.91,0.91,0.81,0.25,0.09
o1-mini,7,10.1,0.5,0.95,0.97,0.82,0.2,0.03
Llama-3.1-8B-Tuned-FFT,8,9.4,0.47,0.96,0.92,0.77,0.17,0.02
gemini-2.0-flash-thinking-exp-01-21,9,8.6,0.43,0.83,0.93,0.65,0.13,0.01
Llama-3.1-8B-Tuned-LoRA,10,8.4,0.42,1.0,0.95,0.57,0.15,0.01
DeepSeek-R1-Distill-Llama-70B,11,8.1,0.4,0.57,0.89,0.61,0.09,0.03
gpt-4.5-preview,12,7.2,0.36,1.0,0.95,0.41,0.06,0.02
gpt-4o,13,6.7,0.34,0.96,0.9,0.37,0.06,0.01
Llama-3.3-70B-Instruct,14,5.1,0.26,0.99,0.84,0.17,0.01,0.01
Llama-3.1-8B-Instruct,15,5.0,0.25,0.87,0.82,0.17,0.01,0.0
QwQ-32B-Preview,16,4.6,0.23,0.84,0.77,0.15,0.0,0.01
Internlm2-20b,17,3.9,0.19,0.82,0.71,0.07,0.0,0.0
Qwen2-57B-A14B-Instruct,18,3.9,0.2,0.81,0.71,0.07,0.0,0.0
CodeLlama-34b-Instruct-hf,19,3.5,0.17,0.78,0.68,0.01,0.0,0.0
Mixtral-8x7B-Instruct-v0.1,20,3.1,0.15,0.93,0.61,0.01,0.0,0.0
Llama-3.2-3B-Instruct,21,1.6,0.08,0.61,0.31,0.01,0.0,0.0