Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction Paper • 2512.18880 • Published 13 days ago • 23
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures Paper • 2510.24081 • Published Oct 28, 2025 • 18
Good Intentions Beyond ACL: Who Does NLP for Social Good, and Where? Paper • 2510.04434 • Published Oct 6, 2025 • 5
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search Paper • 2509.25454 • Published Sep 29, 2025 • 141
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning Paper • 2509.22576 • Published Sep 26, 2025 • 134
The Majority is not always right: RL training for solution aggregation Paper • 2509.06870 • Published Sep 8, 2025 • 16
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification Paper • 2508.05629 • Published Aug 7, 2025 • 180
Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training Paper • 2508.00414 • Published Aug 1, 2025 • 93
Phi-Ground Tech Report: Advancing Perception in GUI Grounding Paper • 2507.23779 • Published Jul 31, 2025 • 44
Diversity-Enhanced Reasoning for Subjective Questions Paper • 2507.20187 • Published Jul 27, 2025 • 25
view article Article Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face +3 Jul 29, 2025 • 206
The Invisible Leash: Why RLVR May Not Escape Its Origin Paper • 2507.14843 • Published Jul 20, 2025 • 85
AGUVIS: Unified Pure Vision GUI Agents Collection https://aguvis-project.github.io • 3 items • Updated Dec 20, 2024 • 7
Through the Valley: Path to Effective Long CoT Training for Small Language Models Paper • 2506.07712 • Published Jun 9, 2025 • 18
HardTests: Synthesizing High-Quality Test Cases for LLM Coding Paper • 2505.24098 • Published May 30, 2025 • 43
DataDecide Collection A suite of models, data, and evals over 25 corpora, 14 sizes, and 3 seeds to measure how accurately small experiments predict rankings at large scale. • 358 items • Updated 12 days ago • 21
Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models Paper • 2505.20236 • Published May 26, 2025 • 2