Abstract
InfoSynth is a framework that automatically generates novel and diverse reasoning benchmarks for large language models using information-theoretic principles and genetic algorithms.
Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls (2025)
- UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models (2025)
- PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code (2025)
- CodeSimpleQA: Scaling Factuality in Code Large Language Models (2025)
- PerfCoder: Large Language Models for Interpretable Code Performance Optimization (2025)
- Better Datasets Start From RefineLab: Automatic Optimization for High-Quality Dataset Refinement (2025)
- AXIOM: Benchmarking LLM-as-a-Judge for Code via Rule-Based Perturbation and Multisource Quality Calibration (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/infosynth-information-guided-benchmark-synthesis-for-llms-1785-2053d8c2
- Executive Summary
- Detailed Breakdown
- Practical Applications
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper