new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Oct 28

Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window

While recent advances in reasoning models have demonstrated cognitive behaviors through reinforcement learning, existing approaches struggle to invoke deep reasoning capabilities in multi-turn agents with long-horizon interactions. We propose DeepMiner, a novel framework that elicits such abilities by introducing high-difficulty training tasks and dynamic context window. DeepMiner presents a reverse construction method to generate complex but verifiable question-answer pairs from authentic web sources, which ensures the challenge and reliability of training data while injecting cognitive capabilities into multi-turn reasoning scenarios. We further design an elegant yet effective dynamic context management strategy for both training and inference, utilizing sliding window mechanisms while eliminating the dependency on external summarization models, thereby efficiently empowering the model to handle continuously expanding long-horizon contexts. Through reinforcement learning on Qwen3-32B, we develop DeepMiner-32B, which achieves substantial performance improvements across multiple search agent benchmarks. DeepMiner attains 33.5% accuracy on BrowseComp-en, surpassing the previous best open-source agent by almost 20 percentage points, and demonstrates consistent improvements on BrowseComp-zh, XBench-DeepSearch, and GAIA. Notably, our dynamic context management enables sustained interactions of nearly 100 turns within standard 32k context length, effectively addressing the context limitations that constrain existing multi-turn interaction systems.

WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.

  • 16 authors
·
Apr 27 2

BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.

Étude cognitive des processus de construction d'une requête dans un système de gestion de connaissances médicales

This article presents the Cogni-CISMeF project, which aims at improving medical information search in the CISMeF system (Catalog and Index of French-language health resources) by including a conversational agent to interact with the user in natural language. To study the cognitive processes involved during the information search, a bottom-up methodology was adopted. Experimentation has been set up to obtain human dialogs between a user (playing the role of patient) dealing with medical information search and a CISMeF expert refining the request. The analysis of these dialogs underlined the use of discursive evidence: vocabulary, reformulation, implicit or explicit expression of user intentions, conversational sequences, etc. A model of artificial agent is proposed. It leads the user in its information search by proposing to him examples, assistance and choices. This model was implemented and integrated in the CISMeF system. ---- Cet article d\'ecrit le projet Cogni-CISMeF qui propose un module de dialogue Homme-Machine \`a int\'egrer dans le syst\`eme d'indexation de connaissances m\'edicales CISMeF (Catalogue et Index des Sites M\'edicaux Francophones). Nous avons adopt\'e une d\'emarche de mod\'elisation cognitive en proc\'edant \`a un recueil de corpus de dialogues entre un utilisateur (jouant le r\^ole d'un patient) d\'esirant une information m\'edicale et un expert CISMeF af inant cette demande pour construire la requ\^ete. Nous avons analys\'e la structure des dialogues ainsi obtenus et avons \'etudi\'e un certain nombre d'indices discursifs : vocabulaire employ\'e, marques de reformulation, commentaires m\'eta et \'epilinguistiques, expression implicite ou explicite des intentions de l'utilisateur, encha\^inement conversationnel, etc. De cette analyse, nous avons construit un mod\`ele d'agent artificiel dot\'e de capacit\'es cognitives capables d'aider l'utilisateur dans sa t\^ache de recherche d'information. Ce mod\`ele a \'et\'e impl\'ement\'e et int\'egr\'e dans le syst\`eme CISMeF.

  • 5 authors
·
Feb 10, 2014