Papers
arxiv:2510.21440

Redefining Retrieval Evaluation in the Era of LLMs

Published on Oct 24
· Submitted by Giovanni Trappolini on Oct 27
Authors:
,
,
,

Abstract

A new metric, UDCG, is introduced to better evaluate Retrieval Augmented Generation systems by accounting for both the utility of relevant documents and the distraction of irrelevant ones, improving correlation with end-to-end answer accuracy.

AI-generated summary

Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components

Community

Paper submitter

We introduce UDCG (Utility and Distraction-aware Cumulative Gain), a novel metric specifically designed for evaluating retrieval systems in RAG pipelines.

Traditional IR metrics like nDCG, MAP, and MRR fail in RAG settings due to two critical misalignments:
(1) they assume monotonically decreasing attention with rank position, unlike LLMs;
(2) they treat all irrelevant documents equally, ignoring that some actively distract LLMs and degrade generation quality while others are harmless.

UDCG addresses these limitations by replacing binary relevance with continuous utility scores that capture both positive contributions from relevant passages and negative impacts from distracting ones, while using an LLM-oriented positional discount that directly optimizes correlation with end-to-end answer accuracy.
Experiments across 5 QA datasets (NQ, TriviaQA, PopQA, BioASQ, NoMIRACL) and 6 LLMs demonstrate that UDCG improves correlation with RAG performance by up to 36% compared to traditional metrics.
Code and data: github.com/GiovanniTRA/UDCG

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.21440 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.21440 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.21440 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.