Abstract
GAIA benchmarks general AI assistants using real-world questions that challenge both reasoning and multi-modality handling, showcasing a significant gap between human and AI performance.
We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.
Community
This work is both fascinating and necessary, yet it seems to overlook the representation of the median human in its analysis. The educational breakdown of the annotators, as presented, does not align with the general population's educational levels. For context:
In the general U.S. population aged 25 and older in 2022, only 23% held a bachelor’s degree as their highest degree, while 14% had advanced education (like a master’s, professional, or doctoral degree), according to the Census Bureau.
In contrast, the paper indicates a much higher educational level among annotators:
Bachelor’s Degree: 61%
Master’s Degree: 26%
PhD: 17%
This discrepancy raises questions about the representativeness of the research sample compared to the general population.
In the level 3 example you say explicitly "Use commas as thousands separators in the number of minutes.". The provided answer is "Ground truth: White; 5876". Should it not be "Ground truth: White; 5,876"?
You're absolutely right: "Use commas as thousands separators in the number of minutes." comes from an older version of the dataset, we will remove it in the next version of the paper
This work is both fascinating and necessary, yet it seems to overlook the representation of the median human in its analysis. The educational breakdown of the annotators, as presented, does not align with the general population's educational levels. For context:
In the general U.S. population aged 25 and older in 2022, only 23% held a bachelor’s degree as their highest degree, while 14% had advanced education (like a master’s, professional, or doctoral degree), according to the Census Bureau.
In contrast, the paper indicates a much higher educational level among annotators:
Bachelor’s Degree: 61%
Master’s Degree: 26%
PhD: 17%
This discrepancy raises questions about the representativeness of the research sample compared to the general population.
There is indeed likely a discrepancy that is impossible to solve between the distribution of the annotators and the general population. That being said the questions rather require fundamental abilities (planning, tool use, multi-modal understanding, etc.) than expert knowledge
Very cool benchmark, congrats!
Can you share any examples from levels 1 & 2 where GPT-4 got the right answer, but the human annotators didn't? I think this would be quite interesting to learn whether there's a type of multi-step question that LLMs are intrinsically better at than humans
Most of the mistakes that were made by humans validators (and why we don't get a 100% human score) were attention mistakes (misreading/mistyping something for example) rather than a difference in actual capability - unless you count "focus" as a capability, in which case we could argue that machines in general are already better at it than most of us 😅
@gregmialz would have specific examples of this.
GAIA is the touring test of AI!
This work is both fascinating and necessary, yet it seems to overlook the representation of the median human in its analysis. The educational breakdown of the annotators, as presented, does not align with the general population's educational levels. For context:
In the general U.S. population aged 25 and older in 2022, only 23% held a bachelor’s degree as their highest degree, while 14% had advanced education (like a master’s, professional, or doctoral degree), according to the Census Bureau.
In contrast, the paper indicates a much higher educational level among annotators:
Bachelor’s Degree: 61%
Master’s Degree: 26%
PhD: 17%
This discrepancy raises questions about the representativeness of the research sample compared to the general population.There is indeed likely a discrepancy that is impossible to solve between the distribution of the annotators and the general population. That being said the questions rather require fundamental abilities (planning, tool use, multi-modal understanding, etc.) than expert knowledge
i wouldn't say impossible, but not sure how feasible it is to so:
- require/validate specific standardized test results which must have been taken within X years (relative to the type of test) in the annotators c.v. prior to acceptance
- rank annotators vs. population being compared against
- annotator pay should reflect current job responsibilities and requirements to obtain
I love that NASA question. It will be something else entirely when LLM's are nailing those level 3 questions. I mean you could wrestle the answer out with some clever and patient prompt engineering and chaining, but when it can do that zero shot... Basically magic. This oughta be the gold standard.
What is the plan for the inevitably of someone solving all the questions and putting them out on the open web? Just regularly create new problem sets?
This couples nicely with this benchmark suite -> GPQA: A Graduate-Level Google-Proof Q&A Benchmark[https://arxiv.org/abs/2311.12022]
@someone13574
	 yes these questions are quite easy to re-create or slightly modify in the case of memorization.
But also: getting the right answer without a good "trace of reasoning" doesn't mean much on this dataset
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models (2023)
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark (2023)
- CORE-MM: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models (2023)
- TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs (2023)
- Evaluating General-Purpose AI with Psychometrics (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Nice work ! Those questions are fun. It's sad the new ChatGPT with all tool (web, image, python) doesn't have a proper API so that it could be tested also. Here is a totally cherry-picked example (worked only once), and still a loss because the answer is not properly formatted :
What is the plan for the inevitably of someone solving all the questions and putting them out on the open web? Just regularly create new problem sets?
You'd have to be a bit of a basterd to do that 😂 maybe someone would do it to poison the competition?
It's certainly not something to overlook.
Here are my thoughts:
- Publish 70% of the dataset, then have 30% behind a trusted API. Hugging face et. all. could easily implement this functionality. Essentially we would all have to agree that this central authority is trustworthy and unbiased. 
- Regularly update the dataset. Requires humans and expensive. Who has the incentive to do this? 
- Synthetic dataset generated on the fly. Is this even plausible and is it self defeating? 
- Close your eyes and hope for the best 
😂
People really don't care about data contamination. How about we resist running to chatgpt with the dataset ha.
Hi! Thank you all for your points about data contamination!
This is precisely why
- we only released the answers on the validation set, not on the test set, which is considerably bigger
- we released the precise recipe for generating such a dataset, in the hope that it will be extended with time
- we ask for the reasoning trace of the model
But since, at the moment, even the best models don't reach more than a few points on level 3, I think we have some time before us :)
The 'GAIA' paper presents a fascinating study but raises a crucial question: does the higher educational level of annotators, compared to the general population, affect the evaluation of AI performance? This discrepancy might skew the AI's ability to handle real-world tasks that are more representative of the broader population's capabilities and perspectives. It's vital to consider a more diverse range of annotators to truly assess AI's proficiency in real-world scenarios.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark (2023)
- MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (2023)
- Developer Experiences with a Contextualized AI Coding Assistant: Usability, Expectations, and Outcomes (2023)
- TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs (2023)
- Evaluating General-Purpose AI with Psychometrics (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
@clefourrier
	 
@gregmialz
	
 hi. I was looking for 'pure riddles' (number of used tools equal zero) in dataset.
Following tasks contain incorrect number of tools in solution (i.e. described solution contains 'websearch' or other tools, but 'tools' section is empty) 
- 305ac316-eef6-4446-960a-92d80d542f82
- cf106601-ab4f-4af9-b045-5295fe67b37d
- 5a0c1adf-205e-4841-a666-7c3ef95def9d
btw, i'm not sure answers for following riddles are correct
- 42576abe-0deb-4869-8c63-225c2d75a95a (ask Gpt4 to think step by step)
- ec09fa32-d03f-4bf8-84b0-1f16922c3ae4
GAIA: Benchmarking the True Capabilities of AI Assistants
Links 🔗:
👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/
Hey guys! I just finished to read the paper and wanted to share with u my summary notes
Intro
- Problem → The complex the tasks passed to evaluate AI’s performance, the harder to leverage humans to evaluate and integrate tailored evaluation benchmarks. Existing benchmarks falls because (1) Expert-level tasks and (2) Human -subjective- evaluation
- GAIA → Evaluation benchmark built to evaluate AI Assistants to do tasks that are simple but tedious for humans, assesing their capability to plan, reason, and use tools
- Evaluation → Consists of 466 questions. zero-shot is a must. Answers should be a number or a plain few words sentence (e.g. 100, Andrew, etc.). The questions must follow the following framework- Real-world challenging questions
- Ungameability: Hard to brute-force without cheating due to # steps + reason traces verification + question diversity
- Unambiguity: Questions shouldn’t have multiple interpetations and answers should be unique
- Simplicity: easy to understand questions + easy to verify answers. (Factoid answers)
 
- LLMs do poorly on GAIA (<30% vs~90% for humans). Why? Because of the lack of integration between plans-reasonings-use of tools
Related work
- Evaluating LLMs → Perfomance on human-expert exams such as US Bar or USMLE, suggestions include- Compile evaluations
- Human in the loop evaluation (time-consuming + difficult to scale)
- Model evaluation (hard to evaluate state-of-art models)
 
- Evaluating General Assistants → Focuses on the technical performance of AI assistants such as API calls. Whereas GAIA focuses on real-world questions
GAIA
- What does it Consist of? - → Questions consist of text + attached file / SOT - → Answers must be short, single and easy to verify - → Requires fundamental abilities: reasoning, multi-modality, code, and tool usage 
- Principes - Design choices. Awards adaptability rather than specialized knowledge
- Interpretability. Model performance is easy to analize
- Robustness against memorization: Action Space size + question diversity
- Easy to use
 
- Evaluation - Automated, Fast and Factual (Quasi-exact verification)
 
- Composition of GAIA - Not so estrict on the behaviour as there are multiple ways to solve each question
- Levels- ≤5 steps, No tool usage needed
- 5<steps<10. Reason + Tools
- Near to perfection. Large #steps and # tool calls
 
- Accesibility of questions + Diversity (topic domains + cultures)
 
- Build and extend GAIA - Craft questions → use of SOTs as wikipedia
- Validate questions → validation = same answer from 3 different annotators
- Relying on the web → Information changes alonside time. Paywalls. Restrictions
 
LLM results on GAIA
- Their performance is very poor on these questions
- API / tool access is a huge improvement
- GPT4 surpasses web-search task due to memorization
Discussion
- Reproducibility is not that important (Time decay)
- Static vs dynamic benchmarks. Time decay is key (data contamination, web info dissappearance). GAIA aims to updated questions as time passes
- Evaluation unification.
- Partial vs full automation: Partial is the current panorama. Full is the target but represents a game change in work and economy. Solving GAIA requires full automation to guarantee objectivity. ethics and open-source will be key
Limitations
- Evaluating traces is not implemented
- Question design is still expensive
- Lack of linguistic / cultural diversity
Key Points 🎯
- LLMs do poorly on simple but tedious tasks, which require multi steps, multi modality and planning
- GAIA is simple, scalable, automated. Leading to AI asistants evaluation for simple but tedious tasks
Models citing this paper 0
No model linking this paper
 AK
							AK 
					 
					 
					 
					 
					 
						
 
						
 
						 
						 
						 
						 
						
 
						 
					 
					