arxiv:2508.18106

A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

Published on Aug 25

· Submitted by

wangjunjie on Sep 1

#1 Paper of the day

Upvote

340

Authors:

Keke Lian ,

Junjie Wang ,

Haotong Duan ,

Jiazheng Quan ,

Yilu Zhong ,

Haoling Li ,

Dong Zhang

Abstract

A.S.E is a repository-level benchmark for evaluating the security of AI-generated code, highlighting challenges in secure coding and the limitations of LLMs in real-world scenarios.

AI-generated summary

The increasing adoption of large language models (LLMs) in software engineering necessitates rigorous security evaluation of their generated code. However, existing benchmarks often lack relevance to real-world AI programming scenarios, making them inadequate for assessing the practical security risks associated with AI-generated code in production environments. To address this gap, we introduce A.S.E (AI Code Generation Security Evaluation), a repository-level evaluation benchmark designed to closely mirror real-world AI programming tasks, offering a comprehensive and reliable framework for assessing the security of AI-generated code. Our evaluation of leading LLMs on A.S.E reveals several key findings. In particular, current LLMs still struggle with secure coding. The complexity in repository-level scenarios presents challenges for LLMs that typically perform well on snippet-level tasks. Morever, a larger reasoning budget does not necessarily lead to better code generation. These observations offer valuable insights into the current state of AI code generation, assisting developers in selecting the most appropriate models for practical tasks, while laying the foundation for refining LLMs to generate secure and efficient code in real-world applications.

View arXiv page View PDF Project page GitHub 708 Add to collection

Community

wanng

Paper author Paper submitter Sep 1

•

edited Sep 1

🤖 AI is revolutionizing how we write code, with LLMs acting as tireless coding partners! But with this incredible speed comes a critical question: is the code they generate truly secure? 🛡️

Many security benchmarks only scratch the surface 🧐, testing code in isolated snippets. This approach misses the real-world vulnerabilities that can lurk in the complex interactions across an entire project.

🚀 Enter A.S.E, a pioneering repository-level benchmark that's changing the game! Instead of just looking at a single file, A.S.E evaluates security across a whole codebase, providing a much more realistic and challenging test for our AI models.

This is a huge step forward ⏩ in building a future where AI assistants are not just powerful coders, but also vigilant security partners. It's time to push for safer, more reliable AI-generated code!

#AISecurity 🔐 #SecureCoding 💻 #LLMs 🧠 #CodeGeneration ⌨️