Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.49.1
metadata
title: DataEngEval
emoji: π₯
colorFrom: green
colorTo: indigo
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: The Benchmarking Hub for Data Engineering + AI
tags:
- leaderboard
- evaluation
- sql
- code-generation
- data-engineering
DataEngEval
A comprehensive evaluation platform for systematically benchmarking performance across various models and programming languages, focusing on data engineering tools and technologies.
π Features
- Multi-use-case evaluation: SQL generation, Python data processing, documentation generation
- Real-world datasets: NYC Taxi queries, data transformation algorithms, technical documentation
- Comprehensive metrics: Correctness, execution success, syntax validation, performance
- Remote inference: Uses Hugging Face Inference API (no local model downloads)
- Mock mode: Works without API keys for demos
π― Current Use Cases
SQL Generation
- Dataset: NYC Taxi Small
- Dialects: Presto, BigQuery, Snowflake
- Metrics: Correctness, execution, result matching, dialect compliance
Code Generation
- Python: Data processing algorithms, ETL pipelines, data transformation functions
- Metrics: Syntax correctness, execution success, data processing accuracy, code quality
Documentation Generation
- Technical Documentation: API documentation, system architecture, data pipeline documentation
- Metrics: Content accuracy, completeness, technical clarity, formatting quality
ποΈ Project Structure
dataeng-leaderboard/
βββ app.py # Main Gradio application
βββ requirements.txt # Dependencies for Hugging Face Spaces
βββ config/ # Configuration files
β βββ app.yaml # App settings
β βββ models.yaml # Model configurations
β βββ metrics.yaml # Scoring weights
β βββ use_cases.yaml # Use case definitions
βββ src/ # Source code modules
β βββ evaluator.py # Dataset management and evaluation
β βββ models_registry.py # Model configuration and interfaces
β βββ scoring.py # Metrics computation
β βββ utils/ # Utility functions
βββ tasks/ # Multi-use-case datasets
β βββ sql_generation/ # SQL generation tasks
β βββ code_generation/ # Python data processing tasks
β βββ documentation/ # Technical documentation tasks
βββ prompts/ # SQL generation templates
βββ test/ # Test files
π Quick Start
Running on Hugging Face Spaces
- Fork this Space: Click "Fork" on the Hugging Face Space
- Configure: Add your
HF_TOKENas a secret in Space settings (optional) - Deploy: The Space will automatically build and deploy
- Use: Access the Space URL to start evaluating models
Running Locally
- Clone this repository:
git clone <repository-url>
cd dataeng-leaderboard
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables (optional):
export HF_TOKEN="your_huggingface_token" # For Hugging Face models
- Run the application:
gradio app.py
π Usage
Evaluating Models
- Select Dataset: Choose from available datasets (NYC Taxi)
- Choose Dialect: Select target SQL dialect (Presto, BigQuery, Snowflake)
- Pick Test Case: Select a specific natural language question to evaluate
- Select Models: Choose one or more models to evaluate
- Run Evaluation: Click "Run Evaluation" to generate SQL and compute metrics
- View Results: See individual results and updated leaderboard
Understanding Metrics
The platform computes several metrics for each evaluation:
- Correctness (Exact): Binary score (0/1) for exact result match
- Execution Success: Binary score (0/1) for successful SQL execution
- Result Match F1: F1 score for partial result matching
- Latency: Response time in milliseconds
- Readability: Score based on SQL structure and formatting
- Dialect Compliance: Binary score (0/1) for successful SQL transpilation
Composite Score combines all metrics with weights:
- Correctness: 40%
- Execution Success: 25%
- Result Match F1: 15%
- Dialect Compliance: 10%
- Readability: 5%
- Latency: 5%
βοΈ Configuration
Adding New Models
Edit config/models.yaml to add new models:
models:
- name: "Your Model Name"
provider: "huggingface"
model_id: "your/model-id"
params:
max_new_tokens: 512
temperature: 0.1
description: "Description of your model"
Adding New Datasets
- Create a new folder under
tasks/(e.g.,tasks/my_dataset/) - Add three required files:
schema.sql: Database schema definition
loader.py: Database creation script
cases.yaml: Test cases with questions and reference SQL
π€ Contributing
Adding New Features
- Fork the repository
- Create a feature branch
- Implement your changes
- Test thoroughly
- Submit a pull request
Testing
Run the test suite:
python run_tests.py
π License
This project is licensed under the Apache-2.0 License.
π Acknowledgments
- Built with Gradio
- SQL transpilation powered by sqlglot
- Database execution using DuckDB
- Model APIs from Hugging Face
- Deployed on Hugging Face Spaces