DataEngEval

Sleeping

App Files Files Community

DataEngEval / README.md

uparekh01151

docs: update README to focus on data engineering tools

3cf16fb about 2 months ago

preview code

raw

history blame contribute delete

5.76 kB

A newer version of the Gradio SDK is available: 5.49.1

Upgrade

metadata

title: DataEngEval
emoji: 🥇
colorFrom: green
colorTo: indigo
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: The Benchmarking Hub for Data Engineering + AI
tags:
  - leaderboard
  - evaluation
  - sql
  - code-generation
  - data-engineering

DataEngEval

A comprehensive evaluation platform for systematically benchmarking performance across various models and programming languages, focusing on data engineering tools and technologies.

🚀 Features

Multi-use-case evaluation: SQL generation, Python data processing, documentation generation
Real-world datasets: NYC Taxi queries, data transformation algorithms, technical documentation
Comprehensive metrics: Correctness, execution success, syntax validation, performance
Remote inference: Uses Hugging Face Inference API (no local model downloads)
Mock mode: Works without API keys for demos

🎯 Current Use Cases

SQL Generation

Dataset: NYC Taxi Small
Dialects: Presto, BigQuery, Snowflake
Metrics: Correctness, execution, result matching, dialect compliance

Code Generation

Python: Data processing algorithms, ETL pipelines, data transformation functions
Metrics: Syntax correctness, execution success, data processing accuracy, code quality

Documentation Generation

Technical Documentation: API documentation, system architecture, data pipeline documentation
Metrics: Content accuracy, completeness, technical clarity, formatting quality

🏗️ Project Structure

dataeng-leaderboard/
├── app.py                     # Main Gradio application
├── requirements.txt           # Dependencies for Hugging Face Spaces
├── config/                    # Configuration files
│   ├── app.yaml              # App settings
│   ├── models.yaml           # Model configurations
│   ├── metrics.yaml          # Scoring weights
│   └── use_cases.yaml        # Use case definitions
├── src/                      # Source code modules
│   ├── evaluator.py          # Dataset management and evaluation
│   ├── models_registry.py    # Model configuration and interfaces
│   ├── scoring.py            # Metrics computation
│   └── utils/                # Utility functions
├── tasks/                    # Multi-use-case datasets
│   ├── sql_generation/      # SQL generation tasks
│   ├── code_generation/     # Python data processing tasks
│   └── documentation/       # Technical documentation tasks
├── prompts/                  # SQL generation templates
└── test/                     # Test files

🚀 Quick Start

Running on Hugging Face Spaces

Fork this Space: Click "Fork" on the Hugging Face Space
Configure: Add your HF_TOKEN as a secret in Space settings (optional)
Deploy: The Space will automatically build and deploy
Use: Access the Space URL to start evaluating models

Running Locally

Clone this repository:

git clone <repository-url>
cd dataeng-leaderboard

Install dependencies:

pip install -r requirements.txt

Set up environment variables (optional):

export HF_TOKEN="your_huggingface_token"  # For Hugging Face models

Run the application:

gradio app.py

📊 Usage

Evaluating Models

Select Dataset: Choose from available datasets (NYC Taxi)
Choose Dialect: Select target SQL dialect (Presto, BigQuery, Snowflake)
Pick Test Case: Select a specific natural language question to evaluate
Select Models: Choose one or more models to evaluate
Run Evaluation: Click "Run Evaluation" to generate SQL and compute metrics
View Results: See individual results and updated leaderboard

Understanding Metrics

The platform computes several metrics for each evaluation:

Correctness (Exact): Binary score (0/1) for exact result match
Execution Success: Binary score (0/1) for successful SQL execution
Result Match F1: F1 score for partial result matching
Latency: Response time in milliseconds
Readability: Score based on SQL structure and formatting
Dialect Compliance: Binary score (0/1) for successful SQL transpilation

Composite Score combines all metrics with weights:

Correctness: 40%
Execution Success: 25%
Result Match F1: 15%
Dialect Compliance: 10%
Readability: 5%
Latency: 5%

⚙️ Configuration

Adding New Models

Edit config/models.yaml to add new models:

models:
  - name: "Your Model Name"
    provider: "huggingface"
    model_id: "your/model-id"
    params:
      max_new_tokens: 512
      temperature: 0.1
    description: "Description of your model"

Adding New Datasets

Create a new folder under tasks/ (e.g., tasks/my_dataset/)
Add three required files:

schema.sql: Database schema definition loader.py: Database creation script cases.yaml: Test cases with questions and reference SQL

🤝 Contributing

Adding New Features

Fork the repository
Create a feature branch
Implement your changes
Test thoroughly
Submit a pull request

Testing

Run the test suite:

python run_tests.py

📄 License

This project is licensed under the Apache-2.0 License.

🙏 Acknowledgments

Built with Gradio
SQL transpilation powered by sqlglot
Database execution using DuckDB
Model APIs from Hugging Face
Deployed on Hugging Face Spaces