Spaces:
Running
Running
π§ ToGMAL Prompt Difficulty Analyzer
Real-time LLM capability boundary detection using vector similarity search.
π― What This Does
This system analyzes any prompt and tells you:
- How difficult it is for current LLMs (based on real benchmark data)
- Why it's difficult (shows similar benchmark questions)
- What to do about it (actionable recommendations)
π₯ Key Innovation
Instead of clustering by domain (all math together), we cluster by difficulty - what's actually hard for LLMs regardless of domain.
π Real Data
- 14,042 MMLU questions with real success rates from top models
- <50ms query time for real-time analysis
- Production ready vector database
π Demo Links
- Local: http://127.0.0.1:7860
- Public: https://99b38fc2e31da2f83d.gradio.live
π§ͺ Example Results
Hard Questions (Low Success Rates)
Prompt: "Statement 1 | Every field is also a ring..."
Risk: HIGH (23.9% success)
Recommendation: Multi-step reasoning with verification
Prompt: "Find all zeros of polynomial xΒ³ + 2x + 2 in Zβ"
Risk: MODERATE (43.8% success)
Recommendation: Use chain-of-thought prompting
Easy Questions (High Success Rates)
Prompt: "What is 2 + 2?"
Risk: MINIMAL (100% success)
Recommendation: Standard LLM response adequate
Prompt: "What is the capital of France?"
Risk: MINIMAL (100% success)
Recommendation: Standard LLM response adequate
π οΈ Technical Details
Architecture
User Prompt β Embedding Model β Vector DB β K Nearest Questions β Weighted Score
Components
- Sentence Transformers (all-MiniLM-L6-v2) for embeddings
- ChromaDB for vector storage
- Real MMLU data with success rates from top models
- Gradio for web interface
π Next Steps
- Add more benchmark datasets (GPQA, MATH)
- Fetch real per-question results from multiple top models
- Integrate with ToGMAL MCP server for Claude Desktop
- Deploy to HuggingFace Spaces for permanent hosting
π Quick Start
# Install dependencies
uv pip install -r requirements.txt
uv pip install gradio
# Run the demo
python demo_app.py
Visit http://127.0.0.1:7860 to use the web interface.