| # Quizbowl Agent Goals and Evaluation | |
| ## Objectives | |
| ### Tossup Agents | |
| - Respond to questions with the best guess with calibrated confidence | |
| - Buzz at the earliest possible moment with sufficient information | |
| - Avoid incorrect buzzes | |
| - Maintain consistent performance across topics | |
| ### Bonus Agents | |
| - Answer parts correctly with accurate confidence estimation | |
| - Provide clear explanation of reasoning which will be used by human team members to validate / pick the suggested answer. | |
| - Adapt to varying difficulty levels (easy, medium, hard) | |
| ## Performance Metrics | |
| ### Tossup Metrics | |
| - **Accuracy**: Percentage of correct answers | |
| - **Average Buzz Position**: How early in the question you buzz (earlier is better) | |
| - **Confidence Calibration**: How well confidence score matches actual performance | |
| - **Score**: Points earned based on buzz position and correctness | |
| ### Bonus Metrics | |
| - **Accuracy**: Percentage of correct answers across all parts | |
| - **Confidence Calibration**: How well confidence score matches actual performance | |
| - **Explanation Quality**: Relevance and clarity of reasoning | |
| ## Evaluating Your Agent | |
| ### Testing Baseline Performance | |
| 1. Run the default agent configuration | |
| 2. Record metrics (accuracy, confidence, buzz position) | |
| 3. Identify specific weaknesses in performance | |
| ### Validating Improvements | |
| After each enhancement: | |
| 1. Run the agent on the same development set of questions | |
| 2. Compare metrics to previous version | |
| 3. Check for improvements in weak areas | |
| ### Final Evaluation Criteria | |
| Your final agent will be evaluated on: | |
| 1. Overall accuracy across diverse questions | |
| 2. Optimal buzz timing (neither too early nor too late) | |
| 3. Confidence threshold calibration | |
| 4. Explanation quality (for bonus agents) | |
| <!-- ## Setting Goals for Your Agent | |
| ### Minimum Goals | |
| - Accuracy above 60% | |
| - Appropriate confidence threshold (0.7-0.9) | |
| - Reasonable buzz positions | |
| ### Advanced Goals | |
| - Multi-step pipelines with specialized components | |
| - Accuracy above 85% | |
| - Strategic early buzzing on familiar topics | |
| - Detailed, accurate explanations for bonus questions --> |