| # Advanced Pipeline Examples | |
| This guide shows how to implement and validate sophisticated multi-step pipelines for quizbowl agents. | |
| ## Goals | |
| Using advanced pipelines, you will: | |
| - Improve accuracy by 15-25% over single-step agents | |
| - Create specialized components for different tasks | |
| - Implement effective confidence calibration | |
| - Build robust buzzer strategies | |
| ## Two-Step Justified Confidence Pipeline | |
| ### Baseline Performance | |
| Standard single-step agents typically achieve: | |
| - Accuracy: ~65-70% | |
| - Poorly calibrated confidence | |
| - Limited explanation for answers | |
| ### Loading the Pipeline Example | |
| 1. Navigate to the "Tossup Agents" tab | |
| 2. Click "Select Pipeline to Import..." and choose "two-step-justified-confidence.yaml" | |
| 3. Click "Import Pipeline" | |
| ### Understanding the Pipeline Structure | |
| This pipeline has two distinct steps: | |
| #### Step A: Answer Generator | |
| - Uses OpenAI/gpt-4o-mini | |
| - Takes question text as input | |
| - Generates an answer candidate | |
| - Focuses solely on accurate answer generation | |
| #### Step B: Confidence Evaluator | |
| - Uses Cohere/command-r-plus | |
| - Takes question text AND generated answer from Step A | |
| - Evaluates confidence and provides justification | |
| - Specialized for confidence assessment | |
| ### Validation | |
| Test the pipeline and check: | |
| - Is accuracy improved compared to single-step? | |
| - Are confidence scores better calibrated? | |
| - Does the justification explain reasoning clearly? | |
| ### Results | |
| Two-step justified confidence typically achieves: | |
| - Accuracy: ~80-85% | |
| - Well-calibrated confidence scores | |
| - Clear justification for answers and confidence | |
| - More strategic buzzing | |
| ## Enhancing the Two-Step Pipeline | |
| ### Step 1: Upgrade Answer Generator | |
| #### Current Performance | |
| The default example uses gpt-4o-mini which may lack: | |
| - Specialized knowledge in some areas | |
| - Consistent answer formatting | |
| #### Implementation | |
| 1. Click on Step A | |
| 2. Change model to a stronger option (e.g., gpt-4o) | |
| 3. Modify system prompt to focus on answer precision | |
| #### Validation | |
| Test with sample questions and check: | |
| - Has answer accuracy improved? | |
| - Is formatting more consistent? | |
| #### Results | |
| With upgraded answer generator: | |
| - Accuracy increases to ~85-90% | |
| - More consistent answer formats | |
| ### Step 2: Improve Confidence Evaluator | |
| #### Current Performance | |
| The default evaluator may: | |
| - Over-estimate confidence on some topics | |
| - Provide limited justification | |
| #### Implementation | |
| 1. Click on Step B | |
| 2. Enhance the system prompt: | |
| ``` | |
| You are an expert confidence evaluator for quizbowl answers. | |
| Your task: | |
| 1. Evaluate ONLY the correctness of the provided answer | |
| 2. Consider question completeness and available clues | |
| 3. Provide specific justification for your confidence score | |
| 4. Be especially critical of answers with limited supporting evidence | |
| Remember: | |
| - Early, difficult clues justify lower confidence | |
| - Later, obvious clues justify higher confidence | |
| - Domain expertise should be reflected in your assessment | |
| ``` | |
| #### Validation | |
| Test and verify: | |
| - Are confidence scores better aligned with correctness? | |
| - Does justification include specific clues from questions? | |
| - Is confidence calibrated appropriately for question position? | |
| #### Results | |
| With improved evaluator: | |
| - More accurate confidence calibration | |
| - Detailed justifications citing specific clues | |
| - Better buzzing decisions | |
| ## Three-Step Pipeline with Analysis | |
| ### Concept | |
| Adding a dedicated analysis step before answer generation: | |
| 1. **Step A: Question Analyzer** | |
| - Identifies key clues, entities, and relationships | |
| - Determines question category and format | |
| 2. **Step B: Answer Generator** | |
| - Uses analysis to generate accurate answers | |
| - Focuses on formatting and precision | |
| 3. **Step C: Confidence Evaluator** | |
| - Assesses answer quality based on analysis and clues | |
| - Determines optimal buzz timing | |
| ### Implementation | |
| Create this pipeline from scratch or modify the two-step example. | |
| ### Validation | |
| Compare to the two-step pipeline: | |
| - Does the analysis step improve answer accuracy? | |
| - Does it provide better performance on difficult questions? | |
| - Are there improvements in early buzzing? | |
| ### Results | |
| Three-step pipelines typically achieve: | |
| - Accuracy: ~90-95% | |
| - Earlier correct buzzes | |
| - Exceptional performance on difficult questions | |
| ## Specialty Pipeline: Literature Focus | |
| ### Concept | |
| Create a pipeline specialized for literature questions: | |
| 1. **Step A: Literary Analyzer** | |
| - Identifies literary techniques, periods, and styles | |
| - Recognizes author-specific clues | |
| 2. **Step B: Answer Generator** | |
| - Specialized for literary works and authors | |
| - Formats answers according to literary conventions | |
| 3. **Step C: Confidence Evaluator** | |
| - Calibrated specifically for literature questions | |
| ### Implementation | |
| Create specialized system prompts for each step focusing on literary knowledge. | |
| ### Validation | |
| Test specifically on literature questions and compare to general pipeline. | |
| ### Results | |
| Specialty pipelines can achieve: | |
| - 95%+ accuracy in their specialized domain | |
| - Earlier buzzing on category-specific questions | |
| - Better performance on difficult clues | |
| ## Best Practices for Advanced Pipelines | |
| 1. **Focused Components**: Each step should have a clear, single responsibility | |
| 2. **Efficient Communication**: Pass only necessary information between steps | |
| 3. **Strong Fundamentals**: Start with a solid two-step pipeline before adding complexity | |
| 4. **Consistent Testing**: Validate each change against the same test set | |
| 5. **Strategic Model Selection**: Use different models for tasks where they excel |