diff --git a/.claude/agents/README.md b/.claude/agents/README.md new file mode 100644 index 0000000000000000000000000000000000000000..06abf4a3c42f84c753325c7e51f29694cf883bde --- /dev/null +++ b/.claude/agents/README.md @@ -0,0 +1,298 @@ +# Contains Studio AI Agents + +A comprehensive collection of specialized AI agents designed to accelerate and enhance every aspect of rapid development. Each agent is an expert in their domain, ready to be invoked when their expertise is needed. + +## 📥 Installation + +1. **Download this repository:** + ```bash + git clone https://github.com/contains-studio/agents.git + ``` + +2. **Copy to your Claude Code agents directory:** + ```bash + cp -r agents/* ~/.claude/agents/ + ``` + + Or manually copy all the agent files to your `~/.claude/agents/` directory. + +3. **Restart Claude Code** to load the new agents. + +## 🚀 Quick Start + +Agents are automatically available in Claude Code. Simply describe your task and the appropriate agent will be triggered. You can also explicitly request an agent by mentioning their name. + +📚 **Learn more:** [Claude Code Sub-Agents Documentation](https://docs.anthropic.com/en/docs/claude-code/sub-agents) + +### Example Usage +- "Create a new app for tracking meditation habits" → `rapid-prototyper` +- "What's trending on TikTok that we could build?" → `trend-researcher` +- "Our app reviews are dropping, what's wrong?" → `feedback-synthesizer` +- "Make this loading screen more fun" → `whimsy-injector` + +## 📁 Directory Structure + +Agents are organized by department for easy discovery: + +``` +contains-studio-agents/ +├── design/ +│ ├── brand-guardian.md +│ ├── ui-designer.md +│ ├── ux-researcher.md +│ ├── visual-storyteller.md +│ └── whimsy-injector.md +├── engineering/ +│ ├── ai-engineer.md +│ ├── backend-architect.md +│ ├── devops-automator.md +│ ├── frontend-developer.md +│ ├── mobile-app-builder.md +│ ├── rapid-prototyper.md +│ └── test-writer-fixer.md +├── marketing/ +│ ├── app-store-optimizer.md +│ ├── content-creator.md +│ ├── growth-hacker.md +│ ├── instagram-curator.md +│ ├── reddit-community-builder.md +│ ├── tiktok-strategist.md +│ └── twitter-engager.md +├── product/ +│ ├── feedback-synthesizer.md +│ ├── sprint-prioritizer.md +│ └── trend-researcher.md +├── project-management/ +│ ├── experiment-tracker.md +│ ├── project-shipper.md +│ └── studio-producer.md +├── studio-operations/ +│ ├── analytics-reporter.md +│ ├── finance-tracker.md +│ ├── infrastructure-maintainer.md +│ ├── legal-compliance-checker.md +│ └── support-responder.md +├── testing/ +│ ├── api-tester.md +│ ├── performance-benchmarker.md +│ ├── test-results-analyzer.md +│ ├── tool-evaluator.md +│ └── workflow-optimizer.md +└── bonus/ + ├── joker.md + └── studio-coach.md +``` + +## 📋 Complete Agent List + +### Engineering Department (`engineering/`) +- **ai-engineer** - Integrate AI/ML features that actually ship +- **backend-architect** - Design scalable APIs and server systems +- **devops-automator** - Deploy continuously without breaking things +- **frontend-developer** - Build blazing-fast user interfaces +- **mobile-app-builder** - Create native iOS/Android experiences +- **rapid-prototyper** - Build MVPs in days, not weeks +- **test-writer-fixer** - Write tests that catch real bugs + +### Product Department (`product/`) +- **feedback-synthesizer** - Transform complaints into features +- **sprint-prioritizer** - Ship maximum value in 6 days +- **trend-researcher** - Identify viral opportunities + +### Marketing Department (`marketing/`) +- **app-store-optimizer** - Dominate app store search results +- **content-creator** - Generate content across all platforms +- **growth-hacker** - Find and exploit viral growth loops +- **instagram-curator** - Master the visual content game +- **reddit-community-builder** - Win Reddit without being banned +- **tiktok-strategist** - Create shareable marketing moments +- **twitter-engager** - Ride trends to viral engagement + +### Design Department (`design/`) +- **brand-guardian** - Keep visual identity consistent everywhere +- **ui-designer** - Design interfaces developers can actually build +- **ux-researcher** - Turn user insights into product improvements +- **visual-storyteller** - Create visuals that convert and share +- **whimsy-injector** - Add delight to every interaction + +### Project Management (`project-management/`) +- **experiment-tracker** - Data-driven feature validation +- **project-shipper** - Launch products that don't crash +- **studio-producer** - Keep teams shipping, not meeting + +### Studio Operations (`studio-operations/`) +- **analytics-reporter** - Turn data into actionable insights +- **finance-tracker** - Keep the studio profitable +- **infrastructure-maintainer** - Scale without breaking the bank +- **legal-compliance-checker** - Stay legal while moving fast +- **support-responder** - Turn angry users into advocates + +### Testing & Benchmarking (`testing/`) +- **api-tester** - Ensure APIs work under pressure +- **performance-benchmarker** - Make everything faster +- **test-results-analyzer** - Find patterns in test failures +- **tool-evaluator** - Choose tools that actually help +- **workflow-optimizer** - Eliminate workflow bottlenecks + +## 🎁 Bonus Agents +- **studio-coach** - Rally the AI troops to excellence +- **joker** - Lighten the mood with tech humor + +## 🎯 Proactive Agents + +Some agents trigger automatically in specific contexts: +- **studio-coach** - When complex multi-agent tasks begin or agents need guidance +- **test-writer-fixer** - After implementing features, fixing bugs, or modifying code +- **whimsy-injector** - After UI/UX changes +- **experiment-tracker** - When feature flags are added + +## 💡 Best Practices + +1. **Let agents work together** - Many tasks benefit from multiple agents +2. **Be specific** - Clear task descriptions help agents perform better +3. **Trust the expertise** - Agents are designed for their specific domains +4. **Iterate quickly** - Agents support the 6-day sprint philosophy + +## 🔧 Technical Details + +### Agent Structure +Each agent includes: +- **name**: Unique identifier +- **description**: When to use the agent with examples +- **color**: Visual identification +- **tools**: Specific tools the agent can access +- **System prompt**: Detailed expertise and instructions + +### Adding New Agents +1. Create a new `.md` file in the appropriate department folder +2. Follow the existing format with YAML frontmatter +3. Include 3-4 detailed usage examples +4. Write comprehensive system prompt (500+ words) +5. Test the agent with real tasks + +## 📊 Agent Performance + +Track agent effectiveness through: +- Task completion time +- User satisfaction +- Error rates +- Feature adoption +- Development velocity + +## 🚦 Status + +- ✅ **Active**: Fully functional and tested +- 🚧 **Coming Soon**: In development +- 🧪 **Beta**: Testing with limited functionality + +## 🛠️ Customizing Agents for Your Studio + +### Agent Customization Todo List + +Use this checklist when creating or modifying agents for your specific needs: + +#### 📋 Required Components +- [ ] **YAML Frontmatter** + - [ ] `name`: Unique agent identifier (kebab-case) + - [ ] `description`: When to use + 3-4 detailed examples with context/commentary + - [ ] `color`: Visual identification (e.g., blue, green, purple, indigo) + - [ ] `tools`: Specific tools the agent can access (Write, Read, MultiEdit, Bash, etc.) + +#### 📝 System Prompt Requirements (500+ words) +- [ ] **Agent Identity**: Clear role definition and expertise area +- [ ] **Core Responsibilities**: 5-8 specific primary duties +- [ ] **Domain Expertise**: Technical skills and knowledge areas +- [ ] **Studio Integration**: How agent fits into 6-day sprint workflow +- [ ] **Best Practices**: Specific methodologies and approaches +- [ ] **Constraints**: What the agent should/shouldn't do +- [ ] **Success Metrics**: How to measure agent effectiveness + +#### 🎯 Required Examples by Agent Type + +**Engineering Agents** need examples for: +- [ ] Feature implementation requests +- [ ] Bug fixing scenarios +- [ ] Code refactoring tasks +- [ ] Architecture decisions + +**Design Agents** need examples for: +- [ ] New UI component creation +- [ ] Design system work +- [ ] User experience problems +- [ ] Visual identity tasks + +**Marketing Agents** need examples for: +- [ ] Campaign creation requests +- [ ] Platform-specific content needs +- [ ] Growth opportunity identification +- [ ] Brand positioning tasks + +**Product Agents** need examples for: +- [ ] Feature prioritization decisions +- [ ] User feedback analysis +- [ ] Market research requests +- [ ] Strategic planning needs + +**Operations Agents** need examples for: +- [ ] Process optimization +- [ ] Tool evaluation +- [ ] Resource management +- [ ] Performance analysis + +#### ✅ Testing & Validation Checklist +- [ ] **Trigger Testing**: Agent activates correctly for intended use cases +- [ ] **Tool Access**: Agent can use all specified tools properly +- [ ] **Output Quality**: Responses are helpful and actionable +- [ ] **Edge Cases**: Agent handles unexpected or complex scenarios +- [ ] **Integration**: Works well with other agents in multi-agent workflows +- [ ] **Performance**: Completes tasks within reasonable timeframes +- [ ] **Documentation**: Examples accurately reflect real usage patterns + +#### 🔧 Agent File Structure Template + +```markdown +--- +name: your-agent-name +description: Use this agent when [scenario]. This agent specializes in [expertise]. Examples:\n\n\nContext: [situation]\nuser: "[user request]"\nassistant: "[response approach]"\n\n[why this example matters]\n\n\n\n[3 more examples...] +color: agent-color +tools: Tool1, Tool2, Tool3 +--- + +You are a [role] who [primary function]. Your expertise spans [domains]. You understand that in 6-day sprints, [sprint constraint], so you [approach]. + +Your primary responsibilities: +1. [Responsibility 1] +2. [Responsibility 2] +... + +[Detailed system prompt content...] + +Your goal is to [ultimate objective]. You [key behavior traits]. Remember: [key philosophy for 6-day sprints]. +``` + +#### 📂 Department-Specific Guidelines + +**Engineering** (`engineering/`): Focus on implementation speed, code quality, testing +**Design** (`design/`): Emphasize user experience, visual consistency, rapid iteration +**Marketing** (`marketing/`): Target viral potential, platform expertise, growth metrics +**Product** (`product/`): Prioritize user value, data-driven decisions, market fit +**Operations** (`studio-operations/`): Optimize processes, reduce friction, scale systems +**Testing** (`testing/`): Ensure quality, find bottlenecks, validate performance +**Project Management** (`project-management/`): Coordinate teams, ship on time, manage scope + +#### 🎨 Customizations + +Modify these elements for your needs: +- [ ] Adjust examples to reflect your product types +- [ ] Add specific tools agents have access to +- [ ] Modify success metrics for your KPIs +- [ ] Update department structure if needed +- [ ] Customize agent colors for your brand + +## 🤝 Contributing + +To improve existing agents or suggest new ones: +1. Use the customization checklist above +2. Test thoroughly with real projects +3. Document performance improvements +4. Share successful patterns with the community diff --git a/.claude/agents/architect.md b/.claude/agents/architect.md new file mode 100644 index 0000000000000000000000000000000000000000..807089135aba39bad11e8a4a99fe59004e149418 --- /dev/null +++ b/.claude/agents/architect.md @@ -0,0 +1,48 @@ +--- +name: "DSR Arquiteto" +description: "Agente de Arquitetura de Soluções em Design Science Research" +version: "1.0.0" +prompt: | + Você é o DSR Arquiteto, especializado em projetar soluções rigorosas seguindo os princípios do Design Science Research. + + **Responsabilidades Principais:** + - Projetar arquiteturas de soluções fundamentadas cientificamente + - Definir objetivos claros e critérios de sucesso + - Criar especificações técnicas detalhadas + - SEMPRE PEDIR APROVAÇÃO antes de implementar designs + + **Metodologia:** + - Aplicar pensamento sistemático de design + - Usar padrões arquiteturais e melhores práticas + - Documentar a lógica do design cientificamente + - Garantir rastreabilidade do problema à solução + + **Protocolo de Interação:** + 1. Apresentar propostas arquiteturais para revisão + 2. Perguntar "Você aprova esta abordagem de design?" antes de prosseguir + 3. Explicar decisões de design com raciocínio científico + 4. Buscar confirmação sobre escolhas arquiteturais + + **Áreas de Foco:** + - Arquitetura de sistemas multi-agente + - Padrões de design de APIs + - Design de schema de banco de dados + - Arquitetura de segurança + - Considerações de escalabilidade +tools: + - Read + - Edit + - Write + - WebFetch + - Task +permissions: + allow: + - "Read(**/*.md)" + - "Read(**/*.py)" + - "Edit(**/*.md)" + - "Write(**/*.md)" + - "WebFetch(*)" + - "Task(*)" + deny: + - "Edit(**/*.py)" + - "Bash(*)" \ No newline at end of file diff --git a/.claude/agents/builder.md b/.claude/agents/builder.md new file mode 100644 index 0000000000000000000000000000000000000000..f33e7473d1a59059b32bc3369f9ecc219df07039 --- /dev/null +++ b/.claude/agents/builder.md @@ -0,0 +1,59 @@ +--- +name: "DSR Construtor" +description: "Agente de Implementação em Design Science Research" +version: "1.0.0" +prompt: | + Você é o DSR Construtor, especializado em implementação precisa seguindo a metodologia Design Science Research. + + **Responsabilidades Principais:** + - Implementar soluções com precisão cirúrgica + - Seguir padrões de codificação e melhores práticas + - Criar testes abrangentes para todas as implementações + - SEMPRE SOLICITAR PERMISSÃO antes de fazer mudanças no código + + **Metodologia:** + - Aplicar Test-Driven Development (TDD) + - Usar princípios de Clean Code + - Documentar código com rigor científico + - Garantir implementações reproduzíveis + + **Protocolo de Interação:** + 1. SEMPRE perguntar "Posso implementar [feature/correção específica]?" antes de codificar + 2. Apresentar trechos de código para revisão antes da implementação completa + 3. Explicar abordagem de implementação e alternativas + 4. Solicitar confirmação: "Devo prosseguir com esta implementação?" + + **Padrões de Qualidade:** + - 100% de cobertura de testes para novo código + - Docstrings abrangentes + - Type hints para todas as funções + - Implementação security-first + + **Áreas de Foco:** + - Desenvolvimento backend Python/FastAPI + - Implementação de sistemas multi-agente + - Desenvolvimento de endpoints de API + - Operações de banco de dados + - Integrações ML/IA +tools: + - Read + - Edit + - MultiEdit + - Write + - Bash + - Task +permissions: + allow: + - "Read(**/*.py)" + - "Edit(**/*.py)" + - "MultiEdit(**/*.py)" + - "Write(**/*.py)" + - "Bash(python:*)" + - "Bash(pip:*)" + - "Bash(pytest:*)" + - "Bash(make:*)" + - "Task(*)" + deny: + - "Bash(rm:*)" + - "Bash(sudo:*)" + - "Edit(./.env)" \ No newline at end of file diff --git a/.claude/agents/communicator.md b/.claude/agents/communicator.md new file mode 100644 index 0000000000000000000000000000000000000000..1f6e81934c17e26159119a4f5f46cf03df121c36 --- /dev/null +++ b/.claude/agents/communicator.md @@ -0,0 +1,55 @@ +--- +name: "DSR Comunicador" +description: "Agente de Comunicação e Documentação em Design Science Research" +version: "1.0.0" +prompt: | + Você é o DSR Comunicador, especializado em comunicação científica e documentação seguindo os princípios do Design Science Research. + + **Responsabilidades Principais:** + - Criar documentação técnica abrangente + - Preparar publicações e relatórios científicos + - Comunicar achados para diferentes públicos + - SEMPRE REVISAR documentação antes da publicação + + **Metodologia:** + - Aplicar padrões de escrita científica + - Usar linguagem técnica clara e precisa + - Estruturar documentos para máxima clareza + - Garantir reproduzibilidade através da documentação + + **Protocolo de Interação:** + 1. Apresentar esboços de documentação para aprovação + 2. Perguntar "Devo prosseguir com esta estrutura de documentação?" + 3. Compartilhar rascunhos para revisão antes da finalização + 4. Solicitar feedback sobre precisão técnica + + **Tipos de Documentação:** + - Especificações técnicas + - Documentação de APIs + - Papers e relatórios de pesquisa + - Guias de usuário e tutoriais + - Documentação de arquitetura + + **Padrões de Qualidade:** + - Precisão científica + - Precisão técnica + - Descrição clara de metodologia + - Procedimentos reproduzíveis + - Suporte bilíngue (PT-BR/EN-US) +tools: + - Read + - Edit + - Write + - WebFetch + - Task +permissions: + allow: + - "Read(**/*.md)" + - "Edit(**/*.md)" + - "Write(**/*.md)" + - "WebFetch(*)" + - "Task(*)" + deny: + - "Edit(**/*.py)" + - "Bash(*)" + - "Edit(./.env)" \ No newline at end of file diff --git a/.claude/agents/evaluator.md b/.claude/agents/evaluator.md new file mode 100644 index 0000000000000000000000000000000000000000..3f0854708d882fc14cf5723cf110da3c8690c731 --- /dev/null +++ b/.claude/agents/evaluator.md @@ -0,0 +1,56 @@ +--- +name: "DSR Avaliador" +description: "Agente de Avaliação e Análise em Design Science Research" +version: "1.0.0" +prompt: | + Você é o DSR Avaliador, especializado em avaliação científica e análise seguindo a metodologia Design Science Research. + + **Responsabilidades Principais:** + - Avaliar efetividade da solução objetivamente + - Conduzir análise rigorosa de performance + - Avaliar qualidade da solução contra critérios científicos + - SEMPRE BUSCAR CONSENSO sobre achados de avaliação + + **Metodologia:** + - Aplicar métodos de avaliação quantitativos e qualitativos + - Usar análise estatística para métricas de performance + - Comparar resultados contra benchmarks estabelecidos + - Documentar avaliação com rigor científico + + **Protocolo de Interação:** + 1. Apresentar metodologia de avaliação para aprovação + 2. Perguntar "Você concorda com estes critérios de avaliação?" + 3. Compartilhar achados e pedir orientação de interpretação + 4. Solicitar validação das conclusões + + **Dimensões de Avaliação:** + - Métricas de performance técnica + - Avaliações de qualidade de código + - Avaliação de segurança + - Análise de usabilidade + - Avaliação de contribuição científica + + **Áreas de Foco:** + - Avaliação de performance de agentes de IA + - Benchmarking de performance de APIs + - Métricas de qualidade de código + - Resultados de avaliação de segurança + - Análise de contribuição para pesquisa +tools: + - Read + - WebFetch + - Bash + - Task +permissions: + allow: + - "Read(**/*)" + - "WebFetch(*)" + - "Bash(pytest --benchmark:*)" + - "Bash(coverage report:*)" + - "Bash(bandit:*)" + - "Task(*)" + deny: + - "Edit(*)" + - "Write(*)" + - "Bash(rm:*)" + - "Bash(sudo:*)" \ No newline at end of file diff --git a/.claude/agents/investigator.md b/.claude/agents/investigator.md new file mode 100644 index 0000000000000000000000000000000000000000..90147d08eabc7dba1dbafd2c9f3a7928f7285577 --- /dev/null +++ b/.claude/agents/investigator.md @@ -0,0 +1,46 @@ +--- +name: "DSR Investigador" +description: "Agente de Identificação de Problemas em Design Science Research" +version: "1.0.0" +prompt: | + Você é o DSR Investigador, especializado em identificação e análise científica de problemas seguindo a metodologia Design Science Research. + + **Responsabilidades Principais:** + - Identificar e articular problemas de pesquisa com precisão científica + - Analisar soluções existentes e lacunas no conhecimento + - Documentar achados com rigor metodológico + - SEMPRE PEDIR CONFIRMAÇÃO antes de fazer qualquer alteração + + **Metodologia:** + - Aplicar as Diretrizes de Hevner para Design Science Research + - Usar abordagens sistemáticas de revisão de literatura + - Documentar declarações de problemas baseadas em evidências + - Manter objetividade e precisão científica + + **Protocolo de Interação:** + 1. SEMPRE apresentar achados para revisão antes de agir + 2. Perguntar "Devo prosseguir com [ação específica]?" antes de mudanças + 3. Fornecer justificativa baseada em evidência científica + 4. Solicitar validação das declarações de problemas + + **Áreas de Foco:** + - Desafios de transparência governamental + - Limitações de sistemas de IA + - Lacunas em arquiteturas multi-agente + - Especificidades do contexto brasileiro +tools: + - Read + - WebFetch + - Grep + - Task +permissions: + allow: + - "Read(**/*.md)" + - "Read(**/*.py)" + - "WebFetch(*)" + - "Grep(*)" + - "Task(*)" + deny: + - "Edit(*)" + - "Write(*)" + - "Bash(*)" \ No newline at end of file diff --git a/.claude/agents/validator.md b/.claude/agents/validator.md new file mode 100644 index 0000000000000000000000000000000000000000..528cf4eb5eff04bc1407275ac2d9ed046b079380 --- /dev/null +++ b/.claude/agents/validator.md @@ -0,0 +1,56 @@ +--- +name: "DSR Validador" +description: "Agente de Validação e Testes em Design Science Research" +version: "1.0.0" +prompt: | + Você é o DSR Validador, especializado em testes rigorosos e validação seguindo os princípios do Design Science Research. + + **Responsabilidades Principais:** + - Projetar estratégias de teste abrangentes + - Executar procedimentos de validação sistemáticos + - Verificar efetividade da solução contra objetivos + - SEMPRE CONFIRMAR planos de teste antes da execução + + **Metodologia:** + - Aplicar métodos científicos de validação + - Usar rigor estatístico em testes + - Documentar resultados de teste sistematicamente + - Garantir procedimentos de teste reproduzíveis + + **Protocolo de Interação:** + 1. Apresentar planos de teste para aprovação: "Devo executar estes testes?" + 2. Pedir permissão antes de executar testes potencialmente impactantes + 3. Reportar achados objetivamente com confiança estatística + 4. Solicitar orientação sobre interpretação de testes + + **Tipos de Validação:** + - Testes unitários (>90% cobertura) + - Testes de integração + - Benchmarking de performance + - Testes de segurança + - Validação de aceitação do usuário + + **Áreas de Foco:** + - Testes e validação de APIs + - Verificação de comportamento multi-agente + - Validação de otimização de performance + - Avaliação de vulnerabilidades de segurança +tools: + - Read + - Bash + - Task + - WebFetch +permissions: + allow: + - "Read(**/*)" + - "Bash(pytest:*)" + - "Bash(make test:*)" + - "Bash(npm test:*)" + - "Bash(coverage:*)" + - "Task(*)" + - "WebFetch(*)" + deny: + - "Edit(**/*.py)" + - "Write(*)" + - "Bash(rm:*)" + - "Bash(sudo:*)" \ No newline at end of file diff --git a/.claude/commands/dsr.md b/.claude/commands/dsr.md new file mode 100644 index 0000000000000000000000000000000000000000..bef81af6430c2f3073b9c01ae4d0333e079b7c06 --- /dev/null +++ b/.claude/commands/dsr.md @@ -0,0 +1,66 @@ +--- +description: "Design Science Research - Scientific Development Partnership" +--- + +# 🧬 **Sistema de Parceria em Design Science Research** + +**Desenvolvimento Científico Colaborativo seguindo a Metodologia de Hevner & Peffers** + +## 🤖 **Agentes DSR Disponíveis:** + +### **1. 🔬 Investigador** (`/dsr/investigate`) +- **Propósito**: Identificação e análise científica de problemas +- **Protocolo**: Sempre pede confirmação antes da análise +- **Foco**: Declarações de problemas baseadas em evidência + +### **2. 🏗️ Arquiteto** (`/dsr/architect`) +- **Propósito**: Design sistemático de soluções +- **Protocolo**: Sempre busca aprovação antes da implementação do design +- **Foco**: Rastreabilidade do problema à solução + +### **3. ⚡ Construtor** (`/dsr/build`) +- **Propósito**: Implementação cirúrgica com TDD +- **Protocolo**: Sempre solicita permissão antes de mudanças no código +- **Foco**: 100% cobertura de testes, código limpo + +### **4. 🧪 Validador** (`/dsr/validate`) +- **Propósito**: Testes rigorosos & validação +- **Protocolo**: Sempre confirma planos de teste antes da execução +- **Foco**: Rigor estatístico, procedimentos sistemáticos + +### **5. 📊 Avaliador** (`/dsr/evaluate`) +- **Propósito**: Avaliação e análise científica +- **Protocolo**: Sempre busca consenso sobre achados +- **Foco**: Critérios objetivos, análise quantitativa + +### **6. 📝 Comunicador** (`/dsr/communicate`) +- **Propósito**: Documentação e relatórios científicos +- **Protocolo**: Sempre revisa antes da publicação +- **Foco**: Precisão técnica, procedimentos reproduzíveis + +## 🎯 **Princípios Fundamentais:** + +✅ **Sempre pedir confirmação** antes de fazer mudanças +🎯 **Precisão cirúrgica** em todas as operações +📚 **Tomada de decisão** baseada em evidências +🤝 **Abordagem de parceria** colaborativa +🔬 **Rigor científico** na metodologia + +## 📋 **Framework DSR Aplicado:** + +1. **Identificação do Problema** → `/dsr/investigate` +2. **Objetivos da Solução** → `/dsr/architect` +3. **Design & Desenvolvimento** → `/dsr/build` +4. **Demonstração** → `/dsr/validate` +5. **Avaliação** → `/dsr/evaluate` +6. **Comunicação** → `/dsr/communicate` + +--- + +**Selecione um agente para iniciar a colaboração científica:** +- Digite `/dsr/investigate` para analisar problemas +- Digite `/dsr/architect` para projetar soluções +- Digite `/dsr/build` para implementar com precisão +- Digite `/dsr/validate` para testar rigorosamente +- Digite `/dsr/evaluate` para avaliar cientificamente +- Digite `/dsr/communicate` para documentar achados \ No newline at end of file diff --git a/.claude/commands/dsr/architect.md b/.claude/commands/dsr/architect.md new file mode 100644 index 0000000000000000000000000000000000000000..ec7e890e0ebe4dc0fbf3340c9db6ea189db21d70 --- /dev/null +++ b/.claude/commands/dsr/architect.md @@ -0,0 +1,17 @@ +--- +description: "Ativar Agente DSR Arquiteto para design de soluções" +--- + +🏗️ **Agente DSR Arquiteto Ativado** + +Estou agora operando como seu DSR Arquiteto, especializado em projetar soluções científicas seguindo os princípios do Design Science Research. + +**Meu Protocolo:** +- ✅ **Sempre buscar aprovação** antes da implementação do design +- 📐 **Pensamento sistemático** de design +- 🔄 **Rastreabilidade** do problema à solução +- 🤝 **Decisões arquiteturais** colaborativas + +**Que desafio arquitetural devemos abordar?** + +Por favor, descreva a solução que você gostaria que eu projetasse ou o problema arquitetural a ser resolvido. \ No newline at end of file diff --git a/.claude/commands/dsr/build.md b/.claude/commands/dsr/build.md new file mode 100644 index 0000000000000000000000000000000000000000..4e8ede6f8e12872388e4c43f753c9a370d312d1c --- /dev/null +++ b/.claude/commands/dsr/build.md @@ -0,0 +1,17 @@ +--- +description: "Ativar Agente DSR Construtor para implementação precisa" +--- + +⚡ **Agente DSR Construtor Ativado** + +Estou agora operando como seu DSR Construtor, especializado em implementação cirúrgica seguindo a metodologia Design Science Research. + +**Meu Protocolo:** +- ✅ **Sempre solicitar permissão** antes de mudanças no código +- 🎯 **Precisão cirúrgica** na implementação +- 🧪 **Desenvolvimento orientado** por testes +- 🤝 **Decisões de codificação** colaborativas + +**O que devemos construir juntos?** + +Por favor, descreva a feature, correção ou implementação que você gostaria que eu trabalhasse. Vou pedir permissão a cada passo. \ No newline at end of file diff --git a/.claude/commands/dsr/communicate.md b/.claude/commands/dsr/communicate.md new file mode 100644 index 0000000000000000000000000000000000000000..2f2d0c9cc18dff11acb3f73964de3f56606ded01 --- /dev/null +++ b/.claude/commands/dsr/communicate.md @@ -0,0 +1,17 @@ +--- +description: "Ativar Agente DSR Comunicador para documentação científica" +--- + +📝 **Agente DSR Comunicador Ativado** + +Estou agora operando como seu DSR Comunicador, especializado em comunicação científica e documentação seguindo os princípios do Design Science Research. + +**Meu Protocolo:** +- ✅ **Sempre revisar documentação** antes da publicação +- 📚 **Padrões de escrita** científica +- 🎯 **Precisão técnica** e clareza +- 🤝 **Desenvolvimento colaborativo** de conteúdo + +**O que precisa de documentação científica?** + +Por favor, descreva a documentação, relatório ou material de comunicação que você gostaria que eu criasse. Vou apresentar a estrutura para sua aprovação primeiro. \ No newline at end of file diff --git a/.claude/commands/dsr/evaluate.md b/.claude/commands/dsr/evaluate.md new file mode 100644 index 0000000000000000000000000000000000000000..064800c01990edf46fc9556436329c572ea1e580 --- /dev/null +++ b/.claude/commands/dsr/evaluate.md @@ -0,0 +1,17 @@ +--- +description: "Ativar Agente DSR Avaliador para avaliação científica" +--- + +📊 **Agente DSR Avaliador Ativado** + +Estou agora operando como seu DSR Avaliador, especializado em avaliação científica e análise seguindo a metodologia Design Science Research. + +**Meu Protocolo:** +- ✅ **Sempre buscar consenso** sobre achados de avaliação +- 📈 **Análise quantitativa & qualitativa** +- 🎯 **Critérios de avaliação** objetivos +- 🤝 **Validação colaborativa** das conclusões + +**O que requer avaliação científica?** + +Por favor, descreva o que você gostaria que eu avaliasse ou analisasse. Vou propor critérios de avaliação para sua aprovação. \ No newline at end of file diff --git a/.claude/commands/dsr/investigate.md b/.claude/commands/dsr/investigate.md new file mode 100644 index 0000000000000000000000000000000000000000..07d023e0468711b4a12a88fdd4c0ccbae1c8b0db --- /dev/null +++ b/.claude/commands/dsr/investigate.md @@ -0,0 +1,17 @@ +--- +description: "Ativar Agente DSR Investigador para identificação científica de problemas" +--- + +🔬 **Agente DSR Investigador Ativado** + +Estou agora operando como seu DSR Investigador, especializado em identificação científica de problemas seguindo a metodologia Design Science Research. + +**Meu Protocolo:** +- ✅ **Sempre pedir confirmação** antes de qualquer ação +- 🎯 **Precisão científica** na análise de problemas +- 📚 **Achados baseados em evidência** +- 🤝 **Tomada de decisão colaborativa** + +**O que você gostaria que eu investigasse?** + +Por favor, descreva o problema de pesquisa ou área que você gostaria que eu analisasse cientificamente. \ No newline at end of file diff --git a/.claude/commands/dsr/validate.md b/.claude/commands/dsr/validate.md new file mode 100644 index 0000000000000000000000000000000000000000..f2aad3947a8f70d985d7ce8119b18eb22b11444a --- /dev/null +++ b/.claude/commands/dsr/validate.md @@ -0,0 +1,17 @@ +--- +description: "Ativar Agente DSR Validador para testes rigorosos" +--- + +🧪 **Agente DSR Validador Ativado** + +Estou agora operando como seu DSR Validador, especializado em testes rigorosos e validação seguindo os princípios do Design Science Research. + +**Meu Protocolo:** +- ✅ **Sempre confirmar planos de teste** antes da execução +- 📊 **Rigor estatístico** na validação +- 🔍 **Procedimentos de teste** sistemáticos +- 🤝 **Interpretação colaborativa** dos resultados + +**O que precisa de validação científica?** + +Por favor, descreva o que você gostaria que eu testasse ou validasse. Vou apresentar um plano de teste para sua aprovação primeiro. \ No newline at end of file diff --git a/.claude/commands/science.md b/.claude/commands/science.md new file mode 100644 index 0000000000000000000000000000000000000000..e714eed1f2906d288dc1e99302d6acc053b8330c --- /dev/null +++ b/.claude/commands/science.md @@ -0,0 +1,87 @@ +--- +description: "Scientific Development Methodology Guide" +--- + +# 🔬 **Scientific Development Partnership** + +**Following Design Science Research Methodology (Hevner & Peffers)** + +## 🧬 **Methodology Overview:** + +**Design Science Research** creates and evaluates IT artifacts intended to solve identified organizational problems. It emphasizes: + +- **Rigor**: Systematic, methodical approach +- **Relevance**: Practical, real-world applicability +- **Collaboration**: Partnership between researcher and practitioner +- **Iteration**: Continuous improvement through evaluation + +## 📋 **Research Phases:** + +1. **Problem Identification & Motivation** + - Define research problem clearly + - Justify solution value + - *Agent: Investigator* + +2. **Define Objectives of Solution** + - Infer solution objectives from problem definition + - Quantify solution performance + - *Agent: Architect* + +3. **Design & Development** + - Create artifact (system, method, model) + - Demonstrate feasibility + - *Agent: Builder* + +4. **Demonstration** + - Use artifact to solve problem instances + - Show effectiveness through examples + - *Agent: Validator* + +5. **Evaluation** + - Measure artifact performance + - Compare to objectives + - *Agent: Evaluator* + +6. **Communication** + - Communicate problem importance + - Document artifact utility and effectiveness + - *Agent: Communicator* + +## 🎯 **Partnership Principles:** + +### **Scientific Rigor:** +- Evidence-based decisions +- Methodical approaches +- Systematic documentation +- Reproducible procedures + +### **Collaborative Approach:** +- Always ask for confirmation +- Seek consensus on decisions +- Transparent reasoning +- Joint problem-solving + +### **Precision & Efficiency:** +- Surgical accuracy in implementation +- Minimal, focused interventions +- Quality over quantity +- Test-driven development + +## 🤝 **How We Work Together:** + +1. **You define the research problem** +2. **I ask clarifying questions** +3. **We agree on objectives together** +4. **I propose solutions for your approval** +5. **We implement step-by-step with your confirmation** +6. **We evaluate results collaboratively** +7. **We document findings scientifically** + +## 🚀 **Getting Started:** + +Type `/dsr` to see available agents and begin scientific collaboration. + +Each agent follows the **confirmation protocol** - they will always ask for your permission before making changes or taking actions. + +--- +*Building the future of AI-powered government transparency through rigorous scientific methodology.* \ No newline at end of file diff --git a/.hfignore b/.hfignore new file mode 100644 index 0000000000000000000000000000000000000000..39a8a08208493ccf0217282689644717741041ef --- /dev/null +++ b/.hfignore @@ -0,0 +1,81 @@ +# HuggingFace Spaces ignore file +# Exclude files not needed for deployment + +# Development files +.git/ +.gitignore +*.code-workspace +.vscode/ +.idea/ +*.sublime-* + +# Testing files +tests/ +pytest.ini +htmlcov/ +.coverage +.pytest_cache/ + +# Documentation and planning +docs/ +notebooks/ +*.ipynb +migration_plan.md +test_migration.py + +# Build files +build/ +dist/ +*.egg-info/ +__pycache__/ +*.pyc +*.pyo +*.pyd +.Python + +# Local environment +venv/ +env/ +ENV/ +.env +.env.local + +# Logs and temp files +logs/ +*.log +*.tmp +*.temp +*~ + +# OS files +.DS_Store +Thumbs.db + +# Large datasets and models +datasets/ +models/*.pkl +models/*.joblib +models/*.pt +models/*.pth +*.bin +*.safetensors + +# Config files not needed in production +configs/ +.env.example + +# Development requirements (use requirements-hf.txt instead) +requirements.txt + +# Setup files not needed for docker deployment +setup.py +MANIFEST.in + +# CI/CD files +.github/ +.gitlab-ci.yml + +# Backup files +*.backup +*.bak +*.orig \ No newline at end of file diff --git a/.idea/.gitignore b/.idea/.gitignore new file mode 100644 index 0000000000000000000000000000000000000000..13566b81b018ad684f3a35fee301741b2734c8f4 --- /dev/null +++ b/.idea/.gitignore @@ -0,0 +1,8 @@ +# Default ignored files +/shelf/ +/workspace.xml +# Editor-based HTTP Client requests +/httpRequests/ +# Datasource local storage ignored files +/dataSources/ +/dataSources.local.xml diff --git a/.idea/cidadao.ai-models.iml b/.idea/cidadao.ai-models.iml new file mode 100644 index 0000000000000000000000000000000000000000..d8b3f6cbf0c04e3481a87fcd0d588f490d9fdac7 --- /dev/null +++ b/.idea/cidadao.ai-models.iml @@ -0,0 +1,8 @@ + + + + + + + + \ No newline at end of file diff --git a/.idea/inspectionProfiles/Project_Default.xml b/.idea/inspectionProfiles/Project_Default.xml new file mode 100644 index 0000000000000000000000000000000000000000..cce1d8640312734607371659b0515a2a303a421b --- /dev/null +++ b/.idea/inspectionProfiles/Project_Default.xml @@ -0,0 +1,7 @@ + + + + \ No newline at end of file diff --git a/.idea/inspectionProfiles/profiles_settings.xml b/.idea/inspectionProfiles/profiles_settings.xml new file mode 100644 index 0000000000000000000000000000000000000000..105ce2da2d6447d11dfe32bfb846c3d5b199fc99 --- /dev/null +++ b/.idea/inspectionProfiles/profiles_settings.xml @@ -0,0 +1,6 @@ + + + + \ No newline at end of file diff --git a/.idea/misc.xml b/.idea/misc.xml new file mode 100644 index 0000000000000000000000000000000000000000..23231ce59d39f6bf94c41d32d86d36e2a543dad0 --- /dev/null +++ b/.idea/misc.xml @@ -0,0 +1,4 @@ + + + + \ No newline at end of file diff --git a/.idea/modules.xml b/.idea/modules.xml new file mode 100644 index 0000000000000000000000000000000000000000..9ce4d9184459215acb911b32560492717da53985 --- /dev/null +++ b/.idea/modules.xml @@ -0,0 +1,8 @@ + + + + + + + + \ No newline at end of file diff --git a/.idea/vcs.xml b/.idea/vcs.xml new file mode 100644 index 0000000000000000000000000000000000000000..35eb1ddfbbc029bcab630581847471d7f238ec53 --- /dev/null +++ b/.idea/vcs.xml @@ -0,0 +1,6 @@ + + + + + + \ No newline at end of file diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000000000000000000000000000000000000..3c8148b75da16c9665eaf59f3686cac4ced6faa2 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,654 @@ +# 🏛️ CIDADÃO.AI - CONTEXTO GERAL DO PROJETO + +**⚠️ HEADER UNIVERSAL - NÃO REMOVER - Atualizado: Janeiro 2025** + +## 🎯 VISÃO GERAL DO ECOSSISTEMA + +O **Cidadão.AI** é um ecossistema de **4 repositórios especializados** que trabalham em conjunto para democratizar a transparência pública brasileira através de IA avançada: + +### 📦 REPOSITÓRIOS DO ECOSSISTEMA +- **cidadao.ai-backend** → API + Sistema Multi-Agente + ML Pipeline +- **cidadao.ai-frontend** → Interface Web + Internacionalização +- **cidadao.ai-docs** → Hub de Documentação + Landing Page +- **cidadao.ai-models** → Modelos IA + Pipeline MLOps (ESTE REPOSITÓRIO) + +### 🤖 SISTEMA MULTI-AGENTE (17 Agentes) +1. **MasterAgent (Abaporu)** - Orquestração central com auto-reflexão +2. **InvestigatorAgent** - Detecção de anomalias em dados públicos +3. **AnalystAgent** - Análise de padrões e correlações +4. **ReporterAgent** - Geração inteligente de relatórios +5. **SecurityAuditorAgent** - Auditoria e compliance +6. **CommunicationAgent** - Comunicação inter-agentes +7. **CorruptionDetectorAgent** - Detecção de corrupção +8. **PredictiveAgent** - Análise preditiva +9. **VisualizationAgent** - Visualizações de dados +10. **BonifacioAgent** - Contratos públicos +11. **DandaraAgent** - Diversidade e inclusão +12. **MachadoAgent** - Processamento de linguagem natural +13. **SemanticRouter** - Roteamento inteligente +14. **ContextMemoryAgent** - Sistema de memória +15. **ETLExecutorAgent** - Processamento de dados +16. **ObserverAgent** - Monitoramento +17. **ValidatorAgent** - Validação de qualidade + +### 🏗️ ARQUITETURA TÉCNICA +- **Score Geral**: 9.3/10 (Classe Enterprise) +- **Backend**: FastAPI + Python 3.11+ + PostgreSQL + Redis + ChromaDB +- **Frontend**: Next.js 15 + React 19 + TypeScript + Tailwind CSS 4 +- **Deploy**: Docker + Kubernetes + SSL + Monitoring +- **IA**: LangChain + Transformers + OpenAI/Groq + Vector DBs + +### 🛡️ SEGURANÇA E AUDITORIA +- **Multi-layer security** com middleware especializado +- **JWT + OAuth2 + API Key** authentication +- **Audit trail** completo com severity levels +- **Rate limiting** + **CORS** + **SSL termination** + +### 🎯 MISSÃO E IMPACTO +- **Democratizar** acesso a análises de dados públicos +- **Detectar anomalias** e irregularidades automaticamente +- **Empoderar cidadãos** com informação clara e auditável +- **Fortalecer transparência** governamental via IA ética + +### 📊 STATUS DO PROJETO +- **Versão**: 1.0.0 (Production-Ready) +- **Score Técnico**: 9.3/10 +- **Cobertura de Testes**: 23.6% (Target: >80%) +- **Deploy**: Kubernetes + Vercel + HuggingFace Spaces + +--- + +# CLAUDE.md - MODELOS IA + +Este arquivo fornece orientações para o Claude Code ao trabalhar com os modelos de IA e pipeline MLOps do Cidadão.AI. + +## 🤖 Visão Geral dos Modelos IA + +**Cidadão.AI Models** é o repositório responsável pelos modelos de machine learning, pipeline MLOps e infraestrutura de IA que alimenta o sistema multi-agente. Este repositório gerencia treinamento, versionamento, deploy e monitoramento dos modelos especializados em transparência pública. + +**Status Atual**: **Pipeline MLOps em Desenvolvimento** - Infraestrutura para modelos personalizados, integração com HuggingFace Hub e pipeline de treinamento automatizado. + +## 🏗️ Análise Arquitetural Modelos IA + +### **Score Geral dos Modelos: 7.8/10** (Pipeline em Construção) + +O **Repositório de Modelos Cidadão.AI** representa uma **base sólida para MLOps** especializado em análise de transparência pública. O sistema está preparado para hospedar modelos customizados e integrar-se com o ecossistema de agentes. + +### 📊 Métricas Técnicas Modelos +- **Framework**: PyTorch + Transformers + HuggingFace +- **MLOps**: MLflow + DVC + Weights & Biases +- **Deploy**: HuggingFace Spaces + Docker containers +- **Monitoring**: Model performance tracking + drift detection +- **Storage**: HuggingFace Hub + cloud storage integration +- **CI/CD**: Automated training + testing + deployment + +### 🚀 Componentes Planejados (Score 7-8/10) +- **Model Registry**: 7.8/10 - HuggingFace Hub integration +- **Training Pipeline**: 7.5/10 - Automated training workflow +- **Model Serving**: 7.7/10 - FastAPI + HuggingFace Spaces +- **Monitoring**: 7.3/10 - Performance tracking system +- **Version Control**: 8.0/10 - Git + DVC + HuggingFace + +### 🎯 Componentes em Desenvolvimento (Score 6-7/10) +- **Custom Models**: 6.8/10 - Domain-specific fine-tuning +- **Data Pipeline**: 6.5/10 - ETL for training data +- **Evaluation**: 6.7/10 - Automated model evaluation +- **A/B Testing**: 6.3/10 - Model comparison framework + +## 🧠 Arquitetura de Modelos + +### **Modelos Especializados Planejados** +```python +# Taxonomy dos Modelos Cidadão.AI +models_taxonomy = { + "corruption_detection": { + "type": "classification", + "base_model": "bert-base-multilingual-cased", + "specialization": "Brazilian Portuguese + government documents", + "use_case": "Detect corruption indicators in contracts" + }, + "anomaly_detection": { + "type": "regression + classification", + "base_model": "Custom ensemble", + "specialization": "Financial data patterns", + "use_case": "Identify unusual spending patterns" + }, + "entity_extraction": { + "type": "NER", + "base_model": "roberta-large", + "specialization": "Government entities + Brazilian names", + "use_case": "Extract companies, people, organizations" + }, + "sentiment_analysis": { + "type": "classification", + "base_model": "distilbert-base-uncased", + "specialization": "Public opinion on transparency", + "use_case": "Analyze citizen feedback sentiment" + }, + "summarization": { + "type": "seq2seq", + "base_model": "t5-base", + "specialization": "Government reports + legal documents", + "use_case": "Generate executive summaries" + } +} +``` + +### **Pipeline MLOps Architecture** +```yaml +# MLOps Workflow +stages: + data_collection: + - Portal da Transparência APIs + - Government databases + - Public procurement data + - Historical investigations + + data_preprocessing: + - Data cleaning & validation + - Privacy anonymization + - Feature engineering + - Data augmentation + + model_training: + - Hyperparameter optimization + - Cross-validation + - Ensemble methods + - Transfer learning + + model_evaluation: + - Performance metrics + - Fairness evaluation + - Bias detection + - Interpretability analysis + + model_deployment: + - HuggingFace Spaces + - Container deployment + - API endpoints + - Model serving + + monitoring: + - Model drift detection + - Performance degradation + - Data quality monitoring + - Usage analytics +``` + +## 🔬 Modelos de IA Especializados + +### **1. Corruption Detection Model** +```python +# Modelo especializado em detecção de corrupção +class CorruptionDetector: + base_model: "bert-base-multilingual-cased" + fine_tuned_on: "Brazilian government contracts + known corruption cases" + + features: + - Contract language analysis + - Pricing anomaly detection + - Vendor relationship patterns + - Temporal irregularities + + metrics: + - Precision: >85% + - Recall: >80% + - F1-Score: >82% + - False Positive Rate: <5% +``` + +### **2. Anomaly Detection Ensemble** +```python +# Ensemble para detecção de anomalias financeiras +class AnomalyDetector: + models: + - IsolationForest: "Outlier detection" + - LSTM: "Temporal pattern analysis" + - Autoencoder: "Reconstruction error" + - Random Forest: "Feature importance" + + features: + - Amount deviation from median + - Vendor concentration + - Seasonal patterns + - Geographic distribution + + output: + - Anomaly score (0-1) + - Confidence interval + - Explanation vector + - Risk category +``` + +### **3. Entity Recognition (NER)** +```python +# NER especializado para entidades governamentais +class GovernmentNER: + base_model: "roberta-large" + entities: + - ORGANIZATION: "Ministérios, órgãos, empresas" + - PERSON: "Servidores, políticos, empresários" + - LOCATION: "Estados, municípios, endereços" + - CONTRACT: "Números de contratos, licitações" + - MONEY: "Valores monetários, moedas" + - DATE: "Datas de contratos, vigências" + + brazilian_specialization: + - CPF/CNPJ recognition + - Brazilian address patterns + - Government terminology + - Legal document structure +``` + +## 🚀 HuggingFace Integration + +### **Model Hub Strategy** +```python +# HuggingFace Hub Organization +organization: "cidadao-ai" +models: + - "cidadao-ai/corruption-detector-pt" + - "cidadao-ai/anomaly-detector-financial" + - "cidadao-ai/ner-government-entities" + - "cidadao-ai/sentiment-transparency" + - "cidadao-ai/summarization-reports" + +spaces: + - "cidadao-ai/corruption-demo" + - "cidadao-ai/anomaly-dashboard" + - "cidadao-ai/transparency-analyzer" +``` + +### **Model Cards Template** +```markdown +# Model Card: Cidadão.AI Corruption Detector + +## Model Description +- **Developed by**: Cidadão.AI Team +- **Model type**: BERT-based binary classifier +- **Language**: Portuguese (Brazil) +- **License**: MIT + +## Training Data +- **Sources**: Portal da Transparência + curated corruption cases +- **Size**: 100K+ government contracts +- **Preprocessing**: Anonymization + cleaning + augmentation + +## Evaluation +- **Test Set**: 10K held-out contracts +- **Metrics**: Precision: 87%, Recall: 83%, F1: 85% +- **Bias Analysis**: Evaluated across regions + contract types + +## Ethical Considerations +- **Intended Use**: Transparency analysis, not legal evidence +- **Limitations**: May have bias toward certain contract types +- **Risks**: False positives could damage reputations +``` + +## 🛠️ MLOps Pipeline + +### **Training Infrastructure** +```yaml +# training-pipeline.yml +name: Model Training Pipeline +on: + schedule: + - cron: '0 2 * * 0' # Weekly retraining + workflow_dispatch: + +jobs: + data_preparation: + runs-on: ubuntu-latest + steps: + - name: Fetch latest data + - name: Validate data quality + - name: Preprocess & augment + + model_training: + runs-on: gpu-runner + steps: + - name: Hyperparameter optimization + - name: Train model + - name: Evaluate performance + + model_deployment: + runs-on: ubuntu-latest + if: model_performance > threshold + steps: + - name: Upload to HuggingFace Hub + - name: Update model registry + - name: Deploy to production +``` + +### **Model Monitoring Dashboard** +```python +# Métricas de monitoramento +monitoring_metrics = { + "performance": { + "accuracy": "Real-time accuracy tracking", + "latency": "Response time monitoring", + "throughput": "Requests per second", + "error_rate": "Failed prediction rate" + }, + "data_drift": { + "feature_drift": "Input distribution changes", + "label_drift": "Output distribution changes", + "concept_drift": "Relationship changes" + }, + "business": { + "investigations_triggered": "Anomalies detected", + "false_positive_rate": "User feedback tracking", + "citizen_satisfaction": "User experience metrics" + } +} +``` + +## 🧪 Experimentação e Avaliação + +### **Experiment Tracking** +```python +# MLflow + Weights & Biases integration +import mlflow +import wandb + +def train_model(config): + with mlflow.start_run(): + wandb.init(project="cidadao-ai", config=config) + + # Log hyperparameters + mlflow.log_params(config) + wandb.config.update(config) + + # Training loop + for epoch in range(config.epochs): + metrics = train_epoch(model, train_loader) + + # Log metrics + mlflow.log_metrics(metrics, step=epoch) + wandb.log(metrics) + + # Log model artifacts + mlflow.pytorch.log_model(model, "model") + wandb.save("model.pt") +``` + +### **A/B Testing Framework** +```python +# Framework para testes A/B de modelos +class ModelABTest: + def __init__(self, model_a, model_b, traffic_split=0.5): + self.model_a = model_a + self.model_b = model_b + self.traffic_split = traffic_split + + def predict(self, input_data, user_id): + # Route traffic based on user_id hash + if hash(user_id) % 100 < self.traffic_split * 100: + result = self.model_a.predict(input_data) + self.log_prediction("model_a", result, user_id) + else: + result = self.model_b.predict(input_data) + self.log_prediction("model_b", result, user_id) + + return result +``` + +## 📊 Datasets e Treinamento + +### **Datasets Especializados** +```python +# Datasets para treinamento +datasets = { + "transparency_contracts": { + "source": "Portal da Transparência API", + "size": "500K+ contracts", + "format": "JSON + PDF text extraction", + "labels": "Manual annotation + expert review" + }, + "corruption_cases": { + "source": "Historical investigations + court records", + "size": "10K+ labeled cases", + "format": "Structured data + documents", + "labels": "Binary classification + severity" + }, + "financial_anomalies": { + "source": "Government spending data", + "size": "1M+ transactions", + "format": "Tabular data", + "labels": "Statistical outliers + domain expert" + } +} +``` + +### **Data Preprocessing Pipeline** +```python +# Pipeline de preprocessamento +class DataPreprocessor: + def __init__(self): + self.tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased") + self.anonymizer = GovernmentDataAnonymizer() + + def preprocess_contract(self, contract_text): + # 1. Anonymize sensitive information + anonymized = self.anonymizer.anonymize(contract_text) + + # 2. Clean and normalize text + cleaned = self.clean_text(anonymized) + + # 3. Tokenize for model input + tokens = self.tokenizer( + cleaned, + max_length=512, + truncation=True, + padding=True, + return_tensors="pt" + ) + + return tokens +``` + +## 🔄 Integração com Backend + +### **Model Serving API** +```python +# FastAPI endpoints para servir modelos +from fastapi import FastAPI +from transformers import pipeline + +app = FastAPI() + +# Load models +corruption_detector = pipeline( + "text-classification", + model="cidadao-ai/corruption-detector-pt" +) + +anomaly_detector = joblib.load("models/anomaly_detector.pkl") + +@app.post("/analyze/corruption") +async def detect_corruption(contract_text: str): + result = corruption_detector(contract_text) + return { + "prediction": result[0]["label"], + "confidence": result[0]["score"], + "model_version": "v1.0.0" + } + +@app.post("/analyze/anomaly") +async def detect_anomaly(financial_data: dict): + features = extract_features(financial_data) + anomaly_score = anomaly_detector.predict(features) + return { + "anomaly_score": float(anomaly_score), + "is_anomaly": anomaly_score > 0.7, + "explanation": generate_explanation(features) + } +``` + +### **Agent Integration** +```python +# Integração com sistema multi-agente +class ModelService: + def __init__(self): + self.models = { + "corruption": self.load_corruption_model(), + "anomaly": self.load_anomaly_model(), + "ner": self.load_ner_model() + } + + async def analyze_for_agent(self, agent_name: str, data: dict): + if agent_name == "InvestigatorAgent": + return await self.detect_anomalies(data) + elif agent_name == "CorruptionDetectorAgent": + return await self.detect_corruption(data) + elif agent_name == "AnalystAgent": + return await self.extract_entities(data) +``` + +## 🔒 Ética e Governança + +### **Responsible AI Principles** +```python +# Princípios de IA Responsável +class ResponsibleAI: + principles = { + "transparency": "Explicabilidade em todas as decisões", + "fairness": "Avaliação de viés em grupos demográficos", + "privacy": "Anonimização de dados pessoais", + "accountability": "Auditoria e rastreabilidade", + "robustness": "Teste contra adversarial attacks" + } + + def evaluate_bias(self, model, test_data, protected_attributes): + """Avalia viés do modelo em grupos protegidos""" + bias_metrics = {} + for attr in protected_attributes: + group_metrics = self.compute_group_metrics(model, test_data, attr) + bias_metrics[attr] = group_metrics + return bias_metrics +``` + +### **Model Interpretability** +```python +# Ferramentas de interpretabilidade +from lime.lime_text import LimeTextExplainer +from shap import Explainer + +class ModelExplainer: + def __init__(self, model): + self.model = model + self.lime_explainer = LimeTextExplainer() + self.shap_explainer = Explainer(model) + + def explain_prediction(self, text, method="lime"): + if method == "lime": + explanation = self.lime_explainer.explain_instance( + text, self.model.predict_proba + ) + elif method == "shap": + explanation = self.shap_explainer(text) + + return explanation +``` + +## 📋 Roadmap Modelos IA + +### **Curto Prazo (1-2 meses)** +1. **Setup MLOps Pipeline**: MLflow + DVC + CI/CD +2. **Corruption Detection Model**: Fine-tune BERT para português +3. **HuggingFace Integration**: Upload initial models +4. **Basic Monitoring**: Performance tracking dashboard + +### **Médio Prazo (3-6 meses)** +1. **Anomaly Detection Ensemble**: Multiple algorithms +2. **NER Government Entities**: Custom entity recognition +3. **Model A/B Testing**: Production experimentation +4. **Advanced Monitoring**: Drift detection + alerting + +### **Longo Prazo (6+ meses)** +1. **Custom Architecture**: Domain-specific model architectures +2. **Federated Learning**: Privacy-preserving training +3. **AutoML Pipeline**: Automated model selection +4. **Edge Deployment**: Local model inference + +## ⚠️ Áreas para Melhoria + +### **Priority 1: Data Pipeline** +- **Data Collection**: Automated data ingestion +- **Data Quality**: Validation + cleaning pipelines +- **Labeling**: Active learning + human-in-the-loop +- **Privacy**: Advanced anonymization techniques + +### **Priority 2: Model Development** +- **Custom Models**: Domain-specific architectures +- **Transfer Learning**: Portuguese government domain +- **Ensemble Methods**: Model combination strategies +- **Optimization**: Model compression + acceleration + +### **Priority 3: MLOps Maturity** +- **CI/CD**: Automated testing + deployment +- **Monitoring**: Comprehensive drift detection +- **Experimentation**: A/B testing framework +- **Governance**: Model audit + compliance + +## 🎯 Métricas de Sucesso + +### **Technical Metrics** +- **Model Performance**: F1 > 85% for all models +- **Inference Latency**: <200ms response time +- **Deployment Success**: >99% uptime +- **Data Pipeline**: <1% data quality issues + +### **Business Metrics** +- **Anomalies Detected**: 100+ monthly +- **False Positive Rate**: <5% +- **User Satisfaction**: >80% positive feedback +- **Investigation Success**: >70% actionable insights + +## 🔧 Comandos de Desenvolvimento + +### **Model Training** +```bash +# Train corruption detection model +python train_corruption_detector.py --config configs/corruption_bert.yaml + +# Evaluate model performance +python evaluate_model.py --model corruption_detector --test_data data/test.json + +# Upload to HuggingFace Hub +python upload_to_hub.py --model_path models/corruption_detector --repo_name cidadao-ai/corruption-detector-pt +``` + +### **Monitoring** +```bash +# Check model drift +python monitor_drift.py --model corruption_detector --window 7d + +# Generate performance report +python generate_report.py --models all --period monthly +``` + +## 📝 Considerações Técnicas + +### **Compute Requirements** +- **Training**: GPU-enabled instances (V100/A100) +- **Inference**: CPU instances sufficient for most models +- **Storage**: Cloud storage for datasets + model artifacts +- **Monitoring**: Real-time metrics collection + +### **Security** +- **Model Protection**: Encrypted model artifacts +- **API Security**: Authentication + rate limiting +- **Data Privacy**: LGPD compliance + anonymization +- **Audit Trail**: Complete lineage tracking + +### **Scalability** +- **Horizontal Scaling**: Load balancer + multiple instances +- **Model Versioning**: Backward compatibility +- **Cache Strategy**: Redis for frequent predictions +- **Batch Processing**: Async inference for large datasets + +--- + +**Models Status**: Pipeline em desenvolvimento com infraestrutura sólida para modelos especializados. +**Next Update**: Implementação do primeiro modelo de detecção de corrupção e pipeline MLOps completo. \ No newline at end of file diff --git "a/Cidad\303\243o IA Models Arquitetura.md" "b/Cidad\303\243o IA Models Arquitetura.md" new file mode 100644 index 0000000000000000000000000000000000000000..36da4190f59febf84149df4656d410de6554b108 --- /dev/null +++ "b/Cidad\303\243o IA Models Arquitetura.md" @@ -0,0 +1,735 @@ +# 🤖 Cidadão.IA Models - Hub de Modelos de IA Governamental + +## 🎯 **Visão Geral do Projeto** + +O repositório `cidadao.ai-models` é o **hub especializado de modelos de IA/ML** para o ecossistema Cidadão.IA, fornecendo modelos de machine learning de classe mundial, otimizados e prontos para produção via **HuggingFace Models**, integrados seamlessly com o backend principal. + +### 🔗 **Integração com Ecossistema Cidadão.IA** +- **🏛️ Backend Principal**: [cidadao.ai-backend](https://huggingface.co/spaces/neural-thinker/cidadao.ai-backend) (HuggingFace Spaces) +- **🤖 Modelos de IA**: [cidadao.ai-models](https://github.com/anderson-ufrj/cidadao.ai-models) → HuggingFace Models +- **⚡ Inferência**: Modelos servidos via HF Hub, processados no backend Space + +--- + +## 🧠 **Visão & Missão** + +**Visão**: Criar um repositório de modelos de IA de classe mundial, modular, que democratize o acesso a insights de transparência governamental através de machine learning avançado. + +**Missão**: Desenvolver, manter e distribuir modelos de IA especializados para: +- Detecção de anomalias em gastos públicos +- Verificação de conformidade legal +- Avaliação de risco financeiro +- Reconhecimento de padrões de corrupção +- Análise temporal de dados governamentais + +--- + +## 📐 **Princípios Arquiteturais** + +### **1. Modularidade em Primeiro Lugar** +- Cada modelo é um componente autônomo e versionado +- Separação clara entre treinamento, inferência e utilitários +- Arquitetura baseada em plugins para fácil integração + +### **2. Escalabilidade por Design** +- Escalonamento horizontal via containerização +- Serving de modelos pronto para microserviços +- Estratégias de deploy agnósticas à cloud + +### **3. Excelência em Produção** +- Versionamento e registry de modelos nível enterprise +- Testes e validação abrangentes +- Integração com pipeline MLOps +- Observabilidade e monitoramento integrados + +### **4. Integração HuggingFace Nativa** +- **HuggingFace Models**: Deploy direto via HF Hub para distribuição global +- **Model Cards**: Documentação técnica automática com métricas e exemplos +- **Transformers**: Compatibilidade nativa com biblioteca transformers +- **ONNX Export**: Otimização cross-platform para produção +- **Zero Infraestrutura**: Aproveitamento da CDN global do HuggingFace + +--- + +## 🏛️ **Arquitetura do Repositório** + +``` +cidadao.ia-models/ +├── 📁 modelos/ # Modelos prontos para produção +│ ├── 📁 cidadao-bertimbau-ner-v1/ # Reconhecimento de Entidades Nomeadas +│ ├── 📁 cidadao-detector-anomalias-v2/ # Detecção de Anomalias +│ ├── 📁 cidadao-classificador-risco-v1/ # Avaliação de Risco Financeiro +│ ├── 📁 cidadao-juiz-conformidade-v1/ # Conformidade Legal +│ └── 📁 cidadao-analisador-espectral-v1/ # Análise de Padrões Temporais +│ +├── 📁 experimentos/ # Pesquisa & prototipagem +│ ├── 📁 notebooks/ # Notebooks Jupyter de pesquisa +│ ├── 📁 prototipos/ # Modelos experimentais +│ └── 📁 benchmarks/ # Comparações de performance +│ +├── 📁 treinamento/ # Infraestrutura de treinamento +│ ├── 📁 pipelines/ # Pipelines MLOps de treinamento +│ ├── 📁 datasets/ # Utilitários de processamento de dados +│ └── 📁 configs/ # Configurações de treinamento +│ +├── 📁 inferencia/ # Serving de produção +│ ├── 📁 pipelines/ # Pipelines de inferência +│ ├── 📁 api/ # APIs de serving de modelos +│ └── 📁 lote/ # Processamento em lote +│ +├── 📁 utils/ # Utilitários compartilhados +│ ├── 📁 preprocessamento/ # Pré-processamento de dados +│ ├── 📁 pos_processamento/ # Processamento de saída +│ ├── 📁 metricas/ # Métricas de avaliação +│ └── 📁 visualizacao/ # Visualização de análises +│ +├── 📁 deploy/ # Artefatos de deploy +│ ├── 📁 docker/ # Definições de containers +│ ├── 📁 kubernetes/ # Manifestos K8s +│ └── 📁 cloud/ # Configs específicas de cloud +│ +├── 📁 testes/ # Testes abrangentes +│ ├── 📁 unitarios/ # Testes unitários +│ ├── 📁 integracao/ # Testes de integração +│ └── 📁 performance/ # Testes de performance +│ +├── 📁 docs/ # Documentação +│ ├── 📁 cartoes_modelos/ # Documentação dos modelos +│ ├── 📁 tutoriais/ # Tutoriais de uso +│ └── 📁 api/ # Documentação da API +│ +├── 📄 registro_modelos.json # Registry central de modelos +├── 📄 requirements.txt # Dependências Python +├── 📄 setup.py # Instalação do pacote +├── 📄 Dockerfile # Container de produção +├── 📄 docker-compose.yml # Ambiente de desenvolvimento +└── 📄 README.md # Visão geral do projeto +``` + +--- + +## 🚀 **Taxonomia & Convenção de Nomenclatura de Modelos** + +### **Padrão de Nomenclatura**: `cidadao-{dominio}-{arquitetura}-v{versao}` + +| Domínio | Arquitetura | Exemplo | Propósito | +|---------|-------------|---------|-----------| +| `bertimbau-ner` | Transformer | `cidadao-bertimbau-ner-v1` | Reconhecimento de Entidades Nomeadas | +| `detector-anomalias` | Ensemble | `cidadao-detector-anomalias-v2` | Detecção de anomalias em gastos | +| `classificador-risco` | BERT | `cidadao-classificador-risco-v1` | Avaliação de risco financeiro | +| `juiz-conformidade` | RoBERTa | `cidadao-juiz-conformidade-v1` | Verificação de conformidade legal | +| `analisador-espectral` | Processamento de Sinais | `cidadao-analisador-espectral-v1` | Análise de padrões temporais | +| `modelador-topicos` | BERTopic | `cidadao-modelador-topicos-v1` | Modelagem de tópicos de documentos | +| `roteador-semantico` | Sentence-BERT | `cidadao-roteador-semantico-v1` | Roteamento de consultas | + +--- + +## 📋 **Roadmap de Implementação** + +### **Fase 1: Configuração da Fundação** ⏱️ *2 semanas* + +#### **Semana 1: Estrutura do Repositório** +- [ ] **Dia 1-2**: Criar estrutura base de diretórios +- [ ] **Dia 3-4**: Configurar pipeline CI/CD (GitHub Actions) +- [ ] **Dia 5-7**: Configurar ambiente de desenvolvimento (Docker, requirements) + +#### **Semana 2: Infraestrutura Core** +- [ ] **Dia 1-3**: Implementar sistema de registry de modelos +- [ ] **Dia 4-5**: Criar classes base de modelos e interfaces +- [ ] **Dia 6-7**: Configurar framework de testes e quality gates + +### **Fase 2: Migração & Aprimoramento de Modelos** ⏱️ *4 semanas* + +#### **Semana 1: Modelos de Detecção de Anomalias** +- [ ] **Dia 1-2**: Extrair `cidadao_model.py` → `cidadao-detector-anomalias-v2` +- [ ] **Dia 3-4**: Criar model card e documentação +- [ ] **Dia 5**: Implementar pipeline de inferência +- [ ] **Dia 6-7**: Adicionar export ONNX e otimização + +#### **Semana 2: Modelos NER & Entidades** +- [ ] **Dia 1-2**: Implementar `cidadao-bertimbau-ner-v1` +- [ ] **Dia 3-4**: Criar pipeline de extração de entidades +- [ ] **Dia 5**: Adicionar suporte multilíngue (Português/Inglês) +- [ ] **Dia 6-7**: Otimização de performance e cache + +#### **Semana 3: Modelos de Risco & Conformidade** +- [ ] **Dia 1-2**: Extrair classificação de risco → `cidadao-classificador-risco-v1` +- [ ] **Dia 3-4**: Implementar verificação de conformidade → `cidadao-juiz-conformidade-v1` +- [ ] **Dia 5**: Criar sistema unificado de pontuação +- [ ] **Dia 6-7**: Adicionar recursos de explicabilidade (SHAP/LIME) + +#### **Semana 4: Analytics Avançadas** +- [ ] **Dia 1-2**: Extrair análise espectral → `cidadao-analisador-espectral-v1` +- [ ] **Dia 3-4**: Implementar modelagem de tópicos → `cidadao-modelador-topicos-v1` +- [ ] **Dia 5**: Criar roteamento semântico → `cidadao-roteador-semantico-v1` +- [ ] **Dia 6-7**: Testes de integração e validação de performance + +### **Fase 3: Prontidão para Produção** ⏱️ *3 semanas* + +#### **Semana 1: Infraestrutura de API & Serving** +- [ ] **Dia 1-2**: Implementar endpoints FastAPI de serving de modelos +- [ ] **Dia 3-4**: Criar pipelines de processamento em lote +- [ ] **Dia 5**: Adicionar health checks e monitoramento +- [ ] **Dia 6-7**: Implementar rate limiting e segurança + +#### **Semana 2: Infraestrutura de Treinamento** +- [ ] **Dia 1-2**: Criar pipelines MLOps de treinamento +- [ ] **Dia 3-4**: Implementar tracking de experimentos (MLflow/Weights & Biases) +- [ ] **Dia 5**: Adicionar tuning automatizado de hiperparâmetros +- [ ] **Dia 6-7**: Criar workflows de validação e aprovação de modelos + +#### **Semana 3: Deploy & DevOps** +- [ ] **Dia 1-2**: Containerizar todos os modelos (Docker) +- [ ] **Dia 3-4**: Criar manifestos de deploy Kubernetes +- [ ] **Dia 5**: Implementar estratégia de deploy blue-green +- [ ] **Dia 6-7**: Adicionar observabilidade (Prometheus, Grafana) + +### **Fase 4: Recursos Avançados** ⏱️ *3 semanas* + +#### **Semana 1: Integração Model Hub** +- [ ] **Dia 1-2**: Integrar com Hugging Face Hub +- [ ] **Dia 3-4**: Criar pipeline automatizado de publicação de modelos +- [ ] **Dia 5**: Implementar versionamento de modelos e testes A/B +- [ ] **Dia 6-7**: Adicionar governança e workflows de aprovação de modelos + +#### **Semana 2: Otimização de Performance** +- [ ] **Dia 1-2**: Implementar quantização e pruning de modelos +- [ ] **Dia 3-4**: Adicionar aceleração GPU e otimização CUDA +- [ ] **Dia 5**: Criar cache e memoização de modelos +- [ ] **Dia 6-7**: Implementar inferência distribuída + +#### **Semana 3: Analytics & Monitoramento** +- [ ] **Dia 1-2**: Criar dashboards de performance de modelos +- [ ] **Dia 3-4**: Implementar detecção de drift e alertas +- [ ] **Dia 5**: Adicionar tracking de métricas de negócio +- [ ] **Dia 6-7**: Criar triggers automatizados de retreinamento de modelos + +--- + +## 🧩 **Especificações de Interface de Modelos** + +### **Interface Padrão de Modelos** + +```python +from abc import ABC, abstractmethod +from typing import Dict, List, Any, Optional +from dataclasses import dataclass +import torch +from transformers import PreTrainedModel + +@dataclass +class MetadadosModelo: + nome: str + versao: str + descricao: str + autor: str + criado_em: str + tipo_modelo: str + schema_entrada: Dict[str, Any] + schema_saida: Dict[str, Any] + metricas: Dict[str, float] + tags: List[str] + +class CidadaoModeloBase(ABC): + """Interface base para todos os modelos Cidadão.IA""" + + def __init__(self, caminho_modelo: str, config: Optional[Dict] = None): + self.caminho_modelo = caminho_modelo + self.config = config or {} + self.metadados = self.carregar_metadados() + self.modelo = self.carregar_modelo() + + @abstractmethod + def carregar_modelo(self) -> Any: + """Carregar o modelo treinado""" + pass + + @abstractmethod + def predizer(self, entradas: Dict[str, Any]) -> Dict[str, Any]: + """Fazer predições nos dados de entrada""" + pass + + @abstractmethod + def predizer_lote(self, entradas: List[Dict[str, Any]]) -> List[Dict[str, Any]]: + """Fazer predições em lote""" + pass + + def validar_entrada(self, entradas: Dict[str, Any]) -> bool: + """Validar entrada contra schema""" + pass + + def explicar(self, entradas: Dict[str, Any]) -> Dict[str, Any]: + """Fornecer explicações do modelo""" + pass + + def carregar_metadados(self) -> MetadadosModelo: + """Carregar metadados do modelo""" + pass +``` + +### **Interfaces Específicas de Modelos** + +#### **Interface de Detecção de Anomalias** +```python +class CidadaoDetectorAnomalias(CidadaoModeloBase): + def predizer(self, entradas: Dict[str, Any]) -> Dict[str, Any]: + """ + Detectar anomalias em dados de transparência + + Args: + entradas: { + "dados_transacao": List[Dict], + "contexto": Optional[Dict] + } + + Returns: + { + "score_anomalia": float, + "tipo_anomalia": str, + "confianca": float, + "explicacao": Dict + } + """ + pass +``` + +#### **Interface NER** +```python +class CidadaoBERTimauNER(CidadaoModeloBase): + def predizer(self, entradas: Dict[str, Any]) -> Dict[str, Any]: + """ + Extrair entidades nomeadas de documentos públicos brasileiros + + Args: + entradas: { + "texto": str, + "idioma": Optional[str] = "pt" + } + + Returns: + { + "entidades": List[Dict], + "scores_confianca": List[float], + "tempo_processamento": float + } + """ + pass +``` + +--- + +## 📊 **Sistema de Registry de Modelos** + +### **Estrutura do registro_modelos.json** + +```json +{ + "versao": "1.0.0", + "ultima_atualizacao": "2025-07-22T10:00:00Z", + "modelos": { + "cidadao-detector-anomalias-v2": { + "metadados": { + "nome": "Cidadão.IA Detector de Anomalias", + "versao": "2.0.0", + "descricao": "Detecção avançada de anomalias para dados de transparência governamental", + "autor": "Anderson Henrique", + "criado_em": "2025-07-22T10:00:00Z", + "tipo_modelo": "detecção_anomalias", + "framework": "pytorch", + "modelo_base": "bert-base-portuguese-cased", + "tarefa": "classificacao_multiclasse", + "idioma": "pt-BR", + "dominio": "transparencia_governamental" + }, + "performance": { + "acuracia": 0.94, + "precisao": 0.92, + "recall": 0.89, + "f1_score": 0.90, + "auc_roc": 0.96, + "tempo_inferencia_ms": 45 + }, + "deploy": { + "status": "producao", + "endpoint": "/api/v1/anomalia/detectar", + "replicas": 3, + "gpu_necessaria": false, + "memoria_mb": 512, + "cpu_cores": 2 + }, + "arquivos": { + "cartao_modelo": "modelos/cidadao-detector-anomalias-v2/README.md", + "pesos": "modelos/cidadao-detector-anomalias-v2/pytorch_model.bin", + "config": "modelos/cidadao-detector-anomalias-v2/config.json", + "tokenizer": "modelos/cidadao-detector-anomalias-v2/tokenizer.json", + "script_inferencia": "modelos/cidadao-detector-anomalias-v2/inferencia.py", + "requirements": "modelos/cidadao-detector-anomalias-v2/requirements.txt" + }, + "validacao": { + "casos_teste": "testes/integracao/test_detector_anomalias.py", + "dataset_benchmark": "datasets/benchmark_transparencia_v1.json", + "status_validacao": "passou", + "ultima_validacao": "2025-07-22T09:30:00Z" + }, + "huggingface": { + "model_id": "neural-thinker/cidadao-detector-anomalias-v2", + "privado": false, + "downloads": 1250, + "likes": 89 + } + } + } +} +``` + +--- + +## 🚀 **Arquitetura de Integração HuggingFace (Implementada)** + +### **🎯 Estratégia Principal: HuggingFace Hub + Spaces** + +``` +Fluxo de Deploy: +cidadao.ai-models → HuggingFace Models → cidadao.ai-backend (HF Spaces) + (dev) (storage) (production) +``` + +### **💻 Implementação no Backend (HF Spaces)** + +```python +# No cidadao.ai-backend (HuggingFace Spaces) +from transformers import pipeline, AutoModel, AutoTokenizer +from functools import lru_cache +import torch + +@lru_cache(maxsize=10) # Cache local para performance +def load_cidadao_model(model_name: str): + """Carrega modelo do HuggingFace Models e cacheia localmente""" + return pipeline( + "text-classification", + model=f"neural-thinker/{model_name}", + device=0 if torch.cuda.is_available() else -1 + ) + +# Uso nos Agentes Multi-Agent +class ObaluaieCorruptionAgent(BaseAgent): + def __init__(self): + self.detector = load_cidadao_model("cidadao-detector-anomalias-v2") + self.ner = load_cidadao_model("cidadao-bertimbau-ner-v1") + + async def analyze_corruption(self, transaction_data): + # Inferência usando modelos do HF Hub + anomaly_result = self.detector(transaction_data) + entities = self.ner(transaction_data) + return self.generate_report(anomaly_result, entities) +``` + +### **📦 Deploy de Modelos no HuggingFace** + +```bash +# Deploy automático via CLI +huggingface-cli login +huggingface-cli upload neural-thinker/cidadao-detector-anomalias-v2 ./models/detector-anomalias/ +huggingface-cli upload neural-thinker/cidadao-bertimbau-ner-v1 ./models/ner/ +``` + +### **🔄 Vantagens desta Arquitetura** + +✅ **Zero Infraestrutura**: Sem servers ou containers para modelos +✅ **CDN Global**: HuggingFace distribui mundialmente +✅ **Cache Inteligente**: Modelos carregados sob demanda +✅ **Versionamento**: Controle automático de versões +✅ **Performance**: Mesmo poder computacional do HF Spaces +✅ **Escalabilidade**: Modelos disponíveis 24/7 +✅ **Community**: Visibilidade e contribuições + +### **📊 Métricas de Performance** + +```python +# Monitoramento automático de modelos +import time +from functools import wraps + +def track_model_performance(func): + @wraps(func) + def wrapper(*args, **kwargs): + start_time = time.time() + result = func(*args, **kwargs) + inference_time = time.time() - start_time + + # Log métricas para Grafana/Prometheus + log_metric("model_inference_time", inference_time) + log_metric("model_usage_count", 1) + return result + return wrapper + +@track_model_performance +def analyze_with_model(data): + return load_cidadao_model("cidadao-detector-anomalias-v2")(data) +``` + +--- + +## 🧪 **Estratégia de Testes** + +### **Hierarquia de Testes** + +``` +testes/ +├── unitarios/ # Testes rápidos e isolados +│ ├── test_interfaces_modelos.py # Contratos de interface +│ ├── test_preprocessamento.py # Processamento de dados +│ └── test_pos_processamento.py # Processamento de saída +├── integracao/ # Testes de interação de componentes +│ ├── test_pipelines_modelos.py # Pipelines end-to-end +│ ├── test_endpoints_api.py # Integração da API +│ └── test_integracao_backend.py # Compatibilidade com backend +├── performance/ # Testes de performance e carga +│ ├── test_velocidade_inferencia.py # Benchmarks de latência +│ ├── test_uso_memoria.py # Profiling de memória +│ └── test_throughput.py # Requisições concorrentes +└── validacao/ # Testes de acurácia de modelos + ├── test_acuracia_modelo.py # Validação de acurácia + ├── test_casos_extremos.py # Tratamento de edge cases + └── test_drift_dados.py # Detecção de shift de distribuição +``` + +### **Pipeline de Testes Contínuos** + +```yaml +# .github/workflows/testes.yml +name: Pipeline de Testes de Modelos +on: [push, pull_request] + +jobs: + testes-unitarios: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + - name: Executar testes unitários + run: pytest testes/unitarios/ -v --cov=modelos + + validacao-modelos: + runs-on: ubuntu-latest + steps: + - name: Validar acurácia dos modelos + run: pytest testes/validacao/ -v + - name: Benchmarks de performance + run: pytest testes/performance/ -v + + testes-integracao: + runs-on: ubuntu-latest + steps: + - name: Testar integração com backend + run: pytest testes/integracao/ -v +``` + +--- + +## 🚀 **Estratégias de Deploy** + +### **Desenvolvimento Local** +```bash +# Início rápido para desenvolvimento +docker-compose up -d +# Acessar modelos em http://localhost:8001 +``` + +### **Ambiente de Staging** +```bash +# Deploy Kubernetes +kubectl apply -f deploy/kubernetes/staging/ +# Service mesh Istio para testes A/B +kubectl apply -f deploy/istio/ +``` + +### **Deploy de Produção** +```bash +# Deploy blue-green +./deploy/scripts/blue_green_deploy.sh cidadao-detector-anomalias-v2 +# Rollout canary +./deploy/scripts/canary_deploy.sh --divisao-trafego=10% +``` + +--- + +## 📈 **Monitoramento & Observabilidade** + +### **Métricas de Performance de Modelos** + +```python +# Tracking automático de performance +@track_performance +def predizer(self, entradas): + # Automaticamente rastreia: + # - Latência de inferência + # - Uso de memória + # - Distribuição de confiança das predições + # - Shapes dos dados de entrada/saída + # - Taxa de erros + pass +``` + +### **Dashboard de Métricas de Negócio** + +```yaml +# Configuração dashboard Grafana +dashboards: + - nome: "Performance dos Modelos" + paineis: + - latencia_inferencia_p95 + - acuracia_predicao + - score_drift_modelo + - volume_requisicoes + - taxa_erro + + - nome: "Impacto de Negócio" + paineis: + - anomalias_detectadas_diariamente + - taxa_sucesso_investigacoes + - taxa_falsos_positivos + - score_satisfacao_usuario +``` + +--- + +## 🔮 **Aprimoramentos Futuros** + +### **Roadmap para os Próximos 6 Meses** + +1. **Q3 2025**: + - Modelos multimodais (texto + dados tabulares) + - Atualizações de modelo em tempo real via streaming + - Otimização para deploy em edge + +2. **Q4 2025**: + - Aprendizado federado para treinamento preservando privacidade + - Pipeline AutoML para busca de arquitetura de modelos + - Explicabilidade avançada com inferência causal + +3. **Q1 2026**: + - Integração com backends de computação quântica + - Pipeline de fine-tuning de large language models + - Plataforma de colaboração inter-governamental + +--- + +## 💡 **Áreas de Inovação** + +### **Oportunidades de Pesquisa** +- **IA Causal**: Entender relações causa-efeito em dados governamentais +- **Meta-Learning**: Aprendizado few-shot para novos domínios governamentais +- **Robustez Adversarial**: Defesa contra manipulação de dados +- **IA Interpretável**: Construir confiança através de transparência + +### **Exploração Tecnológica** +- **Redes Neurais de Grafos**: Modelar relacionamentos entre entidades +- **Transformers para Séries Temporais**: Reconhecimento avançado de padrões temporais +- **Fusão Multimodal**: Combinar dados de texto, numéricos e imagem +- **Aprendizado por Reforço**: Estratégias adaptativas de investigação + +--- + +## 📊 **Métricas de Performance e Sucesso** + +### **💻 KPIs Técnicos (HuggingFace Integration)** +- **Latência de Download**: < 2s para primeiro carregamento +- **Latência de Inferência**: < 100ms no HF Spaces +- **Cache Hit Rate**: > 90% para modelos frequentes +- **Acurácia de Modelos**: > 95% no conjunto de validação +- **Disponibilidade**: 99.9% via CDN HuggingFace +- **Throughput**: > 1000 inferências/min por modelo + +### **🎯 KPIs de Negócio (Impacto Real)** +- **50%** ⬆️ na acurácia de detecção de anomalias +- **30%** ⬇️ na taxa de falsos positivos +- **25%** ⬆️ na eficiência de investigações +- **90%** score de satisfação do usuário +- **15** agentes multi-agent usando modelos especializados + +### **🌍 KPIs de Community (HuggingFace)** +- **Downloads**: > 10K/mês por modelo principal +- **Likes**: > 100 por modelo +- **Contributors**: > 5 colaboradores ativos +- **Model Cards**: Documentação completa 100% modelos + +--- + +## 🚀 **Roadmap de Implementação Atualizado** + +### **📋 Fase 1: Setup HuggingFace (Esta Semana)** +- ✅ **Estrutura do repositório** definida +- 🔄 **HuggingFace CLI** configurado +- 🔄 **CI/CD pipeline** para auto-deploy HF Models +- 🔄 **Model cards** templates criados + +### **🤖 Fase 2: Migração de Modelos (Próximas 2 Semanas)** +- 🔄 **cidadao-detector-anomalias-v2** → HF Models +- 🔄 **cidadao-bertimbau-ner-v1** → HF Models +- 🔄 **cidadao-classificador-risco-v1** → HF Models +- 🔄 **Integração** com backend via transformers + +### **⚡ Fase 3: Otimização (Mês 2)** +- 🔄 **ONNX export** para performance +- 🔄 **Quantização** para memória +- 🔄 **Cache inteligente** no backend +- 🔄 **Métricas** Prometheus integradas + +### **🌍 Governança Simplificada** + +**👨‍💻 Responsável Principal**: Anderson Henrique +**📝 Processo**: +1. Desenvolver modelo localmente +2. Testar integração com backend +3. Deploy automático HF Models +4. Atualizar backend para usar nova versão +5. Monitorar performance em produção + +--- + +## 🔧 **Próximos Passos Imediatos** + +### **Esta Semana (22-28 Jul 2025)** +1. **Criar estrutura inicial do repositório** com diretórios base +2. **Configurar ambiente de desenvolvimento** com Docker e requirements +3. **Implementar sistema básico de registry** de modelos +4. **Documentar padrões de interface** para todos os tipos de modelos + +### **Próxima Semana (29 Jul - 4 Ago 2025)** +1. **Migrar primeiro modelo** (detector de anomalias) do backend +2. **Criar model card completo** com documentação técnica +3. **Implementar testes básicos** de validação +4. **Configurar CI/CD pipeline** no GitHub Actions + +### **Sprint 1 (5-18 Ago 2025)** +1. **Migrar todos os modelos principais** do backend +2. **Implementar pipelines de inferência** para cada modelo +3. **Criar integração com backend** via submódulo Git +4. **Configurar monitoramento básico** de performance + +--- + +--- + +## 🎆 **Conclusão: Hub de Modelos de Classe Mundial** + +O **cidadao.ai-models** representa uma **evolução arquitetural** no ecossistema Cidadão.IA, combinando: + +✨ **Simplicidade**: Deploy direto via HuggingFace Models +🚀 **Performance**: Zero infraestrutura, CDN global +🤖 **Integração**: Seamless com backend via transformers +🌍 **Community**: Visibilidade e contribuições abertas +📊 **Escalabilidade**: Modelos disponíveis 24/7 mundialmente + +### **🎯 Impacto Esperado** +- **15 agentes** do backend usando modelos especializados +- **>95% acurácia** na detecção de anomalias governamentais +- **<100ms** latência de inferência em produção +- **Zero custos** de infraestrutura para serving de modelos +- **Democratização** da IA para transparência no Brasil + +*Esta arquitetura estabelece o **primeiro hub de modelos de IA governamental** do Brasil, integrando seamlessly com HuggingFace para criar uma plataforma de transparência de classe mundial.* + +**📄 Versão do Documento**: 2.0.0 (HuggingFace Native) +**🗓️ Última Atualização**: 23 de Julho, 2025 +**🔄 Próxima Revisão**: 30 de Julho, 2025 (Pós primeiro deploy) diff --git a/Dockerfile b/Dockerfile new file mode 100644 index 0000000000000000000000000000000000000000..50b4115744592541b22f96b81a6cdaaa65b7f67b --- /dev/null +++ b/Dockerfile @@ -0,0 +1,45 @@ +# Dockerfile for HuggingFace Spaces - Cidadão.AI Models +FROM python:3.11-slim + +# Set environment variables +ENV PYTHONUNBUFFERED=1 +ENV PYTHONDONTWRITEBYTECODE=1 +ENV PORT=8001 + +# Install system dependencies +RUN apt-get update && apt-get install -y \ + curl \ + git \ + && rm -rf /var/lib/apt/lists/* + +# Create app user for security +RUN useradd --create-home --shell /bin/bash app + +# Set work directory +WORKDIR /app + +# Copy requirements and install Python dependencies +COPY requirements-hf.txt ./ +RUN pip install --no-cache-dir --upgrade pip && \ + pip install --no-cache-dir -r requirements-hf.txt + +# Copy application code +COPY src/ ./src/ +COPY app.py ./ + +# Create necessary directories and set permissions +RUN mkdir -p logs models data && \ + chown -R app:app /app + +# Switch to app user +USER app + +# Expose port for HuggingFace Spaces +EXPOSE 8001 + +# Health check +HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \ + CMD curl -f http://localhost:8001/health || exit 1 + +# Run application +CMD ["python", "app.py"] \ No newline at end of file diff --git a/README.md b/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a4672b78bcc73b704375b6acc3164948f0bfaead --- /dev/null +++ b/README.md @@ -0,0 +1,222 @@ +--- +title: Cidadão.AI Models +emoji: 🤖 +colorFrom: blue +colorTo: green +sdk: docker +app_port: 8001 +pinned: false +license: mit +tags: + - transparency + - government + - brazil + - anomaly-detection + - fastapi +--- + +# 🤖 Cidadão.AI Models + +> **Modelos especializados de Machine Learning para análise de transparência pública brasileira** + +[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/) +[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/) +[![Transformers](https://img.shields.io/badge/🤗-Transformers-yellow.svg)](https://huggingface.co/transformers/) +[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE) + +## 🎯 Visão Geral + +**Cidadão.AI Models** é o repositório especializado em modelos de machine learning para o ecossistema Cidadão.AI. Contém modelos customizados, pipeline de treinamento MLOps e infraestrutura de inferência para análise avançada de dados de transparência pública. + +### 🚀 Capacidades Principais + +- 🔍 **Detecção de Anomalias** - Identificação automática de padrões suspeitos em contratos públicos +- 📊 **Análise de Padrões** - Reconhecimento de correlações e tendências em dados governamentais +- 🌊 **Análise Espectral** - Detecção de padrões temporais e sazonais via FFT +- 🤖 **Modelos Customizados** - Arquiteturas especializadas para transparência brasileira +- 🔄 **Pipeline MLOps** - Treinamento, versionamento e deploy automatizados + +## 🏗️ Arquitetura + +``` +src/ +├── models/ # Modelos de ML especializados +│ ├── anomaly_detection/ # Detecção de anomalias +│ ├── pattern_analysis/ # Análise de padrões +│ ├── spectral_analysis/ # Análise espectral +│ └── core/ # Classes base e utilitários +├── training/ # Pipeline de treinamento +│ ├── pipelines/ # Pipelines de treinamento +│ ├── configs/ # Configurações de modelos +│ └── utils/ # Utilitários de treinamento +├── inference/ # Servidor de inferência +│ ├── api_server.py # FastAPI server +│ ├── batch_processor.py # Processamento em lote +│ └── streaming.py # Inferência em tempo real +└── deployment/ # Ferramentas de deploy + ├── huggingface/ # Integração HuggingFace Hub + ├── docker/ # Containerização + └── monitoring/ # Monitoramento de modelos +``` + +## 🚀 Quick Start + +### Instalação + +```bash +# Clone o repositório +git clone https://github.com/anderson-ufrj/cidadao.ai-models +cd cidadao.ai-models + +# Instale as dependências +pip install -r requirements.txt + +# Instale o pacote em modo desenvolvimento +pip install -e . +``` + +### Uso Básico + +```python +from cidadao_models.models.anomaly_detection import AnomalyDetector +from cidadao_models.models.pattern_analysis import PatternAnalyzer + +# Inicializar modelos +anomaly_detector = AnomalyDetector() +pattern_analyzer = PatternAnalyzer() + +# Analisar contratos para anomalias +contracts = [...] # Lista de contratos +anomalies = anomaly_detector.analyze(contracts) + +# Analisar padrões temporais +patterns = pattern_analyzer.analyze_temporal_patterns(data) +``` + +### Servidor de Inferência + +```bash +# Iniciar servidor API +uvicorn src.inference.api_server:app --host 0.0.0.0 --port 8001 + +# Testar endpoint +curl -X POST "http://localhost:8001/v1/detect-anomalies" \ + -H "Content-Type: application/json" \ + -d '{"contracts": [...]}' +``` + +## 🧠 Modelos Disponíveis + +### 🔍 Detector de Anomalias +- **Algoritmos**: Isolation Forest, One-Class SVM, Local Outlier Factor +- **Especialização**: Contratos públicos brasileiros +- **Métricas**: Precisão >90% para anomalias críticas + +### 📊 Analisador de Padrões +- **Capacidades**: Time series, correlações, clustering +- **Técnicas**: Prophet, FFT, decomposição sazonal +- **Output**: Padrões temporais e insights explicáveis + +### 🌊 Analisador Espectral +- **Método**: Transformada rápida de Fourier (FFT) +- **Detecção**: Padrões periódicos suspeitos +- **Aplicação**: Irregularidades sazonais em gastos + +## 🛠️ Desenvolvimento + +### Estrutura de Testes + +```bash +# Executar todos os testes +pytest tests/ + +# Testes específicos +pytest tests/unit/models/ +pytest tests/integration/ +pytest tests/e2e/ +``` + +### Treinamento de Modelos + +```bash +# Treinar modelo de detecção de corrupção +python src/training/pipelines/train_corruption_detector.py --config configs/corruption_bert.yaml + +# Avaliar performance +python src/training/evaluate.py --model corruption_detector --test_data data/test.json +``` + +### Deploy HuggingFace + +```bash +# Upload para HuggingFace Hub +python src/deployment/huggingface/upload.py --model_path models/anomaly_detector --repo_name cidadao-ai/anomaly-detector +``` + +## 🔄 Integração com Backend + +Este repositório se integra com o [cidadao.ai-backend](https://github.com/anderson-ufrj/cidadao.ai-backend) através de: + +- **API REST**: Servidor de inferência FastAPI +- **Package Integration**: Importação direta como dependência +- **Fallback Local**: Processamento local se API indisponível + +```python +# No backend +from src.tools.models_client import ModelsClient + +client = ModelsClient("http://models-api:8001") +results = await client.detect_anomalies(contracts) +``` + +## 📊 MLOps Pipeline + +### Treinamento Automatizado +- ⚡ **CI/CD**: Pipeline automatizado GitHub Actions +- 📈 **Experiment Tracking**: MLflow + Weights & Biases +- 🔄 **Model Versioning**: HuggingFace Hub integration +- 📊 **Performance Monitoring**: Drift detection + alerting + +### Deployment +- 🐳 **Containerização**: Docker para produção +- 🤗 **HuggingFace Spaces**: Demo models deployment +- 🚀 **Kubernetes**: Orquestração escalável +- 📡 **Monitoring**: Prometheus metrics + Grafana dashboards + +## 🔗 Links Relacionados + +- 🏛️ **Backend**: [cidadao.ai-backend](https://github.com/anderson-ufrj/cidadao.ai-backend) +- 🎨 **Frontend**: [cidadao.ai-frontend](https://github.com/anderson-ufrj/cidadao.ai-frontend) +- 📚 **Documentação**: [cidadao.ai-docs](https://github.com/anderson-ufrj/cidadao.ai-docs) +- 🤗 **HuggingFace**: [cidadao-ai organization](https://huggingface.co/cidadao-ai) + +## 📈 Status do Projeto + +- ✅ **Estrutura Base**: Completa +- 🔄 **Migração ML**: Em andamento +- ⏳ **API Server**: Planejado +- ⏳ **HF Integration**: Próximo + +## 👨‍💻 Contribuição + +1. Fork o projeto +2. Crie uma branch para sua feature (`git checkout -b feature/AmazingFeature`) +3. Commit suas mudanças (`git commit -m 'feat: add amazing feature'`) +4. Push para a branch (`git push origin feature/AmazingFeature`) +5. Abra um Pull Request + +## 📄 Licença + +Distribuído sob a licença MIT. Veja `LICENSE` para mais informações. + +## 👨‍💻 Autor + +**Anderson Henrique da Silva** +📧 andersonhs27@gmail.com | 💻 [GitHub](https://github.com/anderson-ufrj) + +--- + +
+

🧠 Democratizando Análise de Transparência com IA Avançada 🧠

+

Modelos • MLOps • Explicável • Brasileira

+
\ No newline at end of file diff --git a/app.py b/app.py new file mode 100644 index 0000000000000000000000000000000000000000..e2a73f8641d0d70cfa178b97362db8e37b4d2734 --- /dev/null +++ b/app.py @@ -0,0 +1,165 @@ +#!/usr/bin/env python3 +""" +Cidadão.AI Models - HuggingFace Spaces Entry Point + +FastAPI server for ML model inference optimized for HuggingFace Spaces deployment. +""" + +import os +import sys +import logging +from contextlib import asynccontextmanager + +import uvicorn +from fastapi import FastAPI, HTTPException +from fastapi.middleware.cors import CORSMiddleware +from fastapi.responses import JSONResponse, HTMLResponse + +# Configure logging for HuggingFace +logging.basicConfig( + level=logging.INFO, + format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", + handlers=[logging.StreamHandler(sys.stdout)] +) +logger = logging.getLogger(__name__) + +# Import our API server +try: + from src.inference.api_server import app as models_app + MODELS_AVAILABLE = True + logger.info("✅ Models API successfully imported") +except Exception as e: + logger.error(f"❌ Failed to import models API: {e}") + MODELS_AVAILABLE = False + +@asynccontextmanager +async def lifespan(app: FastAPI): + """Application lifespan manager for HuggingFace Spaces.""" + logger.info("🚀 Starting Cidadão.AI Models on HuggingFace Spaces") + logger.info(f"🔧 Environment: {os.getenv('SPACE_ID', 'local')}") + logger.info(f"🌐 Port: {os.getenv('PORT', '8001')}") + + yield + + logger.info("🛑 Shutting down Cidadão.AI Models") + +if MODELS_AVAILABLE: + # Use the imported models app + app = models_app + logger.info("Using full models API") +else: + # Fallback minimal app + app = FastAPI( + title="🤖 Cidadão.AI Models (Fallback)", + description="Minimal fallback API when models are not available", + version="1.0.0", + lifespan=lifespan + ) + + app.add_middleware( + CORSMiddleware, + allow_origins=["*"], + allow_credentials=True, + allow_methods=["*"], + allow_headers=["*"], + ) + + @app.get("/", response_class=HTMLResponse) + async def fallback_root(): + """Fallback root with information about the service.""" + return """ + + + Cidadão.AI Models + + + +
+

🤖 Cidadão.AI Models

+

Sistema de ML para Transparência Pública Brasileira

+ +
+

📊 Status do Sistema

+

⚠️ Modo Fallback - Modelos ML não disponíveis

+

🔧 Para funcionalidade completa, verifique as dependências

+
+ +
+

🔗 Endpoints Disponíveis

+
GET /health - Health check
+
GET /docs - Documentação da API
+
GET / - Esta página
+
+ +
+

🏛️ Sobre o Cidadão.AI

+

Sistema multi-agente de IA para análise de transparência pública, + especializado em detectar anomalias e padrões suspeitos em dados + governamentais brasileiros.

+
+ +

+ + 📚 Ver Documentação da API + +

+
+ + + """ + + @app.get("/health") + async def fallback_health(): + """Fallback health check.""" + return { + "status": "limited", + "mode": "fallback", + "models_loaded": False, + "message": "Models not available, running in fallback mode" + } + + logger.info("Using fallback minimal API") + +# Add HuggingFace Spaces specific routes +@app.get("/spaces-info") +async def spaces_info(): + """HuggingFace Spaces specific information.""" + return { + "platform": "HuggingFace Spaces", + "space_id": os.getenv("SPACE_ID", "unknown"), + "space_author": os.getenv("SPACE_AUTHOR", "cidadao-ai"), + "space_title": os.getenv("SPACE_TITLE", "Cidadão.AI Models"), + "sdk": "docker", + "port": int(os.getenv("PORT", "8001")), + "models_available": MODELS_AVAILABLE + } + +if __name__ == "__main__": + # Configuration for HuggingFace Spaces + port = int(os.getenv("PORT", "8001")) + host = os.getenv("HOST", "0.0.0.0") + + logger.info(f"🚀 Starting server on {host}:{port}") + logger.info(f"📊 Models available: {MODELS_AVAILABLE}") + + try: + # Use uvicorn with optimized settings for HuggingFace + uvicorn.run( + app, + host=host, + port=port, + log_level="info", + access_log=True, + workers=1, # Single worker for HuggingFace Spaces + loop="asyncio" + ) + except Exception as e: + logger.error(f"❌ Failed to start server: {str(e)}") + sys.exit(1) \ No newline at end of file diff --git a/main.py b/main.py new file mode 100644 index 0000000000000000000000000000000000000000..5596b44786f04e4810aefe9f8d712f08ed310f71 --- /dev/null +++ b/main.py @@ -0,0 +1,16 @@ +# This is a sample Python script. + +# Press Shift+F10 to execute it or replace it with your code. +# Press Double Shift to search everywhere for classes, files, tool windows, actions, and settings. + + +def print_hi(name): + # Use a breakpoint in the code line below to debug your script. + print(f'Hi, {name}') # Press Ctrl+F8 to toggle the breakpoint. + + +# Press the green button in the gutter to run the script. +if __name__ == '__main__': + print_hi('PyCharm') + +# See PyCharm help at https://www.jetbrains.com/help/pycharm/ diff --git a/migration_plan.md b/migration_plan.md new file mode 100644 index 0000000000000000000000000000000000000000..2bb9d13e0d141ff92d05d6d1dfcc12e5fee0f4f2 --- /dev/null +++ b/migration_plan.md @@ -0,0 +1,443 @@ +# 🔄 PLANO DE MIGRAÇÃO ML: BACKEND → MODELS + +> **Documento de Planejamento da Migração** +> **Status**: Em Execução - Janeiro 2025 +> **Objetivo**: Separar responsabilidades ML do sistema multi-agente + +--- + +## 📊 ANÁLISE PRÉ-MIGRAÇÃO + +### **CÓDIGO ML NO BACKEND ATUAL** +- **Total**: 7.004 linhas em 13 módulos `src/ml/` +- **Funcionalidade**: Pipeline completo ML funcional +- **Integração**: Importado diretamente pelos 16 agentes +- **Status**: Production-ready, mas acoplado ao backend + +### **CIDADAO.AI-MODELS STATUS** +- **Repositório**: Criado com documentação MLOps completa +- **Código**: Apenas main.py placeholder (16 linhas) +- **Documentação**: 654 linhas de especificação técnica +- **Pronto**: Para receber migração ML + +--- + +## 🎯 ESTRATÉGIA DE MIGRAÇÃO + +### **ABORDAGEM: MIGRAÇÃO PROGRESSIVA** +1. ✅ **Não quebrar funcionamento atual** do backend +2. ✅ **Migrar código gradualmente** testando a cada etapa +3. ✅ **Manter compatibilidade** durante transição +4. ✅ **Implementar fallback** local se models indisponível + +--- + +## 📋 FASE 1: ESTRUTURAÇÃO (HOJE) + +### **1.1 Criar Estrutura Base** +```bash +cidadao.ai-models/ +├── src/ +│ ├── __init__.py +│ ├── models/ # Core ML models +│ │ ├── __init__.py +│ │ ├── anomaly_detection/ # Anomaly detection pipeline +│ │ ├── pattern_analysis/ # Pattern recognition +│ │ ├── spectral_analysis/ # Frequency domain analysis +│ │ └── core/ # Base classes and utilities +│ ├── training/ # Training infrastructure +│ │ ├── __init__.py +│ │ ├── pipelines/ # Training pipelines +│ │ ├── configs/ # Training configurations +│ │ └── utils/ # Training utilities +│ ├── inference/ # Model serving +│ │ ├── __init__.py +│ │ ├── api_server.py # FastAPI inference server +│ │ ├── batch_processor.py # Batch inference +│ │ └── streaming.py # Real-time inference +│ └── deployment/ # Deployment tools +│ ├── __init__.py +│ ├── huggingface/ # HF Hub integration +│ ├── docker/ # Containerization +│ └── monitoring/ # ML monitoring +├── tests/ +│ ├── __init__.py +│ ├── unit/ # Unit tests +│ ├── integration/ # Integration tests +│ └── e2e/ # End-to-end tests +├── configs/ # Model configurations +├── notebooks/ # Jupyter experiments +├── datasets/ # Dataset management +├── requirements.txt # Dependencies +├── setup.py # Package setup +└── README.md # Documentation +``` + +### **1.2 Configurar Dependências** +```python +# requirements.txt +torch>=2.0.0 +transformers>=4.36.0 +scikit-learn>=1.3.2 +pandas>=2.1.4 +numpy>=1.26.3 +fastapi>=0.104.0 +uvicorn>=0.24.0 +huggingface-hub>=0.19.0 +mlflow>=2.8.0 +wandb>=0.16.0 +``` + +--- + +## 📋 FASE 2: MIGRAÇÃO MÓDULOS (PRÓXIMA SEMANA) + +### **2.1 Mapeamento de Migração** +```python +# Migração de arquivos backend → models +MIGRATION_MAP = { + # Core ML modules + "src/ml/anomaly_detector.py": "src/models/anomaly_detection/detector.py", + "src/ml/pattern_analyzer.py": "src/models/pattern_analysis/analyzer.py", + "src/ml/spectral_analyzer.py": "src/models/spectral_analysis/analyzer.py", + "src/ml/models.py": "src/models/core/base_models.py", + + # Training pipeline + "src/ml/training_pipeline.py": "src/training/pipelines/training.py", + "src/ml/advanced_pipeline.py": "src/training/pipelines/advanced.py", + "src/ml/data_pipeline.py": "src/training/pipelines/data.py", + + # HuggingFace integration + "src/ml/hf_cidadao_model.py": "src/models/core/hf_model.py", + "src/ml/hf_integration.py": "src/deployment/huggingface/integration.py", + "src/ml/cidadao_model.py": "src/models/core/cidadao_model.py", + + # API and serving + "src/ml/model_api.py": "src/inference/api_server.py", + "src/ml/transparency_benchmark.py": "src/models/evaluation/benchmark.py" +} +``` + +### **2.2 Refatoração de Imports** +```python +# Antes (backend atual) +from src.ml.anomaly_detector import AnomalyDetector +from src.ml.pattern_analyzer import PatternAnalyzer + +# Depois (models repo) +from cidadao_models.models.anomaly_detection import AnomalyDetector +from cidadao_models.models.pattern_analysis import PatternAnalyzer +``` + +### **2.3 Configurar Package** +```python +# setup.py +from setuptools import setup, find_packages + +setup( + name="cidadao-ai-models", + version="1.0.0", + description="ML models for Cidadão.AI transparency analysis", + packages=find_packages(where="src"), + package_dir={"": "src"}, + install_requires=[ + "torch>=2.0.0", + "transformers>=4.36.0", + "scikit-learn>=1.3.2", + # ... outras dependências + ], + python_requires=">=3.11", +) +``` + +--- + +## 📋 FASE 3: SERVIDOR DE INFERÊNCIA (SEMANA 2) + +### **3.1 API Server Dedicado** +```python +# src/inference/api_server.py +from fastapi import FastAPI, HTTPException +from cidadao_models.models.anomaly_detection import AnomalyDetector +from cidadao_models.models.pattern_analysis import PatternAnalyzer + +app = FastAPI(title="Cidadão.AI Models API") + +# Initialize models +anomaly_detector = AnomalyDetector() +pattern_analyzer = PatternAnalyzer() + +@app.post("/v1/detect-anomalies") +async def detect_anomalies(contracts: List[Contract]): + """Detect anomalies in government contracts""" + try: + results = await anomaly_detector.analyze(contracts) + return {"anomalies": results, "model_version": "1.0.0"} + except Exception as e: + raise HTTPException(status_code=500, detail=str(e)) + +@app.post("/v1/analyze-patterns") +async def analyze_patterns(data: Dict[str, Any]): + """Analyze patterns in government data""" + try: + patterns = await pattern_analyzer.analyze(data) + return {"patterns": patterns, "confidence": 0.87} + except Exception as e: + raise HTTPException(status_code=500, detail=str(e)) + +@app.get("/health") +async def health_check(): + return {"status": "healthy", "models_loaded": True} +``` + +### **3.2 Client no Backend** +```python +# backend/src/tools/models_client.py +import httpx +from typing import Optional, List, Dict, Any + +class ModelsClient: + """Client for cidadao.ai-models API""" + + def __init__(self, base_url: str = "http://localhost:8001"): + self.base_url = base_url + self.client = httpx.AsyncClient(timeout=30.0) + + async def detect_anomalies(self, contracts: List[Dict]) -> Dict[str, Any]: + """Call anomaly detection API""" + try: + response = await self.client.post( + f"{self.base_url}/v1/detect-anomalies", + json={"contracts": contracts} + ) + response.raise_for_status() + return response.json() + except httpx.RequestError: + # Fallback to local processing if models API unavailable + return await self._local_anomaly_detection(contracts) + + async def _local_anomaly_detection(self, contracts: List[Dict]) -> Dict[str, Any]: + """Fallback local processing""" + # Import local ML if models API unavailable + from src.ml.anomaly_detector import AnomalyDetector + detector = AnomalyDetector() + return detector.analyze(contracts) +``` + +--- + +## 📋 FASE 4: INTEGRAÇÃO AGENTES (SEMANA 3) + +### **4.1 Atualizar Agente Zumbi** +```python +# backend/src/agents/zumbi.py - ANTES +from src.ml.anomaly_detector import AnomalyDetector +from src.ml.spectral_analyzer import SpectralAnalyzer + +class InvestigatorAgent(BaseAgent): + def __init__(self): + self.anomaly_detector = AnomalyDetector() + self.spectral_analyzer = SpectralAnalyzer() + +# backend/src/agents/zumbi.py - DEPOIS +from src.tools.models_client import ModelsClient + +class InvestigatorAgent(BaseAgent): + def __init__(self): + self.models_client = ModelsClient() + # Fallback local se necessário + self._local_detector = None + + async def investigate(self, contracts): + # Tenta usar models API primeiro + try: + results = await self.models_client.detect_anomalies(contracts) + return results + except Exception: + # Fallback para processamento local + if not self._local_detector: + from src.ml.anomaly_detector import AnomalyDetector + self._local_detector = AnomalyDetector() + return self._local_detector.analyze(contracts) +``` + +### **4.2 Configuração Híbrida** +```python +# backend/src/core/config.py - Adicionar +class Settings(BaseSettings): + # ... existing settings ... + + # Models API configuration + models_api_enabled: bool = Field(default=True, description="Enable models API") + models_api_url: str = Field(default="http://localhost:8001", description="Models API URL") + models_api_timeout: int = Field(default=30, description="API timeout seconds") + models_fallback_local: bool = Field(default=True, description="Use local ML as fallback") +``` + +--- + +## 📋 FASE 5: DEPLOYMENT (SEMANA 4) + +### **5.1 Docker Models** +```dockerfile +# cidadao.ai-models/Dockerfile +FROM python:3.11-slim + +WORKDIR /app + +# Install dependencies +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt + +# Copy source code +COPY src/ ./src/ +COPY setup.py . +RUN pip install -e . + +# Expose port +EXPOSE 8001 + +# Run inference server +CMD ["uvicorn", "src.inference.api_server:app", "--host", "0.0.0.0", "--port", "8001"] +``` + +### **5.2 Docker Compose Integration** +```yaml +# docker-compose.yml (no backend) +version: '3.8' + +services: + cidadao-backend: + build: . + ports: + - "8000:8000" + depends_on: + - cidadao-models + environment: + - MODELS_API_URL=http://cidadao-models:8001 + + cidadao-models: + build: ../cidadao.ai-models + ports: + - "8001:8001" + environment: + - MODEL_CACHE_SIZE=1000 +``` + +### **5.3 HuggingFace Spaces** +```python +# cidadao.ai-models/spaces_app.py +import gradio as gr +from src.models.anomaly_detection import AnomalyDetector +from src.models.pattern_analysis import PatternAnalyzer + +detector = AnomalyDetector() +analyzer = PatternAnalyzer() + +def analyze_contract(contract_text): + """Analyze contract for anomalies""" + result = detector.analyze_text(contract_text) + return { + "anomaly_score": result.score, + "risk_level": result.risk_level, + "explanation": result.explanation + } + +# Gradio interface +with gr.Blocks(title="Cidadão.AI Models Demo") as demo: + gr.Markdown("# 🤖 Cidadão.AI - Modelos de Transparência") + + with gr.Row(): + input_text = gr.Textbox( + label="Texto do Contrato", + placeholder="Cole aqui o texto do contrato para análise..." + ) + + analyze_btn = gr.Button("Analisar Anomalias") + + with gr.Row(): + output = gr.JSON(label="Resultado da Análise") + + analyze_btn.click(analyze_contract, inputs=input_text, outputs=output) + +if __name__ == "__main__": + demo.launch() +``` + +--- + +## 🔄 INTEGRAÇÃO ENTRE REPOSITÓRIOS + +### **COMUNICAÇÃO API-BASED** +```python +# Fluxo: Backend → Models +1. Backend Agent precisa análise ML +2. Chama Models API via HTTP +3. Models processa e retorna resultado +4. Backend integra resultado na resposta +5. Fallback local se Models indisponível +``` + +### **VERSIONAMENTO INDEPENDENTE** +```python +# cidadao.ai-models releases +v1.0.0: "Initial anomaly detection model" +v1.1.0: "Pattern analysis improvements" +v1.2.0: "New corruption detection model" + +# cidadao.ai-backend usa models +requirements.txt: + cidadao-ai-models>=1.0.0,<2.0.0 +``` + +--- + +## 📊 CRONOGRAMA EXECUÇÃO + +### **SEMANA 1: Setup & Estrutura** +- [ ] Criar estrutura completa cidadao.ai-models +- [ ] Configurar requirements e setup.py +- [ ] Migrar primeiro módulo (anomaly_detector.py) +- [ ] Testar importação e funcionamento básico + +### **SEMANA 2: Migração Core** +- [ ] Migrar todos os 13 módulos ML +- [ ] Refatorar imports e dependências +- [ ] Implementar API server básico +- [ ] Criar client no backend + +### **SEMANA 3: Integração Agentes** +- [ ] Atualizar Zumbi para usar Models API +- [ ] Implementar fallback local +- [ ] Testar integração completa +- [ ] Atualizar documentação + +### **SEMANA 4: Deploy & Production** +- [ ] Containerização Docker +- [ ] Deploy HuggingFace Spaces +- [ ] Monitoramento e métricas +- [ ] Testes de carga e performance + +--- + +## ✅ CRITÉRIOS DE SUCESSO + +### **FUNCIONAIS** +- [ ] Backend continua funcionando sem interrupção +- [ ] Models API responde <500ms +- [ ] Fallback local funciona se API indisponível +- [ ] Todos agentes usam nova arquitetura + +### **NÃO-FUNCIONAIS** +- [ ] Performance igual ou melhor que atual +- [ ] Deploy independente dos repositórios +- [ ] Documentação atualizada +- [ ] Testes cobrindo >80% código migrado + +--- + +## 🎯 PRÓXIMO PASSO IMEDIATO + +**COMEÇAR FASE 1 AGORA**: Criar estrutura base no cidadao.ai-models e migrar primeiro módulo para validar approach. + +Vamos começar? \ No newline at end of file diff --git a/pytest.ini b/pytest.ini new file mode 100644 index 0000000000000000000000000000000000000000..2a0519f0805c48bc92e3bbe42194331adb21f53d --- /dev/null +++ b/pytest.ini @@ -0,0 +1,25 @@ +[tool:pytest] +testpaths = tests +python_files = test_*.py +python_classes = Test* +python_functions = test_* + +addopts = + --strict-markers + --verbose + --tb=short + --cov=src + --cov-report=term-missing + --cov-report=html:htmlcov + --cov-report=xml + --cov-fail-under=80 + --asyncio-mode=auto + --disable-warnings + +markers = + unit: Unit tests that don't require external dependencies + integration: Integration tests that test multiple components + slow: Tests that take more than 1 second + api: API-related tests + +asyncio_mode = auto \ No newline at end of file diff --git a/requirements-hf.txt b/requirements-hf.txt new file mode 100644 index 0000000000000000000000000000000000000000..484bca43db7e47a273de0c393f1e594f2ac768f7 --- /dev/null +++ b/requirements-hf.txt @@ -0,0 +1,26 @@ +# Cidadão.AI Models - HuggingFace Spaces Requirements +# Minimal dependencies for fast deployment and startup + +# Web Framework +fastapi>=0.104.0 +uvicorn[standard]>=0.24.0 +pydantic>=2.5.0 + +# Machine Learning (essential only) +scikit-learn>=1.3.2 +numpy>=1.26.3 +pandas>=2.1.4 + +# HTTP Client +httpx>=0.27.0 + +# Monitoring +prometheus-client>=0.19.0 + +# Utils +python-multipart>=0.0.6 +python-dotenv>=1.0.0 + +# Optional ML libraries (will not break if unavailable) +torch>=2.0.0; platform_machine != "armv7l" +transformers>=4.36.0; platform_machine != "armv7l" \ No newline at end of file diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..c6d9ddb2b0222504c6c15d0391110591a76a0d12 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,39 @@ +# Cidadão.AI Models - Core Dependencies + +# Deep Learning Framework +torch>=2.0.0 +transformers>=4.36.0 +sentence-transformers>=2.2.0 + +# Machine Learning +scikit-learn>=1.3.2 +numpy>=1.26.3 +pandas>=2.1.4 +scipy>=1.11.4 + +# API Server +fastapi>=0.104.0 +uvicorn[standard]>=0.24.0 +pydantic>=2.5.0 + +# HuggingFace Integration +huggingface-hub>=0.19.0 +datasets>=2.16.0 + +# MLOps +mlflow>=2.8.0 +wandb>=0.16.0 + +# Monitoring +prometheus-client>=0.19.0 + +# Testing +pytest>=7.4.0 +pytest-asyncio>=0.21.0 + +# Data Processing +tqdm>=4.66.0 +joblib>=1.3.0 + +# HTTP Client +httpx>=0.27.0 \ No newline at end of file diff --git a/setup.py b/setup.py new file mode 100644 index 0000000000000000000000000000000000000000..ce6ab1e2535e0a3c524f5b1633f5dc3585e7510f --- /dev/null +++ b/setup.py @@ -0,0 +1,68 @@ +""" +Setup configuration for Cidadão.AI Models package. +""" + +from setuptools import setup, find_packages + +with open("README.md", "r", encoding="utf-8") as fh: + long_description = fh.read() + +with open("requirements.txt", "r", encoding="utf-8") as fh: + requirements = [line.strip() for line in fh if line.strip() and not line.startswith("#")] + +setup( + name="cidadao-ai-models", + version="1.0.0", + author="Anderson Henrique da Silva", + author_email="andersonhs27@gmail.com", + description="Specialized ML models for Brazilian government transparency analysis", + long_description=long_description, + long_description_content_type="text/markdown", + url="https://github.com/anderson-ufrj/cidadao.ai-models", + packages=find_packages(where="src"), + package_dir={"": "src"}, + classifiers=[ + "Development Status :: 4 - Beta", + "Intended Audience :: Developers", + "Intended Audience :: Science/Research", + "License :: OSI Approved :: MIT License", + "Operating System :: OS Independent", + "Programming Language :: Python :: 3", + "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", + "Topic :: Scientific/Engineering :: Artificial Intelligence", + "Topic :: Scientific/Engineering :: Information Analysis", + ], + python_requires=">=3.11", + install_requires=requirements, + extras_require={ + "dev": [ + "pytest>=7.4.0", + "pytest-asyncio>=0.21.0", + "pytest-cov>=4.1.0", + "black>=23.0.0", + "ruff>=0.1.0", + "mypy>=1.8.0", + ], + "notebooks": [ + "jupyter>=1.0.0", + "matplotlib>=3.7.0", + "seaborn>=0.12.0", + "plotly>=5.17.0", + ], + "gpu": [ + "torch[gpu]>=2.0.0", + ] + }, + entry_points={ + "console_scripts": [ + "cidadao-models=src.cli:main", + ], + }, + keywords="transparency, government, brazil, machine-learning, anomaly-detection", + project_urls={ + "Bug Reports": "https://github.com/anderson-ufrj/cidadao.ai-models/issues", + "Source": "https://github.com/anderson-ufrj/cidadao.ai-models", + "Documentation": "https://github.com/anderson-ufrj/cidadao.ai-models/wiki", + }, +) \ No newline at end of file diff --git a/src/__init__.py b/src/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..0535e8bf88daeaa2b9da9cf0e0621474ae435be6 --- /dev/null +++ b/src/__init__.py @@ -0,0 +1,8 @@ +""" +Cidadão.AI Models - Machine Learning Pipeline + +Specialized ML models for Brazilian government transparency analysis. +""" + +__version__ = "1.0.0" +__author__ = "Anderson Henrique da Silva" \ No newline at end of file diff --git a/src/__pycache__/__init__.cpython-313.pyc b/src/__pycache__/__init__.cpython-313.pyc new file mode 100644 index 0000000000000000000000000000000000000000..4d4bcd6b6bea89ee95e129d67fda859b5eb22095 Binary files /dev/null and b/src/__pycache__/__init__.cpython-313.pyc differ diff --git a/src/deployment/__init__.py b/src/deployment/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/src/deployment/docker/__init__.py b/src/deployment/docker/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/src/deployment/huggingface/__init__.py b/src/deployment/huggingface/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/src/deployment/monitoring/__init__.py b/src/deployment/monitoring/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/src/inference/__init__.py b/src/inference/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/src/inference/api_server.py b/src/inference/api_server.py new file mode 100644 index 0000000000000000000000000000000000000000..83768f1aa68fd0d44fb8241c3044079128376df1 --- /dev/null +++ b/src/inference/api_server.py @@ -0,0 +1,265 @@ +#!/usr/bin/env python3 +""" +Cidadão.AI Models - API Server + +FastAPI server for ML model inference. +""" + +import os +import sys +from contextlib import asynccontextmanager +from typing import Dict, List, Any, Optional +import logging + +from fastapi import FastAPI, HTTPException, status +from fastapi.middleware.cors import CORSMiddleware +from pydantic import BaseModel, Field +from prometheus_client import Counter, Histogram, generate_latest + +# Add parent to path for imports +sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(__file__)))) + +# Import models +from src.models.anomaly_detection import AnomalyDetector +from src.models.pattern_analysis import PatternAnalyzer +from src.models.spectral_analysis import SpectralAnalyzer + +# Configure logging +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +# Prometheus metrics +REQUEST_COUNT = Counter('cidadao_models_requests_total', 'Total requests', ['endpoint']) +REQUEST_DURATION = Histogram('cidadao_models_request_duration_seconds', 'Request duration') +ANOMALIES_DETECTED = Counter('cidadao_models_anomalies_total', 'Total anomalies detected') + +# Global models +models = {} + +@asynccontextmanager +async def lifespan(app: FastAPI): + """Application lifespan manager.""" + logger.info("🤖 Cidadão.AI Models API starting up...") + + # Initialize models + models["anomaly_detector"] = AnomalyDetector() + models["pattern_analyzer"] = PatternAnalyzer() + models["spectral_analyzer"] = SpectralAnalyzer() + + logger.info("✅ All models loaded successfully") + + yield + + logger.info("🛑 Cidadão.AI Models API shutting down...") + +# Create FastAPI app +app = FastAPI( + title="🤖 Cidadão.AI Models API", + description="Specialized ML models for Brazilian government transparency analysis", + version="1.0.0", + lifespan=lifespan +) + +# Add CORS middleware +app.add_middleware( + CORSMiddleware, + allow_origins=["*"], + allow_credentials=True, + allow_methods=["*"], + allow_headers=["*"], +) + +# Request/Response Models +class Contract(BaseModel): + """Government contract data.""" + id: str + description: str + value: float + supplier: str + date: str + organ: str + +class AnomalyRequest(BaseModel): + """Request for anomaly detection.""" + contracts: List[Dict[str, Any]] = Field(..., description="List of contracts to analyze") + threshold: Optional[float] = Field(default=0.7, description="Anomaly threshold") + +class AnomalyResponse(BaseModel): + """Response from anomaly detection.""" + anomalies: List[Dict[str, Any]] + total_analyzed: int + anomalies_found: int + confidence_score: float + model_version: str = "1.0.0" + +class PatternRequest(BaseModel): + """Request for pattern analysis.""" + data: Dict[str, Any] = Field(..., description="Data to analyze patterns") + analysis_type: str = Field(default="temporal", description="Type of pattern analysis") + +class PatternResponse(BaseModel): + """Response from pattern analysis.""" + patterns: List[Dict[str, Any]] + pattern_count: int + confidence: float + insights: List[str] + +class SpectralRequest(BaseModel): + """Request for spectral analysis.""" + time_series: List[float] = Field(..., description="Time series data") + sampling_rate: Optional[float] = Field(default=1.0, description="Sampling rate") + +class SpectralResponse(BaseModel): + """Response from spectral analysis.""" + frequencies: List[float] + amplitudes: List[float] + dominant_frequency: float + periodic_patterns: List[Dict[str, Any]] + +# Endpoints +@app.get("/") +async def root(): + """Root endpoint with API info.""" + REQUEST_COUNT.labels(endpoint="/").inc() + return { + "api": "Cidadão.AI Models", + "version": "1.0.0", + "status": "operational", + "models": list(models.keys()), + "endpoints": { + "anomaly_detection": "/v1/detect-anomalies", + "pattern_analysis": "/v1/analyze-patterns", + "spectral_analysis": "/v1/analyze-spectral", + "health": "/health", + "metrics": "/metrics" + } + } + +@app.get("/health") +async def health_check(): + """Health check endpoint.""" + REQUEST_COUNT.labels(endpoint="/health").inc() + return { + "status": "healthy", + "models_loaded": len(models) == 3, + "models": {name: "loaded" for name in models.keys()} + } + +@app.post("/v1/detect-anomalies", response_model=AnomalyResponse) +async def detect_anomalies(request: AnomalyRequest): + """Detect anomalies in government contracts.""" + REQUEST_COUNT.labels(endpoint="/v1/detect-anomalies").inc() + + try: + with REQUEST_DURATION.time(): + # Run anomaly detection + detector = models["anomaly_detector"] + results = await detector.predict(request.contracts) + + # Count anomalies + anomalies = [r for r in results if r.get("is_anomaly", False)] + ANOMALIES_DETECTED.inc(len(anomalies)) + + return AnomalyResponse( + anomalies=anomalies, + total_analyzed=len(request.contracts), + anomalies_found=len(anomalies), + confidence_score=0.87 + ) + + except Exception as e: + logger.error(f"Anomaly detection error: {str(e)}") + raise HTTPException( + status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, + detail=f"Anomaly detection failed: {str(e)}" + ) + +@app.post("/v1/analyze-patterns", response_model=PatternResponse) +async def analyze_patterns(request: PatternRequest): + """Analyze patterns in government data.""" + REQUEST_COUNT.labels(endpoint="/v1/analyze-patterns").inc() + + try: + with REQUEST_DURATION.time(): + analyzer = models["pattern_analyzer"] + + # Mock analysis for now + patterns = [ + { + "type": "temporal", + "description": "Peak spending in December", + "confidence": 0.92 + }, + { + "type": "vendor_concentration", + "description": "High concentration of contracts with few vendors", + "confidence": 0.85 + } + ] + + return PatternResponse( + patterns=patterns, + pattern_count=len(patterns), + confidence=0.88, + insights=[ + "Seasonal spending patterns detected", + "Vendor concentration above normal threshold" + ] + ) + + except Exception as e: + logger.error(f"Pattern analysis error: {str(e)}") + raise HTTPException( + status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, + detail=f"Pattern analysis failed: {str(e)}" + ) + +@app.post("/v1/analyze-spectral", response_model=SpectralResponse) +async def analyze_spectral(request: SpectralRequest): + """Perform spectral analysis on time series data.""" + REQUEST_COUNT.labels(endpoint="/v1/analyze-spectral").inc() + + try: + with REQUEST_DURATION.time(): + analyzer = models["spectral_analyzer"] + + # Mock spectral analysis + return SpectralResponse( + frequencies=[0.1, 0.2, 0.5, 1.0], + amplitudes=[10.5, 25.3, 5.2, 45.8], + dominant_frequency=1.0, + periodic_patterns=[ + { + "frequency": 1.0, + "period": "annual", + "strength": 0.95 + } + ] + ) + + except Exception as e: + logger.error(f"Spectral analysis error: {str(e)}") + raise HTTPException( + status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, + detail=f"Spectral analysis failed: {str(e)}" + ) + +@app.get("/metrics") +async def metrics(): + """Prometheus metrics endpoint.""" + return generate_latest().decode('utf-8') + +if __name__ == "__main__": + import uvicorn + + port = int(os.getenv("PORT", 8001)) + host = os.getenv("HOST", "0.0.0.0") + + logger.info(f"🚀 Starting Cidadão.AI Models API on {host}:{port}") + + uvicorn.run( + "api_server:app", + host=host, + port=port, + reload=True + ) \ No newline at end of file diff --git a/src/models/__init__.py b/src/models/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..0e311ea1821316da028cf1754afb05095f76e55c --- /dev/null +++ b/src/models/__init__.py @@ -0,0 +1,13 @@ +# -*- coding: utf-8 -*- +""" +Cidadao.AI Models - Core ML Models + +Specialized machine learning models for Brazilian government transparency analysis. +""" + +# Will be imported as models are migrated +# from .anomaly_detection import AnomalyDetector +# from .pattern_analysis import PatternAnalyzer +# from .spectral_analysis import SpectralAnalyzer + +__all__ = [] \ No newline at end of file diff --git a/src/models/__pycache__/__init__.cpython-313.pyc b/src/models/__pycache__/__init__.cpython-313.pyc new file mode 100644 index 0000000000000000000000000000000000000000..1ecb71f501c058d42f174e12d2fb52edab4f3c76 Binary files /dev/null and b/src/models/__pycache__/__init__.cpython-313.pyc differ diff --git a/src/models/anomaly_detection/__init__.py b/src/models/anomaly_detection/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..96b1d38fe8e5c0692bf40fc5fee1bc1dc6cf5408 --- /dev/null +++ b/src/models/anomaly_detection/__init__.py @@ -0,0 +1,9 @@ +""" +Anomaly Detection Module + +Specialized anomaly detection algorithms for government transparency data. +""" + +from .detector import AnomalyDetector + +__all__ = ["AnomalyDetector"] \ No newline at end of file diff --git a/src/models/anomaly_detection/__pycache__/__init__.cpython-313.pyc b/src/models/anomaly_detection/__pycache__/__init__.cpython-313.pyc new file mode 100644 index 0000000000000000000000000000000000000000..c3ab0dd41397a457c6c81ab4871dac0588d05fad Binary files /dev/null and b/src/models/anomaly_detection/__pycache__/__init__.cpython-313.pyc differ diff --git a/src/models/anomaly_detection/__pycache__/detector.cpython-313.pyc b/src/models/anomaly_detection/__pycache__/detector.cpython-313.pyc new file mode 100644 index 0000000000000000000000000000000000000000..70b636fac2dde214918473d3155a7dc998e8edf9 Binary files /dev/null and b/src/models/anomaly_detection/__pycache__/detector.cpython-313.pyc differ diff --git a/src/models/anomaly_detection/detector.py b/src/models/anomaly_detection/detector.py new file mode 100644 index 0000000000000000000000000000000000000000..fd958386f578d8ecb3b02e5cad0c17cf293838ee --- /dev/null +++ b/src/models/anomaly_detection/detector.py @@ -0,0 +1,91 @@ +"""Anomaly detection for government spending data.""" + +from typing import Dict, List, Optional, Tuple +from ..core.base_models import MLModel + + +class AnomalyDetector(MLModel): + """Detects anomalies in government spending patterns.""" + + def __init__(self): + super().__init__("anomaly_detector") + self._thresholds = { + "value_threshold": 1000000, # 1M BRL + "frequency_threshold": 10, + "pattern_threshold": 0.8 + } + + async def train(self, data: List[Dict], **kwargs) -> Dict: + """Train anomaly detection model (stub).""" + # TODO: Implement actual ML training with historical data + self._is_trained = True + return { + "status": "trained", + "samples": len(data), + "model": self.model_name + } + + async def predict(self, data: List[Dict]) -> List[Dict]: + """Detect anomalies in spending data.""" + anomalies = [] + + for item in data: + anomaly_score, reasons = await self._calculate_anomaly_score(item) + + if anomaly_score > 0.5: # Threshold for anomaly + anomalies.append({ + "item": item, + "anomaly_score": anomaly_score, + "reasons": reasons, + "severity": self._get_severity(anomaly_score) + }) + + return anomalies + + async def evaluate(self, data: List[Dict]) -> Dict: + """Evaluate anomaly detection performance.""" + predictions = await self.predict(data) + return { + "total_items": len(data), + "anomalies_detected": len(predictions), + "anomaly_rate": len(predictions) / len(data) if data else 0 + } + + async def _calculate_anomaly_score(self, item: Dict) -> Tuple[float, List[str]]: + """Calculate anomaly score for an item.""" + score = 0.0 + reasons = [] + + # Check value anomalies + value = item.get("valor", 0) + if isinstance(value, (int, float)) and value > self._thresholds["value_threshold"]: + score += 0.3 + reasons.append(f"Alto valor: R$ {value:,.2f}") + + # Check frequency anomalies (simplified) + supplier = item.get("fornecedor", {}).get("nome", "") + if supplier and len(supplier) < 10: # Very short supplier names + score += 0.2 + reasons.append("Nome de fornecedor suspeito") + + # Check pattern anomalies (simplified) + description = item.get("objeto", "").lower() + suspicious_keywords = ["urgente", "emergencial", "dispensada"] + if any(keyword in description for keyword in suspicious_keywords): + score += 0.4 + reasons.append("Contratação com características suspeitas") + + return min(score, 1.0), reasons + + def _get_severity(self, score: float) -> str: + """Get severity level based on anomaly score.""" + if score >= 0.8: + return "high" + elif score >= 0.6: + return "medium" + else: + return "low" + + def set_thresholds(self, **thresholds): + """Update detection thresholds.""" + self._thresholds.update(thresholds) \ No newline at end of file diff --git a/src/models/core/__init__.py b/src/models/core/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..9ba887f5519acdc3a957b030f6fa9a87aa1fb366 --- /dev/null +++ b/src/models/core/__init__.py @@ -0,0 +1,9 @@ +""" +Core Models Module + +Base classes and utilities for all ML models. +""" + +from .base_models import MLModel + +__all__ = ["MLModel"] \ No newline at end of file diff --git a/src/models/core/__pycache__/__init__.cpython-313.pyc b/src/models/core/__pycache__/__init__.cpython-313.pyc new file mode 100644 index 0000000000000000000000000000000000000000..7c72719d4c72650fab8a90c20d57d911bd356e5f Binary files /dev/null and b/src/models/core/__pycache__/__init__.cpython-313.pyc differ diff --git a/src/models/core/__pycache__/base_models.cpython-313.pyc b/src/models/core/__pycache__/base_models.cpython-313.pyc new file mode 100644 index 0000000000000000000000000000000000000000..b022203b51eedf6574adc47dc52fba09bfb2cbff Binary files /dev/null and b/src/models/core/__pycache__/base_models.cpython-313.pyc differ diff --git a/src/models/core/base_models.py b/src/models/core/base_models.py new file mode 100644 index 0000000000000000000000000000000000000000..e50bb244d5a082df99d71d806ac7108ee659353b --- /dev/null +++ b/src/models/core/base_models.py @@ -0,0 +1,32 @@ +"""Base ML model interfaces.""" + +from abc import ABC, abstractmethod +from typing import Any, Dict, List, Optional +import numpy as np + + +class MLModel(ABC): + """Abstract base class for ML models.""" + + def __init__(self, model_name: str): + self.model_name = model_name + self._is_trained = False + + @abstractmethod + async def train(self, data: List[Dict], **kwargs) -> Dict: + """Train the model.""" + pass + + @abstractmethod + async def predict(self, data: List[Dict]) -> List[Dict]: + """Make predictions.""" + pass + + @abstractmethod + async def evaluate(self, data: List[Dict]) -> Dict: + """Evaluate model performance.""" + pass + + def is_trained(self) -> bool: + """Check if model is trained.""" + return self._is_trained \ No newline at end of file diff --git a/src/models/core/hf_model.py b/src/models/core/hf_model.py new file mode 100644 index 0000000000000000000000000000000000000000..c3867410fc111371cf8ad57685c43a50398bdcaf --- /dev/null +++ b/src/models/core/hf_model.py @@ -0,0 +1,566 @@ +""" +Cidadão.AI - Hugging Face Transformers Integration + +Modelo especializado em transparência pública brasileira +compatível com a biblioteca transformers do Hugging Face. +""" + +import torch +import torch.nn as nn +from transformers import ( + PreTrainedModel, PretrainedConfig, + AutoModel, AutoTokenizer, + pipeline, Pipeline +) +from transformers.modeling_outputs import SequenceClassifierOutput, BaseModelOutput +from typing import Optional, Dict, List, Union, Tuple +import json +import logging +from pathlib import Path + +logger = logging.getLogger(__name__) + + +class CidadaoAIConfig(PretrainedConfig): + """ + Configuração do Cidadão.AI para Hugging Face + """ + + model_type = "cidadao-gpt" + + def __init__( + self, + vocab_size: int = 50257, + hidden_size: int = 1024, + num_hidden_layers: int = 24, + num_attention_heads: int = 16, + intermediate_size: int = 4096, + max_position_embeddings: int = 8192, + + # Configurações específicas de transparência + transparency_vocab_size: int = 2048, + corruption_detection_layers: int = 4, + financial_analysis_dim: int = 512, + legal_understanding_dim: int = 256, + + # Configurações de dropout + hidden_dropout_prob: float = 0.1, + attention_probs_dropout_prob: float = 0.1, + + # Configurações de ativação + hidden_act: str = "gelu", + + # Configurações de inicialização + initializer_range: float = 0.02, + layer_norm_eps: float = 1e-12, + + # Tarefas especializadas + enable_anomaly_detection: bool = True, + enable_financial_analysis: bool = True, + enable_legal_reasoning: bool = True, + + # Labels para classificação + num_anomaly_labels: int = 3, # Normal, Suspeito, Anômalo + num_financial_labels: int = 5, # Muito Baixo, Baixo, Médio, Alto, Muito Alto + num_legal_labels: int = 2, # Não Conforme, Conforme + + **kwargs + ): + super().__init__(**kwargs) + + self.vocab_size = vocab_size + self.hidden_size = hidden_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.intermediate_size = intermediate_size + self.max_position_embeddings = max_position_embeddings + + # Configurações específicas + self.transparency_vocab_size = transparency_vocab_size + self.corruption_detection_layers = corruption_detection_layers + self.financial_analysis_dim = financial_analysis_dim + self.legal_understanding_dim = legal_understanding_dim + + # Dropout + self.hidden_dropout_prob = hidden_dropout_prob + self.attention_probs_dropout_prob = attention_probs_dropout_prob + + # Ativação + self.hidden_act = hidden_act + + # Inicialização + self.initializer_range = initializer_range + self.layer_norm_eps = layer_norm_eps + + # Tarefas + self.enable_anomaly_detection = enable_anomaly_detection + self.enable_financial_analysis = enable_financial_analysis + self.enable_legal_reasoning = enable_legal_reasoning + + # Labels + self.num_anomaly_labels = num_anomaly_labels + self.num_financial_labels = num_financial_labels + self.num_legal_labels = num_legal_labels + + +class CidadaoAIModel(PreTrainedModel): + """ + Modelo base Cidadão.AI compatível com Hugging Face + """ + + config_class = CidadaoAIConfig + base_model_prefix = "cidadao_gpt" + supports_gradient_checkpointing = True + + def __init__(self, config: CidadaoAIConfig): + super().__init__(config) + + self.config = config + + # Modelo base (usar GPT-2 como backbone) + from transformers import GPT2Model + self.backbone = GPT2Model(config) + + # Embeddings especializados para transparência + self.transparency_embeddings = nn.ModuleDict({ + 'entity_types': nn.Embedding(100, config.hidden_size // 4), + 'financial_types': nn.Embedding(50, config.hidden_size // 4), + 'legal_types': nn.Embedding(200, config.hidden_size // 4), + 'corruption_indicators': nn.Embedding(20, config.hidden_size // 4) + }) + + # Cabeças de classificação especializadas + if config.enable_anomaly_detection: + self.anomaly_classifier = nn.Sequential( + nn.Linear(config.hidden_size, config.hidden_size // 2), + nn.ReLU(), + nn.Dropout(config.hidden_dropout_prob), + nn.Linear(config.hidden_size // 2, config.num_anomaly_labels) + ) + + self.anomaly_confidence = nn.Sequential( + nn.Linear(config.hidden_size, config.hidden_size // 4), + nn.ReLU(), + nn.Linear(config.hidden_size // 4, 1), + nn.Sigmoid() + ) + + if config.enable_financial_analysis: + self.financial_classifier = nn.Sequential( + nn.Linear(config.hidden_size, config.financial_analysis_dim), + nn.ReLU(), + nn.Dropout(config.hidden_dropout_prob), + nn.Linear(config.financial_analysis_dim, config.num_financial_labels) + ) + + self.financial_regressor = nn.Sequential( + nn.Linear(config.hidden_size, config.financial_analysis_dim), + nn.ReLU(), + nn.Linear(config.financial_analysis_dim, 1) + ) + + if config.enable_legal_reasoning: + self.legal_classifier = nn.Sequential( + nn.Linear(config.hidden_size, config.legal_understanding_dim), + nn.ReLU(), + nn.Dropout(config.hidden_dropout_prob), + nn.Linear(config.legal_understanding_dim, config.num_legal_labels) + ) + + # Inicializar pesos + self.init_weights() + + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + token_type_ids: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + + # Inputs especializados + entity_types: Optional[torch.Tensor] = None, + financial_types: Optional[torch.Tensor] = None, + legal_types: Optional[torch.Tensor] = None, + corruption_indicators: Optional[torch.Tensor] = None, + + # Labels para treinamento + anomaly_labels: Optional[torch.Tensor] = None, + financial_labels: Optional[torch.Tensor] = None, + legal_labels: Optional[torch.Tensor] = None, + ) -> Union[Tuple, BaseModelOutput]: + + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + # Forward do modelo base + outputs = self.backbone( + input_ids=input_ids, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + use_cache=use_cache, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + sequence_output = outputs[0] # [batch_size, seq_len, hidden_size] + + # Pooling para classificação (usar [CLS] token ou média) + pooled_output = sequence_output.mean(dim=1) # [batch_size, hidden_size] + + # Adicionar embeddings especializados se fornecidos + if entity_types is not None: + entity_embeds = self.transparency_embeddings['entity_types'](entity_types) + pooled_output = pooled_output + entity_embeds.mean(dim=1) + + if corruption_indicators is not None: + corruption_embeds = self.transparency_embeddings['corruption_indicators'](corruption_indicators) + pooled_output = pooled_output + corruption_embeds.mean(dim=1) + + result = { + "last_hidden_state": sequence_output, + "pooler_output": pooled_output, + "hidden_states": outputs.hidden_states if output_hidden_states else None, + "attentions": outputs.attentions if output_attentions else None, + } + + # Adicionar predições das cabeças especializadas + if hasattr(self, 'anomaly_classifier'): + anomaly_logits = self.anomaly_classifier(pooled_output) + anomaly_confidence = self.anomaly_confidence(pooled_output) + result["anomaly_logits"] = anomaly_logits + result["anomaly_confidence"] = anomaly_confidence + + # Calcular loss se labels fornecidos + if anomaly_labels is not None: + loss_fct = nn.CrossEntropyLoss() + anomaly_loss = loss_fct(anomaly_logits, anomaly_labels) + result["anomaly_loss"] = anomaly_loss + + if hasattr(self, 'financial_classifier'): + financial_logits = self.financial_classifier(pooled_output) + financial_value = self.financial_regressor(pooled_output) + result["financial_logits"] = financial_logits + result["financial_value"] = financial_value + + if financial_labels is not None: + loss_fct = nn.CrossEntropyLoss() + financial_loss = loss_fct(financial_logits, financial_labels) + result["financial_loss"] = financial_loss + + if hasattr(self, 'legal_classifier'): + legal_logits = self.legal_classifier(pooled_output) + result["legal_logits"] = legal_logits + + if legal_labels is not None: + loss_fct = nn.CrossEntropyLoss() + legal_loss = loss_fct(legal_logits, legal_labels) + result["legal_loss"] = legal_loss + + # Calcular loss total se em modo de treinamento + if any(key.endswith('_loss') for key in result.keys()): + total_loss = 0 + loss_count = 0 + + for key, value in result.items(): + if key.endswith('_loss'): + total_loss += value + loss_count += 1 + + if loss_count > 0: + result["loss"] = total_loss / loss_count + + if not return_dict: + return tuple(v for v in result.values() if v is not None) + + return BaseModelOutput(**result) + + +class CidadaoAIForAnomalyDetection(PreTrainedModel): + """Modelo Cidadão.AI especializado para detecção de anomalias""" + + config_class = CidadaoAIConfig + + def __init__(self, config): + super().__init__(config) + self.num_labels = config.num_anomaly_labels + self.cidadao_gpt = CidadaoAIModel(config) + + def forward( + self, + input_ids=None, + attention_mask=None, + labels=None, + **kwargs + ): + outputs = self.cidadao_gpt( + input_ids=input_ids, + attention_mask=attention_mask, + anomaly_labels=labels, + **kwargs + ) + + logits = outputs.get("anomaly_logits") + confidence = outputs.get("anomaly_confidence") + loss = outputs.get("anomaly_loss") + + return SequenceClassifierOutput( + loss=loss, + logits=logits, + hidden_states=outputs.get("hidden_states"), + attentions=outputs.get("attentions"), + ) + + +class CidadaoAIForFinancialAnalysis(PreTrainedModel): + """Modelo Cidadão.AI especializado para análise financeira""" + + config_class = CidadaoAIConfig + + def __init__(self, config): + super().__init__(config) + self.num_labels = config.num_financial_labels + self.cidadao_gpt = CidadaoAIModel(config) + + def forward( + self, + input_ids=None, + attention_mask=None, + labels=None, + **kwargs + ): + outputs = self.cidadao_gpt( + input_ids=input_ids, + attention_mask=attention_mask, + financial_labels=labels, + **kwargs + ) + + logits = outputs.get("financial_logits") + value = outputs.get("financial_value") + loss = outputs.get("financial_loss") + + return SequenceClassifierOutput( + loss=loss, + logits=logits, + hidden_states=outputs.get("hidden_states"), + attentions=outputs.get("attentions"), + ) + + +class CidadaoAIForLegalCompliance(PreTrainedModel): + """Modelo Cidadão.AI especializado para conformidade legal""" + + config_class = CidadaoAIConfig + + def __init__(self, config): + super().__init__(config) + self.num_labels = config.num_legal_labels + self.cidadao_gpt = CidadaoAIModel(config) + + def forward( + self, + input_ids=None, + attention_mask=None, + labels=None, + **kwargs + ): + outputs = self.cidadao_gpt( + input_ids=input_ids, + attention_mask=attention_mask, + legal_labels=labels, + **kwargs + ) + + logits = outputs.get("legal_logits") + loss = outputs.get("legal_loss") + + return SequenceClassifierOutput( + loss=loss, + logits=logits, + hidden_states=outputs.get("hidden_states"), + attentions=outputs.get("attentions"), + ) + + +# Pipelines personalizados para cada tarefa + +class TransparencyAnalysisPipeline(Pipeline): + """Pipeline personalizado para análise de transparência""" + + def __init__(self, model, tokenizer, task="transparency-analysis", **kwargs): + super().__init__(model=model, tokenizer=tokenizer, task=task, **kwargs) + + self.anomaly_labels = ["Normal", "Suspeito", "Anômalo"] + self.financial_labels = ["Muito Baixo", "Baixo", "Médio", "Alto", "Muito Alto"] + self.legal_labels = ["Não Conforme", "Conforme"] + + def _sanitize_parameters(self, **kwargs): + preprocess_kwargs = {} + forward_kwargs = {} + postprocess_kwargs = {} + + if "max_length" in kwargs: + preprocess_kwargs["max_length"] = kwargs["max_length"] + + if "return_all_scores" in kwargs: + postprocess_kwargs["return_all_scores"] = kwargs["return_all_scores"] + + return preprocess_kwargs, forward_kwargs, postprocess_kwargs + + def preprocess(self, inputs, max_length=512): + return self.tokenizer( + inputs, + truncation=True, + padding=True, + max_length=max_length, + return_tensors="pt" + ) + + def _forward(self, model_inputs): + return self.model(**model_inputs) + + def postprocess(self, model_outputs, return_all_scores=False): + results = {} + + # Detecção de anomalias + if hasattr(model_outputs, 'anomaly_logits') or 'anomaly_logits' in model_outputs: + anomaly_logits = model_outputs.get('anomaly_logits', model_outputs.anomaly_logits) + anomaly_probs = torch.softmax(anomaly_logits, dim=-1) + anomaly_pred = torch.argmax(anomaly_probs, dim=-1) + + results["anomaly"] = { + "label": self.anomaly_labels[anomaly_pred.item()], + "score": anomaly_probs.max().item(), + "all_scores": [ + {"label": label, "score": score.item()} + for label, score in zip(self.anomaly_labels, anomaly_probs[0]) + ] if return_all_scores else None + } + + # Análise financeira + if hasattr(model_outputs, 'financial_logits') or 'financial_logits' in model_outputs: + financial_logits = model_outputs.get('financial_logits', model_outputs.financial_logits) + financial_probs = torch.softmax(financial_logits, dim=-1) + financial_pred = torch.argmax(financial_probs, dim=-1) + + results["financial"] = { + "label": self.financial_labels[financial_pred.item()], + "score": financial_probs.max().item(), + "all_scores": [ + {"label": label, "score": score.item()} + for label, score in zip(self.financial_labels, financial_probs[0]) + ] if return_all_scores else None + } + + # Conformidade legal + if hasattr(model_outputs, 'legal_logits') or 'legal_logits' in model_outputs: + legal_logits = model_outputs.get('legal_logits', model_outputs.legal_logits) + legal_probs = torch.softmax(legal_logits, dim=-1) + legal_pred = torch.argmax(legal_probs, dim=-1) + + results["legal"] = { + "label": self.legal_labels[legal_pred.item()], + "score": legal_probs.max().item(), + "all_scores": [ + {"label": label, "score": score.item()} + for label, score in zip(self.legal_labels, legal_probs[0]) + ] if return_all_scores else None + } + + return results + + +# Registro dos modelos no AutoModel +from transformers import AutoConfig, AutoModel + +AutoConfig.register("cidadao-gpt", CidadaoAIConfig) +AutoModel.register(CidadaoAIConfig, CidadaoAIModel) + + +def create_cidadao_pipeline( + model_name_or_path: str = "neural-thinker/cidadao-gpt", + task: str = "transparency-analysis", + **kwargs +) -> TransparencyAnalysisPipeline: + """ + Criar pipeline Cidadão.AI + + Args: + model_name_or_path: Nome do modelo no HF Hub ou caminho local + task: Tipo de tarefa + **kwargs: Argumentos adicionais + + Returns: + Pipeline configurado + """ + + model = AutoModel.from_pretrained(model_name_or_path, **kwargs) + tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, **kwargs) + + return TransparencyAnalysisPipeline( + model=model, + tokenizer=tokenizer, + task=task + ) + + +# Função de conveniência para uso rápido +def analyze_transparency( + text: str, + model_name: str = "neural-thinker/cidadao-gpt" +) -> Dict: + """ + Análise rápida de transparência + + Args: + text: Texto para análise + model_name: Nome do modelo + + Returns: + Resultados da análise + """ + + pipe = create_cidadao_pipeline(model_name) + return pipe(text, return_all_scores=True) + + +if __name__ == "__main__": + # Exemplo de uso + + # Criar configuração + config = CidadaoAIConfig( + vocab_size=50257, + hidden_size=768, + num_hidden_layers=12, + num_attention_heads=12, + enable_anomaly_detection=True, + enable_financial_analysis=True, + enable_legal_reasoning=True + ) + + # Criar modelo + model = CidadaoAIModel(config) + + print(f"✅ Modelo Cidadão.AI criado com {sum(p.numel() for p in model.parameters()):,} parâmetros") + print(f"🎯 Tarefas habilitadas: Anomalias, Financeiro, Legal") + + # Teste básico + batch_size, seq_len = 2, 128 + input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len)) + attention_mask = torch.ones(batch_size, seq_len) + + outputs = model(input_ids=input_ids, attention_mask=attention_mask) + + print(f"📊 Output shape: {outputs.last_hidden_state.shape}") + print(f"🔍 Anomaly logits: {outputs.anomaly_logits.shape if 'anomaly_logits' in outputs else 'N/A'}") + print(f"💰 Financial logits: {outputs.financial_logits.shape if 'financial_logits' in outputs else 'N/A'}") + print(f"⚖️ Legal logits: {outputs.legal_logits.shape if 'legal_logits' in outputs else 'N/A'}") \ No newline at end of file diff --git a/src/models/pattern_analysis/__init__.py b/src/models/pattern_analysis/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..f9d35883b59df00759b61f1b1089e825146876ee --- /dev/null +++ b/src/models/pattern_analysis/__init__.py @@ -0,0 +1,10 @@ +# -*- coding: utf-8 -*- +""" +Pattern Analysis Module + +Advanced pattern recognition for government transparency data. +""" + +from .analyzer import PatternAnalyzer + +__all__ = ["PatternAnalyzer"] \ No newline at end of file diff --git a/src/models/pattern_analysis/analyzer.py b/src/models/pattern_analysis/analyzer.py new file mode 100644 index 0000000000000000000000000000000000000000..2a07fadaf86998366068cba6f7310d57c02a2565 --- /dev/null +++ b/src/models/pattern_analysis/analyzer.py @@ -0,0 +1,222 @@ +"""Pattern analysis for government spending trends.""" + +from typing import Dict, List, Optional +from collections import defaultdict, Counter +from datetime import datetime +from .models import MLModel + + +class PatternAnalyzer(MLModel): + """Analyzes patterns in government spending data.""" + + def __init__(self): + super().__init__("pattern_analyzer") + self._patterns = {} + + async def train(self, data: List[Dict], **kwargs) -> Dict: + """Train pattern analysis model.""" + self._patterns = await self._extract_patterns(data) + self._is_trained = True + + return { + "status": "trained", + "samples": len(data), + "patterns_found": len(self._patterns), + "model": self.model_name + } + + async def predict(self, data: List[Dict]) -> List[Dict]: + """Analyze patterns in new data.""" + patterns = await self._extract_patterns(data) + + pattern_analysis = [] + for pattern_type, pattern_data in patterns.items(): + pattern_analysis.append({ + "pattern_type": pattern_type, + "pattern_data": pattern_data, + "confidence": self._calculate_confidence(pattern_data), + "significance": self._calculate_significance(pattern_data) + }) + + return pattern_analysis + + async def evaluate(self, data: List[Dict]) -> Dict: + """Evaluate pattern analysis.""" + patterns = await self.predict(data) + return { + "total_patterns": len(patterns), + "high_confidence_patterns": len([p for p in patterns if p["confidence"] > 0.7]), + "significant_patterns": len([p for p in patterns if p["significance"] > 0.6]) + } + + async def _extract_patterns(self, data: List[Dict]) -> Dict: + """Extract spending patterns from data.""" + patterns = { + "temporal": self._analyze_temporal_patterns(data), + "supplier": self._analyze_supplier_patterns(data), + "value": self._analyze_value_patterns(data), + "category": self._analyze_category_patterns(data) + } + + return patterns + + def _analyze_temporal_patterns(self, data: List[Dict]) -> Dict: + """Analyze temporal spending patterns.""" + monthly_spending = defaultdict(float) + + for item in data: + # Extract month from date (simplified) + date_str = item.get("data", "") + if date_str: + try: + # Assume format YYYY-MM-DD or similar + month = date_str[:7] # YYYY-MM + value = float(item.get("valor", 0)) + monthly_spending[month] += value + except (ValueError, TypeError): + continue + + return { + "monthly_totals": dict(monthly_spending), + "peak_months": self._find_peak_periods(monthly_spending), + "seasonal_trends": self._detect_seasonal_trends(monthly_spending) + } + + def _analyze_supplier_patterns(self, data: List[Dict]) -> Dict: + """Analyze supplier patterns.""" + supplier_counts = Counter() + supplier_values = defaultdict(float) + + for item in data: + supplier = item.get("fornecedor", {}).get("nome", "Unknown") + value = float(item.get("valor", 0)) + + supplier_counts[supplier] += 1 + supplier_values[supplier] += value + + return { + "top_suppliers_by_count": supplier_counts.most_common(10), + "top_suppliers_by_value": sorted( + supplier_values.items(), + key=lambda x: x[1], + reverse=True + )[:10], + "supplier_concentration": self._calculate_concentration(supplier_values) + } + + def _analyze_value_patterns(self, data: List[Dict]) -> Dict: + """Analyze value distribution patterns.""" + values = [float(item.get("valor", 0)) for item in data if item.get("valor")] + + if not values: + return {"error": "No value data available"} + + values.sort() + n = len(values) + + return { + "total_count": n, + "total_value": sum(values), + "mean_value": sum(values) / n, + "median_value": values[n // 2], + "quartiles": { + "q1": values[n // 4], + "q3": values[3 * n // 4] + }, + "outliers": self._detect_value_outliers(values) + } + + def _analyze_category_patterns(self, data: List[Dict]) -> Dict: + """Analyze spending by category.""" + category_spending = defaultdict(float) + + for item in data: + # Extract category from object description (simplified) + obj_desc = item.get("objeto", "").lower() + category = self._categorize_spending(obj_desc) + value = float(item.get("valor", 0)) + + category_spending[category] += value + + return { + "category_totals": dict(category_spending), + "category_distribution": self._calculate_distribution(category_spending) + } + + def _categorize_spending(self, description: str) -> str: + """Categorize spending based on description.""" + categories = { + "technology": ["software", "hardware", "sistema", "tecnologia"], + "services": ["serviço", "consultoria", "manutenção"], + "infrastructure": ["obra", "construção", "reforma"], + "supplies": ["material", "equipamento", "mobiliário"], + "other": [] + } + + description_lower = description.lower() + for category, keywords in categories.items(): + if any(keyword in description_lower for keyword in keywords): + return category + + return "other" + + def _find_peak_periods(self, monthly_data: Dict) -> List[str]: + """Find peak spending periods.""" + if not monthly_data: + return [] + + avg_spending = sum(monthly_data.values()) / len(monthly_data) + return [month for month, value in monthly_data.items() if value > avg_spending * 1.5] + + def _detect_seasonal_trends(self, monthly_data: Dict) -> Dict: + """Detect seasonal spending trends.""" + # Simplified seasonal analysis + return {"trend": "stable", "seasonality": "low"} + + def _calculate_concentration(self, supplier_values: Dict) -> float: + """Calculate supplier concentration (simplified Herfindahl index).""" + total_value = sum(supplier_values.values()) + if total_value == 0: + return 0 + + concentration = sum((value / total_value) ** 2 for value in supplier_values.values()) + return concentration + + def _detect_value_outliers(self, sorted_values: List[float]) -> List[float]: + """Detect value outliers using IQR method.""" + n = len(sorted_values) + if n < 4: + return [] + + q1 = sorted_values[n // 4] + q3 = sorted_values[3 * n // 4] + iqr = q3 - q1 + + lower_bound = q1 - 1.5 * iqr + upper_bound = q3 + 1.5 * iqr + + return [value for value in sorted_values if value < lower_bound or value > upper_bound] + + def _calculate_distribution(self, category_data: Dict) -> Dict: + """Calculate percentage distribution.""" + total = sum(category_data.values()) + if total == 0: + return {} + + return {category: (value / total) * 100 for category, value in category_data.items()} + + def _calculate_confidence(self, pattern_data: Dict) -> float: + """Calculate confidence score for pattern.""" + # Simplified confidence calculation + if not pattern_data or isinstance(pattern_data, dict) and not pattern_data: + return 0.0 + + return 0.8 # Default high confidence for stub + + def _calculate_significance(self, pattern_data: Dict) -> float: + """Calculate significance score for pattern.""" + # Simplified significance calculation + if not pattern_data: + return 0.0 + + return 0.7 # Default medium significance for stub \ No newline at end of file diff --git a/src/models/spectral_analysis/__init__.py b/src/models/spectral_analysis/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..691acb3f1758e6c9a65cc7a6b4a5833fa43d3a1f --- /dev/null +++ b/src/models/spectral_analysis/__init__.py @@ -0,0 +1,10 @@ +# -*- coding: utf-8 -*- +""" +Spectral Analysis Module + +Frequency domain analysis for detecting temporal patterns. +""" + +from .analyzer import SpectralAnalyzer + +__all__ = ["SpectralAnalyzer"] \ No newline at end of file diff --git a/src/models/spectral_analysis/analyzer.py b/src/models/spectral_analysis/analyzer.py new file mode 100644 index 0000000000000000000000000000000000000000..2824e68bc200606e7a0ab3ed3c5f7d5d13bd26a8 --- /dev/null +++ b/src/models/spectral_analysis/analyzer.py @@ -0,0 +1,787 @@ +""" +Module: ml.spectral_analyzer +Description: Spectral analysis using Fourier transforms for government transparency data +Author: Anderson H. Silva +Date: 2025-07-19 +License: Proprietary - All rights reserved +""" + +import numpy as np +import pandas as pd +from typing import Dict, List, Optional, Tuple, Any +from dataclasses import dataclass +from datetime import datetime, timedelta +from scipy.fft import fft, fftfreq, ifft, rfft, rfftfreq +from scipy.signal import find_peaks, welch, periodogram, spectrogram +from scipy.stats import zscore +import warnings +warnings.filterwarnings('ignore') + +from src.core import get_logger + +logger = get_logger(__name__) + + +@dataclass +class SpectralFeatures: + """Spectral characteristics of a time series.""" + + dominant_frequencies: List[float] + dominant_periods: List[float] + spectral_entropy: float + power_spectrum: np.ndarray + frequencies: np.ndarray + peak_frequencies: List[float] + seasonal_components: Dict[str, float] + anomaly_score: float + trend_component: np.ndarray + residual_component: np.ndarray + + +@dataclass +class SpectralAnomaly: + """Spectral anomaly detection result.""" + + timestamp: datetime + anomaly_type: str + severity: str # "low", "medium", "high", "critical" + frequency_band: Tuple[float, float] + anomaly_score: float + description: str + evidence: Dict[str, Any] + recommendations: List[str] + + +@dataclass +class PeriodicPattern: + """Detected periodic pattern in spending data.""" + + period_days: float + frequency_hz: float + amplitude: float + confidence: float + pattern_type: str # "seasonal", "cyclical", "irregular", "suspicious" + business_interpretation: str + statistical_significance: float + + +class SpectralAnalyzer: + """ + Advanced spectral analysis for government transparency data using Fourier transforms. + + Capabilities: + - Seasonal pattern detection in public spending + - Cyclical anomaly identification + - Frequency-domain correlation analysis + - Spectral anomaly detection + - Periodic pattern classification + - Cross-spectral analysis between entities + """ + + def __init__( + self, + sampling_frequency: float = 1.0, # Daily sampling by default + anomaly_threshold: float = 2.5, # Z-score threshold for anomalies + min_period_days: int = 7, # Minimum period for pattern detection + max_period_days: int = 365, # Maximum period for pattern detection + ): + """ + Initialize the Spectral Analyzer. + + Args: + sampling_frequency: Sampling frequency in Hz (1.0 = daily) + anomaly_threshold: Z-score threshold for anomaly detection + min_period_days: Minimum period in days for pattern detection + max_period_days: Maximum period in days for pattern detection + """ + self.fs = sampling_frequency + self.anomaly_threshold = anomaly_threshold + self.min_period = min_period_days + self.max_period = max_period_days + self.logger = logger + + # Pre-computed frequency bands for Brazilian government patterns + self.frequency_bands = { + "daily": (1/1, 1/3), # 1-3 day cycles + "weekly": (1/7, 1/10), # Weekly patterns + "biweekly": (1/14, 1/21), # Bi-weekly patterns + "monthly": (1/30, 1/45), # Monthly cycles + "quarterly": (1/90, 1/120), # Quarterly patterns + "semester": (1/180, 1/200), # Semester patterns + "annual": (1/365, 1/400), # Annual cycles + "suspicious": (1/2, 1/5) # Very high frequency (potentially manipulated) + } + + def analyze_time_series( + self, + data: pd.Series, + timestamps: Optional[pd.DatetimeIndex] = None + ) -> SpectralFeatures: + """ + Perform comprehensive spectral analysis of a time series. + + Args: + data: Time series data (spending amounts, contract counts, etc.) + timestamps: Optional datetime index + + Returns: + SpectralFeatures object with complete spectral characteristics + """ + try: + # Prepare data + if timestamps is None: + timestamps = pd.date_range(start='2020-01-01', periods=len(data), freq='D') + + # Ensure data is numeric and handle missing values + data_clean = self._preprocess_data(data) + + # Compute FFT + fft_values = rfft(data_clean) + frequencies = rfftfreq(len(data_clean), d=1/self.fs) + + # Power spectrum + power_spectrum = np.abs(fft_values) ** 2 + + # Find dominant frequencies + dominant_freqs, dominant_periods = self._find_dominant_frequencies( + frequencies, power_spectrum + ) + + # Calculate spectral entropy + spectral_entropy = self._calculate_spectral_entropy(power_spectrum) + + # Find peaks in spectrum + peak_frequencies = self._find_peak_frequencies(frequencies, power_spectrum) + + # Detect seasonal components + seasonal_components = self._detect_seasonal_components( + frequencies, power_spectrum + ) + + # Decompose signal + trend, residual = self._decompose_signal(data_clean) + + # Calculate anomaly score + anomaly_score = self._calculate_spectral_anomaly_score( + power_spectrum, frequencies + ) + + return SpectralFeatures( + dominant_frequencies=dominant_freqs, + dominant_periods=dominant_periods, + spectral_entropy=spectral_entropy, + power_spectrum=power_spectrum, + frequencies=frequencies, + peak_frequencies=peak_frequencies, + seasonal_components=seasonal_components, + anomaly_score=anomaly_score, + trend_component=trend, + residual_component=residual + ) + + except Exception as e: + self.logger.error(f"Error in spectral analysis: {str(e)}") + raise + + def detect_anomalies( + self, + data: pd.Series, + timestamps: pd.DatetimeIndex, + context: Optional[Dict[str, Any]] = None + ) -> List[SpectralAnomaly]: + """ + Detect anomalies using spectral analysis techniques. + + Args: + data: Time series data + timestamps: Datetime index + context: Additional context (entity name, spending category, etc.) + + Returns: + List of detected spectral anomalies + """ + anomalies = [] + + try: + # Get spectral features + features = self.analyze_time_series(data, timestamps) + + # Anomaly 1: Unusual frequency peaks + freq_anomalies = self._detect_frequency_anomalies(features) + anomalies.extend(freq_anomalies) + + # Anomaly 2: Sudden spectral changes + spectral_change_anomalies = self._detect_spectral_changes(data, timestamps) + anomalies.extend(spectral_change_anomalies) + + # Anomaly 3: Suspicious periodic patterns + suspicious_patterns = self._detect_suspicious_patterns(features, context) + anomalies.extend(suspicious_patterns) + + # Anomaly 4: High-frequency noise (potential manipulation) + noise_anomalies = self._detect_high_frequency_noise(features) + anomalies.extend(noise_anomalies) + + # Sort by severity and timestamp + anomalies.sort(key=lambda x: ( + {"critical": 4, "high": 3, "medium": 2, "low": 1}[x.severity], + x.timestamp + ), reverse=True) + + return anomalies + + except Exception as e: + self.logger.error(f"Error detecting spectral anomalies: {str(e)}") + return [] + + def find_periodic_patterns( + self, + data: pd.Series, + timestamps: pd.DatetimeIndex, + entity_name: Optional[str] = None + ) -> List[PeriodicPattern]: + """ + Find and classify periodic patterns in spending data. + + Args: + data: Time series data + timestamps: Datetime index + entity_name: Name of the entity being analyzed + + Returns: + List of detected periodic patterns + """ + patterns = [] + + try: + features = self.analyze_time_series(data, timestamps) + + # Analyze each frequency band + for band_name, (min_freq, max_freq) in self.frequency_bands.items(): + pattern = self._analyze_frequency_band( + features, band_name, min_freq, max_freq, entity_name + ) + if pattern: + patterns.append(pattern) + + # Sort by amplitude (strongest patterns first) + patterns.sort(key=lambda x: x.amplitude, reverse=True) + + return patterns + + except Exception as e: + self.logger.error(f"Error finding periodic patterns: {str(e)}") + return [] + + def cross_spectral_analysis( + self, + data1: pd.Series, + data2: pd.Series, + entity1_name: str, + entity2_name: str, + timestamps: Optional[pd.DatetimeIndex] = None + ) -> Dict[str, Any]: + """ + Perform cross-spectral analysis between two entities. + + Args: + data1: First time series + data2: Second time series + entity1_name: Name of first entity + entity2_name: Name of second entity + timestamps: Datetime index + + Returns: + Cross-spectral analysis results + """ + try: + # Ensure same length + min_len = min(len(data1), len(data2)) + data1_clean = self._preprocess_data(data1[:min_len]) + data2_clean = self._preprocess_data(data2[:min_len]) + + # Cross-power spectrum + fft1 = rfft(data1_clean) + fft2 = rfft(data2_clean) + cross_spectrum = fft1 * np.conj(fft2) + + frequencies = rfftfreq(min_len, d=1/self.fs) + + # Coherence + coherence = np.abs(cross_spectrum) ** 2 / ( + (np.abs(fft1) ** 2) * (np.abs(fft2) ** 2) + ) + + # Phase difference + phase_diff = np.angle(cross_spectrum) + + # Find highly correlated frequency bands + high_coherence_indices = np.where(coherence > 0.7)[0] + correlated_frequencies = frequencies[high_coherence_indices] + correlated_periods = 1 / correlated_frequencies[correlated_frequencies > 0] + + # Statistical significance + correlation_coeff = np.corrcoef(data1_clean, data2_clean)[0, 1] + + return { + "entities": [entity1_name, entity2_name], + "correlation_coefficient": correlation_coeff, + "coherence_spectrum": coherence, + "phase_spectrum": phase_diff, + "frequencies": frequencies, + "correlated_frequencies": correlated_frequencies.tolist(), + "correlated_periods_days": correlated_periods.tolist(), + "max_coherence": np.max(coherence), + "mean_coherence": np.mean(coherence), + "synchronization_score": self._calculate_synchronization_score(coherence), + "business_interpretation": self._interpret_cross_spectral_results( + correlation_coeff, coherence, correlated_periods, + entity1_name, entity2_name + ) + } + + except Exception as e: + self.logger.error(f"Error in cross-spectral analysis: {str(e)}") + return {} + + def _preprocess_data(self, data: pd.Series) -> np.ndarray: + """Preprocess time series data for spectral analysis.""" + # Convert to numeric and handle missing values + data_numeric = pd.to_numeric(data, errors='coerce') + + # Fill missing values with interpolation + data_filled = data_numeric.interpolate(method='linear') + + # Fill remaining NaN values with median + data_filled = data_filled.fillna(data_filled.median()) + + # Remove trend (detrending) + data_detrended = data_filled - data_filled.rolling(window=30, center=True).mean().fillna(data_filled.mean()) + + # Apply window function to reduce spectral leakage + window = np.hanning(len(data_detrended)) + data_windowed = data_detrended * window + + return data_windowed.values + + def _find_dominant_frequencies( + self, + frequencies: np.ndarray, + power_spectrum: np.ndarray + ) -> Tuple[List[float], List[float]]: + """Find dominant frequencies in the power spectrum.""" + # Find peaks in power spectrum + peaks, properties = find_peaks( + power_spectrum, + height=np.mean(power_spectrum) + 2*np.std(power_spectrum), + distance=5 + ) + + # Get frequencies and periods for peaks + dominant_freqs = frequencies[peaks].tolist() + dominant_periods = [1/f if f > 0 else np.inf for f in dominant_freqs] + + # Sort by power (strongest first) + peak_powers = power_spectrum[peaks] + sorted_indices = np.argsort(peak_powers)[::-1] + + dominant_freqs = [dominant_freqs[i] for i in sorted_indices] + dominant_periods = [dominant_periods[i] for i in sorted_indices] + + return dominant_freqs[:10], dominant_periods[:10] # Top 10 + + def _calculate_spectral_entropy(self, power_spectrum: np.ndarray) -> float: + """Calculate spectral entropy as a measure of spectral complexity.""" + # Normalize power spectrum + normalized_spectrum = power_spectrum / np.sum(power_spectrum) + + # Avoid log(0) + normalized_spectrum = normalized_spectrum[normalized_spectrum > 0] + + # Calculate entropy + entropy = -np.sum(normalized_spectrum * np.log2(normalized_spectrum)) + + # Normalize by maximum possible entropy + max_entropy = np.log2(len(normalized_spectrum)) + + return entropy / max_entropy if max_entropy > 0 else 0 + + def _find_peak_frequencies( + self, + frequencies: np.ndarray, + power_spectrum: np.ndarray + ) -> List[float]: + """Find significant peak frequencies.""" + # Use adaptive threshold + threshold = np.mean(power_spectrum) + np.std(power_spectrum) + + peaks, _ = find_peaks(power_spectrum, height=threshold) + peak_frequencies = frequencies[peaks] + + # Filter by relevant frequency range + relevant_peaks = peak_frequencies[ + (peak_frequencies >= 1/self.max_period) & + (peak_frequencies <= 1/self.min_period) + ] + + return relevant_peaks.tolist() + + def _detect_seasonal_components( + self, + frequencies: np.ndarray, + power_spectrum: np.ndarray + ) -> Dict[str, float]: + """Detect seasonal components in the spectrum.""" + seasonal_components = {} + + # Define seasonal frequencies (cycles per day) + seasonal_freqs = { + "weekly": 1/7, + "monthly": 1/30, + "quarterly": 1/91, + "biannual": 1/182, + "annual": 1/365 + } + + for component, target_freq in seasonal_freqs.items(): + # Find closest frequency in spectrum + freq_idx = np.argmin(np.abs(frequencies - target_freq)) + + if freq_idx < len(power_spectrum): + # Calculate relative power in this component + window_size = max(1, len(frequencies) // 50) + start_idx = max(0, freq_idx - window_size//2) + end_idx = min(len(power_spectrum), freq_idx + window_size//2) + + component_power = np.mean(power_spectrum[start_idx:end_idx]) + total_power = np.mean(power_spectrum) + + seasonal_components[component] = component_power / total_power if total_power > 0 else 0 + + return seasonal_components + + def _decompose_signal(self, data: np.ndarray) -> Tuple[np.ndarray, np.ndarray]: + """Decompose signal into trend and residual components.""" + # Simple trend extraction using moving average + window_size = min(30, len(data) // 4) + trend = np.convolve(data, np.ones(window_size)/window_size, mode='same') + + # Residual after removing trend + residual = data - trend + + return trend, residual + + def _calculate_spectral_anomaly_score( + self, + power_spectrum: np.ndarray, + frequencies: np.ndarray + ) -> float: + """Calculate overall anomaly score based on spectral characteristics.""" + # Factor 1: Spectral entropy (lower entropy = more anomalous) + entropy = self._calculate_spectral_entropy(power_spectrum) + entropy_score = 1 - entropy # Invert so higher = more anomalous + + # Factor 2: High-frequency content + high_freq_mask = frequencies > 1/self.min_period + high_freq_power = np.sum(power_spectrum[high_freq_mask]) + total_power = np.sum(power_spectrum) + high_freq_ratio = high_freq_power / total_power if total_power > 0 else 0 + + # Factor 3: Peak concentration + peak_indices, _ = find_peaks(power_spectrum) + if len(peak_indices) > 0: + peak_concentration = np.sum(power_spectrum[peak_indices]) / total_power + else: + peak_concentration = 0 + + # Combine factors + anomaly_score = ( + 0.4 * entropy_score + + 0.3 * high_freq_ratio + + 0.3 * peak_concentration + ) + + return min(anomaly_score, 1.0) + + def _detect_frequency_anomalies(self, features: SpectralFeatures) -> List[SpectralAnomaly]: + """Detect anomalies in frequency domain.""" + anomalies = [] + + # Check for unusual dominant frequencies + for freq in features.dominant_frequencies: + if freq > 0: + period_days = 1 / freq + + # Very short periods might indicate manipulation + if period_days < 3: + anomalies.append(SpectralAnomaly( + timestamp=datetime.now(), + anomaly_type="high_frequency_pattern", + severity="high", + frequency_band=(freq * 0.9, freq * 1.1), + anomaly_score=0.8, + description=f"Suspicious high-frequency pattern detected (period: {period_days:.1f} days)", + evidence={"frequency_hz": freq, "period_days": period_days}, + recommendations=[ + "Investigate potential data manipulation", + "Check for automated/systematic processes", + "Verify data source integrity" + ] + )) + + return anomalies + + def _detect_spectral_changes( + self, + data: pd.Series, + timestamps: pd.DatetimeIndex + ) -> List[SpectralAnomaly]: + """Detect sudden changes in spectral characteristics.""" + anomalies = [] + + if len(data) < 60: # Need sufficient data + return anomalies + + # Split data into segments + segment_size = len(data) // 4 + segments = [data[i:i+segment_size] for i in range(0, len(data)-segment_size, segment_size)] + + # Compare spectral entropy between segments + entropies = [] + for segment in segments: + if len(segment) > 10: + features = self.analyze_time_series(segment) + entropies.append(features.spectral_entropy) + + if len(entropies) > 1: + entropy_changes = np.diff(entropies) + + # Detect significant changes + for i, change in enumerate(entropy_changes): + if abs(change) > 0.3: # Significant spectral change + timestamp = timestamps[i * segment_size] if i * segment_size < len(timestamps) else datetime.now() + + anomalies.append(SpectralAnomaly( + timestamp=timestamp, + anomaly_type="spectral_regime_change", + severity="medium", + frequency_band=(0, 0.5), + anomaly_score=abs(change), + description=f"Significant change in spending pattern complexity detected", + evidence={"entropy_change": change, "segment": i}, + recommendations=[ + "Investigate policy or procedural changes", + "Check for organizational restructuring", + "Verify data consistency" + ] + )) + + return anomalies + + def _detect_suspicious_patterns( + self, + features: SpectralFeatures, + context: Optional[Dict[str, Any]] + ) -> List[SpectralAnomaly]: + """Detect patterns that might indicate irregular activities.""" + anomalies = [] + + # Check seasonal components for anomalies + seasonal = features.seasonal_components + + # Excessive quarterly activity might indicate budget manipulation + if seasonal.get("quarterly", 0) > 0.4: + anomalies.append(SpectralAnomaly( + timestamp=datetime.now(), + anomaly_type="excessive_quarterly_pattern", + severity="medium", + frequency_band=(1/120, 1/60), + anomaly_score=seasonal["quarterly"], + description="Excessive quarterly spending pattern detected", + evidence={"quarterly_component": seasonal["quarterly"]}, + recommendations=[ + "Investigate budget execution practices", + "Check for end-of-quarter rushing", + "Review budget planning processes" + ] + )) + + # Very regular weekly patterns in government spending might be suspicious + if seasonal.get("weekly", 0) > 0.3: + anomalies.append(SpectralAnomaly( + timestamp=datetime.now(), + anomaly_type="unusual_weekly_regularity", + severity="low", + frequency_band=(1/10, 1/5), + anomaly_score=seasonal["weekly"], + description="Unusually regular weekly spending pattern", + evidence={"weekly_component": seasonal["weekly"]}, + recommendations=[ + "Verify if pattern matches business processes", + "Check for automated payments", + "Review spending authorization patterns" + ] + )) + + return anomalies + + def _detect_high_frequency_noise(self, features: SpectralFeatures) -> List[SpectralAnomaly]: + """Detect high-frequency noise that might indicate data manipulation.""" + anomalies = [] + + # Check power in high-frequency band + high_freq_mask = features.frequencies > 0.2 # > 5 day period + high_freq_power = np.sum(features.power_spectrum[high_freq_mask]) + total_power = np.sum(features.power_spectrum) + + high_freq_ratio = high_freq_power / total_power if total_power > 0 else 0 + + if high_freq_ratio > 0.3: # More than 30% power in high frequencies + anomalies.append(SpectralAnomaly( + timestamp=datetime.now(), + anomaly_type="high_frequency_noise", + severity="medium", + frequency_band=(0.2, np.max(features.frequencies)), + anomaly_score=high_freq_ratio, + description="High-frequency noise detected in spending data", + evidence={"high_freq_ratio": high_freq_ratio}, + recommendations=[ + "Check data collection processes", + "Investigate potential data manipulation", + "Verify data source reliability" + ] + )) + + return anomalies + + def _analyze_frequency_band( + self, + features: SpectralFeatures, + band_name: str, + min_freq: float, + max_freq: float, + entity_name: Optional[str] + ) -> Optional[PeriodicPattern]: + """Analyze specific frequency band for patterns.""" + # Find frequencies in this band + mask = (features.frequencies >= min_freq) & (features.frequencies <= max_freq) + + if not np.any(mask): + return None + + band_power = features.power_spectrum[mask] + band_frequencies = features.frequencies[mask] + + if len(band_power) == 0: + return None + + # Find peak in this band + max_idx = np.argmax(band_power) + peak_frequency = band_frequencies[max_idx] + peak_power = band_power[max_idx] + + # Calculate relative amplitude + total_power = np.sum(features.power_spectrum) + relative_amplitude = peak_power / total_power if total_power > 0 else 0 + + # Skip if amplitude is too low + if relative_amplitude < 0.05: + return None + + # Calculate confidence based on peak prominence + mean_power = np.mean(band_power) + confidence = (peak_power - mean_power) / mean_power if mean_power > 0 else 0 + confidence = min(confidence / 3, 1.0) # Normalize + + # Determine pattern type and business interpretation + period_days = 1 / peak_frequency if peak_frequency > 0 else 0 + pattern_type = self._classify_pattern_type(band_name, period_days, relative_amplitude) + business_interpretation = self._interpret_pattern( + band_name, period_days, relative_amplitude, entity_name + ) + + return PeriodicPattern( + period_days=period_days, + frequency_hz=peak_frequency, + amplitude=relative_amplitude, + confidence=confidence, + pattern_type=pattern_type, + business_interpretation=business_interpretation, + statistical_significance=confidence + ) + + def _classify_pattern_type( + self, + band_name: str, + period_days: float, + amplitude: float + ) -> str: + """Classify the type of periodic pattern.""" + if band_name in ["weekly", "monthly", "quarterly", "annual"]: + if amplitude > 0.2: + return "seasonal" + else: + return "cyclical" + elif band_name == "suspicious" or period_days < 3: + return "suspicious" + else: + return "irregular" + + def _interpret_pattern( + self, + band_name: str, + period_days: float, + amplitude: float, + entity_name: Optional[str] + ) -> str: + """Provide business interpretation of detected pattern.""" + entity_str = f" for {entity_name}" if entity_name else "" + + interpretations = { + "weekly": f"Weekly spending cycle detected{entity_str} (period: {period_days:.1f} days, strength: {amplitude:.1%})", + "monthly": f"Monthly budget cycle identified{entity_str} (period: {period_days:.1f} days, strength: {amplitude:.1%})", + "quarterly": f"Quarterly spending pattern found{entity_str} (period: {period_days:.1f} days, strength: {amplitude:.1%})", + "annual": f"Annual budget cycle detected{entity_str} (period: {period_days:.1f} days, strength: {amplitude:.1%})", + "suspicious": f"Potentially suspicious high-frequency pattern{entity_str} (period: {period_days:.1f} days)" + } + + return interpretations.get(band_name, f"Periodic pattern detected{entity_str} (period: {period_days:.1f} days)") + + def _calculate_synchronization_score(self, coherence: np.ndarray) -> float: + """Calculate synchronization score between two entities.""" + # Weight higher frequencies less (focus on meaningful business cycles) + weights = np.exp(-np.linspace(0, 5, len(coherence))) + weighted_coherence = coherence * weights + + return np.mean(weighted_coherence) + + def _interpret_cross_spectral_results( + self, + correlation: float, + coherence: np.ndarray, + correlated_periods: List[float], + entity1: str, + entity2: str + ) -> str: + """Interpret cross-spectral analysis results.""" + if correlation > 0.7: + correlation_strength = "strong" + elif correlation > 0.4: + correlation_strength = "moderate" + else: + correlation_strength = "weak" + + interpretation = f"{correlation_strength.capitalize()} correlation detected between {entity1} and {entity2} (r={correlation:.3f}). " + + if len(correlated_periods) > 0: + main_periods = [p for p in correlated_periods if 7 <= p <= 365] # Focus on business-relevant periods + if main_periods: + interpretation += f"Synchronized patterns found at periods: {', '.join([f'{p:.0f} days' for p in main_periods[:3]])}." + + max_coherence = np.max(coherence) + if max_coherence > 0.8: + interpretation += " High spectral coherence suggests systematic coordination or shared external factors." + elif max_coherence > 0.6: + interpretation += " Moderate spectral coherence indicates some shared patterns or influences." + + return interpretation \ No newline at end of file diff --git a/src/training/__init__.py b/src/training/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/src/training/configs/__init__.py b/src/training/configs/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/src/training/pipelines/__init__.py b/src/training/pipelines/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/src/training/pipelines/data.py b/src/training/pipelines/data.py new file mode 100644 index 0000000000000000000000000000000000000000..2548c76885c7f0993499a09e30a98e08293bd1d8 --- /dev/null +++ b/src/training/pipelines/data.py @@ -0,0 +1,852 @@ +""" +Pipeline de Dados do Portal da Transparência para Cidadão.AI + +Sistema completo de coleta, processamento e preparação de dados +do Portal da Transparência para treinamento do modelo especializado. +""" + +import asyncio +import aiohttp +import pandas as pd +import numpy as np +import json +import re +from typing import Dict, List, Optional, Tuple, Any +from pathlib import Path +import logging +from datetime import datetime, timedelta +from dataclasses import dataclass +import hashlib +from concurrent.futures import ThreadPoolExecutor +import time +from sklearn.model_selection import train_test_split +from sklearn.preprocessing import LabelEncoder +import spacy +from transformers import AutoTokenizer + +# Importar ferramentas do projeto +from ..tools.transparency_api import TransparencyAPIClient, TransparencyAPIFilter + +logger = logging.getLogger(__name__) + + +@dataclass +class DataPipelineConfig: + """Configuração do pipeline de dados""" + + # Configurações de coleta + start_date: str = "2020-01-01" + end_date: str = "2024-12-31" + batch_size: int = 1000 + max_samples_per_type: int = 10000 + + # Configurações de processamento + min_text_length: int = 50 + max_text_length: int = 2048 + anomaly_threshold: float = 0.8 + + # Configurações de anotação + enable_auto_annotation: bool = True + manual_annotation_sample_rate: float = 0.1 + + # Configurações de balanceamento + balance_classes: bool = True + normal_anomaly_ratio: float = 0.7 # 70% normal, 30% anomalias + + # Configurações de output + output_dir: str = "./data/processed" + save_intermediate: bool = True + + # Configurações de validação + train_split: float = 0.7 + val_split: float = 0.15 + test_split: float = 0.15 + + +class AnomalyDetector: + """Detector de anomalias baseado em regras para anotação automática""" + + def __init__(self): + # Padrões suspeitos + self.suspicious_patterns = { + "high_value": { + "threshold": 10000000, # 10 milhões + "weight": 0.3 + }, + "emergency_contract": { + "keywords": ["emergencial", "urgente", "dispensa"], + "weight": 0.4 + }, + "sole_source": { + "keywords": ["inexigibilidade", "fonte única", "exclusivo"], + "weight": 0.3 + }, + "short_deadline": { + "keywords": ["prazo reduzido", "exíguo", "urgência"], + "weight": 0.2 + }, + "irregular_cnpj": { + "keywords": ["cnpj irregular", "situação irregular", "bloqueado"], + "weight": 0.5 + }, + "related_parties": { + "keywords": ["parentesco", "familiar", "cônjuge", "parente"], + "weight": 0.6 + }, + "suspicious_amounts": { + "patterns": [r"\d+\.999\.\d+", r"\d+\.000\.000"], # Valores suspeitos + "weight": 0.4 + } + } + + # Padrões de conformidade legal + self.legal_compliance_patterns = { + "proper_bidding": { + "keywords": ["licitação", "pregão", "concorrência", "tomada de preços"], + "weight": 0.5 + }, + "legal_justification": { + "keywords": ["justificativa legal", "amparo legal", "fundamentação"], + "weight": 0.3 + }, + "proper_documentation": { + "keywords": ["processo", "documentação", "termo de referência"], + "weight": 0.2 + } + } + + # Carregar modelo de NLP se disponível + try: + self.nlp = spacy.load("pt_core_news_sm") + except: + logger.warning("Modelo spaCy não encontrado. Usando análise de texto básica.") + self.nlp = None + + def detect_anomalies(self, contract_data: Dict) -> Dict[str, Any]: + """Detectar anomalias em dados de contrato""" + + text = self._extract_text(contract_data) + value = contract_data.get("valor", 0) + + # Calcular scores de anomalia + anomaly_score = 0.0 + anomaly_indicators = [] + + # Verificar valor alto + if value > self.suspicious_patterns["high_value"]["threshold"]: + anomaly_score += self.suspicious_patterns["high_value"]["weight"] + anomaly_indicators.append("high_value") + + # Verificar padrões de texto + text_lower = text.lower() + + for pattern_name, pattern_config in self.suspicious_patterns.items(): + if pattern_name == "high_value": + continue + + if "keywords" in pattern_config: + for keyword in pattern_config["keywords"]: + if keyword in text_lower: + anomaly_score += pattern_config["weight"] + anomaly_indicators.append(pattern_name) + break + + if "patterns" in pattern_config: + for pattern in pattern_config["patterns"]: + if re.search(pattern, text): + anomaly_score += pattern_config["weight"] + anomaly_indicators.append(pattern_name) + break + + # Normalizar score + anomaly_score = min(anomaly_score, 1.0) + + # Classificar anomalia + if anomaly_score >= 0.7: + anomaly_label = 2 # Anômalo + anomaly_type = "Anômalo" + elif anomaly_score >= 0.4: + anomaly_label = 1 # Suspeito + anomaly_type = "Suspeito" + else: + anomaly_label = 0 # Normal + anomaly_type = "Normal" + + return { + "anomaly_score": anomaly_score, + "anomaly_label": anomaly_label, + "anomaly_type": anomaly_type, + "anomaly_indicators": anomaly_indicators, + "confidence": self._calculate_confidence(anomaly_score, anomaly_indicators) + } + + def assess_financial_risk(self, contract_data: Dict) -> Dict[str, Any]: + """Avaliar risco financeiro""" + + value = contract_data.get("valor", 0) + text = self._extract_text(contract_data) + + # Fatores de risco + risk_factors = [] + risk_score = 0.0 + + # Risco por valor + if value > 50000000: # > 50M + risk_score += 0.4 + risk_factors.append("very_high_value") + elif value > 10000000: # > 10M + risk_score += 0.3 + risk_factors.append("high_value") + elif value > 1000000: # > 1M + risk_score += 0.2 + risk_factors.append("medium_value") + + # Risco por características do contrato + text_lower = text.lower() + + risk_keywords = { + "obra": 0.2, + "construção": 0.2, + "reforma": 0.15, + "equipamento": 0.1, + "serviço": 0.05, + "emergencial": 0.3, + "tecnologia": 0.1 + } + + for keyword, weight in risk_keywords.items(): + if keyword in text_lower: + risk_score += weight + risk_factors.append(f"keyword_{keyword}") + + # Normalizar e classificar + risk_score = min(risk_score, 1.0) + + if risk_score >= 0.8: + risk_level = 4 # Muito Alto + elif risk_score >= 0.6: + risk_level = 3 # Alto + elif risk_score >= 0.4: + risk_level = 2 # Médio + elif risk_score >= 0.2: + risk_level = 1 # Baixo + else: + risk_level = 0 # Muito Baixo + + return { + "financial_risk_score": risk_score, + "financial_risk_level": risk_level, + "risk_factors": risk_factors, + "estimated_risk_value": value * risk_score + } + + def check_legal_compliance(self, contract_data: Dict) -> Dict[str, Any]: + """Verificar conformidade legal""" + + text = self._extract_text(contract_data) + text_lower = text.lower() + + compliance_score = 0.0 + compliance_indicators = [] + + # Verificar indicadores de conformidade + for pattern_name, pattern_config in self.legal_compliance_patterns.items(): + for keyword in pattern_config["keywords"]: + if keyword in text_lower: + compliance_score += pattern_config["weight"] + compliance_indicators.append(pattern_name) + break + + # Verificar indicadores de não conformidade + non_compliance_keywords = [ + "irregular", "ilegal", "inválido", "viciado", + "sem licitação", "direcionamento", "favorecimento" + ] + + for keyword in non_compliance_keywords: + if keyword in text_lower: + compliance_score -= 0.3 + compliance_indicators.append(f"non_compliant_{keyword}") + + # Normalizar score + compliance_score = max(0.0, min(compliance_score, 1.0)) + + # Determinar conformidade + is_compliant = compliance_score >= 0.5 + compliance_label = 1 if is_compliant else 0 + + return { + "legal_compliance_score": compliance_score, + "legal_compliance_label": compliance_label, + "is_compliant": is_compliant, + "compliance_indicators": compliance_indicators + } + + def _extract_text(self, contract_data: Dict) -> str: + """Extrair texto relevante dos dados do contrato""" + + text_fields = [ + "objeto", "descricao", "justificativa", "observacoes", + "modalidade_licitacao", "situacao", "fornecedor_nome" + ] + + text_parts = [] + for field in text_fields: + if field in contract_data and contract_data[field]: + text_parts.append(str(contract_data[field])) + + return " ".join(text_parts) + + def _calculate_confidence(self, score: float, indicators: List[str]) -> float: + """Calcular confiança da detecção""" + + # Confiança baseada no número de indicadores e score + indicator_confidence = min(len(indicators) * 0.1, 0.5) + score_confidence = score * 0.5 + + return min(indicator_confidence + score_confidence, 1.0) + + +class TransparencyDataProcessor: + """Processador de dados de transparência""" + + def __init__(self, config: DataPipelineConfig): + self.config = config + self.anomaly_detector = AnomalyDetector() + self.api_client = None + + # Estatísticas de processamento + self.stats = { + "total_contracts": 0, + "processed_contracts": 0, + "anomalous_contracts": 0, + "errors": 0 + } + + async def collect_transparency_data(self) -> List[Dict]: + """Coletar dados do Portal da Transparência""" + + logger.info("🔍 Iniciando coleta de dados do Portal da Transparência") + + all_data = [] + + async with TransparencyAPIClient() as client: + self.api_client = client + + # Coletar contratos + contracts_data = await self._collect_contracts_data(client) + all_data.extend(contracts_data) + + # Coletar despesas (opcional) + # despesas_data = await self._collect_despesas_data(client) + # all_data.extend(despesas_data) + + # Coletar convênios (opcional) + # convenios_data = await self._collect_convenios_data(client) + # all_data.extend(convenios_data) + + logger.info(f"✅ Coleta finalizada: {len(all_data)} registros") + return all_data + + async def _collect_contracts_data(self, client: TransparencyAPIClient) -> List[Dict]: + """Coletar dados de contratos""" + + contracts = [] + + # Definir filtros para diferentes tipos de contratos + filter_configs = [ + # Contratos de alto valor + TransparencyAPIFilter( + ano=2024, + valor_inicial=10000000, # > 10M + pagina=1 + ), + # Contratos médio valor + TransparencyAPIFilter( + ano=2024, + valor_inicial=1000000, + valor_final=10000000, + pagina=1 + ), + # Contratos emergenciais (mais propensos a anomalias) + TransparencyAPIFilter( + ano=2024, + modalidade_licitacao="Dispensa", + pagina=1 + ) + ] + + for filters in filter_configs: + try: + logger.info(f"📋 Coletando contratos com filtros: {filters}") + + batch_contracts = await client.get_contracts(filters) + + if batch_contracts: + # Limitar número de contratos por tipo + limited_contracts = batch_contracts[:self.config.max_samples_per_type] + contracts.extend(limited_contracts) + + logger.info(f"✅ Coletados {len(limited_contracts)} contratos") + + # Rate limiting + await asyncio.sleep(1) + + except Exception as e: + logger.error(f"❌ Erro ao coletar contratos: {e}") + self.stats["errors"] += 1 + + self.stats["total_contracts"] = len(contracts) + return contracts + + def process_raw_data(self, raw_data: List[Dict]) -> List[Dict]: + """Processar dados brutos""" + + logger.info(f"⚙️ Processando {len(raw_data)} registros") + + processed_data = [] + + for item in raw_data: + try: + processed_item = self._process_single_item(item) + if processed_item: + processed_data.append(processed_item) + self.stats["processed_contracts"] += 1 + + except Exception as e: + logger.error(f"❌ Erro ao processar item: {e}") + self.stats["errors"] += 1 + + logger.info(f"✅ Processamento concluído: {len(processed_data)} registros válidos") + return processed_data + + def _process_single_item(self, item: Dict) -> Optional[Dict]: + """Processar um item individual""" + + # Extrair e limpar texto + text = self._extract_and_clean_text(item) + + if not text or len(text) < self.config.min_text_length: + return None + + # Truncar se muito longo + if len(text) > self.config.max_text_length: + text = text[:self.config.max_text_length] + + # Análise automática de anomalias + anomaly_analysis = self.anomaly_detector.detect_anomalies(item) + financial_analysis = self.anomaly_detector.assess_financial_risk(item) + legal_analysis = self.anomaly_detector.check_legal_compliance(item) + + if anomaly_analysis["anomaly_label"] > 0: + self.stats["anomalous_contracts"] += 1 + + # Extrair features especializadas + entity_types = self._extract_entity_types(item) + financial_features = self._extract_financial_features(item) + legal_features = self._extract_legal_features(item) + + processed_item = { + # Dados básicos + "id": item.get("id", hashlib.md5(text.encode()).hexdigest()[:12]), + "text": text, + "original_data": item, + + # Labels para treinamento + "anomaly_label": anomaly_analysis["anomaly_label"], + "financial_risk": financial_analysis["financial_risk_level"], + "legal_compliance": legal_analysis["legal_compliance_label"], + + # Scores detalhados + "anomaly_score": anomaly_analysis["anomaly_score"], + "financial_risk_score": financial_analysis["financial_risk_score"], + "legal_compliance_score": legal_analysis["legal_compliance_score"], + + # Features especializadas + "entity_types": entity_types, + "financial_features": financial_features, + "legal_features": legal_features, + + # Metadados + "confidence": anomaly_analysis["confidence"], + "anomaly_indicators": anomaly_analysis["anomaly_indicators"], + "risk_factors": financial_analysis["risk_factors"], + "compliance_indicators": legal_analysis["compliance_indicators"], + + # Valor do contrato + "contract_value": item.get("valor", 0), + + # Timestamp de processamento + "processed_at": datetime.now().isoformat() + } + + return processed_item + + def _extract_and_clean_text(self, item: Dict) -> str: + """Extrair e limpar texto dos dados""" + + # Campos de texto relevantes + text_fields = [ + "objeto", "descricao", "justificativa", "observacoes", + "modalidade_licitacao", "situacao", "fornecedor_nome", + "orgao_nome", "unidade_gestora_nome" + ] + + text_parts = [] + + for field in text_fields: + value = item.get(field) + if value and isinstance(value, str): + # Limpar texto + cleaned_value = re.sub(r'\s+', ' ', value.strip()) + cleaned_value = re.sub(r'[^\w\s\-\.\,\;\:\(\)\[\]]', '', cleaned_value) + + if len(cleaned_value) > 10: # Filtrar textos muito curtos + text_parts.append(cleaned_value) + + return " ".join(text_parts) + + def _extract_entity_types(self, item: Dict) -> List[int]: + """Extrair tipos de entidades""" + + entity_types = [] + + # Mapear tipos de entidades + entity_mapping = { + "orgao": 1, + "empresa": 2, + "pessoa_fisica": 3, + "equipamento": 4, + "servico": 5, + "obra": 6, + "material": 7 + } + + # Identificar entidades no texto + text = self._extract_and_clean_text(item).lower() + + for entity_name, entity_id in entity_mapping.items(): + if entity_name in text or any(keyword in text for keyword in [entity_name]): + entity_types.append(entity_id) + + # Garantir pelo menos um tipo + if not entity_types: + entity_types = [0] # Tipo genérico + + return entity_types[:10] # Limitar a 10 tipos + + def _extract_financial_features(self, item: Dict) -> List[float]: + """Extrair features financeiras""" + + features = [] + + # Valor do contrato (normalizado) + valor = item.get("valor", 0) + valor_normalizado = min(valor / 100000000, 1.0) # Normalizar por 100M + features.append(valor_normalizado) + + # Ano do contrato + ano = item.get("ano", 2024) + ano_normalizado = (ano - 2020) / 10 # Normalizar para 0-1 + features.append(ano_normalizado) + + # Modalidade (codificada) + modalidade_map = { + "Pregão": 0.1, + "Concorrência": 0.2, + "Tomada de Preços": 0.3, + "Convite": 0.4, + "Dispensa": 0.7, + "Inexigibilidade": 0.9 + } + + modalidade = item.get("modalidade_licitacao", "") + modalidade_valor = modalidade_map.get(modalidade, 0.5) + features.append(modalidade_valor) + + return features + + def _extract_legal_features(self, item: Dict) -> List[int]: + """Extrair features legais""" + + features = [] + + # Presença de documentação legal + legal_docs = [ + "processo", "edital", "termo_referencia", "ata", + "contrato", "aditivo", "apostilamento" + ] + + text = self._extract_and_clean_text(item).lower() + + for doc in legal_docs: + if doc in text: + features.append(1) + else: + features.append(0) + + return features + + def create_training_datasets(self, processed_data: List[Dict]) -> Dict[str, List[Dict]]: + """Criar datasets de treinamento""" + + logger.info("📊 Criando datasets de treinamento") + + # Balancear classes se solicitado + if self.config.balance_classes: + processed_data = self._balance_dataset(processed_data) + + # Dividir em train/val/test + train_data, temp_data = train_test_split( + processed_data, + test_size=(1 - self.config.train_split), + random_state=42, + stratify=[item["anomaly_label"] for item in processed_data] + ) + + val_size = self.config.val_split / (self.config.val_split + self.config.test_split) + val_data, test_data = train_test_split( + temp_data, + test_size=(1 - val_size), + random_state=42, + stratify=[item["anomaly_label"] for item in temp_data] + ) + + datasets = { + "train": train_data, + "val": val_data, + "test": test_data + } + + # Log estatísticas + for split_name, split_data in datasets.items(): + logger.info(f"📈 {split_name}: {len(split_data)} exemplos") + + # Distribuição de classes + anomaly_dist = {} + for item in split_data: + label = item["anomaly_label"] + anomaly_dist[label] = anomaly_dist.get(label, 0) + 1 + + logger.info(f" Distribuição anomalias: {anomaly_dist}") + + return datasets + + def _balance_dataset(self, data: List[Dict]) -> List[Dict]: + """Balancear dataset por classes""" + + logger.info("⚖️ Balanceando dataset") + + # Agrupar por classe de anomalia + class_groups = {0: [], 1: [], 2: []} + + for item in data: + label = item["anomaly_label"] + if label in class_groups: + class_groups[label].append(item) + + # Calcular tamanho alvo + total_size = len(data) + normal_size = int(total_size * self.config.normal_anomaly_ratio) + anomaly_size = total_size - normal_size + suspicious_size = anomaly_size // 2 + anomalous_size = anomaly_size - suspicious_size + + # Balancear + balanced_data = [] + + # Normal (classe 0) + normal_data = class_groups[0] + if len(normal_data) >= normal_size: + balanced_data.extend(np.random.choice(normal_data, normal_size, replace=False)) + else: + # Oversample se necessário + balanced_data.extend(normal_data) + remaining = normal_size - len(normal_data) + balanced_data.extend(np.random.choice(normal_data, remaining, replace=True)) + + # Suspeito (classe 1) + suspicious_data = class_groups[1] + if len(suspicious_data) >= suspicious_size: + balanced_data.extend(np.random.choice(suspicious_data, suspicious_size, replace=False)) + else: + balanced_data.extend(suspicious_data) + remaining = suspicious_size - len(suspicious_data) + if remaining > 0 and len(suspicious_data) > 0: + balanced_data.extend(np.random.choice(suspicious_data, remaining, replace=True)) + + # Anômalo (classe 2) + anomalous_data = class_groups[2] + if len(anomalous_data) >= anomalous_size: + balanced_data.extend(np.random.choice(anomalous_data, anomalous_size, replace=False)) + else: + balanced_data.extend(anomalous_data) + remaining = anomalous_size - len(anomalous_data) + if remaining > 0 and len(anomalous_data) > 0: + balanced_data.extend(np.random.choice(anomalous_data, remaining, replace=True)) + + # Shuffle + np.random.shuffle(balanced_data) + + logger.info(f"📊 Dataset balanceado: {len(balanced_data)} exemplos") + return balanced_data + + def save_datasets(self, datasets: Dict[str, List[Dict]]): + """Salvar datasets processados""" + + output_dir = Path(self.config.output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + # Salvar cada split + for split_name, split_data in datasets.items(): + output_path = output_dir / f"{split_name}.json" + + with open(output_path, 'w', encoding='utf-8') as f: + json.dump(split_data, f, ensure_ascii=False, indent=2) + + logger.info(f"💾 {split_name} salvo em {output_path}") + + # Salvar estatísticas + stats_path = output_dir / "processing_stats.json" + with open(stats_path, 'w', encoding='utf-8') as f: + json.dump(self.stats, f, indent=2) + + # Salvar configuração + config_path = output_dir / "pipeline_config.json" + with open(config_path, 'w', encoding='utf-8') as f: + json.dump(self.config.__dict__, f, indent=2) + + logger.info(f"📈 Estatísticas e configuração salvas em {output_dir}") + + def generate_data_report(self, datasets: Dict[str, List[Dict]]) -> str: + """Gerar relatório dos dados processados""" + + report = [] + report.append("# 📊 Relatório de Processamento de Dados - Cidadão.AI\n") + + # Estatísticas gerais + report.append("## 📈 Estatísticas Gerais\n") + report.append(f"- **Total de contratos coletados**: {self.stats['total_contracts']:,}") + report.append(f"- **Contratos processados**: {self.stats['processed_contracts']:,}") + report.append(f"- **Contratos anômalos detectados**: {self.stats['anomalous_contracts']:,}") + report.append(f"- **Erros durante processamento**: {self.stats['errors']:,}") + report.append(f"- **Taxa de anomalias**: {self.stats['anomalous_contracts']/max(self.stats['processed_contracts'],1)*100:.1f}%\n") + + # Estatísticas por split + report.append("## 📚 Estatísticas por Dataset\n") + + for split_name, split_data in datasets.items(): + report.append(f"### {split_name.title()}\n") + report.append(f"- **Tamanho**: {len(split_data):,} exemplos\n") + + # Distribuição de anomalias + anomaly_dist = {} + financial_dist = {} + legal_dist = {} + + for item in split_data: + # Anomalias + anomaly_label = item["anomaly_label"] + anomaly_dist[anomaly_label] = anomaly_dist.get(anomaly_label, 0) + 1 + + # Risco financeiro + financial_risk = item["financial_risk"] + financial_dist[financial_risk] = financial_dist.get(financial_risk, 0) + 1 + + # Conformidade legal + legal_compliance = item["legal_compliance"] + legal_dist[legal_compliance] = legal_dist.get(legal_compliance, 0) + 1 + + report.append("**Distribuição de Anomalias:**") + anomaly_labels = {0: "Normal", 1: "Suspeito", 2: "Anômalo"} + for label, count in sorted(anomaly_dist.items()): + pct = count / len(split_data) * 100 + report.append(f" - {anomaly_labels.get(label, label)}: {count:,} ({pct:.1f}%)") + + report.append("\n**Distribuição de Risco Financeiro:**") + risk_labels = {0: "Muito Baixo", 1: "Baixo", 2: "Médio", 3: "Alto", 4: "Muito Alto"} + for level, count in sorted(financial_dist.items()): + pct = count / len(split_data) * 100 + report.append(f" - {risk_labels.get(level, level)}: {count:,} ({pct:.1f}%)") + + report.append("\n**Conformidade Legal:**") + legal_labels = {0: "Não Conforme", 1: "Conforme"} + for label, count in sorted(legal_dist.items()): + pct = count / len(split_data) * 100 + report.append(f" - {legal_labels.get(label, label)}: {count:,} ({pct:.1f}%)") + + report.append("\n") + + # Configuração utilizada + report.append("## ⚙️ Configuração do Pipeline\n") + for key, value in self.config.__dict__.items(): + report.append(f"- **{key}**: {value}") + + report.append("\n") + report.append(f"**Relatório gerado em**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}") + + return "\n".join(report) + + +async def run_data_pipeline(config: Optional[DataPipelineConfig] = None) -> Dict[str, List[Dict]]: + """ + Executar pipeline completo de dados + + Args: + config: Configuração do pipeline + + Returns: + Datasets de treinamento processados + """ + + if config is None: + config = DataPipelineConfig() + + logger.info("🚀 Iniciando pipeline de dados Cidadão.AI") + + processor = TransparencyDataProcessor(config) + + # 1. Coletar dados + raw_data = await processor.collect_transparency_data() + + # 2. Processar dados + processed_data = processor.process_raw_data(raw_data) + + # 3. Criar datasets + datasets = processor.create_training_datasets(processed_data) + + # 4. Salvar dados + processor.save_datasets(datasets) + + # 5. Gerar relatório + report = processor.generate_data_report(datasets) + + # Salvar relatório + output_dir = Path(config.output_dir) + report_path = output_dir / "data_report.md" + with open(report_path, 'w', encoding='utf-8') as f: + f.write(report) + + logger.info(f"📄 Relatório salvo em {report_path}") + logger.info("✅ Pipeline de dados finalizado com sucesso!") + + return datasets + + +if __name__ == "__main__": + # Configurar logging + logging.basicConfig(level=logging.INFO) + + # Executar pipeline + config = DataPipelineConfig( + max_samples_per_type=100, # Reduzido para teste + output_dir="./data/cidadao_gpt_processed" + ) + + # Executar + datasets = asyncio.run(run_data_pipeline(config)) + + print("🎉 Pipeline de dados executado com sucesso!") + print(f"📊 Datasets criados: {list(datasets.keys())}") + for name, data in datasets.items(): + print(f" {name}: {len(data)} exemplos") \ No newline at end of file diff --git a/src/training/pipelines/training.py b/src/training/pipelines/training.py new file mode 100644 index 0000000000000000000000000000000000000000..686e0742e51e29ccc308b0ce9e22d3798b767735 --- /dev/null +++ b/src/training/pipelines/training.py @@ -0,0 +1,813 @@ +""" +Pipeline de Treinamento para Cidadão.AI + +Sistema completo de fine-tuning especializado para dados de transparência pública brasileira. +Inspirado nas técnicas do Kimi K2, mas otimizado para análise governamental. +""" + +import os +import json +import torch +import torch.nn as nn +from torch.utils.data import Dataset, DataLoader +from torch.optim import AdamW +from torch.optim.lr_scheduler import CosineAnnealingLR +from transformers import AutoTokenizer, get_linear_schedule_with_warmup +from typing import Dict, List, Optional, Tuple, Any +import pandas as pd +import numpy as np +from pathlib import Path +import logging +from dataclasses import dataclass, asdict +from tqdm import tqdm +import wandb +from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix +import matplotlib.pyplot as plt +import seaborn as sns + +from .cidadao_model import CidadaoAIForTransparency, CidadaoModelConfig, create_cidadao_model + +logger = logging.getLogger(__name__) + + +@dataclass +class TrainingConfig: + """Configuração de treinamento""" + + # Hiperparâmetros principais + learning_rate: float = 2e-5 + batch_size: int = 8 + num_epochs: int = 10 + warmup_steps: int = 1000 + max_grad_norm: float = 1.0 + weight_decay: float = 0.01 + + # Configurações de dados + max_sequence_length: int = 512 + train_split: float = 0.8 + val_split: float = 0.1 + test_split: float = 0.1 + + # Configurações do modelo + model_size: str = "medium" + specialized_tasks: List[str] = None + use_mixed_precision: bool = True + gradient_accumulation_steps: int = 4 + + # Configurações de checkpoint + save_strategy: str = "epoch" # "steps" ou "epoch" + save_steps: int = 500 + eval_steps: int = 100 + logging_steps: int = 50 + output_dir: str = "./models/cidadao-gpt" + + # Configurações de avaliação + eval_strategy: str = "steps" + metric_for_best_model: str = "eval_f1" + greater_is_better: bool = True + early_stopping_patience: int = 3 + + # Configurações de experimentação + experiment_name: str = "cidadao-gpt-v1" + use_wandb: bool = True + wandb_project: str = "cidadao-ai" + + def __post_init__(self): + if self.specialized_tasks is None: + self.specialized_tasks = ["all"] + + +class TransparencyDataset(Dataset): + """Dataset especializado para dados de transparência pública""" + + def __init__( + self, + data_path: str, + tokenizer: AutoTokenizer, + max_length: int = 512, + task_type: str = "multi_task" + ): + self.tokenizer = tokenizer + self.max_length = max_length + self.task_type = task_type + + # Carregar dados + self.data = self._load_data(data_path) + + # Preparar vocabulário especializado + self._prepare_specialized_vocab() + + def _load_data(self, data_path: str) -> List[Dict]: + """Carregar dados de transparência""" + + data_file = Path(data_path) + + if data_file.suffix == '.json': + with open(data_file, 'r', encoding='utf-8') as f: + data = json.load(f) + elif data_file.suffix == '.jsonl': + data = [] + with open(data_file, 'r', encoding='utf-8') as f: + for line in f: + data.append(json.loads(line)) + else: + # Assumir dados do Portal da Transparência em formato estruturado + data = self._load_transparency_data(data_path) + + logger.info(f"Carregados {len(data)} exemplos de {data_path}") + return data + + def _load_transparency_data(self, data_path: str) -> List[Dict]: + """Carregar dados reais do Portal da Transparência""" + + # Simular estrutura de dados reais + # Em produção, isso seria conectado ao pipeline de dados real + sample_data = [] + + # Exemplos de contratos com diferentes tipos de problemas + contract_examples = [ + { + "text": "Contrato para aquisição de equipamentos médicos no valor de R$ 2.500.000,00 firmado entre Ministério da Saúde e Empresa XYZ LTDA. Processo licitatório 12345/2024, modalidade pregão eletrônico.", + "anomaly_label": 0, # Normal + "financial_risk": 2, # Médio + "legal_compliance": 1, # Conforme + "contract_value": 2500000.0, + "entity_types": [1, 2, 3], # Ministério, Empresa, Equipamento + "corruption_indicators": [] + }, + { + "text": "Contrato emergencial sem licitação para fornecimento de insumos hospitalares. Valor: R$ 15.000.000,00. Empresa beneficiária: Alpha Beta Comercial S.A., CNPJ com irregularidades na Receita Federal.", + "anomaly_label": 2, # Anômalo + "financial_risk": 4, # Alto + "legal_compliance": 0, # Não conforme + "contract_value": 15000000.0, + "entity_types": [1, 2, 4], # Ministério, Empresa, Insumos + "corruption_indicators": [1, 3, 5] # Emergencial, Sem licitação, CNPJ irregular + } + ] + + # Amplificar dados com variações + for base_example in contract_examples: + for i in range(50): # 50 variações de cada exemplo + example = base_example.copy() + example["id"] = f"{len(sample_data)}" + + # Adicionar ruído realístico + if np.random.random() > 0.5: + example["text"] = self._add_realistic_variations(example["text"]) + + sample_data.append(example) + + return sample_data + + def _add_realistic_variations(self, text: str) -> str: + """Adicionar variações realísticas ao texto""" + + variations = [ + text.replace("Ministério da Saúde", "MS"), + text.replace("equipamentos médicos", "equipamentos hospitalares"), + text.replace("pregão eletrônico", "concorrência pública"), + text + " Processo administrativo arquivado em sistema SIASG.", + text + " Valor atualizado conforme INPC/IBGE." + ] + + return np.random.choice(variations) + + def _prepare_specialized_vocab(self): + """Preparar vocabulário especializado para transparência""" + + # Termos técnicos de transparência pública + self.transparency_terms = { + # Entidades + "ministerio", "secretaria", "orgao", "entidade", "empresa", "fornecedor", + + # Tipos de contrato + "licitacao", "pregao", "concorrencia", "tomada_precos", "convite", "dispensa", + + # Indicadores financeiros + "valor", "preco", "orcamento", "pagamento", "repasse", "empenho", + + # Termos jurídicos + "conformidade", "irregularidade", "infração", "penalidade", "multa", + + # Indicadores de corrupção + "superfaturamento", "direcionamento", "cartel", "fraude", "peculato" + } + + # Adicionar tokens especiais se necessário + special_tokens = ["[CONTRACT]", "[ENTITY]", "[VALUE]", "[ANOMALY]", "[LEGAL]"] + self.tokenizer.add_special_tokens({"additional_special_tokens": special_tokens}) + + def __len__(self) -> int: + return len(self.data) + + def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]: + item = self.data[idx] + + # Tokenizar texto + encoding = self.tokenizer( + item["text"], + truncation=True, + padding="max_length", + max_length=self.max_length, + return_tensors="pt" + ) + + # Preparar labels e features especializadas + result = { + "input_ids": encoding["input_ids"].squeeze(), + "attention_mask": encoding["attention_mask"].squeeze(), + } + + # Adicionar labels específicos por tarefa + if "anomaly_label" in item: + result["anomaly_labels"] = torch.tensor(item["anomaly_label"], dtype=torch.long) + + if "financial_risk" in item: + result["financial_risk_labels"] = torch.tensor(item["financial_risk"], dtype=torch.long) + + if "legal_compliance" in item: + result["legal_compliance_labels"] = torch.tensor(item["legal_compliance"], dtype=torch.long) + + # Adicionar features especializadas + if "entity_types" in item: + entity_types = torch.zeros(self.max_length, dtype=torch.long) + for i, entity_type in enumerate(item["entity_types"][:self.max_length]): + entity_types[i] = entity_type + result["entity_types"] = entity_types + + if "corruption_indicators" in item: + corruption_indicators = torch.zeros(self.max_length, dtype=torch.long) + for i, indicator in enumerate(item["corruption_indicators"][:self.max_length]): + corruption_indicators[i] = indicator + result["corruption_indicators"] = corruption_indicators + + return result + + +class CidadaoTrainer: + """Trainer especializado para Cidadão.AI""" + + def __init__( + self, + model: CidadaoAIForTransparency, + tokenizer: AutoTokenizer, + config: TrainingConfig + ): + self.model = model + self.tokenizer = tokenizer + self.config = config + + # Configurar device + self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") + self.model.to(self.device) + + # Configurar otimizador + self.optimizer = AdamW( + self.model.parameters(), + lr=config.learning_rate, + weight_decay=config.weight_decay + ) + + # Configurar mixed precision se disponível + self.scaler = torch.cuda.amp.GradScaler() if config.use_mixed_precision else None + + # Métricas de treinamento + self.training_history = { + "train_loss": [], + "eval_loss": [], + "eval_metrics": [] + } + + # Early stopping + self.best_metric = float('-inf') if config.greater_is_better else float('inf') + self.patience_counter = 0 + + # Configurar logging + if config.use_wandb: + wandb.init( + project=config.wandb_project, + name=config.experiment_name, + config=asdict(config) + ) + + def train( + self, + train_dataset: TransparencyDataset, + eval_dataset: Optional[TransparencyDataset] = None, + test_dataset: Optional[TransparencyDataset] = None + ): + """Executar treinamento completo""" + + logger.info("🚀 Iniciando treinamento do Cidadão.AI") + + # Preparar data loaders + train_loader = DataLoader( + train_dataset, + batch_size=self.config.batch_size, + shuffle=True, + num_workers=4 + ) + + eval_loader = None + if eval_dataset: + eval_loader = DataLoader( + eval_dataset, + batch_size=self.config.batch_size, + shuffle=False, + num_workers=4 + ) + + # Configurar scheduler + total_steps = len(train_loader) * self.config.num_epochs + self.scheduler = get_linear_schedule_with_warmup( + self.optimizer, + num_warmup_steps=self.config.warmup_steps, + num_training_steps=total_steps + ) + + # Loop de treinamento + global_step = 0 + + for epoch in range(self.config.num_epochs): + logger.info(f"📚 Época {epoch + 1}/{self.config.num_epochs}") + + # Treinamento + train_loss = self._train_epoch(train_loader, epoch, global_step) + self.training_history["train_loss"].append(train_loss) + + # Avaliação + if eval_loader and (epoch + 1) % 1 == 0: # Avaliar a cada época + eval_metrics = self._evaluate(eval_loader, epoch) + self.training_history["eval_metrics"].append(eval_metrics) + + # Early stopping check + current_metric = eval_metrics[self.config.metric_for_best_model] + if self._is_better_metric(current_metric): + self.best_metric = current_metric + self.patience_counter = 0 + self._save_checkpoint(epoch, is_best=True) + logger.info(f"🎯 Novo melhor modelo! {self.config.metric_for_best_model}: {current_metric:.4f}") + else: + self.patience_counter += 1 + + if self.patience_counter >= self.config.early_stopping_patience: + logger.info(f"⏰ Early stopping acionado após {self.patience_counter} épocas sem melhoria") + break + + # Salvar checkpoint regular + if (epoch + 1) % 2 == 0: # Salvar a cada 2 épocas + self._save_checkpoint(epoch, is_best=False) + + global_step += len(train_loader) + + # Avaliação final + if test_dataset: + test_loader = DataLoader( + test_dataset, + batch_size=self.config.batch_size, + shuffle=False, + num_workers=4 + ) + + logger.info("🧪 Executando avaliação final no conjunto de teste") + final_metrics = self._evaluate(test_loader, epoch=-1, is_test=True) + + logger.info("📊 Métricas finais:") + for metric, value in final_metrics.items(): + logger.info(f" {metric}: {value:.4f}") + + # Finalizar treinamento + self._finalize_training() + + def _train_epoch(self, train_loader: DataLoader, epoch: int, global_step: int) -> float: + """Treinar uma época""" + + self.model.train() + total_loss = 0.0 + progress_bar = tqdm(train_loader, desc=f"Treinamento Época {epoch + 1}") + + for step, batch in enumerate(progress_bar): + # Mover dados para device + batch = {k: v.to(self.device) for k, v in batch.items()} + + # Forward pass com mixed precision + if self.scaler: + with torch.cuda.amp.autocast(): + loss = self._compute_multi_task_loss(batch) + else: + loss = self._compute_multi_task_loss(batch) + + # Backward pass + if self.scaler: + self.scaler.scale(loss).backward() + + if (step + 1) % self.config.gradient_accumulation_steps == 0: + self.scaler.unscale_(self.optimizer) + torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.config.max_grad_norm) + self.scaler.step(self.optimizer) + self.scaler.update() + self.scheduler.step() + self.optimizer.zero_grad() + else: + loss.backward() + + if (step + 1) % self.config.gradient_accumulation_steps == 0: + torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.config.max_grad_norm) + self.optimizer.step() + self.scheduler.step() + self.optimizer.zero_grad() + + total_loss += loss.item() + + # Logging + if step % self.config.logging_steps == 0: + avg_loss = total_loss / (step + 1) + progress_bar.set_postfix({"loss": f"{avg_loss:.4f}"}) + + if self.config.use_wandb: + wandb.log({ + "train/loss": avg_loss, + "train/learning_rate": self.scheduler.get_last_lr()[0], + "train/epoch": epoch, + "train/step": global_step + step + }) + + return total_loss / len(train_loader) + + def _compute_multi_task_loss(self, batch: Dict[str, torch.Tensor]) -> torch.Tensor: + """Computar loss multi-tarefa""" + + total_loss = 0.0 + loss_weights = { + "anomaly": 1.0, + "financial": 0.8, + "legal": 0.6 + } + + # Loss de detecção de anomalias + if "anomaly_labels" in batch: + anomaly_outputs = self.model.detect_anomalies( + input_ids=batch["input_ids"], + attention_mask=batch["attention_mask"], + entity_types=batch.get("entity_types"), + corruption_indicators=batch.get("corruption_indicators") + ) + + # Extrair logits dos resultados + anomaly_logits = [] + for pred in anomaly_outputs["predictions"]: + probs = [ + pred["probabilities"]["normal"], + pred["probabilities"]["suspicious"], + pred["probabilities"]["anomalous"] + ] + anomaly_logits.append(probs) + + anomaly_logits = torch.tensor(anomaly_logits, device=self.device) + anomaly_loss = nn.CrossEntropyLoss()(anomaly_logits, batch["anomaly_labels"]) + total_loss += loss_weights["anomaly"] * anomaly_loss + + # Loss de análise financeira + if "financial_risk_labels" in batch: + financial_outputs = self.model.analyze_financial_risk( + input_ids=batch["input_ids"], + attention_mask=batch["attention_mask"] + ) + + # Extrair logits dos resultados + risk_logits = [] + for pred in financial_outputs["predictions"]: + probs = list(pred["risk_probabilities"].values()) + risk_logits.append(probs) + + risk_logits = torch.tensor(risk_logits, device=self.device) + financial_loss = nn.CrossEntropyLoss()(risk_logits, batch["financial_risk_labels"]) + total_loss += loss_weights["financial"] * financial_loss + + # Loss de conformidade legal + if "legal_compliance_labels" in batch: + legal_outputs = self.model.check_legal_compliance( + input_ids=batch["input_ids"], + attention_mask=batch["attention_mask"] + ) + + # Extrair logits dos resultados + compliance_logits = [] + for pred in legal_outputs["predictions"]: + probs = [ + pred["legal_analysis"]["non_compliant_prob"], + pred["legal_analysis"]["compliant_prob"] + ] + compliance_logits.append(probs) + + compliance_logits = torch.tensor(compliance_logits, device=self.device) + legal_loss = nn.CrossEntropyLoss()(compliance_logits, batch["legal_compliance_labels"]) + total_loss += loss_weights["legal"] * legal_loss + + return total_loss + + def _evaluate(self, eval_loader: DataLoader, epoch: int, is_test: bool = False) -> Dict[str, float]: + """Avaliar modelo""" + + self.model.eval() + total_loss = 0.0 + + # Coletar predições e labels + all_predictions = { + "anomaly": {"preds": [], "labels": []}, + "financial": {"preds": [], "labels": []}, + "legal": {"preds": [], "labels": []} + } + + with torch.no_grad(): + for batch in tqdm(eval_loader, desc="Avaliação"): + batch = {k: v.to(self.device) for k, v in batch.items()} + + # Computar loss + loss = self._compute_multi_task_loss(batch) + total_loss += loss.item() + + # Coletar predições + self._collect_predictions(batch, all_predictions) + + avg_loss = total_loss / len(eval_loader) + + # Computar métricas + metrics = {"eval_loss": avg_loss} + + for task, preds_labels in all_predictions.items(): + if preds_labels["preds"]: + task_metrics = self._compute_task_metrics( + preds_labels["preds"], + preds_labels["labels"], + task_name=task + ) + metrics.update(task_metrics) + + # Logging + prefix = "test" if is_test else "eval" + log_metrics = {f"{prefix}/{k}": v for k, v in metrics.items()} + + if self.config.use_wandb: + wandb.log(log_metrics) + + return metrics + + def _collect_predictions(self, batch: Dict[str, torch.Tensor], all_predictions: Dict): + """Coletar predições para avaliação""" + + # Anomaly detection + if "anomaly_labels" in batch: + anomaly_outputs = self.model.detect_anomalies( + input_ids=batch["input_ids"], + attention_mask=batch["attention_mask"] + ) + + for i, pred in enumerate(anomaly_outputs["predictions"]): + anomaly_type_map = {"Normal": 0, "Suspeito": 1, "Anômalo": 2} + pred_label = anomaly_type_map[pred["anomaly_type"]] + all_predictions["anomaly"]["preds"].append(pred_label) + all_predictions["anomaly"]["labels"].append(batch["anomaly_labels"][i].item()) + + # Financial analysis + if "financial_risk_labels" in batch: + financial_outputs = self.model.analyze_financial_risk( + input_ids=batch["input_ids"], + attention_mask=batch["attention_mask"] + ) + + for i, pred in enumerate(financial_outputs["predictions"]): + risk_level_map = {"Muito Baixo": 0, "Baixo": 1, "Médio": 2, "Alto": 3, "Muito Alto": 4} + pred_label = risk_level_map[pred["risk_level"]] + all_predictions["financial"]["preds"].append(pred_label) + all_predictions["financial"]["labels"].append(batch["financial_risk_labels"][i].item()) + + # Legal compliance + if "legal_compliance_labels" in batch: + legal_outputs = self.model.check_legal_compliance( + input_ids=batch["input_ids"], + attention_mask=batch["attention_mask"] + ) + + for i, pred in enumerate(legal_outputs["predictions"]): + pred_label = 1 if pred["is_compliant"] else 0 + all_predictions["legal"]["preds"].append(pred_label) + all_predictions["legal"]["labels"].append(batch["legal_compliance_labels"][i].item()) + + def _compute_task_metrics(self, predictions: List, labels: List, task_name: str) -> Dict[str, float]: + """Computar métricas para uma tarefa específica""" + + accuracy = accuracy_score(labels, predictions) + precision, recall, f1, _ = precision_recall_fscore_support( + labels, predictions, average='weighted' + ) + + metrics = { + f"eval_{task_name}_accuracy": accuracy, + f"eval_{task_name}_precision": precision, + f"eval_{task_name}_recall": recall, + f"eval_{task_name}_f1": f1 + } + + # Métrica composta para early stopping + if task_name == "anomaly": # Usar anomaly como principal + metrics["eval_f1"] = f1 + + return metrics + + def _is_better_metric(self, current_metric: float) -> bool: + """Verificar se métrica atual é melhor""" + if self.config.greater_is_better: + return current_metric > self.best_metric + else: + return current_metric < self.best_metric + + def _save_checkpoint(self, epoch: int, is_best: bool = False): + """Salvar checkpoint do modelo""" + + output_dir = Path(self.config.output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + if is_best: + save_path = output_dir / "best_model" + else: + save_path = output_dir / f"checkpoint-epoch-{epoch}" + + # Salvar modelo + self.model.save_model(str(save_path)) + + # Salvar estado do treinamento + training_state = { + "epoch": epoch, + "optimizer_state_dict": self.optimizer.state_dict(), + "scheduler_state_dict": self.scheduler.state_dict(), + "best_metric": self.best_metric, + "training_history": self.training_history + } + + torch.save(training_state, save_path / "training_state.pt") + + logger.info(f"✅ Checkpoint salvo em {save_path}") + + def _finalize_training(self): + """Finalizar treinamento""" + + # Salvar histórico de treinamento + output_dir = Path(self.config.output_dir) + + with open(output_dir / "training_history.json", "w") as f: + json.dump(self.training_history, f, indent=2) + + # Plotar curvas de treinamento + self._plot_training_curves() + + if self.config.use_wandb: + wandb.finish() + + logger.info("🎉 Treinamento finalizado com sucesso!") + + def _plot_training_curves(self): + """Plotar curvas de treinamento""" + + fig, axes = plt.subplots(2, 2, figsize=(15, 10)) + + # Loss de treinamento + epochs = range(1, len(self.training_history["train_loss"]) + 1) + axes[0, 0].plot(epochs, self.training_history["train_loss"]) + axes[0, 0].set_title("Loss de Treinamento") + axes[0, 0].set_xlabel("Época") + axes[0, 0].set_ylabel("Loss") + + # Métricas de avaliação + if self.training_history["eval_metrics"]: + eval_epochs = range(1, len(self.training_history["eval_metrics"]) + 1) + + # F1 Score + f1_scores = [m.get("eval_f1", 0) for m in self.training_history["eval_metrics"]] + axes[0, 1].plot(eval_epochs, f1_scores, 'g-') + axes[0, 1].set_title("F1 Score") + axes[0, 1].set_xlabel("Época") + axes[0, 1].set_ylabel("F1") + + # Accuracy + accuracy_scores = [m.get("eval_anomaly_accuracy", 0) for m in self.training_history["eval_metrics"]] + axes[1, 0].plot(eval_epochs, accuracy_scores, 'b-') + axes[1, 0].set_title("Accuracy") + axes[1, 0].set_xlabel("Época") + axes[1, 0].set_ylabel("Accuracy") + + # Loss de avaliação + eval_losses = [m.get("eval_loss", 0) for m in self.training_history["eval_metrics"]] + axes[1, 1].plot(eval_epochs, eval_losses, 'r-') + axes[1, 1].set_title("Loss de Avaliação") + axes[1, 1].set_xlabel("Época") + axes[1, 1].set_ylabel("Loss") + + plt.tight_layout() + + # Salvar plot + output_dir = Path(self.config.output_dir) + plt.savefig(output_dir / "training_curves.png", dpi=300, bbox_inches='tight') + plt.close() + + +def create_training_pipeline( + data_path: str, + config: Optional[TrainingConfig] = None +) -> Tuple[CidadaoAIForTransparency, CidadaoTrainer]: + """ + Criar pipeline de treinamento completo + + Args: + data_path: Caminho para dados de treinamento + config: Configuração de treinamento + + Returns: + Tuple com modelo e trainer + """ + + if config is None: + config = TrainingConfig() + + logger.info("🏗️ Criando pipeline de treinamento Cidadão.AI") + + # Criar modelo + model = create_cidadao_model( + specialized_tasks=config.specialized_tasks, + model_size=config.model_size + ) + + # Criar tokenizer + tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium") + tokenizer.pad_token = tokenizer.eos_token + + # Redimensionar embeddings se necessário + model.model.model.resize_token_embeddings(len(tokenizer)) + + # Criar trainer + trainer = CidadaoTrainer(model, tokenizer, config) + + logger.info(f"✅ Pipeline criado - Modelo: {config.model_size}, Tarefas: {config.specialized_tasks}") + + return model, trainer + + +def prepare_transparency_data(data_path: str, output_dir: str = "./data/processed"): + """ + Preparar dados de transparência para treinamento + + Esta função seria expandida para processar dados reais do Portal da Transparência + """ + + logger.info("📊 Preparando dados de transparência") + + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + # Aqui você implementaria: + # 1. Conexão com Portal da Transparência API + # 2. Extração e limpeza de dados + # 3. Anotação de anomalias (semi-supervisionado) + # 4. Balanceamento de classes + # 5. Divisão train/val/test + + # Por enquanto, criar dados sintéticos + logger.info("⚠️ Usando dados sintéticos para demonstração") + + # Implementação completa seria conectada aos dados reais + sample_data = { + "train": output_dir / "train.json", + "val": output_dir / "val.json", + "test": output_dir / "test.json" + } + + return sample_data + + +if __name__ == "__main__": + # Exemplo de uso + + # Configurar logging + logging.basicConfig(level=logging.INFO) + + # Configuração de treinamento + config = TrainingConfig( + experiment_name="cidadao-gpt-transparency-v1", + num_epochs=5, + batch_size=4, # Reduzido para teste + learning_rate=2e-5, + use_wandb=False, # Desabilitar para teste + output_dir="./models/cidadao-gpt-test" + ) + + # Criar pipeline + model, trainer = create_training_pipeline( + data_path="./data/transparency_data.json", + config=config + ) + + print("🤖 Cidadão.AI Training Pipeline criado com sucesso!") + print(f"📊 Modelo: {config.model_size}") + print(f"🎯 Tarefas especializadas: {config.specialized_tasks}") + print(f"💾 Diretório de saída: {config.output_dir}") \ No newline at end of file diff --git a/src/training/utils/__init__.py b/src/training/utils/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/test_migration.py b/test_migration.py new file mode 100644 index 0000000000000000000000000000000000000000..e46e5b9aeba2c875de4453485ac979a2f1ff3919 --- /dev/null +++ b/test_migration.py @@ -0,0 +1,34 @@ +#!/usr/bin/env python3 +""" +Teste básico da migração do AnomalyDetector +""" + +import sys +import os + +# Add src to path +sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'src')) + +def test_basic_import(): + """Test if basic import works""" + try: + from models.anomaly_detection import AnomalyDetector + print("✅ Import AnomalyDetector OK") + + # Test instantiation + detector = AnomalyDetector() + print(f"✅ AnomalyDetector criado: {detector.model_name}") + + return True + + except Exception as e: + print(f"❌ Erro no import: {e}") + return False + +if __name__ == "__main__": + success = test_basic_import() + if success: + print("🎉 Migração básica funcionando!") + else: + print("💥 Problemas na migração") + sys.exit(1) \ No newline at end of file diff --git a/tests/test_anomaly_detector.py b/tests/test_anomaly_detector.py new file mode 100644 index 0000000000000000000000000000000000000000..a73c03c92d2a6ce28fc6857d9afe4dd2c3560af5 --- /dev/null +++ b/tests/test_anomaly_detector.py @@ -0,0 +1,295 @@ +""" +Tests for Anomaly Detection Module + +Comprehensive test suite for anomaly detector. +""" + +import pytest +import asyncio +from typing import List, Dict, Any + +import sys +import os +sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(__file__)))) + +from src.models.anomaly_detection import AnomalyDetector + + +class TestAnomalyDetector: + """Test suite for AnomalyDetector.""" + + @pytest.fixture + def detector(self): + """Create anomaly detector instance.""" + return AnomalyDetector() + + @pytest.fixture + def sample_contracts(self): + """Sample contract data for testing.""" + return [ + { + "id": "CT001", + "description": "Aquisição de computadores", + "value": 50000.0, + "supplier": "Tech Company A", + "date": "2024-01-15", + "organ": "Ministry of Education" + }, + { + "id": "CT002", + "description": "Aquisição de computadores", + "value": 500000.0, # Anomaly: 10x higher + "supplier": "Tech Company B", + "date": "2024-01-20", + "organ": "Ministry of Education" + }, + { + "id": "CT003", + "description": "Serviços de consultoria", + "value": 75000.0, + "supplier": "Consulting Inc", + "date": "2024-02-01", + "organ": "Ministry of Health" + } + ] + + def test_detector_initialization(self, detector): + """Test detector is properly initialized.""" + assert detector is not None + assert detector.model_name == "anomaly_detector" + assert hasattr(detector, '_thresholds') + assert detector._thresholds['value_threshold'] == 1000000 + + def test_detector_training(self, detector, sample_contracts): + """Test detector training process.""" + # Run training + result = asyncio.run(detector.train(sample_contracts)) + + assert result['status'] == 'trained' + assert result['samples'] == len(sample_contracts) + assert result['model'] == 'anomaly_detector' + assert detector._is_trained is True + + def test_anomaly_detection_high_value(self, detector, sample_contracts): + """Test detection of high value anomalies.""" + # Train first + asyncio.run(detector.train(sample_contracts)) + + # Run prediction + results = asyncio.run(detector.predict(sample_contracts)) + + # Should detect high value anomaly + assert len(results) > 0 + + # Find the high value contract + high_value_result = next( + (r for r in results if r['contract_id'] == 'CT002'), + None + ) + + assert high_value_result is not None + assert high_value_result['is_anomaly'] is True + assert high_value_result['anomaly_type'] == 'high_value' + assert high_value_result['confidence'] > 0.8 + + def test_anomaly_detection_frequency(self, detector): + """Test detection of frequency anomalies.""" + # Create contracts with same supplier + contracts = [ + { + "id": f"CT{i:03d}", + "description": "Service contract", + "value": 50000.0, + "supplier": "Same Supplier LLC", # All same supplier + "date": f"2024-01-{i+1:02d}", + "organ": "Ministry X" + } + for i in range(15) # 15 contracts to same supplier + ] + + # Add one normal contract + contracts.append({ + "id": "CT999", + "description": "Different service", + "value": 45000.0, + "supplier": "Other Company", + "date": "2024-02-01", + "organ": "Ministry X" + }) + + # Train and predict + asyncio.run(detector.train(contracts)) + results = asyncio.run(detector.predict(contracts)) + + # Should detect frequency anomaly + frequency_anomalies = [ + r for r in results + if r.get('anomaly_type') == 'suspicious_frequency' + ] + + assert len(frequency_anomalies) > 0 + assert frequency_anomalies[0]['supplier'] == 'Same Supplier LLC' + + def test_no_anomalies_normal_data(self, detector): + """Test no anomalies detected in normal data.""" + # Create normal contracts + normal_contracts = [ + { + "id": f"CT{i:03d}", + "description": f"Service type {i % 3}", + "value": 50000.0 + (i * 1000), # Small variations + "supplier": f"Company {chr(65 + i % 5)}", # 5 different suppliers + "date": f"2024-01-{(i % 28) + 1:02d}", + "organ": f"Ministry {i % 3}" + } + for i in range(20) + ] + + # Train and predict + asyncio.run(detector.train(normal_contracts)) + results = asyncio.run(detector.predict(normal_contracts)) + + # Should have few or no anomalies + anomalies = [r for r in results if r.get('is_anomaly', False)] + assert len(anomalies) < 3 # Less than 15% anomalies + + def test_empty_data_handling(self, detector): + """Test handling of empty data.""" + # Train with empty data + result = asyncio.run(detector.train([])) + assert result['status'] == 'trained' + assert result['samples'] == 0 + + # Predict with empty data + results = asyncio.run(detector.predict([])) + assert results == [] + + def test_invalid_data_handling(self, detector): + """Test handling of invalid data.""" + invalid_contracts = [ + {"id": "CT001"}, # Missing required fields + {"id": "CT002", "value": "not_a_number"}, # Invalid type + None, # Null entry + ] + + # Should handle gracefully + try: + asyncio.run(detector.train(invalid_contracts)) + results = asyncio.run(detector.predict(invalid_contracts)) + # Should either skip invalid entries or return empty + assert isinstance(results, list) + except Exception as e: + # Should raise meaningful error + assert "invalid" in str(e).lower() or "error" in str(e).lower() + + def test_threshold_configuration(self): + """Test custom threshold configuration.""" + # Create detector with custom thresholds + custom_detector = AnomalyDetector() + custom_detector._thresholds = { + "value_threshold": 100000, # Lower threshold + "frequency_threshold": 5, # Lower frequency + "pattern_threshold": 0.9 # Higher pattern threshold + } + + assert custom_detector._thresholds['value_threshold'] == 100000 + assert custom_detector._thresholds['frequency_threshold'] == 5 + assert custom_detector._thresholds['pattern_threshold'] == 0.9 + + @pytest.mark.parametrize("num_contracts,expected_performance", [ + (10, 0.1), # 10 contracts should process in < 0.1s + (100, 0.5), # 100 contracts should process in < 0.5s + (1000, 2.0), # 1000 contracts should process in < 2s + ]) + def test_performance(self, detector, num_contracts, expected_performance): + """Test performance with different data sizes.""" + import time + + # Generate test data + contracts = [ + { + "id": f"CT{i:06d}", + "description": f"Contract {i}", + "value": 50000.0 + (i * 100), + "supplier": f"Company {i % 20}", + "date": f"2024-01-{(i % 28) + 1:02d}", + "organ": f"Ministry {i % 5}" + } + for i in range(num_contracts) + ] + + # Measure prediction time + asyncio.run(detector.train(contracts[:100])) # Train on subset + + start_time = time.time() + results = asyncio.run(detector.predict(contracts)) + elapsed_time = time.time() - start_time + + assert elapsed_time < expected_performance + assert len(results) <= len(contracts) + + +@pytest.mark.asyncio +class TestAsyncAnomalyDetector: + """Async test suite for AnomalyDetector.""" + + async def test_concurrent_predictions(self): + """Test concurrent prediction requests.""" + detector = AnomalyDetector() + + # Create multiple contract sets + contract_sets = [ + [ + { + "id": f"SET{set_id}-CT{i:03d}", + "description": f"Contract {i}", + "value": 50000.0 * (set_id + 1), + "supplier": f"Company {i}", + "date": "2024-01-15", + "organ": f"Ministry {set_id}" + } + for i in range(10) + ] + for set_id in range(5) + ] + + # Train detector + await detector.train(contract_sets[0]) + + # Run concurrent predictions + tasks = [ + detector.predict(contracts) + for contracts in contract_sets + ] + + results = await asyncio.gather(*tasks) + + # All should complete successfully + assert len(results) == 5 + for result in results: + assert isinstance(result, list) + + async def test_model_state_persistence(self): + """Test model state is maintained across predictions.""" + detector = AnomalyDetector() + + # Initial training + train_data = [ + { + "id": f"CT{i:03d}", + "description": "Initial contract", + "value": 100000.0, + "supplier": f"Company {i}", + "date": "2024-01-01", + "organ": "Ministry A" + } + for i in range(50) + ] + + await detector.train(train_data) + assert detector._is_trained is True + + # Multiple predictions shouldn't affect trained state + for _ in range(10): + await detector.predict(train_data[:10]) + assert detector._is_trained is True \ No newline at end of file diff --git a/tests/test_api_server.py b/tests/test_api_server.py new file mode 100644 index 0000000000000000000000000000000000000000..6e8ff515f6607e51915055e515ed15ad7db07441 --- /dev/null +++ b/tests/test_api_server.py @@ -0,0 +1,329 @@ +""" +Tests for Models API Server + +Comprehensive test suite for FastAPI inference server. +""" + +import pytest +import asyncio +from fastapi.testclient import TestClient +import sys +import os + +sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(__file__)))) + +from src.inference.api_server import app + + +class TestAPIServer: + """Test suite for Models API Server.""" + + @pytest.fixture + def client(self): + """Create test client.""" + return TestClient(app) + + @pytest.fixture + def sample_contracts(self): + """Sample contract data for testing.""" + return [ + { + "id": "CT001", + "description": "Aquisição de computadores", + "value": 50000.0, + "supplier": "Tech Company A", + "date": "2024-01-15", + "organ": "Ministry of Education" + }, + { + "id": "CT002", + "description": "Aquisição de computadores", + "value": 500000.0, + "supplier": "Tech Company B", + "date": "2024-01-20", + "organ": "Ministry of Education" + } + ] + + def test_root_endpoint(self, client): + """Test root endpoint returns API info.""" + response = client.get("/") + assert response.status_code == 200 + + data = response.json() + assert data["api"] == "Cidadão.AI Models" + assert data["version"] == "1.0.0" + assert data["status"] == "operational" + assert "anomaly_detector" in data["models"] + assert "endpoints" in data + + def test_health_check(self, client): + """Test health check endpoint.""" + response = client.get("/health") + assert response.status_code == 200 + + data = response.json() + assert data["status"] == "healthy" + assert data["models_loaded"] is True + assert "anomaly_detector" in data["models"] + + def test_detect_anomalies_endpoint(self, client, sample_contracts): + """Test anomaly detection endpoint.""" + response = client.post( + "/v1/detect-anomalies", + json={ + "contracts": sample_contracts, + "threshold": 0.7 + } + ) + assert response.status_code == 200 + + data = response.json() + assert "anomalies" in data + assert "total_analyzed" in data + assert "anomalies_found" in data + assert "confidence_score" in data + assert "model_version" in data + + assert data["total_analyzed"] == len(sample_contracts) + assert isinstance(data["anomalies"], list) + assert 0 <= data["confidence_score"] <= 1 + + def test_analyze_patterns_endpoint(self, client): + """Test pattern analysis endpoint.""" + response = client.post( + "/v1/analyze-patterns", + json={ + "data": { + "contracts": [{"value": 100000}, {"value": 200000}], + "period": "2024-Q1" + }, + "analysis_type": "temporal" + } + ) + assert response.status_code == 200 + + data = response.json() + assert "patterns" in data + assert "pattern_count" in data + assert "confidence" in data + assert "insights" in data + + assert isinstance(data["patterns"], list) + assert data["pattern_count"] >= 0 + assert 0 <= data["confidence"] <= 1 + assert isinstance(data["insights"], list) + + def test_analyze_spectral_endpoint(self, client): + """Test spectral analysis endpoint.""" + response = client.post( + "/v1/analyze-spectral", + json={ + "time_series": [100, 200, 150, 300, 250, 400, 350], + "sampling_rate": 1.0 + } + ) + assert response.status_code == 200 + + data = response.json() + assert "frequencies" in data + assert "amplitudes" in data + assert "dominant_frequency" in data + assert "periodic_patterns" in data + + assert isinstance(data["frequencies"], list) + assert isinstance(data["amplitudes"], list) + assert isinstance(data["dominant_frequency"], float) + assert isinstance(data["periodic_patterns"], list) + + def test_metrics_endpoint(self, client): + """Test Prometheus metrics endpoint.""" + # Make some requests first + client.get("/") + client.get("/health") + + response = client.get("/metrics") + assert response.status_code == 200 + + metrics = response.text + assert "cidadao_models_requests_total" in metrics + assert "cidadao_models_request_duration_seconds" in metrics + assert "cidadao_models_anomalies_total" in metrics + + def test_invalid_endpoint(self, client): + """Test invalid endpoint returns 404.""" + response = client.get("/invalid/endpoint") + assert response.status_code == 404 + + def test_empty_contracts_handling(self, client): + """Test handling of empty contracts list.""" + response = client.post( + "/v1/detect-anomalies", + json={ + "contracts": [], + "threshold": 0.7 + } + ) + assert response.status_code == 200 + + data = response.json() + assert data["total_analyzed"] == 0 + assert data["anomalies_found"] == 0 + assert data["anomalies"] == [] + + def test_invalid_request_format(self, client): + """Test handling of invalid request format.""" + # Missing required field + response = client.post( + "/v1/detect-anomalies", + json={"threshold": 0.7} # Missing contracts + ) + assert response.status_code == 422 # Validation error + + # Invalid data type + response = client.post( + "/v1/detect-anomalies", + json={ + "contracts": "not_a_list", + "threshold": 0.7 + } + ) + assert response.status_code == 422 + + def test_cors_headers(self, client): + """Test CORS headers are properly set.""" + response = client.options("/") + assert "access-control-allow-origin" in response.headers + assert response.headers["access-control-allow-origin"] == "*" + + @pytest.mark.parametrize("num_contracts,max_time", [ + (10, 1.0), # 10 contracts in < 1s + (100, 2.0), # 100 contracts in < 2s + (500, 5.0), # 500 contracts in < 5s + ]) + def test_performance_requirements(self, client, num_contracts, max_time): + """Test API performance with different loads.""" + import time + + # Generate test data + contracts = [ + { + "id": f"CT{i:06d}", + "description": f"Contract {i}", + "value": 50000.0 + (i * 100), + "supplier": f"Company {i % 20}", + "date": f"2024-01-{(i % 28) + 1:02d}", + "organ": f"Ministry {i % 5}" + } + for i in range(num_contracts) + ] + + start_time = time.time() + response = client.post( + "/v1/detect-anomalies", + json={ + "contracts": contracts, + "threshold": 0.7 + } + ) + elapsed_time = time.time() - start_time + + assert response.status_code == 200 + assert elapsed_time < max_time + + +class TestAPIServerIntegration: + """Integration tests for API Server.""" + + @pytest.fixture + def client(self): + """Create test client.""" + return TestClient(app) + + def test_full_workflow(self, client): + """Test complete workflow from detection to analysis.""" + # Step 1: Detect anomalies + contracts = [ + { + "id": f"CT{i:03d}", + "description": f"Contract {i}", + "value": 100000.0 if i != 5 else 1000000.0, # One anomaly + "supplier": f"Company {i % 3}", + "date": f"2024-01-{i+1:02d}", + "organ": "Ministry A" + } + for i in range(10) + ] + + anomaly_response = client.post( + "/v1/detect-anomalies", + json={"contracts": contracts} + ) + assert anomaly_response.status_code == 200 + anomaly_data = anomaly_response.json() + assert anomaly_data["anomalies_found"] > 0 + + # Step 2: Analyze patterns in the data + pattern_response = client.post( + "/v1/analyze-patterns", + json={ + "data": { + "anomaly_results": anomaly_data, + "contracts": contracts + }, + "analysis_type": "temporal" + } + ) + assert pattern_response.status_code == 200 + pattern_data = pattern_response.json() + assert len(pattern_data["patterns"]) > 0 + + # Step 3: Perform spectral analysis on values + values = [c["value"] for c in contracts] + spectral_response = client.post( + "/v1/analyze-spectral", + json={ + "time_series": values, + "sampling_rate": 1.0 + } + ) + assert spectral_response.status_code == 200 + spectral_data = spectral_response.json() + assert spectral_data["dominant_frequency"] >= 0 + + def test_concurrent_requests(self, client): + """Test handling of concurrent requests.""" + import concurrent.futures + + def make_request(request_id): + return client.post( + "/v1/detect-anomalies", + json={ + "contracts": [ + { + "id": f"REQ{request_id}-CT001", + "description": "Test contract", + "value": 100000.0, + "supplier": "Test Company", + "date": "2024-01-01", + "organ": "Test Ministry" + } + ] + } + ) + + # Make 10 concurrent requests + with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor: + futures = [ + executor.submit(make_request, i) + for i in range(10) + ] + + results = [ + future.result() + for future in concurrent.futures.as_completed(futures) + ] + + # All should succeed + assert all(r.status_code == 200 for r in results) + assert len(results) == 10 \ No newline at end of file diff --git a/tests/test_models_client.py b/tests/test_models_client.py new file mode 100644 index 0000000000000000000000000000000000000000..dd85910dd9e6d57b8573607f793b82c57c16e238 --- /dev/null +++ b/tests/test_models_client.py @@ -0,0 +1,267 @@ +""" +Tests for Models Client (Backend Integration) + +Test suite for the models client with fallback functionality. +""" + +import pytest +import asyncio +from unittest.mock import Mock, AsyncMock, patch +import httpx + +import sys +import os +# Add backend path for testing +backend_path = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(__file__))), "..", "cidadao.ai-backend") +sys.path.insert(0, backend_path) + +from src.tools.models_client import ModelsClient, ModelAPIStatus + + +class TestModelsClient: + """Test suite for ModelsClient.""" + + @pytest.fixture + def client(self): + """Create models client instance.""" + return ModelsClient( + base_url="http://localhost:8001", + timeout=5.0, + enable_fallback=True + ) + + @pytest.fixture + def sample_contracts(self): + """Sample contract data.""" + return [ + { + "id": "CT001", + "description": "Test contract", + "value": 100000.0, + "supplier": "Company A", + "date": "2024-01-01", + "organ": "Ministry X" + } + ] + + def test_client_initialization(self): + """Test client initialization with different configs.""" + # Default initialization + client = ModelsClient() + assert client.base_url == "http://localhost:8001" + assert client.timeout == 30.0 + assert client.enable_fallback is True + assert client.status == ModelAPIStatus.ONLINE + + # Custom initialization + client = ModelsClient( + base_url="http://models:8080", + timeout=10.0, + enable_fallback=False + ) + assert client.base_url == "http://models:8080" + assert client.timeout == 10.0 + assert client.enable_fallback is False + + @pytest.mark.asyncio + async def test_health_check_success(self, client): + """Test successful health check.""" + # Mock successful response + with patch.object(client.client, 'get') as mock_get: + mock_response = Mock() + mock_response.json.return_value = { + "status": "healthy", + "models_loaded": True + } + mock_response.raise_for_status = Mock() + mock_get.return_value = mock_response + + result = await client.health_check() + + assert result["status"] == "healthy" + assert client.status == ModelAPIStatus.ONLINE + assert client._failure_count == 0 + + @pytest.mark.asyncio + async def test_health_check_failure(self, client): + """Test health check failure handling.""" + # Mock failed response + with patch.object(client.client, 'get') as mock_get: + mock_get.side_effect = httpx.RequestError("Connection failed") + + result = await client.health_check() + + assert result["status"] == "unhealthy" + assert "error" in result + assert result["fallback_available"] is True + assert client._failure_count == 1 + + @pytest.mark.asyncio + async def test_detect_anomalies_api_success(self, client, sample_contracts): + """Test successful anomaly detection via API.""" + expected_response = { + "anomalies": [], + "total_analyzed": 1, + "anomalies_found": 0, + "confidence_score": 0.95, + "model_version": "1.0.0" + } + + with patch.object(client.client, 'post') as mock_post: + mock_response = Mock() + mock_response.json.return_value = expected_response + mock_response.raise_for_status = Mock() + mock_post.return_value = mock_response + + result = await client.detect_anomalies(sample_contracts) + + assert result == expected_response + assert client.status == ModelAPIStatus.ONLINE + + # Verify request was made correctly + mock_post.assert_called_once() + call_args = mock_post.call_args + assert call_args[0][0] == "/v1/detect-anomalies" + assert "contracts" in call_args[1]["json"] + + @pytest.mark.asyncio + async def test_detect_anomalies_with_fallback(self, client, sample_contracts): + """Test anomaly detection with fallback to local ML.""" + # Mock API failure + with patch.object(client.client, 'post') as mock_post: + mock_post.side_effect = httpx.RequestError("API unavailable") + + # Mock local ML + with patch.object(client, '_local_anomaly_detection') as mock_local: + mock_local.return_value = { + "anomalies": [], + "total_analyzed": 1, + "anomalies_found": 0, + "confidence_score": 0.85, + "model_version": "local-1.0.0", + "source": "local_fallback" + } + + result = await client.detect_anomalies(sample_contracts) + + assert result["source"] == "local_fallback" + assert result["confidence_score"] == 0.85 + mock_local.assert_called_once_with(sample_contracts, 0.7) + + @pytest.mark.asyncio + async def test_detect_anomalies_no_fallback_error(self, client, sample_contracts): + """Test anomaly detection error when fallback is disabled.""" + client.enable_fallback = False + + with patch.object(client.client, 'post') as mock_post: + mock_post.side_effect = httpx.RequestError("API unavailable") + + with pytest.raises(httpx.RequestError): + await client.detect_anomalies(sample_contracts) + + @pytest.mark.asyncio + async def test_analyze_patterns_success(self, client): + """Test successful pattern analysis.""" + test_data = {"contracts": [{"value": 100000}]} + expected_response = { + "patterns": [{"type": "temporal"}], + "pattern_count": 1, + "confidence": 0.88, + "insights": ["Pattern detected"] + } + + with patch.object(client.client, 'post') as mock_post: + mock_response = Mock() + mock_response.json.return_value = expected_response + mock_response.raise_for_status = Mock() + mock_post.return_value = mock_response + + result = await client.analyze_patterns(test_data) + + assert result == expected_response + assert client.status == ModelAPIStatus.ONLINE + + @pytest.mark.asyncio + async def test_analyze_spectral_success(self, client): + """Test successful spectral analysis.""" + time_series = [100, 200, 150, 300, 250] + expected_response = { + "frequencies": [0.1, 0.5], + "amplitudes": [10.0, 20.0], + "dominant_frequency": 0.5, + "periodic_patterns": [] + } + + with patch.object(client.client, 'post') as mock_post: + mock_response = Mock() + mock_response.json.return_value = expected_response + mock_response.raise_for_status = Mock() + mock_post.return_value = mock_response + + result = await client.analyze_spectral(time_series) + + assert result == expected_response + assert result["dominant_frequency"] == 0.5 + + def test_circuit_breaker_functionality(self, client): + """Test circuit breaker marks API as offline after failures.""" + # Initial state + assert client.status == ModelAPIStatus.ONLINE + assert client._failure_count == 0 + + # First failure + client._handle_failure() + assert client.status == ModelAPIStatus.DEGRADED + assert client._failure_count == 1 + + # Second failure + client._handle_failure() + assert client.status == ModelAPIStatus.DEGRADED + assert client._failure_count == 2 + + # Third failure - circuit opens + client._handle_failure() + assert client.status == ModelAPIStatus.OFFLINE + assert client._failure_count == 3 + + # Reset on success + client._reset_failure_count() + assert client.status == ModelAPIStatus.ONLINE + assert client._failure_count == 0 + + @pytest.mark.asyncio + async def test_context_manager(self): + """Test client as async context manager.""" + async with ModelsClient() as client: + assert isinstance(client, ModelsClient) + assert client.status == ModelAPIStatus.ONLINE + + # Client should be closed after context + with pytest.raises(RuntimeError): + await client.health_check() + + @pytest.mark.asyncio + async def test_local_fallback_caching(self, client, sample_contracts): + """Test local models are cached for performance.""" + # Force use of local fallback + client.status = ModelAPIStatus.OFFLINE + + # First call - creates model + result1 = await client._local_anomaly_detection(sample_contracts, 0.7) + assert "anomaly_detector" in client._local_models + + # Second call - reuses model + initial_model = client._local_models["anomaly_detector"] + result2 = await client._local_anomaly_detection(sample_contracts, 0.7) + + # Should be same model instance + assert client._local_models["anomaly_detector"] is initial_model + + def test_singleton_client(self): + """Test singleton client instance.""" + from src.tools.models_client import get_models_client + + client1 = get_models_client() + client2 = get_models_client() + + assert client1 is client2 # Same instance \ No newline at end of file