Spaces:
Sleeping
🏛️ CIDADÃO.AI - CONTEXTO GERAL DO PROJETO
⚠️ HEADER UNIVERSAL - NÃO REMOVER - Atualizado: Janeiro 2025
🎯 VISÃO GERAL DO ECOSSISTEMA
O Cidadão.AI é um ecossistema de 4 repositórios especializados que trabalham em conjunto para democratizar a transparência pública brasileira através de IA avançada:
📦 REPOSITÓRIOS DO ECOSSISTEMA
- cidadao.ai-backend → API + Sistema Multi-Agente + ML Pipeline
- cidadao.ai-frontend → Interface Web + Internacionalização
- cidadao.ai-docs → Hub de Documentação + Landing Page
- cidadao.ai-models → Modelos IA + Pipeline MLOps (ESTE REPOSITÓRIO)
🤖 SISTEMA MULTI-AGENTE (17 Agentes)
- MasterAgent (Abaporu) - Orquestração central com auto-reflexão
- InvestigatorAgent - Detecção de anomalias em dados públicos
- AnalystAgent - Análise de padrões e correlações
- ReporterAgent - Geração inteligente de relatórios
- SecurityAuditorAgent - Auditoria e compliance
- CommunicationAgent - Comunicação inter-agentes
- CorruptionDetectorAgent - Detecção de corrupção
- PredictiveAgent - Análise preditiva
- VisualizationAgent - Visualizações de dados
- BonifacioAgent - Contratos públicos
- DandaraAgent - Diversidade e inclusão
- MachadoAgent - Processamento de linguagem natural
- SemanticRouter - Roteamento inteligente
- ContextMemoryAgent - Sistema de memória
- ETLExecutorAgent - Processamento de dados
- ObserverAgent - Monitoramento
- ValidatorAgent - Validação de qualidade
🏗️ ARQUITETURA TÉCNICA
- Score Geral: 9.3/10 (Classe Enterprise)
- Backend: FastAPI + Python 3.11+ + PostgreSQL + Redis + ChromaDB
- Frontend: Next.js 15 + React 19 + TypeScript + Tailwind CSS 4
- Deploy: Docker + Kubernetes + SSL + Monitoring
- IA: LangChain + Transformers + OpenAI/Groq + Vector DBs
🛡️ SEGURANÇA E AUDITORIA
- Multi-layer security com middleware especializado
- JWT + OAuth2 + API Key authentication
- Audit trail completo com severity levels
- Rate limiting + CORS + SSL termination
🎯 MISSÃO E IMPACTO
- Democratizar acesso a análises de dados públicos
- Detectar anomalias e irregularidades automaticamente
- Empoderar cidadãos com informação clara e auditável
- Fortalecer transparência governamental via IA ética
📊 STATUS DO PROJETO
- Versão: 1.0.0 (Production-Ready)
- Score Técnico: 9.3/10
- Cobertura de Testes: 23.6% (Target: >80%)
- Deploy: Kubernetes + Vercel + HuggingFace Spaces
CLAUDE.md - MODELOS IA
Este arquivo fornece orientações para o Claude Code ao trabalhar com os modelos de IA e pipeline MLOps do Cidadão.AI.
🤖 Visão Geral dos Modelos IA
Cidadão.AI Models é o repositório responsável pelos modelos de machine learning, pipeline MLOps e infraestrutura de IA que alimenta o sistema multi-agente. Este repositório gerencia treinamento, versionamento, deploy e monitoramento dos modelos especializados em transparência pública.
Status Atual: Pipeline MLOps em Desenvolvimento - Infraestrutura para modelos personalizados, integração com HuggingFace Hub e pipeline de treinamento automatizado.
🏗️ Análise Arquitetural Modelos IA
Score Geral dos Modelos: 7.8/10 (Pipeline em Construção)
O Repositório de Modelos Cidadão.AI representa uma base sólida para MLOps especializado em análise de transparência pública. O sistema está preparado para hospedar modelos customizados e integrar-se com o ecossistema de agentes.
📊 Métricas Técnicas Modelos
- Framework: PyTorch + Transformers + HuggingFace
- MLOps: MLflow + DVC + Weights & Biases
- Deploy: HuggingFace Spaces + Docker containers
- Monitoring: Model performance tracking + drift detection
- Storage: HuggingFace Hub + cloud storage integration
- CI/CD: Automated training + testing + deployment
🚀 Componentes Planejados (Score 7-8/10)
- Model Registry: 7.8/10 - HuggingFace Hub integration
- Training Pipeline: 7.5/10 - Automated training workflow
- Model Serving: 7.7/10 - FastAPI + HuggingFace Spaces
- Monitoring: 7.3/10 - Performance tracking system
- Version Control: 8.0/10 - Git + DVC + HuggingFace
🎯 Componentes em Desenvolvimento (Score 6-7/10)
- Custom Models: 6.8/10 - Domain-specific fine-tuning
- Data Pipeline: 6.5/10 - ETL for training data
- Evaluation: 6.7/10 - Automated model evaluation
- A/B Testing: 6.3/10 - Model comparison framework
🧠 Arquitetura de Modelos
Modelos Especializados Planejados
# Taxonomy dos Modelos Cidadão.AI
models_taxonomy = {
"corruption_detection": {
"type": "classification",
"base_model": "bert-base-multilingual-cased",
"specialization": "Brazilian Portuguese + government documents",
"use_case": "Detect corruption indicators in contracts"
},
"anomaly_detection": {
"type": "regression + classification",
"base_model": "Custom ensemble",
"specialization": "Financial data patterns",
"use_case": "Identify unusual spending patterns"
},
"entity_extraction": {
"type": "NER",
"base_model": "roberta-large",
"specialization": "Government entities + Brazilian names",
"use_case": "Extract companies, people, organizations"
},
"sentiment_analysis": {
"type": "classification",
"base_model": "distilbert-base-uncased",
"specialization": "Public opinion on transparency",
"use_case": "Analyze citizen feedback sentiment"
},
"summarization": {
"type": "seq2seq",
"base_model": "t5-base",
"specialization": "Government reports + legal documents",
"use_case": "Generate executive summaries"
}
}
Pipeline MLOps Architecture
# MLOps Workflow
stages:
data_collection:
- Portal da Transparência APIs
- Government databases
- Public procurement data
- Historical investigations
data_preprocessing:
- Data cleaning & validation
- Privacy anonymization
- Feature engineering
- Data augmentation
model_training:
- Hyperparameter optimization
- Cross-validation
- Ensemble methods
- Transfer learning
model_evaluation:
- Performance metrics
- Fairness evaluation
- Bias detection
- Interpretability analysis
model_deployment:
- HuggingFace Spaces
- Container deployment
- API endpoints
- Model serving
monitoring:
- Model drift detection
- Performance degradation
- Data quality monitoring
- Usage analytics
🔬 Modelos de IA Especializados
1. Corruption Detection Model
# Modelo especializado em detecção de corrupção
class CorruptionDetector:
base_model: "bert-base-multilingual-cased"
fine_tuned_on: "Brazilian government contracts + known corruption cases"
features:
- Contract language analysis
- Pricing anomaly detection
- Vendor relationship patterns
- Temporal irregularities
metrics:
- Precision: >85%
- Recall: >80%
- F1-Score: >82%
- False Positive Rate: <5%
2. Anomaly Detection Ensemble
# Ensemble para detecção de anomalias financeiras
class AnomalyDetector:
models:
- IsolationForest: "Outlier detection"
- LSTM: "Temporal pattern analysis"
- Autoencoder: "Reconstruction error"
- Random Forest: "Feature importance"
features:
- Amount deviation from median
- Vendor concentration
- Seasonal patterns
- Geographic distribution
output:
- Anomaly score (0-1)
- Confidence interval
- Explanation vector
- Risk category
3. Entity Recognition (NER)
# NER especializado para entidades governamentais
class GovernmentNER:
base_model: "roberta-large"
entities:
- ORGANIZATION: "Ministérios, órgãos, empresas"
- PERSON: "Servidores, políticos, empresários"
- LOCATION: "Estados, municípios, endereços"
- CONTRACT: "Números de contratos, licitações"
- MONEY: "Valores monetários, moedas"
- DATE: "Datas de contratos, vigências"
brazilian_specialization:
- CPF/CNPJ recognition
- Brazilian address patterns
- Government terminology
- Legal document structure
🚀 HuggingFace Integration
Model Hub Strategy
# HuggingFace Hub Organization
organization: "cidadao-ai"
models:
- "cidadao-ai/corruption-detector-pt"
- "cidadao-ai/anomaly-detector-financial"
- "cidadao-ai/ner-government-entities"
- "cidadao-ai/sentiment-transparency"
- "cidadao-ai/summarization-reports"
spaces:
- "cidadao-ai/corruption-demo"
- "cidadao-ai/anomaly-dashboard"
- "cidadao-ai/transparency-analyzer"
Model Cards Template
# Model Card: Cidadão.AI Corruption Detector
## Model Description
- **Developed by**: Cidadão.AI Team
- **Model type**: BERT-based binary classifier
- **Language**: Portuguese (Brazil)
- **License**: MIT
## Training Data
- **Sources**: Portal da Transparência + curated corruption cases
- **Size**: 100K+ government contracts
- **Preprocessing**: Anonymization + cleaning + augmentation
## Evaluation
- **Test Set**: 10K held-out contracts
- **Metrics**: Precision: 87%, Recall: 83%, F1: 85%
- **Bias Analysis**: Evaluated across regions + contract types
## Ethical Considerations
- **Intended Use**: Transparency analysis, not legal evidence
- **Limitations**: May have bias toward certain contract types
- **Risks**: False positives could damage reputations
🛠️ MLOps Pipeline
Training Infrastructure
# training-pipeline.yml
name: Model Training Pipeline
on:
schedule:
- cron: '0 2 * * 0' # Weekly retraining
workflow_dispatch:
jobs:
data_preparation:
runs-on: ubuntu-latest
steps:
- name: Fetch latest data
- name: Validate data quality
- name: Preprocess & augment
model_training:
runs-on: gpu-runner
steps:
- name: Hyperparameter optimization
- name: Train model
- name: Evaluate performance
model_deployment:
runs-on: ubuntu-latest
if: model_performance > threshold
steps:
- name: Upload to HuggingFace Hub
- name: Update model registry
- name: Deploy to production
Model Monitoring Dashboard
# Métricas de monitoramento
monitoring_metrics = {
"performance": {
"accuracy": "Real-time accuracy tracking",
"latency": "Response time monitoring",
"throughput": "Requests per second",
"error_rate": "Failed prediction rate"
},
"data_drift": {
"feature_drift": "Input distribution changes",
"label_drift": "Output distribution changes",
"concept_drift": "Relationship changes"
},
"business": {
"investigations_triggered": "Anomalies detected",
"false_positive_rate": "User feedback tracking",
"citizen_satisfaction": "User experience metrics"
}
}
🧪 Experimentação e Avaliação
Experiment Tracking
# MLflow + Weights & Biases integration
import mlflow
import wandb
def train_model(config):
with mlflow.start_run():
wandb.init(project="cidadao-ai", config=config)
# Log hyperparameters
mlflow.log_params(config)
wandb.config.update(config)
# Training loop
for epoch in range(config.epochs):
metrics = train_epoch(model, train_loader)
# Log metrics
mlflow.log_metrics(metrics, step=epoch)
wandb.log(metrics)
# Log model artifacts
mlflow.pytorch.log_model(model, "model")
wandb.save("model.pt")
A/B Testing Framework
# Framework para testes A/B de modelos
class ModelABTest:
def __init__(self, model_a, model_b, traffic_split=0.5):
self.model_a = model_a
self.model_b = model_b
self.traffic_split = traffic_split
def predict(self, input_data, user_id):
# Route traffic based on user_id hash
if hash(user_id) % 100 < self.traffic_split * 100:
result = self.model_a.predict(input_data)
self.log_prediction("model_a", result, user_id)
else:
result = self.model_b.predict(input_data)
self.log_prediction("model_b", result, user_id)
return result
📊 Datasets e Treinamento
Datasets Especializados
# Datasets para treinamento
datasets = {
"transparency_contracts": {
"source": "Portal da Transparência API",
"size": "500K+ contracts",
"format": "JSON + PDF text extraction",
"labels": "Manual annotation + expert review"
},
"corruption_cases": {
"source": "Historical investigations + court records",
"size": "10K+ labeled cases",
"format": "Structured data + documents",
"labels": "Binary classification + severity"
},
"financial_anomalies": {
"source": "Government spending data",
"size": "1M+ transactions",
"format": "Tabular data",
"labels": "Statistical outliers + domain expert"
}
}
Data Preprocessing Pipeline
# Pipeline de preprocessamento
class DataPreprocessor:
def __init__(self):
self.tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
self.anonymizer = GovernmentDataAnonymizer()
def preprocess_contract(self, contract_text):
# 1. Anonymize sensitive information
anonymized = self.anonymizer.anonymize(contract_text)
# 2. Clean and normalize text
cleaned = self.clean_text(anonymized)
# 3. Tokenize for model input
tokens = self.tokenizer(
cleaned,
max_length=512,
truncation=True,
padding=True,
return_tensors="pt"
)
return tokens
🔄 Integração com Backend
Model Serving API
# FastAPI endpoints para servir modelos
from fastapi import FastAPI
from transformers import pipeline
app = FastAPI()
# Load models
corruption_detector = pipeline(
"text-classification",
model="cidadao-ai/corruption-detector-pt"
)
anomaly_detector = joblib.load("models/anomaly_detector.pkl")
@app.post("/analyze/corruption")
async def detect_corruption(contract_text: str):
result = corruption_detector(contract_text)
return {
"prediction": result[0]["label"],
"confidence": result[0]["score"],
"model_version": "v1.0.0"
}
@app.post("/analyze/anomaly")
async def detect_anomaly(financial_data: dict):
features = extract_features(financial_data)
anomaly_score = anomaly_detector.predict(features)
return {
"anomaly_score": float(anomaly_score),
"is_anomaly": anomaly_score > 0.7,
"explanation": generate_explanation(features)
}
Agent Integration
# Integração com sistema multi-agente
class ModelService:
def __init__(self):
self.models = {
"corruption": self.load_corruption_model(),
"anomaly": self.load_anomaly_model(),
"ner": self.load_ner_model()
}
async def analyze_for_agent(self, agent_name: str, data: dict):
if agent_name == "InvestigatorAgent":
return await self.detect_anomalies(data)
elif agent_name == "CorruptionDetectorAgent":
return await self.detect_corruption(data)
elif agent_name == "AnalystAgent":
return await self.extract_entities(data)
🔒 Ética e Governança
Responsible AI Principles
# Princípios de IA Responsável
class ResponsibleAI:
principles = {
"transparency": "Explicabilidade em todas as decisões",
"fairness": "Avaliação de viés em grupos demográficos",
"privacy": "Anonimização de dados pessoais",
"accountability": "Auditoria e rastreabilidade",
"robustness": "Teste contra adversarial attacks"
}
def evaluate_bias(self, model, test_data, protected_attributes):
"""Avalia viés do modelo em grupos protegidos"""
bias_metrics = {}
for attr in protected_attributes:
group_metrics = self.compute_group_metrics(model, test_data, attr)
bias_metrics[attr] = group_metrics
return bias_metrics
Model Interpretability
# Ferramentas de interpretabilidade
from lime.lime_text import LimeTextExplainer
from shap import Explainer
class ModelExplainer:
def __init__(self, model):
self.model = model
self.lime_explainer = LimeTextExplainer()
self.shap_explainer = Explainer(model)
def explain_prediction(self, text, method="lime"):
if method == "lime":
explanation = self.lime_explainer.explain_instance(
text, self.model.predict_proba
)
elif method == "shap":
explanation = self.shap_explainer(text)
return explanation
📋 Roadmap Modelos IA
Curto Prazo (1-2 meses)
- Setup MLOps Pipeline: MLflow + DVC + CI/CD
- Corruption Detection Model: Fine-tune BERT para português
- HuggingFace Integration: Upload initial models
- Basic Monitoring: Performance tracking dashboard
Médio Prazo (3-6 meses)
- Anomaly Detection Ensemble: Multiple algorithms
- NER Government Entities: Custom entity recognition
- Model A/B Testing: Production experimentation
- Advanced Monitoring: Drift detection + alerting
Longo Prazo (6+ meses)
- Custom Architecture: Domain-specific model architectures
- Federated Learning: Privacy-preserving training
- AutoML Pipeline: Automated model selection
- Edge Deployment: Local model inference
⚠️ Áreas para Melhoria
Priority 1: Data Pipeline
- Data Collection: Automated data ingestion
- Data Quality: Validation + cleaning pipelines
- Labeling: Active learning + human-in-the-loop
- Privacy: Advanced anonymization techniques
Priority 2: Model Development
- Custom Models: Domain-specific architectures
- Transfer Learning: Portuguese government domain
- Ensemble Methods: Model combination strategies
- Optimization: Model compression + acceleration
Priority 3: MLOps Maturity
- CI/CD: Automated testing + deployment
- Monitoring: Comprehensive drift detection
- Experimentation: A/B testing framework
- Governance: Model audit + compliance
🎯 Métricas de Sucesso
Technical Metrics
- Model Performance: F1 > 85% for all models
- Inference Latency: <200ms response time
- Deployment Success: >99% uptime
- Data Pipeline: <1% data quality issues
Business Metrics
- Anomalies Detected: 100+ monthly
- False Positive Rate: <5%
- User Satisfaction: >80% positive feedback
- Investigation Success: >70% actionable insights
🔧 Comandos de Desenvolvimento
Model Training
# Train corruption detection model
python train_corruption_detector.py --config configs/corruption_bert.yaml
# Evaluate model performance
python evaluate_model.py --model corruption_detector --test_data data/test.json
# Upload to HuggingFace Hub
python upload_to_hub.py --model_path models/corruption_detector --repo_name cidadao-ai/corruption-detector-pt
Monitoring
# Check model drift
python monitor_drift.py --model corruption_detector --window 7d
# Generate performance report
python generate_report.py --models all --period monthly
📝 Considerações Técnicas
Compute Requirements
- Training: GPU-enabled instances (V100/A100)
- Inference: CPU instances sufficient for most models
- Storage: Cloud storage for datasets + model artifacts
- Monitoring: Real-time metrics collection
Security
- Model Protection: Encrypted model artifacts
- API Security: Authentication + rate limiting
- Data Privacy: LGPD compliance + anonymization
- Audit Trail: Complete lineage tracking
Scalability
- Horizontal Scaling: Load balancer + multiple instances
- Model Versioning: Backward compatibility
- Cache Strategy: Redis for frequent predictions
- Batch Processing: Async inference for large datasets
Models Status: Pipeline em desenvolvimento com infraestrutura sólida para modelos especializados. Next Update: Implementação do primeiro modelo de detecção de corrupção e pipeline MLOps completo.