Spaces:
Sleeping
Sleeping
| # CLAUDE.md | |
| This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. | |
| ## Project Overview | |
| **SummerizerApp** is a FastAPI-based text summarization REST API service deployed on Hugging Face Spaces. Despite the directory name, this is NOT an Android app - it's a cloud-based backend service providing multiple summarization engines through versioned API endpoints. | |
| ## Development Commands | |
| ### Testing | |
| ```bash | |
| # Run all tests with coverage (90% minimum required) | |
| pytest | |
| # Run specific test categories | |
| pytest -m unit # Unit tests only | |
| pytest -m integration # Integration tests only | |
| pytest -m "not slow" # Skip slow tests | |
| pytest -m ollama # Tests requiring Ollama service | |
| # Run with coverage report | |
| pytest --cov=app --cov-report=html:htmlcov | |
| ``` | |
| ### Code Quality | |
| ```bash | |
| # Lint code (with auto-fix) | |
| ruff check --fix app/ | |
| # Format code | |
| ruff format app/ | |
| # Run both linting and formatting | |
| ruff check --fix app/ && ruff format app/ | |
| ``` | |
| ### Running Locally | |
| ```bash | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Run development server (with auto-reload) | |
| uvicorn app.main:app --host 0.0.0.0 --port 7860 --reload | |
| # Run production server | |
| uvicorn app.main:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| ### Docker | |
| ```bash | |
| # Build and run with docker-compose (full stack with Ollama) | |
| docker-compose up --build | |
| # Build HF Spaces optimized image (V2 only) | |
| docker build -f Dockerfile -t summarizer-app . | |
| docker run -p 7860:7860 summarizer-app | |
| # Development stack | |
| docker-compose -f docker-compose.dev.yml up | |
| ``` | |
| ## Architecture | |
| ### Multi-Version API System | |
| The application runs **three independent API versions simultaneously**: | |
| **V1 API** (`/api/v1/*`): Ollama + Transformers Pipeline | |
| - `/api/v1/summarize` - Non-streaming Ollama summarization | |
| - `/api/v1/summarize/stream` - Streaming Ollama summarization | |
| - `/api/v1/summarize/pipeline/stream` - Streaming Transformers summarization | |
| - Dependencies: External Ollama service + local transformers model | |
| - Use case: Local/on-premises deployment with custom models | |
| **V2 API** (`/api/v2/*`): HuggingFace Streaming (Primary for HF Spaces) | |
| - `/api/v2/summarize/stream` - Streaming HF summarization with advanced features | |
| - Dependencies: Local transformers model only | |
| - Features: Adaptive token calculation, recursive summarization for long texts | |
| - Use case: Cloud deployment on resource-constrained platforms | |
| **V3 API** (`/api/v3/*`): Web Scraping + Summarization | |
| - `/api/v3/scrape-and-summarize/stream` - Scrape article from URL and stream summarization | |
| - Dependencies: trafilatura, httpx, lxml (lightweight, no JavaScript rendering) | |
| - Features: Backend web scraping, caching, user-agent rotation, metadata extraction | |
| - Use case: End-to-end article summarization from URL (Android app primary use case) | |
| ### Service Layer Components | |
| **OllamaService** (`app/services/summarizer.py` - 277 lines) | |
| - Communicates with external Ollama inference engine via HTTP | |
| - Normalizes URLs (handles `0.0.0.0` bind addresses) | |
| - Dynamic timeout calculation based on text length | |
| - Streaming support with JSON line parsing | |
| **TransformersService** (`app/services/transformers_summarizer.py` - 158 lines) | |
| - Uses local transformer pipeline (distilbart-cnn-6-6 model) | |
| - Fast inference without external dependencies | |
| - Streaming with token chunking | |
| **HFStreamingSummarizer** (`app/services/hf_streaming_summarizer.py` - 630 lines, most complex) | |
| - **Adaptive Token Calculation**: Adjusts `max_new_tokens` based on input length | |
| - **Recursive Summarization**: Chunks long texts (>1500 chars) and creates summaries of summaries | |
| - **Device Auto-detection**: Handles GPU (bfloat16/float16) vs CPU (float32) | |
| - **TextIteratorStreamer**: Real-time token streaming via threading | |
| - **Batch Dimension Validation**: Strict singleton batch enforcement to prevent OOM | |
| - Supports T5, BART, and generic models with chat templates | |
| **ArticleScraperService** (`app/services/article_scraper.py`) | |
| - Uses trafilatura for high-quality article extraction (F1 score: 0.958) | |
| - User-agent rotation to avoid anti-scraping measures | |
| - Content quality validation (minimum length, sentence structure) | |
| - Metadata extraction (title, author, date, site_name) | |
| - Async HTTP requests with configurable timeouts | |
| - In-memory caching with TTL for performance | |
| ### Request Flow | |
| ``` | |
| HTTP Request | |
| β | |
| Middleware (app/core/middleware.py) | |
| - Request ID generation/tracking | |
| - Request/response timing | |
| - CORS headers | |
| β | |
| Route Handler (app/api/v1 or app/api/v2) | |
| - Pydantic schema validation | |
| β | |
| Service Layer (OllamaService, TransformersService, or HFStreamingSummarizer) | |
| - Text processing and summarization | |
| β | |
| Streaming Response (Server-Sent Events format) | |
| - Token chunks: {"content": "token", "done": false, "tokens_used": N} | |
| - Final chunk: {"content": "", "done": true, "latency_ms": float} | |
| ``` | |
| ### Configuration Management | |
| Settings are managed via `app/core/config.py` using Pydantic BaseSettings. Key environment variables: | |
| **V1 Configuration (Ollama)**: | |
| - `OLLAMA_HOST` - Ollama service host (default: `http://localhost:11434`) | |
| - `OLLAMA_MODEL` - Model to use (default: `llama3.2:1b`) | |
| - `ENABLE_V1_WARMUP` - Enable V1 warmup (default: `false`) | |
| **V2 Configuration (HuggingFace)**: | |
| - `HF_MODEL_ID` - Model ID (default: `sshleifer/distilbart-cnn-6-6`) | |
| - `HF_DEVICE_MAP` - Device mapping (default: `auto`) | |
| - `HF_TORCH_DTYPE` - Torch dtype (default: `auto`) | |
| - `HF_MAX_NEW_TOKENS` - Max new tokens (default: `128`) | |
| - `ENABLE_V2_WARMUP` - Enable V2 warmup (default: `true`) | |
| **V3 Configuration (Web Scraping)**: | |
| - `ENABLE_V3_SCRAPING` - Enable V3 API (default: `true`) | |
| - `SCRAPING_TIMEOUT` - HTTP timeout for scraping (default: `10` seconds) | |
| - `SCRAPING_MAX_TEXT_LENGTH` - Max text to extract (default: `50000` chars) | |
| - `SCRAPING_CACHE_ENABLED` - Enable caching (default: `true`) | |
| - `SCRAPING_CACHE_TTL` - Cache TTL (default: `3600` seconds / 1 hour) | |
| - `SCRAPING_UA_ROTATION` - Enable user-agent rotation (default: `true`) | |
| - `SCRAPING_RATE_LIMIT_PER_MINUTE` - Rate limit per IP (default: `10`) | |
| **Server Configuration**: | |
| - `SERVER_HOST`, `SERVER_PORT`, `LOG_LEVEL`, `LOG_FORMAT` | |
| ### Core Infrastructure | |
| **Logging** (`app/core/logging.py`) - **Powered by Loguru** | |
| - **Structured Logging**: Automatic JSON serialization for production, colored text for development | |
| - **Environment-Aware**: Auto-detects HuggingFace Spaces (JSON logs) vs local development (colored logs) | |
| - **Request ID Context**: Automatic propagation via `contextvars` (no manual passing required) | |
| - **Backward Compatible**: `get_logger()` and `RequestLogger` class maintain existing API | |
| - **Configuration**: | |
| - `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR, CRITICAL (default: INFO) | |
| - `LOG_FORMAT`: `json`, `text`, or `auto` (default: auto-detect based on environment) | |
| - **Features**: | |
| - Lazy evaluation for performance (`logger.opt(lazy=True)`) | |
| - Exception tracing with full stack traces | |
| - Automatic request ID binding without manual propagation | |
| - Structured fields (request_id, status_code, duration_ms, etc.) | |
| **Middleware** (`app/core/middleware.py`) | |
| - Request context middleware for tracking | |
| - Automatic request ID generation/extraction from headers | |
| - Context variable injection for automatic logging propagation | |
| - CORS middleware for cross-origin requests | |
| **Error Handling** (`app/core/errors.py`) | |
| - Custom exception handlers | |
| - Structured error responses with request IDs | |
| ## Coding Conventions (from .cursor/rules) | |
| ### Key Principles | |
| - Use functional, declarative programming; avoid classes where possible | |
| - Use descriptive variable names with auxiliary verbs (e.g., `is_active`, `has_permission`) | |
| - Use lowercase with underscores for directories and files (e.g., `routers/user_routes.py`) | |
| ### Python/FastAPI Specific | |
| - Use `def` for pure functions and `async def` for asynchronous operations | |
| - Use type hints for all function signatures | |
| - Prefer Pydantic models over raw dictionaries for input validation | |
| - File structure: exported router, sub-routes, utilities, static content, types (models, schemas) | |
| ### Error Handling Pattern | |
| - Handle errors and edge cases at the beginning of functions | |
| - Use early returns for error conditions to avoid deeply nested if statements | |
| - Place the happy path last in the function for improved readability | |
| - Avoid unnecessary else statements; use the if-return pattern instead | |
| - Use guard clauses to handle preconditions and invalid states early | |
| ### FastAPI Guidelines | |
| - Use functional components and Pydantic models for validation | |
| - Use `def` for synchronous, `async def` for asynchronous operations | |
| - Prefer lifespan context managers over `@app.on_event("startup")` | |
| - Use middleware for logging, error monitoring, and performance optimization | |
| - Use HTTPException for expected errors | |
| - Optimize with async functions for I/O-bound tasks | |
| ## Deployment Context | |
| **Primary Deployment**: Hugging Face Spaces (Docker SDK) | |
| - Port 7860 required | |
| - V2-only deployment for resource efficiency | |
| - Model cache: `/tmp/huggingface` | |
| - Environment variable: `HF_SPACE_ROOT_PATH` for proxy awareness | |
| **Alternative Deployments**: Railway, Google Cloud Run, AWS ECS | |
| - Docker Compose support for full stack (Ollama + API) | |
| - Persistent volumes for model caching | |
| ## Performance Characteristics | |
| **V1 (Ollama + Transformers)**: | |
| - Memory: ~2-4GB RAM when warmup enabled | |
| - Inference: ~2-5 seconds per request | |
| - Startup: ~30-60 seconds when warmup enabled | |
| **V2 (HuggingFace Streaming)**: | |
| - Memory: ~500MB RAM when warmup enabled | |
| - Inference: Real-time token streaming | |
| - Startup: ~30-60 seconds (includes model download when warmup enabled) | |
| - Model size: ~300MB download (distilbart-cnn-6-6) | |
| **V3 (Web Scraping + Summarization)**: | |
| - Memory: ~550MB RAM (V2 + scraping dependencies: +10-50MB) | |
| - Scraping: 200-500ms typical, <10ms on cache hit | |
| - Total latency: 2-5s (scrape + summarize) | |
| - Success rate: 95%+ article extraction | |
| - Docker image: +5-10MB for trafilatura dependencies | |
| **Optimization Strategy**: | |
| - V1 warmup disabled by default to save memory | |
| - V2 warmup enabled by default for first-request performance | |
| - Adaptive timeouts scale with text length: base 60s + 3s per 1000 chars, capped at 90s | |
| - Text truncation at 4000 chars for efficiency | |
| ## Important Implementation Notes | |
| ### Streaming Response Format | |
| All streaming endpoints use Server-Sent Events (SSE) format: | |
| ``` | |
| data: {"content": "token text", "done": false, "tokens_used": 10} | |
| data: {"content": "more tokens", "done": false, "tokens_used": 20} | |
| data: {"content": "", "done": true, "latency_ms": 1234.5} | |
| ``` | |
| ### HF Streaming Improvements (Recent Changes) | |
| The V2 API includes several critical improvements documented in `FAILED_TO_LEARN.MD`: | |
| - Adaptive `max_new_tokens` calculation based on input length | |
| - Recursive summarization for texts >1500 chars | |
| - Batch dimension enforcement (singleton batches only) | |
| - Better length parameter tuning for distilbart model | |
| ### Request Tracking | |
| Every request gets a unique request ID (UUID or from `X-Request-ID` header) for: | |
| - Request/response correlation | |
| - Error tracking | |
| - Performance monitoring | |
| - Logging and debugging | |
| ### Input Validation Constraints | |
| **V1/V2 (Text Input)**: | |
| - Max text length: 32,000 characters | |
| - Max tokens: 1-2,048 tokens | |
| - Temperature: 0.0-2.0 | |
| - Top-p: 0.0-1.0 | |
| **V3 (URL Input)**: | |
| - URL format: http/https schemes only | |
| - URL length: <2000 characters | |
| - SSRF protection: Blocks localhost and private IP ranges | |
| - Max extracted text: 50,000 characters | |
| - Minimum content: 100 characters for valid extraction | |
| - Rate limiting: 10 requests/minute per IP (configurable) | |
| ## Testing Requirements | |
| - **Coverage requirement**: 90% minimum (enforced by pytest.ini) | |
| - **Coverage reports**: Terminal output + HTML in `htmlcov/` | |
| - **Test markers**: `unit`, `integration`, `slow`, `ollama` | |
| - **Async mode**: Auto-enabled for async tests | |
| When adding new features: | |
| 1. Write tests BEFORE implementation where possible | |
| 2. Ensure 90% coverage is maintained | |
| 3. Use appropriate markers for test categorization | |
| 4. Mock external dependencies (Ollama service, model downloads) | |
| ## V3 Web Scraping API Details | |
| ### Architecture | |
| V3 adds backend web scraping capabilities to enable Android app to send URLs and receive streamed summaries without client-side scraping overhead. | |
| ### Key Components | |
| - **ArticleScraperService**: Handles HTTP requests, trafilatura extraction, user-agent rotation | |
| - **SimpleCache**: In-memory TTL-based cache (1 hour default) for scraped content | |
| - **V3 Router**: `/api/v3/scrape-and-summarize/stream` endpoint | |
| - **SSRF Protection**: Validates URLs to prevent internal network access | |
| ### Request Flow (V3) | |
| ``` | |
| 1. POST /api/v3/scrape-and-summarize/stream {"url": "...", "max_tokens": 256} | |
| 2. Check cache for URL (cache hit = <10ms, cache miss = fetch) | |
| 3. Scrape article with trafilatura (200-500ms typical) | |
| 4. Validate content quality (>100 chars, sentence structure) | |
| 5. Cache scraped content for 1 hour | |
| 6. Stream summarization using V2 HF service | |
| 7. Return SSE stream: metadata event β content chunks β done event | |
| ``` | |
| ### SSE Response Format (V3) | |
| ```json | |
| // Event 1: Metadata | |
| data: {"type":"metadata","data":{"title":"...","author":"...","scrape_latency_ms":450.2}} | |
| // Event 2-N: Content chunks (same as V2) | |
| data: {"content":"The","done":false,"tokens_used":1} | |
| // Event N+1: Done | |
| data: {"content":"","done":true,"latency_ms":2340.5} | |
| ``` | |
| ### Benefits Over Client-Side Scraping | |
| - 3-5x faster (2-5s vs 5-15s on mobile) | |
| - No battery drain on device | |
| - Reduced mobile data usage (summary only, not full page) | |
| - 95%+ success rate vs 60-70% on mobile | |
| - Shared caching across all users | |
| - Instant server updates without app deployment | |
| ### Security Considerations | |
| - SSRF protection blocks localhost, 127.0.0.1, and private IP ranges (10.x, 192.168.x, 172.x) | |
| - Per-IP rate limiting (10 req/min default) | |
| - Per-domain rate limiting (10 req/min per domain) | |
| - Content length limits (50,000 chars max) | |
| - Timeout protection (10s default) | |
| ### Resource Impact | |
| - Memory: +10-50MB over V2 (~550MB total) | |
| - Docker image: +5-10MB for trafilatura/lxml | |
| - CPU: Negligible (trafilatura is efficient) | |
| - Compatible with HuggingFace Spaces free tier (<600MB) | |