Spaces:

colin730
/

SummarizerApp

Sleeping

App Files Files Community

SummarizerApp / CLAUDE.md

ming

Migrate logging from stdlib to Loguru for structured logging

7ab470d 9 days ago

preview code

raw

history blame contribute delete

14.2 kB

	# CLAUDE.md

	This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

	## Project Overview

	SummerizerApp is a FastAPI-based text summarization REST API service deployed on Hugging Face Spaces. Despite the directory name, this is NOT an Android app - it's a cloud-based backend service providing multiple summarization engines through versioned API endpoints.

	## Development Commands

	### Testing
	```bash
	# Run all tests with coverage (90% minimum required)
	pytest

	# Run specific test categories
	pytest -m unit # Unit tests only
	pytest -m integration # Integration tests only
	pytest -m "not slow" # Skip slow tests
	pytest -m ollama # Tests requiring Ollama service

	# Run with coverage report
	pytest --cov=app --cov-report=html:htmlcov
	```

	### Code Quality
	```bash
	# Lint code (with auto-fix)
	ruff check --fix app/

	# Format code
	ruff format app/

	# Run both linting and formatting
	ruff check --fix app/ && ruff format app/
	```

	### Running Locally
	```bash
	# Install dependencies
	pip install -r requirements.txt

	# Run development server (with auto-reload)
	uvicorn app.main:app --host 0.0.0.0 --port 7860 --reload

	# Run production server
	uvicorn app.main:app --host 0.0.0.0 --port 7860
	```

	### Docker
	```bash
	# Build and run with docker-compose (full stack with Ollama)
	docker-compose up --build

	# Build HF Spaces optimized image (V2 only)
	docker build -f Dockerfile -t summarizer-app .
	docker run -p 7860:7860 summarizer-app

	# Development stack
	docker-compose -f docker-compose.dev.yml up
	```

	## Architecture

	### Multi-Version API System

	The application runs three independent API versions simultaneously:

	V1 API (`/api/v1/*`): Ollama + Transformers Pipeline
	- `/api/v1/summarize` - Non-streaming Ollama summarization
	- `/api/v1/summarize/stream` - Streaming Ollama summarization
	- `/api/v1/summarize/pipeline/stream` - Streaming Transformers summarization
	- Dependencies: External Ollama service + local transformers model
	- Use case: Local/on-premises deployment with custom models

	V2 API (`/api/v2/*`): HuggingFace Streaming (Primary for HF Spaces)
	- `/api/v2/summarize/stream` - Streaming HF summarization with advanced features
	- Dependencies: Local transformers model only
	- Features: Adaptive token calculation, recursive summarization for long texts
	- Use case: Cloud deployment on resource-constrained platforms

	V3 API (`/api/v3/*`): Web Scraping + Summarization
	- `/api/v3/scrape-and-summarize/stream` - Scrape article from URL and stream summarization
	- Dependencies: trafilatura, httpx, lxml (lightweight, no JavaScript rendering)
	- Features: Backend web scraping, caching, user-agent rotation, metadata extraction
	- Use case: End-to-end article summarization from URL (Android app primary use case)

	### Service Layer Components

	OllamaService (`app/services/summarizer.py` - 277 lines)
	- Communicates with external Ollama inference engine via HTTP
	- Normalizes URLs (handles `0.0.0.0` bind addresses)
	- Dynamic timeout calculation based on text length
	- Streaming support with JSON line parsing

	TransformersService (`app/services/transformers_summarizer.py` - 158 lines)
	- Uses local transformer pipeline (distilbart-cnn-6-6 model)
	- Fast inference without external dependencies
	- Streaming with token chunking

	HFStreamingSummarizer (`app/services/hf_streaming_summarizer.py` - 630 lines, most complex)
	- Adaptive Token Calculation: Adjusts `max_new_tokens` based on input length
	- Recursive Summarization: Chunks long texts (>1500 chars) and creates summaries of summaries
	- Device Auto-detection: Handles GPU (bfloat16/float16) vs CPU (float32)
	- TextIteratorStreamer: Real-time token streaming via threading
	- Batch Dimension Validation: Strict singleton batch enforcement to prevent OOM
	- Supports T5, BART, and generic models with chat templates

	ArticleScraperService (`app/services/article_scraper.py`)
	- Uses trafilatura for high-quality article extraction (F1 score: 0.958)
	- User-agent rotation to avoid anti-scraping measures
	- Content quality validation (minimum length, sentence structure)
	- Metadata extraction (title, author, date, site_name)
	- Async HTTP requests with configurable timeouts
	- In-memory caching with TTL for performance

	### Request Flow

	```
	HTTP Request
	↓
	Middleware (app/core/middleware.py)
	- Request ID generation/tracking
	- Request/response timing
	- CORS headers
	↓
	Route Handler (app/api/v1 or app/api/v2)
	- Pydantic schema validation
	↓
	Service Layer (OllamaService, TransformersService, or HFStreamingSummarizer)
	- Text processing and summarization
	↓
	Streaming Response (Server-Sent Events format)
	- Token chunks: {"content": "token", "done": false, "tokens_used": N}
	- Final chunk: {"content": "", "done": true, "latency_ms": float}
	```

	### Configuration Management

	Settings are managed via `app/core/config.py` using Pydantic BaseSettings. Key environment variables:

	V1 Configuration (Ollama):
	- `OLLAMA_HOST` - Ollama service host (default: `http://localhost:11434`)
	- `OLLAMA_MODEL` - Model to use (default: `llama3.2:1b`)
	- `ENABLE_V1_WARMUP` - Enable V1 warmup (default: `false`)

	V2 Configuration (HuggingFace):
	- `HF_MODEL_ID` - Model ID (default: `sshleifer/distilbart-cnn-6-6`)
	- `HF_DEVICE_MAP` - Device mapping (default: `auto`)
	- `HF_TORCH_DTYPE` - Torch dtype (default: `auto`)
	- `HF_MAX_NEW_TOKENS` - Max new tokens (default: `128`)
	- `ENABLE_V2_WARMUP` - Enable V2 warmup (default: `true`)

	V3 Configuration (Web Scraping):
	- `ENABLE_V3_SCRAPING` - Enable V3 API (default: `true`)
	- `SCRAPING_TIMEOUT` - HTTP timeout for scraping (default: `10` seconds)
	- `SCRAPING_MAX_TEXT_LENGTH` - Max text to extract (default: `50000` chars)
	- `SCRAPING_CACHE_ENABLED` - Enable caching (default: `true`)
	- `SCRAPING_CACHE_TTL` - Cache TTL (default: `3600` seconds / 1 hour)
	- `SCRAPING_UA_ROTATION` - Enable user-agent rotation (default: `true`)
	- `SCRAPING_RATE_LIMIT_PER_MINUTE` - Rate limit per IP (default: `10`)

	Server Configuration:
	- `SERVER_HOST`, `SERVER_PORT`, `LOG_LEVEL`, `LOG_FORMAT`

	### Core Infrastructure

	Logging (`app/core/logging.py`) - Powered by Loguru
	- Structured Logging: Automatic JSON serialization for production, colored text for development
	- Environment-Aware: Auto-detects HuggingFace Spaces (JSON logs) vs local development (colored logs)
	- Request ID Context: Automatic propagation via `contextvars` (no manual passing required)
	- Backward Compatible: `get_logger()` and `RequestLogger` class maintain existing API
	- Configuration:
	- `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR, CRITICAL (default: INFO)
	- `LOG_FORMAT`: `json`, `text`, or `auto` (default: auto-detect based on environment)
	- Features:
	- Lazy evaluation for performance (`logger.opt(lazy=True)`)
	- Exception tracing with full stack traces
	- Automatic request ID binding without manual propagation
	- Structured fields (request_id, status_code, duration_ms, etc.)

	Middleware (`app/core/middleware.py`)
	- Request context middleware for tracking
	- Automatic request ID generation/extraction from headers
	- Context variable injection for automatic logging propagation
	- CORS middleware for cross-origin requests

	Error Handling (`app/core/errors.py`)
	- Custom exception handlers
	- Structured error responses with request IDs

	## Coding Conventions (from .cursor/rules)

	### Key Principles
	- Use functional, declarative programming; avoid classes where possible
	- Use descriptive variable names with auxiliary verbs (e.g., `is_active`, `has_permission`)
	- Use lowercase with underscores for directories and files (e.g., `routers/user_routes.py`)

	### Python/FastAPI Specific
	- Use `def` for pure functions and `async def` for asynchronous operations
	- Use type hints for all function signatures
	- Prefer Pydantic models over raw dictionaries for input validation
	- File structure: exported router, sub-routes, utilities, static content, types (models, schemas)

	### Error Handling Pattern
	- Handle errors and edge cases at the beginning of functions
	- Use early returns for error conditions to avoid deeply nested if statements
	- Place the happy path last in the function for improved readability
	- Avoid unnecessary else statements; use the if-return pattern instead
	- Use guard clauses to handle preconditions and invalid states early

	### FastAPI Guidelines
	- Use functional components and Pydantic models for validation
	- Use `def` for synchronous, `async def` for asynchronous operations
	- Prefer lifespan context managers over `@app.on_event("startup")`
	- Use middleware for logging, error monitoring, and performance optimization
	- Use HTTPException for expected errors
	- Optimize with async functions for I/O-bound tasks

	## Deployment Context

	Primary Deployment: Hugging Face Spaces (Docker SDK)
	- Port 7860 required
	- V2-only deployment for resource efficiency
	- Model cache: `/tmp/huggingface`
	- Environment variable: `HF_SPACE_ROOT_PATH` for proxy awareness

	Alternative Deployments: Railway, Google Cloud Run, AWS ECS
	- Docker Compose support for full stack (Ollama + API)
	- Persistent volumes for model caching

	## Performance Characteristics

	V1 (Ollama + Transformers):
	- Memory: ~2-4GB RAM when warmup enabled
	- Inference: ~2-5 seconds per request
	- Startup: ~30-60 seconds when warmup enabled

	V2 (HuggingFace Streaming):
	- Memory: ~500MB RAM when warmup enabled
	- Inference: Real-time token streaming
	- Startup: ~30-60 seconds (includes model download when warmup enabled)
	- Model size: ~300MB download (distilbart-cnn-6-6)

	V3 (Web Scraping + Summarization):
	- Memory: ~550MB RAM (V2 + scraping dependencies: +10-50MB)
	- Scraping: 200-500ms typical, <10ms on cache hit
	- Total latency: 2-5s (scrape + summarize)
	- Success rate: 95%+ article extraction
	- Docker image: +5-10MB for trafilatura dependencies

	Optimization Strategy:
	- V1 warmup disabled by default to save memory
	- V2 warmup enabled by default for first-request performance
	- Adaptive timeouts scale with text length: base 60s + 3s per 1000 chars, capped at 90s
	- Text truncation at 4000 chars for efficiency

	## Important Implementation Notes

	### Streaming Response Format
	All streaming endpoints use Server-Sent Events (SSE) format:
	```
	data: {"content": "token text", "done": false, "tokens_used": 10}
	data: {"content": "more tokens", "done": false, "tokens_used": 20}
	data: {"content": "", "done": true, "latency_ms": 1234.5}
	```

	### HF Streaming Improvements (Recent Changes)
	The V2 API includes several critical improvements documented in `FAILED_TO_LEARN.MD`:
	- Adaptive `max_new_tokens` calculation based on input length
	- Recursive summarization for texts >1500 chars
	- Batch dimension enforcement (singleton batches only)
	- Better length parameter tuning for distilbart model

	### Request Tracking
	Every request gets a unique request ID (UUID or from `X-Request-ID` header) for:
	- Request/response correlation
	- Error tracking
	- Performance monitoring
	- Logging and debugging

	### Input Validation Constraints

	V1/V2 (Text Input):
	- Max text length: 32,000 characters
	- Max tokens: 1-2,048 tokens
	- Temperature: 0.0-2.0
	- Top-p: 0.0-1.0

	V3 (URL Input):
	- URL format: http/https schemes only
	- URL length: <2000 characters
	- SSRF protection: Blocks localhost and private IP ranges
	- Max extracted text: 50,000 characters
	- Minimum content: 100 characters for valid extraction
	- Rate limiting: 10 requests/minute per IP (configurable)

	## Testing Requirements

	- Coverage requirement: 90% minimum (enforced by pytest.ini)
	- Coverage reports: Terminal output + HTML in `htmlcov/`
	- Test markers: `unit`, `integration`, `slow`, `ollama`
	- Async mode: Auto-enabled for async tests

	When adding new features:
	1. Write tests BEFORE implementation where possible
	2. Ensure 90% coverage is maintained
	3. Use appropriate markers for test categorization
	4. Mock external dependencies (Ollama service, model downloads)

	## V3 Web Scraping API Details

	### Architecture
	V3 adds backend web scraping capabilities to enable Android app to send URLs and receive streamed summaries without client-side scraping overhead.

	### Key Components
	- ArticleScraperService: Handles HTTP requests, trafilatura extraction, user-agent rotation
	- SimpleCache: In-memory TTL-based cache (1 hour default) for scraped content
	- V3 Router: `/api/v3/scrape-and-summarize/stream` endpoint
	- SSRF Protection: Validates URLs to prevent internal network access

	### Request Flow (V3)
	```
	1. POST /api/v3/scrape-and-summarize/stream {"url": "...", "max_tokens": 256}
	2. Check cache for URL (cache hit = <10ms, cache miss = fetch)
	3. Scrape article with trafilatura (200-500ms typical)
	4. Validate content quality (>100 chars, sentence structure)
	5. Cache scraped content for 1 hour
	6. Stream summarization using V2 HF service
	7. Return SSE stream: metadata event → content chunks → done event
	```

	### SSE Response Format (V3)
	```json
	// Event 1: Metadata
	data: {"type":"metadata","data":{"title":"...","author":"...","scrape_latency_ms":450.2}}

	// Event 2-N: Content chunks (same as V2)
	data: {"content":"The","done":false,"tokens_used":1}

	// Event N+1: Done
	data: {"content":"","done":true,"latency_ms":2340.5}
	```

	### Benefits Over Client-Side Scraping
	- 3-5x faster (2-5s vs 5-15s on mobile)
	- No battery drain on device
	- Reduced mobile data usage (summary only, not full page)
	- 95%+ success rate vs 60-70% on mobile
	- Shared caching across all users
	- Instant server updates without app deployment

	### Security Considerations
	- SSRF protection blocks localhost, 127.0.0.1, and private IP ranges (10.x, 192.168.x, 172.x)
	- Per-IP rate limiting (10 req/min default)
	- Per-domain rate limiting (10 req/min per domain)
	- Content length limits (50,000 chars max)
	- Timeout protection (10s default)

	### Resource Impact
	- Memory: +10-50MB over V2 (~550MB total)
	- Docker image: +5-10MB for trafilatura/lxml
	- CPU: Negligible (trafilatura is efficient)
	- Compatible with HuggingFace Spaces free tier (<600MB)