File size: 14,190 Bytes
2ed2bd7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29ed661
 
 
2ed2bd7
29ed661
2ed2bd7
29ed661
 
2ed2bd7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7ab470d
2ed2bd7
 
 
7ab470d
 
 
 
 
 
 
 
 
 
 
 
 
2ed2bd7
 
 
7ab470d
 
2ed2bd7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

**SummerizerApp** is a FastAPI-based text summarization REST API service deployed on Hugging Face Spaces. Despite the directory name, this is NOT an Android app - it's a cloud-based backend service providing multiple summarization engines through versioned API endpoints.

## Development Commands

### Testing
```bash
# Run all tests with coverage (90% minimum required)
pytest

# Run specific test categories
pytest -m unit                    # Unit tests only
pytest -m integration             # Integration tests only
pytest -m "not slow"              # Skip slow tests
pytest -m ollama                  # Tests requiring Ollama service

# Run with coverage report
pytest --cov=app --cov-report=html:htmlcov
```

### Code Quality
```bash
# Lint code (with auto-fix)
ruff check --fix app/

# Format code
ruff format app/

# Run both linting and formatting
ruff check --fix app/ && ruff format app/
```

### Running Locally
```bash
# Install dependencies
pip install -r requirements.txt

# Run development server (with auto-reload)
uvicorn app.main:app --host 0.0.0.0 --port 7860 --reload

# Run production server
uvicorn app.main:app --host 0.0.0.0 --port 7860
```

### Docker
```bash
# Build and run with docker-compose (full stack with Ollama)
docker-compose up --build

# Build HF Spaces optimized image (V2 only)
docker build -f Dockerfile -t summarizer-app .
docker run -p 7860:7860 summarizer-app

# Development stack
docker-compose -f docker-compose.dev.yml up
```

## Architecture

### Multi-Version API System

The application runs **three independent API versions simultaneously**:

**V1 API** (`/api/v1/*`): Ollama + Transformers Pipeline
- `/api/v1/summarize` - Non-streaming Ollama summarization
- `/api/v1/summarize/stream` - Streaming Ollama summarization
- `/api/v1/summarize/pipeline/stream` - Streaming Transformers summarization
- Dependencies: External Ollama service + local transformers model
- Use case: Local/on-premises deployment with custom models

**V2 API** (`/api/v2/*`): HuggingFace Streaming (Primary for HF Spaces)
- `/api/v2/summarize/stream` - Streaming HF summarization with advanced features
- Dependencies: Local transformers model only
- Features: Adaptive token calculation, recursive summarization for long texts
- Use case: Cloud deployment on resource-constrained platforms

**V3 API** (`/api/v3/*`): Web Scraping + Summarization
- `/api/v3/scrape-and-summarize/stream` - Scrape article from URL and stream summarization
- Dependencies: trafilatura, httpx, lxml (lightweight, no JavaScript rendering)
- Features: Backend web scraping, caching, user-agent rotation, metadata extraction
- Use case: End-to-end article summarization from URL (Android app primary use case)

### Service Layer Components

**OllamaService** (`app/services/summarizer.py` - 277 lines)
- Communicates with external Ollama inference engine via HTTP
- Normalizes URLs (handles `0.0.0.0` bind addresses)
- Dynamic timeout calculation based on text length
- Streaming support with JSON line parsing

**TransformersService** (`app/services/transformers_summarizer.py` - 158 lines)
- Uses local transformer pipeline (distilbart-cnn-6-6 model)
- Fast inference without external dependencies
- Streaming with token chunking

**HFStreamingSummarizer** (`app/services/hf_streaming_summarizer.py` - 630 lines, most complex)
- **Adaptive Token Calculation**: Adjusts `max_new_tokens` based on input length
- **Recursive Summarization**: Chunks long texts (>1500 chars) and creates summaries of summaries
- **Device Auto-detection**: Handles GPU (bfloat16/float16) vs CPU (float32)
- **TextIteratorStreamer**: Real-time token streaming via threading
- **Batch Dimension Validation**: Strict singleton batch enforcement to prevent OOM
- Supports T5, BART, and generic models with chat templates

**ArticleScraperService** (`app/services/article_scraper.py`)
- Uses trafilatura for high-quality article extraction (F1 score: 0.958)
- User-agent rotation to avoid anti-scraping measures
- Content quality validation (minimum length, sentence structure)
- Metadata extraction (title, author, date, site_name)
- Async HTTP requests with configurable timeouts
- In-memory caching with TTL for performance

### Request Flow

```
HTTP Request
    ↓
Middleware (app/core/middleware.py)
    - Request ID generation/tracking
    - Request/response timing
    - CORS headers
    ↓
Route Handler (app/api/v1 or app/api/v2)
    - Pydantic schema validation
    ↓
Service Layer (OllamaService, TransformersService, or HFStreamingSummarizer)
    - Text processing and summarization
    ↓
Streaming Response (Server-Sent Events format)
    - Token chunks: {"content": "token", "done": false, "tokens_used": N}
    - Final chunk: {"content": "", "done": true, "latency_ms": float}
```

### Configuration Management

Settings are managed via `app/core/config.py` using Pydantic BaseSettings. Key environment variables:

**V1 Configuration (Ollama)**:
- `OLLAMA_HOST` - Ollama service host (default: `http://localhost:11434`)
- `OLLAMA_MODEL` - Model to use (default: `llama3.2:1b`)
- `ENABLE_V1_WARMUP` - Enable V1 warmup (default: `false`)

**V2 Configuration (HuggingFace)**:
- `HF_MODEL_ID` - Model ID (default: `sshleifer/distilbart-cnn-6-6`)
- `HF_DEVICE_MAP` - Device mapping (default: `auto`)
- `HF_TORCH_DTYPE` - Torch dtype (default: `auto`)
- `HF_MAX_NEW_TOKENS` - Max new tokens (default: `128`)
- `ENABLE_V2_WARMUP` - Enable V2 warmup (default: `true`)

**V3 Configuration (Web Scraping)**:
- `ENABLE_V3_SCRAPING` - Enable V3 API (default: `true`)
- `SCRAPING_TIMEOUT` - HTTP timeout for scraping (default: `10` seconds)
- `SCRAPING_MAX_TEXT_LENGTH` - Max text to extract (default: `50000` chars)
- `SCRAPING_CACHE_ENABLED` - Enable caching (default: `true`)
- `SCRAPING_CACHE_TTL` - Cache TTL (default: `3600` seconds / 1 hour)
- `SCRAPING_UA_ROTATION` - Enable user-agent rotation (default: `true`)
- `SCRAPING_RATE_LIMIT_PER_MINUTE` - Rate limit per IP (default: `10`)

**Server Configuration**:
- `SERVER_HOST`, `SERVER_PORT`, `LOG_LEVEL`, `LOG_FORMAT`

### Core Infrastructure

**Logging** (`app/core/logging.py`) - **Powered by Loguru**
- **Structured Logging**: Automatic JSON serialization for production, colored text for development
- **Environment-Aware**: Auto-detects HuggingFace Spaces (JSON logs) vs local development (colored logs)
- **Request ID Context**: Automatic propagation via `contextvars` (no manual passing required)
- **Backward Compatible**: `get_logger()` and `RequestLogger` class maintain existing API
- **Configuration**:
  - `LOG_LEVEL`: DEBUG, INFO, WARNING, ERROR, CRITICAL (default: INFO)
  - `LOG_FORMAT`: `json`, `text`, or `auto` (default: auto-detect based on environment)
- **Features**:
  - Lazy evaluation for performance (`logger.opt(lazy=True)`)
  - Exception tracing with full stack traces
  - Automatic request ID binding without manual propagation
  - Structured fields (request_id, status_code, duration_ms, etc.)

**Middleware** (`app/core/middleware.py`)
- Request context middleware for tracking
- Automatic request ID generation/extraction from headers
- Context variable injection for automatic logging propagation
- CORS middleware for cross-origin requests

**Error Handling** (`app/core/errors.py`)
- Custom exception handlers
- Structured error responses with request IDs

## Coding Conventions (from .cursor/rules)

### Key Principles
- Use functional, declarative programming; avoid classes where possible
- Use descriptive variable names with auxiliary verbs (e.g., `is_active`, `has_permission`)
- Use lowercase with underscores for directories and files (e.g., `routers/user_routes.py`)

### Python/FastAPI Specific
- Use `def` for pure functions and `async def` for asynchronous operations
- Use type hints for all function signatures
- Prefer Pydantic models over raw dictionaries for input validation
- File structure: exported router, sub-routes, utilities, static content, types (models, schemas)

### Error Handling Pattern
- Handle errors and edge cases at the beginning of functions
- Use early returns for error conditions to avoid deeply nested if statements
- Place the happy path last in the function for improved readability
- Avoid unnecessary else statements; use the if-return pattern instead
- Use guard clauses to handle preconditions and invalid states early

### FastAPI Guidelines
- Use functional components and Pydantic models for validation
- Use `def` for synchronous, `async def` for asynchronous operations
- Prefer lifespan context managers over `@app.on_event("startup")`
- Use middleware for logging, error monitoring, and performance optimization
- Use HTTPException for expected errors
- Optimize with async functions for I/O-bound tasks

## Deployment Context

**Primary Deployment**: Hugging Face Spaces (Docker SDK)
- Port 7860 required
- V2-only deployment for resource efficiency
- Model cache: `/tmp/huggingface`
- Environment variable: `HF_SPACE_ROOT_PATH` for proxy awareness

**Alternative Deployments**: Railway, Google Cloud Run, AWS ECS
- Docker Compose support for full stack (Ollama + API)
- Persistent volumes for model caching

## Performance Characteristics

**V1 (Ollama + Transformers)**:
- Memory: ~2-4GB RAM when warmup enabled
- Inference: ~2-5 seconds per request
- Startup: ~30-60 seconds when warmup enabled

**V2 (HuggingFace Streaming)**:
- Memory: ~500MB RAM when warmup enabled
- Inference: Real-time token streaming
- Startup: ~30-60 seconds (includes model download when warmup enabled)
- Model size: ~300MB download (distilbart-cnn-6-6)

**V3 (Web Scraping + Summarization)**:
- Memory: ~550MB RAM (V2 + scraping dependencies: +10-50MB)
- Scraping: 200-500ms typical, <10ms on cache hit
- Total latency: 2-5s (scrape + summarize)
- Success rate: 95%+ article extraction
- Docker image: +5-10MB for trafilatura dependencies

**Optimization Strategy**:
- V1 warmup disabled by default to save memory
- V2 warmup enabled by default for first-request performance
- Adaptive timeouts scale with text length: base 60s + 3s per 1000 chars, capped at 90s
- Text truncation at 4000 chars for efficiency

## Important Implementation Notes

### Streaming Response Format
All streaming endpoints use Server-Sent Events (SSE) format:
```
data: {"content": "token text", "done": false, "tokens_used": 10}
data: {"content": "more tokens", "done": false, "tokens_used": 20}
data: {"content": "", "done": true, "latency_ms": 1234.5}
```

### HF Streaming Improvements (Recent Changes)
The V2 API includes several critical improvements documented in `FAILED_TO_LEARN.MD`:
- Adaptive `max_new_tokens` calculation based on input length
- Recursive summarization for texts >1500 chars
- Batch dimension enforcement (singleton batches only)
- Better length parameter tuning for distilbart model

### Request Tracking
Every request gets a unique request ID (UUID or from `X-Request-ID` header) for:
- Request/response correlation
- Error tracking
- Performance monitoring
- Logging and debugging

### Input Validation Constraints

**V1/V2 (Text Input)**:
- Max text length: 32,000 characters
- Max tokens: 1-2,048 tokens
- Temperature: 0.0-2.0
- Top-p: 0.0-1.0

**V3 (URL Input)**:
- URL format: http/https schemes only
- URL length: <2000 characters
- SSRF protection: Blocks localhost and private IP ranges
- Max extracted text: 50,000 characters
- Minimum content: 100 characters for valid extraction
- Rate limiting: 10 requests/minute per IP (configurable)

## Testing Requirements

- **Coverage requirement**: 90% minimum (enforced by pytest.ini)
- **Coverage reports**: Terminal output + HTML in `htmlcov/`
- **Test markers**: `unit`, `integration`, `slow`, `ollama`
- **Async mode**: Auto-enabled for async tests

When adding new features:
1. Write tests BEFORE implementation where possible
2. Ensure 90% coverage is maintained
3. Use appropriate markers for test categorization
4. Mock external dependencies (Ollama service, model downloads)

## V3 Web Scraping API Details

### Architecture
V3 adds backend web scraping capabilities to enable Android app to send URLs and receive streamed summaries without client-side scraping overhead.

### Key Components
- **ArticleScraperService**: Handles HTTP requests, trafilatura extraction, user-agent rotation
- **SimpleCache**: In-memory TTL-based cache (1 hour default) for scraped content
- **V3 Router**: `/api/v3/scrape-and-summarize/stream` endpoint
- **SSRF Protection**: Validates URLs to prevent internal network access

### Request Flow (V3)
```
1. POST /api/v3/scrape-and-summarize/stream {"url": "...", "max_tokens": 256}
2. Check cache for URL (cache hit = <10ms, cache miss = fetch)
3. Scrape article with trafilatura (200-500ms typical)
4. Validate content quality (>100 chars, sentence structure)
5. Cache scraped content for 1 hour
6. Stream summarization using V2 HF service
7. Return SSE stream: metadata event β†’ content chunks β†’ done event
```

### SSE Response Format (V3)
```json
// Event 1: Metadata
data: {"type":"metadata","data":{"title":"...","author":"...","scrape_latency_ms":450.2}}

// Event 2-N: Content chunks (same as V2)
data: {"content":"The","done":false,"tokens_used":1}

// Event N+1: Done
data: {"content":"","done":true,"latency_ms":2340.5}
```

### Benefits Over Client-Side Scraping
- 3-5x faster (2-5s vs 5-15s on mobile)
- No battery drain on device
- Reduced mobile data usage (summary only, not full page)
- 95%+ success rate vs 60-70% on mobile
- Shared caching across all users
- Instant server updates without app deployment

### Security Considerations
- SSRF protection blocks localhost, 127.0.0.1, and private IP ranges (10.x, 192.168.x, 172.x)
- Per-IP rate limiting (10 req/min default)
- Per-domain rate limiting (10 req/min per domain)
- Content length limits (50,000 chars max)
- Timeout protection (10s default)

### Resource Impact
- Memory: +10-50MB over V2 (~550MB total)
- Docker image: +5-10MB for trafilatura/lxml
- CPU: Negligible (trafilatura is efficient)
- Compatible with HuggingFace Spaces free tier (<600MB)