Spaces:

colin730
/

SummarizerApp

Sleeping

App Files Files Community

SummarizerApp / V3_FIX_URL_AND_TEXT_INPUT.md

ming

Fix V3 API to support both URL and text input

f724bab about 1 month ago

preview code

raw

history blame contribute delete

23.7 kB

	# V3 API Fix: Support Both URL and Text Input

	## Problem Statement

	The V3 endpoint `/api/v3/scrape-and-summarize/stream` currently only accepts URLs in the request body. When the Android app sends plain text instead of a URL, the request fails with 422 Unprocessable Entity due to URL validation failure.

	### Error Symptoms
	```
	INFO: 10.16.17.219:29372 - "POST /api/v3/scrape-and-summarize/stream HTTP/1.1" 422 Unprocessable Entity
	2025-11-11 05:39:49,140 - app.core.middleware - INFO - Request lXqCov: POST /api/v3/scrape-and-summarize/stream
	2025-11-11 05:39:49,143 - app.core.middleware - INFO - Response lXqCov: 422 (2.64ms)
	```

	Key Indicator: Response time < 3ms means the request is failing at schema validation before any scraping logic runs.

	### Root Cause

	Current Schema (`app/api/v3/schemas.py`):
	```python
	class ScrapeAndSummarizeRequest(BaseModel):
	url: str = Field(..., description="URL of article to scrape and summarize")
	# ... other fields

	@validator('url')
	def validate_url(cls, v):
	# URL validation regex that rejects plain text
	if not url_pattern.match(v):
	raise ValueError('Invalid URL format')
	return v
	```

	Problem: The `url` field is required and must match URL pattern. When Android app sends plain text (non-URL), validation fails → 422 error.

	---

	## Solution Overview

	Make the V3 endpoint intelligent - it should handle both:

	1. URL Input → Scrape article from URL + Summarize
	2. Text Input → Skip scraping + Summarize directly

	This provides a single, unified endpoint for the Android app without needing to choose between multiple endpoints.

	---

	## Design Approach

	### Option 1: Flexible Input Field (Recommended)

	Schema Design:
	```python
	class ScrapeAndSummarizeRequest(BaseModel):
	url: Optional[str] = None
	text: Optional[str] = None
	# ... other fields (max_tokens, temperature, etc.)

	@model_validator(mode='after')
	def check_url_or_text(self):
	"""Ensure exactly one of url or text is provided."""
	if not self.url and not self.text:
	raise ValueError('Either url or text must be provided')
	if self.url and self.text:
	raise ValueError('Provide either url OR text, not both')
	return self

	@field_validator('url')
	def validate_url(cls, v):
	"""Validate URL format if provided."""
	if v is None:
	return v
	# URL validation logic
	return v

	@field_validator('text')
	def validate_text(cls, v):
	"""Validate text if provided."""
	if v is None:
	return v
	if len(v) < 50:
	raise ValueError('Text too short (minimum 50 characters)')
	if len(v) > 50000:
	raise ValueError('Text too long (maximum 50,000 characters)')
	return v
	```

	Request Examples:
	```json
	// URL-based request (scraping enabled)
	{
	"url": "https://example.com/article",
	"max_tokens": 256,
	"temperature": 0.3
	}

	// Text-based request (direct summarization)
	{
	"text": "Your article text here...",
	"max_tokens": 256,
	"temperature": 0.3
	}
	```

	Endpoint Logic:
	```python
	@router.post("/scrape-and-summarize/stream")
	async def scrape_and_summarize_stream(
	request: Request,
	payload: ScrapeAndSummarizeRequest
	):
	"""Handle both URL scraping and direct text summarization."""

	# Determine input type
	if payload.url:
	# URL input → Scrape + Summarize
	article_data = await article_scraper_service.scrape_article(payload.url)
	text_to_summarize = article_data['text']
	metadata = {
	'title': article_data.get('title'),
	'author': article_data.get('author'),
	'source': 'scraped',
	'scrape_latency_ms': article_data.get('scrape_time_ms')
	}
	else:
	# Text input → Direct Summarization
	text_to_summarize = payload.text
	metadata = {
	'source': 'direct_text',
	'text_length': len(payload.text)
	}

	# Stream summarization (same for both paths)
	return StreamingResponse(
	_stream_generator(text_to_summarize, payload, metadata, request_id),
	media_type="text/event-stream",
	headers={"Cache-Control": "no-cache", ...}
	)
	```

	---

	### Option 2: Auto-Detection (Alternative)

	Schema Design:
	```python
	class ScrapeAndSummarizeRequest(BaseModel):
	input: str = Field(..., description="URL to scrape OR text to summarize")
	# ... other fields
	```

	Endpoint Logic:
	```python
	# Auto-detect if input is URL or text
	if _is_valid_url(payload.input):
	# URL detected → Scrape + Summarize
	article_data = await article_scraper_service.scrape_article(payload.input)
	text_to_summarize = article_data['text']
	else:
	# Plain text detected → Direct Summarization
	text_to_summarize = payload.input
	```

	Pros:
	- Single input field (simpler API)
	- Auto-detection is smart

	Cons:
	- Ambiguous: What if text looks like a URL?
	- Harder to debug issues
	- Less explicit intent

	Verdict: Option 1 is clearer and more explicit.

	---

	## Implementation Plan

	### Step 1: Update Request Schema

	File: `app/api/v3/schemas.py`

	Changes:
	1. Make `url` field Optional (change from required to `Optional[str] = None`)
	2. Add `text` field as Optional (`Optional[str] = None`)
	3. Add `@model_validator` to ensure exactly one is provided
	4. Update `url` validator to handle None
	5. Add `text` validator for length constraints

	Code:
	```python
	from pydantic import BaseModel, Field, field_validator, model_validator
	from typing import Optional
	import re

	class ScrapeAndSummarizeRequest(BaseModel):
	"""Request schema supporting both URL scraping and direct text summarization."""

	url: Optional[str] = Field(
	None,
	description="URL of article to scrape and summarize",
	example="https://example.com/article"
	)

	text: Optional[str] = Field(
	None,
	description="Direct text to summarize (alternative to URL)",
	example="Your article text here..."
	)

	max_tokens: Optional[int] = Field(
	default=256,
	ge=1,
	le=2048,
	description="Maximum tokens in summary"
	)

	temperature: Optional[float] = Field(
	default=0.3,
	ge=0.0,
	le=2.0,
	description="Sampling temperature"
	)

	top_p: Optional[float] = Field(
	default=0.9,
	ge=0.0,
	le=1.0,
	description="Nucleus sampling"
	)

	prompt: Optional[str] = Field(
	default="Summarize this article concisely:",
	description="Custom summarization prompt"
	)

	include_metadata: Optional[bool] = Field(
	default=True,
	description="Include article metadata in response"
	)

	use_cache: Optional[bool] = Field(
	default=True,
	description="Use cached content if available (URL mode only)"
	)

	@model_validator(mode='after')
	def check_url_or_text(self):
	"""Ensure exactly one of url or text is provided."""
	if not self.url and not self.text:
	raise ValueError('Either "url" or "text" must be provided')
	if self.url and self.text:
	raise ValueError('Provide either "url" OR "text", not both')
	return self

	@field_validator('url')
	@classmethod
	def validate_url(cls, v: Optional[str]) -> Optional[str]:
	"""Validate URL format if provided."""
	if v is None:
	return v

	# URL validation regex
	url_pattern = re.compile(
	r'^https?://' # http:// or https://
	r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?\|' # domain
	r'localhost\|' # localhost
	r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # or IP
	r'(?::\d+)?' # optional port
	r'(?:/?\|[/?]\S+)$', re.IGNORECASE
	)

	if not url_pattern.match(v):
	raise ValueError('Invalid URL format. Must start with http:// or https://')

	# SSRF protection
	v_lower = v.lower()
	if 'localhost' in v_lower or '127.0.0.1' in v:
	raise ValueError('Cannot scrape localhost URLs')

	if any(private in v for private in ['192.168.', '10.', '172.16.', '172.17.', '172.18.']):
	raise ValueError('Cannot scrape private IP addresses')

	if len(v) > 2000:
	raise ValueError('URL too long (maximum 2000 characters)')

	return v

	@field_validator('text')
	@classmethod
	def validate_text(cls, v: Optional[str]) -> Optional[str]:
	"""Validate text content if provided."""
	if v is None:
	return v

	if len(v) < 50:
	raise ValueError('Text too short (minimum 50 characters)')

	if len(v) > 50000:
	raise ValueError('Text too long (maximum 50,000 characters)')

	# Check for mostly whitespace
	non_whitespace = len(v.replace(' ', '').replace('\n', '').replace('\t', ''))
	if non_whitespace < 30:
	raise ValueError('Text contains mostly whitespace')

	return v
	```

	---

	### Step 2: Update Endpoint Logic

	File: `app/api/v3/scrape_summarize.py`

	Changes:
	1. Detect input type (URL vs text)
	2. Branch logic accordingly
	3. Adjust metadata based on input type
	4. Keep streaming logic the same

	Code:
	```python
	@router.post("/scrape-and-summarize/stream")
	async def scrape_and_summarize_stream(
	request: Request,
	payload: ScrapeAndSummarizeRequest
	):
	"""
	Scrape article from URL OR summarize provided text.

	Supports two modes:
	1. URL mode: Scrape article from URL then summarize
	2. Text mode: Summarize provided text directly

	Returns:
	Server-Sent Events stream with metadata and content chunks
	"""
	request_id = getattr(request.state, 'request_id', 'unknown')

	# Determine input mode
	if payload.url:
	# URL Mode: Scrape + Summarize
	logger.info(f"[{request_id}] V3 URL mode: {payload.url}")

	scrape_start = time.time()
	try:
	article_data = await article_scraper_service.scrape_article(
	url=payload.url,
	use_cache=payload.use_cache
	)
	except Exception as e:
	logger.error(f"[{request_id}] Scraping failed: {e}")
	raise HTTPException(
	status_code=502,
	detail=f"Failed to scrape article: {str(e)}"
	)

	scrape_latency_ms = (time.time() - scrape_start) * 1000
	logger.info(f"[{request_id}] Scraped in {scrape_latency_ms:.2f}ms, "
	f"extracted {len(article_data['text'])} chars")

	# Validate scraped content
	if len(article_data['text']) < 100:
	raise HTTPException(
	status_code=422,
	detail="Insufficient content extracted from URL. "
	"Article may be behind paywall or site may block scrapers."
	)

	text_to_summarize = article_data['text']
	metadata = {
	'input_type': 'url',
	'url': payload.url,
	'title': article_data.get('title'),
	'author': article_data.get('author'),
	'date': article_data.get('date'),
	'site_name': article_data.get('site_name'),
	'scrape_method': article_data.get('method', 'static'),
	'scrape_latency_ms': scrape_latency_ms,
	'extracted_text_length': len(article_data['text']),
	}

	else:
	# Text Mode: Direct Summarization
	logger.info(f"[{request_id}] V3 text mode: {len(payload.text)} chars")

	text_to_summarize = payload.text
	metadata = {
	'input_type': 'text',
	'text_length': len(payload.text),
	}

	# Stream summarization (same for both modes)
	return StreamingResponse(
	_stream_generator(text_to_summarize, payload, metadata, request_id),
	media_type="text/event-stream",
	headers={
	"Cache-Control": "no-cache",
	"Connection": "keep-alive",
	"X-Accel-Buffering": "no",
	"X-Request-ID": request_id,
	}
	)


	async def _stream_generator(text: str, payload, metadata: dict, request_id: str):
	"""Generate SSE stream for summarization."""

	# Send metadata event first
	if payload.include_metadata:
	metadata_event = {
	"type": "metadata",
	"data": metadata
	}
	yield f"data: {json.dumps(metadata_event)}\n\n"

	# Stream summarization chunks
	summarization_start = time.time()
	tokens_used = 0

	try:
	async for chunk in hf_streaming_service.summarize_text_stream(
	text=text,
	max_new_tokens=payload.max_tokens,
	temperature=payload.temperature,
	top_p=payload.top_p,
	prompt=payload.prompt,
	):
	if not chunk.get('done', False):
	tokens_used = chunk.get('tokens_used', tokens_used)

	yield f"data: {json.dumps(chunk)}\n\n"

	except Exception as e:
	logger.error(f"[{request_id}] Summarization failed: {e}")
	error_event = {
	"type": "error",
	"error": str(e),
	"done": True
	}
	yield f"data: {json.dumps(error_event)}\n\n"
	return

	summarization_latency_ms = (time.time() - summarization_start) * 1000

	# Calculate total latency
	total_latency_ms = summarization_latency_ms
	if metadata.get('input_type') == 'url':
	total_latency_ms += metadata.get('scrape_latency_ms', 0)

	logger.info(f"[{request_id}] V3 request completed in {total_latency_ms:.2f}ms")
	```

	---

	### Step 3: Update Tests

	File: `tests/test_v3_api.py`

	New Test Cases:

	```python
	@pytest.mark.asyncio
	async def test_v3_text_mode_success(client):
	"""Test V3 endpoint with text input (no scraping)."""
	response = await client.post(
	"/api/v3/scrape-and-summarize/stream",
	json={
	"text": "This is a test article with enough content to summarize properly. "
	"It has multiple sentences and provides meaningful information.",
	"max_tokens": 128,
	"include_metadata": True
	}
	)

	assert response.status_code == 200
	assert response.headers['content-type'] == 'text/event-stream'

	# Parse SSE stream
	events = []
	for line in response.text.split('\n'):
	if line.startswith('data: '):
	events.append(json.loads(line[6:]))

	# Check metadata event
	metadata_event = next(e for e in events if e.get('type') == 'metadata')
	assert metadata_event['data']['input_type'] == 'text'
	assert metadata_event['data']['text_length'] > 0
	assert 'scrape_latency_ms' not in metadata_event['data'] # No scraping in text mode

	# Check content events exist
	content_events = [e for e in events if 'content' in e]
	assert len(content_events) > 0


	@pytest.mark.asyncio
	async def test_v3_url_mode_success(client):
	"""Test V3 endpoint with URL input (with scraping)."""
	with patch('app.services.article_scraper.article_scraper_service.scrape_article') as mock_scrape:
	mock_scrape.return_value = {
	'text': 'Scraped article content here...',
	'title': 'Test Article',
	'url': 'https://example.com/test',
	'method': 'static'
	}

	response = await client.post(
	"/api/v3/scrape-and-summarize/stream",
	json={
	"url": "https://example.com/test",
	"max_tokens": 128
	}
	)

	assert response.status_code == 200

	# Parse events
	events = []
	for line in response.text.split('\n'):
	if line.startswith('data: '):
	events.append(json.loads(line[6:]))

	# Check metadata shows URL mode
	metadata_event = next(e for e in events if e.get('type') == 'metadata')
	assert metadata_event['data']['input_type'] == 'url'
	assert 'scrape_latency_ms' in metadata_event['data']


	@pytest.mark.asyncio
	async def test_v3_missing_both_url_and_text(client):
	"""Test validation error when neither url nor text provided."""
	response = await client.post(
	"/api/v3/scrape-and-summarize/stream",
	json={
	"max_tokens": 128
	}
	)

	assert response.status_code == 422
	error_detail = response.json()['detail']
	assert 'url' in error_detail[0]['loc'] or 'text' in error_detail[0]['loc']


	@pytest.mark.asyncio
	async def test_v3_both_url_and_text_provided(client):
	"""Test validation error when both url and text provided."""
	response = await client.post(
	"/api/v3/scrape-and-summarize/stream",
	json={
	"url": "https://example.com/test",
	"text": "Some text here",
	"max_tokens": 128
	}
	)

	assert response.status_code == 422


	@pytest.mark.asyncio
	async def test_v3_text_too_short(client):
	"""Test validation error for text that's too short."""
	response = await client.post(
	"/api/v3/scrape-and-summarize/stream",
	json={
	"text": "Too short", # Less than 50 chars
	"max_tokens": 128
	}
	)

	assert response.status_code == 422
	assert 'too short' in response.json()['detail'][0]['msg'].lower()
	```

	---

	### Step 4: Update Documentation

	File: `CLAUDE.md`

	Update V3 API section:

	```markdown
	### V3 API (/api/v3/*): Web Scraping + Summarization

	Endpoint: POST `/api/v3/scrape-and-summarize/stream`

	Supports two modes:

	1. URL Mode (scraping enabled):
	```json
	{
	"url": "https://example.com/article",
	"max_tokens": 256
	}
	```
	- Scrapes article from URL
	- Caches result for 1 hour
	- Streams summarization

	2. Text Mode (direct summarization):
	```json
	{
	"text": "Your article text here...",
	"max_tokens": 256
	}
	```
	- Skips scraping
	- Summarizes text directly
	- Useful when scraping fails or text already extracted

	Features:
	- Intelligent input detection (URL vs text)
	- Backend web scraping with trafilatura
	- In-memory caching (URL mode only)
	- User-agent rotation
	- Metadata extraction (URL mode: title, author, date)
	- SSRF protection
	- Rate limiting

	Response Format:
	Same Server-Sent Events format for both modes:
	```
	data: {"type":"metadata","data":{"input_type":"url\|text",...}}
	data: {"content":"token","done":false,"tokens_used":N}
	data: {"content":"","done":true,"latency_ms":MS}
	```
	```

	File: `README.md`

	Add usage examples:

	```markdown
	### V3 API Examples

	Scrape and Summarize from URL:
	```bash
	curl -X POST "https://your-space.hf.space/api/v3/scrape-and-summarize/stream" \
	-H "Content-Type: application/json" \
	-d '{
	"url": "https://example.com/article",
	"max_tokens": 256,
	"temperature": 0.3
	}'
	```

	Summarize Direct Text:
	```bash
	curl -X POST "https://your-space.hf.space/api/v3/scrape-and-summarize/stream" \
	-H "Content-Type: application/json" \
	-d '{
	"text": "Your article text here...",
	"max_tokens": 256,
	"temperature": 0.3
	}'
	```

	Python Example:
	```python
	import requests

	# URL mode
	response = requests.post(
	"https://your-space.hf.space/api/v3/scrape-and-summarize/stream",
	json={"url": "https://example.com/article", "max_tokens": 256},
	stream=True
	)

	# Text mode
	response = requests.post(
	"https://your-space.hf.space/api/v3/scrape-and-summarize/stream",
	json={"text": "Article content here...", "max_tokens": 256},
	stream=True
	)

	for line in response.iter_lines():
	if line.startswith(b'data: '):
	data = json.loads(line[6:])
	if data.get('content'):
	print(data['content'], end='')
	```
	```

	---

	## Benefits of This Approach

	### 1. Single Unified Endpoint
	- Android app uses one endpoint for everything
	- No need to choose between `/api/v2/` and `/api/v3/`
	- Simpler client-side logic

	### 2. Graceful Fallback
	- If scraping fails (paywall, blocked), user can paste text manually
	- App can catch 502 errors and prompt user to provide text directly

	### 3. Backward Compatible
	- Existing URL-based requests still work
	- No breaking changes for current users

	### 4. Better Error Messages
	```json
	// Missing both
	{
	"detail": [
	{
	"type": "value_error",
	"msg": "Either 'url' or 'text' must be provided"
	}
	]
	}

	// Both provided
	{
	"detail": [
	{
	"type": "value_error",
	"msg": "Provide either 'url' OR 'text', not both"
	}
	]
	}

	// Text too short
	{
	"detail": [
	{
	"loc": ["body", "text"],
	"msg": "Text too short (minimum 50 characters)"
	}
	]
	}
	```

	### 5. Clear Metadata
	```json
	// URL mode metadata
	{
	"type": "metadata",
	"data": {
	"input_type": "url",
	"url": "https://...",
	"title": "Article Title",
	"scrape_latency_ms": 450.2
	}
	}

	// Text mode metadata
	{
	"type": "metadata",
	"data": {
	"input_type": "text",
	"text_length": 1234
	}
	}
	```

	---

	## Testing Checklist

	- [ ] Test URL mode with valid URL
	- [ ] Test text mode with valid text
	- [ ] Test validation: missing both url and text (expect 422)
	- [ ] Test validation: both url and text provided (expect 422)
	- [ ] Test validation: text too short (< 50 chars, expect 422)
	- [ ] Test validation: text too long (> 50k chars, expect 422)
	- [ ] Test validation: invalid URL format (expect 422)
	- [ ] Test SSRF protection: localhost URL (expect 422)
	- [ ] Test SSRF protection: private IP (expect 422)
	- [ ] Test metadata event in URL mode (includes scrape_latency_ms)
	- [ ] Test metadata event in text mode (no scrape_latency_ms)
	- [ ] Test streaming format same for both modes
	- [ ] Test cache works in URL mode
	- [ ] Test cache not used in text mode

	---

	## Deployment Steps

	1. Update Schema (`app/api/v3/schemas.py`)
	- Make url Optional
	- Add text Optional
	- Add model_validator for mutual exclusivity
	- Update validators

	2. Update Endpoint (`app/api/v3/scrape_summarize.py`)
	- Add input type detection
	- Branch logic for URL vs text mode
	- Adjust metadata

	3. Update Tests (`tests/test_v3_api.py`)
	- Add text mode tests
	- Add validation tests
	- Ensure 90% coverage

	4. Update Docs (`CLAUDE.md`, `README.md`)
	- Document both modes
	- Add examples

	5. Test Locally
	```bash
	pytest tests/test_v3_api.py -v
	```

	6. Deploy to HF Spaces
	- Push changes
	- Monitor logs
	- Test both modes on live deployment

	7. Update Android App
	- App can now send either URL or text to same endpoint
	- Graceful fallback: if scraping fails, prompt user for text

	---

	## Success Criteria

	✅ URL mode works (scraping + summarization)
	✅ Text mode works (direct summarization)
	✅ Validation errors are clear and helpful
	✅ No 422 errors when text is sent
	✅ Metadata correctly indicates input type
	✅ Tests pass with 90%+ coverage
	✅ Documentation updated
	✅ Android app can use single endpoint for both scenarios

	---

	## Estimated Impact

	- Code Changes: ~100 lines modified
	- New Tests: ~8 test cases
	- Breaking Changes: None (backward compatible)
	- Performance: No impact (same logic, just more flexible input)
	- Memory: No impact
	- Deployment Time: ~30 minutes

	---

	## Conclusion

	This fix transforms the V3 API from a URL-only endpoint to a smart, dual-mode endpoint that gracefully handles both URLs and plain text. The Android app gains flexibility without added complexity, and users get better error messages when validation fails.

	Ready to implement! 🚀