Spaces:
Running
title: Text Summarizer API
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860
Text Summarizer API
A FastAPI-based text summarization service powered by Ollama and Mistral 7B model.
π Features
- Fast text summarization using local LLM inference
- RESTful API with FastAPI
- Health monitoring and logging
- Docker containerized for easy deployment
- Free deployment on Hugging Face Spaces
π‘ API Endpoints
Health Check
GET /health
V1 API (Ollama + Transformers Pipeline)
POST /api/v1/summarize
POST /api/v1/summarize/stream
POST /api/v1/summarize/pipeline/stream
V2 API (HuggingFace Streaming)
POST /api/v2/summarize/stream
V3 API (Web Scraping + Summarization)
POST /api/v3/scrape-and-summarize/stream
π Live Deployment
β Successfully deployed and tested on Hugging Face Spaces!
- Live Space: https://colin730-SummarizerApp.hf.space
- API Documentation: https://colin730-SummarizerApp.hf.space/docs
- Health Check: https://colin730-SummarizerApp.hf.space/health
- V2 Streaming API: https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream
Quick Test
# Test the live deployment
curl https://colin730-SummarizerApp.hf.space/health
curl -X POST https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream \
-H "Content-Type: application/json" \
-d '{"text":"This is a test of the live API.","max_tokens":50}'
Request Format (V1 and V2 compatible):
{
"text": "Your long text to summarize here...",
"max_tokens": 256,
"prompt": "Summarize the following text concisely:"
}
API Documentation
- Swagger UI:
/docs - ReDoc:
/redoc
π§ Configuration
The service uses the following environment variables:
V1 Configuration (Ollama)
OLLAMA_MODEL: Model to use (default:llama3.2:1b)OLLAMA_HOST: Ollama service host (default:http://localhost:11434)OLLAMA_TIMEOUT: Request timeout in seconds (default:60)ENABLE_V1_WARMUP: Enable V1 warmup (default:false)
V2 Configuration (HuggingFace)
HF_MODEL_ID: HuggingFace model ID (default:sshleifer/distilbart-cnn-6-6)HF_DEVICE_MAP: Device mapping (default:autofor GPU fallback to CPU)HF_TORCH_DTYPE: Torch dtype (default:auto)HF_HOME: HuggingFace cache directory (default:/tmp/huggingface)HF_MAX_NEW_TOKENS: Max new tokens (default:128)HF_TEMPERATURE: Sampling temperature (default:0.7)HF_TOP_P: Nucleus sampling (default:0.95)ENABLE_V2_WARMUP: Enable V2 warmup (default:true)
V3 Configuration (Web Scraping)
ENABLE_V3_SCRAPING: Enable V3 API (default:true)SCRAPING_TIMEOUT: HTTP timeout for scraping (default:10seconds)SCRAPING_MAX_TEXT_LENGTH: Max text to extract (default:50000chars)SCRAPING_CACHE_ENABLED: Enable caching (default:true)SCRAPING_CACHE_TTL: Cache TTL (default:3600seconds / 1 hour)SCRAPING_UA_ROTATION: Enable user-agent rotation (default:true)SCRAPING_RATE_LIMIT_PER_MINUTE: Rate limit per IP (default:10)
Server Configuration
SERVER_HOST: Server host (default:127.0.0.1)SERVER_PORT: Server port (default:8000)LOG_LEVEL: Logging level (default:INFO)
π³ Docker Deployment
Local Development
# Build and run with docker-compose
docker-compose up --build
# Or run directly
docker build -f Dockerfile.hf -t summarizer-app .
docker run -p 7860:7860 summarizer-app
Hugging Face Spaces
This app is optimized for deployment on Hugging Face Spaces using Docker SDK.
V2-Only Deployment on HF Spaces:
- Uses
t5-smallmodel (~250MB) for fast startup - No Ollama dependency (saves memory and disk space)
- Model downloads during warmup for instant first request
- Optimized for free tier resource limits
Environment Variables for HF Spaces:
ENABLE_V1_WARMUP=false
ENABLE_V2_WARMUP=true
HF_MODEL_ID=sshleifer/distilbart-cnn-6-6
HF_HOME=/tmp/huggingface
π Performance
V1 (Ollama + Transformers Pipeline)
- V1 Models: llama3.2:1b (Ollama) + distilbart-cnn-6-6 (Transformers)
- Memory usage: ~2-4GB RAM (when V1 warmup enabled)
- Inference speed: ~2-5 seconds per request
- Startup time: ~30-60 seconds (when V1 warmup enabled)
V2 (HuggingFace Streaming) - Primary on HF Spaces
- V2 Model: sshleifer/distilbart-cnn-6-6 (~300MB download)
- Memory usage: ~500MB RAM (when V2 warmup enabled)
- Inference speed: Real-time token streaming
- Startup time: ~30-60 seconds (includes model download when V2 warmup enabled)
V3 (Web Scraping + Summarization)
- Dependencies: trafilatura, httpx, lxml (lightweight, no JavaScript rendering)
- Memory usage: ~550MB RAM (V2 + scraping: +10-50MB)
- Scraping speed: 200-500ms typical, <10ms on cache hit
- Total latency: 2-5 seconds (scrape + summarize)
- Success rate: 95%+ article extraction
Memory Optimization
- V1 warmup disabled by default (
ENABLE_V1_WARMUP=false) - V2 warmup enabled by default (
ENABLE_V2_WARMUP=true) - HuggingFace Spaces: V2-only deployment (no Ollama)
- Local development: V1 endpoints work if Ollama is running externally
- distilbart-cnn-6-6 model: Optimized for HuggingFace Spaces free tier with CNN/DailyMail fine-tuning
π οΈ Development
Setup
# Install dependencies
pip install -r requirements.txt
# Run locally
uvicorn app.main:app --host 0.0.0.0 --port 7860
Testing
# Run tests
pytest
# Run with coverage
pytest --cov=app
π Usage Examples
V1 API (Ollama)
import requests
import json
# V1 streaming summarization
response = requests.post(
"https://colin730-SummarizerApp.hf.space/api/v1/summarize/stream",
json={
"text": "Your long article or text here...",
"max_tokens": 256
},
stream=True
)
for line in response.iter_lines():
if line.startswith(b'data: '):
data = json.loads(line[6:])
print(data["content"], end="")
if data["done"]:
break
V2 API (HuggingFace Streaming) - Recommended
import requests
import json
# V2 streaming summarization (same request format as V1)
response = requests.post(
"https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream",
json={
"text": "Your long article or text here...",
"max_tokens": 128 # V2 uses max_new_tokens
},
stream=True
)
for line in response.iter_lines():
if line.startswith(b'data: '):
data = json.loads(line[6:])
print(data["content"], end="")
if data["done"]:
break
V3 API (Web Scraping + Summarization) - Android App Primary Use Case
V3 supports two modes: URL scraping or direct text summarization
Mode 1: URL Scraping (recommended for articles)
import requests
import json
# V3 scrape article from URL and stream summarization
response = requests.post(
"https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream",
json={
"url": "https://example.com/article",
"max_tokens": 256,
"include_metadata": True, # Get article title, author, etc.
"use_cache": True # Use cached content if available
},
stream=True
)
for line in response.iter_lines():
if line.startswith(b'data: '):
data = json.loads(line[6:])
# First event: metadata
if data.get("type") == "metadata":
print(f"Input type: {data['data']['input_type']}") # 'url'
print(f"Title: {data['data']['title']}")
print(f"Author: {data['data']['author']}")
print(f"Scrape time: {data['data']['scrape_latency_ms']}ms\n")
# Content events
elif "content" in data:
print(data["content"], end="")
if data["done"]:
print(f"\n\nTotal time: {data['latency_ms']}ms")
break
Mode 2: Direct Text Summarization (fallback when scraping fails)
import requests
import json
# V3 direct text summarization (no scraping)
response = requests.post(
"https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream",
json={
"text": "Your article text here... (minimum 50 characters)",
"max_tokens": 256,
"include_metadata": True
},
stream=True
)
for line in response.iter_lines():
if line.startswith(b'data: '):
data = json.loads(line[6:])
# First event: metadata
if data.get("type") == "metadata":
print(f"Input type: {data['data']['input_type']}") # 'text'
print(f"Text length: {data['data']['text_length']} chars\n")
# Content events
elif "content" in data:
print(data["content"], end="")
if data["done"]:
break
Note: Provide either url OR text, not both. Text mode is useful as a fallback when:
- Article is behind a paywall
- Website blocks scrapers
- User has already extracted the text manually
Android Client (SSE)
// Android SSE client example
val client = OkHttpClient()
val request = Request.Builder()
.url("https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream")
.post(RequestBody.create(
MediaType.parse("application/json"),
"""{"text": "Your text...", "max_tokens": 128}"""
))
.build()
client.newCall(request).enqueue(object : Callback {
override fun onResponse(call: Call, response: Response) {
val source = response.body()?.source()
source?.use { bufferedSource ->
while (true) {
val line = bufferedSource.readUtf8Line()
if (line?.startsWith("data: ") == true) {
val json = line.substring(6)
val data = Gson().fromJson(json, Map::class.java)
// Update UI with data["content"]
if (data["done"] == true) break
}
}
}
}
})
cURL Examples
# Test live deployment
curl https://colin730-SummarizerApp.hf.space/health
# V1 API (if Ollama is available)
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v1/summarize/stream" \
-H "Content-Type: application/json" \
-d '{"text": "Your text...", "max_tokens": 256}'
# V2 API (HuggingFace streaming - recommended)
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream" \
-H "Content-Type: application/json" \
-d '{"text": "Your text...", "max_tokens": 128}'
# V3 API - URL mode (web scraping + summarization)
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article", "max_tokens": 256, "include_metadata": true}'
# V3 API - Text mode (direct summarization, no scraping)
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream" \
-H "Content-Type: application/json" \
-d '{"text": "Your article text here (minimum 50 characters)...", "max_tokens": 256}'
Test Script
# Use the included test script
./scripts/test_endpoints.sh https://colin730-SummarizerApp.hf.space
π Security
- Non-root user execution
- Input validation and sanitization
- Rate limiting (configurable)
- API key authentication (optional)
π Monitoring
The service includes:
- Health check endpoint
- Request logging
- Error tracking
- Performance metrics
π Troubleshooting
Common Issues
- Model not loading: Check if Ollama is running and model is pulled
- Out of memory: Ensure sufficient RAM (8GB+) for Mistral 7B
- Slow startup: Normal on first run due to model download
- API errors: Check logs via
/docsendpoint
Logs
View application logs in the Hugging Face Spaces interface or check the health endpoint for service status.
π License
MIT License - see LICENSE file for details.
π€ Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
β Deployment Status
Successfully deployed and tested on Hugging Face Spaces! π
- β
Proxy-aware FastAPI with
root_pathsupport - β All endpoints working (health, docs, V2 API)
- β Real-time streaming summarization
- β No 404 errors - all paths correctly configured
- β Test script included for easy verification
Recent Fixes Applied
- Added
root_path=os.getenv("HF_SPACE_ROOT_PATH", "")for HF Spaces proxy awareness - Ensured binding to
0.0.0.0:7860as required by HF Spaces - Verified V2 router paths (
/api/v2/summarize/stream) with no double prefixes - Created test script for external endpoint verification
Live Space: https://colin730-SummarizerApp.hf.space π―