Spaces:
Running
Running
metadata
title: Text Summarizer API
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860
Text Summarizer API
A FastAPI-based text summarization service powered by Ollama and Mistral 7B model.
π Features
- Fast text summarization using local LLM inference
- RESTful API with FastAPI
- Health monitoring and logging
- Docker containerized for easy deployment
- Free deployment on Hugging Face Spaces
π‘ API Endpoints
Health Check
GET /health
V1 API (Ollama + Transformers Pipeline)
POST /api/v1/summarize
POST /api/v1/summarize/stream
POST /api/v1/summarize/pipeline/stream
V2 API (HuggingFace Streaming)
POST /api/v2/summarize/stream
π Live Deployment
β Successfully deployed and tested on Hugging Face Spaces!
- Live Space: https://colin730-SummarizerApp.hf.space
- API Documentation: https://colin730-SummarizerApp.hf.space/docs
- Health Check: https://colin730-SummarizerApp.hf.space/health
- V2 Streaming API: https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream
Quick Test
# Test the live deployment
curl https://colin730-SummarizerApp.hf.space/health
curl -X POST https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream \
-H "Content-Type: application/json" \
-d '{"text":"This is a test of the live API.","max_tokens":50}'
Request Format (V1 and V2 compatible):
{
"text": "Your long text to summarize here...",
"max_tokens": 256,
"prompt": "Summarize the following text concisely:"
}
API Documentation
- Swagger UI:
/docs - ReDoc:
/redoc
π§ Configuration
The service uses the following environment variables:
V1 Configuration (Ollama)
OLLAMA_MODEL: Model to use (default:llama3.2:1b)OLLAMA_HOST: Ollama service host (default:http://localhost:11434)OLLAMA_TIMEOUT: Request timeout in seconds (default:60)ENABLE_V1_WARMUP: Enable V1 warmup (default:false)
V2 Configuration (HuggingFace)
HF_MODEL_ID: HuggingFace model ID (default:sshleifer/distilbart-cnn-6-6)HF_DEVICE_MAP: Device mapping (default:autofor GPU fallback to CPU)HF_TORCH_DTYPE: Torch dtype (default:auto)HF_HOME: HuggingFace cache directory (default:/tmp/huggingface)HF_MAX_NEW_TOKENS: Max new tokens (default:128)HF_TEMPERATURE: Sampling temperature (default:0.7)HF_TOP_P: Nucleus sampling (default:0.95)ENABLE_V2_WARMUP: Enable V2 warmup (default:true)
Server Configuration
SERVER_HOST: Server host (default:127.0.0.1)SERVER_PORT: Server port (default:8000)LOG_LEVEL: Logging level (default:INFO)
π³ Docker Deployment
Local Development
# Build and run with docker-compose
docker-compose up --build
# Or run directly
docker build -f Dockerfile.hf -t summarizer-app .
docker run -p 7860:7860 summarizer-app
Hugging Face Spaces
This app is optimized for deployment on Hugging Face Spaces using Docker SDK.
V2-Only Deployment on HF Spaces:
- Uses
t5-smallmodel (~250MB) for fast startup - No Ollama dependency (saves memory and disk space)
- Model downloads during warmup for instant first request
- Optimized for free tier resource limits
Environment Variables for HF Spaces:
ENABLE_V1_WARMUP=false
ENABLE_V2_WARMUP=true
HF_MODEL_ID=sshleifer/distilbart-cnn-6-6
HF_HOME=/tmp/huggingface
π Performance
V1 (Ollama + Transformers Pipeline)
- V1 Models: llama3.2:1b (Ollama) + distilbart-cnn-6-6 (Transformers)
- Memory usage: ~2-4GB RAM (when V1 warmup enabled)
- Inference speed: ~2-5 seconds per request
- Startup time: ~30-60 seconds (when V1 warmup enabled)
V2 (HuggingFace Streaming) - Primary on HF Spaces
- V2 Model: sshleifer/distilbart-cnn-6-6 (~300MB download)
- Memory usage: ~500MB RAM (when V2 warmup enabled)
- Inference speed: Real-time token streaming
- Startup time: ~30-60 seconds (includes model download when V2 warmup enabled)
Memory Optimization
- V1 warmup disabled by default (
ENABLE_V1_WARMUP=false) - V2 warmup enabled by default (
ENABLE_V2_WARMUP=true) - HuggingFace Spaces: V2-only deployment (no Ollama)
- Local development: V1 endpoints work if Ollama is running externally
- distilbart-cnn-6-6 model: Optimized for HuggingFace Spaces free tier with CNN/DailyMail fine-tuning
π οΈ Development
Setup
# Install dependencies
pip install -r requirements.txt
# Run locally
uvicorn app.main:app --host 0.0.0.0 --port 7860
Testing
# Run tests
pytest
# Run with coverage
pytest --cov=app
π Usage Examples
V1 API (Ollama)
import requests
import json
# V1 streaming summarization
response = requests.post(
"https://colin730-SummarizerApp.hf.space/api/v1/summarize/stream",
json={
"text": "Your long article or text here...",
"max_tokens": 256
},
stream=True
)
for line in response.iter_lines():
if line.startswith(b'data: '):
data = json.loads(line[6:])
print(data["content"], end="")
if data["done"]:
break
V2 API (HuggingFace Streaming) - Recommended
import requests
import json
# V2 streaming summarization (same request format as V1)
response = requests.post(
"https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream",
json={
"text": "Your long article or text here...",
"max_tokens": 128 # V2 uses max_new_tokens
},
stream=True
)
for line in response.iter_lines():
if line.startswith(b'data: '):
data = json.loads(line[6:])
print(data["content"], end="")
if data["done"]:
break
Android Client (SSE)
// Android SSE client example
val client = OkHttpClient()
val request = Request.Builder()
.url("https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream")
.post(RequestBody.create(
MediaType.parse("application/json"),
"""{"text": "Your text...", "max_tokens": 128}"""
))
.build()
client.newCall(request).enqueue(object : Callback {
override fun onResponse(call: Call, response: Response) {
val source = response.body()?.source()
source?.use { bufferedSource ->
while (true) {
val line = bufferedSource.readUtf8Line()
if (line?.startsWith("data: ") == true) {
val json = line.substring(6)
val data = Gson().fromJson(json, Map::class.java)
// Update UI with data["content"]
if (data["done"] == true) break
}
}
}
}
})
cURL Examples
# Test live deployment
curl https://colin730-SummarizerApp.hf.space/health
# V1 API (if Ollama is available)
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v1/summarize/stream" \
-H "Content-Type: application/json" \
-d '{"text": "Your text...", "max_tokens": 256}'
# V2 API (HuggingFace streaming - recommended)
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream" \
-H "Content-Type: application/json" \
-d '{"text": "Your text...", "max_tokens": 128}'
Test Script
# Use the included test script
./scripts/test_endpoints.sh https://colin730-SummarizerApp.hf.space
π Security
- Non-root user execution
- Input validation and sanitization
- Rate limiting (configurable)
- API key authentication (optional)
π Monitoring
The service includes:
- Health check endpoint
- Request logging
- Error tracking
- Performance metrics
π Troubleshooting
Common Issues
- Model not loading: Check if Ollama is running and model is pulled
- Out of memory: Ensure sufficient RAM (8GB+) for Mistral 7B
- Slow startup: Normal on first run due to model download
- API errors: Check logs via
/docsendpoint
Logs
View application logs in the Hugging Face Spaces interface or check the health endpoint for service status.
π License
MIT License - see LICENSE file for details.
π€ Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
β Deployment Status
Successfully deployed and tested on Hugging Face Spaces! π
- β
Proxy-aware FastAPI with
root_pathsupport - β All endpoints working (health, docs, V2 API)
- β Real-time streaming summarization
- β No 404 errors - all paths correctly configured
- β Test script included for easy verification
Recent Fixes Applied
- Added
root_path=os.getenv("HF_SPACE_ROOT_PATH", "")for HF Spaces proxy awareness - Ensured binding to
0.0.0.0:7860as required by HF Spaces - Verified V2 router paths (
/api/v2/summarize/stream) with no double prefixes - Created test script for external endpoint verification
Live Space: https://colin730-SummarizerApp.hf.space π―