Spaces:
Running
Backend Performance Analysis & Optimization Guide
π Executive Summary
The Android app streaming functionality is working perfectly. However, users experience 60+ second delays before seeing results due to backend cold start and processing time on Hugging Face Spaces.
Impact: Poor user experience, high abandonment risk
Root Cause: Ollama model loading + CPU-only inference on HF free tier
Priority: HIGH - Affects all users on every request
π Performance Analysis
Current Performance Metrics
From HF Logs:
2025-10-15 06:01:03 - POST /api/generate starts
2025-10-15 06:01:29 - Response 200 | Duration: 1m2s
From Android Network Inspector:
- Request Type:
event-stream(SSE) - Status:
200 OK - Total Time: 1 minute 3 seconds
- Data Size: 8.1 KB
Performance Breakdown
| Phase | Duration | Percentage |
|---|---|---|
| Server Cold Start | 10-20s | 16-32% |
| Model Loading (Ollama) | 30-40s | 48-64% |
| Text Generation | 10-15s | 16-24% |
| Total | 60-65s | 100% |
Root Causes
Ollama on CPU-only infrastructure
- Using
localhost:11434/api/generate - Model weights loaded on every cold start
- CPU inference is 10-20x slower than GPU
- Using
Hugging Face Free Tier Limitations
- Space goes to sleep after 15 minutes inactivity
- Cold start required on wake-up
- Shared CPU resources
- No GPU access
Large Input Text
- Processing 4,000+ character inputs
- Longer inputs = longer generation time
π‘ Optimization Recommendations
Priority 1: Quick Wins (Can Implement Today)
1.1 Replace Ollama with Faster Alternative
Impact: 70-85% faster inference
Effort: 2-3 hours
Cost: Free
Current: llama3.2:1b via Ollama
Problem: Even though 1B is small, Ollama adds significant overhead:
- Ollama server startup time
- Model loading into memory (even 1B takes 20-30s on CPU)
- Ollama's API layer adds latency
- Not optimized for CPU-only inference
Recommended Solutions:
Option A: Switch to Transformers Pipeline (FASTEST for CPU) β
from transformers import pipeline
# Load once at startup
summarizer = pipeline(
"summarization",
model="sshleifer/distilbart-cnn-6-6", # Optimized for speed
device=-1 # CPU
)
# Use in your endpoint
summary = summarizer(
text,
max_length=130,
min_length=30,
do_sample=False
)
Why it's faster:
- No Ollama overhead
- Optimized for CPU with ONNX/quantization
- Faster model loading
- Better batching
Expected: 60s β 8-12s (80% improvement)
Option B: Use Smaller Specialized Model
# Tiny but effective for summarization
summarizer = pipeline(
"summarization",
model="facebook/bart-large-cnn", # Well-optimized
device=-1
)
Expected: 60s β 10-15s (75% improvement)
Option C: Keep llama3.2:1b but optimize Ollama If you must keep Llama3.2:
# Pre-load model at startup
import ollama
@app.on_event("startup")
def load_model():
# Warm up the model
ollama.generate(model="llama3.2:1b", prompt="test")
Expected: 60s β 25-35s (40% improvement)
1.2 Keep Model Loaded in Memory
Impact: Eliminates 30-40s loading time
Effort: 30 minutes
Cost: Free
Problem: Currently, the model is loaded for each request, adding 30-40s overhead.
Solution: Load model once at application startup and keep it in memory.
from fastapi import FastAPI
import ollama
app = FastAPI()
# Load model at startup (runs once)
@app.on_event("startup")
async def load_model():
global model_client
model_client = ollama.Client()
# Warm up the model
model_client.generate(
model="phi3:mini",
prompt="test",
stream=False
)
print("Model loaded and ready!")
# Use pre-loaded model in endpoints
@app.post("/api/v1/summarize/stream")
async def summarize_stream(request: SummarizeRequest):
response = model_client.generate(
model="phi3:mini",
prompt=request.text,
stream=True
)
# ... stream response
Expected Result: 62s β 15-25s (first request), 10-15s (subsequent)
1.3 Set Up Keep-Warm Service
Impact: Eliminates cold starts
Effort: 10 minutes
Cost: Free
Problem: HF Space goes to sleep after 15 minutes, causing 10-20s cold start.
Solution: Ping your space every 10 minutes to keep it awake.
Option A: UptimeRobot (Recommended)
- Go to https://uptimerobot.com
- Create free account
- Add HTTP(s) monitor:
- URL:
https://colin730-summarizerapp.hf.space/ - Interval: 10 minutes
- URL:
- Done!
Option B: GitHub Actions
Create .github/workflows/keep-warm.yml:
name: Keep HF Space Warm
on:
schedule:
- cron: '*/10 * * * *' # Every 10 minutes
jobs:
ping:
runs-on: ubuntu-latest
steps:
- name: Ping Space
run: |
curl -f https://colin730-summarizerapp.hf.space/ || echo "Ping failed"
Expected Result: Eliminates 10-20s cold start delay
Priority 2: Medium-term Solutions (This Week)
2.1 Switch to Hugging Face Inference API
Impact: 70-80% faster
Effort: 2-3 hours
Cost: Free (with rate limits)
Problem: Ollama is not optimized for CPU-only environments.
Solution: Use HF's native transformers library with optimized models.
from transformers import pipeline
from fastapi.responses import StreamingResponse
import json
# Load at startup (much faster than Ollama)
summarizer = pipeline(
"summarization",
model="facebook/bart-large-cnn", # or "sshleifer/distilbart-cnn-12-6" for speed
device=-1 # CPU
)
@app.post("/api/v1/summarize/stream")
async def summarize_stream(request: SummarizeRequest):
async def generate():
# Process in chunks for streaming effect
text = request.text
max_chunk_size = 1024
for i in range(0, len(text), max_chunk_size):
chunk = text[i:i + max_chunk_size]
result = summarizer(
chunk,
max_length=150,
min_length=30,
do_sample=False
)
summary_chunk = result[0]['summary_text']
# Stream as SSE
yield f"data: {json.dumps({'content': summary_chunk, 'done': False})}\n\n"
yield f"data: {json.dumps({'content': '', 'done': True})}\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
Advantages:
- Faster loading (optimized for CPU)
- Better caching
- Native HF integration
- Simpler deployment
Expected Result: 62s β 10-15s
2.2 Implement Response Caching
Impact: Instant responses for repeated inputs
Effort: 1-2 hours
Cost: Free
from functools import lru_cache
import hashlib
def get_cache_key(text: str) -> str:
return hashlib.md5(text.encode()).hexdigest()
# Simple in-memory cache
summary_cache = {}
@app.post("/api/v1/summarize/stream")
async def summarize_stream(request: SummarizeRequest):
cache_key = get_cache_key(request.text)
# Check cache first
if cache_key in summary_cache:
cached_summary = summary_cache[cache_key]
# Stream cached result
async def stream_cached():
for word in cached_summary.split():
yield f"data: {json.dumps({'content': word + ' ', 'done': False})}\n\n"
yield f"data: {json.dumps({'content': '', 'done': True})}\n\n"
return StreamingResponse(stream_cached(), media_type="text/event-stream")
# ... generate new summary and cache it
summary_cache[cache_key] = summary
Expected Result: Cached requests: 1-2s (instant)
Priority 3: Long-term Solutions (Upgrade Path)
3.1 Upgrade to Hugging Face Pro
Impact: 80-90% faster, eliminates all cold starts
Effort: 5 minutes
Cost: $9/month
Benefits:
- Persistent hardware (no cold starts)
- GPU access (10-20x faster inference)
- Always-on instance
- Better resource allocation
Expected Result: 62s β 3-5s consistently
3.2 Migrate to Dedicated Infrastructure
Impact: Full control, optimal performance
Effort: 4-8 hours
Cost: $10-50/month
Options:
- DigitalOcean GPU Droplet ($10/month + GPU hours)
- AWS Lambda + SageMaker (Pay per use)
- Railway.app ($5-20/month)
- Render.com ($7-25/month)
Advantages:
- No cold starts
- GPU access
- Better monitoring
- Scalable
Expected Result: 62s β 2-5s consistently
3.3 Use Managed AI APIs
Impact: Instant responses, no infrastructure management
Effort: 2-3 hours (API integration)
Cost: Pay per use (~$0.001 per summary)
Options:
OpenAI GPT-3.5/4:
import openai
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{
"role": "user",
"content": f"Summarize: {text}"
}],
stream=True
)
for chunk in response:
# Stream to client
- Response time: 1-3s
- Cost: ~$0.001-0.003 per summary
Anthropic Claude:
import anthropic
client = anthropic.Client(api_key="...")
response = client.messages.create(
model="claude-3-haiku-20240307",
messages=[{"role": "user", "content": f"Summarize: {text}"}],
stream=True
)
- Response time: 1-2s
- Cost: ~$0.0005-0.002 per summary
Google Gemini:
- Free tier: 60 requests/minute
- Response time: 1-3s
- Cost: Free β $0.0005 per summary
Expected Result: 62s β 1-3s, zero maintenance
π Performance Comparison
| Solution | Time | Cold Start | Cost | Effort | Recommendation |
|---|---|---|---|---|---|
| Current (llama3.2:1b + Ollama) | 60-65s | Yes | Free | - | β Poor UX |
| Keep-Warm Service | 50-55s | No | Free | 10min | β Do first |
| Pre-load Model | 35-45s | Yes | Free | 30min | β Do first |
| Switch to Transformers | 8-12s | Minimal | Free | 2-3hrs | ββ Best free option |
| HF Pro + GPU | 3-5s | No | $9/mo | 5min | β Best value |
| Managed API (OpenAI/Claude) | 1-3s | No | Pay/use | 2-3hrs | β Best perf |
π― Recommended Implementation Plan
Phase 1: Immediate (Do Today) β‘
- Set up UptimeRobot keep-warm (10 min)
- Pre-load llama3.2 model at startup (30 min)
Expected: 60s β 35-40s (35% improvement)
Phase 2: This Week π
- Replace Ollama with Transformers pipeline (2-3 hrs) β BIGGEST IMPACT
- Implement response caching (1-2 hrs)
Expected: 35s β 8-12s (additional 70% improvement)
Phase 3: Future Consideration π
- Evaluate HF Pro vs Managed APIs
- Based on usage patterns and budget, choose:
- HF Pro if self-hosted control is important
- Managed API if cost-per-use works better
Expected: 8s β 2-5s (additional 60-75% improvement)
π Next Steps
- Review this analysis with the backend team
- Pick quick wins from Phase 1 (can be done in <1 hour)
- Measure results after each change
- Share metrics so we can validate improvements
π Success Metrics
- Target: < 10 seconds for first response
- Ideal: < 5 seconds for first response
- Acceptable: < 15 seconds with good UX feedback
π Additional Notes
Current Stack:
- Hugging Face Spaces (Free Tier)
- Ollama (localhost:11434)
- CPU-only inference
- Model: llama3.2:1b (Good choice for speed!)
Android App Status:
- β Working perfectly
- β Streaming implementation is correct
- β UI updates in real-time once data arrives
- The only issue is waiting for backend to start responding
Document Version: 1.0
Date: October 15, 2025
Prepared for: Backend Team - SummarizeAI App
Contact: Android Team