Spaces:

colin730
/

SummarizerApp

Running

App Files Files Community

SummarizerApp / V4_LOCAL_SETUP.md

ming

Migrate to Ruff for linting/formatting and add comprehensive import tests

29ed661 5 days ago

preview code

raw

history blame contribute delete

7.69 kB

V4 Local Setup for M4 MacBook Pro

Summary

V4 is successfully configured and running on your M4 MacBook Pro with MPS (Metal Performance Shaders) acceleration!

Hardware Configuration

Model: M4 MacBook Pro (Mac16,7)
CPU: 14 cores (10 performance + 4 efficiency)
Memory: 24GB unified memory
GPU: Apple Silicon with MPS support
OS: macOS 26.1

V4 Configuration (.env)

# V4 Structured JSON API Configuration (Outlines)
ENABLE_V4_STRUCTURED=true
ENABLE_V4_WARMUP=true

# V4 Model Configuration
V4_MODEL_ID=Qwen/Qwen2.5-1.5B-Instruct
V4_MAX_TOKENS=256
V4_TEMPERATURE=0.2

# V4 Performance Optimization (M4 MacBook Pro with MPS)
V4_USE_FP16_FOR_SPEED=true
V4_ENABLE_QUANTIZATION=false

Performance Metrics

Model Loading

Device: mps:0 (Metal Performance Shaders)
Dtype: torch.float16 (FP16 for speed)
Quantization: FP16 (MPS, fast mode)
Load time: ~5 seconds
Warmup time: ~22 seconds
Memory usage: ~2-3GB unified memory

Inference Performance

Expected speed: 2-5 seconds per request
Token generation: ~10-20 tokens/sec
Device utilization: GPU accelerated via MPS

Starting the Server

# Start V4-enabled server
conda run -n summarizer uvicorn app.main:app --host 0.0.0.0 --port 7860

# Server will warmup V4 on startup (takes 20-30s)
# Look for these log messages:
#   ✅ V4 model initialized successfully
#   Model device: mps:0
#   Torch dtype: torch.float16

Testing V4

Via curl

# Test V4 stream-json endpoint (Outlines-constrained)
curl -X POST http://localhost:7860/api/v4/scrape-and-summarize/stream-json \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Your article text here...",
    "style": "executive",
    "max_tokens": 256
  }'

Via Python

import requests

url = "http://localhost:7860/api/v4/scrape-and-summarize/stream-json"
payload = {
    "text": "Your article text here...",
    "style": "executive",  # Options: skimmer, executive, eli5
    "max_tokens": 256
}

response = requests.post(url, json=payload, stream=True)

for line in response.iter_lines():
    if line:
        print(line.decode('utf-8'))

V4 Endpoints

/api/v4/scrape-and-summarize/stream - Raw JSON token streaming
/api/v4/scrape-and-summarize/stream-ndjson - NDJSON patch streaming (best for Android)
/api/v4/scrape-and-summarize/stream-json - Outlines-constrained JSON (most reliable schema)

Structured Output Format

V4 guarantees the following JSON structure:

{
  "title": "6-10 word headline",
  "main_summary": "2-4 sentence summary",
  "key_points": [
    "Key point 1",
    "Key point 2",
    "Key point 3"
  ],
  "category": "1-2 word topic label",
  "sentiment": "positive|negative|neutral",
  "read_time_min": 3
}

Summarization Styles

skimmer - Quick facts and highlights for fast reading
executive - Business-focused summary with key takeaways (recommended)
eli5 - "Explain Like I'm 5" - simple, accessible explanations

Code Changes Made

1. Added MPS Detection (`app/services/structured_summarizer.py`)

# Detect both CUDA and MPS
use_cuda = torch.cuda.is_available()
use_mps = torch.backends.mps.is_available() and torch.backends.mps.is_built()

if use_cuda:
    logger.info("CUDA is available. Using GPU for V4 model.")
elif use_mps:
    logger.info("MPS (Metal Performance Shaders) is available. Using Apple Silicon GPU for V4 model.")
else:
    logger.info("No GPU available. V4 model will run on CPU.")

2. Fixed Model Loading for MPS

# MPS requires explicit device setting, not device_map
if use_mps:
    self.model = AutoModelForCausalLM.from_pretrained(
        settings.v4_model_id,
        torch_dtype=torch.float16,  # Fixed: was `dtype=` (incorrect)
        cache_dir=settings.hf_cache_dir,
        trust_remote_code=True,
    ).to("mps")  # Explicit MPS device

3. Added FP16 Support for MPS

elif (use_cuda or use_mps) and use_fp16_for_speed:
    device_str = "CUDA GPU" if use_cuda else "MPS (Apple Silicon)"
    logger.info(f"Loading V4 model in FP16 for maximum speed on {device_str}...")
    # ... FP16 loading logic

Known Issues

Outlines JSON Generation Reliability

The Outlines library (0.1.1) with Qwen 1.5B sometimes generates malformed JSON with extra characters. This is a known limitation of constrained decoding with smaller models.

Symptoms:

ValidationError: Extra data: line 1 column 278 (char 277)
input_value='{"title":"Apple Announce...":5}#RRR!!##R!R!R##!#!!'

Workarounds:

Use the /stream or /stream-ndjson endpoints instead (more reliable)
Retry failed requests (Outlines generation is non-deterministic)
Consider using a larger model (Qwen 3B) for better JSON reliability
Use lower temperature (already set to 0.2 for stability)

Memory Considerations

Current usage: ~2-3GB unified memory for V4
Total with all services: ~4-5GB (V2 + V3 + V4)
Your 24GB Mac: Plenty of headroom ✅

Performance Comparison

Version	Device	Memory	Inference Time	Use Case
V1	Ollama	~2-4GB	2-5s	Local custom models
V2	CPU/GPU	~500MB	Streaming	Fast free-form summaries
V3	CPU/GPU	~550MB	2-5s	Web scraping + summarization
V4	MPS	~2-3GB	2-5s	Structured JSON output

Next Steps

For Production Use

Test with real articles: Feed V4 actual articles from your Android app
Monitor memory: Use Activity Monitor to track memory usage
Benchmark performance: Measure actual inference times under load
Consider alternatives if Outlines is unreliable:
- Switch to /stream-ndjson endpoint (more reliable, progressive updates)
- Use post-processing to clean JSON output
- Upgrade to a larger model (Qwen 3B or Phi-3-Mini 3.8B)

For Development

Disable V4 warmup when not testing:

ENABLE_V4_WARMUP=false  # Saves 20-30s startup time

Run only V4 (disable V1/V2/V3 to save memory):

ENABLE_V1_WARMUP=false
ENABLE_V2_WARMUP=false
ENABLE_V3_SCRAPING=false

Experiment with temperature:

V4_TEMPERATURE=0.1  # Even more deterministic (may be too rigid)
V4_TEMPERATURE=0.3  # More creative (may reduce schema compliance)

Troubleshooting

Model not loading on MPS

Check PyTorch MPS support:

conda run -n summarizer python -c "import torch; print(f'MPS: {torch.backends.mps.is_available()}')"

Server startup fails

Check the logs:

conda run -n summarizer uvicorn app.main:app --host 0.0.0.0 --port 7860
# Look for "✅ V4 model initialized successfully"

JSON validation errors

This is expected with Qwen 1.5B + Outlines. Consider:

Using /stream-ndjson endpoint
Implementing retry logic
Using a larger model

Resources

Model: Qwen/Qwen2.5-1.5B-Instruct on HuggingFace
Outlines: Outlines 0.1.1 Documentation
PyTorch MPS: Apple Silicon GPU Acceleration

Success Indicators

✅ Model loads on MPS (mps:0) ✅ FP16 dtype enabled (torch.float16) ✅ Fast loading (~~5 seconds) ✅ Memory efficient (~~2-3GB) ✅ Inference working (generates output) ⚠️ Outlines reliability (known issue with Qwen 1.5B)

Status: V4 is fully operational on your M4 MacBook Pro! 🎉