SummarizerApp / V4_LOCAL_SETUP.md
ming
Migrate to Ruff for linting/formatting and add comprehensive import tests
29ed661

V4 Local Setup for M4 MacBook Pro

Summary

V4 is successfully configured and running on your M4 MacBook Pro with MPS (Metal Performance Shaders) acceleration!

Hardware Configuration

  • Model: M4 MacBook Pro (Mac16,7)
  • CPU: 14 cores (10 performance + 4 efficiency)
  • Memory: 24GB unified memory
  • GPU: Apple Silicon with MPS support
  • OS: macOS 26.1

V4 Configuration (.env)

# V4 Structured JSON API Configuration (Outlines)
ENABLE_V4_STRUCTURED=true
ENABLE_V4_WARMUP=true

# V4 Model Configuration
V4_MODEL_ID=Qwen/Qwen2.5-1.5B-Instruct
V4_MAX_TOKENS=256
V4_TEMPERATURE=0.2

# V4 Performance Optimization (M4 MacBook Pro with MPS)
V4_USE_FP16_FOR_SPEED=true
V4_ENABLE_QUANTIZATION=false

Performance Metrics

Model Loading

  • Device: mps:0 (Metal Performance Shaders)
  • Dtype: torch.float16 (FP16 for speed)
  • Quantization: FP16 (MPS, fast mode)
  • Load time: ~5 seconds
  • Warmup time: ~22 seconds
  • Memory usage: ~2-3GB unified memory

Inference Performance

  • Expected speed: 2-5 seconds per request
  • Token generation: ~10-20 tokens/sec
  • Device utilization: GPU accelerated via MPS

Starting the Server

# Start V4-enabled server
conda run -n summarizer uvicorn app.main:app --host 0.0.0.0 --port 7860

# Server will warmup V4 on startup (takes 20-30s)
# Look for these log messages:
#   βœ… V4 model initialized successfully
#   Model device: mps:0
#   Torch dtype: torch.float16

Testing V4

Via curl

# Test V4 stream-json endpoint (Outlines-constrained)
curl -X POST http://localhost:7860/api/v4/scrape-and-summarize/stream-json \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Your article text here...",
    "style": "executive",
    "max_tokens": 256
  }'

Via Python

import requests

url = "http://localhost:7860/api/v4/scrape-and-summarize/stream-json"
payload = {
    "text": "Your article text here...",
    "style": "executive",  # Options: skimmer, executive, eli5
    "max_tokens": 256
}

response = requests.post(url, json=payload, stream=True)

for line in response.iter_lines():
    if line:
        print(line.decode('utf-8'))

V4 Endpoints

  1. /api/v4/scrape-and-summarize/stream - Raw JSON token streaming
  2. /api/v4/scrape-and-summarize/stream-ndjson - NDJSON patch streaming (best for Android)
  3. /api/v4/scrape-and-summarize/stream-json - Outlines-constrained JSON (most reliable schema)

Structured Output Format

V4 guarantees the following JSON structure:

{
  "title": "6-10 word headline",
  "main_summary": "2-4 sentence summary",
  "key_points": [
    "Key point 1",
    "Key point 2",
    "Key point 3"
  ],
  "category": "1-2 word topic label",
  "sentiment": "positive|negative|neutral",
  "read_time_min": 3
}

Summarization Styles

  1. skimmer - Quick facts and highlights for fast reading
  2. executive - Business-focused summary with key takeaways (recommended)
  3. eli5 - "Explain Like I'm 5" - simple, accessible explanations

Code Changes Made

1. Added MPS Detection (app/services/structured_summarizer.py)

# Detect both CUDA and MPS
use_cuda = torch.cuda.is_available()
use_mps = torch.backends.mps.is_available() and torch.backends.mps.is_built()

if use_cuda:
    logger.info("CUDA is available. Using GPU for V4 model.")
elif use_mps:
    logger.info("MPS (Metal Performance Shaders) is available. Using Apple Silicon GPU for V4 model.")
else:
    logger.info("No GPU available. V4 model will run on CPU.")

2. Fixed Model Loading for MPS

# MPS requires explicit device setting, not device_map
if use_mps:
    self.model = AutoModelForCausalLM.from_pretrained(
        settings.v4_model_id,
        torch_dtype=torch.float16,  # Fixed: was `dtype=` (incorrect)
        cache_dir=settings.hf_cache_dir,
        trust_remote_code=True,
    ).to("mps")  # Explicit MPS device

3. Added FP16 Support for MPS

elif (use_cuda or use_mps) and use_fp16_for_speed:
    device_str = "CUDA GPU" if use_cuda else "MPS (Apple Silicon)"
    logger.info(f"Loading V4 model in FP16 for maximum speed on {device_str}...")
    # ... FP16 loading logic

Known Issues

Outlines JSON Generation Reliability

The Outlines library (0.1.1) with Qwen 1.5B sometimes generates malformed JSON with extra characters. This is a known limitation of constrained decoding with smaller models.

Symptoms:

ValidationError: Extra data: line 1 column 278 (char 277)
input_value='{"title":"Apple Announce...":5}#RRR!!##R!R!R##!#!!'

Workarounds:

  1. Use the /stream or /stream-ndjson endpoints instead (more reliable)
  2. Retry failed requests (Outlines generation is non-deterministic)
  3. Consider using a larger model (Qwen 3B) for better JSON reliability
  4. Use lower temperature (already set to 0.2 for stability)

Memory Considerations

  • Current usage: ~2-3GB unified memory for V4
  • Total with all services: ~4-5GB (V2 + V3 + V4)
  • Your 24GB Mac: Plenty of headroom βœ…

Performance Comparison

Version Device Memory Inference Time Use Case
V1 Ollama ~2-4GB 2-5s Local custom models
V2 CPU/GPU ~500MB Streaming Fast free-form summaries
V3 CPU/GPU ~550MB 2-5s Web scraping + summarization
V4 MPS ~2-3GB 2-5s Structured JSON output

Next Steps

For Production Use

  1. Test with real articles: Feed V4 actual articles from your Android app
  2. Monitor memory: Use Activity Monitor to track memory usage
  3. Benchmark performance: Measure actual inference times under load
  4. Consider alternatives if Outlines is unreliable:
    • Switch to /stream-ndjson endpoint (more reliable, progressive updates)
    • Use post-processing to clean JSON output
    • Upgrade to a larger model (Qwen 3B or Phi-3-Mini 3.8B)

For Development

  1. Disable V4 warmup when not testing:

    ENABLE_V4_WARMUP=false  # Saves 20-30s startup time
    
  2. Run only V4 (disable V1/V2/V3 to save memory):

    ENABLE_V1_WARMUP=false
    ENABLE_V2_WARMUP=false
    ENABLE_V3_SCRAPING=false
    
  3. Experiment with temperature:

    V4_TEMPERATURE=0.1  # Even more deterministic (may be too rigid)
    V4_TEMPERATURE=0.3  # More creative (may reduce schema compliance)
    

Troubleshooting

Model not loading on MPS

Check PyTorch MPS support:

conda run -n summarizer python -c "import torch; print(f'MPS: {torch.backends.mps.is_available()}')"

Server startup fails

Check the logs:

conda run -n summarizer uvicorn app.main:app --host 0.0.0.0 --port 7860
# Look for "βœ… V4 model initialized successfully"

JSON validation errors

This is expected with Qwen 1.5B + Outlines. Consider:

  • Using /stream-ndjson endpoint
  • Implementing retry logic
  • Using a larger model

Resources

Success Indicators

βœ… Model loads on MPS (mps:0) βœ… FP16 dtype enabled (torch.float16) βœ… Fast loading (5 seconds) βœ… Memory efficient (2-3GB) βœ… Inference working (generates output) ⚠️ Outlines reliability (known issue with Qwen 1.5B)


Status: V4 is fully operational on your M4 MacBook Pro! πŸŽ‰