Spaces:

colin730
/

SummarizerApp

Running

App Files Files Community

SummarizerApp / V4_LOCAL_SETUP.md

ming

Migrate to Ruff for linting/formatting and add comprehensive import tests

29ed661 6 days ago

preview code

raw

history blame contribute delete

7.69 kB

	# V4 Local Setup for M4 MacBook Pro

	## Summary

	V4 is successfully configured and running on your M4 MacBook Pro with MPS (Metal Performance Shaders) acceleration!

	## Hardware Configuration

	- Model: M4 MacBook Pro (Mac16,7)
	- CPU: 14 cores (10 performance + 4 efficiency)
	- Memory: 24GB unified memory
	- GPU: Apple Silicon with MPS support
	- OS: macOS 26.1

	## V4 Configuration (.env)

	```bash
	# V4 Structured JSON API Configuration (Outlines)
	ENABLE_V4_STRUCTURED=true
	ENABLE_V4_WARMUP=true

	# V4 Model Configuration
	V4_MODEL_ID=Qwen/Qwen2.5-1.5B-Instruct
	V4_MAX_TOKENS=256
	V4_TEMPERATURE=0.2

	# V4 Performance Optimization (M4 MacBook Pro with MPS)
	V4_USE_FP16_FOR_SPEED=true
	V4_ENABLE_QUANTIZATION=false
	```

	## Performance Metrics

	### Model Loading
	- Device: `mps:0` (Metal Performance Shaders)
	- Dtype: `torch.float16` (FP16 for speed)
	- Quantization: FP16 (MPS, fast mode)
	- Load time: ~5 seconds
	- Warmup time: ~22 seconds
	- Memory usage: ~2-3GB unified memory

	### Inference Performance
	- Expected speed: 2-5 seconds per request
	- Token generation: ~10-20 tokens/sec
	- Device utilization: GPU accelerated via MPS

	## Starting the Server

	```bash
	# Start V4-enabled server
	conda run -n summarizer uvicorn app.main:app --host 0.0.0.0 --port 7860

	# Server will warmup V4 on startup (takes 20-30s)
	# Look for these log messages:
	# ✅ V4 model initialized successfully
	# Model device: mps:0
	# Torch dtype: torch.float16
	```

	## Testing V4

	### Via curl

	```bash
	# Test V4 stream-json endpoint (Outlines-constrained)
	curl -X POST http://localhost:7860/api/v4/scrape-and-summarize/stream-json \
	-H "Content-Type: application/json" \
	-d '{
	"text": "Your article text here...",
	"style": "executive",
	"max_tokens": 256
	}'
	```

	### Via Python

	```python
	import requests

	url = "http://localhost:7860/api/v4/scrape-and-summarize/stream-json"
	payload = {
	"text": "Your article text here...",
	"style": "executive", # Options: skimmer, executive, eli5
	"max_tokens": 256
	}

	response = requests.post(url, json=payload, stream=True)

	for line in response.iter_lines():
	if line:
	print(line.decode('utf-8'))
	```

	## V4 Endpoints

	1. `/api/v4/scrape-and-summarize/stream` - Raw JSON token streaming
	2. `/api/v4/scrape-and-summarize/stream-ndjson` - NDJSON patch streaming (best for Android)
	3. `/api/v4/scrape-and-summarize/stream-json` - Outlines-constrained JSON (most reliable schema)

	## Structured Output Format

	V4 guarantees the following JSON structure:

	```json
	{
	"title": "6-10 word headline",
	"main_summary": "2-4 sentence summary",
	"key_points": [
	"Key point 1",
	"Key point 2",
	"Key point 3"
	],
	"category": "1-2 word topic label",
	"sentiment": "positive\|negative\|neutral",
	"read_time_min": 3
	}
	```

	## Summarization Styles

	1. `skimmer` - Quick facts and highlights for fast reading
	2. `executive` - Business-focused summary with key takeaways (recommended)
	3. `eli5` - "Explain Like I'm 5" - simple, accessible explanations

	## Code Changes Made

	### 1. Added MPS Detection (`app/services/structured_summarizer.py`)

	```python
	# Detect both CUDA and MPS
	use_cuda = torch.cuda.is_available()
	use_mps = torch.backends.mps.is_available() and torch.backends.mps.is_built()

	if use_cuda:
	logger.info("CUDA is available. Using GPU for V4 model.")
	elif use_mps:
	logger.info("MPS (Metal Performance Shaders) is available. Using Apple Silicon GPU for V4 model.")
	else:
	logger.info("No GPU available. V4 model will run on CPU.")
	```

	### 2. Fixed Model Loading for MPS

	```python
	# MPS requires explicit device setting, not device_map
	if use_mps:
	self.model = AutoModelForCausalLM.from_pretrained(
	settings.v4_model_id,
	torch_dtype=torch.float16, # Fixed: was `dtype=` (incorrect)
	cache_dir=settings.hf_cache_dir,
	trust_remote_code=True,
	).to("mps") # Explicit MPS device
	```

	### 3. Added FP16 Support for MPS

	```python
	elif (use_cuda or use_mps) and use_fp16_for_speed:
	device_str = "CUDA GPU" if use_cuda else "MPS (Apple Silicon)"
	logger.info(f"Loading V4 model in FP16 for maximum speed on {device_str}...")
	# ... FP16 loading logic
	```

	## Known Issues

	### Outlines JSON Generation Reliability

	The Outlines library (0.1.1) with Qwen 1.5B sometimes generates malformed JSON with extra characters. This is a known limitation of constrained decoding with smaller models.

	Symptoms:
	```
	ValidationError: Extra data: line 1 column 278 (char 277)
	input_value='{"title":"Apple Announce...":5}#RRR!!##R!R!R##!#!!'
	```

	Workarounds:
	1. Use the `/stream` or `/stream-ndjson` endpoints instead (more reliable)
	2. Retry failed requests (Outlines generation is non-deterministic)
	3. Consider using a larger model (Qwen 3B) for better JSON reliability
	4. Use lower temperature (already set to 0.2 for stability)

	### Memory Considerations

	- Current usage: ~2-3GB unified memory for V4
	- Total with all services: ~4-5GB (V2 + V3 + V4)
	- Your 24GB Mac: Plenty of headroom ✅

	## Performance Comparison

	\| Version \| Device \| Memory \| Inference Time \| Use Case \|
	\|---------\|--------\|---------\|----------------\|----------\|
	\| V1 \| Ollama \| ~2-4GB \| 2-5s \| Local custom models \|
	\| V2 \| CPU/GPU \| ~500MB \| Streaming \| Fast free-form summaries \|
	\| V3 \| CPU/GPU \| ~550MB \| 2-5s \| Web scraping + summarization \|
	\| V4 \| MPS \| ~2-3GB \| 2-5s \| Structured JSON output \|

	## Next Steps

	### For Production Use

	1. Test with real articles: Feed V4 actual articles from your Android app
	2. Monitor memory: Use Activity Monitor to track memory usage
	3. Benchmark performance: Measure actual inference times under load
	4. Consider alternatives if Outlines is unreliable:
	- Switch to `/stream-ndjson` endpoint (more reliable, progressive updates)
	- Use post-processing to clean JSON output
	- Upgrade to a larger model (Qwen 3B or Phi-3-Mini 3.8B)

	### For Development

	1. Disable V4 warmup when not testing:
	```bash
	ENABLE_V4_WARMUP=false # Saves 20-30s startup time
	```

	2. Run only V4 (disable V1/V2/V3 to save memory):
	```bash
	ENABLE_V1_WARMUP=false
	ENABLE_V2_WARMUP=false
	ENABLE_V3_SCRAPING=false
	```

	3. Experiment with temperature:
	```bash
	V4_TEMPERATURE=0.1 # Even more deterministic (may be too rigid)
	V4_TEMPERATURE=0.3 # More creative (may reduce schema compliance)
	```

	## Troubleshooting

	### Model not loading on MPS

	Check PyTorch MPS support:
	```bash
	conda run -n summarizer python -c "import torch; print(f'MPS: {torch.backends.mps.is_available()}')"
	```

	### Server startup fails

	Check the logs:
	```bash
	conda run -n summarizer uvicorn app.main:app --host 0.0.0.0 --port 7860
	# Look for "✅ V4 model initialized successfully"
	```

	### JSON validation errors

	This is expected with Qwen 1.5B + Outlines. Consider:
	- Using `/stream-ndjson` endpoint
	- Implementing retry logic
	- Using a larger model

	## Resources

	- Model: [Qwen/Qwen2.5-1.5B-Instruct on HuggingFace](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)
	- Outlines: [Outlines 0.1.1 Documentation](https://outlines-dev.github.io/outlines/)
	- PyTorch MPS: [Apple Silicon GPU Acceleration](https://pytorch.org/docs/stable/notes/mps.html)

	## Success Indicators

	✅ Model loads on MPS (`mps:0`)
	✅ FP16 dtype enabled (`torch.float16`)
	✅ Fast loading (~5 seconds)
	✅ Memory efficient (~2-3GB)
	✅ Inference working (generates output)
	⚠️ Outlines reliability (known issue with Qwen 1.5B)

	---

	Status: V4 is fully operational on your M4 MacBook Pro! 🎉