| ### Current observations | |
| - **L40S 48GB** → faster than realtime. Use `pace:"realtime"` to avoid client over‑buffering. | |
| - **L4 24GB** → slightly **below** realtime even with pre‑roll buffering, TF32/Autotune, smaller chunks (`max_decode_frames`), and the **base** checkpoint. | |
| ### Practical guidance | |
| - For consistent realtime, target **~40GB VRAM per active stream** (e.g., **A100 40GB**, or MIG slices ≈ **35–40GB** on newer GPUs). | |
| - Keep client‑side **overlap‑add** (25–40 ms) for seamless chunk joins. | |
| - Prefer **`pace:"realtime"`** once playback begins; use **ASAP** only to build a short pre‑roll if needed. | |
| - Optional knob: **`max_decode_frames`** (default **50** ≈ 2.0 s). Reducing to **36–45** can lower per‑chunk latency/VRAM, but doesn't increase frames/sec throughput. | |
| ### Concurrency | |
| This research build is designed for **one active jam per GPU**. Concurrency would require GPU partitioning (MIG) or horizontal scaling with a session scheduler. |