Spaces:

ucalyptus
/

sglang-prefill-decoded-aggregation

Running

App Files Files Community

ucalyptus commited on May 7

Commit

535344b

verified ·

1 Parent(s): ceaf146

Add 3 files

Browse files

Files changed (3) hide show

README.md +7 -5
index.html +312 -18
prompts.txt +0 -0

README.md CHANGED Viewed

@@ -1,10 +1,12 @@
 ---
-title: Sglang Prefill Decoded Aggregation
-emoji: 📈
-colorFrom: gray
-colorTo: purple
 sdk: static
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: sglang-prefill-decoded-aggregation
+emoji: 🐳
+colorFrom: blue
+colorTo: gray
 sdk: static
 pinned: false
+tags:
+  - deepsite
 ---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

index.html CHANGED Viewed

@@ -1,19 +1,313 @@
-<!doctype html>
-<html>
-	<head>
-		<meta charset="utf-8" />
-		<meta name="viewport" content="width=device-width" />
-		<title>My static Space</title>
-		<link rel="stylesheet" href="style.css" />
-	</head>
-	<body>
-		<div class="card">
-			<h1>Welcome to your static Space!</h1>
-			<p>You can modify this app directly by editing <i>index.html</i> in the Files and versions tab.</p>
-			<p>
-				Also don't forget to check the
-				<a href="https://huggingface.co/docs/hub/spaces" target="_blank">Spaces documentation</a>.
-			</p>
-		</div>
-	</body>
 </html>

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>DeepSeek Deployment with SGLang: Visual Explanation</title>
+    <script src="https://cdn.tailwindcss.com"></script>
+    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap" rel="stylesheet">
+    <style>
+        body {
+            font-family: 'Inter', sans-serif;
+            background-color: #f3f4f6; /* Light gray background */
+        }
+        .section-title {
+            font-size: 1.75rem; /* Larger section titles */
+            font-weight: 700;
+            color: #1e3a8a; /* Dark blue */
+            border-bottom: 2px solid #3b82f6; /* Medium blue border */
+            padding-bottom: 0.5rem;
+            margin-bottom: 1.5rem;
+        }
+        .subsection-title {
+            font-size: 1.25rem;
+            font-weight: 600;
+            color: #1d4ed8; /* Slightly lighter blue */
+            margin-top: 1rem;
+            margin-bottom: 0.75rem;
+        }
+        .card {
+            background-color: #ffffff;
+            border-radius: 0.75rem; /* More rounded corners */
+            box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06);
+            padding: 1.5rem;
+            margin-bottom: 1.5rem;
+            transition: transform 0.2s ease-in-out;
+        }
+        .card:hover {
+            transform: translateY(-5px);
+        }
+        .highlight {
+            background-color: #eff6ff; /* Light blue background for highlights */
+            color: #1e40af; /* Darker blue text for highlights */
+            padding: 0.25rem 0.75rem;
+            border-radius: 0.375rem;
+            font-weight: 600;
+        }
+        .metric {
+            font-size: 1.1rem;
+            font-weight: 700;
+            color: #16a34a; /* Green for positive metrics */
+        }
+        .comparison-metric {
+            font-size: 1rem;
+            font-weight: 600;
+            color: #52525b; /* Neutral gray for comparison details */
+        }
+        ul {
+            list-style-type: none; /* Remove default bullets */
+            padding-left: 0;
+        }
+        li {
+            position: relative;
+            padding-left: 1.75rem; /* Space for custom bullet */
+            margin-bottom: 0.75rem;
+            line-height: 1.6;
+        }
+        li::before {
+            content: '✓'; /* Custom checkmark bullet */
+            position: absolute;
+            left: 0;
+            color: #2563eb; /* Blue checkmark */
+            font-weight: bold;
+            font-size: 1.25rem;
+        }
+        .arrow {
+            font-size: 1.5rem;
+            color: #3b82f6;
+            margin: 0 0.5rem;
+        }
+        .gpu-icon svg {
+            width: 24px;
+            height: 24px;
+            fill: currentColor;
+            margin-right: 8px;
+        }
+        .flex-container {
+            display: flex;
+            align-items: center;
+            justify-content: space-around;
+            flex-wrap: wrap;
+        }
+        .flow-item {
+            text-align: center;
+            margin: 1rem;
+            padding: 1rem;
+            background-color: #e0e7ff;
+            border-radius: 0.5rem;
+            min-width: 150px;
+        }
+    </style>
+</head>
+<body class="p-4 md:p-8">
+    <div class="max-w-5xl mx-auto">
+        <header class="mb-12 text-center">
+            <h1 class="text-4xl font-bold text-gray-800 mb-2">Deploying DeepSeek with SGLang</h1>
+            <p class="text-xl text-gray-600">Achieving High Performance with PD Disaggregation & Large-scale Expert Parallelism</p>
+            <p class="text-sm text-gray-500 mt-1">Based on SGLang Team, May 05, 2025</p>
+        </header>
+        <section class="mb-10">
+            <h2 class="section-title">Key Achievements with SGLang</h2>
+            <div class="grid md:grid-cols-2 gap-6">
+                <div class="card">
+                    <h3 class="subsection-title">🚀 Near Official Performance</h3>
+                    <p class="text-gray-700">SGLang's implementation on 12 nodes (96 H100 GPUs) nearly matches DeepSeek's official inference throughput.</p>
+                    <p class="mt-2">Input: <span class="metric">52.3k tokens/s per node</span></p>
+                    <p>Output: <span class="metric">22.3k tokens/s per node</span> (for 2k token inputs)</p>
+                </div>
+                <div class="card">
+                    <h3 class="subsection-title">💰 Cost Efficiency</h3>
+                    <p class="text-gray-700">Translates to <span class="metric">$0.20 / 1M output tokens</span>, approximately <span class="highlight">1/5th the cost</span> of the official DeepSeek Chat API.</p>
+                </div>
+                <div class="card md:col-span-2">
+                    <h3 class="subsection-title">⚡ Throughput Boost</h3>
+                    <p class="text-gray-700">Optimized strategy improves output throughput by up to <span class="metric">5x</span> compared to vanilla tensor parallelism on the same resources.</p>
+                </div>
+            </div>
+            <div class="card mt-6">
+                <h3 class="subsection-title">Core SGLang Enhancements</h3>
+                <ul>
+                    <li>Support for Prefill-Decode (PD) Disaggregation.</li>
+                    <li>Large-scale Expert Parallelism (EP), including DeepEP, DeepGEMM, and EPLB.</li>
+                    <li>Open-source implementation for community access and development.</li>
+                </ul>
+            </div>
+        </section>
+        <section class="mb-10">
+            <h2 class="section-title">Parallelism Design Strategies</h2>
+            <div class="grid md:grid-cols-2 gap-6">
+                <div class="card">
+                    <h3 class="subsection-title">Attention Layers (MLA)</h3>
+                    <p class="text-gray-700">Utilizes <span class="highlight">DP Attention</span> (Data Parallelism):</p>
+                    <ul>
+                        <li>Eliminates KV cache duplication across devices.</li>
+                        <li>Significantly reduces memory overhead.</li>
+                        <li>Supports hybrid data and tensor parallelism for flexibility.</li>
+                    </ul>
+                </div>
+                <div class="card">
+                    <h3 class="subsection-title">Dense FFNs</h3>
+                    <p class="text-gray-700">Adopts <span class="highlight">Data Parallelism (DP)</span> over Tensor Parallelism (TP):</p>
+                    <ul>
+                        <li><span class="font-semibold">Enhanced Scalability:</span> Avoids fragmentation and ensures balanced workloads.</li>
+                        <li><span class="font-semibold">Optimized Memory Efficiency:</span> Lower TP degree often minimizes memory, making DP favorable.</li>
+                        <li><span class="font-semibold">Minimized Communication:</span> Reduces all-reduce operations by 50% compared to pure TP.</li>
+                    </ul>
+                </div>
+                <div class="card">
+                    <h3 class="subsection-title">Sparse FFNs (Mixture of Experts)</h3>
+                    <p class="text-gray-700">Implements <span class="highlight">Expert Parallelism (EP)</span>:</p>
+                    <ul>
+                        <li>Distributes expert weights across multiple devices.</li>
+                        <li>Scales memory capacity effectively.</li>
+                        <li>Addresses challenges like irregular communication and workload imbalance using DeepEP.</li>
+                    </ul>
+                </div>
+                <div class="card">
+                    <h3 class="subsection-title">LM Head</h3>
+                    <p class="text-gray-700">Employs <span class="highlight">Data Parallelism (DP)</span>:</p>
+                    <ul>
+                        <li>Mirrors the strategy for dense FFNs.</li>
+                        <li>Reduces memory overhead for large vocabulary computations.</li>
+                        <li>Simplifies communication across devices.</li>
+                    </ul>
+                </div>
+            </div>
+        </section>
+        <section class="mb-10">
+            <h2 class="section-title">Prefill & Decode (PD) Disaggregation</h2>
+            <div class="card">
+                <p class="text-gray-700 mb-4">LLM inference has two phases: computation-intensive <span class="font-semibold">Prefill</span> and memory-intensive <span class="font-semibold">Decode</span>. Unified scheduling is inefficient.</p>
+                <h3 class="subsection-title">Problems with Unified Scheduling:</h3>
+                <ul>
+                    <li>Prefill batches interrupt decode batches (delay).</li>
+                    <li>DP Attention imbalance (increased decode latency).</li>
+                    <li>Incompatible with DeepEP's dual dispatch modes.</li>
+                </ul>
+                <h3 class="subsection-title mt-4">SGLang's PD Disaggregation Solution:</h3>
+                <div class="flex-container my-4 p-4 bg-blue-50 rounded-lg">
+                    <div class="flow-item">Input Request</div>
+                    <div class="arrow">➔</div>
+                    <div class="flow-item">Prefill Server<br/>(Computes KV Cache)</div>
+                    <div class="arrow">➔</div>
+                    <div class="flow-item">Data Transfer (RDMA)</div>
+                    <div class="arrow">➔</div>
+                    <div class="flow-item">Decode Server<br/>(Iterative Token Gen)</div>
+                </div>
+                <p class="text-gray-700">This separation allows tailored optimizations for each phase, maximizing GPU utilization.</p>
+                <h4 class="font-semibold text-gray-800 mt-3 mb-1">Key Implementation Details:</h4>
+                <ul>
+                    <li><span class="highlight">Non-blocking Transfer:</span> Background data send/receive.</li>
+                    <li><span class="highlight">RDMA-Based Transfer:</span> Efficient for non-contiguous memory.</li>
+                    <li><span class="highlight">Flexible API Integration:</span> Supports Mooncake, NIXL.</li>
+                </ul>
+            </div>
+        </section>
+        <section class="mb-10">
+            <h2 class="section-title">Large-scale Expert Parallelism Optimizations</h2>
+            <div class="space-y-6">
+                <div class="card">
+                    <h3 class="subsection-title">Expert Parallelism with DeepEP</h3>
+                    <p class="text-gray-700">DeepEP streamlines EP by efficiently routing tokens to experts across GPUs.</p>
+                    <p class="text-gray-700 mt-2"><span class="highlight">Normal Dispatch:</span> For prefill (long inputs, max throughput). Incompatible with CUDA Graph.</p>
+                    <p class="text-gray-700 mt-1"><span class="highlight">Low-Latency Dispatch:</span> For decode (output tokens, min delay). Supports CUDA Graph.</p>
+                    <p class="text-gray-700 mt-2">SGLang's <span class="font-semibold">PD Disaggregation</span> enables using both modes effectively with DP Attention.</p>
+                </div>
+                <div class="card">
+                    <h3 class="subsection-title">DeepGEMM Integration</h3>
+                    <p class="text-gray-700">Optimizes MoE matrix multiplications (Grouped GEMMs).</p>
+                    <p class="text-gray-700 mt-2"><span class="highlight">Contiguous Layout Kernel:</span> For prefill (dynamic shapes). Used with DeepEP's Normal Dispatch (requires permutation).</p>
+                    <p class="text-gray-700 mt-1"><span class="highlight">Masked Layout Kernel:</span> For decode (fixed shapes, CUDA Graph compatible). Used with DeepEP's Low-Latency Dispatch.</p>
+                </div>
+                <div class="card">
+                    <h3 class="subsection-title">Two-batch Overlap (TBO)</h3>
+                    <p class="text-gray-700">Splits a batch into two micro-batches to <span class="highlight">overlap computation and communication</span>.</p>
+                    <ul>
+                        <li>Lowers peak memory usage.</li>
+                        <li>Addresses limited communication bandwidth in multi-node setups.</li>
+                        <li>SGLang uses an abstraction layer (operations & yield points) for clean implementation.</li>
+                        <li>Optimized launch order in prefill to avoid CPU-blocking by DeepEP.</li>
+                    </ul>
+                </div>
+                <div class="card">
+                    <h3 class="subsection-title">Expert Parallelism Load Balancer (EPLB)</h3>
+                    <p class="text-gray-700">Addresses uneven workload distribution in MoE models.</p>
+                    <ul>
+                        <li>Computes optimal expert arrangement to minimize imbalance.</li>
+                        <li>Uses redundant experts (e.g., 288 instead of 256) for flexible placement.</li>
+                        <li>Enables diverse parallelism sizes (e.g., 12 or 72).</li>
+                        <li>SGLang implements efficient, non-disruptive rebalancing.</li>
+                    </ul>
+                    <p class="mt-2 text-gray-600">Effectiveness depends on matching input distribution to serving workload (achieved via larger batches or periodic rebalancing).</p>
+                </div>
+            </div>
+        </section>
+        <section class="mb-10">
+            <h2 class="section-title">Evaluation Highlights</h2>
+            <div class="grid md:grid-cols-2 gap-6">
+                <div class="card">
+                    <h3 class="subsection-title">Prefill Phase Performance</h3>
+                    <p class="text-gray-700">On 4 nodes (32 H100s, EP32):</p>
+                    <p>Up to <span class="metric">3.3x improvement</span> over TP16 baseline.</p>
+                    <p>Throughput within <span class="comparison-metric">5.6% of DeepSeek's official profile</span> (assuming perfect balance).</p>
+                    <p class="mt-1">Example: <span class="highlight">50,302 tokens/s per node</span> for 4K prompts.</p>
+                </div>
+                <div class="card">
+                    <h3 class="subsection-title">Decode Phase Performance</h3>
+                    <p class="text-gray-700">On 9 nodes (72 H100s, EP72):</p>
+                    <p><span class="metric">5.2x speedup</span> over TP16 baseline.</p>
+                    <p>With simulated MTP, throughput <span class="comparison-metric">6.6% below DeepSeek's profile</span>.</p>
+                    <p class="mt-1">Example: <span class="highlight">22,282 tokens/s per node</span> for 2K inputs.</p>
+                </div>
+            </div>
+            <div class="card mt-6">
+                <h3 class="subsection-title">Ablation Study: Two-batch Overlap (TBO)</h3>
+                <p class="text-gray-700"><span class="font-semibold">Prefill:</span></p>
+                <ul>
+                    <li>Supports larger batch sizes (e.g., 16k tokens/device vs 8k OOM without TBO).</li>
+                    <li><span class="metric">27-35% throughput increase</span> by overlapping computation & communication.</li>
+                </ul>
+                <p class="text-gray-700 mt-3"><span class="font-semibold">Decode:</span></p>
+                <ul>
+                    <li>Speedup contingent on batch size (e.g., <span class="metric">25.5% at 256 tokens/device</span>).</li>
+                    <li>Most substantial speedup (<span class="metric">35%</span>) in simulated MTP with prolonged attention.</li>
+                </ul>
+            </div>
+            <div class="card mt-6">
+                <h3 class="subsection-title">Ablation Study: EPLB</h3>
+                <p class="text-gray-700">Delivers significant speedup by mitigating workload imbalance:</p>
+                <ul>
+                    <li>Prefill: <span class="metric">1.49x speedup</span>.</li>
+                    <li>Decode: <span class="metric">2.54x speedup</span>.</li>
+                </ul>
+                <p class="text-gray-700 mt-2">Strong correlation between <span class="highlight">workload balancedness and overall throughput</span>.</p>
+                <p class="text-gray-700 mt-2">Different expert distributions for prefill vs. decode support PD disaggregation for phase-specific expert placement.</p>
+            </div>
+        </section>
+        <section class="mb-6">
+            <h2 class="section-title">Conclusion</h2>
+            <div class="card">
+                <p class="text-gray-700 leading-relaxed">
+                    SGLang, by integrating advanced techniques like Prefill-Decode Disaggregation and sophisticated Expert Parallelism strategies (DeepEP, DeepGEMM, TBO, EPLB), successfully deploys the large DeepSeek model on H100 GPUs with performance nearly matching official reports and significantly reducing costs.
+                    The open-source nature of these components empowers the community to build upon these optimizations for efficient large-scale LLM serving.
+                </p>
+            </div>
+        </section>
+        <footer class="text-center mt-12 py-6 border-t border-gray-300">
+            <p class="text-gray-600">Visual summary generated based on "Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs" by The SGLang Team.</p>
+        </footer>
+    </div>
+<p style="border-radius: 8px; text-align: center; font-size: 12px; color: #fff; margin-top: 16px;position: fixed; left: 8px; bottom: 8px; z-index: 10; background: rgba(0, 0, 0, 0.8); padding: 4px 8px;">Made with <img src="https://enzostvs-deepsite.hf.space/logo.svg" alt="DeepSite Logo" style="width: 16px; height: 16px; vertical-align: middle;display:inline-block;margin-right:3px;filter:brightness(0) invert(1);"><a href="https://enzostvs-deepsite.hf.space" style="color: #fff;text-decoration: underline;" target="_blank" >DeepSite</a> - 🧬 <a href="https://enzostvs-deepsite.hf.space?remix=ucalyptus/sglang-prefill-decoded-aggregation" style="color: #fff;text-decoration: underline;" target="_blank" >Remix</a></p></body>
 </html>

prompts.txt ADDED Viewed

File without changes