remove old files
Browse files- blog-export-headrs.html +0 -192
- blog-export.html +0 -0
- blog-export.md +0 -0
blog-export-headrs.html
DELETED
|
@@ -1,192 +0,0 @@
|
|
| 1 |
-
<h2>The Ultra-Scale Playbook: Training LLMs on GPU Clusters</h2>
|
| 2 |
-
|
| 3 |
-
<h2>TL;DR</h2>
|
| 4 |
-
|
| 5 |
-
<h2>First Steps: Training on one GPU</h2>
|
| 6 |
-
|
| 7 |
-
<h3>Memory usage in Transformers</h3>
|
| 8 |
-
|
| 9 |
-
<h4>Memory profiling a training step</h4>
|
| 10 |
-
|
| 11 |
-
<h4>Weights/grads/optimizer states memory</h4>
|
| 12 |
-
|
| 13 |
-
<h4>Activations memory</h4>
|
| 14 |
-
|
| 15 |
-
<h3><strong>Activation recomputation</strong></h3>
|
| 16 |
-
|
| 17 |
-
<h3>Gradient accumulation</h3>
|
| 18 |
-
|
| 19 |
-
<h2>Data Parallelism</h2>
|
| 20 |
-
|
| 21 |
-
<h4><strong>First optimization:</strong> Overlap gradient synchronization with backward pass</h4>
|
| 22 |
-
|
| 23 |
-
<h4><strong>Second optimization:</strong> Bucketing gradients</h4>
|
| 24 |
-
|
| 25 |
-
<h4><strong>Third optimization: I</strong>nterplay with gradient accumulation</h4>
|
| 26 |
-
|
| 27 |
-
<h3>Revisit global batch size</h3>
|
| 28 |
-
|
| 29 |
-
<h3>Our journey up to now</h3>
|
| 30 |
-
|
| 31 |
-
<h3>ZeRO (<strong>Ze</strong>ro <strong>R</strong>edundancy <strong>O</strong>ptimizer)</h3>
|
| 32 |
-
|
| 33 |
-
<h4>Memory usage revisited</h4>
|
| 34 |
-
|
| 35 |
-
<h4>ZeRO-1: Partitioning Optimizer States</h4>
|
| 36 |
-
|
| 37 |
-
<h4>ZeRO-2: Adding <strong>Gradient Partitioning</strong></h4>
|
| 38 |
-
|
| 39 |
-
<h4>ZeRO-3: Adding Parameter <strong>Partitioning</strong></h4>
|
| 40 |
-
|
| 41 |
-
<h2>Tensor Parallelism</h2>
|
| 42 |
-
|
| 43 |
-
<h3>Tensor Parallelism in a Transformer Block</h3>
|
| 44 |
-
|
| 45 |
-
<h3>Sequence Parallelism</h3>
|
| 46 |
-
|
| 47 |
-
<h2>Context Parallelism</h2>
|
| 48 |
-
|
| 49 |
-
<h3>Introducing Context Parallelism</h3>
|
| 50 |
-
|
| 51 |
-
<h3>Discovering Ring Attention</h3>
|
| 52 |
-
|
| 53 |
-
<h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>
|
| 54 |
-
|
| 55 |
-
<h2></h2>
|
| 56 |
-
|
| 57 |
-
<h2>Pipeline Parallelism</h2>
|
| 58 |
-
|
| 59 |
-
<h3>Splitting layers on various nodes - All forward, all backward</h3>
|
| 60 |
-
|
| 61 |
-
<h3>One-forward-one-backward and LLama 3.1 schemes</h3>
|
| 62 |
-
|
| 63 |
-
<h3>Interleaving stages</h3>
|
| 64 |
-
|
| 65 |
-
<h3>Zero Bubble and DualPipe</h3>
|
| 66 |
-
|
| 67 |
-
<h2>Expert parallelism</h2>
|
| 68 |
-
|
| 69 |
-
<h2>5D parallelism in a nutshell</h2>
|
| 70 |
-
|
| 71 |
-
<h2>How to Find the Best Training Configuration</h2>
|
| 72 |
-
|
| 73 |
-
<h2>Diving in the GPUs – fusing, threading, mixing</h2>
|
| 74 |
-
|
| 75 |
-
<h4>A primer on GPU</h4>
|
| 76 |
-
|
| 77 |
-
<h3>How to improve performance with Kernels ?</h3>
|
| 78 |
-
|
| 79 |
-
<h4>Memory Coalescing</h4>
|
| 80 |
-
|
| 81 |
-
<h4>Tiling</h4>
|
| 82 |
-
|
| 83 |
-
<h4>Thread Coarsening</h4>
|
| 84 |
-
|
| 85 |
-
<h4>Minimizing Control Divergence</h4>
|
| 86 |
-
|
| 87 |
-
<h3>Flash Attention 1-3</h3>
|
| 88 |
-
|
| 89 |
-
<h3>Fused Kernels</h3>
|
| 90 |
-
|
| 91 |
-
<h3>Mixed Precision Training</h3>
|
| 92 |
-
|
| 93 |
-
<h4>FP16 and BF16 training</h4>
|
| 94 |
-
|
| 95 |
-
<h4>FP8 pretraining</h4>
|
| 96 |
-
|
| 97 |
-
<h2>Conclusion</h2>
|
| 98 |
-
|
| 99 |
-
<h3>What you learned</h3>
|
| 100 |
-
|
| 101 |
-
<h3>What we learned</h3>
|
| 102 |
-
|
| 103 |
-
<h3>What’s next?</h3>
|
| 104 |
-
|
| 105 |
-
<h2>References</h2>
|
| 106 |
-
|
| 107 |
-
<h3>Landmark LLM Scaling Papers</h3>
|
| 108 |
-
|
| 109 |
-
<h3>Training Frameworks</h3>
|
| 110 |
-
|
| 111 |
-
<h3>Debugging</h3>
|
| 112 |
-
|
| 113 |
-
<h3>Distribution Techniques</h3>
|
| 114 |
-
|
| 115 |
-
<h3>CUDA Kernels</h3>
|
| 116 |
-
|
| 117 |
-
<h3>Hardware</h3>
|
| 118 |
-
|
| 119 |
-
<h3>Others</h3>
|
| 120 |
-
|
| 121 |
-
<h2>Appendix</h2>
|
| 122 |
-
|
| 123 |
-
<h3>A0: Parallel Programming Crash Course</h3>
|
| 124 |
-
|
| 125 |
-
<h4>Broadcast</h4>
|
| 126 |
-
|
| 127 |
-
<h4>Reduce & AllReduce</h4>
|
| 128 |
-
|
| 129 |
-
<h4><strong>A quick focus on Ring All-Reduce</strong></h4>
|
| 130 |
-
|
| 131 |
-
<h4>Gather & AllGather</h4>
|
| 132 |
-
|
| 133 |
-
<h4>Scatter & ReduceScatter</h4>
|
| 134 |
-
|
| 135 |
-
<h4>Barrier</h4>
|
| 136 |
-
|
| 137 |
-
<h4>NCCL: NVIDIA Collective Communications Library</h4>
|
| 138 |
-
|
| 139 |
-
<h3>A1: Profiling</h3>
|
| 140 |
-
|
| 141 |
-
<h4>Kernels</h4>
|
| 142 |
-
|
| 143 |
-
<h2>Print a table of the profiling results, sorted by total CUDA time, limited to the top 10 entries</h2>
|
| 144 |
-
|
| 145 |
-
<h2>include <torch/extension.h></h2>
|
| 146 |
-
|
| 147 |
-
<h2>include <cuda.h></h2>
|
| 148 |
-
|
| 149 |
-
<h2>include <cuda_runtime.h></h2>
|
| 150 |
-
|
| 151 |
-
<h2>Load and compile the CUDA extension</h2>
|
| 152 |
-
|
| 153 |
-
<h2>Define input tensors</h2>
|
| 154 |
-
|
| 155 |
-
<h2>Run the CUDA kernel</h2>
|
| 156 |
-
|
| 157 |
-
<h3>A2: TP Backward pass</h3>
|
| 158 |
-
|
| 159 |
-
<h3>A3: ZeRO-R</h3>
|
| 160 |
-
|
| 161 |
-
<h4>$P_a:$ Partitioned Activation Checkpointing</h4>
|
| 162 |
-
|
| 163 |
-
<h4><strong>$C_B:$ Constant Size Buffers</strong></h4>
|
| 164 |
-
|
| 165 |
-
<h4><strong>$M_D$: Memory Defragmentation</strong></h4>
|
| 166 |
-
|
| 167 |
-
<h4>Communication Analysis of ZeRO-R</h4>
|
| 168 |
-
|
| 169 |
-
<h3>A5. Memory profile</h3>
|
| 170 |
-
|
| 171 |
-
<h2>Set up optimizer</h2>
|
| 172 |
-
|
| 173 |
-
<h3>TP: Practical PyTorch Implementation</h3>
|
| 174 |
-
|
| 175 |
-
<h2>This is the <code>f</code> function in the paper: https://arxiv.org/abs/1909.08053</h2>
|
| 176 |
-
|
| 177 |
-
<h2>core logic of Column Parallel linear</h2>
|
| 178 |
-
|
| 179 |
-
<h4>Gelu code</h4>
|
| 180 |
-
|
| 181 |
-
<h4>Interconnect</h4>
|
| 182 |
-
|
| 183 |
-
<h3>How to profile your code</h3>
|
| 184 |
-
|
| 185 |
-
<h3>Formulas for compute / comms the balanhe balance</h3>
|
| 186 |
-
|
| 187 |
-
<h3>Integrating Context Parallelism with TP/SP</h3>
|
| 188 |
-
|
| 189 |
-
<h3>The nanotron FP8 recipe</h3>
|
| 190 |
-
|
| 191 |
-
<h2>Overlapping computation and communication</h2>
|
| 192 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
blog-export.html
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
blog-export.md
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|