Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

124

lvwerra HF Staff commited on Feb 14

Commit

8d5f916

1 Parent(s): 2a9ca3d

remove old files

Browse files

Files changed (3) hide show

blog-export-headrs.html +0 -192
blog-export.html +0 -0
blog-export.md +0 -0

blog-export-headrs.html DELETED Viewed

@@ -1,192 +0,0 @@
-<h2>The Ultra-Scale Playbook: Training LLMs on GPU Clusters</h2>
-<h2>TL;DR</h2>
-<h2>First Steps: Training on one GPU</h2>
-<h3>Memory usage in Transformers</h3>
-<h4>Memory profiling a training step</h4>
-<h4>Weights/grads/optimizer states memory</h4>
-<h4>Activations memory</h4>
-<h3><strong>Activation recomputation</strong></h3>
-<h3>Gradient accumulation</h3>
-<h2>Data Parallelism</h2>
-<h4><strong>First optimization:</strong> Overlap gradient synchronization with backward pass</h4>
-<h4><strong>Second optimization:</strong> Bucketing gradients</h4>
-<h4><strong>Third optimization: I</strong>nterplay with gradient accumulation</h4>
-<h3>Revisit global batch size</h3>
-<h3>Our journey up to now</h3>
-<h3>ZeRO (<strong>Ze</strong>ro <strong>R</strong>edundancy <strong>O</strong>ptimizer)</h3>
-<h4>Memory usage revisited</h4>
-<h4>ZeRO-1: Partitioning Optimizer States</h4>
-<h4>ZeRO-2: Adding <strong>Gradient Partitioning</strong></h4>
-<h4>ZeRO-3: Adding Parameter <strong>Partitioning</strong></h4>
-<h2>Tensor Parallelism</h2>
-<h3>Tensor Parallelism in a Transformer Block</h3>
-<h3>Sequence Parallelism</h3>
-<h2>Context Parallelism</h2>
-<h3>Introducing Context Parallelism</h3>
-<h3>Discovering Ring Attention</h3>
-<h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>
-<h2></h2>
-<h2>Pipeline Parallelism</h2>
-<h3>Splitting layers on various nodes - All forward, all backward</h3>
-<h3>One-forward-one-backward and LLama 3.1 schemes</h3>
-<h3>Interleaving stages</h3>
-<h3>Zero Bubble and DualPipe</h3>
-<h2>Expert parallelism</h2>
-<h2>5D parallelism in a nutshell</h2>
-<h2>How to Find the Best Training Configuration</h2>
-<h2>Diving in the GPUs – fusing, threading, mixing</h2>
-<h4>A primer on GPU</h4>
-<h3>How to improve performance with Kernels ?</h3>
-<h4>Memory Coalescing</h4>
-<h4>Tiling</h4>
-<h4>Thread Coarsening</h4>
-<h4>Minimizing Control Divergence</h4>
-<h3>Flash Attention 1-3</h3>
-<h3>Fused Kernels</h3>
-<h3>Mixed Precision Training</h3>
-<h4>FP16 and BF16 training</h4>
-<h4>FP8 pretraining</h4>
-<h2>Conclusion</h2>
-<h3>What you learned</h3>
-<h3>What we learned</h3>
-<h3>What’s next?</h3>
-<h2>References</h2>
-<h3>Landmark LLM Scaling Papers</h3>
-<h3>Training Frameworks</h3>
-<h3>Debugging</h3>
-<h3>Distribution Techniques</h3>
-<h3>CUDA Kernels</h3>
-<h3>Hardware</h3>
-<h3>Others</h3>
-<h2>Appendix</h2>
-<h3>A0: Parallel Programming Crash Course</h3>
-<h4>Broadcast</h4>
-<h4>Reduce &amp; AllReduce</h4>
-<h4><strong>A quick focus on Ring All-Reduce</strong></h4>
-<h4>Gather &amp; AllGather</h4>
-<h4>Scatter &amp; ReduceScatter</h4>
-<h4>Barrier</h4>
-<h4>NCCL: NVIDIA Collective Communications Library</h4>
-<h3>A1: Profiling</h3>
-<h4>Kernels</h4>
-<h2>Print a table of the profiling results, sorted by total CUDA time, limited to the top 10 entries</h2>
-<h2>include <torch/extension.h></h2>
-<h2>include <cuda.h></h2>
-<h2>include <cuda_runtime.h></h2>
-<h2>Load and compile the CUDA extension</h2>
-<h2>Define input tensors</h2>
-<h2>Run the CUDA kernel</h2>
-<h3>A2: TP Backward pass</h3>
-<h3>A3: ZeRO-R</h3>
-<h4>$P_a:$  Partitioned Activation Checkpointing</h4>
-<h4><strong>$C_B:$ Constant Size Buffers</strong></h4>
-<h4><strong>$M_D$: Memory Defragmentation</strong></h4>
-<h4>Communication Analysis of ZeRO-R</h4>
-<h3>A5. Memory profile</h3>
-<h2>Set up optimizer</h2>
-<h3>TP: Practical PyTorch Implementation</h3>
-<h2>This is the <code>f</code> function in the paper: https://arxiv.org/abs/1909.08053</h2>
-<h2>core logic of Column Parallel linear</h2>
-<h4>Gelu code</h4>
-<h4>Interconnect</h4>
-<h3>How to profile your code</h3>
-<h3>Formulas for compute / comms the balanhe balance</h3>
-<h3>Integrating Context Parallelism with TP/SP</h3>
-<h3>The nanotron FP8 recipe</h3>
-<h2>Overlapping computation and communication</h2>

blog-export.html DELETED Viewed

The diff for this file is too large to render. See raw diff

blog-export.md DELETED Viewed

The diff for this file is too large to render. See raw diff