Spaces:
Paused
Paused
| # Distributing Training | |
| <Tip warning={true}> | |
| Section under construction. Feel free to contribute! | |
| </Tip> | |
| ## Multi-GPU Training with TRL | |
| The trainers in TRL use [π€ Accelerate](https://github.com/huggingface/accelerate) to enable distributed training across multiple GPUs or nodes. To do so, first create an [π€ Accelerate](https://github.com/huggingface/accelerate) config file by running | |
| ```bash | |
| accelerate config | |
| ``` | |
| and answering the questions according to your multi-GPU / multi-node setup. You can then launch distributed training by running: | |
| ```bash | |
| accelerate launch train.py | |
| ``` | |
| We also provide config files in the [examples folder](https://github.com/huggingface/trl/tree/main/examples/accelerate_configs) that can be used as templates. To use these templates, simply pass the path to the config file when launching a job, e.g.: | |
| ```shell | |
| accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml train.py <SCRIPT_ARGS> | |
| ``` | |
| This automatically distributes the workload across all available GPUs. | |
| Under the hood, [π€ Accelerate](https://github.com/huggingface/accelerate) creates one model per GPU. Each process: | |
| - Processes its own batch of data | |
| - Computes the loss and gradients for that batch | |
| - Shares gradient updates across all GPUs | |
|  | |
| The effective batch size is calculated as: | |
| $$ | |
| \text{Batch Size} = \text{per\_device\_train\_batch\_size} \times \text{num\_devices} \times \text{gradient\_accumulation\_steps} | |
| $$ | |
| To maintain a consistent batch size when scaling to multiple GPUs, make sure to update `per_device_train_batch_size` and `gradient_accumulation_steps` accordingly. | |
| Example, these configurations are equivalent, and should yield the same results: | |
| | Number of GPUs | Per device batch size | Gradient accumulation steps | Comments | | |
| | --- | --- | --- | --- | | |
| | 1 | 32 | 1 | Possibly high memory usage, but faster training | | |
| | 1 | 4 | 8 | Lower memory usage, slower training | | |
| | 8 | 4 | 1 | Multi-GPU to get the best of both worlds | | |
| <Tip> | |
| Having one model per GPU can lead to high memory usage, which may not be feasible for large models or low-memory GPUs. In such cases, you can leverage [DeepSpeed](https://github.com/deepspeedai/DeepSpeed), which provides optimizations like model sharding, Zero Redundancy Optimizer, mixed precision training, and offloading to CPU or NVMe. Check out our [DeepSpeed Integration](deepspeed_integration) guide for more details. | |
| </Tip> | |
| ## Multi-Nodes Training | |
| We're working on a guide for multi-node training. Stay tuned! π |