Practical 7
DDP Optimization and Debugging
Objective
In this practical session, you will:
- Understand DDP performance optimization techniques
- Explore Gradient Accumulation
- Explore Automatic Mixed Precision (AMP)
- Identify how optimization appears in real code
- Compare different optimization strategies
- Prepare to build an optimized DDP training
This session prepares you to improve training efficiency and scalability in multi-GPU systems.
Background
In Lecture 7, we introduced the following:
- DDP performance bottlenecks
- Gradient accumulation (reducing communication overhead)
- Automatic Mixed Precision (AMP)
- GPU memory optimization
- Debugging common DDP issues
In this session, you will analyze these techniques directly in code.
Part 1—Run DDP with Gradient Accumulation
1) Create a file:
practical7_accumulation.py
2) Use the provided script
3) Run using torchrun
torchrun --nproc_per_node=2 practical7_accumulation.py
👉 Replace 2 with number of GPUs available
4) Observe
- Training runs across multiple GPUs
- Gradients are not synchronized every step
- Optimizer updates happen every few steps
Part 2—Understand Gradient Accumulation
1) Focus on these lines:
loss = loss_fn(output, y) / ACCUM_STEPSmodel.no_sync()if (step + 1) % ACCUM_STEPS == 0:
2) Observe
- Why do we divide the loss?
- Why do we skip synchronization?
- What happens if
no_sync()is removed?
Part 3 —Run DDP with Mixed Precision (AMP)
1) Create a file: practical7_amp.py
practical7_amp.py
2) Use the provided script
3) Run:
torchrun --nproc_per_node=2 practical7_amp.py
4) Observe
- Training runs faster
- GPU memory usage is reduced
- Same model, but different numerical precision
Part 4—Understand AMP
1) Focus on these lines:
with autocast(device_type="cuda"):scaler.scale(loss).backward()scaler.step(optimizer)scaler.update()
2) Observe
- Why do we use mixed precision?
- Why do we need GradScaler?
- What happens without it?
Part 5—Compare the Two Approaches
1) Fill the table:
| Feature | Gradient Accumulation | Mixed Precision |
| Goal | ||
| Effect on Memory | ||
| Effect on Speed | ||
| Effect on Communication |
2) Questions:
- Which method reduces GPU memory usage more?
- Which method improves speed more?
- Can we combine both methods?
Takeaway
- Gradient Accumulation → reduces communication overhead.
- Mixed Precision → reduces memory usage and computation cost.
- Combining both leads to highly optimized training.