Parallelism in Deep Learning - Courses - Institute of Computer Science

Practical 7

DDP Optimization and Debugging

Objective

In this practical session, you will:

Understand DDP performance optimization techniques
Explore Gradient Accumulation
Explore Automatic Mixed Precision (AMP)
Identify how optimization appears in real code
Compare different optimization strategies
Prepare to build an optimized DDP training

This session prepares you to improve training efficiency and scalability in multi-GPU systems.

Background

In Lecture 7, we introduced the following:

DDP performance bottlenecks
Gradient accumulation (reducing communication overhead)
Automatic Mixed Precision (AMP)
GPU memory optimization
Debugging common DDP issues

In this session, you will analyze these techniques directly in code.

Part 1—Run DDP with Gradient Accumulation

1) Create a file:

practical7_accumulation.py

2) Use the provided script

Download

3) Run using torchrun

torchrun --nproc_per_node=2 practical7_accumulation.py

👉 Replace 2 with number of GPUs available

4) Observe

Training runs across multiple GPUs
Gradients are not synchronized every step
Optimizer updates happen every few steps

Part 2—Understand Gradient Accumulation

1) Focus on these lines:

loss = loss_fn(output, y) / ACCUM_STEPS
model.no_sync()
if (step + 1) % ACCUM_STEPS == 0:

2) Observe

Why do we divide the loss?
Why do we skip synchronization?
What happens if no_sync() is removed?

Part 3 —Run DDP with Mixed Precision (AMP)

1) Create a file: practical7_amp.py

practical7_amp.py

2) Use the provided script

Download

3) Run:

torchrun --nproc_per_node=2 practical7_amp.py

4) Observe

Training runs faster
GPU memory usage is reduced
Same model, but different numerical precision

Part 4—Understand AMP