Institute of Computer Science
Courses.cs.ut.ee Institute of Computer Science University of Tartu
  1. Courses
  2. 2025/26 spring
  3. Parallelism in Deep Learning (LTAT.06.030)
ET
Log in

Parallelism in Deep Learning 2025/26 spring

  • Pealeht
  • Loengud
  • Laborid
  • Kodutöö
  • Viited

Practical 7

DDP Optimization and Debugging


Objective

In this practical session, you will:

  • Understand DDP performance optimization techniques
  • Explore Gradient Accumulation
  • Explore Automatic Mixed Precision (AMP)
  • Identify how optimization appears in real code
  • Compare different optimization strategies
  • Prepare to build an optimized DDP training

This session prepares you to improve training efficiency and scalability in multi-GPU systems.


Background

In Lecture 7, we introduced the following:

  • DDP performance bottlenecks
  • Gradient accumulation (reducing communication overhead)
  • Automatic Mixed Precision (AMP)
  • GPU memory optimization
  • Debugging common DDP issues

In this session, you will analyze these techniques directly in code.


Part 1—Run DDP with Gradient Accumulation

1) Create a file:

  • practical7_accumulation.py

2) Use the provided script

  • Download

3) Run using torchrun

  • torchrun --nproc_per_node=2 practical7_accumulation.py

👉 Replace 2 with number of GPUs available

4) Observe

  • Training runs across multiple GPUs
  • Gradients are not synchronized every step
  • Optimizer updates happen every few steps

Part 2—Understand Gradient Accumulation

1) Focus on these lines:

  • loss = loss_fn(output, y) / ACCUM_STEPS
  • model.no_sync()
  • if (step + 1) % ACCUM_STEPS == 0:

2) Observe

  • Why do we divide the loss?
  • Why do we skip synchronization?
  • What happens if no_sync() is removed?

Part 3 —Run DDP with Mixed Precision (AMP)

1) Create a file: practical7_amp.py

  • practical7_amp.py

2) Use the provided script

  • Download

3) Run:

  • torchrun --nproc_per_node=2 practical7_amp.py

4) Observe

  • Training runs faster
  • GPU memory usage is reduced
  • Same model, but different numerical precision

Part 4—Understand AMP

1) Focus on these lines:

  • with autocast(device_type="cuda"):
  • scaler.scale(loss).backward()
  • scaler.step(optimizer)
    • scaler.update()

2) Observe

  • Why do we use mixed precision?
  • Why do we need GradScaler?
  • What happens without it?

Part 5—Compare the Two Approaches

1) Fill the table:

FeatureGradient AccumulationMixed Precision
Goal  
Effect on Memory  
Effect on Speed  
Effect on Communication  

2) Questions:

  • Which method reduces GPU memory usage more?
  • Which method improves speed more?
  • Can we combine both methods?

Takeaway
  • Gradient Accumulation → reduces communication overhead.
  • Mixed Precision → reduces memory usage and computation cost.
  • Combining both leads to highly optimized training.

  • Institute of Computer Science
  • Faculty of Science and Technology
  • University of Tartu
In case of technical problems or questions write to:

Contact the course organizers with the organizational and course content questions.
The proprietary copyrights of educational materials belong to the University of Tartu. The use of educational materials is permitted for the purposes and under the conditions provided for in the copyright law for the free use of a work. When using educational materials, the user is obligated to give credit to the author of the educational materials.
The use of educational materials for other purposes is allowed only with the prior written consent of the University of Tartu.
Terms of use for the Courses environment