Arvutiteaduse instituut
Courses.cs.ut.ee Arvutiteaduse instituut Tartu Ülikool
  1. Kursused
  2. 2025/26 kevad
  3. Paralleelsus süvaõppes (LTAT.06.030)
EN
Logi sisse

Paralleelsus süvaõppes 2025/26 kevad

  • Pealeht
  • Loengud
  • Laborid
  • Kodutöö
  • Viited

Practical 7

DDP Optimization and Debugging


Objective

In this practical session, you will:

  • Understand DDP performance optimization techniques
  • Explore Gradient Accumulation
  • Explore Automatic Mixed Precision (AMP)
  • Identify how optimization appears in real code
  • Compare different optimization strategies
  • Prepare to build an optimized DDP training

This session prepares you to improve training efficiency and scalability in multi-GPU systems.


Background

In Lecture 7, we introduced the following:

  • DDP performance bottlenecks
  • Gradient accumulation (reducing communication overhead)
  • Automatic Mixed Precision (AMP)
  • GPU memory optimization
  • Debugging common DDP issues

In this session, you will analyze these techniques directly in code.


Part 1—Run DDP with Gradient Accumulation

1) Create a file:

  • practical7_accumulation.py

2) Use the provided script

  • Download

3) Run using torchrun

  • torchrun --nproc_per_node=2 practical7_accumulation.py

👉 Replace 2 with number of GPUs available

4) Observe

  • Training runs across multiple GPUs
  • Gradients are not synchronized every step
  • Optimizer updates happen every few steps

Part 2—Understand Gradient Accumulation

1) Focus on these lines:

  • loss = loss_fn(output, y) / ACCUM_STEPS
  • model.no_sync()
  • if (step + 1) % ACCUM_STEPS == 0:

2) Observe

  • Why do we divide the loss?
  • Why do we skip synchronization?
  • What happens if no_sync() is removed?

Part 3 —Run DDP with Mixed Precision (AMP)

1) Create a file: practical7_amp.py

  • practical7_amp.py

2) Use the provided script

  • Download

3) Run:

  • torchrun --nproc_per_node=2 practical7_amp.py

4) Observe

  • Training runs faster
  • GPU memory usage is reduced
  • Same model, but different numerical precision

Part 4—Understand AMP

1) Focus on these lines:

  • with autocast(device_type="cuda"):
  • scaler.scale(loss).backward()
  • scaler.step(optimizer)
    • scaler.update()

2) Observe

  • Why do we use mixed precision?
  • Why do we need GradScaler?
  • What happens without it?

Part 5—Compare the Two Approaches

1) Fill the table:

FeatureGradient AccumulationMixed Precision
Goal  
Effect on Memory  
Effect on Speed  
Effect on Communication  

2) Questions:

  • Which method reduces GPU memory usage more?
  • Which method improves speed more?
  • Can we combine both methods?

Takeaway
  • Gradient Accumulation → reduces communication overhead.
  • Mixed Precision → reduces memory usage and computation cost.
  • Combining both leads to highly optimized training.

  • Arvutiteaduse instituut
  • Loodus- ja täppisteaduste valdkond
  • Tartu Ülikool
Tehniliste probleemide või küsimuste korral kirjuta:

Kursuse sisu ja korralduslike küsimustega pöörduge kursuse korraldajate poole.
Õppematerjalide varalised autoriõigused kuuluvad Tartu Ülikoolile. Õppematerjalide kasutamine on lubatud autoriõiguse seaduses ettenähtud teose vaba kasutamise eesmärkidel ja tingimustel. Õppematerjalide kasutamisel on kasutaja kohustatud viitama õppematerjalide autorile.
Õppematerjalide kasutamine muudel eesmärkidel on lubatud ainult Tartu Ülikooli eelneval kirjalikul nõusolekul.
Courses’i keskkonna kasutustingimused