Arvutiteaduse instituut
Courses.cs.ut.ee Arvutiteaduse instituut Tartu Ülikool
  1. Kursused
  2. 2025/26 kevad
  3. Paralleelsus süvaõppes (LTAT.06.030)
EN
Logi sisse

Paralleelsus süvaõppes 2025/26 kevad

  • Pealeht
  • Loengud
  • Laborid
  • Kodutöö
  • Viited

Practical 6

DistributedDataParallel (DDP) & Collective Communication


Objective

In this practical session, you will:

  • Run training using DistributedDataParallel (DDP)
  • Understand multi-process training (one process per GPU)
  • Observe rank and world size
  • Understand how data is split manually in DDP
  • Observe balanced GPU usage
  • Compare DDP with DataParallel (previous session)

This session prepares you to use efficient multi-GPU training.


Background

In Lecture 6, we introduced:

  • DistributedDataParallel (DDP)
  • Process-based parallelism
  • Collective communication (All-Reduce)
  • Gradient synchronization

In this session, you will experimentally observe these concepts.


Part 1—Run DDP Training

1) Create a file:

  • practical6_ddp.py

2) Use the provided script

  • Download

3) Run using torchrun

  • torchrun --nproc_per_node=2 practical6_ddp.py

👉 Replace 2 with number of GPUs available

4) Observe

  • Multiple processes start
  • Each process runs on a different GPU
  • Output appears from multiple ranks

Part 2—Understand Rank & Processes

1) Add this point:

  • print(f"Rank: {dist.get_rank()} | Local Rank: {local_rank}")

2) Observe

👉 You should see:

Rank 0 → GPU 0
Rank 1 → GPU 1
...

Note: DDP uses one process per GPU (unlike DataParallel)


Part 3—Understand Data Splitting

1) Look at this line:

  • x, y = get_batch(BATCH_SIZE // dist.get_world_size(), DIM, device)

2) Answer

  • Is data split automatically?
  • How is batch size distributed across GPUs?

Note: In DDP, you split the batch manually, not automatically like DataParallel


Part 4—Observe Communication (All-Reduce)

1) Focus on this line:

  • loss.backward()

2) Answer

  • When does communication happen?
  • Why is it needed?

Note: During backward pass, gradients are synchronized across all GPUs using All-Reduce communication.


Part 5—Monitor GPU Usage

1) Run:

  • watch -n 1 nvidia-smi

2) Observe:

  • GPU utilization
  • Memory usage

3) Compare with DataParallel:

  • Are GPUs more balanced?
  • Is GPU 0 still a bottleneck?

Part 6—Compare with DataParallel

1) Run your Practical 5 code (DataParallel) and compare:

AspectDPDDP
Processes1Mutiple
Data splitAutomaticManual
GPU UsageImbalancedBalanced
ScalabilityLimitedBetter

2) Answer

  • Which is faster?
  • Which scales better?
  • Why?

Note: DDP avoids the central bottleneck of DataParallel by using distributed processes and collective communication.


Takeaway
  • DDP is the standard method for multi-GPU training
  • It provides:
    • Better performance
    • Better scalability
  • But requires:
    • More setup
    • Understanding of distributed systems
  • Each GPU runs its own process
  • You must launch with torchrun
  • Batch must be manually divided

  • Arvutiteaduse instituut
  • Loodus- ja täppisteaduste valdkond
  • Tartu Ülikool
Tehniliste probleemide või küsimuste korral kirjuta:

Kursuse sisu ja korralduslike küsimustega pöörduge kursuse korraldajate poole.
Õppematerjalide varalised autoriõigused kuuluvad Tartu Ülikoolile. Õppematerjalide kasutamine on lubatud autoriõiguse seaduses ettenähtud teose vaba kasutamise eesmärkidel ja tingimustel. Õppematerjalide kasutamisel on kasutaja kohustatud viitama õppematerjalide autorile.
Õppematerjalide kasutamine muudel eesmärkidel on lubatud ainult Tartu Ülikooli eelneval kirjalikul nõusolekul.
Courses’i keskkonna kasutustingimused