Paralleelsus süvaõppes - Kursused - Arvutiteaduse instituut

Practical 6

DistributedDataParallel (DDP) & Collective Communication

Objective

In this practical session, you will:

Run training using DistributedDataParallel (DDP)
Understand multi-process training (one process per GPU)
Observe rank and world size
Understand how data is split manually in DDP
Observe balanced GPU usage
Compare DDP with DataParallel (previous session)

This session prepares you to use efficient multi-GPU training.

Background

In Lecture 6, we introduced:

DistributedDataParallel (DDP)
Process-based parallelism
Collective communication (All-Reduce)
Gradient synchronization

In this session, you will experimentally observe these concepts.

Part 1—Run DDP Training

1) Create a file:

practical6_ddp.py

2) Use the provided script

Download

3) Run using torchrun

torchrun --nproc_per_node=2 practical6_ddp.py

👉 Replace 2 with number of GPUs available

4) Observe

Multiple processes start
Each process runs on a different GPU
Output appears from multiple ranks

Part 2—Understand Rank & Processes

1) Add this point:

print(f"Rank: {dist.get_rank()} | Local Rank: {local_rank}")

2) Observe

👉 You should see:

Rank 0 → GPU 0
Rank 1 → GPU 1
...

Note: DDP uses one process per GPU (unlike DataParallel)

Part 3—Understand Data Splitting

1) Look at this line:

x, y = get_batch(BATCH_SIZE // dist.get_world_size(), DIM, device)

2) Answer

Is data split automatically?
How is batch size distributed across GPUs?

Note: In DDP, you split the batch manually, not automatically like DataParallel

Part 4—Observe Communication (All-Reduce)

1) Focus on this line:

loss.backward()

2) Answer

When does communication happen?
Why is it needed?

Note: During backward pass, gradients are synchronized across all GPUs using All-Reduce communication.

Part 5—Monitor GPU Usage

1) Run:

watch -n 1 nvidia-smi

2) Observe:

GPU utilization
Memory usage

3) Compare with DataParallel:

Are GPUs more balanced?
Is GPU 0 still a bottleneck?

Part 6—Compare with DataParallel

1) Run your Practical 5 code (DataParallel) and compare:

Aspect	DP	DDP
Processes	1	Mutiple
Data split	Automatic	Manual
GPU Usage	Imbalanced	Balanced
Scalability	Limited	Better

2) Answer

Which is faster?
Which scales better?
Why?

Note: DDP avoids the central bottleneck of DataParallel by using distributed processes and collective communication.

Takeaway

DDP is the standard method for multi-GPU training
It provides:
- Better performance
- Better scalability
But requires:
- More setup
- Understanding of distributed systems
Each GPU runs its own process
You must launch with torchrun
Batch must be manually divided

Paralleelsus süvaõppes 2025/26 kevad

Practical 6

DistributedDataParallel (DDP) & Collective Communication

Objective

Background

Part 1—Run DDP Training

Part 2—Understand Rank & Processes

Part 3—Understand Data Splitting

Part 4—Observe Communication (All-Reduce)

Part 5—Monitor GPU Usage

Part 6—Compare with DataParallel

Takeaway