Practical 6
DistributedDataParallel (DDP) & Collective Communication
Objective
In this practical session, you will:
- Run training using DistributedDataParallel (DDP)
- Understand multi-process training (one process per GPU)
- Observe rank and world size
- Understand how data is split manually in DDP
- Observe balanced GPU usage
- Compare DDP with DataParallel (previous session)
This session prepares you to use efficient multi-GPU training.
Background
In Lecture 6, we introduced:
- DistributedDataParallel (DDP)
- Process-based parallelism
- Collective communication (All-Reduce)
- Gradient synchronization
In this session, you will experimentally observe these concepts.
Part 1—Run DDP Training
1) Create a file:
practical6_ddp.py
2) Use the provided script
3) Run using torchrun
torchrun --nproc_per_node=2 practical6_ddp.py
👉 Replace 2 with number of GPUs available
4) Observe
- Multiple processes start
- Each process runs on a different GPU
- Output appears from multiple ranks
Part 2—Understand Rank & Processes
1) Add this point:
print(f"Rank: {dist.get_rank()} | Local Rank: {local_rank}")
2) Observe
👉 You should see:
Rank 0 → GPU 0Rank 1 → GPU 1 ...
Note: DDP uses one process per GPU (unlike DataParallel)
Part 3—Understand Data Splitting
1) Look at this line:
x, y = get_batch(BATCH_SIZE // dist.get_world_size(), DIM, device)
2) Answer
- Is data split automatically?
- How is batch size distributed across GPUs?
Note: In DDP, you split the batch manually, not automatically like DataParallel
Part 4—Observe Communication (All-Reduce)
1) Focus on this line:
loss.backward()
2) Answer
- When does communication happen?
- Why is it needed?
Note: During backward pass, gradients are synchronized across all GPUs using All-Reduce communication.
Part 5—Monitor GPU Usage
1) Run:
watch -n 1 nvidia-smi
2) Observe:
- GPU utilization
- Memory usage
3) Compare with DataParallel:
- Are GPUs more balanced?
- Is GPU 0 still a bottleneck?
Part 6—Compare with DataParallel
1) Run your Practical 5 code (DataParallel) and compare:
| Aspect | DP | DDP |
| Processes | 1 | Mutiple |
| Data split | Automatic | Manual |
| GPU Usage | Imbalanced | Balanced |
| Scalability | Limited | Better |
2) Answer
- Which is faster?
- Which scales better?
- Why?
Note: DDP avoids the central bottleneck of DataParallel by using distributed processes and collective communication.
Takeaway
- DDP is the standard method for multi-GPU training
- It provides:
- Better performance
- Better scalability
- But requires:
- More setup
- Understanding of distributed systems
- Each GPU runs its own process
- You must launch with torchrun
- Batch must be manually divided