Practical 10
Hybrid Parallelism (DP + MP + Pipeline Concept)
Objective
In this practical session, you will:
- Run a hybrid parallel training system
- Identify:
- Data Parallelism (DP)
- Model Parallelism (MP)
- Pipeline concept (micro-batching)
- Modify the system and observe behavior changes
Background
Students should:
- Understand:
- DP, MP, Pipeline (from lecture)
- Have access to:
- Multi-GPU machine (≥ 4 GPUs)
- PyTorch with distributed support
Setup Instructions
- Step 1 — Allocate GPUs (HPC):
srun --partition=gpu --gres=gpu:4 --pty bash - Step 2 — Run the code :
torchrun --nproc_per_node=2 python/sample.py
Part 1—Run and Observe
1) Use the following script:
2) Questions:
- Q1) How many processes are running?
- Q2) Which GPUs does each rank use?
- Q3) How many micro-batches are processed?
- Q4) When does synchronization happen?
Part 2—Modify Micro-batches
1) Change:
MICRO_BATCHES = 4
2) Try:
MICRO_BATCHES = 2
MICRO_BATCHES = 8
3) Questions:
- Q1) What changes in the output?
- Q2) How many forward/backward steps now?
Part 3—Change Batch Size
1) Modify:
BATCH_SIZE = 64
2) Try:
BATCH_SIZE = 32
BATCH_SIZE = 128
2) Questions:
- Q1) Does execution pattern change?
- Q2) What stays the same?
Part 4—Change Number of Processes (DP)
1) Run: torchrun --nproc_per_node=1 python/sample.py
2) Questions:
- Q1) What happens to [Rank 1]?
- Q2) Is synchronization still happening?
Part 5—Break Model Parallelism (Important)
1) Modify code:
- Replace :
device1 = torch.device(f"cuda:{local_rank + 2}") - With :
device1 = torch.device(f"cuda:{local_rank}")
2) Questions:
- Q1) What changes in output?
- Q2) Are multiple GPUs still used?