Practical 5
Data Parallelism with nn.DataParallel
Objective
In this practical session, you will:
- Train a model on a single GPU
- Convert it to multi-GPU training using nn.DataParallel
- Measure training time and GPU usage
- Observe how input data is split across GPUs
- Identify the limitations of Data Parallelism
This practical prepares you to understand why more advanced methods (e.g., DDP) are needed.
Background
In Lecture 5, we introduced:
- Data Parallelism (DP)
- Model replication
- Batch splitting
- Gradient aggregation
- GPU utilization
In this session, you will experimentally observe these concepts.
Part 1—Baseline (Single GPU)
1) Create a file named:
practical5_single.py
2) Use the following script:
3) Run the script:
python practical5_single.py
4) Observe
- Step time
- Total training time
- GPU memory usage
👉 This is your baseline performance
Part 2—Multi-GPU with DataParallel
1) Create a file named:
practical5_dp.py
2) Use the following script:
3) Run the script:
python practical5_dp.py
4) Observe
- Number of GPUs used
- Step time
- Total training time
- Memory usage per GPU
Part 3—Compare Performance
1) Fill the table:
| Set up | Step Time (sec) | Total Time (sec) |
| Single GPU | ||
| DataParallel |
2) Answer:
- Did training become faster?
- Is the speedup proportional to number of GPUs?
- What differences do you observe in memory usage?
Part 4—Understand Data Splitting
1) Add this print inside the training loop (before forward pass):print(f"Inside model on device {x.device}, shape: {x.shape}")
2) Run the code and observe the output
👉 Look at:
- Which GPU is used
- How many samples each GPU processes
3) Even vs Odd GPU Splitting
Run the script with different numbers of GPUs:
srun --partition=gpu --gres=gpu:2 --cpus-per-task=16 --mem=16G --pty bashsrun --partition=gpu --gres=gpu:3 --cpus-per-task=16 --mem=16G --pty bashsrun --partition=gpu --gres=gpu:5 --cpus-per-task=16 --mem=16G --pty bash
4) Compare how the batch is divided in each case. Fill this
| #GPUs | Batch Size | Split per GPU |
| 2 | 64 | 32/32 |
| 3 | 64 | ? |
| 5 | 64 | ? |
5) Then answer
- Is the batch split manually in your code?
- How does DataParallel divide the batch across GPUs?
- When is the split equal?
- When is it uneven?
- Why does this happen?
Part 5—Identify Bottlenecks
1) Observe GPU usage using:
watch -n 1 nvidia-smi
2) Answer:
- Is GPU 0 more loaded than others?
- Are all GPUs equally utilized?
- Where could a bottleneck occur?
Part 6—Discussion
Based on your observations:
- What is the main advantage of DataParallel?
- What is its main limitation?
- Why might it not scale well to many GPUs?
Important Insight
nn.DataParallel
- Replicates the model on each GPU
- Splits input automatically
- Collects gradients on GPU 0
This can create a central bottleneck
Takeaway
- Data Parallelism is easy to use
- It improves performance (to some extent)
- But it has scaling limitations