Paralleelsus süvaõppes - Kursused - Arvutiteaduse instituut

Practical 5

Data Parallelism with `nn.DataParallel`

Objective

In this practical session, you will:

Train a model on a single GPU
Convert it to multi-GPU training using nn.DataParallel
Measure training time and GPU usage
Observe how input data is split across GPUs
Identify the limitations of Data Parallelism

This practical prepares you to understand why more advanced methods (e.g., DDP) are needed.

Background

In Lecture 5, we introduced:

Data Parallelism (DP)
Model replication
Batch splitting
Gradient aggregation
GPU utilization

In this session, you will experimentally observe these concepts.

Part 1—Baseline (Single GPU)

1) Create a file named:

practical5_single.py

2) Use the following script:

Download

3) Run the script:

python practical5_single.py

4) Observe

Step time
Total training time
GPU memory usage

👉 This is your baseline performance

Part 2—Multi-GPU with DataParallel

1) Create a file named:

practical5_dp.py

2) Use the following script:

Download

3) Run the script:

python practical5_dp.py

4) Observe

Number of GPUs used
Step time
Total training time
Memory usage per GPU

Part 3—Compare Performance

1) Fill the table:

Set up	Step Time (sec)	Total Time (sec)
Single GPU
DataParallel

2) Answer:

Did training become faster?
Is the speedup proportional to number of GPUs?
What differences do you observe in memory usage?

Part 4—Understand Data Splitting

1) Add this print inside the training loop (before forward pass):

print(f"Inside model on device {x.device}, shape: {x.shape}")

2) Run the code and observe the output
👉 Look at:

Which GPU is used
How many samples each GPU processes

3) Even vs Odd GPU Splitting

Run the script with different numbers of GPUs:

srun --partition=gpu --gres=gpu:2 --cpus-per-task=16 --mem=16G --pty bash
srun --partition=gpu --gres=gpu:3 --cpus-per-task=16 --mem=16G --pty bash
srun --partition=gpu --gres=gpu:5 --cpus-per-task=16 --mem=16G --pty bash

4) Compare how the batch is divided in each case. Fill this

#GPUs	Batch Size	Split per GPU
2	64	32/32
3	64	?
5	64	?

5) Then answer

Is the batch split manually in your code?
How does DataParallel divide the batch across GPUs?
When is the split equal?
When is it uneven?
Why does this happen?

Part 5—Identify Bottlenecks

1) Observe GPU usage using:

watch -n 1 nvidia-smi

2) Answer:

Is GPU 0 more loaded than others?
Are all GPUs equally utilized?
Where could a bottleneck occur?

Part 6—Discussion

Based on your observations:

What is the main advantage of DataParallel?
What is its main limitation?
Why might it not scale well to many GPUs?

Important Insight

nn.DataParallel

Replicates the model on each GPU
Splits input automatically
Collects gradients on GPU 0

This can create a central bottleneck

Takeaway

Data Parallelism is easy to use
It improves performance (to some extent)
But it has scaling limitations

Paralleelsus süvaõppes 2025/26 kevad