Arvutiteaduse instituut
Courses.cs.ut.ee Arvutiteaduse instituut Tartu Ülikool
  1. Kursused
  2. 2025/26 kevad
  3. Paralleelsus süvaõppes (LTAT.06.030)
EN
Logi sisse

Paralleelsus süvaõppes 2025/26 kevad

  • Pealeht
  • Loengud
  • Laborid
  • Kodutöö
  • Viited

Practical 5

Data Parallelism with nn.DataParallel


Objective

In this practical session, you will:

  • Train a model on a single GPU
  • Convert it to multi-GPU training using nn.DataParallel
  • Measure training time and GPU usage
  • Observe how input data is split across GPUs
  • Identify the limitations of Data Parallelism

This practical prepares you to understand why more advanced methods (e.g., DDP) are needed.


Background

In Lecture 5, we introduced:

  • Data Parallelism (DP)
  • Model replication
  • Batch splitting
  • Gradient aggregation
  • GPU utilization

In this session, you will experimentally observe these concepts.


Part 1—Baseline (Single GPU)

1) Create a file named:

  • practical5_single.py

2) Use the following script:

  • Download

3) Run the script:

  • python practical5_single.py

4) Observe

  • Step time
  • Total training time
  • GPU memory usage

👉 This is your baseline performance


Part 2—Multi-GPU with DataParallel

1) Create a file named:

  • practical5_dp.py

2) Use the following script:

  • Download

3) Run the script:

  • python practical5_dp.py

4) Observe

  • Number of GPUs used
  • Step time
  • Total training time
  • Memory usage per GPU

Part 3—Compare Performance

1) Fill the table:

Set upStep Time (sec)Total Time (sec)
Single GPU  
DataParallel  


2) Answer:

  • Did training become faster?
  • Is the speedup proportional to number of GPUs?
  • What differences do you observe in memory usage?

Part 4—Understand Data Splitting

1) Add this print inside the training loop (before forward pass):

print(f"Inside model on device {x.device}, shape: {x.shape}")

2) Run the code and observe the output
👉 Look at:

  • Which GPU is used
  • How many samples each GPU processes

3) Even vs Odd GPU Splitting

Run the script with different numbers of GPUs:

  • srun --partition=gpu --gres=gpu:2 --cpus-per-task=16 --mem=16G --pty bash
  • srun --partition=gpu --gres=gpu:3 --cpus-per-task=16 --mem=16G --pty bash
  • srun --partition=gpu --gres=gpu:5 --cpus-per-task=16 --mem=16G --pty bash

4) Compare how the batch is divided in each case. Fill this

#GPUsBatch SizeSplit per GPU
26432/32
364?
564?

5) Then answer

  • Is the batch split manually in your code?
  • How does DataParallel divide the batch across GPUs?
  • When is the split equal?
  • When is it uneven?
  • Why does this happen?

Part 5—Identify Bottlenecks

1) Observe GPU usage using:

watch -n 1 nvidia-smi

2) Answer:

  • Is GPU 0 more loaded than others?
  • Are all GPUs equally utilized?
  • Where could a bottleneck occur?

Part 6—Discussion

Based on your observations:

  • What is the main advantage of DataParallel?
  • What is its main limitation?
  • Why might it not scale well to many GPUs?

Important Insight


nn.DataParallel

  • Replicates the model on each GPU
  • Splits input automatically
  • Collects gradients on GPU 0

This can create a central bottleneck


Takeaway
  • Data Parallelism is easy to use
  • It improves performance (to some extent)
  • But it has scaling limitations

  • Arvutiteaduse instituut
  • Loodus- ja täppisteaduste valdkond
  • Tartu Ülikool
Tehniliste probleemide või küsimuste korral kirjuta:

Kursuse sisu ja korralduslike küsimustega pöörduge kursuse korraldajate poole.
Õppematerjalide varalised autoriõigused kuuluvad Tartu Ülikoolile. Õppematerjalide kasutamine on lubatud autoriõiguse seaduses ettenähtud teose vaba kasutamise eesmärkidel ja tingimustel. Õppematerjalide kasutamisel on kasutaja kohustatud viitama õppematerjalide autorile.
Õppematerjalide kasutamine muudel eesmärkidel on lubatud ainult Tartu Ülikooli eelneval kirjalikul nõusolekul.
Courses’i keskkonna kasutustingimused