Institute of Computer Science
Courses.cs.ut.ee Institute of Computer Science University of Tartu
  1. Courses
  2. 2025/26 spring
  3. Parallelism in Deep Learning (LTAT.06.030)
ET
Log in

Parallelism in Deep Learning 2025/26 spring

  • Pealeht
  • Loengud
  • Laborid
  • Viited

Practical 3

Profiling & Identifying Performance Bottlenecks


Objective

In this practical session, you will:

  • Measure training performance of a large model
  • Compute throughput (samples per second)
  • Analyze speedup behavior
  • Identify computation vs I/O bottlenecks
  • Use PyTorch’s built-in profiler
  • Connect experimental results to Amdahl’s Law

This practical prepares you to reason about performance before scaling to multi-GPU training.


Background

In Lecture 3, we introduced:

  • Speedup
  • Scalability
  • Throughput
  • Efficiency
  • Amdahl’s Law
  • Bottlenecks (computation vs I/O vs communication)
  • PyTorch profiler

In this session, you will experimentally observe these concepts.


Task 1—Baseline Measurement (Single GPU)

1) Create a file named: practical3.py
2) Use the following script:

import torch
import torch.nn as nn
import time

DIM = 8192
DEPTH = 6
BATCH_SIZE = 64
STEPS = 20

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class LargeLinearModel(nn.Module):
    def __init__(self):
        super().__init__()
        layers = []
        for _ in range(DEPTH):
            layers.append(nn.Linear(DIM, DIM))
            layers.append(nn.ReLU())
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

model = LargeLinearModel().to(device)
optimizer = torch.optim.Adam(model.parameters())
loss_fn = nn.MSELoss()

print("Running baseline experiment...")

total_start = time.time()

for step in range(STEPS):

    x = torch.randn(BATCH_SIZE, DIM, device=device)
    y = torch.randn(BATCH_SIZE, DIM, device=device)

    torch.cuda.synchronize()
    step_start = time.time()

    output = model(x)
    loss = loss_fn(output, y)

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    torch.cuda.synchronize()
    step_time = time.time() - step_start

    throughput = BATCH_SIZE / step_time

    print(f"Step {step:02d} | "
          f"Step time: {step_time:.4f}s | "
          f"Throughput: {throughput:.2f} samples/sec")

total_time = time.time() - total_start
print(f"\nTotal training time: {total_time:.2f}s") 

Task 2—Compute Throughput & Efficiency

1) Run the script with:

  • BATCH_SIZE = 32
  • BATCH_SIZE = 64
  • BATCH_SIZE = 128

2) Create a table:

Batch sizeStep Time (sec)Throughput (samples/sec)
   

3) Answer:

  • Does throughput increase linearly?
  • When does GPU utilization improve?
  • What happens to efficiency?

Task 3—Artificial I/O Bottleneck

1) Add the following line before the forward pass: time.sleep(0.01)
2) Run the experiment again.
3) Answer:

  • What happens to throughput?
  • Is the GPU fully utilized?
  • Which part of training becomes the bottleneck?

This simulates an I/O bottleneck.


Task 4—Apply Amdahl’s Law

1) Assume that:

  • 30% of total time is I/O
  • 70% is computation

2) Using Amdahl’s Law:

{$ S_{max} = \frac{1 }{1-P} $}

3) Compute the theoretical maximum speedup if computation is infinitely fast.
4) Does this match your intuition from the experiment?


Task 5—Use PyTorch Profiler

1) Place the profiler around one training step (forward + backward), right after data is created:

from torch.profiler import profile, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    output = model(x)
    loss = loss_fn(output, y)
    loss.backward()

print(prof.key_averages().table(sort_by="cuda_time_total"))

2) Identify:

  • Which operation consumes most CUDA time?
  • Is backward more expensive than forward?
  • Which layer dominates runtime?

Task 6 (optional) — Monitor GPU Usage

1) While running the script, execute watch -n 1 nvidia-smi
2) Observe:

  • GPU utilization (%)
  • Memory usage
  • Power consumption
  • Institute of Computer Science
  • Faculty of Science and Technology
  • University of Tartu
In case of technical problems or questions write to:

Contact the course organizers with the organizational and course content questions.
The proprietary copyrights of educational materials belong to the University of Tartu. The use of educational materials is permitted for the purposes and under the conditions provided for in the copyright law for the free use of a work. When using educational materials, the user is obligated to give credit to the author of the educational materials.
The use of educational materials for other purposes is allowed only with the prior written consent of the University of Tartu.
Terms of use for the Courses environment