Arvutiteaduse instituut
Courses.cs.ut.ee Arvutiteaduse instituut Tartu Ülikool
  1. Kursused
  2. 2025/26 kevad
  3. Paralleelsus süvaõppes (LTAT.06.030)
EN
Logi sisse

Paralleelsus süvaõppes 2025/26 kevad

  • Pealeht
  • Loengud
  • Laborid
  • Viited

Practical 3

Profiling & Identifying Performance Bottlenecks


Objective

In this practical session, you will:

  • Measure training performance of a large model
  • Compute throughput (samples per second)
  • Analyze speedup behavior
  • Identify computation vs I/O bottlenecks
  • Use PyTorch’s built-in profiler
  • Connect experimental results to Amdahl’s Law

This practical prepares you to reason about performance before scaling to multi-GPU training.


Background

In Lecture 3, we introduced:

  • Speedup
  • Scalability
  • Throughput
  • Efficiency
  • Amdahl’s Law
  • Bottlenecks (computation vs I/O vs communication)
  • PyTorch profiler

In this session, you will experimentally observe these concepts.


Task 1—Baseline Measurement (Single GPU)

1) Create a file named: practical3.py
2) Use the following script:

import torch
import torch.nn as nn
import time

DIM = 8192
DEPTH = 6
BATCH_SIZE = 64
STEPS = 20

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class LargeLinearModel(nn.Module):
    def __init__(self):
        super().__init__()
        layers = []
        for _ in range(DEPTH):
            layers.append(nn.Linear(DIM, DIM))
            layers.append(nn.ReLU())
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

model = LargeLinearModel().to(device)
optimizer = torch.optim.Adam(model.parameters())
loss_fn = nn.MSELoss()

print("Running baseline experiment...")

total_start = time.time()

for step in range(STEPS):

    x = torch.randn(BATCH_SIZE, DIM, device=device)
    y = torch.randn(BATCH_SIZE, DIM, device=device)

    torch.cuda.synchronize()
    step_start = time.time()

    output = model(x)
    loss = loss_fn(output, y)

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    torch.cuda.synchronize()
    step_time = time.time() - step_start

    throughput = BATCH_SIZE / step_time

    print(f"Step {step:02d} | "
          f"Step time: {step_time:.4f}s | "
          f"Throughput: {throughput:.2f} samples/sec")

total_time = time.time() - total_start
print(f"\nTotal training time: {total_time:.2f}s") 

Task 2—Compute Throughput & Efficiency

1) Run the script with:

  • BATCH_SIZE = 32
  • BATCH_SIZE = 64
  • BATCH_SIZE = 128

2) Create a table:

Batch sizeStep Time (sec)Throughput (samples/sec)
   

3) Answer:

  • Does throughput increase linearly?
  • When does GPU utilization improve?
  • What happens to efficiency?

Task 3—Artificial I/O Bottleneck

1) Add the following line before the forward pass: time.sleep(0.01)
2) Run the experiment again.
3) Answer:

  • What happens to throughput?
  • Is the GPU fully utilized?
  • Which part of training becomes the bottleneck?

This simulates an I/O bottleneck.


Task 4—Apply Amdahl’s Law

1) Assume that:

  • 30% of total time is I/O
  • 70% is computation

2) Using Amdahl’s Law:

{$ S_{max} = \frac{1 }{1-P} $}

3) Compute the theoretical maximum speedup if computation is infinitely fast.
4) Does this match your intuition from the experiment?


Task 5—Use PyTorch Profiler

1) Place the profiler around one training step (forward + backward), right after data is created:

from torch.profiler import profile, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    output = model(x)
    loss = loss_fn(output, y)
    loss.backward()

print(prof.key_averages().table(sort_by="cuda_time_total"))

2) Identify:

  • Which operation consumes most CUDA time?
  • Is backward more expensive than forward?
  • Which layer dominates runtime?

Task 6 (optional) — Monitor GPU Usage

1) While running the script, execute watch -n 1 nvidia-smi
2) Observe:

  • GPU utilization (%)
  • Memory usage
  • Power consumption
  • Arvutiteaduse instituut
  • Loodus- ja täppisteaduste valdkond
  • Tartu Ülikool
Tehniliste probleemide või küsimuste korral kirjuta:

Kursuse sisu ja korralduslike küsimustega pöörduge kursuse korraldajate poole.
Õppematerjalide varalised autoriõigused kuuluvad Tartu Ülikoolile. Õppematerjalide kasutamine on lubatud autoriõiguse seaduses ettenähtud teose vaba kasutamise eesmärkidel ja tingimustel. Õppematerjalide kasutamisel on kasutaja kohustatud viitama õppematerjalide autorile.
Õppematerjalide kasutamine muudel eesmärkidel on lubatud ainult Tartu Ülikooli eelneval kirjalikul nõusolekul.
Courses’i keskkonna kasutustingimused