Parallelism in Deep Learning - Courses - Institute of Computer Science

Practical 3

Profiling & Identifying Performance Bottlenecks Objective

Objective

In this practical session, you will:

Measure training performance of a large model
Compute throughput (samples per second)
Analyze speedup behavior
Identify computation vs I/O bottlenecks
Use PyTorch’s built-in profiler
Connect experimental results to Amdahl’s Law

This practical prepares you to reason about performance before scaling to multi-GPU training.

Background

In Lecture 3, we introduced:

Speedup
Scalability
Throughput
Efficiency
Amdahl’s Law
Bottlenecks (computation vs I/O vs communication)
PyTorch profiler

In this session, you will experimentally observe these concepts.

Part 1—Baseline Measurement (Single GPU)

1) Create a file named: practical3.py
2) Use the following script:

import torch
import torch.nn as nn
import time

DIM = 8192
DEPTH = 6
BATCH_SIZE = 64
STEPS = 20

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class LargeLinearModel(nn.Module):
    def __init__(self):
        super().__init__()
        layers = []
        for _ in range(DEPTH):
            layers.append(nn.Linear(DIM, DIM))
            layers.append(nn.ReLU())
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

model = LargeLinearModel().to(device)
optimizer = torch.optim.Adam(model.parameters())
loss_fn = nn.MSELoss()

print("Running baseline experiment...")

total_start = time.time()

for step in range(STEPS):

    x = torch.randn(BATCH_SIZE, DIM, device=device)
    y = torch.randn(BATCH_SIZE, DIM, device=device)

    torch.cuda.synchronize()
    step_start = time.time()

    output = model(x)
    loss = loss_fn(output, y)

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    torch.cuda.synchronize()
    step_time = time.time() - step_start

    throughput = BATCH_SIZE / step_time

    print(f"Step {step:02d} | "
          f"Step time: {step_time:.4f}s | "
          f"Throughput: {throughput:.2f} samples/sec")

total_time = time.time() - total_start
print(f"\nTotal training time: {total_time:.2f}s")

Part 2—Compute Throughput & Efficiency

1) Run the script with:

BATCH_SIZE = 32
BATCH_SIZE = 64
BATCH_SIZE = 128

2) Create a table:

Batch size	Step Time (sec)	Throughput (samples/sec)

3) Answer:

Does throughput increase linearly?
When does GPU utilization improve?
What happens to efficiency?

Part 3—Artificial I/O Bottleneck

1) Add the following line before the forward pass: time.sleep(0.01)
2) Run the experiment again.
3) Answer:

What happens to throughput?
Is the GPU fully utilized?
Which part of training becomes the bottleneck?

This simulates an I/O bottleneck.

Part 4—Apply Amdahl’s Law

1) Assume that:

30% of total time is I/O
70% is computation

2) Using Amdahl’s Law:

{$ S_{max} = \frac{1 }{1-P} $}

3) Compute the theoretical maximum speedup if computation is infinitely fast.
4) Does this match your intuition from the experiment?

Part 5—Use PyTorch Profiler

1) Place the profiler around one training step (forward + backward), right after data is created:

from torch.profiler import profile, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    output = model(x)
    loss = loss_fn(output, y)
    loss.backward()

print(prof.key_averages().table(sort_by="cuda_time_total"))

2) Identify:

Which operation consumes most CUDA time?
Is backward more expensive than forward?
Which layer dominates runtime?

Part 6 (optional) — Monitor GPU Usage

1) While running the script, execute watch -n 1 nvidia-smi
2) Observe:

GPU utilization (%)
Memory usage
Power consumption

Parallelism in Deep Learning 2025/26 spring