Practical 3
Profiling & Identifying Performance Bottlenecks
Objective
In this practical session, you will:
- Measure training performance of a large model
- Compute throughput (samples per second)
- Analyze speedup behavior
- Identify computation vs I/O bottlenecks
- Use PyTorch’s built-in profiler
- Connect experimental results to Amdahl’s Law
This practical prepares you to reason about performance before scaling to multi-GPU training.
Background
In Lecture 3, we introduced:
- Speedup
- Scalability
- Throughput
- Efficiency
- Amdahl’s Law
- Bottlenecks (computation vs I/O vs communication)
- PyTorch profiler
In this session, you will experimentally observe these concepts.
Task 1—Baseline Measurement (Single GPU)
1) Create a file named: practical3.py
2) Use the following script:
import torch
import torch.nn as nn
import time
DIM = 8192
DEPTH = 6
BATCH_SIZE = 64
STEPS = 20
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
class LargeLinearModel(nn.Module):
def __init__(self):
super().__init__()
layers = []
for _ in range(DEPTH):
layers.append(nn.Linear(DIM, DIM))
layers.append(nn.ReLU())
self.net = nn.Sequential(*layers)
def forward(self, x):
return self.net(x)
model = LargeLinearModel().to(device)
optimizer = torch.optim.Adam(model.parameters())
loss_fn = nn.MSELoss()
print("Running baseline experiment...")
total_start = time.time()
for step in range(STEPS):
x = torch.randn(BATCH_SIZE, DIM, device=device)
y = torch.randn(BATCH_SIZE, DIM, device=device)
torch.cuda.synchronize()
step_start = time.time()
output = model(x)
loss = loss_fn(output, y)
loss.backward()
optimizer.step()
optimizer.zero_grad()
torch.cuda.synchronize()
step_time = time.time() - step_start
throughput = BATCH_SIZE / step_time
print(f"Step {step:02d} | "
f"Step time: {step_time:.4f}s | "
f"Throughput: {throughput:.2f} samples/sec")
total_time = time.time() - total_start
print(f"\nTotal training time: {total_time:.2f}s")
Task 2—Compute Throughput & Efficiency
1) Run the script with:
BATCH_SIZE = 32BATCH_SIZE = 64BATCH_SIZE = 128
2) Create a table:
| Batch size | Step Time (sec) | Throughput (samples/sec) |
3) Answer:
- Does throughput increase linearly?
- When does GPU utilization improve?
- What happens to efficiency?
Task 3—Artificial I/O Bottleneck
1) Add the following line before the forward pass: time.sleep(0.01)
2) Run the experiment again.
3) Answer:
- What happens to throughput?
- Is the GPU fully utilized?
- Which part of training becomes the bottleneck?
This simulates an I/O bottleneck.
Task 4—Apply Amdahl’s Law
1) Assume that:
- 30% of total time is I/O
- 70% is computation
2) Using Amdahl’s Law:
{$ S_{max} = \frac{1 }{1-P} $}
3) Compute the theoretical maximum speedup if computation is infinitely fast.
4) Does this match your intuition from the experiment?
Task 5—Use PyTorch Profiler
1) Place the profiler around one training step (forward + backward), right after data is created:
from torch.profiler import profile, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
output = model(x)
loss = loss_fn(output, y)
loss.backward()
print(prof.key_averages().table(sort_by="cuda_time_total"))
2) Identify:
- Which operation consumes most CUDA time?
- Is backward more expensive than forward?
- Which layer dominates runtime?
Task 6 (optional) — Monitor GPU Usage
1) While running the script, execute
watch -n 1 nvidia-smi
2) Observe:
- GPU utilization (%)
- Memory usage
- Power consumption