Practical 2
CPU vs GPU: Understanding Hardware Parallelism
Objective
In this practical session, you will:
- Compare matrix multiplication performance on CPU (NumPy) and GPU (PyTorch CUDA).
- Measure execution time correctly
- Analyze speedup behavior
- Understand when GPU parallelism becomes beneficial
This practical builds intuition about hardware-level parallelism and performance scaling.
Background
Modern deep learning relies heavily on GPU acceleration. However, GPUs are not always faster than CPUs.
Important Note on Parallelism
Both experiments in this practical are parallel:
- The CPU version uses multi-threaded BLAS libraries (limited number of cores).
- The GPU version uses thousands of CUDA cores (massively parallel architecture).
The key difference is not whether they are parallel —it is the scale of parallelism.
- CPU → limited parallelism (few powerful cores).
- GPU → massive parallelism (thousands of smaller cores).
The performance depends on:
- Matrix size
- Parallel workload
- Memory bandwidth
- Kernel launch overhead
In this practical, you will experimentally observe these effects.
Codes ((Download)
Task 1 — Implement CPU vs GPU Benchmark (Download)
The script measures matrix multiplication performance on both CPU and GPU, averages multiple runs, and computes speedup.
1) Create a file named:practical2.py
2) Copy the following full script into the file:
import numpy as np
import torch
import time
# ----------------------------
# Configuration
# ----------------------------
SIZES = [1000, 2000, 4000]
REPEATS = 3
print("=" * 60)
print("CPU vs GPU Matrix Multiplication Benchmark")
print("=" * 60)
print(f"PyTorch CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"GPU Name: {torch.cuda.get_device_name(0)}")
print()
# ----------------------------
# Benchmark Loop
# ----------------------------
for N in SIZES:
print(f"\nMatrix Size: {N} x {N}")
print("-" * 40)
# ----------------------------
# CPU (NumPy)
# ----------------------------
A_cpu = np.random.randn(N, N)
B_cpu = np.random.randn(N, N)
cpu_times = []
for _ in range(REPEATS):
start = time.time()
C_cpu = A_cpu @ B_cpu
end = time.time()
cpu_times.append(end - start)
cpu_avg = sum(cpu_times) / REPEATS
print(f"CPU Avg Time: {cpu_avg:.4f} sec")
# ----------------------------
# GPU (PyTorch)
# ----------------------------
if torch.cuda.is_available():
A_gpu = torch.randn(N, N, device="cuda")
B_gpu = torch.randn(N, N, device="cuda")
# Warm-up (important for GPU timing)
torch.matmul(A_gpu, B_gpu)
torch.cuda.synchronize()
gpu_times = []
for _ in range(REPEATS):
torch.cuda.synchronize()
start = time.time()
C_gpu = torch.matmul(A_gpu, B_gpu)
torch.cuda.synchronize()
end = time.time()
gpu_times.append(end - start)
gpu_avg = sum(gpu_times) / REPEATS
print(f"GPU Avg Time: {gpu_avg:.4f} sec")
speedup = cpu_avg / gpu_avg
print(f"Speedup (CPU/GPU): {speedup:.2f}x")
print("=" * 60)
3) Run the Script python practical2.py
Task 2 — Scaling Study
1) Increase the matrix size (e.g., 1000, 2000, 4000, 6000, 8000).
2) Create a results table:
| Matrix Size | CPU Time (sec) | GPU Time (sec) | Speedup |
3) Plot matrix size vs speedup.
Task 3 — Analysis Questions
Answer the following:
- For which matrix size does GPU become faster than CPU?
- Why is GPU sometimes slower for small matrices?
- Why do we call
torch.cuda.synchronize()before measuring time? - Why does speedup increase as matrix size increases?
Task 4 — Monitor GPU Usage
1) While running the script, open another terminal and execute
watch -n 1 nvidia-smi
2) Observe:
- GPU utilization (%)
- Memory usage
- Power consumption