Paralleelsus süvaõppes - Kursused - Arvutiteaduse instituut

Practical 2

CPU vs GPU: Understanding Hardware Parallelism

Objective

In this practical session, you will:

Compare matrix multiplication performance on CPU (NumPy) and GPU (PyTorch CUDA).
Measure execution time correctly
Analyze speedup behavior
Understand when GPU parallelism becomes beneficial

This practical builds intuition about hardware-level parallelism and performance scaling.

Background

Modern deep learning relies heavily on GPU acceleration. However, GPUs are not always faster than CPUs.

Important Note on Parallelism
Both experiments in this practical are parallel:

The CPU version uses multi-threaded BLAS libraries (limited number of cores).
The GPU version uses thousands of CUDA cores (massively parallel architecture).

The key difference is not whether they are parallel —it is the scale of parallelism.

CPU → limited parallelism (few powerful cores).
GPU → massive parallelism (thousands of smaller cores).

The performance depends on:

Matrix size
Parallel workload
Memory bandwidth
Kernel launch overhead

In this practical, you will experimentally observe these effects.

Codes ((Download)

Part 1 — Implement CPU vs GPU Benchmark (Download)

The script measures matrix multiplication performance on both CPU and GPU, averages multiple runs, and computes speedup.

1) Create a file named:practical2.py
2) Copy the following full script into the file:

import numpy as np
import torch
import time


# ----------------------------
# Configuration
# ----------------------------

SIZES = [1000, 2000, 4000]
REPEATS = 3

print("=" * 60)
print("CPU vs GPU Matrix Multiplication Benchmark")
print("=" * 60)

print(f"PyTorch CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")

print()

# ----------------------------
# Benchmark Loop
# ----------------------------
for N in SIZES:

    print(f"\nMatrix Size: {N} x {N}")
    print("-" * 40)

    # ----------------------------
    # CPU (NumPy)
    # ----------------------------
    A_cpu = np.random.randn(N, N)
    B_cpu = np.random.randn(N, N)

    cpu_times = []

    for _ in range(REPEATS):
        start = time.time()
        C_cpu = A_cpu @ B_cpu
        end = time.time()
        cpu_times.append(end - start)

    cpu_avg = sum(cpu_times) / REPEATS
    print(f"CPU Avg Time: {cpu_avg:.4f} sec")

    # ----------------------------
    # GPU (PyTorch)
    # ----------------------------
    if torch.cuda.is_available():

        A_gpu = torch.randn(N, N, device="cuda")
        B_gpu = torch.randn(N, N, device="cuda")

        # Warm-up (important for GPU timing)
        torch.matmul(A_gpu, B_gpu)
        torch.cuda.synchronize()

        gpu_times = []

        for _ in range(REPEATS):
            torch.cuda.synchronize()
            start = time.time()

            C_gpu = torch.matmul(A_gpu, B_gpu)

            torch.cuda.synchronize()
            end = time.time()

            gpu_times.append(end - start)

        gpu_avg = sum(gpu_times) / REPEATS
        print(f"GPU Avg Time: {gpu_avg:.4f} sec")

        speedup = cpu_avg / gpu_avg
        print(f"Speedup (CPU/GPU): {speedup:.2f}x")



    print("=" * 60)

3) Run the Script python practical2.py

Part 2 — Scaling Study

1) Increase the matrix size (e.g., 1000, 2000, 4000, 6000, 8000).

2) Create a results table:

Matrix Size	CPU Time (sec)	GPU Time (sec)	Speedup

3) Plot matrix size vs speedup.

Part 3 — Analysis Questions

Answer the following:

For which matrix size does GPU become faster than CPU?
Why is GPU sometimes slower for small matrices?
Why do we call torch.cuda.synchronize() before measuring time?
Why does speedup increase as matrix size increases?

Part 4 — Monitor GPU Usage

1) While running the script, open another terminal and execute watch -n 1 nvidia-smi
2) Observe:

GPU utilization (%)
Memory usage
Power consumption

Paralleelsus süvaõppes 2025/26 kevad