Arvutiteaduse instituut
Courses.cs.ut.ee Arvutiteaduse instituut Tartu Ülikool
  1. Kursused
  2. 2025/26 kevad
  3. Paralleelsus süvaõppes (LTAT.06.030)
EN
Logi sisse

Paralleelsus süvaõppes 2025/26 kevad

  • Pealeht
  • Loengud
  • Laborid
  • Viited

Practical 2

CPU vs GPU: Understanding Hardware Parallelism


Objective

In this practical session, you will:

  • Compare matrix multiplication performance on CPU (NumPy) and GPU (PyTorch CUDA).
  • Measure execution time correctly
  • Analyze speedup behavior
  • Understand when GPU parallelism becomes beneficial

This practical builds intuition about hardware-level parallelism and performance scaling.


Background

Modern deep learning relies heavily on GPU acceleration. However, GPUs are not always faster than CPUs.

Important Note on Parallelism
Both experiments in this practical are parallel:

  • The CPU version uses multi-threaded BLAS libraries (limited number of cores).
  • The GPU version uses thousands of CUDA cores (massively parallel architecture).

The key difference is not whether they are parallel —it is the scale of parallelism.

  • CPU → limited parallelism (few powerful cores).
  • GPU → massive parallelism (thousands of smaller cores).

The performance depends on:

  • Matrix size
  • Parallel workload
  • Memory bandwidth
  • Kernel launch overhead

In this practical, you will experimentally observe these effects.


Codes ((Download)

Task 1 — Implement CPU vs GPU Benchmark (Download)

The script measures matrix multiplication performance on both CPU and GPU, averages multiple runs, and computes speedup.

1) Create a file named:practical2.py
2) Copy the following full script into the file:

import numpy as np
import torch
import time


# ----------------------------
# Configuration
# ----------------------------

SIZES = [1000, 2000, 4000]
REPEATS = 3

print("=" * 60)
print("CPU vs GPU Matrix Multiplication Benchmark")
print("=" * 60)

print(f"PyTorch CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")

print()

# ----------------------------
# Benchmark Loop
# ----------------------------
for N in SIZES:

    print(f"\nMatrix Size: {N} x {N}")
    print("-" * 40)

    # ----------------------------
    # CPU (NumPy)
    # ----------------------------
    A_cpu = np.random.randn(N, N)
    B_cpu = np.random.randn(N, N)

    cpu_times = []

    for _ in range(REPEATS):
        start = time.time()
        C_cpu = A_cpu @ B_cpu
        end = time.time()
        cpu_times.append(end - start)

    cpu_avg = sum(cpu_times) / REPEATS
    print(f"CPU Avg Time: {cpu_avg:.4f} sec")

    # ----------------------------
    # GPU (PyTorch)
    # ----------------------------
    if torch.cuda.is_available():

        A_gpu = torch.randn(N, N, device="cuda")
        B_gpu = torch.randn(N, N, device="cuda")

        # Warm-up (important for GPU timing)
        torch.matmul(A_gpu, B_gpu)
        torch.cuda.synchronize()

        gpu_times = []

        for _ in range(REPEATS):
            torch.cuda.synchronize()
            start = time.time()

            C_gpu = torch.matmul(A_gpu, B_gpu)

            torch.cuda.synchronize()
            end = time.time()

            gpu_times.append(end - start)

        gpu_avg = sum(gpu_times) / REPEATS
        print(f"GPU Avg Time: {gpu_avg:.4f} sec")

        speedup = cpu_avg / gpu_avg
        print(f"Speedup (CPU/GPU): {speedup:.2f}x")



    print("=" * 60)

3) Run the Script python practical2.py


Task 2 — Scaling Study

1) Increase the matrix size (e.g., 1000, 2000, 4000, 6000, 8000).

2) Create a results table:

Matrix SizeCPU Time (sec)GPU Time (sec)Speedup
    


3) Plot matrix size vs speedup.


Task 3 — Analysis Questions

Answer the following:

  • For which matrix size does GPU become faster than CPU?
  • Why is GPU sometimes slower for small matrices?
  • Why do we call torch.cuda.synchronize() before measuring time?
  • Why does speedup increase as matrix size increases?

Task 4 — Monitor GPU Usage

1) While running the script, open another terminal and execute watch -n 1 nvidia-smi
2) Observe:

  • GPU utilization (%)
  • Memory usage
  • Power consumption
  • Arvutiteaduse instituut
  • Loodus- ja täppisteaduste valdkond
  • Tartu Ülikool
Tehniliste probleemide või küsimuste korral kirjuta:

Kursuse sisu ja korralduslike küsimustega pöörduge kursuse korraldajate poole.
Õppematerjalide varalised autoriõigused kuuluvad Tartu Ülikoolile. Õppematerjalide kasutamine on lubatud autoriõiguse seaduses ettenähtud teose vaba kasutamise eesmärkidel ja tingimustel. Õppematerjalide kasutamisel on kasutaja kohustatud viitama õppematerjalide autorile.
Õppematerjalide kasutamine muudel eesmärkidel on lubatud ainult Tartu Ülikooli eelneval kirjalikul nõusolekul.
Courses’i keskkonna kasutustingimused