Arvutiteaduse instituut
Courses.cs.ut.ee Arvutiteaduse instituut Tartu Ülikool
  1. Kursused
  2. 2025/26 kevad
  3. Paralleelsus süvaõppes (LTAT.06.030)
EN
Logi sisse

Paralleelsus süvaõppes 2025/26 kevad

  • Pealeht
  • Loengud
  • Laborid
  • Viited

Practical 4

Understanding Parallelism (Concept + Code)


Objective

In this session, we explore the core ideas behind parallelism in deep learning:

  • Data Parallelism
  • Model Parallelism

We will simulate these concepts using simple PyTorch code.


Part 1: Baseline (Single GPU)
import torch
import torch.nn as nn

# Select device: GPU if available, otherwise CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define a simple neural network (2 linear layers)
model = nn.Sequential(
    nn.Linear(1000, 2000),  # First layer
    nn.ReLU(),              # Activation
    nn.Linear(2000, 1000)   # Second layer
).to(device)  # Move model to device

# Create dummy input data (batch of 64 samples, each of size 1000)
x = torch.randn(64, 1000).to(device)

# Forward pass: data goes through the model
output = model(x)

# Print output shape
print("Output shape:", output.shape)
Key Idea
  • The entire model runs on a single device
  • The full batch is processed at once—This is our baseline (no parallelism)

Part 2: Simulated Data Parallelism
# Split the batch into 2 smaller chunks
# Example: 64 samples → 2 chunks of 32
chunks = torch.chunk(x, 2)

outputs = []

# Loop over each chunk
for i, chunk in enumerate(chunks):
    print(f"Processing chunk {i} with shape {chunk.shape}")

    # The SAME model processes each chunk
    # In real Data Parallelism, this happens on different GPUs
    out = model(chunk)

    outputs.append(out)

# Combine outputs back into a single tensor
final_output = torch.cat(outputs)

print("Final output shape:", final_output.shape)
Key Idea
  • We split the data, not the model
  • The same model processes different parts of the batch
In real systems:
  • Each chunk runs on a different GPU
  • Results are combined at the end

Part 3:Simulated Model Parallelism
# Check if at least 2 GPUs are available
if torch.cuda.device_count() >= 2:

    # Define two devices
    device0 = torch.device("cuda:0")
    device1 = torch.device("cuda:1")

    # Split the model into two parts
    layer1 = nn.Linear(1000, 2000).to(device0)  # First layer on GPU 0
    layer2 = nn.Linear(2000, 1000).to(device1)  # Second layer on GPU 1

    # Input data starts on GPU 0
    x = torch.randn(64, 1000).to(device0)

    # Forward pass through first layer (GPU 0)
    out = layer1(x)

    # Move intermediate result to GPU 1
    out = out.to(device1)

    # Continue forward pass on GPU 1
    out = layer2(out)

    print("Output shape:", out.shape)

else:
    print("Model parallelism requires at least 2 GPUs.")
Key Idea
  • We split the model, not the data
  • Data flows between GPUs layer by layer
Important:
  • Moving data between GPUs introduces communication overhead
Questions:
  • Which approach is easier to implement?
  • Which approach helps when the model is too large?
  • What is the main overhead in model parallelism?
  • Arvutiteaduse instituut
  • Loodus- ja täppisteaduste valdkond
  • Tartu Ülikool
Tehniliste probleemide või küsimuste korral kirjuta:

Kursuse sisu ja korralduslike küsimustega pöörduge kursuse korraldajate poole.
Õppematerjalide varalised autoriõigused kuuluvad Tartu Ülikoolile. Õppematerjalide kasutamine on lubatud autoriõiguse seaduses ettenähtud teose vaba kasutamise eesmärkidel ja tingimustel. Õppematerjalide kasutamisel on kasutaja kohustatud viitama õppematerjalide autorile.
Õppematerjalide kasutamine muudel eesmärkidel on lubatud ainult Tartu Ülikooli eelneval kirjalikul nõusolekul.
Courses’i keskkonna kasutustingimused