Parallelism in Deep Learning - Courses - Institute of Computer Science

Practical 4

Understanding Parallelism (Concept + Code)

Objective

In this session, we explore the core ideas behind parallelism in deep learning:

Data Parallelism
Model Parallelism

We will simulate these concepts using simple PyTorch code.

Part 1: Baseline (Single GPU)

import torch
import torch.nn as nn

# Select device: GPU if available, otherwise CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define a simple neural network (2 linear layers)
model = nn.Sequential(
    nn.Linear(1000, 2000),  # First layer
    nn.ReLU(),              # Activation
    nn.Linear(2000, 1000)   # Second layer
).to(device)  # Move model to device

# Create dummy input data (batch of 64 samples, each of size 1000)
x = torch.randn(64, 1000).to(device)

# Forward pass: data goes through the model
output = model(x)

# Print output shape
print("Output shape:", output.shape)

Key Idea

The entire model runs on a single device
The full batch is processed at once—This is our baseline (no parallelism)

Part 2: Simulated Data Parallelism

# Split the batch into 2 smaller chunks
# Example: 64 samples → 2 chunks of 32
chunks = torch.chunk(x, 2)

outputs = []

# Loop over each chunk
for i, chunk in enumerate(chunks):
    print(f"Processing chunk {i} with shape {chunk.shape}")

    # The SAME model processes each chunk
    # In real Data Parallelism, this happens on different GPUs
    out = model(chunk)

    outputs.append(out)

# Combine outputs back into a single tensor
final_output = torch.cat(outputs)

print("Final output shape:", final_output.shape)

Key Idea

We split the data, not the model
The same model processes different parts of the batch

In real systems:

Each chunk runs on a different GPU
Results are combined at the end

Part 3:Simulated Model Parallelism

# Check if at least 2 GPUs are available
if torch.cuda.device_count() >= 2:

    # Define two devices
    device0 = torch.device("cuda:0")
    device1 = torch.device("cuda:1")

    # Split the model into two parts
    layer1 = nn.Linear(1000, 2000).to(device0)  # First layer on GPU 0
    layer2 = nn.Linear(2000, 1000).to(device1)  # Second layer on GPU 1

    # Input data starts on GPU 0
    x = torch.randn(64, 1000).to(device0)

    # Forward pass through first layer (GPU 0)
    out = layer1(x)

    # Move intermediate result to GPU 1
    out = out.to(device1)

    # Continue forward pass on GPU 1
    out = layer2(out)

    print("Output shape:", out.shape)

else:
    print("Model parallelism requires at least 2 GPUs.")

Key Idea

We split the model, not the data
Data flows between GPUs layer by layer

Important:

Moving data between GPUs introduces communication overhead

Questions:

Which approach is easier to implement?
Which approach helps when the model is too large?
What is the main overhead in model parallelism?

Parallelism in Deep Learning 2025/26 spring

Practical 4

Understanding Parallelism (Concept + Code)

Objective

Part 1: Baseline (Single GPU)

Key Idea

Part 2: Simulated Data Parallelism

Key Idea

In real systems:

Part 3:Simulated Model Parallelism

Key Idea

Important:

Questions: