Practical 4
Understanding Parallelism (Concept + Code)
Objective
In this session, we explore the core ideas behind parallelism in deep learning:
- Data Parallelism
- Model Parallelism
We will simulate these concepts using simple PyTorch code.
Part 1: Baseline (Single GPU)
import torch
import torch.nn as nn
# Select device: GPU if available, otherwise CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Define a simple neural network (2 linear layers)
model = nn.Sequential(
nn.Linear(1000, 2000), # First layer
nn.ReLU(), # Activation
nn.Linear(2000, 1000) # Second layer
).to(device) # Move model to device
# Create dummy input data (batch of 64 samples, each of size 1000)
x = torch.randn(64, 1000).to(device)
# Forward pass: data goes through the model
output = model(x)
# Print output shape
print("Output shape:", output.shape)
Key Idea
- The entire model runs on a single device
- The full batch is processed at once—This is our baseline (no parallelism)
Part 2: Simulated Data Parallelism
# Split the batch into 2 smaller chunks
# Example: 64 samples → 2 chunks of 32
chunks = torch.chunk(x, 2)
outputs = []
# Loop over each chunk
for i, chunk in enumerate(chunks):
print(f"Processing chunk {i} with shape {chunk.shape}")
# The SAME model processes each chunk
# In real Data Parallelism, this happens on different GPUs
out = model(chunk)
outputs.append(out)
# Combine outputs back into a single tensor
final_output = torch.cat(outputs)
print("Final output shape:", final_output.shape)
Key Idea
- We split the data, not the model
- The same model processes different parts of the batch
In real systems:
- Each chunk runs on a different GPU
- Results are combined at the end
Part 3:Simulated Model Parallelism
# Check if at least 2 GPUs are available
if torch.cuda.device_count() >= 2:
# Define two devices
device0 = torch.device("cuda:0")
device1 = torch.device("cuda:1")
# Split the model into two parts
layer1 = nn.Linear(1000, 2000).to(device0) # First layer on GPU 0
layer2 = nn.Linear(2000, 1000).to(device1) # Second layer on GPU 1
# Input data starts on GPU 0
x = torch.randn(64, 1000).to(device0)
# Forward pass through first layer (GPU 0)
out = layer1(x)
# Move intermediate result to GPU 1
out = out.to(device1)
# Continue forward pass on GPU 1
out = layer2(out)
print("Output shape:", out.shape)
else:
print("Model parallelism requires at least 2 GPUs.")
Key Idea
- We split the model, not the data
- Data flows between GPUs layer by layer
Important:
- Moving data between GPUs introduces communication overhead
Questions:
- Which approach is easier to implement?
- Which approach helps when the model is too large?
- What is the main overhead in model parallelism?