Institute of Computer Science
Courses.cs.ut.ee Institute of Computer Science University of Tartu
  1. Courses
  2. 2025/26 spring
  3. Parallelism in Deep Learning (LTAT.06.030)
ET
Log in

Parallelism in Deep Learning 2025/26 spring

  • Pealeht
  • Loengud
  • Laborid
  • Viited

Practical 4

Understanding Parallelism (Concept + Code)


Objective

In this session, we explore the core ideas behind parallelism in deep learning:

  • Data Parallelism
  • Model Parallelism

We will simulate these concepts using simple PyTorch code.


Part 1: Baseline (Single GPU)
import torch
import torch.nn as nn

# Select device: GPU if available, otherwise CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define a simple neural network (2 linear layers)
model = nn.Sequential(
    nn.Linear(1000, 2000),  # First layer
    nn.ReLU(),              # Activation
    nn.Linear(2000, 1000)   # Second layer
).to(device)  # Move model to device

# Create dummy input data (batch of 64 samples, each of size 1000)
x = torch.randn(64, 1000).to(device)

# Forward pass: data goes through the model
output = model(x)

# Print output shape
print("Output shape:", output.shape)
Key Idea
  • The entire model runs on a single device
  • The full batch is processed at once—This is our baseline (no parallelism)

Part 2: Simulated Data Parallelism
# Split the batch into 2 smaller chunks
# Example: 64 samples → 2 chunks of 32
chunks = torch.chunk(x, 2)

outputs = []

# Loop over each chunk
for i, chunk in enumerate(chunks):
    print(f"Processing chunk {i} with shape {chunk.shape}")

    # The SAME model processes each chunk
    # In real Data Parallelism, this happens on different GPUs
    out = model(chunk)

    outputs.append(out)

# Combine outputs back into a single tensor
final_output = torch.cat(outputs)

print("Final output shape:", final_output.shape)
Key Idea
  • We split the data, not the model
  • The same model processes different parts of the batch
In real systems:
  • Each chunk runs on a different GPU
  • Results are combined at the end

Part 3:Simulated Model Parallelism
# Check if at least 2 GPUs are available
if torch.cuda.device_count() >= 2:

    # Define two devices
    device0 = torch.device("cuda:0")
    device1 = torch.device("cuda:1")

    # Split the model into two parts
    layer1 = nn.Linear(1000, 2000).to(device0)  # First layer on GPU 0
    layer2 = nn.Linear(2000, 1000).to(device1)  # Second layer on GPU 1

    # Input data starts on GPU 0
    x = torch.randn(64, 1000).to(device0)

    # Forward pass through first layer (GPU 0)
    out = layer1(x)

    # Move intermediate result to GPU 1
    out = out.to(device1)

    # Continue forward pass on GPU 1
    out = layer2(out)

    print("Output shape:", out.shape)

else:
    print("Model parallelism requires at least 2 GPUs.")
Key Idea
  • We split the model, not the data
  • Data flows between GPUs layer by layer
Important:
  • Moving data between GPUs introduces communication overhead
Questions:
  • Which approach is easier to implement?
  • Which approach helps when the model is too large?
  • What is the main overhead in model parallelism?
  • Institute of Computer Science
  • Faculty of Science and Technology
  • University of Tartu
In case of technical problems or questions write to:

Contact the course organizers with the organizational and course content questions.
The proprietary copyrights of educational materials belong to the University of Tartu. The use of educational materials is permitted for the purposes and under the conditions provided for in the copyright law for the free use of a work. When using educational materials, the user is obligated to give credit to the author of the educational materials.
The use of educational materials for other purposes is allowed only with the prior written consent of the University of Tartu.
Terms of use for the Courses environment