Arvutiteaduse instituut
Courses.cs.ut.ee Arvutiteaduse instituut Tartu Ülikool
  1. Kursused
  2. 2025/26 kevad
  3. Paralleelsus süvaõppes (LTAT.06.030)
EN
Logi sisse

Paralleelsus süvaõppes 2025/26 kevad

  • Pealeht
  • Loengud
  • Laborid
  • Kodutöö
  • Viited

Practical 8

Layer-wise Model Parallelism in PyTorch


Objective

In this practical session, you will:

  • Implement layer-wise model parallelism
  • Distribute a neural network across multiple GPUs
  • Understand forward data movement between devices
  • Observe GPU utilization imbalance
  • Measure performance limitations

Background

In Lecture 8, we introduced the following:
👉 Model Parallelism = Split the model across GPUs, not the data

  • GPU 0 → first layers
  • GPU 1 → later layers
  • Data moves between GPUs
  • Execution is sequential
  • Not all GPUs work at the same time

In this session, you will experimentally observe these concepts of model parallelism.


Part 1—Simple Implementation (Core Learning)
import torch
import torch.nn as nn

# Define devices
device0 = torch.device("cuda:0")
device1 = torch.device("cuda:1")

class SimpleMPModel(nn.Module):
    def __init__(self):
        super().__init__()

        # First part → GPU 0
        self.layer1 = nn.Linear(1000, 2000).to(device0)
        self.relu1 = nn.ReLU().to(device0)

        # Second part → GPU 1
        self.layer2 = nn.Linear(2000, 1000).to(device1)
        self.relu2 = nn.ReLU().to(device1)

    def forward(self, x):
        # Move input to GPU 0
        x = x.to(device0)

        # Forward on GPU 0
        x = self.layer1(x)
        x = self.relu1(x)

        # Move to GPU 1  ← KEY STEP
        x = x.to(device1)

        # Forward on GPU 1
        x = self.layer2(x)
        x = self.relu2(x)

        return x

# Test
model = SimpleMPModel()
x = torch.randn(32, 1000)

output = model(x)
print(output.device)
Run and check:
  • Where does computation start?
  • When does .to(device1) happen?
  • Which GPU holds the final output?

Part 2—Add Training
import torch.optim as optim

model = SimpleMPModel()
optimizer = optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.MSELoss()

for step in range(5):
    x = torch.randn(32, 1000)
    y = torch.randn(32, 1000).to(device1)  # target on last GPU

    output = model(x)
    loss = loss_fn(output, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print(f"Step {step}, Loss: {loss.item():.4f}")

Part 3—Track Data Movement

Modify forward:

print("Before move:", x.device)
x = x.to(device1)
print("After move:", x.device)

👉 This shows the GPU-to-GPU transfer


Part 4—Measure Time
import time

start = time.time()
output = model(x)
torch.cuda.synchronize()
end = time.time()

print("Forward time:", end - start)
Key Observations:
  • GPU 0 works → then GPU 1 works
  • No overlap (not parallel in time)
  • .to(device) introduces communication cost
Key Insight (Very Important)

👉 Layer-wise model parallelism:

  • ✔ Solves memory problem
  • ✖ Does NOT solve idle GPU problem

  • Arvutiteaduse instituut
  • Loodus- ja täppisteaduste valdkond
  • Tartu Ülikool
Tehniliste probleemide või küsimuste korral kirjuta:

Kursuse sisu ja korralduslike küsimustega pöörduge kursuse korraldajate poole.
Õppematerjalide varalised autoriõigused kuuluvad Tartu Ülikoolile. Õppematerjalide kasutamine on lubatud autoriõiguse seaduses ettenähtud teose vaba kasutamise eesmärkidel ja tingimustel. Õppematerjalide kasutamisel on kasutaja kohustatud viitama õppematerjalide autorile.
Õppematerjalide kasutamine muudel eesmärkidel on lubatud ainult Tartu Ülikooli eelneval kirjalikul nõusolekul.
Courses’i keskkonna kasutustingimused