Paralleelsus süvaõppes - Kursused - Arvutiteaduse instituut

Practical 9

Pipeline Parallelism in PyTorch

Objective

By the end of this practical, students will:

Understand pipeline execution
Implement model partitioning across GPUs
Apply micro-batching
Observe pipeline efficiency (bubble problem)
Understand differences between:
- Naive Pipeline (Option A)
- GPipe (Option B)
- 1F1B (Option C)
Analyze how scheduling affects performance

Provided Code Files

Students will work with three versions:

1) Naive Pipeline (Option A):

practical9_OptionA.py Download

2) GPipe (Option B):

practical9_OptionB.py Download

3) 1F1B (Option C):

practical9_OptionC.py Download

Pipeline Scheduling Comparison

Method	Behavior	Efficiency
Naive	Sequential forward + backward	❌ Very Low
GPipe	Forward all → backward all	⚠️ Medium
1F1W	Interleaved forward/backward	✅ High

Part 1— Run and Observe (Option A)

Understand naive pipeline behavior

1) Instructions

Run Option A code
Observe:
- Step time
- GPU utilization (qualitatively)

2) Questions

Q1) Are GPUs working in parallel?
Q2) Where do you expect idle time?

Part 2— Analyze GPipe (Option B)

Understand pipeline fill and drain

1) Instructions

Run Option B code
Compare with Option A:
- Execution time
- Loss behavior

2) Questions

Q1) What changed compared to Option A?
Q2) Why do we separate forward and backward?
Q3) Does this remove pipeline bubbles?

Part 3—Analyze 1F1B (Option C)

Understand overlapping execution

1) Instructions

Run Option C code
Compare with:
- Option A
- Option B

2) Questions

Q1) When does backward start?
Q2) What is different from GPipe?
Q3) Why is this more efficient?

Part 4—Modify and Experiment

Understand impact of micro-batching

1) Change

NUM_MICROBATCHES = 2, 4, 8

2) Run for each option:

3) Record Results

Method	µB=2	µB=4	'µB=4
A
B
C

4) Questions

Q1) Does increasing micro-batches always help?
Q2) Which method benefits most?

Part 5—Code Understanding

1) Analyze Option C (1F1B)
Focus on this part:

if i > 0:
    prev_out = forward_outputs[i - 1]
    prev_target = targets[i - 1]

    loss = loss_fn(prev_out, prev_target)
    loss.backward()

2) Questions

Q1) Why do we delay backward by one step?
Q2) What would happen if we remove i > 0?
Q3) What happens to pipeline overlap?

Final Discussion

For students, discuss:

Why is Option A not true pipeline parallelism?
What is the main limitation of GPipe?
How does 1F1B reduce pipeline bubbles?
Which approach would you use in large-scale systems?

Paralleelsus süvaõppes 2025/26 kevad

Practical 9

Pipeline Parallelism in PyTorch

Objective

Provided Code Files

Part 1— Run and Observe (Option A)

Part 2— Analyze GPipe (Option B)

Part 3—Analyze 1F1B (Option C)

Part 4—Modify and Experiment

Part 5—Code Understanding

Final Discussion