Practical 9
Pipeline Parallelism in PyTorch
Objective
By the end of this practical, students will:
- Understand pipeline execution
- Implement model partitioning across GPUs
- Apply micro-batching
- Observe pipeline efficiency (bubble problem)
- Understand differences between:
- Naive Pipeline (Option A)
- GPipe (Option B)
- 1F1B (Option C)
- Analyze how scheduling affects performance
Provided Code Files
Students will work with three versions:
1) Naive Pipeline (Option A):
practical9_OptionA.pyDownload
2) GPipe (Option B):
practical9_OptionB.pyDownload
3) 1F1B (Option C):
practical9_OptionC.pyDownload
Pipeline Scheduling Comparison
| Method | Behavior | Efficiency |
| Naive | Sequential forward + backward | ❌ Very Low |
| GPipe | Forward all → backward all | ⚠️ Medium |
| 1F1W | Interleaved forward/backward | ✅ High |
Part 1— Run and Observe (Option A)
Understand naive pipeline behavior
1) Instructions
- Run Option A code
- Observe:
- Step time
- GPU utilization (qualitatively)
2) Questions
- Q1) Are GPUs working in parallel?
- Q2) Where do you expect idle time?
Part 2— Analyze GPipe (Option B)
Understand pipeline fill and drain
1) Instructions
- Run Option B code
- Compare with Option A:
- Execution time
- Loss behavior
2) Questions
- Q1) What changed compared to Option A?
- Q2) Why do we separate forward and backward?
- Q3) Does this remove pipeline bubbles?
Part 3—Analyze 1F1B (Option C)
Understand overlapping execution
1) Instructions
- Run Option C code
- Compare with:
- Option A
- Option B
2) Questions
- Q1) When does backward start?
- Q2) What is different from GPipe?
- Q3) Why is this more efficient?
Part 4—Modify and Experiment
Understand impact of micro-batching
1) Change
NUM_MICROBATCHES = 2, 4, 8
2) Run for each option:
- A
- B
- C
3) Record Results
| Method | µB=2 | µB=4 | 'µB=4 |
| A | |||
| B | |||
| C |
4) Questions
- Q1) Does increasing micro-batches always help?
- Q2) Which method benefits most?
Part 5—Code Understanding
1) Analyze Option C (1F1B)
Focus on this part:
if i > 0:
prev_out = forward_outputs[i - 1]
prev_target = targets[i - 1]
loss = loss_fn(prev_out, prev_target)
loss.backward()
2) Questions
- Q1) Why do we delay backward by one step?
- Q2) What would happen if we remove
i > 0? - Q3) What happens to pipeline overlap?
Final Discussion
For students, discuss:
- Why is Option A not true pipeline parallelism?
- What is the main limitation of GPipe?
- How does 1F1B reduce pipeline bubbles?
- Which approach would you use in large-scale systems?