Home Work 1
General Instructions & Submission
Release & Deadline
- Release date: 2 April 2026
- Deadline: 16 April 2026 (23:59)
Submission Requirements
Each student must submit:
1) Code files
- All modified scripts used in the homework
- Must be runnable
2)Report (PDF — single file) Include:
- Answers to all questions
- Tables of Results
- Short explanations
Task 1—Reproducible Benchmark Setup (3 Points)
Modify the DDP script to:
1) Fix randomness:
Add at the beginning of your script:
torch.manual_seed(0) torch.cuda.manual_seed_all(0)
2)Log per-step time (only rank 0)
👉 Measure time for each training step using time.time():
- Start timer before forward pass
- End timer after optimizer step
- Print time only when rank = 0
3) Run at least 30 steps (ignore first 5 warmup steps) 👉I gnore the first 5 steps (warmup)
Submission:
1) Your modified DDP script
2) A short report (max 1 page) answering:
- Q1) Why is warmup needed in GPU benchmarking?
- Q2) Why must we control randomness?
Task 2—Strong Scaling Analysis (3 Points )
1) Keep global batch size fixed (64), and run:
torchrun --nproc_per_node=1 ... torchrun --nproc_per_node=2 ... torchrun --nproc_per_node=4 ...
2) Compute
| #GPUs | Time/step | Speedup | Efficency |
| 1 | 1.0 | 1.0 | |
| 2 | |||
| 4 |
Submission:
1) Your modified DDP script
2) Complated table
3) Short explanation :
- Q1) Is scaling linear?
- Q2) Where does efficiency drop?
- Q3) Give a quantitative explanation (not just word)?
Task 3—Communication vs Computation (3 Points)
Create two scenarios:
- Case A — Small model (communication dominates)
DIM = 1024 DEPTH = 2
- Case B — Large model (computation dominates)
DIM = 8192 DEPTH = 8
Submission:
1) Completed table
| Case | #GPUs | Time/step | Speedup |
| Small | 1 | 1.0 | |
| Small | 2 | ||
| Small | 4 | ||
| Large | 1 | 1.0 | |
| Large | 2 | ||
| Large | 4 |
2) Short explanation :
- Q1) In which case is DDP more efficient?
- Q2) Why does performance differ between small and large models?
- Q3) When does communication become the bottleneck?
Task 4—DataParallel vs DDP (Deep Comparison) (3 Points)
1) Run both:
- DP (Praactical 5)
- DDP (this week)
2) Analyze:
- GPU utilization (via nvidia-smi)
- Step time variance
- Memory usage
Submission:
- Q1) Why does DP suffer from a bottleneck on GPU 0?
- Q2) Why does DDP scale better architecturally?
- Q3) In what scenario could DP still be acceptable?
Task 5—Research Challenge (3 Points )
1) choose one:
Option A — Artificial Communication Delay
- Add delay before backward:
import time time.sleep(0.01)
- Analyze:
- Q1) How does this affect scaling?
- Q2) Does speedup degrade linearly or non-linearly?
- Q3) Relate your observation to Amdahl’s Law
Option B — Batch Size Scaling Law
- Keep GPUs fixed (e.g., 4), vary batch:
32, 64, 128, 256
- Analyze:
- Q1) Does larger batch improve scaling?
- Q2) When does performance saturate?
Option C — Imbalance Experiment
- Modify workload:
if dist.get_rank() == 0:
time.sleep(0.02)
- Analyze:
- Q1) What happens to overall training time?
- Q2) What does this reveal about synchronization?
2) Students must:
- Use numbers (not opinions)
- Show tables + short reasoning
- Explain why, not just what
Submission:
1) Code for your experiemnt
2) Table of results
3) Short expination (max 12-15 lines)