Paralleelsus süvaõppes - Kursused - Arvutiteaduse instituut

Home Work 1

General Instructions & Submission

Release & Deadline

Release date: 2 April 2026
Deadline: 16 April 2026 (23:59)

Submission Requirements

Each student must submit:

1) Code files

All modified scripts used in the homework
Must be runnable

2)Report (PDF — single file) Include:

Answers to all questions
Tables of Results
Short explanations

Task 1—Reproducible Benchmark Setup (3 Points)

Modify the DDP script to:

1) Fix randomness:
Add at the beginning of your script:

torch.manual_seed(0)
torch.cuda.manual_seed_all(0)

2)Log per-step time (only rank 0)
👉 Measure time for each training step using time.time():

Start timer before forward pass
End timer after optimizer step
Print time only when rank = 0

3) Run at least 30 steps (ignore first 5 warmup steps)
👉 Ignore the first 5 steps (warmup)

Submission:

1) Your modified DDP script
2) A short report (max 1 page) answering:

Q1) Why is warmup needed in GPU benchmarking?
Q2) Why must we control randomness?

Task 2—Strong Scaling Analysis (3 Points )

1) Keep global batch size fixed (64), and run:

torchrun --nproc_per_node=1 ...
torchrun --nproc_per_node=2 ...
torchrun --nproc_per_node=4 ...

2) Compute

#GPUs	Time/step	Speedup	Efficency
1		1.0	1.0
2
4

Submission:

1) Your modified DDP script
2) Complated table
3) Short explanation :

Q1) Is scaling linear?
Q2) Where does efficiency drop?
Q3) Give a quantitative explanation (not just word)?

Task 3—Communication vs Computation (3 Points)

Create two scenarios:

Case A — Small model (communication dominates)

DIM = 1024
DEPTH = 2

Case B — Large model (computation dominates)

DIM = 8192
DEPTH = 8

Submission:

1) Completed table

Case	#GPUs	Time/step	Speedup
Small	1		1.0
Small	2
Small	4
Large	1		1.0
Large	2
Large	4

2) Short explanation :

Q1) In which case is DDP more efficient?
Q2) Why does performance differ between small and large models?
Q3) When does communication become the bottleneck?

Task 4—DataParallel vs DDP (Deep Comparison) (3 Points)

1) Run both:

DP (Praactical 5)
DDP (this week)

2) Analyze:

GPU utilization (via nvidia-smi)
Step time variance
Memory usage

Submission:

Q1) Why does DP suffer from a bottleneck on GPU 0?
Q2) Why does DDP scale better architecturally?
Q3) In what scenario could DP still be acceptable?

Task 5—Research Challenge (3 Points )

1) choose one:

Option A — Artificial Communication Delay

Add delay before backward:

import time
time.sleep(0.01)

Analyze:
- Q1) How does this affect scaling?
- Q2) Does speedup degrade linearly or non-linearly?
- Q3) Relate your observation to Amdahl’s Law

Option B — Batch Size Scaling Law

Keep GPUs fixed (e.g., 4), vary batch:

32, 64, 128, 256

Analyze:
- Q1) Does larger batch improve scaling?
- Q2) When does performance saturate?

Option C — Imbalance Experiment

Modify workload:

if dist.get_rank() == 0:
    time.sleep(0.02)

Analyze:
- Q1) What happens to overall training time?
- Q2) What does this reveal about synchronization?

2) Students must:

Use numbers (not opinions)
Show tables + short reasoning
Explain why, not just what

Submission:

1) Code for your experiemnt
2) Table of results
3) Short expination (max 12-15 lines)

Paralleelsus süvaõppes 2025/26 kevad

Home Work 1

General Instructions & Submission

Release & Deadline

Submission Requirements

Task 1—Reproducible Benchmark Setup (3 Points)

Task 2—Strong Scaling Analysis (3 Points )

Task 3—Communication vs Computation (3 Points)

Task 4—DataParallel vs DDP (Deep Comparison) (3 Points)

Task 5—Research Challenge (3 Points )