Home Work 3
General Instructions & Submission
Release & Deadline
- Release date: 16 April 2026
- Deadline: 30 April 2026 (23:59)
Submission Requirements
Each student must submit:
1) Code files
- All modified scripts used in the homework
- Must be runnable
2)Report (PDF — single file) Include:
- Answers to all questions
- Tables of Results
- Short explanations
Base Code
Use the LargeLinearModelMP class code provided in our Lecture 8 practical session.
Use the following script:
Task 1—Current Behavior (3 Points)
Focus on the splitting logic: if i < depth // 2:
1) Do: Run the provided code.
2) Answer:
- Q1) How many layers are assigned to
GPU 0andGPU 1?
- Q2) In the
forward()method, at what exact point does the data move fromGPU 0toGPU 1?
Task 2—Increase Model Size (3 Points)
Focus on the configuration constants: DIM and DEPTH.
- Do: Increase DEPTH (e.g., from 6 to 12, then 24).
- Submit: A table comparing
DEPTHvs.Step Time. - Answer: Does increasing the number of layers improve speed? Explain why or why not based on your observations.
Task 3—Use All GPUs (3 Points)
Refactor the model to be dynamic.
- Do: Modify
LargeLinearModelMPso it automatically detects all available GPUs (torch.cuda.device_count()) and distributes the layers evenly across all of them. - Submit: Your modified
__init__ methodcode. - Answer: Explain the logic you used to calculate which GPU receives which layer index.
Task 4—Run with More GPUs (3 Points)
Use your modified code from Task 3.
- Do: Set
DEPTH = 20. Run the code using:- 2 GPUs
- 5 GPUs
- Submit: A table showing
Step TimeandGPU Memory Usagefor each configuration. - Answer: Are all GPUs being utilized? Does adding more GPUs make the training faster in this specific setup?
Task 5—Data Transfer (3 Points)
Focus on the "Baton Pass" in forward() : x = x.to(layer_device).
- Do: Count how many times
x.to(layer_device)is triggered in one single forward pass. - Submit:
- Number of transfers with 2 GPUs:
- Number of transfers with 5 GPUs:
- Answer: What is the relationship between the number of GPUs and the total communication overhead?
Hints:
- Visualization: To understand the distribution,
- you can print the device of each layer after initialization (e.g.,
print(layer.parameters().device)).
- you can print the device of each layer after initialization (e.g.,
- Efficiency: Remember that moving data between devices takes time.
- Use
torch.cuda.synchronize()to ensure yourStep Timemeasurements are accurate.
- Use