Paralleelsus süvaõppes - Kursused - Arvutiteaduse instituut

Each student must submit:

1) Code files

2)Report (PDF — single file) Include:

Use the LargeLinearModelMP class code provided in our Lecture 8 practical session.

Use the following script:

Focus on the splitting logic: if i < depth // 2:

1) Do: Run the provided code.
2) Answer:

Q1) How many layers are assigned to GPU 0 and GPU 1?
Q2) In the forward() method, at what exact point does the data move from GPU 0 to GPU 1?

Focus on the configuration constants: DIM and DEPTH.

Do: Increase DEPTH (e.g., from 6 to 12, then 24).
Submit: A table comparing DEPTH vs. Step Time.
Answer: Does increasing the number of layers improve speed? Explain why or why not based on your observations.

Refactor the model to be dynamic.

Do: Modify LargeLinearModelMP so it automatically detects all available GPUs (torch.cuda.device_count()) and distributes the layers evenly across all of them.
Submit: Your modified __init__ method code.
Answer: Explain the logic you used to calculate which GPU receives which layer index.

Use your modified code from Task 3.

Do: Set DEPTH = 20. Run the code using:
- 2 GPUs
- 5 GPUs
Submit: A table showing Step Time and GPU Memory Usage for each configuration.
Answer: Are all GPUs being utilized? Does adding more GPUs make the training faster in this specific setup?

Focus on the "Baton Pass" in forward() : x = x.to(layer_device).

Do: Count how many times x.to(layer_device) is triggered in one single forward pass.
Submit:
- Number of transfers with 2 GPUs:
- Number of transfers with 5 GPUs:
Answer: What is the relationship between the number of GPUs and the total communication overhead?

Visualization: To understand the distribution,
- you can print the device of each layer after initialization (e.g., print(layer.parameters().device)).
Efficiency: Remember that moving data between devices takes time.
- Use torch.cuda.synchronize() to ensure your Step Time measurements are accurate.

Paralleelsus süvaõppes 2025/26 kevad