Institute of Computer Science
Courses.cs.ut.ee Institute of Computer Science University of Tartu
  1. Courses
  2. 2025/26 spring
  3. Parallelism in Deep Learning (LTAT.06.030)
ET
Log in

Parallelism in Deep Learning 2025/26 spring

  • Pealeht
  • Loengud
  • Laborid
  • Kodutöö
  • Viited

Home Work 3

General Instructions & Submission


Release & Deadline
  • Release date: 16 April 2026
  • Deadline: 30 April 2026 (23:59)
Submission Requirements

Each student must submit:

1) Code files

  • All modified scripts used in the homework
  • Must be runnable

2)Report (PDF — single file) Include:

  • Answers to all questions
  • Tables of Results
  • Short explanations

Base Code

Use the LargeLinearModelMP class code provided in our Lecture 8 practical session.

Use the following script:

  • Download

Task 1—Current Behavior (3 Points)

Focus on the splitting logic: if i < depth // 2:

1) Do: Run the provided code.
2) Answer:

  • Q1) How many layers are assigned to GPU 0 and GPU 1?
  • Q2) In the forward() method, at what exact point does the data move from GPU 0 to GPU 1?

Task 2—Increase Model Size (3 Points)

Focus on the configuration constants: DIM and DEPTH.

  • Do: Increase DEPTH (e.g., from 6 to 12, then 24).
  • Submit: A table comparing DEPTH vs. Step Time.
  • Answer: Does increasing the number of layers improve speed? Explain why or why not based on your observations.

Task 3—Use All GPUs (3 Points)

Refactor the model to be dynamic.

  • Do: Modify LargeLinearModelMP so it automatically detects all available GPUs (torch.cuda.device_count()) and distributes the layers evenly across all of them.
  • Submit: Your modified __init__ method code.
  • Answer: Explain the logic you used to calculate which GPU receives which layer index.

Task 4—Run with More GPUs (3 Points)

Use your modified code from Task 3.

  • Do: Set DEPTH = 20. Run the code using:
    • 2 GPUs
    • 5 GPUs
  • Submit: A table showing Step Time and GPU Memory Usage for each configuration.
  • Answer: Are all GPUs being utilized? Does adding more GPUs make the training faster in this specific setup?

Task 5—Data Transfer (3 Points)

Focus on the "Baton Pass" in forward() : x = x.to(layer_device).

  • Do: Count how many times x.to(layer_device) is triggered in one single forward pass.
  • Submit:
    • Number of transfers with 2 GPUs:
    • Number of transfers with 5 GPUs:
  • Answer: What is the relationship between the number of GPUs and the total communication overhead?

Hints:
  • Visualization: To understand the distribution,
    • you can print the device of each layer after initialization (e.g., print(layer.parameters().device)).
  • Efficiency: Remember that moving data between devices takes time.
    • Use torch.cuda.synchronize() to ensure your Step Time measurements are accurate.

  • Institute of Computer Science
  • Faculty of Science and Technology
  • University of Tartu
In case of technical problems or questions write to:

Contact the course organizers with the organizational and course content questions.
The proprietary copyrights of educational materials belong to the University of Tartu. The use of educational materials is permitted for the purposes and under the conditions provided for in the copyright law for the free use of a work. When using educational materials, the user is obligated to give credit to the author of the educational materials.
The use of educational materials for other purposes is allowed only with the prior written consent of the University of Tartu.
Terms of use for the Courses environment