Parallelism in Deep Learning - Courses - Institute of Computer Science

Each student must submit:

1) Code files

2) Report (PDF — single file) Include:

Combine Code 1 and Code 2 to create an optimized training script.

1) Requirements
Students must:

2) Implementation Instructions
Your implementation must include:

Hints:

Keep the gradient accumulation logic from Code 1 unchanged, and integrate AMP inside it.
Do not modify the training logic structure — only enhance it with AMP.

1) Requirements
Students must run the following three versions:

2) Comparison Table
Fill in the table based on your observations:

3) Comparison Table: Modify Code 1 and Code 3
Students must run with different values of: ACCUM_STEPS = 4, 8, 16, 32

A short report (1 - 2 page) answering:

Q1) Compare execution time and explain differences using gradient synchronization and numerical precision?
Q2) Compare GPU memory usage and explain how Gradient Accumulation and Mixed Precision impact memory differently?
Q3) Explain how gradient accumulation changes the effective batch size. Why must the loss be divided by ACCUM_STEPS?

Parallelism in Deep Learning 2025/26 spring