Parallelism in Deep Learning - Courses - Institute of Computer Science

Practical sessions:

Practical 1: Implement a toy neural network in PyTorch and visualize forward/backward passes.
Practical 2: Compare matrix multiplication speeds using NumPy on the CPU versus PyTorch on the GPU.
Practical 3: Profile a CNN training script and identify the main bottlenecks.
Practical 4: Lecture is theoretical; practical implementation is deferred to later sessions.

Practical 5: Convert a single-GPU training script to use torch.nn.DataParallel and observe its sequential bottleneck.
Practical 6: Convert the single-GPU script to use DDP (with torchrun) and compare its performance against the DP implementation.
Practical 7: Train a model with AMP and gradient accumulation to observe the benefits and practice DDP launch configurations.
Practical 8: Apply basic model parallelism by distributing layers and tensors of a feedforward network across multiple devices.
Practical 9: Explain pipeline parallelism with toy examples and outline implementation steps using a theoretical module.
Practical 10: Design a hybrid DDP+PP strategy for a toy transformer in PyTorch, analyzing pros, cons, and communication costs.
Practical 11: Recap and Project Q&A.