Final Project
Building a High-Performance Hybrid Parallel Engine
Project Description and Required Tasks
This project is intentionally open-ended. Different implementation approaches are acceptable as long as the technical objectives are addressed and properly analyzed.
In Practical 10, we implemented a "naive" hybrid parallel system. While it worked, it was sequential (Stage 2 waited for Stage 1) and hard-coded (it only worked on a specific 4-GPU setup).
Your Mission: Transform that educational script into a professional Hybrid Parallel Engine. You will move from a conceptual model to a system that supports dynamic scaling, asynchronous execution, and memory optimization.
Technical Objectives
You are required to upgrade the base code to meet these Four Milestones:
- Milestone 1: Dynamic Hybrid Parallel Topology (Scaling) (10 Points)
- The Problem: The current code is hard-coded for a 4-GPU environment.
- The Task: Use
argparseto allow the user to define the 3D grid dimensions (DP, PP, MP) via the command line. - Requirement: The script must automatically assign the correct GPU ranks to the correct process groups.
- Command Example:
torchrun --nproc_per_node=8 train.py --dp 2 --pp 2 --mp 2
- Milestone 2: Pipeline Overlap (Performance) (12 Points)
- The Problem: The current micro-batch loop is "Stop-and-Wait."
- The Task: Implement Asynchronous Pipelining.
- Requirement: Students should attempt asynchronous pipelining using CUDA streams, RPC, or other overlap strategies. The goal is to reduce idle GPU time by overlapping the execution of different micro-batches whenever possible.
- Milestone 3: Advanced Optimization (AMP) (8 Points)
- The Problem: Standard 32-bit training is slow and consumes significant memory.
- The Task: Integrate Automatic Mixed Precision (AMP) using
torch.cuda.amp - Requirement: The engine must support
float16training and handle the necessary gradient scaling to maintain model stability.
- Milestone 4: Performance Profiling (10 Points)
- The Task: Conduct a comparative study.
- Requirement: Compare your optimized engine against the "Base Code" from Practical 10. You must report:
- Throughput: Samples processed per second.
- Memory Savings: Maximum batch size achievable with AMP vs. without.
Deliverables
Students must submit a .zip file containing:
1) hybrid_engine.py : Your complete, commented source code.
2) A text file containing the torchrun commands used for the experiments.
3) Technical Report (PDF):
- A diagram of your 3D Grid mapping.
- An explanation of how you handled the Pipeline Overlap.
- Performance tables and a brief conclusion on the bottlenecks you observed.
4) Submission Deadline : The final project must be submitted by 31 May 2026 (23:59). Late submissions may not be accepted unless approved in advance.
Getting Started
- Base Code: Use the script provided in Practical 10 as your starting point.
- Environment: Use the same HPC allocation parameters:
srun --partition=gpu --gres=gpu:4. - Tip: Start by making the configuration dynamic (Milestone 1) before attempting the asynchronous streams (Milestone 2).
Note to Students: This project mimics the real-world challenges faced by engineers at companies like OpenAI and Meta. Focus on the communication between GPUs; that is where the real magic happens.