Homework 1 (5 points)
Deadline: Wednesday, September 24, 23:59 (no late submissions)
Watch Andrej Karpathy’s “Let's build GPT: from scratch, in code, spelled out” video and review the accompanying code.
Your task is to
- write down any questions or parts that you found confusing,
- reflect briefly on what you learned,
- answer the following three questions:
- Why is the attention scaled by head size?
- In the video, we learned that attention is a communication mechanism where the elements in the sequence can be seen as nodes in a directed graph. What has to change in the self-attention of a decoder-only transformer implemented in the video in order for every node to have a connection to every other node and itself (complete digraph with self-loops)?
- Why is the `torch.tril` function useful, according to Karpathy?
This assignment is intended to help you become familiar with the inner workings of GPT-style models and Transformers in general. Your submission should be no more than one page.
Grading: Full credit (5 points) will be awarded for complete submissions that address all parts of the task.
Note: Use of AI tools such as ChatGPT is allowed and even encouraged, but must be disclosed (e.g., via a short description or by attaching the relevant conversation history).
Video link: https://www.youtube.com/watch?v=kCc8FmEb1nY&t=107s