TL;DR
The goal of this course is to prepare students for research in the foundations of deep learning. By the end of the course you should be able to read most cutting-edge papers in this field, as well as be capable of reproducing at least some experimental results (those that do not require an inordinate amount of computational and human resources). Ideally, you should be on the way of working on original research on the field. To achieve this the course will require a large amount of independence from students, including both self-study and peer study.
Formal Description
A graduate level course on recent advances and open questions in the foundations of machine learning and specifically deep learning. We will review both classical results as well as recent papers in areas including classifiers and generalization gaps, representation learning, generative models, adversarial robustness, out of distribution performance, and more. This is a fast-moving area and it will be a fast-moving course. We will aim to cover both state-of-art results, as well as the intellectual foundations for them, and have a substantive discussion on both the “big picture” and technical details of the papers. In addition to the theoretical lectures, the course will involve a programming component aiming to get students to the point where they can both reproduce results from papers and work on their own research. This component will be largely self-directed and we expect students to be proficient in Python and in picking up technologies and libraries such as pytorch/numpy/etc on their own (aka “Stack Overflow oriented programming”).
Lectures
Friday 16:15-18:00 Delta - 1022 (Kallol Roy, Boris Kudryashov, Irina Bocharova)
https://ut-ee.zoom.us/j/98180227833?pwd=jbGPnn9OlP4nID5BdISbSZb53x04Lb.1
Meeting ID: 981 8022 7833 Passcode: 771089
Contacts
Responsible Lecturer:
- Kallol Roy kallol.roy@ut.ee (Delta:3082)
- Boris Kudryashov boris.kudryashov@ut.ee (Delta:3089)
- Irina Bocharova irina.bocharova@ut.ee (Delta:3089)
Contents
Lectures and practice sessions?
N | Lecture | Reading |
---|---|---|
1 | Introduction to the course, a quick review of classical ML: representation (i.e., approximation theorems), optimization (convexity, stochastic gradient descent), generalization (bias/variance tradeoff). Differences between that and modern paradigms.Transformer architecture. How it works, why it is well-suited for GPUs, auto-regressive language models. The next-token prediction task. Some questions: are transformers useful for their inductive bias, or for their highly efficient GPU implementation? Differences between fine tuning, prompt tuning, linear readouts. | Model: original paper and annotated version (colab version), Peter Blohem blog - informal initial introduction,Phuong-Hutter formal survey, Andrej Karpathy video - GPT from scratch. Even if you don’t code along with the video you will learn a lot from watching it and Karpathy’s commentary,Vision: vision transformer, MLP mixer, Efficiency: compute/energy consumption of models, GPUs and linear algebra,Inductive bias: learning convolutions from scratch (Benham),Linear time attention reading: Efficient attention, Nyströmformer (blog, paper), Linformer , MEGA (sub quadratic). Attention-free transformer, Pretraining without attention SSM |
2 | Generative models: Variational principle, VAEs, normalizing flows. | Reading: Chapter 2 (VAE) Kingma and Welling survey on VAEs. Chapter 3 (exponential distributions, can skim concrete examples in 3.3) Wainwright and Jordan. Lilan Weng blog on normalizing flows. Survey by Kobyzev, Prince, and Brubaker (see also CVPR 21 tutorial) |
3 | Diffusion models | Reading: On Perusall - Weng blog, Karras et al unifying design space , MacAllester math of diffusion. Additional resources: Latent diffusion (Rombach et al), classifier-free guidance (Ho and Salimans) Blog posts of Song and Das. Vadhat tutorial (video, 2 hours). |
4 | Entropy. Uniform and nonuniform source codes, Applications of Huffman codes to different scenarios. | Reading: |
5 | Arithmetic coding. Practical implementation. Ideas behind universal coding | Reading: |
6 | Dictionary-based low-complexity universal coding techniques. Interval coding and recency rank coding. LZ-77 algorithm, Monotonic codes. LZ-77 example,Examples. Implementation issues | Reading: |
7 | LZW algorithm, PPM algorithm, Examples. Implementation issues | Reading: |
8 | Burrows-Wheeler transform and its application. Overview of the best archivers, A BWT-based data compression algorithm, Understanding lossy data compression. Multimedia data compression, Sampling. Transforms. Quantization | Reading: |
9 | Privacy in machine learning, Attacks on non-private models: Membership inference. Extracting training data from GPT2 and Diffusion models, Failure of heuristics, e.g. Attack on InstaHide. Exposed! A survey of attacks on private data. Issues with DP for deep learning: Tramer-Boneh: DP needs better features Bagdasaryan-Shmatikov: DP impacts subgroups differently.Machine unlearning: see this.Relaxations of DP: label DP, privacy-preserving predictions. DP fine tuning of large models | 2014 manuscript on Differential Privacy by Dwork and Roth . For issues of computational complexity, see the survey of Vadhan. DP-SGD paper see l,ecture notes by Smith and Ullman, notes by Kamath, and slides by Bellet. This video of Kamath can also be useful. https://differentialprivacy.org/. |
10 | Protein Folding: AlphaFold - guest lecture by Gustaf Ahdritz. | AlphaFold1 paper, AlphaFold2 paper. Blog: Mohammed AlQuraishi blog1, blog2 |
11 | Training Dynamics: Differences between back-propagation and perturbative methods, natural gradient, edge of stability, deep bootstrap, the effect of issues such as batch norm, residual connections, SGD vs Adam. | lecture notes of Roger Grosse, Deep Bootstrap paper, Edge of stability paper, SGD complexity paper. Francis Bach’s blog on depth-2 networks dynamics (guest post by Lénaïc Chizat). Chinchilla paper on scaling laws. |
12 | Training dynamics continued. | We will look at Deep Boostrap, Edge of Stability, and scaling laws (particularly Chinchilla and to what extent they are challenged by LlaMA). Some other reading: mathematical models that demonstrate the above phenomena: deep bootstrap in kernels, understanding edge-of-stability via minimialist example, edge-of-stability in 2-layer nets, explaining neural scaling laws, power laws in Kernels (see also this , this, and nearest-neighbor rates). |
13 | Reinforcement learning | David Silver slides on MDP, AlphaZero paper, MuZero paper.Proximal Policy Optimization (PPO) Schulman et al |
14 | Test-time computation- test-time augmentation, beam search, retrieval-based models, differntiable vs non-differentiable memory and tools. | Survey on augmented language models. Best of n outputs WebGPT paper, plurality voting Wang et al, Minerva paper. In-context learning, and is it really “learning” or “conext conditioning”: Min et al - in-context examples more useful for the data distributions than labels, Wei et al - LLMs can adapt to label dist also Chain of thought: Wei et al, zero shot CoT Kojima et al (“step by step”) Differentiable memory: RETRO (Deepmind) , Memorizing transformers, Ruccrent memory (Bulatov et al) Non-differentiable memory, Natural language as universal API Toolformer (Schick et al), see also “Bing inner monologue” (e.g. here, here, unsure the extent these are confirmed), langchain, Taskmatrix.ai |
15 | AI Safety, Fairness, Accountability, Transparency, Alignment. | AI Safety, Fairness, Accountability, Transparency, Alignment.Fair ML textbook. Hendrycks safety course.Algorithmic Auditing Veccione at al. Against predictive optimization Wang et al. Meta study on bias papers in NLP. Feature highlighting explanations in model interpretability (Barocas et al). The mythos of model interprtability - Lipton. Gender Shades - Boulamwini and Gebru. |
Topic | Content | Start (week) | Deadline (week) |
Measuring information | For a real discrete data source, estimate the theoretically achievable compression ratio. Compare with the efficiency of existing archivers | 1 | 6 |
Universal data compression | Implement programs for estimating the efficiency of a given set of universal data compression algorithms. Choose the best for a given data source. Compare with efficiency of the standard archivers | 5 | 13 |