Most importantly, the link to the online course that we follow:
https://www.cs.toronto.edu/~rgrosse/courses/csc2541_2021/
Linear Algebra
- Linear Algebra and Calculus refresher from Stanford Machine Learning course
- Linear Algebra Review and Reference from Stanford
- Computational Linear Algebra for Coders from fast.ai.
Calculus
- Vector, Matrix, and Tensor Derivatives from Stanford.
- The Matrix Calculus You Need For Deep Learning from fast.ai.
Probability
- Probabilities and Statistics refresher from Stanford Machine Learning course
- Review of Probability Theory from Stanford.
Links for Lecture 1: A Toy Model: Linear Regression
- Tom Goldstein's talk, "An empirical look at generalization in neural nets"
- Eigenvectors and eigenvalues |Essence of linear algebra
Links for Practice 1: JAX Tutorial
- Google introductory video covering grad, jit, vmap, and pmap
- SciPy 2020 talk
- JAX Docs
- JAX Github notebooks
- CSC413 course about neural networks
- Lecture 3 about Autodiff: slides and lecture notes
Tutorials:
- JAX Tutorial 1
- https://colab.research.google.com/drive/1dMZVo9JqI573TSpWLZ6_W5pTKPTPsbpj?usp=sharing|JAX Tutorial 2]]
- JAX NN and Data Loading
Links for Lecture 2: Taylor Approximations
- AutoDiff slides by Mathieu Blondel
- What AutoDiff is and what is not
- Reverse and forward modes explained (1)
- Reverse and forward modes explained (2)
- Jax's doc on forward and backward modes
- VJP with JVPs
- Hessian as jvp and vjp
- Gauss-Newton Matrix
- Conjugate Gradient explained
- Conjugate Gradient explained short version
- Positive Definite Matrix
Links for Problem Set 1: Gradient Descent with Momentum
- Gradient descent with Momentum blog
- Text on Iterative methods
- Paper from Sutskever & Hinton on importance of init and momentum in deep learning
Links for Problem Set 2: Computing the Grassmannian Length
- why deeper networks are more expressive than wide but shallow
- Grassmannian length video presentation
- forward vs reverse auto-differentiation 1
- forward vs reverse auto-differentiation 2
- forward vs reverse auto-differentiation 3
Links for Problem Set 3: Path Energy and Geodesics
- The paper that discusses geometrical interpretation of Fisher distance on the example of a Normal distribution
- Explicit calculation of Fisher matrix for parametrization
- Leibniz integral rule for differentiation under the integral sign
Links for Lecture 5: Adaptive Gradient Methods, Normalization, Weight Decay
- Original Batch Norm Paper
- Bridge between optimizers and Hessian approximation
- Batch Norm + Adam optimizer
- NeurIPS Test of Time talk
- About experiment with random batch transformation
- About stability
Links for Lecture 6: Infinite Limits and Overparameterization
- Priors for Infinite Networks
- Gaussian Process Behaviour in Wide Deep Neural Networks
- Approximate Inference Turns Deep Networks into Gaussian Processes
- On Exact Computation with an Infinitely Wide Neural Net