Date | Topic | Lecturer | |
---|---|---|---|

Ia | 7th of February | Introduction to the course | Sven Laur |

Ib | 7th of February | Decision trees and association rules | Sven Laur |

II | 14th of February | Linear models and polynomial interpolation | Anna Leontjeva |

III | 21th of February | Performance evaluation measures | Sven Laur |

IV | 28th of February | Introduction to optimization | Konstantin Tretyakov |

V | 6th of March | Linear classification | Konstantin Tretyakov |

VI | 13th of March | Feed-forward neural networks for prediction tasks | Sven Laur |

VII | 20th of March | Basics of probabilistic modelling | Sven Laur |

VIII | 27th of March | Maximum likelihood and maximum a posteriori estimates | Sven Laur |

IX | 3th of April | Model-based clustering techniques | Meelis Kull |

X | 10th of April | Expectation-maximisation and data augmentation algorithms | Meelis Kull |

XI | 17th of April | PCA | Kristjan Korjus |

XII | 24th of April | Statistical learning theory | Sven Laur/Jüri Lember |

-- | 1st of May | International Workers' Day | V.I.Lenin |

XIII | 8th of May | Support Vector Machines | Konstantin Tretyakov |

XIV | 15th of May | Kernel methods | Konstantin Tretyakov |

XV | 22th of May | Basics of ensemble methods | Raivo Kolde |

XVI | 29th of May | Particle filters | Raivo Kolde |

Lectures take place on Tuesdays, 14:15, room 404

Exercise sessions are on Tuesdays, 12:15, room 402

## Ia. Introduction to the course

**Given by** Sven Laur and Konstantin Tretyakov

**Brief summary:** Advantages and drawbacks of machine learning. When is it appropriate to use machine and when knowledge based modelling is more appropriate. overview of standard experiment design. Potential applications and limits of machine learning. Broad classification of machine learning methods. Roadmap for the course.

**Slides:** Handwritten slides PDF slides

**Literature:** ??

### Ib. Association rules and decision trees

**Given by** Sven Laur

**Brief summary:**
What is association rule and which rules are useful? Notions of support, confidence and applicability. Heuristic ways to define rules. Decision trees. ID3 algorithm for learning a good decision tree from examples. C4.5 algorithm for finding decision trees. CART algorithm for predicting continuous variables. Over-fitting and tree pruning.

**Slides:** Handwritten slides PDF slides

**Literature**

- Lecture slides by Tom Mitchell
- Thomas Mitchell: Machine learning (1997) pages 52 - 80

**Complementary exercises**:

- Thomas Mitchell: Machine learning (1997) pages 77 - 78
- Iris dataset
- Tutorial for using C4.5

**Free implementations**

### II. Linear models and polynomial interpolation

**Given by** Anna Leontjeva and Sven Laur

**Brief summary:**
What is a linear model? How to detect linear trends in the data. Mean square error and normalised mean square error of a given linear model. Ordinary least squares estimation and its geometrical interpretation.Polynomial interpolation as a linear regression problem. How does the experiment design influence the reconstruction of linear dependencies. Influence and leverage of various data points. Linear regression for categorical data. One-way and two-way analysis of variance. Methods for regression diagnostics and outlier detection. Model selection and regularisation. No free lunch theorem for interpretation.

**Slides:**Linear regression

**Literature:**

- Germán Rodríguez: Generalized Linear Models Chapter 2
- Kaare Brandt Petersen and Michael Syskind Pedersen:The Matrix Cookbook: Derivatives
- Russell Davidson and James G. MacKinnon: Econometric Theory and Methods: Chapter 2: The Geometry of Linear Regression

**Complementary exercises**:

- Sanford Weisberg: Applied Linear Regression (3rd edition) pages 18 - 19, 38 - 46, 65 - 68, 92 - 95, 137 - 146, 191 - 193, 206 - 210
- Various datasets used in the examples of Applied Linear Regression
- Various datasets for fitting linear models in Princeton lectures

**Free implementations**:

- Built-in stats package in R:
`anova`

,`glm`

,`lm`

,`residuals`

- Diagnostics for linear model fitting in R
- Example datasets for linear model fitting in R

### III. Performance evaluation measures

**Given by** Sven Laur

**Brief summary:**
Principles of experiment design. Machine learning as minimisation of future costs. Overview of standard loss functions. Stochastic estimation of future costs by random sampling (Monte-Carlo integration). Theoretical limitations. Standard validation methods: holdout, randomised holdout, cross-validation, leave-one-out, bootstrapping. Advantages and drawbacks of standard validation methods

**Slides:**Attach:lecture-4.pdf

**Literature**:

- Davison and Hinkley: Bootstrap Methods and Their Application
- Molinaro, Simon and Pfeiffer: Prediction Error Estimation: A Comparison of Resampling Methods
- Arlot and Celisse: A survey of cross-validation procedures for model selection
- Efron: Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation
- Efron and Tibshirani: Improvements on Cross-Validation: The .632+ Bootstrap Method
- Wolfgang Härardle: Applied Nonparametric Regression: Choosing the smoothing parameter (Chapter 5)
- Yang: Can the Strengths of AIC and BIC Be Shared?
- van Erven, Grunwald and de Rooij:Catching Up Faster by Switching Sooner: A Prequential Solution to the AIC-BIC Dilemma

**Complementary exercises**:

- Generate data form a simple linear or polynomial regression model and use various validation methods and report results:
- Did a training method chose a correct model
- Is there some differences when the correct model is not feasible?
- Estimate bias and variance of a training method
- Did a validation method correctly estimated expected losses

- Try various classification and linear regression methods together with various validation methods report the results

**Free implementations**:

### IV. Introduction to optimization

**Given by** Konstantin Tretyakov

**Brief summary:**
Optimization problem statement and basic terminology (minimum, maximum, gradient, Hessian). Classical methods for unconstrained optimization - gradient descent, stochastic gradient descent, Newton's method. Brief overview of constrained optimization.

**Slides:** (pdf) (Auxiliary reading) (Video)

**Literature:**

- Wikipedia has some good articles on the topic. Otherwise, pick any fat textbook with "Optimization methods" in the title.

### V. Linear classification

**Given by** Konstantin Tretyakov

**Brief summary:** General description of linear classifiers. Components of the linear discriminant functions: weight vector and bias term. Fisher discriminant, Least-squares methods, Perceptron, Batch and online training, Various cost functions used in training and resulting training algorithms.

**Slides:** (pdf) (Auxiliary reading)

**Literature**:

- Duda, Hart & Stork: Patter Classification: Linear discriminant functions

### VI. Feed-forward neural networks for prediction tasks

**Given by** Sven Laur

**Brief summary:**
Neural networks as a toolbox for approximating complex functions. Generalised linear models and the conceptual design of a feed-forward network. Hidden layer as an adaptive and non-linear map to higher feature space. Sigmoid functions and radial-based functions as standard ways to build non-linear mapping. Backpropagation algorithm as an efficient gradient decent procedure. Higher-order methods for minimising the training error. Computer vision and invariance under shifts and rotations. Training methods for forcing this type of invariance.

**Slides:**PDF slides Handwritten slides

**Literature:**

- Bishop: Pattern Recognition and Machine Learning pages 225 - 272

**Complementary exercises**:

- Bishop: Pattern Recognition and Machine Learning pages 284 - 290
- Use neural networks for the classification and prediction for various datasets listed below and compare the results obtained in the earlier exercise sessions
- Build a translation invariant neural network for distinguishing numbers in Semeion Handwritten Digit Data Set
- First, use random small translations to increase the data set.
- Second, use tangent propagation method.
- Try two-class versus multi-class classification tasks.

**Free implementations**:

### VII. Basics of probabilistic modelling

**Given by** Sven Laur

**Brief summary:**
Overview of main distributions used for probabilistic reasoning: Bernoulli, Binomial, Beta, Poisson, Gaussian and Laplacian distribution. Text classification using naive-Bayes classifiers. Bayesian networks. Probabilistic model for linear regression and corresponding maximum likelihood solution. The effect of error models to the objective function and performance.

**Slides:**Attach:lecture-9.pdf Handwritten slides Naive Bayes explained for text classification

**Literature:**

- Bishop: Pattern Recognition and Machine Learning pages 67 - 120
- Weiss, Indurkhya, Zhang & Damerau: Text Mining: Predictive Methods for Analyzing Unstructured Information pages 52 - 70
- Bishop: Pattern Recognition and Machine Learning pages 137 - 161
- Ricci: Fitting distributions with R

**Complementary exercises**:

- Bishop: Pattern Recognition and Machine Learning exercises from pages 127 - 136 that are related to practical tasks
- Bishop: Pattern Recognition and Machine Learning pages 220 - 224
- Some practical exercises to confirm various hypotheses about how some variables are distributed in real life.
- Build a naive Bayes filter for detecting the spam. Use the Spambase Data Set for training and testing.

**Free implementations**:

- Various distribution in built-in stats package in R and
`qqplot`

. - MASS package in R:
`fitdistr`

,`mvrnorm`

, - Predbayescor package in R that implements naive Bayes model

### VIII. Maximum likelihood and maximum a posteriori estimates

**Given by** Sven Laur

**Brief summary:**
Formalisation of Maximum Likelihood and Maximum A Posteriori principles. Simple examples of ML and MAP estimates. Logistic regression and linear discriminant analysis. Link functyion. Connection between regularisation and Maximum A Posteriori estimate. Regularised linear models and corresponding priors to parameters.

**Slides:**Attach:lecture-10.pdf Handwritten slides

**Literature:**

- Duda, Hart & Stork: Patter Classification pages 84-107
- Bishop: Pattern Recognition and Machine Learning pages 137 - 161
- Bishop: Pattern Recognition and Machine Learning pages 204 - 220

**Complementary exercises**:

- Bishop: Pattern Recognition and Machine Learning pages 173 - 177
- Bishop: Pattern Recognition and Machine Learning pages 220 - 224
- Practical comparison of various linear regression methods on data with different error distributions.

**Free implementations**:

- Built-in stats package in R:
`glm`

`anova`

. - LARS package in R
- Quantreg package in R:
`rq`

### IX. Model-based clustering techniques

**Given by** Meelis Kull

**Brief summary:**
Phylogenetic trees and hierarchical clustering. Independent mutation model. Hard and soft clustering techniques. K-means clustering as a maximum likelihood estimate. Gaussian mixture model as a generalisation of k-means algorithm. Two ways to minimise objective function hard and soft clustering. Soft clustering as expectation-maximisation. Robust Gaussian mixture models. Mixtures of Bernoulli distributions.

**Slides:**Attach:lecture-11.pdf

**Video:** http://uttv.ee/naita?id=10562 (Please log in to the UTTV system with ut account)

**Literature:**

- Bishop: Pattern Recognition and Machine Learning pages 423 - 448

**Complementary exercises**:

- Bishop: Pattern Recognition and Machine Learning pages 455 - 459
- Practical clustering tasks:
- Discretisation of continuous variables, such as height and weight or gene expression
- Finding relations between such discretised signals

**Free implementations**:

### X. Expectation-maximisation and data augmentation algorithms

**Given by** Meelis Kull

**Brief summary:**
Formal derivation of the EM-algorithm and its convergence proof. Kullback-Leibler divergence. Bayesian and frequentistic interpretation. Robust Gaussian mixtures and their application in geodata analysis. Reconstruction of missing values with EM algorithm. Data augmentation algorithm as a generalisation of the EM algorithm. Imputation of missing data with data augmentation algorithm.

**Video:** http://uttv.ee/naita?id=10820 (Please log in to the UTTV system with ut account)

**Literature:**

- Bishop: Pattern Recognition and Machine Learning pages 450 - 455
- Tanner: Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions pages 90 - 100

**Complementary exercises**:

- Bishop: Pattern Recognition and Machine Learning pages 455 - 459
- Use EM and DA algorithms for imputation of missing data and compare results
- Use EM and DA algorithms for clustering of events or locations:
- Try to fit Gaussian mixture model to diamond resources.
- Try to fit Gaussian mixture model to petrol resources.
- Try to use more complex models to track the spread of swine flu.

**Free implementations**:

### XI. Principal Component Analysis (PCA)

**Given by** Kristjan Korjus

**Brief summary:**
I would like you to know what is PCA, how and when to use it, how to explain it to someone not familiar with the topic, how to plot and interpret the results of the analysis and how to write a nice report using the method. To sum up, I would like you to be able to get a full skillset needed to use PCA in practice.

This lecture describes the mathematical background of the PCA. Your goal is to apply PCA to the data of the richest and the biggest countries in the world and write a nice report explaining your analysis of the data.

**Materials:** Lecture notes, R code, video

**Literature:**

- Bishop: Pattern Recognition and Machine Learning pages 559 - 599

**Complementary exercises**:

- Bishop: Pattern Recognition and Machine Learning pages 599 - 603
- Try PCA and ICA methods on various image collections and interpret the resullts

**Free implementations**:

- Built-in stats package in R:
`princomp`

,`prcomp`

### XII. Statistical learning theory

**Given by** Sven Laur

**Brief summary:**
Bias-variance dilemma revisited. Training and testing data as iid samples from the distribution of future challenges. Confidence bounds on cost estimations via Monte-Carlo integration. Why is does the training error underestimate future costs. Case study for the finite function set. Bias in training error and its connection to union-bound and multiple hypothesis testing. Consistency and identifiability properties. VC-dimension as a way to estimate bias in training error. Rademacher complexity and its connection to the bias in the training error. Limitations of statistical learning theory.

**Slides:** Slides Handwritten slides

**Literature:**

- Cristianini & Shawe-Taylor: Support Vector Machines: Generalisation Theory (Chapter 4)
- Bartlett & Mendelson: Rademacher and Gaussian Complexities: Risk Bounds and Structural Results
- David MacKay: Information Theory, Inference, and Learning Algorithms: Capacity of a Single Neuron

**Complementary exercises**:

- Estimate the difference between training and test errors for different classifiers
- Draw data from linearly separable model with some gaussian random shifts
- Try various linear and non-linear classifiers
- Plot the discrepancy as a function of training sample size
- Draw data form more complex model that cannot be represented by predictor classes
- Repeat the procedure
- Estimate VC and Rademacher complexities and see if SLT bounds coincide with practice

- Estimate the difference between training and test errors for different prediction algorithms
- Draw the data form a linear model
- Try various linear and non-linear predictors
- Plot the discrepancy as a function of training sample size
- Draw data form more complex model that cannot be represented by predictor classes
- Repeat the procedure
- Estimate VC and Rademacher complexities and see if SLT bounds coincide with practice

**Free implementations**:

- None that we know

### XIII. Support Vector Machines

**Given by** Konstantin Tretyakov

**Brief summary:**
Recap on algebra and geometry. Maximal margin classifiers. Reformulation as a quadratic programming problem. Primal and dual forms. SVM as an example of a regularized learning problem. Hinge loss as an example of a surrogate loss function.

**Slides:**(pdf)

**Video:**http://uttv.ee/naita?id=11400 (Please log in to the UTTV system with ut account)

**Literature:**

- Cristianini and Shawe-Taylor: An Introduction to Support Vector Machines pages 93 - 112
- Schölkopf and Smola: Learning with Kernels pages 189 - 215

### XIV. Kernel methods

**Given by** Konstantin Tretyakov

**Brief summary:** Feature space mapping, the kernel trick, dual representation, kernelization.

**Slides:**(pdf)

**Video:**(uttv)

**Literature:**

- Cristianini and Shawe-Taylor: An Introduction to Support Vector Machines pages 112 - 123
- Schölkopf and Smola: Learning with Kernels pages 227-247
- David MacKay: Information Theory, Inference, and Learning Algorithms pages 535 - 549

### XV. Basics of ensemble methods

**Given by** Raivo Kolde

**Brief summary:**
Bayesian view on model selection. Ensembles as a Monte Carlo integration technique. Committee voting as a Bayesian Model averaging. Bagging is bootstrapping together with averaging. Sequential error correction methods and the idea of data point weighting. AdaBoost algorithm and its reformulation in terms of standard minimisation problem with a peculiar cost function. Non-robustness of AdaBoost algorithm and alternatives. Mixtures of experts and relation to lazy learning.

**Literature:**

- Bishop: Pattern Recognition and Machine Learning pages 653 - 674
- Hastie, Tibshirani & Friedman: The Elements of Statistical Learning pages 337 - 387

**Complementary Exercises:**

- Bishop: Pattern Recognition and Machine Learning pages 674-677
- Study the robustness and precision of bagging and boosting on Spambase datased with simple tree based classifiers.
- Study the behaviour of Bayesian Model Averaging for linear models. Interpret the results.

**Free implementations**:

- BMS package for Bayesian Model Averaging in R:
`bms`

,`topmodels.bms`

,`image`

- BMA package for Bayesian Model Averanging in R:
`bicreg`

,`bic.glm`

,`bic.surv`

and`imageplot.bma`

- Ipred package in R:
`bagging`

,`inbagg`

and`bootest`

- Ada package in R:
`ada`

and`predict.ada`

- Gbm package in R:
`gbm`

,`gbm.perf`

and`gbm.perf`

- index.html|Mboost package in R

### XVI. Particle filters

**Given by** Raivo Kolde

**Brief summary:**
Reconstruction of latent variables (signal) from observations given complete specification system. Examples of such problems: speech reconstruction, motion tracking, restoration of old music records, finding and predicting regime changes in stock markets. State space model and observation function. Filtering as prediction task for a current timepoint. Smoothing as prediction for timepoints in past using also present observations. A principle way for reconstructing posterior distribution for system state. Kalman filter as an exact solution for a particular type of systems. Limitations of Kalman filtering. Rejection and importance sampling. How to use importance sampling for solving filtering problems. Inherent weaknesses of importance sampling. Resampling as a way to eliminate particles with degenerate weights. Auxiliary particle filtering.

**Slides:**Attach:lecture-19.pdf Δ

**Literature:**

- Candy: Bayesian Signal Processing: Classical, Modern, and Particle Filtering Methods
- Chen: Bayesian Filtering: From Kalman Filters to Particle Filters and Beyond
- Gappé, Simon, Godsill & Moulines: An overview of Existing Methods and Recent Advances in Sequential Monte Carlo

**Video lectures**:

**Complementary Exercises:**

- Reconstruction of corrupted audio signal with Kalman and particle filtering.
- Reconstruction of trajectories of moving balls based on video signal.

**Free implementations**: