DateTopicLecturer
Ia7th of FebruaryIntroduction to the courseSven Laur
Ib7th of FebruaryDecision trees and association rulesSven Laur
II14th of FebruaryLinear models and polynomial interpolationAnna Leontjeva
III21th of FebruaryPerformance evaluation measuresSven Laur
IV28th of FebruaryIntroduction to optimizationKonstantin Tretyakov
V6th of MarchLinear classificationKonstantin Tretyakov
VI13th of MarchFeed-forward neural networks for prediction tasksSven Laur
VII20th of MarchBasics of probabilistic modellingSven Laur
VIII27th of MarchMaximum likelihood and maximum a posteriori estimatesSven Laur
IX3th of AprilModel-based clustering techniquesMeelis Kull
X10th of AprilExpectation-maximisation and data augmentation algorithmsMeelis Kull
XI17th of AprilPCAKristjan Korjus
XII24th of AprilStatistical learning theorySven Laur/Jüri Lember
--1st of MayInternational Workers' DayV.I.Lenin
XIII8th of MaySupport Vector MachinesKonstantin Tretyakov
XIV15th of MayKernel methodsKonstantin Tretyakov
XV22th of MayBasics of ensemble methodsRaivo Kolde
XVI29th of MayParticle filtersRaivo Kolde

Lectures take place on Tuesdays, 14:15, room 404
Exercise sessions are on Tuesdays, 12:15, room 402

Ia. Introduction to the course

Given by Sven Laur and Konstantin Tretyakov
Brief summary: Advantages and drawbacks of machine learning. When is it appropriate to use machine and when knowledge based modelling is more appropriate. overview of standard experiment design. Potential applications and limits of machine learning. Broad classification of machine learning methods. Roadmap for the course.
Slides: Handwritten slides PDF slides
Literature: ??

Ib. Association rules and decision trees

Given by Sven Laur

Brief summary: What is association rule and which rules are useful? Notions of support, confidence and applicability. Heuristic ways to define rules. Decision trees. ID3 algorithm for learning a good decision tree from examples. C4.5 algorithm for finding decision trees. CART algorithm for predicting continuous variables. Over-fitting and tree pruning.

Slides: Handwritten slides PDF slides

Literature

  • Lecture slides by Tom Mitchell
  • Thomas Mitchell: Machine learning (1997) pages 52 - 80

Complementary exercises:

Free implementations

II. Linear models and polynomial interpolation

Given by Anna Leontjeva and Sven Laur

Brief summary: What is a linear model? How to detect linear trends in the data. Mean square error and normalised mean square error of a given linear model. Ordinary least squares estimation and its geometrical interpretation.Polynomial interpolation as a linear regression problem. How does the experiment design influence the reconstruction of linear dependencies. Influence and leverage of various data points. Linear regression for categorical data. One-way and two-way analysis of variance. Methods for regression diagnostics and outlier detection. Model selection and regularisation. No free lunch theorem for interpretation.

Slides:Linear regression

Literature:

Complementary exercises:

Free implementations:

III. Performance evaluation measures

Given by Sven Laur

Brief summary: Principles of experiment design. Machine learning as minimisation of future costs. Overview of standard loss functions. Stochastic estimation of future costs by random sampling (Monte-Carlo integration). Theoretical limitations. Standard validation methods: holdout, randomised holdout, cross-validation, leave-one-out, bootstrapping. Advantages and drawbacks of standard validation methods

Slides:Attach:lecture-4.pdf

Literature:

Complementary exercises:

  • Generate data form a simple linear or polynomial regression model and use various validation methods and report results:
    • Did a training method chose a correct model
    • Is there some differences when the correct model is not feasible?
    • Estimate bias and variance of a training method
    • Did a validation method correctly estimated expected losses
  • Try various classification and linear regression methods together with various validation methods report the results

Free implementations:

IV. Introduction to optimization

Given by Konstantin Tretyakov
Brief summary: Optimization problem statement and basic terminology (minimum, maximum, gradient, Hessian). Classical methods for unconstrained optimization - gradient descent, stochastic gradient descent, Newton's method. Brief overview of constrained optimization.

Slides: (pdf) (Auxiliary reading) (Video)

Literature:

  • Wikipedia has some good articles on the topic. Otherwise, pick any fat textbook with "Optimization methods" in the title.

V. Linear classification

Given by Konstantin Tretyakov
Brief summary: General description of linear classifiers. Components of the linear discriminant functions: weight vector and bias term. Fisher discriminant, Least-squares methods, Perceptron, Batch and online training, Various cost functions used in training and resulting training algorithms.

Slides: (pdf) (Auxiliary reading)

Literature:

  • Duda, Hart & Stork: Patter Classification: Linear discriminant functions

VI. Feed-forward neural networks for prediction tasks

Given by Sven Laur
Brief summary: Neural networks as a toolbox for approximating complex functions. Generalised linear models and the conceptual design of a feed-forward network. Hidden layer as an adaptive and non-linear map to higher feature space. Sigmoid functions and radial-based functions as standard ways to build non-linear mapping. Backpropagation algorithm as an efficient gradient decent procedure. Higher-order methods for minimising the training error. Computer vision and invariance under shifts and rotations. Training methods for forcing this type of invariance.

Slides:PDF slides Handwritten slides

Literature:

  • Bishop: Pattern Recognition and Machine Learning pages 225 - 272

Complementary exercises:

Free implementations:

VII. Basics of probabilistic modelling

Given by Sven Laur
Brief summary: Overview of main distributions used for probabilistic reasoning: Bernoulli, Binomial, Beta, Poisson, Gaussian and Laplacian distribution. Text classification using naive-Bayes classifiers. Bayesian networks. Probabilistic model for linear regression and corresponding maximum likelihood solution. The effect of error models to the objective function and performance.

Slides:Attach:lecture-9.pdf Handwritten slides Naive Bayes explained for text classification

Literature:

  • Bishop: Pattern Recognition and Machine Learning pages 67 - 120
  • Weiss, Indurkhya, Zhang & Damerau: Text Mining: Predictive Methods for Analyzing Unstructured Information pages 52 - 70
  • Bishop: Pattern Recognition and Machine Learning pages 137 - 161
  • Ricci: Fitting distributions with R

Complementary exercises:

  • Bishop: Pattern Recognition and Machine Learning exercises from pages 127 - 136 that are related to practical tasks
  • Bishop: Pattern Recognition and Machine Learning pages 220 - 224
  • Some practical exercises to confirm various hypotheses about how some variables are distributed in real life.
  • Build a naive Bayes filter for detecting the spam. Use the Spambase Data Set for training and testing.

Free implementations:

VIII. Maximum likelihood and maximum a posteriori estimates

Given by Sven Laur
Brief summary: Formalisation of Maximum Likelihood and Maximum A Posteriori principles. Simple examples of ML and MAP estimates. Logistic regression and linear discriminant analysis. Link functyion. Connection between regularisation and Maximum A Posteriori estimate. Regularised linear models and corresponding priors to parameters.

Slides:Attach:lecture-10.pdf Handwritten slides

Literature:

  • Duda, Hart & Stork: Patter Classification pages 84-107
  • Bishop: Pattern Recognition and Machine Learning pages 137 - 161
  • Bishop: Pattern Recognition and Machine Learning pages 204 - 220

Complementary exercises:

  • Bishop: Pattern Recognition and Machine Learning pages 173 - 177
  • Bishop: Pattern Recognition and Machine Learning pages 220 - 224
  • Practical comparison of various linear regression methods on data with different error distributions.

Free implementations:

IX. Model-based clustering techniques

Given by Meelis Kull
Brief summary: Phylogenetic trees and hierarchical clustering. Independent mutation model. Hard and soft clustering techniques. K-means clustering as a maximum likelihood estimate. Gaussian mixture model as a generalisation of k-means algorithm. Two ways to minimise objective function hard and soft clustering. Soft clustering as expectation-maximisation. Robust Gaussian mixture models. Mixtures of Bernoulli distributions.

Slides:Attach:lecture-11.pdf
Video: http://uttv.ee/naita?id=10562 (Please log in to the UTTV system with ut account)

Literature:

  • Bishop: Pattern Recognition and Machine Learning pages 423 - 448

Complementary exercises:

  • Bishop: Pattern Recognition and Machine Learning pages 455 - 459
  • Practical clustering tasks:
    • Discretisation of continuous variables, such as height and weight or gene expression
    • Finding relations between such discretised signals

Free implementations:

X. Expectation-maximisation and data augmentation algorithms

Given by Meelis Kull
Brief summary: Formal derivation of the EM-algorithm and its convergence proof. Kullback-Leibler divergence. Bayesian and frequentistic interpretation. Robust Gaussian mixtures and their application in geodata analysis. Reconstruction of missing values with EM algorithm. Data augmentation algorithm as a generalisation of the EM algorithm. Imputation of missing data with data augmentation algorithm.

Video: http://uttv.ee/naita?id=10820 (Please log in to the UTTV system with ut account)

Literature:

  • Bishop: Pattern Recognition and Machine Learning pages 450 - 455
  • Tanner: Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions pages 90 - 100

Complementary exercises:

  • Bishop: Pattern Recognition and Machine Learning pages 455 - 459
  • Use EM and DA algorithms for imputation of missing data and compare results
  • Use EM and DA algorithms for clustering of events or locations:

Free implementations:

XI. Principal Component Analysis (PCA)

Given by Kristjan Korjus
Brief summary: I would like you to know what is PCA, how and when to use it, how to explain it to someone not familiar with the topic, how to plot and interpret the results of the analysis and how to write a nice report using the method. To sum up, I would like you to be able to get a full skillset needed to use PCA in practice.

This lecture describes the mathematical background of the PCA. Your goal is to apply PCA to the data of the richest and the biggest countries in the world and write a nice report explaining your analysis of the data.

Materials: Lecture notes, R code, video

Literature:

  • Bishop: Pattern Recognition and Machine Learning pages 559 - 599

Complementary exercises:

Free implementations:

  • Built-in stats package in R: princomp, prcomp

XII. Statistical learning theory

Given by Sven Laur
Brief summary: Bias-variance dilemma revisited. Training and testing data as iid samples from the distribution of future challenges. Confidence bounds on cost estimations via Monte-Carlo integration. Why is does the training error underestimate future costs. Case study for the finite function set. Bias in training error and its connection to union-bound and multiple hypothesis testing. Consistency and identifiability properties. VC-dimension as a way to estimate bias in training error. Rademacher complexity and its connection to the bias in the training error. Limitations of statistical learning theory.

Slides: Slides Handwritten slides

Literature:

Complementary exercises:

  • Estimate the difference between training and test errors for different classifiers
    • Draw data from linearly separable model with some gaussian random shifts
    • Try various linear and non-linear classifiers
    • Plot the discrepancy as a function of training sample size
    • Draw data form more complex model that cannot be represented by predictor classes
    • Repeat the procedure
    • Estimate VC and Rademacher complexities and see if SLT bounds coincide with practice
  • Estimate the difference between training and test errors for different prediction algorithms
    • Draw the data form a linear model
    • Try various linear and non-linear predictors
    • Plot the discrepancy as a function of training sample size
    • Draw data form more complex model that cannot be represented by predictor classes
    • Repeat the procedure
    • Estimate VC and Rademacher complexities and see if SLT bounds coincide with practice

Free implementations:

  • None that we know


XIII. Support Vector Machines

Given by Konstantin Tretyakov
Brief summary: Recap on algebra and geometry. Maximal margin classifiers. Reformulation as a quadratic programming problem. Primal and dual forms. SVM as an example of a regularized learning problem. Hinge loss as an example of a surrogate loss function.

Slides:(pdf)
Video:http://uttv.ee/naita?id=11400 (Please log in to the UTTV system with ut account)

Literature:

  • Cristianini and Shawe-Taylor: An Introduction to Support Vector Machines pages 93 - 112
  • Schölkopf and Smola: Learning with Kernels pages 189 - 215

XIV. Kernel methods

Given by Konstantin Tretyakov
Brief summary: Feature space mapping, the kernel trick, dual representation, kernelization.

Slides:(pdf)
Video:(uttv)
Literature:

XV. Basics of ensemble methods

Given by Raivo Kolde
Brief summary: Bayesian view on model selection. Ensembles as a Monte Carlo integration technique. Committee voting as a Bayesian Model averaging. Bagging is bootstrapping together with averaging. Sequential error correction methods and the idea of data point weighting. AdaBoost algorithm and its reformulation in terms of standard minimisation problem with a peculiar cost function. Non-robustness of AdaBoost algorithm and alternatives. Mixtures of experts and relation to lazy learning.

Slides:(pdf) Video

Literature:

  • Bishop: Pattern Recognition and Machine Learning pages 653 - 674
  • Hastie, Tibshirani & Friedman: The Elements of Statistical Learning pages 337 - 387

Complementary Exercises:

  • Bishop: Pattern Recognition and Machine Learning pages 674-677
  • Study the robustness and precision of bagging and boosting on Spambase datased with simple tree based classifiers.
  • Study the behaviour of Bayesian Model Averaging for linear models. Interpret the results.

Free implementations:

XVI. Particle filters

Given by Raivo Kolde
Brief summary: Reconstruction of latent variables (signal) from observations given complete specification system. Examples of such problems: speech reconstruction, motion tracking, restoration of old music records, finding and predicting regime changes in stock markets. State space model and observation function. Filtering as prediction task for a current timepoint. Smoothing as prediction for timepoints in past using also present observations. A principle way for reconstructing posterior distribution for system state. Kalman filter as an exact solution for a particular type of systems. Limitations of Kalman filtering. Rejection and importance sampling. How to use importance sampling for solving filtering problems. Inherent weaknesses of importance sampling. Resampling as a way to eliminate particles with degenerate weights. Auxiliary particle filtering.

Slides:Attach:lecture-19.pdf Δ

Literature:

  • Candy: Bayesian Signal Processing: Classical, Modern, and Particle Filtering Methods
  • Chen: Bayesian Filtering: From Kalman Filters to Particle Filters and Beyond
  • Gappé, Simon, Godsill & Moulines: An overview of Existing Methods and Recent Advances in Sequential Monte Carlo

Video lectures:

Complementary Exercises:

  • Reconstruction of corrupted audio signal with Kalman and particle filtering.
  • Reconstruction of trajectories of moving balls based on video signal.

Free implementations: