# Syllabus

### Introductory part(Konstantin Tretyakov)

• Lecture 1: Introduction & Motivation
• Pattern Analysis: What, Why & How?
• Supervised learning. Classification. Regression.
• Case study: Linear regression. Empirical risk minimization. Maximum likelihood estimation.
• Practical session 1: Scilab
• Introduction to programming in Scilab.
• Analysis of a simple dataset using linear regression.
• Lecture 2: Basics of Optimization
• Unconstrained optimization. Differentiability. Fermat' theorem. Gradient descent. Convexity.
• Constrained optimization. Constrained gradient descent. Method of Lagrange multipliers.
• Practical session 2: Optimization
• Problem solving session. Constrained optimization. Method of Lagrange multipliers.
• Lecture 3: Basics of Probability and Statistics
• Random Variables. Probability Distributions.
• Conditional Distribution. Bayes' Rule. Independence.
• Probability theory vs Statistics.
• Hypothesis Testing. Confidence Intervals.
• Practical session 3: Statistics in Pattern Analysis
• Frequent substring example. Hypothesis testing. Multiple hypothesis testing.
• Risk estimation. Confidence intervals

### Main part(Tijl De Bie, Konstantin Tretyakov)

These lectures are based on joint work of Tijl De Bie, Nello Cristianini, and John Shawe-Taylor

### Class 1 (Monday)

• Say what?
This lecture will start with a general overview of pattern analysis. We will discuss the importance of patterns, in science and technology, business, and in society at large. We will argue that humans have an innate ability to search for (and find!) patterns without much effort, with a bias towards those that are strong, reliable, and have predictive value. In fact, we need this ability to survive.
On the other hand, we will discuss examples where our desire to observe patterns misguides us (or has misguided others). This raises serious concerns: how to determine which patterns can be trusted?
• Challenges in automatic pattern analysis
The examples will provide an overview of the main issues in pattern finding. We will motivate that there is a growing need and along with this a growing potential to develop automatic tools to find patterns in data in an automatic way, in some ways better than humans would be able to do.
We will introduce here a formal framework to study patterns and the automatic search for them. This involves the definition of a pattern (building on a concept we call a pattern function), the pattern space, and the capacity of the pattern space (building on a concept we call the capacity functional).

### Class 2 (Tuesday)

• Statistical foundations of pattern analysis
One may not be interested in just any type of pattern, but rather only in the strongest ones. In other words, when searching for interesting patterns, we usually focus intuitively on those that contrast the most with our prior assumptions. Technically, one says that one searches for the most significant patterns. We will discuss here the issues in computing the significance of a discovered pattern.
Besides this, the examples in the first lecture will have pointed out that patterns are not always as useful. Indeed, some are highly unstable, and may not be a property of ‘nature’. Such patterns are merely artefacts of the data, potentially due to noise or other artefacts. It is of great importance to recognize this, and to identify whether an identified pattern is stable or not.
The significance and stability of patterns are intimately connected, and we will attempt to elucidate the connection here.
• Algorithmic foundations of pattern analysis
A pattern needs to be studied statistically, but before all it needs to be found… This is in general a non-trivial task, as the search space is often extremely large (usually exponentially large in the problem size).
We introduce a common approach to search in discrete pattern spaces reminiscent of Branch and Bound and related search techniques. We explain this on the problem of finding the ‘longest frequent substring’ in a given string.

### Practical session 1 (Tuesday)

• Mining frequent itemsets: a case study
In this first practical session you will apply the principles of pattern analysis to the problem of mining frequent itemsets in databases. Specific topics covered are:
• The fact that size and frequency matter.
• How to find large and frequent itemsets.
• Computing the significance of a frequent itemset.
• Computing the stability of a frequent itemset: if it is frequent in a given dataset, will it be frequent in the future also?
• Mapping the concepts ‘pattern (function)’, ‘pattern space’, and ‘capacity (functional)’ onto this instance of pattern analysis.

### Class 3 (Wednesday)

• Patterns in numbers
This lecture will introduce patterns in numbers as an important branch of pattern analysis. It is the language of science since centuries, and has meant a revolution as a new way of thinking.
• Patterns in vectorial data
Numeric data is often multi-dimensional, and we will discuss some examples here. Some data is inherently numeric, some can be represented as such.
By working out the example of linear regression (Least Squares Regression as seen in the preparatory lectures, and Ridge Regression), we will hint at the so-called kernel-trick for a first time, which set the scene for the next lectures.

### Class 4 (Wednesday)

• Getting a grip on high-dimensional data
In the previous lecture we have introduced some examples of vectorial data. Here we will discuss some tools that allow to explore such data in an unsupervised way. In particular, we will discuss two extremely important and influential methods, Principal Component Analysis (PCA) which allows to extract the main sources of variation in the data, and K-means clustering which allows to recognize a grouping/clustering structure in the data.
With PCA, we discuss an algorithm that can be solved highly efficiently by means of a simple eigenvalue problem. On the other hand, we will see that the K-means problem is essentially very hard, but nevertheless, that it can be tackled reasonably well by means of an approximative method.
• Statistical analysis and kernel versions
A recurring theme in pattern analysis is the determination of the reliability of the patterns found, their significance and/or their stability. For patterns in vectorial data these statistical aspects are understood very well, and we will introduce an important technique for doing so. This technique relies on the so-called Rademacher complexity of pattern spaces. We apply this to the PCA problem.
The second part of this lecture discusses the use of the so-called kernel trick, already touched upon in a previous lecture. We derive the kernel versions of PCA and K-means clustering, and show their use on an example.

### Practical session 2 (Thursday)

• Patterns in vector spaces (1)
The session will start with simple experiments with linear regression showing the effect of dimensionality and training set size on generalization error. Regularization is introduced as a method to control test error. Ways of tuning the regularization parameter (based on training set performance, validation set performance, cross-validation) are considered. The kernel trick allows to exploit all the machinery of linear regression into the solution of a non-linear regression task. The students will implement kernel ridge regression (kRR) as an example. An important resulting insight is that regularization is especially important for kRR.

### Class 5 (Thursday)

• Supervised learning
Whereas the previous lectures have focused on unsupervised techniques (also known as exploratory data analysis), today we discuss two examples of supervised learning.
The first problem we discuss is linear regression, and we attack it using the methods known as Least Squares Regression (LSR) and Ridge Regression (RR).
The second problem concerns (binary) classification, and allows one to learn how to assign data to one of two classes. The method we will discuss is a trivial variant of RR, and is known as Fisher’s Discriminant Analysis (FDA).
We will also briefly discuss support vector machines for classification, as a more robust alternative to FDA.
• Statistical analysis and kernel versions
Just as for the unsupervised learning techniques, we can provide a statistical analysis of RR and FDA based on Rademacher complexities. Subsequently, we derive the kernel versions of these algorithms. We discuss the importance of regularization and put this in the light of a general pattern analysis context. We show how this is reflected in statistical bounds, and illustrate the effect of regularization by showing empirical results.

### Class 6 (Friday)

• Advanced topics: Kernels for structured data
We have pointed out one of the strengths of kernel methods: their ability to deal with structured types of data, such as text, strings, graph nodes,... Here we will provide a few examples, which should make clear how kernels are designed.