You need a topic and supervisor to pass this course
Any data mining related topic which is complex enough and has a university supervisor will do
Normally you should choose your BSc or MSc thesis topic
Young PhD student can take something, which will bring it closer to the first article.

Potential supervisors and their areas of interest

If you know what you want just contact these persons and try to get a seminar topic that interests them. You can also look for their seminars for topics given out in previous years.

Data mining and machine learning topics naturally divide into areas and you can ask supervisor for the topics

Bioinformatics
- Hedi Peterson
- Kaur Alasoo
- Leopold Parts
- Dima Fishman
- Elena Sügis
- Jaak Vilo
Robotics
- Alvo Aabloo
Spacial-data
- Anna Leontjeva,
- Artjom Lind
- Amnir Hadachi
Neurosience
- Tambet Matiisen
- Ilya Kuzovkin
- Ardi Tampuu
- Raul Vicente
Natural language processing
- Sven Laur
- Kairit Sirts
- Mark Fishel
- Eduard Barbu
Business-process mining
- Fabrizio Maggi
- Marlon Dumas
- Dietmar Pfahl
- Mario Ezequiel Scott
Medical data: cleraning and analysis
- Sven Laur
- Jaak Vilo
Theoretical aspects in machine learning
- Meelis Kull

If you know what you want just contact these persons and try to get a seminar topic that interests them. You can also look for previous seminars for topics.

Derive topic from your own thesis topic

If you already have a thesis topic that is related to data mining or machine learning or you are choosing it right now then a well framed subtask can be used as a topic for the seminar.

You can choose any thesis topic from the Theses Topics Registry.
You need to contact the supervisor and me to fix exact scope of the project.

Particular topics for this year

Machine learning in software development

Mining software repositories and social networks to understand team performance in agile software projects

Student: ---
Supervisor: Ezequiel Scott
Problem description: Mining software repositories consist in applying techniques to mine data from software repositories to leverage development data. Many kinds of repositories are intensively used by developers in today’s settings such as source control and issue tracking repositories (e.g. Bitbucket, Github, Jira). These repositories contain a wealth of information that is available to extract and analyze to study several development phenomena. For example, some studies have explored the how software projects evolve and how to identify relevant issues. However, few studies have explored the role of human factors in the data analyzed from software repositories. This is surprising since human factors are always involved in every software development process. The goal of this project is to use data from social networks about software developers to analyze the team performance determined by well-known agile metrics such as velocity. We will provide a dataset of several software projects and your task will augment it with data from social networks. In addition, you will use simple predictive models and/or stats to describe the impact of social features on team performance.

Uncovering dependencies among User Stories in agile software projects

Student: ---
Supervisor: Ezequiel Scott
Problem description: Requirements are usually expressed as User Stories in agile software development. Although User Stories are expected to follow a fixed structure (“As <a role>, I want to <a feature> in order to <a benefit>”), they are still written by using natural language and informal descriptions. This can lead to bad quality user stories that can be difficult to understand by developers. Existing quality frameworks argue that good quality user stories are independent. That is, user stories should not overlap in concept and should be schedulable and implementable in any order. In this context, the aim of this project is to use an unsupervised learning approach to identify clusters of dependent user stories. As a starting point, you can analyze the unstructured text of user stories by using topic models, which aim to uncover relationships between words and documents. We will provide a dataset with user stories of several software projects.

Recommending issues to developers

Student: ---
Supervisor: Ezequiel Scott
Problem description: In agile software development, task allocation is often based on self-assignment. That is, developers choose the user stories that they will develop during the sprint according to their own preferences and experience. Industry practices give some evidence to support this method of task allocation but how this takes place is not completely clear yet. As far we know, developers apply different strategies for self-assigning different types of tasks (new feature, enhancement, bug fixation). However, applying these strategies to determine what task develop can be difficult for non-experienced developers. The goal of this project is to use features about the developers to recommend tasks to developers. To this end, you will use clustering techniques to provide the recommendations. You will be provided with a dataset with user stories of several agile projects.

Business-process mining

Deviance Mining

Student: ---
Supervisor: Fabrizio Maria Maggi
Problem description:
Deviant process executions of a business process are those that deviate in a negative or positive way with respect to normative or desirable outcomes, such as executions that undershoot or exceed performance targets. This project aims at implementing a new approach for discriminating between normal and deviant executions. We start from the requirement that the discovered rules should explain potential causes of observed deviances. Using as a baseline feature types extracted using pattern mining techniques we try to explore more complex feature types to achieve higher levels of accuracy. The approach will be implemented in java.

Predictive Monitoring

Student: ---
Supervisor: Fabrizio Maria Maggi
Problem description:
Predictive process monitoring is concerned with exploiting event logs to predict how running (uncompleted) cases will unfold up to their completion. In this project, we implement an instance of a predictive process monitoring framework for estimating the probability that a given predicate will be fulfilled upon completion of a running case. The prediction problem is approached in two phases. First, prefixes of previous traces are clustered according to control flow information. Secondly, a classifier is built for each cluster using event data to discriminate between fulfillments and violations. At runtime, a prediction is made on a running case by mapping it to a cluster and applying the corresponding classifier. The approach will be implemented in java.

Natural language processing

Implement a recursive auto-encoder for learning sentence representations

Student: ---
Supervisor: Kairit Sirts
Problem description:
The task is to implement a recursive auto-encoder for learning natural language sentence representations. The inputs to the model are parsed sentences and the recursive encoder must encode the sentence using the structure of the parse tree. The decoder part of the auto-encoder must then generate the same sentence from the encoded representation using the recurrent neural network.

The project should be implemented using either Tensorflow or pytorch (https://devblogs.nvidia.com/recursive-neural-networks-pytorch/). The student is required to have some familiarity with natural language processing (at least knowing what is a parse tree) and must have experience in using either Tensorflow or pytorch.

Linking concepts in the Estonian Wordnet with the corresponding Estonian Wikipedia pages

Student: ---
Supervisor: Eduard Barbu
Problem description:
The Estonian Wordnet is a lexical ontology where for each word its senses are listed and various relations between word senses hold. The challenge is to find in an automatic way the mapping between Wordnet concepts and the title of the pages in Estonian Wikipedia. Various supervised and unsupervised methods can be tried.

Ontology population

Student: ---
Supervisor: Eduard Barbu
Problem description:
The Estonian Wordnet is a lexical ontology where for each word its senses are listed and various relations between word senses hold. The task is to populate the ontology concepts with the instances present in Estonian Wikipedia (e.g. Donald Trump and Kersti Kaljulaid are instances of the concept "president"). Various approaches (unsupervised and supervised) can be tried.

Ontology rules acquisition

Student: ---
Supervisor: Eduard Barbu
Problem description:
The task is to mine Estonian Wikipedia for rules involving quantifiers ( e.g. car has four wheels, most birds fly, a mandate of a president hold for four years etc...). Various approaches (unsupervised and supervised) can be tried.

Time-series analysis and prediction

Regime change detection in biomass measurements

Student: ---
Supervisor: Sven Laur
Problem description:
PRIA has reoccurring radar and satellite images of fields and meadows. The task is to reliably detect when a particular field is harvested or other agricultural event is taking place. The data is already aggregated to a field level, your task is to build linear and nonlinear predictors and filters to detect regime changes and correlate them with agricultural events.

Analysis of data from power tests

Student: ---
Supervisor: Sven Laur
Problem description:
Power electronic components are tested extensively before they deployed. Nevertheless some of them fail. Your task is to build linear or nonlinear predictive model for test measurements. If achieved prediction is good then one can look for abnormalities -- unexplainable residues and correlate them with failures.

Analysis of drone power levels

Student: ---
Supervisor: Sven Laur
Problem description:
A flight time of a drone depends on many external factors. Nevertheless, it should be possible to predict flight time by looking at the power level graph as a main indicator. You task is to build a linear or nonlinear predictor that could estimate the remaining flight time on the fly.

Generalisation of Support Vector Machines as Adversarial Learning Student: ---
Supervisor: Sven Laur
Problem description:
Support Vector Machines are commonly stated in terms of maximal-margin classifiers, which leads to usage of hinge loss function. Another geometrically more interpretable restatement of the problem is adversarial setting with standard loss function which easier to generalise for multilayer neural networks. The aim of this work is to explore how this adversarial setting generalises to deep learning.

Radar imaging

There are four topics given out by AS Datel:

Geograafilise asukoha avatus ja nähtavus Sentinel-1 satelliidi vaatesuunale
Infrastruktuuriobjektide maa-alade mitmik-importimine ja töötlemine
Järelvalveta meetod ehitiste vajumise tuvastamiseks radarinterferomeetriliste mõõtmiste alusel
Objekti radar-inteferomeetriliseks seireks sobivuse hindamine ühe radaripildi alusel
Vertikaalset ja horisontaalset deformatsiooni arvutav rakendus

for which further details are available on request.

Andmekaeve uurimisseminar 2017/18 kevad