Arvutiteaduse instituut
  1. Kursused
  2. 2017/18 sügis
  3. Andmekaeve uurimisseminar (MTAT.03.277)
EN
Logi sisse

Andmekaeve uurimisseminar 2017/18 sügis

Older Datamining Seminars: 2008k » 2008s » 2009k » 2009s » 2010k » 2011k » 2012s » 2014k » 2014s » 2015k » 2015s » 2016k » 2016s

  • About
  • Topics
  • Presentations
  • Assignments
  • Deadlines

Potential project topics and supervisors

  • You need a topic and supervisor to pass this course
  • Any data mining related topic which is complex enough and has a university supervisor will do
  • Normally you should choose your BsC or MSc thesis topic
  • Young PhD student can take something, which will bring it closer to the first article.
  • Not many topics to offer! Sorry about that!
  • You can look for free topics from previous years!

General topics

Data mining topics naturally divide into areas and you can ask supervisor for the topics

  • Bioinformatics (Hedi Peterson, Dima Fishman, Elena Sügis, Jaak Vilo)
  • Robotics (Alvo Aabloo's reserch group)
  • Spacial-data (Anna Leontjeva, Artjom Lind, Amnir Hadachi)
  • Neurosience (Tambet Matiisen, Ilya Kuzovkin, Raul Vicente)
  • Natural language processing (Sven Laur, Kairit Sirts, Mark Fishel)
  • Business-process mining (Fabrizio Maggi, Marlon Dumas)
  • Medical data: cleraning and analysis (Sven Laur, Jaak Vilo)
  • Machine Learning (Meelis Kull)

If you know what you want just contact these persons and try to get a seminar topic that interests them. You can also look for previous seminars for topics.

Particular topics for this year

Bioinformatics

Modeling the CRISPR/Cas9 genome editing system

Student: ---
Supervisor: Leopold Parts
Problem description: The CRISPR/Cas9 system has revolutionized research in cell biology. DNA can now be edited easily and accurately, enabling experiments ranging from understanding gene function and establishing mutations that cause disease, to correcting inherited genetic defects. The system relies on targeting Cas9 enzyme that generates breaks in DNA, to chosen locations in the genome using a guide RNA (gRNA). However, as all locations cannot be targeted equally well, many gRNAs are used to edit one gene, which limits the scale of the experiments that interrogate large numbers of genes at once. This motivates the need for better understanding of the factors that aid targeting.

The features that determine the efficacy of Cas9 function have only been tested to some extent. For example, it is known that DNA sequence composition and its accessibility play a role [1,2]. Whether the editing results in a change for a cell further depends on the expression of the targeted gene and exon, conservation of the region across evolution, the protein domain edited, and the genetic background of the line. However, the extent of the influence of these factors remains poorly characterized for now, and it is difficult to predict whether a newly designed gRNA will perform well in a genome editing experiment.

The aim of this project is to build a predictive model of genome editing outcome, focusing on the properties of the targeted region. The available data are 12,475,734 existing editing experiment outcomes and their genomic context features; pipelines exist to compute additional informative features. Knowledge of genomics is beneficial.

References:

  • Smith, Justin D., et al. "Quantitative CRISPR interference screens in yeast identify chemical-genetic interactions and new rules for guide RNA design." Genome biology 17.1 (2016). http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0900-9
  • Li, W. et al. “MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens” Genome Biology 15:554 (2014). https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0554-4

Topics form Kaur Alasoo

  • Identifying genetic variants that regulate chromatin accessibility in different experimental conditions
  • Identifying genetic variants that regulate the shape of accessible chromatin regions
  • Using multi-trait modelling to identify genetic variants associated with protein cell surface expression
  • Augmenting transcript annotations based on experimental data
  • Identifying genetic variants that regulate protein abundance via RNA splicing
  • Identifying novel genetic variants that lead to loss of gene function by disrupting RNA splicing in the Estonian Biobank

Geo-data and robotics

  • Ask topics from Amnir Hadachi

Natural language processing

Vector space models for paraphrasing

Student: ---
Supervisor: Mark Fishel
Problem description: The aim is to apply vector space model learning to n-grams, in order to learn vectors for words and phrases, similar to GloVe / word2vec / other embedding models. The final goal is to take an input phrase or sentence and generate other phrases / sentences with a similar or same meaning (paraphrases) Data: 1.5 billion words of raw text data in Estonian Methods: word2vec-like approach with random resampling for achieving ngram2vec. Python base code available, your task is to expand it, implement additional algorithms and thoroughly test and analyze them.

Vector space models for translation

Student: ---
Supervisor: Mark Fishel
Problem description: The aim is to implement one of the existing methods of learning bilingual vector space models and adapt it to n-grams (phrases), later applying it to generate translations of phrases and sentences. The final goal is to train vectors for words and phrases in two languages based on little or no translation data and lots of raw text data and use them for machine translation.
Data: 1.5 billion words of raw text data in Estonian + much more raw text in English + translation data for Estonian-English Methods: Some Python base code available, your task is to implement one of the algorithms of bilingual vector space model training and thoroughly test and analyze it.

Neural network classifier for short clinical texts

Student: ---
Supervisor: Kairit Sirts (kairit.sirts@ut.ee) Problem description: The goal of this project is to experiment with neural network (NN) text classification methods for classifying a relatively small sample of short clinical texts. Typically, neural networks need large sample sizes to learn properly. The aim of this project is to find out whether NN-based classification methods could be applied also to learn from a relatively small dataset (500 samples) to discriminate between Alzheimer’s patients and healthy controls. The project involves experimenting with existing neural network classifier implementations, both fully-connected and convolutional neural networks, and possibly implementing your own model with keras/tensorflow.

Evaluating word embeddings with analogy tests

Student: ---
Supervisor: Kairit Sirts (kairit.sirts@ut.ee) Problem description: The goal of this project is to evaluate word embeddings in many languages with analogy tests. In particular, we are interested in morphological analogy questions, i.e. ‘go’ is to ‘went’ as ‘stand’ is to ‘stood’. The project involves gathering a morphological analogy test set for many languages from Universal Dependencies treebanks, training word embedding models for these languages on Wikipedia dumps using one or several word embedding training systems (such as word2vec and Glove) and evaluating the trained embeddings on the collected analogy sets.

Business-process mining

Deviance Mining

Student: ---
Supervisor: Fabrizio Maria Maggi
Problem description:
Deviant process executions of a business process are those that deviate in a negative or positive way with respect to normative or desirable outcomes, such as executions that undershoot or exceed performance targets. This project aims at implementing a new approach for discriminating between normal and deviant executions. We start from the requirement that the discovered rules should explain potential causes of observed deviances. Using as a baseline feature types extracted using pattern mining techniques we try to explore more complex feature types to achieve higher levels of accuracy. The approach will be implemented in java.

Predictive Monitoring

Student: ---
Supervisor: Fabrizio Maria Maggi
Problem description:
Predictive process monitoring is concerned with exploiting event logs to predict how running (uncompleted) cases will unfold up to their completion. In this project, we implement an instance of a predictive process monitoring framework for estimating the probability that a given predicate will be fulfilled upon completion of a running case. The prediction problem is approached in two phases. First, prefixes of previous traces are clustered according to control flow information. Secondly, a classifier is built for each cluster using event data to discriminate between fulfillments and violations. At runtime, a prediction is made on a running case by mapping it to a cluster and applying the corresponding classifier. The approach will be implemented in java.

Mining software repositories and social networks to understand team performance in agile software projects

Student: ---
Supervisor: Ezequiel Scott
Problem description:
Mining software repositories consist in applying techniques to mine data from software repositories to leverage development data. Many kinds of repositories are intensively used by developers in today’s settings such as source control and issue tracking repositories (e.g. Bitbucket, Github, Jira). These repositories contain a wealth of information that is available to extract and analyze to study several development phenomena. For example, some studies have explored the how software projects evolve and how to identify relevant issues. However, few studies have explored the role of human factors in the data analyzed from software repositories. This is surprising since human factors are always involved in every software development process. The goal of this project is to use data from social networks about software developers to analyze the team performance determined by well-known agile metrics such as velocity. We will provide a dataset about several software projects and your task will augment it with data from social networks. In addition, you will use simple predictive models and/or stats to describe the impact of social features on team performance.

Uncovering dependencies among User Stories in agile software projects

Student: ---
Supervisor: Ezequiel Scott
Problem description:
Requirements are usually expressed as User Stories in agile software development. Although User Stories are expected to follow a fixed structure (“As <a role>, I want to <a feature> in order to <a benefit>”), they are still written by using natural language and informal descriptions. This can lead to bad quality user stories that can be difficult to understand by developers. Existing quality frameworks argue that good quality user stories are independent. That is, user stories should not overlap in concept and should be schedulable and implementable in any order. In this context, the aim of this project is to use an unsupervised learning approach to identify clusters of dependent user stories. As a starting point, you can analyze the unstructured text of user stories by using topic models, which aim to uncover relationships between words and documents. We will provide a dataset with user stories of several software projects.

Predicting Story Points for User Stories

Student: ---
Supervisor: Ezequiel Scott
Problem description:
Story points are a unit of measure for expressing an estimate of the effort involved in implementing a user story in Agile software development. Estimating user stories in story points can be a difficult task, particularly when developers do not have enough experience. Existing approaches have explored the use of deep learning models to recommend a story-point estimate for a given user story. Similarly, the goal of this project is to use simple models to predict story points estimations for user stories. To this end, you will be provided with a dataset with user stories of several agile projects. You will have to take into account an analysis of the correlation between the time-to-solve and story points in the dataset to determine how good the predictions are. In addition, you will compare the results with common baseline benchmarks used in the context of effort estimation.

Applications of Machine Learning

Indoor localisation using signal strengths at access points

Student: ---
Supervisor: Meelis Kull
Problem description:
Using data to understand where the people are is one of the fundamental tasks in a smart home. The standard motion sensors that are used in security systems do not provide good precision and localisation can be improved adding other sources of data. For this project we consider the setting where the people are wearing an acceleration sensor on the wrist and there are multiple access points receiving signals from it. The received signal strength allows to localise the person. Instead of modelling the complicated physics of how signal travels we take a much simpler approach - use machine learning. For this the person first goes to all corners of all rooms and carefully annotates the location. This can then be used as a training set to learn a localisation model using standard machine learning methods.

For this project we have access to several datasets, one of them is 1 month long, with the potential to localise the resident whenever he/she is at home. The goal is to learn a predictive model which is reasonably accurate in telling which room the person is in. Optionally, this could be followed by an attempt to improve precision further and predict where in the room the person is. The project has a good potential to be extended into a bachelor or master thesis. ---

Activity recognition using acceleration sensor data

Student: ---
Supervisor: Meelis Kull
Problem description:
Have you ever wondered how FitBit or some other device on the wrist figures out what you are doing? Here is your chance to explore the task of automatic activity recognition yourself! I can provide several datasets, where the activities have been carefully labelled and one or more acceleration sensors have been used. The first simple goal is to predict whether the person is walking, standing, sitting or lying down. Optionally, more complicated activities such as cooking or eating could be tried to recognise. The project can easily be extended into a bachelor or master thesis, for example by considering more activities or learning location-specific models.

Adjustment of classifiers after the context changes

Student: ---
Supervisor: Meelis Kull
Problem description:
Machine learning models are often very sensitive to the context and can fail badly if there is even a slight change in context. Suppose you use the photos that you took this summer and train a model to classify hair colour of the people in the photos. You now go and apply the model on the photos this autumn and realise that it wrongly predicts too many people as brunette. You study what the problem is and realise that the summer photos were taken outside in the sun whereas the new autumn photos are mostly indoors and dark.

The goal of this project is to use some standard machine learning datasets and try out a simple procedure to adjust the model to the new context. This procedure modifies the predictions such that in the new context the distribution of predicted classes is the same as in the training data. The adjustment methods have been recently invented by the supervisor in a theoretical paper and the aim is to show their usefulness in practice. This project has the advantage of working with simple methods while being very relevant to the state of the art in machine learning. This project can be extended to a bachelor or master thesis.

Theoretical Aspects of Machine Learning

Demonstrating that isotonic calibration is biased and deriving a correction to reduce it

Student:Mari-Liis Allikivi
Supervisor: Meelis Kull
Problem description:
Our previous research has been about showing that isotonic calibration has problems with over-confidence in the tails of the calibration map and proposing corrections for that. During the experiments it has appeared that the problem might not only be in the tails. This makes the currently proposed corrections suitable only in some cases, making it difficult to decide if and which correction to use. In order to propose a suitable correction, we want to first demonstrate that isotonic calibration gives biased estimates of the calibrated probabilities and show how the bias looks like. After knowing that we then try to derive a suitable correction by using the information about the bias.

Generalisation of Support Vector Machines as Adversarial Learning

Student:---
Supervisor: Sven Laur
Problem description:
Support Vector Machines are commonly stated in terms of maximal-margin classifiers, which leads to usage of hinge loss function. Another geometrically more interpretable restatement of the problem is adversarial setting with standard loss function which easier to generalise for multilayer neural networks. The aim of this work is to explore how this adversarial setting generalises to deep learning.

  • Arvutiteaduse instituut
  • Loodus- ja täppisteaduste valdkond
  • Tartu Ülikool
Tehniliste probleemide või küsimuste korral kirjuta:

Kursuse sisu ja korralduslike küsimustega pöörduge kursuse korraldajate poole.
Õppematerjalide varalised autoriõigused kuuluvad Tartu Ülikoolile. Õppematerjalide kasutamine on lubatud autoriõiguse seaduses ettenähtud teose vaba kasutamise eesmärkidel ja tingimustel. Õppematerjalide kasutamisel on kasutaja kohustatud viitama õppematerjalide autorile.
Õppematerjalide kasutamine muudel eesmärkidel on lubatud ainult Tartu Ülikooli eelneval kirjalikul nõusolekul.
Tartu Ülikooli arvutiteaduse instituudi kursuste läbiviimist toetavad järgmised programmid:
euroopa sotsiaalfondi logo