Arvutiteaduse instituut
  1. Kursused
  2. 2016/17 sügis
  3. Andmekaeve uurimisseminar (MTAT.03.277)
EN
Logi sisse

Andmekaeve uurimisseminar 2016/17 sügis

Older Datamining Seminars: 2008k » 2008s » 2009k » 2009s » 2010k » 2011k » 2012s » 2014k » 2014s » 2014k

  • About
  • Topics
  • Presentations
  • Assignments
  • Deadlines

Potential project topics and supervisors

  • You need a topic and supervisor to pass this course
  • Any data mining related topic which is complex enough and has a university supervisor will do
  • Normally you should choose your BsC or MSc thesis topic
  • Young PhD student can take something, which will bring it closer to the first article.

General topics

Data mining topics naturally divide into areas and you can ask supervisor for the topics

  • Bioinformatics (Hedi Peterson, Dima Fishman, Elena Sügis, Jaak Vilo)
  • Robotics (Alvo Aabloo's reserch group)
  • Spacial-data (Anna Leontjeva, Artjom Lind, Amnir Hadachi)
  • Neurosience (Tambet Matiisen, Ilya Kuzovkin, Raul Vicente)
  • Natural language processing (Sven Laur, Kairit Sirts, Mark Fishel)
  • Business-process mining (Fabrizio Maggi, Marlon Dumas)
  • Medical data: cleraning and analysis (Sven Laur, Jaak Vilo)

If you know what you want just contact these persons and try to get a seminar topic that interests them. You can also look for previous seminars for topics.

Particular topics

Bioinformatics

Modeling the CRISPR/Cas9 genome editing system

Student: ---
Supervisor: Leopold Parts
Problem description:
The CRISPR/Cas9 system has revolutionized research in cell biology. DNA can now be edited easily and accurately, enabling experiments ranging from understanding gene function and establishing mutations that cause disease, to correcting inherited genetic defects. The system relies on targeting Cas9 enzyme that generates breaks in DNA, to chosen locations in the genome using a guide RNA (gRNA). However, as all locations cannot be targeted equally well, many gRNAs are used to edit one gene, which limits the scale of the experiments that interrogate large numbers of genes at once. This motivates the need for better understanding of the factors that aid targeting.

The features that determine the efficacy of Cas9 function have only been tested to some extent. For example, it is known that DNA sequence composition and its accessibility play a role [1,2]. Whether the editing results in a change for a cell further depends on the expression of the targeted gene and exon, conservation of the region across evolution, the protein domain edited, and the genetic background of the line. However, the extent of the influence of these factors remains poorly characterized for now, and it is difficult to predict whether a newly designed gRNA will perform well in a genome editing experiment.

The aim of this project is to build a predictive model of genome editing outcome, focusing on the properties of the targeted region. As a first step, the editing readouts can be modeled as in [2], and the model expanded to include the abundant additional genomic information. Alternatively, other machine learning approaches can be tested. The project will be in collaboration with the Genetic Screens of Cellular Traits group at the Wellcome Trust Sanger Institute, where new data for validating the findings can be generated.

This cutting edge project is well suited for someone with experience in (or desire to acquire) machine learning or statistical modeling methods, and basic data science skills of obtaining, cleaning, and visualising data. Knowledge of genomics is beneficial.

References:

  • Smith, Justin D., et al. "Quantitative CRISPR interference screens in yeast identify chemical-genetic interactions and new rules for guide RNA design." Genome biology 17.1 (2016). http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0900-9
  • Li, W. et al. “MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens” Genome Biology 15:554 (2014). https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0554-4

Predicting human health from genomic and life history information

Student: ---
Supervisor: Leopold Parts
Problem description:
An ambitious goal of personalized medicine is to predict human health. In Estonia, there is the unique availability of data from electronic health records, the Estonian Genome project, and additional assays for the same set of individuals, providing abundant high quality information. In addition, genomic measurements have been gathered for many more non-Estonians, which could be mined for linkages that are also present in our population. For the first time, we are in a position to test how well we can forecast health using these diverse sources of data.

We previously attacked the problem of predicting individual characteristics using genomic information in yeast [1], and found that traits can be predicted surprisingly well, with on average 91% accuracy, when using information about variation in DNA, as well as other measurements for the same individual. Importantly, close relatives greatly aided prediction. This demonstrated that there are no fundamental limitations to accurate prediction, and we are now asking if the same holds true for human health information.

The aim of this project is to predict elements of electronic health records based on all the rest of the available data on the person, including DNA sequence and phenotypes of closely related individuals. The methods used would initially follow those of [1], starting with standard linear mixed models to combine information from the genome and other traits, and expanding to random forest based methods for a more flexible model class. If desired, other types of approaches, such as deep neural networks, can be tested. The project is in collaboration with the Estonian Genome Center (Geenivaramu) and its scientists.

This data science project is well-suited for someone with experience in (or desire to acquire) machine learning or statistical modeling methods, and basic data science skills of obtaining, cleaning, and visualising data. Knowledge of genomics is beneficial.

References

  • Kaspar Märtens, Johan Hallin, Jonas Warringer, Gianni Liti, Leopold Parts. “Predicting quantitative traits from genome and phenome with near perfect accuracy”. Nature Communications, 2016.http://www.nature.com/ncomms/2016/160510/ncomms11512/full/ncomms11512.html

Analyzing human induced pluripotent stem cell images with deep neural networks

Student: ---
Supervisor: Leopold Parts
Problem description:
To start understanding how individual cells work, their characteristics have to be measured. A standard way to do this is using high throughput microscopy, which is a rich source of quantitative data, with thousands of pixel values measured for every cell. The main problem in making use of this large scale information is extracting meaningful features. There are many existing software packages that are able to generate numbers from each cell image, ranging from area and shape characteristics to values for Zernike features and Gabor filters, but these hundreds of numbers rarely correspond directly to a biologically interpretable signal. Alternatively, a desired biological characteristic can be modelled and quantified from the images, but this requires extensive feature engineering.

A promising alternative to the abundant local and scarce broad global image features is to extract them from deep neural networks. These models are able to make use of large-scale data to learn spatial correlations that are not easily inferred from standard parametric models. In the last years, this class of methods has outperformed alternatives on most image processing tasks, ranging from object detection to semantic embedding.

The aim of this project is to train a multilayer convolutional neural network to represent the microscopy images in a low-dimensional space, and to assess whether the neuron activities correspond to meaningful signal. This is the case for yeast cells [1], and we are now ready to test whether it holds for human cells as well.

The Wellcome Trust has funded an initiative (www.hipsci.org) to create induced pluripotent stem cell lines from hundreds of healthy and diseased individuals. These cells are special, as they have been reverted to a ground (“pluripotent”) state, in which they can be propagated in the lab, and from which they can be differentiated into many different cell types. Many cell images have been acquired from each donor, and are publicly available, together with a wealth of genomic data. In addition, we have been awarded a grant from NVIDIA, which provided us with a new Tesla GPU hosted at the Department of Computer Science in Tartu.

This computational project is suitable for someone with relatively good computer engineering and hacking skills, as well as some background in mathematics.

References

  • Tanel Pärnamaa, Leopold Parts. “Accurate classification of protein subcellular localization from high throughput microscopy images using deep learning”. doi: http://dx.doi.org/10.1101/050757

Geo-data and robotics

Predicting Bus Arrival Time on the Basis of Sparse GPS Data

Student: ---
Supervisor: Amnir Hadachi & Artjom Lind
Problem description:
The ability to obtain accurate prediction of bus arrival time on real time basis is vital information to both bus operation control and passenger. The importance of arrival time is due to its involvement in many operations. For example: the transit operators can promptly respond to unexpected service interruptions and delays by introducing various bus control strategies. Moreover, it can help discover and spot problematic routes and shifts that run late. There are so many algorithms developed for solving this problematic and the results are satisfactory. However, the accuracy of the prediction is still can be made better and the challenge relies on the sparseness of data itself and also on its inaccuracy during data gathering. Therefore, there is a need of investigating how to clean the data and correct it if need from mapping prospective. This step is called preprocessing of the raw data. Plus, they will be a need to integrate a geographic information system (GIS) for geo- localization on the maps. The main key point in this project can be resumed as follows:

  • Predicting Arrival Time
  • EstimatingTravelTime
  • Map-Matching (correcting erroneous GPS data)

Anomalous Bus Trajectory Detection Using GPS Traces

Student: ---
Supervisor: Amnir Hadachi & Artjom Lind
Problem description:
With the presence of GPS in so many devices that we use in our daily life. The amount of data available is enormous. The traces left behind by GPS-enabled vehicle provide us with an access to observe the dynamics of city’s road network. Our cities nowadays are a living entities full of divers information that can be help us understand how to make our life better, comfortable and safe. The GPS traces collected from the bus enable to extract statistical, dynamic, and behavioral information about the bus drivers and also the urban road traffic. The main objective behind this project is to use the GPS traces and build an understanding of the drivers’ behaviors. This, latter will help us to observe any anomalies appearing in the trajectories of the bus. In other words, the created algorithm should be capable of discovering anomaly insights from the extracted patterns from the GPS traces. The key points in this project can be summarized as follows:

  • Trajectory Extraction
  • Pattern Extraction
  • Anomaly Detection

Safe flight-time prediction for drones

Student: ---
Supervisor: Sven Laur
Problem description:
Drones are commonly powered with lithium batteries which have a fixed capacity. State of the art drones stay afloat around one hour. The exact airtime depends on the drone, route and whether conditions. The aim of this project is to use simple models to get 5-10 second accurate prediction on the airtime so that the drone pilot can adequately plan and adjust his or her mission. We will provide onboard information and flight telemetry for this task.

Natural language processing

Quantifying the variability of the word embedding spaces

Student: ---
Supervisor: Kairit Sirts
Problem description:
Word embeddings represent natural language words in a low-dimensional dense vector space. Using word embeddings as features in various natural language processing tools is beneficial as they preserve various semantic and syntactic properties of the natural language. However, different word embed- ding spaces, even when they have been trained on the same data using the same method, are not directly comparable because their axes are not aligned. The goal of this project is to evaluate the variability of the different instances of the word embedding spaces induced from the same data. The project consists of three main parts:

  1. Train several embedding spaces on a training data using a tool such as word2vec.
  2. Align the axes of the different embedding spaces by learning linear transformation functions
  3. Evaluate the variability between different aligned embedding spaces.

Implement a command line utility for extracting semantic propositions from dependency trees

Student: ---
Supervisor: Kairit Sirts
Problem description:
Dependency parse trees express the syntactic and semantic structure of sen- tences using directed graphs. According to the dependency structure, each word in a sentence has a head—a word that it is syntactically and/or se- mantically dependent on. For instance, typically the main verb is the root of the sentence, both the subject and direct and indirect objects are the de- pendents of the verb, whereas adjectives are dependent on the nouns which they modify. Stanford CoreNLP3 is a collection of tools and libraries that implement various natural language processing methods. It includes Semgrex4, a pack- age for performing pattern matching on dependency graphs. The goal of this project is to implement a command line utility for match- ing patterns on dependency graphs using the existing CoreNLP Semgrex API.

Neural network based dependency parser

Student: ---
Supervisor: Kairit Sirts
Problem description:
Dependency parser is one of the core NLP tools that expresses the syntactic structure of sentences with directed dependency graphs. Recently, neural network based dependency parsers have reached state of the art performance in English. The goal of this project is to experimentally train neural network based dependency parsers on Estonian using universal dependencies treebank5. The project can develop into a Bachelor or Masters thesis.

Learning the space of morphological transformations

Student: ---
Supervisor: Kairit Sirts
Problem description:
The ability to generate morphologically inflected (for nouns) or conjugated (for verbs) word forms is important for many natural language processing systems, especially in morphologically complex languages such as Estonian. The goal of this project is learn the space of morphological transforma- tions using a well-known TransE6 model used in relation prediction task. The relation prediction systems learn from fact triples (head entity, relation, tail entity) by projecting all entities and relations into a low-dimensional dense vector space. In this project the same method will be applied to triples (baseform, morhological transformation, inflected/conjugated form). The project involves two steps:

  1. . Prepare the training and test data from morphologically annotated Multext-East corpora.
  2. . Conduct experiments with the TransE model (C++ code available). The project can develop into a Bachelor or Masters thesis.

Predicting Loan Applicant Reliability

Student: ---
Supervisor: Mark Fishel
Problem description:
Data:

  • information on clients, their financial responsibilities and recent behaviour

Aim -- model client’s future behaviour:

  • should they get a loan
  • do we expect them to pay on time / be late / default / etc.

Methods:

  • series/sequence modelling, HMMs/RNNs
  • Simpler things like binary classification

Modelling Semantic Similarity

Student: ---
Supervisor: Mark Fishel
Problem description:
Data:

  • domain name queries, their availability, final client’s order
  • lots of raw text
  • WordNet

Aim -- learn to suggest alternative domains:

  • e.g. “kurjadkassid.ee” is taken, maybe you’re interested in “tigedadkiisud.ee”?

Methods:

  • word vector representation learning (like word2vec/GloVe)
  • later maybe deep learning

Probabilistic models and semi-supervised learning

Student: ---
Supervisor: Mark Fishel
Problem description:
Given

  • texts + part-of-speech tags and other morphological info

find a better prediction model for

  • predicting PoS & morphological info

Method

  • design a better probabilistic graphical model, modelling output dependency network
  • the sexy part: do semi-supervised learning from partially annotated data (EM / CRP)

Morphological Segmentation via Bayesian Inference

Student: ---
Supervisor: Mark Fishel
Problem description:
Implement and test the method published in http://www.aclweb.org/anthology/P08-1084, very friendly introduction into the methodology: http://www.isi.edu/natural-language/people/bayes-with-tears.pdf

Business-process mining

Causal Mining of Business Process Deviance

Student: ---
Supervisor: Marlon Dumas
Problem description:
Business process deviance refers to the phenomenon whereby a subset of the executions of a business process deviate, in a negative or positive way, with respect to the expected or desirable outcomes of the process. Deviant executions of a business process include those that violate compliance rules, or executions that undershoot or exceed performance targets. Deviance mining is concerned with uncovering the reasons for deviant executions by analyzing business process event logs. Current deviance mining techniques are focused on identifying patterns or rules that are correlated with deviant outcomes. However, the obtained patterns might not actually help to explain the causes of the deviance. In this thesis, you will enhance existing deviance mining techniques with causal discovery techniques in order to more precisely identify the potential causes of deviant process executions.

---

Dynamic Time Warping for Predictive Monitoring of Business Processes

Student: ---
Supervisor: Marlon Dumas
Problem description:
Predictive business process monitoring refers to a family of online process monitoring methods that seek to predict as early as possible the outcome of each case given its current (incomplete) execution trace and given a set of traces of previously completed cases. In this context, an outcome may be the fulfillment of a compliance rule, a performance objective (e.g., maximum allowed cycle time) or business goal, or any other characteristic of a case that can be determined upon its completion. For example, in a sales process, a possible outcome is the placement of a purchase order by a potential customer, whereas in a debt recovery process, a possible outcome is the receipt of a debt repayment.

Existing approaches for predictive business process monitoring are designed for processes with a relatively high level of regularity, where most cases go through the same stages and these stages are more or less of the same length. In the case of very irregular processes where the number of stages and their length is variable, the accuracy of these techniques generally suffers. In this project, you will design an approach to predictive process monitoring that addresses this limitation by using a time series analysis technique known as dynamic time warping. The thesis will adopt an experimental approach. You will implement a prototype and compare it with implementations of other predictive process monitoring techniques using a collection of real-life event logs.

Topic in Process Mining

Student: ---
Supervisor: Fabrizio Maggi
Problem description:
Process discovery techniques try to generate process models from execution logs. Declarative process modeling languages are more suitable than procedural notations for representing the discovery results deriving from logs of processes working in dynamic and low-predictable environments. However, existing declarative discovery approaches aim at mining declarative specifications considering each activity in a business process as an atomic/instantaneous event. In this project, we investigate how to use discriminative rule mining in the discovery task, to characterize lifecycles that determine constraint violations and lifecycles that ensure constraint fulfillments. The approach will be implemented in java as a plug-in of the process mining tool ProM.
Related literature: http://link.springer.com/chapter/10.1007%2F978-3-319-09870-8_21

Topic in Deviance Mining

Student: ---
Supervisor: Fabrizio Maggi
Problem description:
Deviant process executions of a business process are those that deviate in a negative or positive way with respect to normative or desirable outcomes, such as executions that undershoot or exceed performance targets. This project aims at implementing a new approach for discriminating between normal and deviant executions. We start from the requirement that the discovered rules should explain potential causes of observed deviances. Using as a baseline feature types extracted using pattern mining techniques we try to explore more complex feature types to achieve higher levels of accuracy. The approach will be implemented in java as a plug-in of the process mining tool ProM.
Related literature: http://link.springer.com/chapter/10.1007%2F978-3-662-45563-0_25

Pre-assigned topics

Combining feature extraction methods in SSVEP-based BCI

Student: Anti Ingel
Supervisor: Ilya Kuzovkin
Problem description:
In recent years, developing direct communication channel between brain and an external device has received much attention. One method to achieve this is to evoke brain potential called steady-state visual evoked potential (SSVEP) and then use a feature extraction method to extract information related to the potential from electroencephalography (EEG) recording. Brain-computer interface (BCI) can use this information to send commands chosen by a user to an external device with up to 25 bit/min of information transfer rate by using a consumer-grade EEG device. The aim of this project is to show that using a combination of feature extraction methods in the decision making process of the BCI achieves better performance than using a single feature extraction method.

Aspect mining and fact extraction

Student: Lisa Yankovskaja
Supervisor: Sven Laur & Hedi Peterson
Problem description:
The aim of this research project is to get to know and apply some of the fact-extraction techniques known as aspect mining and apply them in the domain of product classification and summarisation. We use hand-curated descriptions of bioinformatic tools as the golden standard to compare against.

Context-based clustering for fact extraction

Student: Robert Roosalu
Supervisor: Sven Laur
Problem description:
Rule-base fact extraction methods are quite common in medical domain as they have guaranteed precision. However they also have low recall. Context-based clustering of candidate phrases can both simplify the derivation and validation of fact extraction rules. The aim of this project is to develop practical webtool for this purpose and validate its performance on medical data.

Extraction and Clustering of App Features from App Reviews

Student: Lisa Yankovskaja
Supervisor: Sven Laur & Hedi Peterson
Problem description:
The aim of this research project is to get to know and apply some of the fact-extraction techniques known as aspect mining and apply them in the domain of product classification and summarisation. We use hand-curated descriptions of bioinformatic tools as the golden standard to compare against.

Semi-automatic method for creating virtual reality environments

Student: Andres Traumann
Supervisor: Margus Niitsoo
Problem description:
The aim of this research project is to use 360 degree camera to create virtual reality illusion by showing individually taken pictures on fixed grid. Concrete subtask is to stitch images taken from the camera into 360 degree projections on the sphere and compare it with existing solution form Samsung in terms of efficiency and quality

  • Arvutiteaduse instituut
  • Loodus- ja täppisteaduste valdkond
  • Tartu Ülikool
Tehniliste probleemide või küsimuste korral kirjuta:

Kursuse sisu ja korralduslike küsimustega pöörduge kursuse korraldajate poole.
Õppematerjalide varalised autoriõigused kuuluvad Tartu Ülikoolile. Õppematerjalide kasutamine on lubatud autoriõiguse seaduses ettenähtud teose vaba kasutamise eesmärkidel ja tingimustel. Õppematerjalide kasutamisel on kasutaja kohustatud viitama õppematerjalide autorile.
Õppematerjalide kasutamine muudel eesmärkidel on lubatud ainult Tartu Ülikooli eelneval kirjalikul nõusolekul.
Tartu Ülikooli arvutiteaduse instituudi kursuste läbiviimist toetavad järgmised programmid:
euroopa sotsiaalfondi logo it akadeemia logo