Arvutiteaduse instituut
  1. Kursused
  2. 2015/16 sügis
  3. Andmekaeve uurimisseminar (MTAT.03.277)
EN
Logi sisse

Andmekaeve uurimisseminar 2015/16 sügis

Older Datamining Seminars: 2008k » 2008s » 2009k » 2009s » 2010k » 2011k » 2012s » 2014k » 2014s » 2014k

  • About
  • Track I: Deep Learning for NLP
    • Timetable
    • Creating tests
    • Project ideas
    • Projects
    • Keras
  • Track II: Research Projects
    • Presentations
    • Assignments
    • Deadlines

Potential project topics and supervisors

Under construction. To be completed

Throughout the years various supervisors have offered many topics so if you do not find a topic that is suitable for your in this page look at the topics from previous years.

Bioinformatics

More topics are given out in separate bioinformatis seminar MTAT.03.242 Bioinformatics Seminar.

Training material (ppt, pdf) automatic annotation with EDAM terms

  • Supervisor: Hedi Peterson
  • Student:
  • Slides:
  • Abstract:

In ELIXIR project we aim to link training materials from different bioinformatics areas to topics, operations, data formats etc described in EDAM Ontology. Additionally, as there is a Bioinformatics Tools and Service registry, it would be great to automatically detect tool names, provide links to existing tool descriptions or add tools to the annotation pipeline waiting list.

In order to enable the annotation of the existing materials to EDAM Ontology and extraction of the tools we would need text processing and some data mining skills.

Realistically, the imput files are ppt and pdf files, where free text needs to be extracted, words mapped to ontology terms and ideally potential tool names extracted.

Automatic EDAM Ontology term mapping to publication abstracts

  • Supervisor: Hedi Peterson
  • Student:
  • Slides:
  • Abstract:

Every day thousands of scientific publications are accepted in the biomedical field. Also, many bioinformatics methods are developed and published daily. Therefore, we would like to extract from the abstracts of the scientific articles automatically words that could be mapped EDAM Ontology terms about topics, operations and data formats. In this way it would enhance gathering the overview of the available tools and automatically link published tools to relevant terms for easy query and recovery by the end users.

To fulfill the task there is a need to work with large set of text files, extract terms, map them to existing knowledge base or if possible propose new terms based on the analyzed text.

Autoencoder on EEG/ECoG Data

  • Supervisor: Ilya Kuzovkin
  • Student:
  • Slides:
  • Abstract:

The idea is to take a Kaggle EEG competition dataset, which has very strong benchmark and published solution by the winning teams. Most probably those solutions are based on complex hand-crafted features. We will apply the idea of an autoencoder / deep autoencoder / stacked RBM to see how close to the benchmark we will be able to go using only those fully automatic feature extraction methods. Complexity: strong bachelor / master

3D convolutional filter for fMRI data

  • Supervisor: Ilya Kuzovkin
  • Student:
  • Slides:
  • Abstract:The very nature of fMRI data is 3D: it is a volumetric scan of brain activity. The idea of convolutional filter allows deep neural network to learn features (filters) which are useful for classifying, for example, images. The idea is to try and apply 3D convolutional filter to fMRI data and see a) how successful it will be, b) how the learnt features (the ones considered useful by a DNN for classifying brain activity) will look like. Will they reveal some interesting patterns of fMRI brain activity?

Bayesian dosage adjustement during on-going treatment

  • Supervisor: Tauno Metsalu
  • Student: Tõnis Tasa
  • Slides:
  • Abstract:Existing pharmacokinetic models can be updated using Bayesian frameworks. Individual estimates for subjects can then be obtained with only limited available sampling data. New estimates can then be used to concuct simulation studies for assessing the efficacy and optimisation of dosing schemas. Aim of this project is to develop such framework of dose adjustment for on-going treatments with arbitrary dosing schedules.

Business process mining and data aquisition

No dedicated topics this year. However any data mining topic form the topic list of software engineering group is acceptable provided that supervisors agree with this.

Robotics

Topics available from previous years. See 2014 fall.

Computer vision and image processing

Using Neural Networks for Diabetic Retinopathy Detection in Eye Images

  • Supervisor: Dimitro Fishman
  • Student:
  • Slides:
  • Abstract:

Currently, detecting DR is a time-consuming and manual process that requires a trained clinician to examine and evaluate digital color fundus photographs of the retina. By the time human readers submit their reviews, often a day or two later, the delayed results lead to lost follow up, miscommunication, and delayed treatment (from http://www.kaggle.com/c/diabetic-retinopathy-detection).

Thus, we will try to develop an automated method for detecting DR in eye images using machine learning techniques in particular we will try Neural Networks approach. First, we will start with implementing simple Softmax classifier, later substituting it with three-layer artificial neural network, ultimate goal is to try to build a convolutional neural network for image classification. In this project we will follow up the Convolutional Neural Networks Course, which is an open online course from Stanford (http://cs231n.github.io/).

  • Project prerequisites:
    • Matrix/vector arithmetics
    • Understanding of basics of machine learning:
    • Programming language: Python (if you don't have experience with Python, you will learn it)
  • Associated topics: Image processing, neural networks, cost function optimization, parameters tuning, model comparison

Iterative Closest Point algorithm and its applications

  • Supervisor: Andres Traumann
  • Student:
  • Slides:
  • Abstract: Iterative Closest Point algorithm is a common way to reconstruct 3D objects form 2D images or from partial 3D measurements. The aim of the project is to use either Kinect or standard mobile phone to reconstruct 3D surfaces. The exact topic will be determined by the discussion with student. Potential project topics are
  • Kinect as a simple 3D scanner
  • Surface reconstruction from mobile phone images.

Generating 3D point cloud of a corridor system/large room using monocular SLAM

  • Supervisor: Janno Jõgeva
  • Student: Janno Jõgeva
  • Slides:
  • Abstract: SLAM is the process of doing Simultaneous Localisation And Mapping. This is widely used in robotics to enable a robot to familiarise itself with it's surroundings (mapping) while navigating the space using the map being generated (localisation). One is needed for the other. Think of the "chicken or the egg" problem.

This project would use a single camera and the LSD-SLAM toolkit developed in Munich. The toolkit also utilises ROS (Robot Operating System) for input and output. Main aim of this project would be to get a dataset consisting of a point cloud and the input video of a larger space with documentation on how it was retrieved. Secondary aim is to get a configuration management system (e.g. Ansible) script for setting the whole system up on a Linux based system. More precisely to get a single click setup for a clean Debian based system. This is to simplify future usage of the toolkit.

Deliverables:

  • Point cloud and input video of selected space
  • Configuration management script for setting up the capturing system
  • Project report

Keeletehnoloogia

Semi-supervised NER

  • Supervisor: Aleksandr Tkatšenko
  • Student:
  • Slides:
  • Abstract: Named Entity Recognition (NER) is a task of extracting interesting information units from text, such as person names, geographical locations and organisations. A standard way to develop a NER-tagger is to manually annotate named entities in a piece of text and let machine learning algorithm learn discriminative rules for each entity type. Since manual labelling is a time consuming process, a natural way to further improve a NER-tagger is to make use of feely available text data, such as news articles, digital books, etc.
    The goal of the project is to improve an existing NER-tagger using unlabelled text based on the following algorithm: First you process the unlabelled text corpus with an existing NER-tagger and aggregate predictions for each entity. Instances with consistent ne-tag prediction throughout the text can then be added to the original manually labelled dataset and the system re-trained. Repeating these steps multiple times can result in a more robust entity tagger. Your task is to implement the algorithm and document the results. You will be provided with the existing NE-tagger and the unlabelled text corpus.

Deliverables:

  • working code
  • ne-tagged dataset
  • report which summarises experimental results

Mapping named entities

  • Supervisor: Aleksandr Tkatšenko
  • Student:
  • Slides:
  • Abstract: Named entity recognition refers to a task of automatic extraction of certain types of entities (e.g. names of people, organisations, locations, etc.) from text. Given a piece of text, a typical NER system will find and tag chunks of text representing an entity string, e.g. “<LOCATION>Iraani</LOCATION> president <PERSON>Mahmoud Ahmadinejad</PERSON> süüdistas vaenlaste salasepitsusi riiki tabanud rängas põuas.” However, such output is of a limited value, since it’s not clear yet that a tagged string “Iraani” refers to a concrete object in the world – a country Iraan. Your task is to implement a method which maps extracted strings to concrete objects. It’s up to you which collection of “concrete objects” to use. It can be, for instance, simply a collection of all possible geographical locations, or a list of countries, cities, etc. Your system should effectively address the problem of word from variation; for instance, strings as “Tallinnast”, ”Tallinnast” should be both mapped to “Tallinn”. Try to make you solution as generic as possible so that it could be applied to other languages without much tweaking.

Resources

  • Named entity recognition system
  • annotated corpus for NER for evaluation (locations, people names, organisations)
  • morphological analyser / lemmatiser
  • database of world locations: Geonames.org (also in estonian).

Automated Classification of Estonian Online Media Articles

  • Supervisor: Aleksandr Tkatšenko
  • Student:
  • Slides:
  • Abstract: The problem of document classification refers to the task of automatically assigning a topic category to a text document based on its content. The goal of this project is to explore different forms of document representation with respect to classification accuracy. Possible representation forms include words, lemmas, n-grams, tf-idf weighting, article section weighting, named entites, tags, LSI/LDA/random indexing, clustering, etc. For this purpose, the student will be provided a large corpus collected from major estonian online news outlets. The student will need to topic-tag a set of articles for training and evaluation purposes and implement/compare several document representation schemes.

Deliverables:

  • topic-tagged dataset,
  • implementation of several representation schemes,
  • report which summarises experimental results.

Multi-label document classification

  • Supervisor: Aleksandr Tkatšenko
  • Student:
  • Slides:
  • Abstract: As opposed to the classical documentation classification task, where each document is assigned an exactly one topic category, a multi-label classification deals with the situation where document is characterised by a number of more fine-grained topics or folksonomy tags (e.g. military, Syria crisis, USA, Barack Obama, chemical weapons, etc.). The goal of this project is to implement and compare several existing methods of multi-label document classification in the context of estonian online media domain. For this purpose, the student will be provided a large corpus collected from major estonian online news outlets together with the topic tags assigned by the editors.

Deliverables:

  • implementation of several approaches for multi-label classification,
  • report which summarises experimental results.

Materials:

  • Statistical topic models for multi-label document classification, Rubin et al, Machine learning, 2012
  • Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora, D Ramage et al, 2009
  • OpenCalais - a service by Reuters that automatically extracts semantic information from web pages: http://viewer.opencalais.com/

Ajaseoste märgenduse masinõppimine

  • Juhendaja: Siim Orasmaa
  • Tudeng:
  • Slaidid:
  • Abstract: Temporal annotation specifies which events are described in a natural language text and how these events are temporally located and ordered. The goal of the current project is to create a prototype for a system that can automatically indentify temporal relations in Estonian news texts. The project involves experimenting with different supervised machine learning techniques, using the data from Estonian TimeML annotated corpus. The temporal relations in the corpus have been prepared closely following the TempEval-2 setting (Pustejovsky, Verhagen 2009; Verhagen et al. 2010). The initial aim is to predict temporal relations based on given event and temporal expression annotations, but the task can also be extended to predicting temporal relations solely based on syntactic annotations.
  • References:
    • Estonian TimeML Annotated corpus. See also https://github.com/soras/EstTimeMLCorpus/blob/master/readme.txt.
    • Pustejovsky, J., & Verhagen, M. (2009). SemEval-2010 task 13: evaluating events, time expressions, and temporal relations (TempEval-2). In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (pp. 112-116). Association for Computational Linguistics.
    • Verhagen, M., Sauri, R., Caselli, T., & Pustejovsky, J. (2010). SemEval-2010 task 13: TempEval-2. In Proceedings of the 5th international workshop on semantic evaluation (pp. 57-62). Association for Computational Linguistics.

Ingliskeelse USENET korpuse lausestamine ning esmane analüüs

  • Juhendaja: Mark Fishel
  • Tudeng:
  • Slaidid:
  • Abstract:USENET korpus on hiiglaskik ingliskeelne korpus, mille tekst on toorel kujul ning vajaks lausestamist ning esmast statistilist analüüsi. Peamine raskus on korpuse suuruses, ning projekt on üsna tehniline.

Latentsed sõnaliigid paralleelkorpuses

  • Juhendaja: Mark Fishel
  • Tudeng:
  • Slaidid:
  • Abstract: Eesmärgiks on uurida, kas saab avastada sõnaliike paralleelkorpusest ilma märgenduseta, ning nende abil parandada masintõlke kvaliteeti. Sõnaliikide kasutamiseks masintõlkes on olemas vähemalt üks kindel lihtne meetod. Lihtsustatud variant oleks jagada sõnu sisulisteks ja grammatilisteks sõnaliikide asemel.

Machine learning

Spatiotemporal modeling of accidents and the Estonian rescue board response times

  • Juhendaja: Anna Leontjeva
  • Tudeng:
  • Slaidid:
  • Abstract: This is a practice oriented master thesis topic that can be started as seminar project.

Prediction of the accidents by data enrichment of the Estonian rescue board with public resources

  • Juhendaja: Anna Leontjeva
  • Tudeng:
  • Slaidid:
  • Abstract: This is a practice oriented master thesis topic that can be started as seminar project.
  • Arvutiteaduse instituut
  • Loodus- ja täppisteaduste valdkond
  • Tartu Ülikool
Tehniliste probleemide või küsimuste korral kirjuta:

Kursuse sisu ja korralduslike küsimustega pöörduge kursuse korraldajate poole.
Õppematerjalide varalised autoriõigused kuuluvad Tartu Ülikoolile. Õppematerjalide kasutamine on lubatud autoriõiguse seaduses ettenähtud teose vaba kasutamise eesmärkidel ja tingimustel. Õppematerjalide kasutamisel on kasutaja kohustatud viitama õppematerjalide autorile.
Õppematerjalide kasutamine muudel eesmärkidel on lubatud ainult Tartu Ülikooli eelneval kirjalikul nõusolekul.
Tartu Ülikooli arvutiteaduse instituudi kursuste läbiviimist toetavad järgmised programmid:
euroopa sotsiaalfondi logo