Andmekaeve uurimisseminar - Kursused - Arvutiteaduse instituut

Spreadsheet with data mining and machine learning topics

Link to the spreadsheet

The topics below are from the autumn semester so some of them have been completed

Natural language processing

Multilingual speech recognition

Supervisor: Mark Fishel
Problem description: Typical speech recognition solutions handle one language at a time. At the same time there is a multilingual speech corpus (voice.mozilla.org) and open software for training speech recognition models (https://github.com/mozilla/DeepSpeech). Your task is to test if multiple languages can fit into the same model and if bigger languages (like English) help the quality of recognizing smaller languages (like Estonian).

Medical text generation with deep language models

Supervisor: Mark Fishel & Raivo Kolde
Problem description: Study of medical texts is complicated by privacy requirements. It is almost impossible to get freely available texts in medical domain due to privacy concerns. One way out of this problem is artificial text synthesis. Recent developments in text synthesis make technically possible. You task is to explore applicability of the state of art on a large collection of Estonian medical text.

Autonomous Driving

Environment based localisation

Supervisor: Dmytro Fishman
Problem description: Self-driving cars rely a lot on pre-built high-definition maps to orient in the space. As surroundings change rapidly (especially in urban areas), so should the maps to enable the safety of autonomous driving. In this project, we want to build a system that can first of all detect changes in the environment and later update the underlying map in response to these changes. This work will build on top of Vladislav Fediukov's MSc thesis, which was about detecting changes in the environment by comparing pre-recorded LiDAR point clouds and camera images with new measurements.

We are interested in the second year MSc students who would be willing to carry out this work as part of their MSc theses.

Human-like speed curves

Supervisor: Tambet Matiisen
Problem description: When the vehicle has to stop for traffic light, pedestrian crossing or to give way, it is unclear what is the most comfortable speed curve to bring the vehicle to stop. Should it be linear? Should it be exponential? Should it happen in one or two phases? Figuring this out is the goal of this project.

The project consists of:

Doing literature review on speed curves for stopping.
Recording number of speed curves from human drivers using our test vehicle.
Analyzing those speed curves.
Proposing mathematical formula for altering the speed with respect to the distance to the stopping point.

The final result of the project is a simple mathematical formula: speed = f(distance). You have to find the f. If there are multiple candidate f-s, then you can also do a user survey - have three people in the car, brake with different f-s and have the people rate them 1-5.

End-to-end driving dataset and baseline model

Supervisor: Tambet Matiisen
Problem description: There are two main approaches for autonomous driving - modular approach that decomposes the driving pipeline into independent and testable modules, and end-to-end approach that models the entire driving pipeline as one big neural network. While a modular approach is easier to test and debug, it is also more complicated to set up and relies on specific know-how on localization, perception, planning, etc. End-to-end driving on the other hand requires just a collection of big enough dataset to train a powerful neural network model. Of course it has its own shortcomings - it requires a lot of computing power and can be unreliable. We want to push the limits of end-to-end driving models by making them more capable and reliable. To this end we need a reliable dataset to train our models on.

The project consists of:

Fixing the sensor set that will be in the dataset.
Collecting the data by driving around in Tartu. Alternatively: using existing data.
Postprocessing and cleaning the dataset.
Training baseline end-to-end driving model using imitation learning.

Bioinformatics

List of topics in Goole Docs
Supervisor: Kaur Alasoo

Health Informatics

List of topics in Goole Docs
Supervisors: Raivo Kolde, Sulev Reisberg and Sven Laur

Software Engineering add business process management

General list of topics

You can pick only topics related to statistics, machine learning and data mining

Eesti Energia

Contact person: Kristjan Eljand
Detailed description of topics

Neighborhood Power Plant
Electric Vehicle Charging Predictor
Grid congestion manager
Solar production forecast
Solar Potential Calculator
Mining Vehicle Analytics

Eesti Haigekassa

Anomaalsete raviarvete leidmine

Taust: Üldiselt iga kord, kui arsti juures käite, esitab arst (haigla) selle visiidi eest haigekassale raviarve – selliste arvete tasumine on üks haigekassa põhitegevusi. Haigekassa kontrollib selliste raviarvete põhjendatust ja valede raviarvete eest nõuab raha haiglalt tagasi. Sealhulgas kontrollitakse aastast aastasse teatud a priori kahtlaseid raviarveid, näiteks selliseid, kus samale inimesele on tehtud sama protseduur samal või lähestikustel päevadel korduvalt. Niimoodi eelvalitud arved vaadatakse käsitsi üle, vajadusel haigla käest selgitusi ja lisadokumente küsides. Valdav enamus nendest raviarvetest osutuvad siiski põhjendatuteks, kuid osa lähevad ka tagasinõudeks. Praktika eesmärk on rakendada statistilisi mudeleid, et kitsendada käsitsi kontrollimisele minevate arvete hulka.

Teemad:

Samale isikule erinevatel arvetel kajastatud sama teenus, mis on osutatud kuni 3 päeva erineval kuupäeval. Soovi korral võib eristada kolme alaolukorda:
1. Üks arve perearstilt ja teine arve eriarstilt
2. Sama haigla erinevate arstide arved
3. Erinevate haiglate arved
Haigla oma süsteemis sama numbrit kandvad arved, mis on haigekassale esitatud korduvalt.

Andmestik koosneb seega siis eelkirjeldatud tingimustele vastavate arvete komplektidest (tüüpnäide on arvete paar, kuid leidub ka palju muud).

Eesmärk on leida need arvete (või teenuste – otsustamise koht) komplektid, mille seast vähemalt ühe teenuse tasu on tagasi nõutud. Tungivalt soovitav on kasutada meetodit, mis annaks lisaks binaarsele tulemuse ka arvu (nõude „tõenäosus“), mille abil oleks võimalik arvete komplekte kontrolli jaoks prioriseerida. Tuleb arvestada, et andmestik ei ole tasakaalus, s.t. nõudega arveid on ainult ca. 1% kõigist eelkirjeldatud komplektidest. Konkreetsete põhjuste välja toomine, et miks teatud komplektid siis on kahtlasemad kui teised komplektid, on teisejärguline. See siiski ei tähenda, et sisu poolt ei võiks uurida, kui tahtmist on. Eelistame, et mudel oleks implementeeritav R-is. Praktika soovitavaks, kuid siiski möödapääsetavaks, osaks on tunnuste konstrueerimise loomingulis-tehniline töö (või alternatiivselt mudeli otsimine, mis selle ise selgeks õpib, või alternatiivselt komplektistruktuuri ignoreerimisega kaasneva ebatäpsuse hindamine). Täpsemalt, toores andmetabelis on ühe rea peal üks arve (ja siis lisaks teises tabelis iga rea peal üks raviteenus), kuid mudeli objektiks on justnimelt arvete komplekt (lihtsustatud tüüpjuhul paar). Seega tuleb komplekti kuuluvate arvete ja teenuste info kokku võtta („ühele reale“), et muuta andmestik masinõppe mudelile söödavaks. See töö on mingil tasemel tehtud, kuid alati on ruumi parendustele.

STACC

Contact persons: Karl-Oskar Masing & Dage Särg
Detailed description of topics

Cross-alphabet named entity disambiguation
News text classification in terms of events' location
Name recognition tasks
Correcting the ambiguous named entities
Classification of namesakes
Improving morphological analysis and named entity recognition
Training the existing neural models further with new samples
Improving sentiment analysis
Text summarization based on keywords
Detecting hot topics in the media over time
Assessing model quality over time
Automatic improvement of corpus annotations
Extending training data with automatically generated counter examples

Andmekaeve uurimisseminar 2021/22 sügis