- You need a topic and supervisor to pass this course
- Any data mining related topic which is complex enough and has a university supervisor will do
- Normally you should choose your BSc or MSc thesis topic
- Young PhD student can take something, which will bring it closer to the first article.
Kickoff slides
- Sven Laur. General organisation
Natural language processing
General contact person: Mark Fishel
Minimal number of avaliable topics: 5
- Kairit Sirts
- Compressing fine-tuned input embeddings of NLP neural models with a projection matrix
- Neural models for Estonian text analysis
- Interpretable neural text classification models
- Sven Laur
- Phrase similarity measures that are robust to word order (Hele-Andra Kuulmets)
Bioinformatics
General contact person: Jaak Vilo
Minimal number of avaliable topics: 5
- Dmytro Fishman
- Cell Phenotyping using Convolutional Neural Networks
- Astrocytes Segmentation in Brain Microscopy Images
- Using CapsuleNets for Human Tissue Segmentation
- Leopold Parts
- Finding features that predict why a cancer requires a gene for growth
- Analysing CRISPR/Cas9 gRNA libraries
- Analysing CRISPR/Cas9 gene knockout experiments
- Kaur Alasoo
- Analysis of genetic variants regulating gene expression in cis
- The effect of chromatin accessibility on gene expression
- Meta-analysis of trans-eQTLs accross cell types and tissues
- Predicting cell-type-specific genetic effects with neural networks
Neuroscience
General contact person: Raul Vicente
Minimal number of avaliable topics: 5
- Tambet Matiisen
- Sequence-aware recommendation system (Maksym Semikin)
Machine learning
General contact person: Meelis Kull
Minimal number of avaliable topics: 3
- Leopold Parts
- Comparing automated inference engines
- Meelis Kull
- Model concatenation for classifier calibration
- Measuring model reliability for better aggregation of probabilistic classifiers
- Combining predictive models for activity recognition in SPHERE
- Activity recognition from accelerometers in SPHERE (Hristijan Sardjoski)
- Obtaining error bars in time-series regression models
- Unsupervised pre-training for context change adaptation
- Fundamentals of unsupervised pattern recognition
Analysis of medical data and personal medicine
General contact person: Sven Laur
Minimal number of avaliable topics: 3
- Marek Oja
- Recovery of disease treatment cases from EHR data
- Sven Laur
- Analysis of medical procedure logs: fairness and anomalies
- Disease treatment trajectories
- Extraction of stroke related facts from EHR
- Extraction of diagnostic facts from medical imaging descriptions
Analysis of big data
General contact person: Sherif Sakr
Minimal number of avaliable topics: 3
Selected topics
- Automated Selection and Optimization of Distributed Machine Learning Algorithms
- Declarative Querying of Distributed Graphs
- Complex Event Processing Over Event Intervals: The Case of Apache Flink
- Comparative Evaluation for the Performance of Big Stream Processing Systems
- Online Detection of Electrical Vehicle Charging Activity
- Auto Tuning of Flink Jobs: A Machine Learning Approach
- Fast Creation of Training Data using Weak Supervision
- Toward Interpretable Machine Learning Techniques
- Interpretability of automatically extracted machine learning features in medical images
Complete list of Big Data topics and their descriptions
Analysis of software development
- Ezequiel Scott
- Understanding team performance in agile software development
Mining software repositories consist in applying techniques to mine data from software repositories in order to leverage development data. Many kinds of repositories are intensively used by developers in today’s settings such as source control repositories and issue tracking repositories (e.g. Bitbucket, Github, Jira). These repositories contain a wealth of information that is available to extract, analyze and explore to study several development phenomena. For example, some studies have explored the evolution of projects and the prediction of relevant issues. However, very little attention has been paid to the role of human factors in the data analyzed from software repositories. This is surprising since human factors are always involved in every software development process. The goal of this project is to use the data from the repositories about software developers in order to analyze their relationship with the team and their performance. We will provide a dataset of several software projects and your task will be to calculate several performance metrics in the context of agile software development. In addition, you will use simple predictive models and/or stats to describe the team performance.
- Understanding team performance in agile software development