Project ideas
Word Vectors for Estonian Language
Train word vectors for Estonian language and show that "case vectors" (käändevektorid) exist, i.e. koerale - koer + kass = kassile. You can use combined corpus for Estonian language and word2vec or GloVe. For later analysis you might find Gensim toolkit useful.
Supervisor: Tambet Matiisen
Difficulty: simple
Stability Analysis of Word Vectors for Estonian Language
The goal of this project is to verify if there is enough data to train stable word vectors for Estonian language.
- Train word vectors on full dataset and treat it as a standard model.
- Train with only 10%, 20%, ..., 90% of data and find linear transformation from trained model to standard model.
- Report normalized MSE for each subset.
This helps to diagnose the cause if word vectors are not good:
- If stability is achieved early, then word2vec is not good for modeling Estonian language.
- If stability is not achieved or barely achieved at very late, then we might not have enough training data.
Supervisor: Sven Laur
Difficulty: simple
Replicate examples in Andrej Karpathy's blog post for Estonian language
Andrej Karpathy produced a wonderful blog post about recurrent neural networks and how to use them to generate text. He produced models for generating
- Paul Graham blog posts (he's a famous startup guy),
- Shakespeare plays,
- Wikipedia articles,
- LaTeX scientific articles (with drawings!),
- Linux source code,
- baby names.
Your task will be to create similar models for Estonian language, for example
- speeches of Lennart Meri, Toomas Hendrik Ilves or Edgar Savisaar,
- writings of Andrus Kivirähk, Kaur Kender or nihilist.fm,
- Estonian Vikipeedia articles,
- Estonian newspapers: Postimees, SL Õhtuleht or Eesti Ekspress,
- Estonian PhD thesises,
- Estonian baby names.
Basically produce generative model for any big enough dataset you can find. Each of these is a separate project and the list here is not exhaustive, you can propose your own. If you want to use something from web, keep in mind that you need to scrape it and clean it first, which can take quite some time. Share responsibilities with team members!
You can use Andrej Karpathy's original Lua code or his 100-line Python implementation. But my suggestion would be to make friends with existing deep learning toolkit. For example Keras is simple, full-featured and includes text generation example.
NB! This is character level prediction, not word level prediction. Character level prediction is supposedly more suitable for Estonian language.
Lists of other Estonian corpuses:
Supervisor: Tambet Matiisen / Tanel Pärnamaa
Difficulty: simple
Sentiment classification
Classify sentiment of an article/comment using simple bag-of-words approach and neural network on top of that. You can use existing datasets:
- Teachers.ee portal comments and ratings
- Hinnavaatlus.ee portal comments and ratings
- Postimees.ee opinionated articles
- Segmented versions of the corpora?
Implementation wise you can take example from Keras.
Supervisor: Tanel Pärnamaa
Difficulty: simple
Sentiment classification using recurrent neural network
Classify sentiment of an article/comment using recurrent neural network. You can use the same datasets as in previous project. Implementation wise you can take example from Keras.
Supervisor: Tambet Matiisen
Difficulty: intermediate
Named Entity Recognition
Perform named entity recognition on Estonian NER dataset.
Supervisor: Tanel Pärnamaa
Difficulty: simple
Machine translation
Create a sequence-to-sequence translation network to translate between English and Estonian. Datasets you can use:
Implementation wise some hints to get started: Keras, Blocks or Groundhog.
Supervisor: Tambet Matiisen / Tanel Pärnamaa
Difficulty: hard
Kaggle competition: Is your model smarter than an 8th grader?
You can participate in the Kaggle competition proposed by the Allen Institute for Artificial Intelligence (AI2) - "Is your model smarter than an 8th grader?". The training data for this project consists of 2,500 multiple choice questions from a typical US 8th grade science curriculum. Each question has four possible answers, of which exactly one is correct.
Supervisor: Ilya Kuzovkin
Difficulty: hard
StackOverflow question answering
Wouldn't it be nice if you could use StackOverflow also when offline? In this project you train a model to answer coding questions using StackOverflow dataset. As these questions tend to be very specific, it might be easier to generate the first sentence of the answer based on question title. Or maybe just generate tags from the title. You also have an option to use this model on other StackExchange sites, which might have more general questions.
Supervisor: Tambet Matiisen
Difficulty: hard