Arvutiteaduse instituut
  1. Kursused
  2. 2015/16 sügis
  3. Andmekaeve uurimisseminar (MTAT.03.277)
EN
Logi sisse

Andmekaeve uurimisseminar 2015/16 sügis

Older Datamining Seminars: 2008k » 2008s » 2009k » 2009s » 2010k » 2011k » 2012s » 2014k » 2014s » 2014k

  • About
  • Track I: Deep Learning for NLP
    • Timetable
    • Creating tests
    • Project ideas
    • Projects
    • Keras
  • Track II: Research Projects
    • Presentations
    • Assignments
    • Deadlines

Project ideas

Word Vectors for Estonian Language

Train word vectors for Estonian language and show that "case vectors" (käändevektorid) exist, i.e. koerale - koer + kass = kassile. You can use combined corpus for Estonian language and word2vec or GloVe. For later analysis you might find Gensim toolkit useful.

Supervisor: Tambet Matiisen
Difficulty: simple

Stability Analysis of Word Vectors for Estonian Language

The goal of this project is to verify if there is enough data to train stable word vectors for Estonian language.

  1. Train word vectors on full dataset and treat it as a standard model.
  2. Train with only 10%, 20%, ..., 90% of data and find linear transformation from trained model to standard model.
  3. Report normalized MSE for each subset.

This helps to diagnose the cause if word vectors are not good:

  • If stability is achieved early, then word2vec is not good for modeling Estonian language.
  • If stability is not achieved or barely achieved at very late, then we might not have enough training data.

Supervisor: Sven Laur
Difficulty: simple

Replicate examples in Andrej Karpathy's blog post for Estonian language

Andrej Karpathy produced a wonderful blog post about recurrent neural networks and how to use them to generate text. He produced models for generating

  • Paul Graham blog posts (he's a famous startup guy),
  • Shakespeare plays,
  • Wikipedia articles,
  • LaTeX scientific articles (with drawings!),
  • Linux source code,
  • baby names.

Your task will be to create similar models for Estonian language, for example

  • speeches of Lennart Meri, Toomas Hendrik Ilves or Edgar Savisaar,
  • writings of Andrus Kivirähk, Kaur Kender or nihilist.fm,
  • Estonian Vikipeedia articles,
  • Estonian newspapers: Postimees, SL Õhtuleht or Eesti Ekspress,
  • Estonian PhD thesises,
  • Estonian baby names.

Basically produce generative model for any big enough dataset you can find. Each of these is a separate project and the list here is not exhaustive, you can propose your own. If you want to use something from web, keep in mind that you need to scrape it and clean it first, which can take quite some time. Share responsibilities with team members!

You can use Andrej Karpathy's original Lua code or his 100-line Python implementation. But my suggestion would be to make friends with existing deep learning toolkit. For example Keras is simple, full-featured and includes text generation example.

NB! This is character level prediction, not word level prediction. Character level prediction is supposedly more suitable for Estonian language.

Lists of other Estonian corpuses:

  • Korpused
  • Keelekogud
  • Tekstikorpused
  • Metashare

Supervisor: Tambet Matiisen / Tanel Pärnamaa
Difficulty: simple

Sentiment classification

Classify sentiment of an article/comment using simple bag-of-words approach and neural network on top of that. You can use existing datasets:

  • Teachers.ee portal comments and ratings
  • Hinnavaatlus.ee portal comments and ratings
  • Postimees.ee opinionated articles
  • Segmented versions of the corpora?

Implementation wise you can take example from Keras.

Supervisor: Tanel Pärnamaa
Difficulty: simple

Sentiment classification using recurrent neural network

Classify sentiment of an article/comment using recurrent neural network. You can use the same datasets as in previous project. Implementation wise you can take example from Keras.

Supervisor: Tambet Matiisen
Difficulty: intermediate

Named Entity Recognition

Perform named entity recognition on Estonian NER dataset.

Supervisor: Tanel Pärnamaa
Difficulty: simple

Machine translation

Create a sequence-to-sequence translation network to translate between English and Estonian. Datasets you can use:

  • Movie subtitles
  • Europarlament speeches
  • European Union laws

Implementation wise some hints to get started: Keras, Blocks or Groundhog.

Supervisor: Tambet Matiisen / Tanel Pärnamaa
Difficulty: hard

Kaggle competition: Is your model smarter than an 8th grader?

You can participate in the Kaggle competition proposed by the Allen Institute for Artificial Intelligence (AI2) - "Is your model smarter than an 8th grader?". The training data for this project consists of 2,500 multiple choice questions from a typical US 8th grade science curriculum. Each question has four possible answers, of which exactly one is correct.

Supervisor: Ilya Kuzovkin
Difficulty: hard

StackOverflow question answering

Wouldn't it be nice if you could use StackOverflow also when offline? In this project you train a model to answer coding questions using StackOverflow dataset. As these questions tend to be very specific, it might be easier to generate the first sentence of the answer based on question title. Or maybe just generate tags from the title. You also have an option to use this model on other StackExchange sites, which might have more general questions.

Supervisor: Tambet Matiisen
Difficulty: hard

  • Arvutiteaduse instituut
  • Loodus- ja täppisteaduste valdkond
  • Tartu Ülikool
Tehniliste probleemide või küsimuste korral kirjuta:

Kursuse sisu ja korralduslike küsimustega pöörduge kursuse korraldajate poole.
Õppematerjalide varalised autoriõigused kuuluvad Tartu Ülikoolile. Õppematerjalide kasutamine on lubatud autoriõiguse seaduses ettenähtud teose vaba kasutamise eesmärkidel ja tingimustel. Õppematerjalide kasutamisel on kasutaja kohustatud viitama õppematerjalide autorile.
Õppematerjalide kasutamine muudel eesmärkidel on lubatud ainult Tartu Ülikooli eelneval kirjalikul nõusolekul.
Tartu Ülikooli arvutiteaduse instituudi kursuste läbiviimist toetavad järgmised programmid:
euroopa sotsiaalfondi logo