Institute of Computer Science
  1. Courses
  2. 2021/22 spring
  3. Natural Language Processing (LTAT.01.001)
ET
Log in

Natural Language Processing 2021/22 spring

  • HomePage
  • Lectures
  • Labs and Homeworks
  • Reading Tests
  • Project
  • Project Leaderboards

Project

More information in Moodle

Project 1: Question Answering for Estonian

In this project, you are going to work in the extractive question answering dataset for the Estonian language. The dataset follows the SQuAD v1.1 format and has 776 context-question-answer triplets for training and 892 for testing.

The evaluation is the same as for the SQuAD v1.1 dataset and has two main metrics:

  • Exact Match (EM): "[M]easures the percentage of predictions that match any one of the ground truth answers exactly."
  • (Macro-averaged) F1 score: "[M]easures the average overlap between the prediction and ground truth answer. We treat the prediction and ground truth as bags of tokens, and compute their F1. We take the maximum F1 over all of the ground truth answers for a given question, and then average over all of the questions."

Link to the dataset: https://huggingface.co/datasets/anukaver/EstQA

Link to the thesis presenting the dataset: https://digikogu.taltech.ee/et/Download/234d8fd6-108b-446a-847d-008eeb902737

Project 2: Fake Job Postings Detection

In this project, you are going to work with the dataset that has 18K job postings from which 17,200 are real and 800 are fake. The data has different information about the job postings such as location, department, company profile, job description, requirements, etc.

The dataset was split into the train, development, and test sets. To develop your models, you will have access to the train and development sets. Your goal is to develop your model and run it on the held-out test set without the labels. After that, we will evaluate your model and post the metrics on the course page.

Three main metrics will be reported:

  • F1-score for the "fake" class (main metric)
  • Micro-averaged F1-score
  • Macro-averaged F1-score

The main challenge of this task is to deal with unbalanced data as well as possibly design your model for several inputs.

Link to the data: https://moodle.ut.ee/pluginfile.php/2113036/mod_resource/content/1/task2_data.zip

  • Institute of Computer Science
  • Faculty of Science and Technology
  • University of Tartu
In case of technical problems or questions write to:

Contact the course organizers with the organizational and course content questions.
The proprietary copyrights of educational materials belong to the University of Tartu. The use of educational materials is permitted for the purposes and under the conditions provided for in the copyright law for the free use of a work. When using educational materials, the user is obligated to give credit to the author of the educational materials.
The use of educational materials for other purposes is allowed only with the prior written consent of the University of Tartu.
Terms of use for the Courses environment