Project
More information in Moodle
Project 1: Question Answering for Estonian
In this project, you are going to work in the extractive question answering dataset for the Estonian language. The dataset follows the SQuAD v1.1 format and has 776 context-question-answer triplets for training and 892 for testing.
The evaluation is the same as for the SQuAD v1.1 dataset and has two main metrics:
- Exact Match (EM): "[M]easures the percentage of predictions that match any one of the ground truth answers exactly."
- (Macro-averaged) F1 score: "[M]easures the average overlap between the prediction and ground truth answer. We treat the prediction and ground truth as bags of tokens, and compute their F1. We take the maximum F1 over all of the ground truth answers for a given question, and then average over all of the questions."
Link to the dataset: https://huggingface.co/datasets/anukaver/EstQA
Link to the thesis presenting the dataset: https://digikogu.taltech.ee/et/Download/234d8fd6-108b-446a-847d-008eeb902737
Project 2: Fake Job Postings Detection
In this project, you are going to work with the dataset that has 18K job postings from which 17,200 are real and 800 are fake. The data has different information about the job postings such as location, department, company profile, job description, requirements, etc.
The dataset was split into the train, development, and test sets. To develop your models, you will have access to the train and development sets. Your goal is to develop your model and run it on the held-out test set without the labels. After that, we will evaluate your model and post the metrics on the course page.
Three main metrics will be reported:
- F1-score for the "fake" class (main metric)
- Micro-averaged F1-score
- Macro-averaged F1-score
The main challenge of this task is to deal with unbalanced data as well as possibly design your model for several inputs.
Link to the data: https://moodle.ut.ee/pluginfile.php/2113036/mod_resource/content/1/task2_data.zip