Institute of Computer Science
  1. Courses
  2. 2024/25 spring
  3. Data engineering for Conversion Master's (LTAT.02.026)
ET
Log in

Data engineering for Conversion Master's 2024/25 spring

  • Main
  • Lectures
  • Project
  • Homework
  • References

Homework

Solutions for this task can not be submitted at the moment.

HW4 (due 23.05)

Apache Spark DataFrames and SQL with Yelp Dataset.

Dive into Big Data analysis using Apache Spark DataFrames and SQL on the Yelp dataset.

Process and manipulate data in parallel, learning to load Yelp tables as DataFrames, extract user statistics, scrutinize businesses, and generate pivot tables with the Spark DataFrame API and Spark SQL.

Ensure a functional Spark environment and submit Python scripts and outputs as deliverables.

Guide: BigDataLab

HW3 (due 24.04)

ETL Processes with Docker, Superset, and Data Visualization

Objective: The goal is to use a dockerised Apache Superset for dashboard creation.

Task 1: Docker and Apache Superset Setup Review

Explain the following command: docker run -d -v ${PWD}:/data:rw -p 8080:8088 -e "SUPERSET_SECRET_KEY=<your_new_secret_key>" --name superset my/superset:duckdb

The explanation should cover the following:

  • What does each part of the command do?
  • For -v, what is the Windows equivalent to ${PWD}

If you participated in the lecture, you have Superset already set up. Explain the steps needed to start working with it again:

  • in case the container already exists.
  • in case the container has been deleted.
  • in case you want to change the mounted folder.
Task 2: Data Analysis and Visualization in Superset

Make sure you have both datasets available in the Superset instance:

  • based on http://airviro.klab.ee.
  • based on https://www.ilmateenistus.ee/kliima/ajaloolised-ilmaandmed/.

Refer to L04, data transformation scripts can be found at https://github.com/adlerpriit/DECM_WS_Docker_ETL_AS.

Let's delve further into explaining dust concentration in the air:

  • Using Superset, create a chart (pivot table) that displays dust particle concentration (PM10) per weekday (rows) and time of the day (24-hour format; columns).
Task 3: Dashboard Creation in Superset

Create a new dashboard in Superset in which you will need to include the following:

  • chart you created in Task 2
  • tow more charts of your choice, one based on ilmateenistus data (try to keep different visuals, i.e. line plot and boxplot)
  • A markdown element containing:
    • Title
    • Brief explanation of the dashboard
    • Your credentials

Submission: A Markdown document for Task 1, download the dashboard as PNG or JPG for Task 3.

HW2 (due 27.03)

Practical Application of Database Design Principles

Objective: The goal is to design, implement, and query a straightforward database system.

Task 1: Reading Assignment

Instructions: Explore the online book titled "Database Design". Specifically, focus on Chapters 3 & 13, which discuss the "Database Development Process". After reading, fill out the "Database Development & Characteristics Concept Test" in moodle.

Task 2: Design a Database Schema

Scenario: Your task is to create a database design for a small-scale library system. This system requires tracking information related to books, authors, library patrons, and book loans. Requirements:

  • Construct an Entity-Relationship (ER) diagram to represent the library system. Your diagram should include entities for books, authors, patrons, and loans.
  • Compose a concise description of your design, detailing the relationships between the entities.
Task 3: Implement the Database

Tools: Choose a relational database management system (RDBMS) you're comfortable with (options include SQLite, MySQL, PostgreSQL).

Steps:

  • Construct the database and its tables in line with your design from Task 2.
  • Populate each table with sample data, ensuring at least 5 entries per table. (Feel free to use ChatGPT or Gemini for schema-based random data generation)
  • Include SQL scripts or a link to an SQL file that details your schema creation and data insertion operations.
Task 4: Querying the Database

Objective: Develop SQL queries to execute the following tasks:

  • Produce a list of all books authored by a specified individual.
  • Identify all patrons who have overdue book loans.
  • Enumerate books that have not been borrowed.
  • Submission Requirements: Furnish the SQL queries alongside a brief explanation of the expected output for each query.

Deliverables: Your homework submission should consist of your ER diagram (pdf or png), SQL scripts for database creation and data entry, your SQL queries with expected outputs, and any accompanying explanations.

HW1 (due 27.02)

Data Source Exploration for Group Project

In this homework assignment, you will find a publicly available and reliable data source for a group project. After selecting a suitable data source, provide a concise overview of its key attributes, including:

Dataset name and a brief description

  • Purpose of the data and its potential use in a group project
  • Type of data (e.g., tabular, time series, geospatial) and data types (e.g., numerical, categorical, text)
  • Update frequency and historical data availability
  • Data ownership, licensing, and attribution requirements
  • Privacy, ethical concerns, and necessary steps to address them
  • Accessibility (e.g., direct download, API) and any API usage information
  • Data size, scalability, and quality considerations
  • Preprocessing and cleaning tasks required before analysis

Please do not upload the data itself. The preferred submission format is markdown. You can write your homework at https://hackmd.io/ (requires login).

  • Institute of Computer Science
  • Faculty of Science and Technology
  • University of Tartu
In case of technical problems or questions write to:

Contact the course organizers with the organizational and course content questions.
The proprietary copyrights of educational materials belong to the University of Tartu. The use of educational materials is permitted for the purposes and under the conditions provided for in the copyright law for the free use of a work. When using educational materials, the user is obligated to give credit to the author of the educational materials.
The use of educational materials for other purposes is allowed only with the prior written consent of the University of Tartu.
Terms of use for the Courses environment