Institute of Computer Science
Courses.cs.ut.ee Institute of Computer Science University of Tartu
  1. Courses
  2. 2025/26 spring
  3. Data engineering for Conversion Master's (LTAT.02.026)
ET
Log in

Data engineering for Conversion Master's 2025/26 spring

  • Main
  • Lectures
  • Project
  • Homework
  • References

Homework

Homework submission will be in Moodle environment

HW3 (due 26.04)

ETL Processes, Superset, and Data Visualization

Objective: start the Lecture 4 environment, run the ETL workflow, connect Superset to the warehouse, and create visualizations based on the course datasets.

Task 1: Lecture 4 Environment and Superset Startup Review

Start the Lecture 4 environment from the repository root inside the devcontainer:

make init
make up-superset
make devcontainer-join-course-network
make ps

In your own words, explain what each command does and why it is needed.

Your explanation should cover:

  • what each command does;
  • which services are expected to exist for Lecture 4 (postgres, superset, superset-redis, and the one-time superset-init);
  • why make devcontainer-join-course-network is needed before running ETL from the devcontainer;
  • how make and Docker Compose work together in this repository.

Also explain how you would start working again in each of these cases:

  • the Superset stack already exists but is stopped;
  • the containers have been removed;
  • you changed .env values or reopened the repository from a different host folder.
Task 2: Data Analysis and Visualization in Superset

Make sure the current Lecture 4 datasets are available in your Superset instance.

Run the ETL and validation flow from the repository root inside the devcontainer:

make etl-bootstrap
.venv/bin/python -m etl.airviro.cli run --from 2026-01-01 --to 2026-03-26 --source-key air_quality_station_8 --source-key pollen_station_25 --verbose
make warehouse-status

For this homework, keep the date range inside 2026. The year-to-lecture-04 example above was validated during course preparation. Do not use wide historical backfills here, because older API responses may contain timestamp inconsistencies.

Then connect Superset to the warehouse and use the current serving views:

  • l4_mart.v_air_quality_hourly_station_8
  • l4_mart.v_pollen_daily_station_25
  • l4_mart.v_airviro_measurements_long (optional helper dataset)

Warehouse connection values:

  • host: postgres
  • port: 5432
  • database: warehouse
  • username: warehouse
  • password: warehouse

Create a chart in Superset that shows dust particle concentration (PM10) by:

  • rows: weekday;
  • columns: hour of day (0-23);
  • metric: average pm10.

Recommended dataset:

  • l4_mart.v_air_quality_hourly_station_8

Recommended fields:

  • weekday: day_short
  • hour: hour_number
  • metric: AVG(pm10)

day_short is intentionally space-padded in the datetime dimension so weekday labels sort chronologically in many Superset charts.

Task 3: Dashboard Creation in Superset

Create a new dashboard in Superset that includes:

  • the chart you created in Task 2;
  • two more charts of your choice;
  • at least one chart based on l4_mart.v_pollen_daily_station_25;
  • at least one chart based on air-quality indicators from l4_mart.v_air_quality_hourly_station_8 such as ws10, wd10, pm2_5, pm10, o3, no2, or so2;
  • a Markdown element containing a title, a brief explanation of the dashboard, and your name.

Do not rely on temp for this homework. During the lecture period, the source API returned unreliable temperature values for station 8. The homework still stays on station 8, because the current Lecture 4 simple ETL and serving views are intentionally built around that station. If you or your agent buddy feels confident, you can choose another air quality station (view ohuseire.ee).

Try to keep the visuals different from each other, for example a pivot table, line chart, and box plot or bar chart.

Submission

Submit:

  • one Markdown document for Task 1;
  • one screenshot of the Task 2 chart;
  • one screenshot or exported image of the final dashboard from Task 3.

HW2 (due 26.03)

Practical Application of Database Design Principles

Objective: Design, implement, and query a straightforward database system.

Task 1: Reading Assignment

Instructions: Explore the online book titled "Database Design". Specifically, focus on Chapters 3 & 13, which discuss the "Database Development Process". After reading, fill out the "Database Development & Characteristics Concept Test" in moodle.

Task 2: Design a Database Schema

Scenario: Your task is to create a database design for a small-scale library system. This system requires tracking information related to books, authors, library patrons, and book loans. Requirements:

  • Construct an Entity-Relationship (ER) diagram to represent the library system. Your diagram should include entities for books, authors, patrons, and loans.
    • Keep in mind possible many-to-many relationships.
    • You can add more tables to complete your design.
  • Compose a concise description of your design, detailing the relationships between the entities.
Task 3: Implement the Database

Tools: Choose a relational database management system (RDBMS) you're comfortable with (options include SQLite, MySQL, PostgreSQL, DuckDB).

Steps:

  • Construct the database and its tables in line with your design from Task 2.
  • Populate each table with sample data, ensuring at least 5 entries per main entity table. (Feel free to use ChatGPT or Gemini for schema-based random data generation)
  • Include SQL scripts or a link to an SQL file that details your schema creation and data insertion operations.
Task 4: Querying the Database

Objective: Develop SQL queries to execute the following tasks:

  • Produce a list of all books authored by a specified individual.
  • Identify all patrons who have overdue book loans.
  • Enumerate books that have not ever been borrowed.
  • Submission Requirements: Furnish the SQL queries alongside a brief explanation of the expected output for each query.

Deliverables: Your homework submission should consist of your ER diagram (pdf or png), SQL scripts for database creation and data entry, your SQL queries with expected outputs, and any accompanying explanations.

HW1 (due 12.03)

Data Source Exploration for Group Project

In this homework assignment, you will find a publicly available and reliable data source for a group project. After selecting a suitable data source, provide a concise overview of its key attributes, including:

Dataset name and a brief description

  • Purpose of the data and its potential use in a group project
  • Type of data (e.g., tabular, time series, geospatial) and data types (e.g., numerical, categorical, text)
  • Update frequency and historical data availability
  • Data ownership, licensing, and attribution requirements
  • Privacy, ethical concerns, and necessary steps to address them
  • Accessibility (e.g., direct download, API) and any API usage information
  • Data size, scalability, and quality considerations
  • Preprocessing and cleaning tasks are required before analysis

Please do not upload the data itself. The preferred submission format is markdown. You can write your homework at https://hackmd.io/ (requires login).

  • Institute of Computer Science
  • Faculty of Science and Technology
  • University of Tartu
In case of technical problems or questions write to:

Contact the course organizers with the organizational and course content questions.
The proprietary copyrights of educational materials belong to the University of Tartu. The use of educational materials is permitted for the purposes and under the conditions provided for in the copyright law for the free use of a work. When using educational materials, the user is obligated to give credit to the author of the educational materials.
The use of educational materials for other purposes is allowed only with the prior written consent of the University of Tartu.
Terms of use for the Courses environment