Data engineering for Conversion Master's - Courses

Homework

Homework submission will be in Moodle environment

HW4 (due 23.05)

Apache Spark DataFrames and SQL with Yelp Dataset.

Dive into Big Data analysis using Apache Spark DataFrames and SQL on the Yelp dataset.

Process and manipulate data in parallel, learning to load Yelp tables as DataFrames, extract user statistics, scrutinize businesses, and generate pivot tables with the Spark DataFrame API and Spark SQL.

Ensure a functional Spark environment and submit Python scripts and outputs as deliverables.

Guide: BigDataLab

HW3 (due 26.04)

ETL Processes, Superset, and Data Visualization

Objective: start the Lecture 4 environment, run the ETL workflow, connect Superset to the warehouse, and create visualizations based on the course datasets.

Task 1: Lecture 4 Environment and Superset Startup Review

Start the Lecture 4 environment from the repository root inside the devcontainer:

make init
make up-superset
make devcontainer-join-course-network
make ps

In your own words, explain what each command does and why it is needed.

Your explanation should cover:

what each command does;
which services are expected to exist for Lecture 4 (postgres, superset, superset-redis, and the one-time superset-init);
why make devcontainer-join-course-network is needed before running ETL from the devcontainer;
how make and Docker Compose work together in this repository.

Also explain how you would start working again in each of these cases:

the Superset stack already exists but is stopped;
the containers have been removed;
you changed .env values or reopened the repository from a different host folder.

Task 2: Data Analysis and Visualization in Superset

Make sure the current Lecture 4 datasets are available in your Superset instance.

Run the ETL and validation flow from the repository root inside the devcontainer:

make etl-bootstrap
.venv/bin/python -m etl.airviro.cli run --from 2026-01-01 --to 2026-03-26 --source-key air_quality_station_8 --source-key pollen_station_25 --verbose
make warehouse-status

For this homework, keep the date range inside 2026. The year-to-lecture-04 example above was validated during course preparation. Do not use wide historical backfills here, because older API responses may contain timestamp inconsistencies.

Then connect Superset to the warehouse and use the current serving views:

l4_mart.v_air_quality_hourly_station_8
l4_mart.v_pollen_daily_station_25
l4_mart.v_airviro_measurements_long (optional helper dataset)

Warehouse connection values:

host: postgres
port: 5432
database: warehouse
username: warehouse
password: warehouse

Create a chart in Superset that shows dust particle concentration (PM10) by:

rows: weekday;
columns: hour of day (0-23);
metric: average pm10.

Recommended dataset:

l4_mart.v_air_quality_hourly_station_8

Recommended fields:

weekday: day_short
hour: hour_number
metric: AVG(pm10)

day_short is intentionally space-padded in the datetime dimension so weekday labels sort chronologically in many Superset charts.

Task 3: Dashboard Creation in Superset

Create a new dashboard in Superset that includes:

the chart you created in Task 2;
two more charts of your choice;
at least one chart based on l4_mart.v_pollen_daily_station_25;
at least one chart based on air-quality indicators from l4_mart.v_air_quality_hourly_station_8 such as ws10, wd10, pm2_5, pm10, o3, no2, or so2;
a Markdown element containing a title, a brief explanation of the dashboard, and your name.

Do not rely on temp for this homework. During the lecture period, the source API returned unreliable temperature values for station 8. The homework still stays on station 8, because the current Lecture 4 simple ETL and serving views are intentionally built around that station. If you or your agent buddy feels confident, you can choose another air quality station (view ohuseire.ee).

Try to keep the visuals different from each other, for example a pivot table, line chart, and box plot or bar chart.

Submission

Submit:

one Markdown document for Task 1;
one screenshot of the Task 2 chart;
one screenshot or exported image of the final dashboard from Task 3.

HW2 (due 26.03)

Practical Application of Database Design Principles

Objective: Design, implement, and query a straightforward database system.

Task 1: Reading Assignment

Instructions: Explore the online book titled "Database Design". Specifically, focus on Chapters 3 & 13, which discuss the "Database Development Process". After reading, fill out the "Database Development & Characteristics Concept Test" in moodle.

Task 2: Design a Database Schema

Scenario: Your task is to create a database design for a small-scale library system. This system requires tracking information related to books, authors, library patrons, and book loans. Requirements:

Construct an Entity-Relationship (ER) diagram to represent the library system. Your diagram should include entities for books, authors, patrons, and loans.
- Keep in mind possible many-to-many relationships.
- You can add more tables to complete your design.
Compose a concise description of your design, detailing the relationships between the entities.

Task 3: Implement the Database

Tools: Choose a relational database management system (RDBMS) you're comfortable with (options include SQLite, MySQL, PostgreSQL, DuckDB).

Steps:

Construct the database and its tables in line with your design from Task 2.
Populate each table with sample data, ensuring at least 5 entries per main entity table. (Feel free to use ChatGPT or Gemini for schema-based random data generation)
Include SQL scripts or a link to an SQL file that details your schema creation and data insertion operations.

Task 4: Querying the Database

Objective: Develop SQL queries to execute the following tasks:

Produce a list of all books authored by a specified individual.
Identify all patrons who have overdue book loans.
Enumerate books that have not ever been borrowed.
Submission Requirements: Furnish the SQL queries alongside a brief explanation of the expected output for each query.

Deliverables: Your homework submission should consist of your ER diagram (pdf or png), SQL scripts for database creation and data entry, your SQL queries with expected outputs, and any accompanying explanations.

HW1 (due 12.03)

Data Source Exploration for Group Project

In this homework assignment, you will find a publicly available and reliable data source for a group project. After selecting a suitable data source, provide a concise overview of its key attributes, including:

Dataset name and a brief description

Purpose of the data and its potential use in a group project
Type of data (e.g., tabular, time series, geospatial) and data types (e.g., numerical, categorical, text)
Update frequency and historical data availability
Data ownership, licensing, and attribution requirements
Privacy, ethical concerns, and necessary steps to address them
Accessibility (e.g., direct download, API) and any API usage information
Data size, scalability, and quality considerations
Preprocessing and cleaning tasks are required before analysis

Please do not upload the data itself. The preferred submission format is markdown. You can write your homework at https://hackmd.io/ (requires login).

Data engineering for Conversion Master's 2025/26 spring