Homework
Homework submission will be in Moodle environment
HW3 (due 26.04)
ETL Processes, Superset, and Data Visualization
Objective: start the Lecture 4 environment, run the ETL workflow, connect Superset to the warehouse, and create visualizations based on the course datasets.
Task 1: Lecture 4 Environment and Superset Startup Review
Start the Lecture 4 environment from the repository root inside the devcontainer:
make init make up-superset make devcontainer-join-course-network make ps
In your own words, explain what each command does and why it is needed.
Your explanation should cover:
- what each command does;
- which services are expected to exist for Lecture 4 (postgres, superset, superset-redis, and the one-time superset-init);
- why
make devcontainer-join-course-networkis needed before running ETL from the devcontainer; - how
makeand Docker Compose work together in this repository.
Also explain how you would start working again in each of these cases:
- the Superset stack already exists but is stopped;
- the containers have been removed;
- you changed
.envvalues or reopened the repository from a different host folder.
Task 2: Data Analysis and Visualization in Superset
Make sure the current Lecture 4 datasets are available in your Superset instance.
Run the ETL and validation flow from the repository root inside the devcontainer:
make etl-bootstrap .venv/bin/python -m etl.airviro.cli run --from 2026-01-01 --to 2026-03-26 --source-key air_quality_station_8 --source-key pollen_station_25 --verbose make warehouse-status
For this homework, keep the date range inside 2026. The year-to-lecture-04 example above was validated during course preparation. Do not use wide historical backfills here, because older API responses may contain timestamp inconsistencies.
Then connect Superset to the warehouse and use the current serving views:
l4_mart.v_air_quality_hourly_station_8l4_mart.v_pollen_daily_station_25l4_mart.v_airviro_measurements_long(optional helper dataset)
Warehouse connection values:
- host:
postgres - port:
5432 - database:
warehouse - username:
warehouse - password:
warehouse
Create a chart in Superset that shows dust particle concentration (PM10) by:
- rows: weekday;
- columns: hour of day (0-23);
- metric: average
pm10.
Recommended dataset:
l4_mart.v_air_quality_hourly_station_8
Recommended fields:
- weekday:
day_short - hour:
hour_number - metric:
AVG(pm10)
day_short is intentionally space-padded in the datetime dimension so weekday labels sort chronologically in many Superset charts.
Task 3: Dashboard Creation in Superset
Create a new dashboard in Superset that includes:
- the chart you created in Task 2;
- two more charts of your choice;
- at least one chart based on
l4_mart.v_pollen_daily_station_25; - at least one chart based on air-quality indicators from
l4_mart.v_air_quality_hourly_station_8such asws10,wd10,pm2_5,pm10,o3,no2, orso2; - a Markdown element containing a title, a brief explanation of the dashboard, and your name.
Do not rely on temp for this homework. During the lecture period, the source API returned unreliable temperature values for station 8.
The homework still stays on station 8, because the current Lecture 4 simple ETL and serving views are intentionally built around that station.
If you or your agent buddy feels confident, you can choose another air quality station (view ohuseire.ee).
Try to keep the visuals different from each other, for example a pivot table, line chart, and box plot or bar chart.
Submission
Submit:
- one Markdown document for Task 1;
- one screenshot of the Task 2 chart;
- one screenshot or exported image of the final dashboard from Task 3.
HW2 (due 26.03)
Practical Application of Database Design Principles
Objective: Design, implement, and query a straightforward database system.
Task 1: Reading Assignment
Instructions: Explore the online book titled "Database Design". Specifically, focus on Chapters 3 & 13, which discuss the "Database Development Process". After reading, fill out the "Database Development & Characteristics Concept Test" in moodle.
Task 2: Design a Database Schema
Scenario: Your task is to create a database design for a small-scale library system. This system requires tracking information related to books, authors, library patrons, and book loans. Requirements:
- Construct an Entity-Relationship (ER) diagram to represent the library system. Your diagram should include entities for books, authors, patrons, and loans.
- Keep in mind possible many-to-many relationships.
- You can add more tables to complete your design.
- Compose a concise description of your design, detailing the relationships between the entities.
Task 3: Implement the Database
Tools: Choose a relational database management system (RDBMS) you're comfortable with (options include SQLite, MySQL, PostgreSQL, DuckDB).
Steps:
- Construct the database and its tables in line with your design from Task 2.
- Populate each table with sample data, ensuring at least 5 entries per main entity table. (Feel free to use ChatGPT or Gemini for schema-based random data generation)
- Include SQL scripts or a link to an SQL file that details your schema creation and data insertion operations.
Task 4: Querying the Database
Objective: Develop SQL queries to execute the following tasks:
- Produce a list of all books authored by a specified individual.
- Identify all patrons who have overdue book loans.
- Enumerate books that have not ever been borrowed.
- Submission Requirements: Furnish the SQL queries alongside a brief explanation of the expected output for each query.
Deliverables: Your homework submission should consist of your ER diagram (pdf or png), SQL scripts for database creation and data entry, your SQL queries with expected outputs, and any accompanying explanations.
HW1 (due 12.03)
Data Source Exploration for Group Project
In this homework assignment, you will find a publicly available and reliable data source for a group project. After selecting a suitable data source, provide a concise overview of its key attributes, including:
Dataset name and a brief description
- Purpose of the data and its potential use in a group project
- Type of data (e.g., tabular, time series, geospatial) and data types (e.g., numerical, categorical, text)
- Update frequency and historical data availability
- Data ownership, licensing, and attribution requirements
- Privacy, ethical concerns, and necessary steps to address them
- Accessibility (e.g., direct download, API) and any API usage information
- Data size, scalability, and quality considerations
- Preprocessing and cleaning tasks are required before analysis
Please do not upload the data itself. The preferred submission format is markdown. You can write your homework at https://hackmd.io/ (requires login).