Homework
HW4 (due 23.05)
Apache Spark DataFrames and SQL with Yelp Dataset Dive into Big Data analysis using Apache Spark DataFrames and SQL on the Yelp dataset. Process and manipulate data in parallel, learning to load Yelp tables as DataFrames, extract user statistics, scrutinize businesses, and generate pivot tables with the Spark DataFrame API and Spark SQL. Ensure a functional Spark environment and submit Python scripts and outputs as deliverables. BigDataLab
HW3 (due 09.05)
ETL Processes with Docker, Superset, and Data Visualization
Objective: The goal is to use a dockerised Apache Superset for dashboard creation.
Task 1: Docker and Apache Superset Setup Review
Explain the following command: docker run -d -v ${PWD}:/data:rw -p 8080:8088 -e "SUPERSET_SECRET_KEY=<your_new_secret_key>" --name superset my/superset:duckdb
The explanation should cover the following:
- What does each part of the command do?
- For
-v
, what is the Windows equivalent to${PWD}
If you participated in the lecture, you have Superset already set up. Explain the steps needed to start working with it again:
- in case the container already exists.
- in case the container has been deleted.
- in case you want to change the mounted folder.
Task 2: Data Analysis and Visualization in Superset
Make sure you have both datasets available in the Superset instance:
- based on http://airviro.klab.ee. (refer to the lecture on 04.04)
- based on https://www.ilmateenistus.ee/kliima/ajaloolised-ilmaandmed/. (refer to the lecture on 11.04)
Data transformation scripts can be found at https://github.com/adlerpriit/ETL_superset.
Let's delve further into explaining dust concentration in the air:
- Using Superset, create a chart that displays dust particle concentration per weekday and time of the day (24-hour format).
Task 3: Dashboard Creation in Superset
Create a new dashboard in Superset in which you will need to include the following:
- chart you created in Task 2
- 2 more charts of your choice (try to keep different visuals, i.e. line plot and boxplot)
- A markdown element containing:
- Title
- Brief explanation of the dashboard
- Your credentials
Submission: PDF document for Task 1, screenshot with opened datasets for Task 2, downloaded dashboard as PNG or JPG for Task 3.
HW2 (due 28.03)
Practical Application of Database Design Principles
Objective: The goal is to design, implement, and query a straightforward database system.
Task 1: Reading Assignment
Instructions: Explore the online book titled "Database Design". Specifically, focus on Chapter 13, which discusses the "Database Development Process". After reading, answer the questions provided at the end of Chapter 13.
Task 2: Design a Database Schema
Scenario: Your task is to create a database design for a small-scale library system. This system requires tracking information related to books, authors, library patrons, and book loans. Requirements:
- Construct an Entity-Relationship (ER) diagram to represent the library system. Your diagram should include entities for books, authors, patrons, and loans.
- Compose a concise description of your design, detailing the relationships between the entities.
Task 3: Implement the Database
Tools: Choose a relational database management system (RDBMS) you're comfortable with (options include SQLite, MySQL, PostgreSQL).
Steps:
- Construct the database and its tables in line with your design from Task 2.
- Populate each table with sample data, ensuring at least 5 entries per table. (Feel free to use ChatGPT or Gemini for schema-based random data generation)
- Include SQL scripts or a link to an SQL file that details your schema creation and data insertion operations.
Task 4: Querying the Database
Objective: Develop SQL queries to execute the following tasks:
- Produce a list of all books authored by a specified individual.
- Identify all patrons who have overdue book loans.
- Enumerate books that have not been borrowed.
- Submission Requirements: Furnish the SQL queries alongside a brief explanation of the expected output for each query.
Deliverables: Your homework submission should consist of your responses to the reading assignment, your ER diagram, a narrative of your database schema, SQL scripts for database creation and data entry, your SQL queries with expected outputs, and any accompanying explanations.
HW1 (due 29.02)
Data Source Exploration for Group Project
In this homework assignment, you will find a publicly available and reliable data source for a group project. After selecting a suitable data source, provide a concise overview of its key attributes, including:
Dataset name and a brief description
- Type of data (e.g., tabular, time series, geospatial) and data types (e.g., numerical, categorical, text)
- Purpose of the data and its potential use in a group project
- Update frequency and historical data availability
- Data ownership, licensing, and attribution requirements
- Data size, scalability, and quality considerations
- Accessibility (e.g., direct download, API) and any API usage information
- Privacy, ethical concerns, and necessary steps to address them
- Preprocessing and cleaning tasks required before analysis