04.04.2024 Data Engineering Lecture
Setup
Before we begin, we need to create a folder specifically for today's session. This folder will contain all the necessary files and directories we will work with during the workshop. Please follow the steps below to set up the environment on your system:
- Open your terminal or command prompt.
- Navigate to the Desktop or a specific project or lecture practicals folder where you wish to create the session folder. You can do this by executing the command:
cd Desktop # Change this to your project folder or lecture practicals folder as needed. mkdir ETL_superset cd ETL_superset
Now that we've created the folder for today's session let's ensure our working environment is ready. If you followed the previous session, we created our own custom Superset image. It might still be running, so our first step is to stop and remove it. Don't worry, no actual data will be lost by doing this. Please execute the following commands in your terminal:
docker stop superset docker rm superset
Next, we'll be creating two new Docker containers that utilize the same `ETL_superset` folder as a mount point:
- A Superset container with our project mounted.
- A Python development container with our project mounted.
Execute the following commands to set up the Superset container. Remember to change the `SUPERSET_SECRET_KEY` to a new secret value and keep it confidential, especially not publishing it on GitHub or any public repositories.
# Secret key should be changed and kept secret, not published to GitHub :) docker run -d -v ${PWD}:/data:rw -p 8080:8088 -e "SUPERSET_SECRET_KEY=your_new_secret_key" --name superset <replace with image name we created last time. i.e. my/superset:duckdb - see docker images for list> # Update user, firstname, lastname, email and password as you see fit docker exec -it superset superset fab create-admin --username admin --firstname Admin --lastname Superset --email admin@example.com --password admin docker exec -it superset superset db upgrade docker exec -it superset superset init
For setting up the development container in Visual Studio Code, please follow these steps:
- Open the `ETL_superset` folder in VS Code.
- Open the command palette using the keyboard shortcut `Ctrl+Shift+P` (Windows/Linux) or `Cmd+Shift+P` (Mac). Type and select
Dev Containers: Add Dev Container Configuration Files...
- If prompted, select
Add Configuration to Workspace
. - Choose
Python 3
and then select3.12-bullseye
as the version. - Select' none' when asked for additional packages and click
OK
. - Again, use the keyboard shortcut `Ctrl+Shift+P` or `Cmd+Shift+P` and type and select
Dev Containers: Reopen in Container
.
Python Script to Fetch Air Quality Data in Tartu
Next, we'll be developing a Python script to download air quality data for Tartu. This data is accessible via the 'csv' API endpoint provided by http://airviro.klab.ee. Below, you will find the code snippets designed for this task, including how to download data in CSV format, handle command-line arguments, and fetch data for a specified range.
Ensure you have Python installed requests
and pandas
libraries. You might need to install the requests
library if you haven't done so already, which can be done by running pip install requests
in your terminal (In VS Code, open the 'Terminal' menu and select 'New Terminal').
Here is the Python script for downloading the air quality data:
import requests import datetime import argparse # Define command line arguments for station ID, year start, and year end parser = argparse.ArgumentParser(description='Fetch air quality data from airviro.klab.ee') parser.add_argument('station_id', type=int, default=8, help='The station ID to fetch data for') parser.add_argument('start', type=int, help='The start year to fetch data for') parser.add_argument('stop', type=int, help='The end year to fetch data for (included)') args = parser.parse_args() def fetch_air_range(station_id, date_from, date_until): # Define the URL of the endpoint that provides the air quality data url = 'http://airviro.klab.ee/station/csv' # Define the parameters for the POST request data = { 'filter[type]': 'INDICATOR', 'filter[cancelSearch]': '', 'filter[stationId]': station_id, 'filter[dateFrom]': date_from, 'filter[dateUntil]': date_until, 'filter[submitHit]': '1', 'filter[indicatorIds]': '' } # Send a POST request to the endpoint response = requests.post(url, data) # Return the text content of the response return response.text def get_first_and_last_day_of_month(year, month): # Calculate the first and last day of the month first_day = datetime.date(year, month, 1) if month == 12: num_days = 31 else: num_days = (datetime.date(year, month+1, 1) - datetime.timedelta(days=1)).day last_day = datetime.date(year, month, num_days) return first_day, last_day def fetch_air_year(station_id, year): # Fetch monthly results within a year for month in range(1, 13): first_day, last_day = get_first_and_last_day_of_month(year, month) print(f'Fetching data from {first_day} to {last_day}, included') monthdata = fetch_air_range(station_id, first_day, last_day) with open(f'csv/air_{station_id}_{year}_{month}.csv', 'w') as f: f.write(monthdata) for year in range(args.start, args.stop+1): print(f'Fetching data for year {year}') fetch_air_year(args.station_id, year)
This script requires command-line arguments to specify the station ID and the range of years for which you wish to fetch the data. Ensure you create a directory named csv
in the same folder as your script or modify the path in the script accordingly to avoid file write errors.
The script utilizes requests
to send a POST request to the air quality data endpoint and saves the fetched data into CSV files organized by station ID, year, and month.
After saving this script, you can run it from the command line by navigating to the script's directory and executing the command python script_name.py station_id start_year end_year
, replacing script_name.py
with the actual name of your script and filling in the appropriate values for station_id
, start_year
, and end_year
.
Transforming Air Quality Data
In our dataset, we've identified a few areas for improvement to enhance usability and learning experience. Specifically:
- we'll be transforming the data format from CSV to Parquet,
- adjusting the numeric values to use a period (`.`) as the decimal separator instead of a comma,
- converting the first column to a date format from a string and
- renaming the date column.
To accomplish these tasks, we will use the Python pandas
library for its powerful data manipulation capabilities.
Ensure you have Python and the pandas
and pyarrow
libraries installed on your system. If you haven't installed pandas
yet, you can do so by running pip install pandas pyarrow
in your terminal.
Below is the Python script to perform the transformation:
import pandas as pd import argparse # Define command line arguments for input and output files parser = argparse.ArgumentParser(description='Transform air quality data') parser.add_argument('infile', type=str, help='The input file to transform') parser.add_argument('outfile', type=str, help='The output file to write the transformed data to') args = parser.parse_args() # Read the input file, specifying the separator, decimal, and the column to parse as date data = pd.read_csv(args.infile, sep=';', decimal=',', parse_dates=['Kuupäev']).sort_values(['Kuupäev']).rename(columns={'Kuupäev': 'DateTime'}) # Write the transformed data to the output file in Parquet format data.to_parquet(args.outfile, index=False)
This script takes the downloaded .csv
file, adjusts the number format and column names, and saves the transformed data as a Parquet file.
We will use a Bash script to apply this script to all downloaded files. This step assumes you have a parquet
directory where the transformed files will be stored. If the parquet
directory does not exist, please create it before running the script.
Run the following commands in your terminal:
for fn in $(ls csv); do echo "Transforming $fn..." python transform_AirQuality.py csv/$fn parquet/$(basename $fn .csv).parquet done
This loop iterates over each file in the csv
directory, transforming it into a Parquet file with the corrected formatting and saving it in the parquet
directory. The script outputs the name of each file as it is being transformed, providing a clear indication of its progress.
Remember, while the transformation process improves data usability, ensure the integrity and accuracy of the data is maintained throughout any such process.
Visualizing Data
With the data now prepared and stored, our next step is to explore and visualize it. For this purpose, we'll be using the Apache Superset instance we initiated during the setup process. Apache Superset is an open-source software that allows for data exploration and visualization through an intuitive interface.
Here's how to get started with visualizing your air quality data in Apache Superset:
- Open your web browser and navigate to
localhost:8080
. This will take you to the Apache Superset login page. - Enter the credentials you set up during the initialization process to log in.
Once logged in, follow these steps to connect to your data source:
- Click on the '+' icon, then select 'Connect Database'.
- In the 'Connect Database' screen, you'll see a 'Supported databases' dropdown menu. Select
DuckDB
from this list. - For the URI, you need to specify the location of your database. Enter
duckdb:////data/AirQuality.db
into the URI field.
By connecting to the DuckDB database, you're setting up Superset to directly query and visualize the data you've transformed and stored. DuckDB is an in-process SQL OLAP Database Management System, perfect for analytical queries on large datasets, like the air quality data we're working with.
To start visualizing the air quality data, we first need to create a dataset within Apache Superset. This dataset will be the foundation for our explorations and visualizations. Follow the steps below to create a dataset from the Parquet files we prepared earlier:
- From the top navigation bar, select
SQL
and thenSQL Lab
from the dropdown menu. SQL Lab is a flexible tool within Superset that allows for executing SQL queries on your connected databases. - Within SQL Lab, you'll see an option to select your database. Choose the database connection you set up previously (the one connected to
DuckDB
). - In the SQL query editor that appears, you'll enter a SQL query to load your data. Type or copy the following SQL command into the editor:
SELECT * FROM read_parquet('/data/parquet/*.parquet') ORDER BY DateTime
- This SQL command tells Superset to read all Parquet files located in
/data/parquet/
and orders the resulting records by the DateTime column. Using the `read_parquet` function allows us to directly query and work with the Parquet files as if they were tables in a database, making it extremely convenient for analysis.
- After entering the query, execute it by pressing the "Run selection" button. This action will load the data and display the results within SQL Lab. After confirming that the query works, let's save it as dataset using "Save" -> "Save dataset" button.
Creating Your First Plot: Boxplot
After preparing and connecting your dataset, you're now ready to dive into the visualization capabilities of Apache Superset. We will start by creating a boxplot, which is excellent for visualizing the distribution of your data across different categories, such as time periods.
- Navigate to the
charts
section by selectingCharts
from the top navigation menu, then click on+ Chart
to begin creating a new chart. - From the 'Choose a dataset' dropdown, select the dataset you created in the previous steps.
- For the visualization type, scroll through the options under '#ECharts' and select
Boxplot
. ClickCreate new chart
.
Configuring the Data Panel
The next step involves setting up your data for the boxplot:
- Distribute Across: Select
DateTime
as the dimension to distribute data points across the boxplot. - Time Grain: Set this to
Day
to aggregate data points by each day. - Dimensions:
- Click on
+
next to 'Dimensions' to add custom SQL expressions. - First, add a custom SQL expression
month(DateTime)
and title itmonth
. This will help with the correct order of the months. - Add another custom SQL with
monthname(DateTime)
and title itmonthname
for displaying month names on the boxplot.
- Click on
- Metrics: Choose
avg(TEMP)
as the metric to display the average temperature.
After configuring these options, click Create chart
to generate your initial boxplot.
Customizing the Boxplot for Month Ordering
To address the challenge of custom ordering by month names, we can apply an SQL trick using a CASE WHEN
statement. This approach adjusts the monthname
dimension to enforce a custom order:
- Navigate back to the boxplot's settings by clicking on the 'Edit' button for your chart.
- Remove the
month
dimension since it was only needed for ordering. - Modify the
monthname
dimension by replacing it with the following custom SQL expression:
CASE WHEN monthname(DateTime) = 'January' THEN ' Jan' WHEN monthname(DateTime) = 'February' THEN ' Feb' WHEN monthname(DateTime) = 'March' THEN ' Mar' WHEN monthname(DateTime) = 'April' THEN ' Apr' WHEN monthname(DateTime) = 'May' THEN ' May' WHEN monthname(DateTime) = 'June' THEN ' Jun' WHEN monthname(DateTime) = 'July' THEN ' Jul' WHEN monthname(DateTime) = 'August' THEN ' Aug' WHEN monthname(DateTime) = 'September' THEN ' Sep' WHEN monthname(DateTime) = 'October' THEN ' Oct' WHEN monthname(DateTime) = 'November' THEN ' Nov' WHEN monthname(DateTime) = 'December' THEN 'Dec' ELSE monthname(DateTime) END
- After entering this custom SQL, click
Update chart
to apply the changes.
This adjustment effectively enforces a custom ordering for the months in your boxplot, ensuring that data is displayed correctly.
By following these steps, you have created a visually appealing boxplot that represents your data accurately and addresses specific formatting and ordering challenges. Such customizations enhance the interpretability and effectiveness of your data visualizations.
Using the same guidelines, create another chart for average monthly wind speed and a pivot table to display average dust particles in the air across months and weekdays. Remember to save all your charts!
Creating a Dashboard in Apache Superset
After visualizing your data through various charts, the next step is to organize these visualizations on a dashboard. Apache Superset provides an intuitive interface for creating and managing dashboards, aggregating multiple charts and enabling interactive data exploration.
Step 1: Creating a New Dashboard
- Navigate to the
Dashboard
tab located in the top menu. - Click on
+Dashboard
to initiate the creation of a new dashboard. - You'll be taken to a drag-and-drop interface, where you can start assembling your dashboard.
Step 2: Adding Charts to the Dashboard
- On the right-side panel, you’ll see a list of charts you've created. Drag and drop the charts you wish to include onto the dashboard area.
- Charts already being used in any dashboard will appear greyed out. However, if desired, you can still select and add them to your current dashboard.
Step 3: Saving Your Dashboard
- Once you’ve arranged the charts to your liking, don’t forget to save your dashboard. Look for the "Save Dashboard" button at the interface's top right.
- You should customize your dashboard by adding titles, dividing lines, and text blocks for additional context or instructions.
Step 4: Adding Filters
- Apache Superset allows you to add filters to your dashboard, enhancing its interactivity and usability. Filters enable dynamic data selection, allowing dashboard viewers to focus on specific aspects or time ranges of the data.
- To add a filter, look for the "Filters" option or a symbol resembling a funnel. You can specify which data dimensions the filter should apply to, such as dates, categories, or any custom SQL conditions.
Interactive Features: Pivot Table Exploration
- If a pivot table is included in your dashboard, it offers an interactive feature allowing you to click on any of the categories displayed.
- Clicking on a category within the pivot table can dynamically filter or highlight associated data across the dashboard.
Conclusion
By carefully curating charts and employing interactive features like filters and pivot tables, you create a dynamic and informative dashboard. This not only aids in data analysis but also in storytelling with data, providing a comprehensive view of the data at hand.
Remember, the effectiveness of a dashboard lies in its design and the relevance of the information it presents. Regularly review and update your dashboards to ensure they remain useful and engaging for their audience.