04.04.2024 Data Engineering Lecture

Setup

Before we begin, we need to create a folder specifically for today's session. This folder will contain all the necessary files and directories we will work with during the workshop. Please follow the steps below to set up the environment on your system:

Open your terminal or command prompt.
Navigate to the Desktop or a specific project or lecture practicals folder where you wish to create the session folder. You can do this by executing the command:

cd Desktop  # Change this to your project folder or lecture practicals folder as needed.
mkdir ETL_superset
cd ETL_superset

Now that we've created the folder for today's session let's ensure our working environment is ready. If you followed the previous session, we created our own custom Superset image. It might still be running, so our first step is to stop and remove it. Don't worry, no actual data will be lost by doing this. Please execute the following commands in your terminal:

docker stop superset
docker rm superset

Next, we'll be creating two new Docker containers that utilize the same `ETL_superset` folder as a mount point:

A Superset container with our project mounted.
A Python development container with our project mounted.

Execute the following commands to set up the Superset container. Remember to change the `SUPERSET_SECRET_KEY` to a new secret value and keep it confidential, especially not publishing it on GitHub or any public repositories.

# Secret key should be changed and kept secret, not published to GitHub :)
docker run -d -v ${PWD}:/data:rw -p 8080:8088 -e "SUPERSET_SECRET_KEY=your_new_secret_key" --name superset <replace with image name we created last time. i.e. my/superset:duckdb - see docker images for list>
# Update user, firstname, lastname, email and password as you see fit
docker exec -it superset superset fab create-admin --username admin --firstname Admin --lastname Superset --email admin@example.com --password admin
docker exec -it superset superset db upgrade
docker exec -it superset superset init

For setting up the development container in Visual Studio Code, please follow these steps:

Open the `ETL_superset` folder in VS Code.
Open the command palette using the keyboard shortcut `Ctrl+Shift+P` (Windows/Linux) or `Cmd+Shift+P` (Mac). Type and select Dev Containers: Add Dev Container Configuration Files...
If prompted, select Add Configuration to Workspace.
Choose Python 3 and then select 3.12-bullseye as the version.
Select' none' when asked for additional packages and click OK.
Again, use the keyboard shortcut `Ctrl+Shift+P` or `Cmd+Shift+P` and type and select Dev Containers: Reopen in Container.

Python Script to Fetch Air Quality Data in Tartu

Next, we'll be developing a Python script to download air quality data for Tartu. This data is accessible via the 'csv' API endpoint provided by http://airviro.klab.ee. Below, you will find the code snippets designed for this task, including how to download data in CSV format, handle command-line arguments, and fetch data for a specified range.

Ensure you have Python installed requests and pandas libraries. You might need to install the requests library if you haven't done so already, which can be done by running pip install requests in your terminal (In VS Code, open the 'Terminal' menu and select 'New Terminal').

Here is the Python script for downloading the air quality data:

import requests
import datetime
import argparse

# Define command line arguments for station ID, year start, and year end
parser = argparse.ArgumentParser(description='Fetch air quality data from airviro.klab.ee')
parser.add_argument('station_id', type=int, default=8, help='The station ID to fetch data for')
parser.add_argument('start', type=int, help='The start year to fetch data for')
parser.add_argument('stop', type=int, help='The end year to fetch data for (included)')
args = parser.parse_args()

def fetch_air_range(station_id, date_from, date_until):
    # Define the URL of the endpoint that provides the air quality data
    url = 'http://airviro.klab.ee/station/csv'

    # Define the parameters for the POST request
    data = {
        'filter[type]': 'INDICATOR',
        'filter[cancelSearch]': '',
        'filter[stationId]': station_id,
        'filter[dateFrom]': date_from,
        'filter[dateUntil]': date_until,
        'filter[submitHit]': '1',
        'filter[indicatorIds]': ''
    }

    # Send a POST request to the endpoint
    response = requests.post(url, data)

    # Return the text content of the response
    return response.text

def get_first_and_last_day_of_month(year, month):
    # Calculate the first and last day of the month
    first_day = datetime.date(year, month, 1)
    if month == 12:
        num_days = 31
    else:
        num_days = (datetime.date(year, month+1, 1) - datetime.timedelta(days=1)).day
    last_day = datetime.date(year, month, num_days)
    return first_day, last_day

def fetch_air_year(station_id, year):
    # Fetch monthly results within a year
    for month in range(1, 13):
        first_day, last_day = get_first_and_last_day_of_month(year, month)
        print(f'Fetching data from {first_day} to {last_day}, included')
        monthdata = fetch_air_range(station_id, first_day, last_day)
        with open(f'csv/air_{station_id}_{year}_{month}.csv', 'w') as f:
            f.write(monthdata)

for year in range(args.start, args.stop+1):
    print(f'Fetching data for year {year}')
    fetch_air_year(args.station_id, year)

This script requires command-line arguments to specify the station ID and the range of years for which you wish to fetch the data. Ensure you create a directory named csv in the same folder as your script or modify the path in the script accordingly to avoid file write errors.

The script utilizes requests to send a POST request to the air quality data endpoint and saves the fetched data into CSV files organized by station ID, year, and month.

After saving this script, you can run it from the command line by navigating to the script's directory and executing the command python script_name.py station_id start_year end_year, replacing script_name.py with the actual name of your script and filling in the appropriate values for station_id, start_year, and end_year.

Transforming Air Quality Data

In our dataset, we've identified a few areas for improvement to enhance usability and learning experience. Specifically:

we'll be transforming the data format from CSV to Parquet,
adjusting the numeric values to use a period (`.`) as the decimal separator instead of a comma,
converting the first column to a date format from a string and
renaming the date column.

To accomplish these tasks, we will use the Python pandas library for its powerful data manipulation capabilities.

Ensure you have Python and the pandas and pyarrow libraries installed on your system. If you haven't installed pandas yet, you can do so by running pip install pandas pyarrow in your terminal.

Below is the Python script to perform the transformation:

import pandas as pd
import argparse

# Define command line arguments for input and output files
parser = argparse.ArgumentParser(description='Transform air quality data')
parser.add_argument('infile', type=str, help='The input file to transform')
parser.add_argument('outfile', type=str, help='The output file to write the transformed data to')
args = parser.parse_args()

# Read the input file, specifying the separator, decimal, and the column to parse as date
data = pd.read_csv(args.infile, sep=';', decimal=',', parse_dates=['Kuupäev']).sort_values(['Kuupäev']).rename(columns={'Kuupäev': 'DateTime'})

# Write the transformed data to the output file in Parquet format
data.to_parquet(args.outfile, index=False)

This script takes the downloaded .csv file, adjusts the number format and column names, and saves the transformed data as a Parquet file.

We will use a Bash script to apply this script to all downloaded files. This step assumes you have a parquet directory where the transformed files will be stored. If the parquet directory does not exist, please create it before running the script.

Run the following commands in your terminal:

for fn in $(ls csv); do
    echo "Transforming $fn..."
    python transform_AirQuality.py csv/$fn parquet/$(basename $fn .csv).parquet
done

This loop iterates over each file in the csv directory, transforming it into a Parquet file with the corrected formatting and saving it in the parquet directory. The script outputs the name of each file as it is being transformed, providing a clear indication of its progress.

Remember, while the transformation process improves data usability, ensure the integrity and accuracy of the data is maintained throughout any such process.

Visualizing Data

With the data now prepared and stored, our next step is to explore and visualize it. For this purpose, we'll be using the Apache Superset instance we initiated during the setup process. Apache Superset is an open-source software that allows for data exploration and visualization through an intuitive interface.

Here's how to get started with visualizing your air quality data in Apache Superset:

Open your web browser and navigate to localhost:8080. This will take you to the Apache Superset login page.
Enter the credentials you set up during the initialization process to log in.

Once logged in, follow these steps to connect to your data source:

Click on the '+' icon, then select 'Connect Database'.
In the 'Connect Database' screen, you'll see a 'Supported databases' dropdown menu. Select DuckDB from this list.
For the URI, you need to specify the location of your database. Enter duckdb:////data/AirQuality.db into the URI field.

By connecting to the DuckDB database, you're setting up Superset to directly query and visualize the data you've transformed and stored. DuckDB is an in-process SQL OLAP Database Management System, perfect for analytical queries on large datasets, like the air quality data we're working with.

To start visualizing the air quality data, we first need to create a dataset within Apache Superset. This dataset will be the foundation for our explorations and visualizations. Follow the steps below to create a dataset from the Parquet files we prepared earlier:

From the top navigation bar, select SQL and then SQL Lab from the dropdown menu. SQL Lab is a flexible tool within Superset that allows for executing SQL queries on your connected databases.
Within SQL Lab, you'll see an option to select your database. Choose the database connection you set up previously (the one connected to DuckDB).
In the SQL query editor that appears, you'll enter a SQL query to load your data. Type or copy the following SQL command into the editor:
- SELECT * FROM read_parquet('/data/parquet/*.parquet') ORDER BY DateTime
- This SQL command tells Superset to read all Parquet files located in /data/parquet/ and orders the resulting records by the DateTime column. Using the `read_parquet` function allows us to directly query and work with the Parquet files as if they were tables in a database, making it extremely convenient for analysis.
After entering the query, execute it by pressing the "Run selection" button. This action will load the data and display the results within SQL Lab. After confirming that the query works, let's save it as dataset using "Save" -> "Save dataset" button.

Creating Your First Plot: Boxplot

After preparing and connecting your dataset, you're now ready to dive into the visualization capabilities of Apache Superset. We will start by creating a boxplot, which is excellent for visualizing the distribution of your data across different categories, such as time periods.

Navigate to the charts section by selecting Charts from the top navigation menu, then click on + Chart to begin creating a new chart.
From the 'Choose a dataset' dropdown, select the dataset you created in the previous steps.
For the visualization type, scroll through the options under '#ECharts' and select Boxplot. Click Create new chart.

Configuring the Data Panel

The next step involves setting up your data for the boxplot:

Distribute Across: Select DateTime as the dimension to distribute data points across the boxplot.
Time Grain: Set this to Day to aggregate data points by each day.
Dimensions:
- Click on + next to 'Dimensions' to add custom SQL expressions.
- First, add a custom SQL expression month(DateTime) and title it month. This will help with the correct order of the months.
- Add another custom SQL with monthname(DateTime) and title it monthname for displaying month names on the boxplot.
Metrics: Choose avg(TEMP) as the metric to display the average temperature.

After configuring these options, click Create chart to generate your initial boxplot.

Customizing the Boxplot for Month Ordering

To address the challenge of custom ordering by month names, we can apply an SQL trick using a CASE WHEN statement. This approach adjusts the monthname dimension to enforce a custom order:

Navigate back to the boxplot's settings by clicking on the 'Edit' button for your chart.
Remove the month dimension since it was only needed for ordering.
Modify the monthname dimension by replacing it with the following custom SQL expression:

 
CASE
WHEN monthname(DateTime) = 'January'   THEN '           Jan'
WHEN monthname(DateTime) = 'February'  THEN '          Feb'
WHEN monthname(DateTime) = 'March'     THEN '         Mar'
WHEN monthname(DateTime) = 'April'     THEN '        Apr'
WHEN monthname(DateTime) = 'May'       THEN '       May'
WHEN monthname(DateTime) = 'June'      THEN '      Jun'
WHEN monthname(DateTime) = 'July'      THEN '     Jul'
WHEN monthname(DateTime) = 'August'    THEN '    Aug'
WHEN monthname(DateTime) = 'September' THEN '   Sep'
WHEN monthname(DateTime) = 'October'   THEN '  Oct'
WHEN monthname(DateTime) = 'November'  THEN ' Nov'
WHEN monthname(DateTime) = 'December'  THEN 'Dec'
ELSE monthname(DateTime)
END

After entering this custom SQL, click Update chart to apply the changes.

This adjustment effectively enforces a custom ordering for the months in your boxplot, ensuring that data is displayed correctly.

By following these steps, you have created a visually appealing boxplot that represents your data accurately and addresses specific formatting and ordering challenges. Such customizations enhance the interpretability and effectiveness of your data visualizations.

Using the same guidelines, create another chart for average monthly wind speed and a pivot table to display average dust particles in the air across months and weekdays. Remember to save all your charts!

Creating a Dashboard in Apache Superset

After visualizing your data through various charts, the next step is to organize these visualizations on a dashboard. Apache Superset provides an intuitive interface for creating and managing dashboards, aggregating multiple charts and enabling interactive data exploration.

Step 1: Creating a New Dashboard

Navigate to the Dashboard tab located in the top menu.
Click on +Dashboard to initiate the creation of a new dashboard.
You'll be taken to a drag-and-drop interface, where you can start assembling your dashboard.

Step 2: Adding Charts to the Dashboard

On the right-side panel, you’ll see a list of charts you've created. Drag and drop the charts you wish to include onto the dashboard area.
Charts already being used in any dashboard will appear greyed out. However, if desired, you can still select and add them to your current dashboard.

Step 3: Saving Your Dashboard

Once you’ve arranged the charts to your liking, don’t forget to save your dashboard. Look for the "Save Dashboard" button at the interface's top right.
You should customize your dashboard by adding titles, dividing lines, and text blocks for additional context or instructions.

Step 4: Adding Filters

Apache Superset allows you to add filters to your dashboard, enhancing its interactivity and usability. Filters enable dynamic data selection, allowing dashboard viewers to focus on specific aspects or time ranges of the data.
To add a filter, look for the "Filters" option or a symbol resembling a funnel. You can specify which data dimensions the filter should apply to, such as dates, categories, or any custom SQL conditions.

Interactive Features: Pivot Table Exploration

If a pivot table is included in your dashboard, it offers an interactive feature allowing you to click on any of the categories displayed.
Clicking on a category within the pivot table can dynamically filter or highlight associated data across the dashboard.

Conclusion

By carefully curating charts and employing interactive features like filters and pivot tables, you create a dynamic and informative dashboard. This not only aids in data analysis but also in storytelling with data, providing a comprehensive view of the data at hand.

Remember, the effectiveness of a dashboard lies in its design and the relevance of the information it presents. Regularly review and update your dashboards to ensure they remain useful and engaging for their audience.

Andmetehnika mitteinformaatikutele 2023/24 kevad