HW2: ETL Process for Air Quality Data
In this homework assignment, you will build upon the ETL process that we started in the classroom using air quality data from http://airviro.klab.ee/. The goal is to extract, transform, and load the data into a structured format that allows for easy analysis.
You can use the existing repository at https://github.com/adlerpriit/ETL as inspiration and reference, which contains steps to download and clean the hour-based data. Your task is to further process this data and create tables with daily and monthly average values for all columns in the dataset.
While completing this assignment, pay close attention to the following aspects:
- Data management principles: Ensure data integrity, accuracy, and consistency throughout the ETL process.
- Variable and column names: Use clear, descriptive, and consistent names for variables and columns in your code and output tables.
- File organization and hierarchy: Maintain a well-organized folder structure for your code, data, and documentation files.
- Documentation: Write a comprehensive README.md file that explains the steps and processes involved in the ETL process, including instructions on how to replicate the process.
- GitHub repository: Store and publish all your code in a GitHub repository. You may use a private repository if necessary, but make sure the instructor has access.
HW1: Data Source Exploration for Group Project
In this homework assignment, you will find a publicly available and reliable data source for a group project. After selecting a suitable data source, provide a concise overview of its key attributes, including:
Dataset name and a brief description
- Type of data (e.g., tabular, time series, geospatial) and data types (e.g., numerical, categorical, text)
- Purpose of the data and its potential use in a group project
- Update frequency and historical data availability
- Data ownership, licensing, and attribution requirements
- Data size, scalability, and quality considerations
- Accessibility (e.g., direct download, API) and any API usage information
- Privacy, ethical concerns, and necessary steps to address them
- Preprocessing and cleaning tasks required before analysis