Group Project: Implementing an ETL Process and Data Visualization
Objective: In groups of 2-4 students, implement an ETL process to extract data from one or more sources, transform the data, and load it into an SQL or flat database. Create a compelling data visualization using Tableau, Apache Superset, R ggplot (RMarkdown), or Python plotnine.
Description:
Form groups of 2-4 students and choose one or more data_sources from Homework 1.
Extract and transform the source data:
- Pull in the selected data source(s) and preprocess the data as needed.
- Perform data transformation and cleaning tasks to ensure data consistency and accuracy.
Load the transformed data:
- Load the cleaned and transformed data into an SQL or flat database.
- Ensure proper indexing and organization of the data for efficient querying and analysis.
Create a data visualization:
- Using Apache Superset, Tableau, R ggplot (RMarkdown), or Python plotnine, create a compelling visualization that highlights key insights and tells a story with the data.
- Ensure that the visualization is clear, concise, and visually appealing.
Focus on data management and data engineering aspects:
- Ensure that your project follows the best data management and engineering practices.
- Use a shared git repository for version control and collaboration among group members.
- Maintain an up-to-date README that documents the project progress and provides a clear overview of the project structure.
- Provide detailed documentation on how to use the pipeline from start to finish, including any prerequisites, installation steps, and instructions for running the ETL process and generating the visualization.
Final submission:
- Submit the link to your shared git repository containing the ETL pipeline, visualization code, README, and documentation.
- Include a brief report (around 500-700 words) describing the project's objectives, data sources, ETL process, visualization, and any challenges encountered and how they were addressed.
This group project aims to provide hands-on experience in implementing an ETL process and creating a data visualization while emphasizing the importance of data management and data engineering principles. Collaborate effectively within your group and ensure clear communication to complete the project successfully. Good luck!
Project due 23.05.2024