Homework
HW3 (due 29.05)
Apache Spark DataFrames and SQL with Yelp Dataset
Dive into Big Data analysis using Apache Spark DataFrames and SQL on the Yelp dataset. Process and manipulate data in parallel, learning to load Yelp tables as DataFrames, extract user statistics, scrutinize businesses, and generate pivot tables with the Spark DataFrame API and Spark SQL. Ensure a functional Spark environment, and submit Python scripts and outputs as deliverables. BigDataLab
HW2 (due 10.04)
ETL Process for Air Quality Data
Perform an ETL (Extract, Transform, Load) process on air quality data from http://airviro.klab.ee/ and create tables with hourly, daily, and monthly average values for all columns in the dataset. Adhere to data management principles, maintain an organized file structure, and document the process in a README.md file. Publish the code on GitHub (private repositories are allowed). Further instructions
HW1 (due 27.03)
Data Source Exploration for Group Project
Identify a suitable data source for a group project and briefly describe its key attributes, such as data type, purpose, update frequency, ownership, and other relevant aspects. Further instructions