Processing data with SQL in Apache Spark
We will take a look at how to process data using SQL in this practice session. We will download the Yelp dataset and use Apache Spark SQL interface to process it.
References
- Spark DataFrame/SQL programming guide: https://spark.apache.org/docs/2.3.2/sql-programming-guide.html
- Spark SQL Functions - https://spark.apache.org/docs/2.3.2/api/sql/index.html
- Spark Lecture slides: https://courses.cs.ut.ee/LTAT.06.005/2018_fall/uploads/Main/L9_2018.pdf
- Spark (Python) SQL examples: https://github.com/apache/spark/blob/master/examples/src/main/python/sql/
Dataset Description
The dataset that we will analyze using SQL is is the Yelp Dataset. It contains Yelp businesses, reviews, and user data for use in personal, educational, and academic purposes. It is available to for teaching about databases and NLP. The dataset consists of 6 JSON files: Businesses, reviews, users, checkins, tips and photos.
NB! Using this dataset requires that you read and agree with the Dataset Challenge Dataset Agreement .
- Name: Yelp Dataset
- Location: https://www.yelp.com/dataset
- Documentation: https://www.yelp.com/dataset/documentation/main
- Alternative Download link for a smaller sample of the dataset for testing: https://owncloud.ut.ee/owncloud/index.php/s/HWqreZxGRA6Q45d (Ask for password from the lab supervisor)
Dataset tables are: | |||
Review
|
User
|
Business
|
Checkin
|
Tip
| |||
Photo
|
Join keys are marked red. As we area dealing with JSON files, order of the attributes may differ from the list here. Some attributes might in turn contain nested structures. Check the dataset documentation for the complete list of (nested) attributes.
Exercise 9.1. Preparing environment for working with SQL in Apache Spark
We will continue using Python PyCharm IDE to simplify working with SQL in Spark. We will download and run an example Spark Python SQL script to check that everything is working. You can use this script as a template in the following exercises.
- Either create a new Python Spark project in PyCharm (just like in the previous lab) or reuse your previous project.
- Download the following Python Spark SQLexample sql_example.py file and move it inside your PySpark project.
- You will find other (Python) Spark SQL examples in the
examples\src\main\python\sql
folder inside the Spark distribution you downloaded and unpacked in the previous practice session. - Spark (Python) SQL examples are also available here: https://github.com/apache/spark/blob/master/examples/src/main/python/sql/
- You will find other (Python) Spark SQL examples in the
- Download the smaller sample of the dataset for development and testing in the lab from here: https://owncloud.ut.ee/owncloud/index.php/s/HWqreZxGRA6Q45d (Ask for password from the lab supervisor)
- Create a new folder for input files (Does not have to be inside your Spark Python project) and move the Yelp dataset JSON files there.
- Run the sql_example.py Python script to check that everything is working as needed.
- Its arguments should be:
input_folder output_folder
- When you first try the example script, you should use the location of the yelp small_review folder as the
input_folder
- Its arguments should be:
NB! You must solve the data processing tasks of the following exercises by using SQL statements and you should avoid using DataFrame API and Spark RDD commands. However, you can still use the Dataframe read/write command and other commands which provide information about the structure and content of the tables, such as:
df.printSchema()
df.show(nrOfRows)
df.registerTempTable("table_name")
spark.read
&df.write
Exercise 9.2. Processing data using Spark SQL
- Find the top 5 users who gave out most 5 star reviews.
- Write a select statement that queries the review table
- Group the review table by user_id and review stars S (S = 1,2,3,4 or 5) and compute the count of rows inside group as starcount (starcount = the number of times user gave S stars).
- This grouped table should look something like this:
+----------------------+-----+---------+ |user_id |stars|starcount| +----------------------+-----+---------+ |Zckxoi9_9btUFld8X10x-g|5 |1 | |ZcVOYdRznoCpQx62Sl0mAA|2 |1 | +----------------------+-----+---------+
- Filter the result by stars value is equal to 5 and order by the count of review stars.
- For each user and amount of review stars S (S = 1,2,3,4 or 5) - compute the percentage of total S-star reviews that this person gave.
- You can reuse the intermediate (user_id, stars, starcount) table from the the previous task, but remove ordering or filtering.
- Compute the total sum of review stars count for each possible stars value (S = 1,2,3,4 or 5) as a new column total.
- You can use the
... OVER (PARTITION by ...) ...
statement. Check Lecture slides for an example.
- You can use the
- Compute what is the percentage of user's S-star (S = 1,2,3,4 or 5) reviews out of the total number of S-star reviews in the dataset.
- Example result:
+----------------------+-----+---------+-----+-----------------+ |user_id |stars|starcount|total|percent | +----------------------+-----+---------+-----+-----------------+ |ZcYNpemiHw2OOpQxkVbrVg|4 |10 |332 |3.01204819277108 | +----------------------+-----+---------+-----+-----------------+
- Where in this example row:
- stars - 4-star reviews
- starcount - this user gave 10 4-star reviews
- total - There were total of 332 4-star reviews in the reviews table
- percent - 10 out of a total of 322 is approximately 3%.
- Find the average amount of stars for Businesses that offer takeout and have a picture of the menu available.
- You will need to join 3 tables: photo, business and review
- Takeout availability can be found from the nested attributes field in business table. Check the full schema of the business table using
business.printSchema()
Exercise 9.3. Creating SQL User Defined Functions in Python
We will create a Python function for computing the average length of words in a text field and call it as a User Defined Function (UDF) inside SQL statements to solve the following task:
- Find top 5 users who on average write reviews with the longest average length of words.
- Create a UDF, which gets a block of text, divides text into words and computes the average word length.
- Output should look something like this (except that different users are in the top):
+----------------------+------------------+ |user_id |avgW | +----------------------+------------------+ |ZcQr5rS6oL6CVoCA6IysSA|5.037685650661365 | |ZcVagtiGZQxNR64r5dCzBQ|5.021671826625387 | |ZcLKXikTHYOnYt5VYRO5sg|5.019230769230769 | +----------------------+------------------+
NB! While it is possible to complete this task without creating your own UDF (similarly to WordCount SQL example), you must solve it by creating your own Python function as an UDF to get full credit for this exercise.
Bonus Exercise
NB! Because the Cluster was down for most of the week, bonus task in cancelled for this week. Next week bonus task will give double extra points and deadline will be extended by one week.
Deliverables:
- SQL scripts (9.2.1, 9.2.2, 9.2.3, 9.3)
- Either as a single script that solves all individual tasks or as multiple separate scripts.
- Output of the SQL scripts (9.2.1, 9.2.2, 9.2.3, 9.3)
- The output should contain the resulting SQL table or DataFrame content.
- Output of the scripts should be provided either as:
- JSON output file(s)
- or the respective
df.show()
command's output in a text file.
Potential issues and solutions
- The system cannot find the path specified when executing your Spark application the first time.
- Check that your SPARK_HOME environment variable value is correct
- You can also set SPARK_HOME location programmatically inside your script in the following way:
os.environ['SPARK_HOME'] = "/home/pelle/soft/spark-2.3.2"
- It is suggested to use Java 8 as the default Java in your computer.
- NB! Java 9 and 10 will not work without issues!
- Be careful with the code indentation in Python script. Python is very strict when it comes to mixing tabs and spaces.
- Check indentation manual for Python: https://docs.python.org/2.0/ref/indentation.html
- Python in worker has different version 2.7 than that in driver 3.6
- Check that the default Python version in your computer is the same version you use inside your PyCharm project.