Processing data with SQL in Apache Spark

We will take a look at how to process data using SQL in this practice session. We will download the Yelp dataset and use Apache Spark SQL interface to process it.

References

Spark DataFrame/SQL programming guide: https://spark.apache.org/docs/2.3.2/sql-programming-guide.html
Spark SQL Functions - https://spark.apache.org/docs/2.3.2/api/sql/index.html
Spark Lecture slides: https://courses.cs.ut.ee/LTAT.06.005/2018_fall/uploads/Main/L9_2018.pdf
Spark (Python) SQL examples: https://github.com/apache/spark/blob/master/examples/src/main/python/sql/

Dataset Description

The dataset that we will analyze using SQL is is the Yelp Dataset. It contains Yelp businesses, reviews, and user data for use in personal, educational, and academic purposes. It is available to for teaching about databases and NLP. The dataset consists of 6 JSON files: Businesses, reviews, users, checkins, tips and photos.

NB! Using this dataset requires that you read and agree with the Dataset Challenge Dataset Agreement .

Name: Yelp Dataset
Location: https://www.yelp.com/dataset
Documentation: https://www.yelp.com/dataset/documentation/main
Alternative Download link for a smaller sample of the dataset for testing: https://owncloud.ut.ee/owncloud/index.php/s/HWqreZxGRA6Q45d (Ask for password from the lab supervisor)

Dataset tables are:

Review

funny
user_id
review_id
text
business_id
stars
date
useful
cool

User

yelping_since
useful
compliment_photos
compliment_list
compliment_funny
compliment_plain
review_count
elite
fans
compliment_note
funny
compliment_writer
compliment_cute
average_stars
user_id
compliment_more
friends
compliment_hot,cool
name
compliment_profile
compliment_cool

Business

address
attributes (nested)
business_id
categories
city
hours
is_open
latitude
longitude
name
neighborhood
postal_code
review_count
stars: double
state: string

Checkin

business_id
time (nested)

Tip

user_id
text
business_id
likes
date

Photo

photo_id
business_id
label
caption

Join keys are marked red. As we area dealing with JSON files, order of the attributes may differ from the list here. Some attributes might in turn contain nested structures. Check the dataset documentation for the complete list of (nested) attributes.

Exercise 9.1. Preparing environment for working with SQL in Apache Spark

We will continue using Python PyCharm IDE to simplify working with SQL in Spark. We will download and run an example Spark Python SQL script to check that everything is working. You can use this script as a template in the following exercises.

Either create a new Python Spark project in PyCharm (just like in the previous lab) or reuse your previous project.
Download the following Python Spark SQLexample sql_example.py file and move it inside your PySpark project.
- You will find other (Python) Spark SQL examples in the examples\src\main\python\sql folder inside the Spark distribution you downloaded and unpacked in the previous practice session.
- Spark (Python) SQL examples are also available here: https://github.com/apache/spark/blob/master/examples/src/main/python/sql/
Download the smaller sample of the dataset for development and testing in the lab from here: https://owncloud.ut.ee/owncloud/index.php/s/HWqreZxGRA6Q45d (Ask for password from the lab supervisor)
Create a new folder for input files (Does not have to be inside your Spark Python project) and move the Yelp dataset JSON files there.
Run the sql_example.py Python script to check that everything is working as needed.
- Its arguments should be: input_folder output_folder
- When you first try the example script, you should use the location of the yelp small_review folder as the input_folder

NB! You must solve the data processing tasks of the following exercises by using SQL statements and you should avoid using DataFrame API and Spark RDD commands. However, you can still use the Dataframe read/write command and other commands which provide information about the structure and content of the tables, such as:

df.printSchema()
df.show(nrOfRows)
df.registerTempTable("table_name")
spark.read & df.write

Exercise 9.2. Processing data using Spark SQL

Find the top 5 users who gave out most 5 star reviews.
- Write a select statement that queries the review table
- Group the review table by user_id and review stars S (S = 1,2,3,4 or 5) and compute the count of rows inside group as starcount (starcount = the number of times user gave S stars).
  - This grouped table should look something like this:
  - ```
  +----------------------+-----+---------+
  |user_id               |stars|starcount|
  +----------------------+-----+---------+
  |Zckxoi9_9btUFld8X10x-g|5    |1        |
  |ZcVOYdRznoCpQx62Sl0mAA|2    |1        |
  +----------------------+-----+---------+
```
- Filter the result by stars value is equal to 5 and order by the count of review stars.
For each user and amount of review stars S (S = 1,2,3,4 or 5) - compute the percentage of total S-star reviews that this person gave.
- You can reuse the intermediate (user_id, stars, starcount) table from the the previous task, but remove ordering or filtering.
- Compute the total sum of review stars count for each possible stars value (S = 1,2,3,4 or 5) as a new column total.
  - You can use the ... OVER (PARTITION by ...) ... statement. Check Lecture slides for an example.
- Compute what is the percentage of user's S-star (S = 1,2,3,4 or 5) reviews out of the total number of S-star reviews in the dataset.
- Example result:
  - ```
  +----------------------+-----+---------+-----+-----------------+
  |user_id               |stars|starcount|total|percent          |
  +----------------------+-----+---------+-----+-----------------+
  |ZcYNpemiHw2OOpQxkVbrVg|4    |10       |332  |3.01204819277108 |
  +----------------------+-----+---------+-----+-----------------+
```
- Where in this example row:
  - stars - 4-star reviews
  - starcount - this user gave 10 4-star reviews
  - total - There were total of 332 4-star reviews in the reviews table
  - percent - 10 out of a total of 322 is approximately 3%.
Find the average amount of stars for Businesses that offer takeout and have a picture of the menu available.
- You will need to join 3 tables: photo, business and review
- Takeout availability can be found from the nested attributes field in business table. Check the full schema of the business table using business.printSchema()

Exercise 9.3. Creating SQL User Defined Functions in Python

We will create a Python function for computing the average length of words in a text field and call it as a User Defined Function (UDF) inside SQL statements to solve the following task:

Find top 5 users who on average write reviews with the longest average length of words.

Create a UDF, which gets a block of text, divides text into words and computes the average word length.

Output should look something like this (except that different users are in the top):

+----------------------+------------------+
|user_id               |avgW              |
+----------------------+------------------+
|ZcQr5rS6oL6CVoCA6IysSA|5.037685650661365 |
|ZcVagtiGZQxNR64r5dCzBQ|5.021671826625387 |
|ZcLKXikTHYOnYt5VYRO5sg|5.019230769230769 |
+----------------------+------------------+

NB! While it is possible to complete this task without creating your own UDF (similarly to WordCount SQL example), you must solve it by creating your own Python function as an UDF to get full credit for this exercise.

Bonus Exercise

NB! Because the Cluster was down for most of the week, bonus task in cancelled for this week. Next week bonus task will give double extra points and deadline will be extended by one week.

Deliverables:

SQL scripts (9.2.1, 9.2.2, 9.2.3, 9.3)
- Either as a single script that solves all individual tasks or as multiple separate scripts.
Output of the SQL scripts (9.2.1, 9.2.2, 9.2.3, 9.3)
- The output should contain the resulting SQL table or DataFrame content.
- Output of the scripts should be provided either as:
  1. JSON output file(s)
  2. or the respective df.show() command's output in a text file.

9. Practice session 9

Sellele ülesandele ei saa enam lahendusi esitada.

Potential issues and solutions

The system cannot find the path specified when executing your Spark application the first time.
- Check that your SPARK_HOME environment variable value is correct
- You can also set SPARK_HOME location programmatically inside your script in the following way:
  - os.environ['SPARK_HOME'] = "/home/pelle/soft/spark-2.3.2"
It is suggested to use Java 8 as the default Java in your computer.
- NB! Java 9 and 10 will not work without issues!
Be careful with the code indentation in Python script. Python is very strict when it comes to mixing tabs and spaces.
- Check indentation manual for Python: https://docs.python.org/2.0/ref/indentation.html
Python in worker has different version 2.7 than that in driver 3.6
- Check that the default Python version in your computer is the same version you use inside your PyCharm project.

Hajusandmetöötlus pilves 2018/19 sügis