Practice 7 - Higher level languages: Apache Pig

References

Overview of Apache Pig Latin - http://pig.apache.org/docs/r0.17.0/basic.html
Lecture slides - https://courses.cs.ut.ee/2018/DDPC/fall/uploads/Main/L7_2018.pdf
Apache Pig Tutorial - https://cwiki.apache.org/confluence/display/PIG/PigTutorial
Apache Pig version 0.17.0 API - https://pig.apache.org/docs/r0.17.0/api/

Preparation

Download pre-compiled Pig Tutorial example: pigtutorial.tar.gz
- It contains a version of compiled Pig that simplifies running Pig scripts in your local computer.
- Unpack the pigtutorial.tar.gz container into a freely chosen folder in your computer and use this folder as the working directory in the following exercises.
Download the Pig distributtion pig-0.17.0.tar.gz from https://www-us.apache.org/dist/pig/pig-0.17.0/ and unpack it.
- Copy the following two files from the unpacked container into your working directory
  - pig-0.17.0-core-h2.jar
  - lib/piggybank.jar
NB! Each student will get a new Cloudera cluster account for working with Apache Pig.
- You can access your new Cloudera credentials from here if you are logged in.
- Contact lab assistant if there are any issues with your credentials or access.

Exercise 7.1. Creating and running Pig Scripts locally in your computer

Create a new WordCount example Pig script named myscript.pig inside the working directory.

Its content should be:

A = load 'input';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
store D into 'output';

Create an input folder inside the working directory and move some textfiles into it. You can reuse the Gutenberg books from previous labs.
Open command line program (cmd in Windows or terminal in Linux), navigate into the working directory and execute the script using the following Java command:
- java -cp pig.jar org.apache.pig.Main -x local myscript.pig
- If java command does not work, make sure that Java 8 sdk is properly installed, JAVA_HOME system variable is set properly and that java bin folder has been added to the system PATH variable.
Check the results in the output folder.

Exercise 7.2. Creating & runnning Pig Scripts in the Cluster

Log into our Hadoop cluster: http://172.17.77.39:8889/
Open the page Query Editors-> Pig
Create the following Pig script:
- ```
A = LOAD 'Unclaimed_bank_accounts.csv' USING PigStorage(',') AS 
(last_name,first_name,balance,address,city,last_transaction,bank_name);
DESCRIBE A;
B = foreach A GENERATE  bank_name, balance;
DESCRIBE B;
STORE B INTO 'processed_bank_accounts' USING PigStorage (',');
```
- In this script example, the HDFS input folder is Unclaimed_bank_accounts.csv and output folder path is processed_bank_accounts
- Change these paths as needed and make sure Unclaimed_bank_accounts.csv file exists in the correct HDFS location in the cluster
- Make sure output folder does not exist!
Try to find the DESCRIBE statement result from the log file (Search for the string: "last_transaction" for example)
Always use the DESCRIBE statement when trying to debug error messages involving relation and field names.
Investigate the script output folder processed_bank_accounts located in cluster HDFS
Assign types to each of the fields in the A relation
- double for balance, int for integers, chararray for strings would be sufficient for now
Save the Pig script in your computer so you can use it as a base for the following exercises.

Exercise 7.3. Statistics with Apache Pig

You can continue running the scripts in your local computer to speed up prototyping and debugging.
Lets now use Apache Pig to solve the first two tasks of exercise 3.2 from Practice 3 - Processing data with MapReduce
Modify the Pig script (from 7.2) to:
1. Group all data by bank name and calculate the sum of bank balances inside each group.
  - Output should contain both the bank name and the calculated 'sum.
2. Find an example of Group By either in lecture slides or in Pig Latin Manual linked under References
3. Now, in addition to sum, also calculate minimum, maximum and average.
Save the script as the first deliverable

Using PiggyBank UDF's to improve the CSV file loading

You will run into issues with the additional commas in input file rows. We can fix it by using a special csv loader available in the PiggyBank UDF collection.
- Replace USING PigStorage (',') with USING org.apache.pig.piggybank.storage.CSVExcelStorage()
- This should work without issues in the Cloudera cluster.
- However, when running the Pig script in your local computer, you need to register the PiggyBank piggybank.jar file which we have previously extracted from pig-0.17.0.tar.gz to make CSVExcelStorage available from inside your Pig script.
- Add the following line at the start of your Pig script to register it:
  - REGISTER piggybank.jar;
  - This REGISTER statement assume the jar file is accessible from the working directory. Add a full or relative path to the jar if the jar is located somewhere else.
- Unfortunately, the tutorial pig.jar we are using is missing some libraries that are otherwise included in the Pig core and are needed by PiggyBank. We can make these libraries available by also registering the pig-0.17.0-core-h2.jar which we have previously extracted from pig-0.17.0.tar.gz. Add the following line at the start of your Pig script to register it:
  - REGISTER pig-0.17.0-core-h2.jar;

Exercise 7.4. Nested queries in Pig

Now, lets use Apache Pig to solve the last two tasks in exercise 3.2 and also exercise 3.3 task from the Practice 3 - Processing data with MapReduce:
1. For each City, generate top 3 highest and top 3 lowest balances.
  - You can execute multiple statements inside the foreach in the following way:
    - ```
    B = GROUP A BY city;
    C = foreach B {
        FIRST_RESULT =  ... ;
        SECOND_RESULT  =  ... ;
        GENERATE group, FIRST_RESULT, SECOND_RESULT;
    }
```
- You can use many of the Pig statements inside the enclosing (FILTER, ORDER, SPLIT, DISTINCT, ...)
- This allows you to use multiple sequential statements to compute both highest and lowest values at the same time.
2. Perform the previous analysis for each unique City and Bank Name
3. Now, analyse only entries from a specific last transaction year.
  - You an use the Fiter statement
  - Consider: Where is it best location to put the Fiter statement?
  - You can split a chararray (string) field by using STRSPLIT(string, delimiter, split_limit)
  - You can get rid of nesting in result using FLATTEN(...)
Save the script as the second deliverable
Execute the created script in the Cloudera cluster (http://172.17.77.39:8889) and take screenshots of the result as the third deliverable.

Bonus Exercise (Additional credit points)

Preparation

Bonus tasks will not work properly with the Pig Tutorial pig.jar we used in the lab
You can either use the cluster web interface or set up full Pig distribution in your computer.
Setting up full pig distribution:
- Download pig-0.17.0.tar.gz from https://www-eu.apache.org/dist/pig/ and unpack it.
- Configure PIG_HOME environment variable. It should link to the full path of the pig-0.17.0 folder that you unpacked and which contains bin folder.
- Configure full path to pig-0.17.0/bin folder into your User PATH environment variable so that pig command is available from the command line.
- Modify pig.cmd (in Windows) and replace the following line:
  - set HADOOP_BIN_PATH=\bin
- with:
  - set HADOOP_BIN_PATH=\libexec
- Configure HADOOP_HOME environment variable. It should include the winutils.exe file in the bin subdirectory if you are using Windows.
- Verify that JAVA_HOME environment variable exists and its path does not contain any spaces if you are using Windows.
  - You can replace some common Windows folder name in paths, such as "Program Files" with their short forms (PROGRA~1 for "Program Files") in Windows 8.3+.
  - You can use dir /x folder_path to find the short names of folders.
- You should then be able to execute pig scripts from command line using: pig -x local myscriptname.pig

Bonus tasks

Solve Exercise 7.3 using CUBE operation instead of Group BY statement.
Implement TF-IDF in Pig
- It should not require more than 5 foreach generate and 3 group by statements. But it is ok if you use more.
- There is a simple way how to configure PigStorage() to assign file names of the input data as an additional column in the resulting Relation. Investigate PigStorage API page.
Run the TF-IDF Pig script on /tmp/books_medium or /tmp/books_large datasets.
- How much slower or faster did it run in comparison to your Practice 4 - MapReduce in Information Retrieval solution?

Deliverables of the bonus task:

7.3 with Cube operation Pig script
TF-IDF Pig script
Screenshots from the cluster web interface which show that you have successfully finished executing TF-IDF on /tmp/books_medium or /tmp/books_large datasets.

Deliverables

Pig scripts (7.3, 7.4) you created in the lab.
Screenshots from the cluster web interface which show that you have successfully finished executing your Exercise 7.4 Pig script in the cluster.

Solutions to common issues and errors

SUM, MIN, MAX and AVG operations inside pig statements have to be written with capital letters.
Apache Pig is not available for you in the cluster job editor
- Apache Pig job editor is not accessible from the labuser account.
- Each student will get a new Cloudera cluster account for working with Apache Pig. You can access your new Cloudera credentials from here if you are logged in. Contact lab assistant if there are any issues with your credentials or access.
If java command does not work, make sure that Java 8 sdk is properly installed, JAVA_HOME system variable is set properly and that java bin folder has been added to the system PATH variable.

7. Practice session 7

Solutions for this task can no longer be submitted.

Distributed Data Processing on the Cloud 2018/19 fall

Practice 7 - Higher level languages: Apache Pig

References

Preparation

Exercise 7.1. Creating and running Pig Scripts locally in your computer

Exercise 7.2. Creating & runnning Pig Scripts in the Cluster

Exercise 7.3. Statistics with Apache Pig

Using PiggyBank UDF's to improve the CSV file loading

Exercise 7.4. Nested queries in Pig

Bonus Exercise (Additional credit points)

Preparation

Bonus tasks

Deliverables

Solutions to common issues and errors