Practice 7 - Higher level languages: Apache Pig
References
- Overview of Apache Pig Latin - http://pig.apache.org/docs/r0.17.0/basic.html
- Lecture slides - https://courses.cs.ut.ee/2018/DDPC/fall/uploads/Main/L7_2018.pdf
- Apache Pig Tutorial - https://cwiki.apache.org/confluence/display/PIG/PigTutorial
- Apache Pig version 0.17.0 API - https://pig.apache.org/docs/r0.17.0/api/
Preparation
- Download pre-compiled Pig Tutorial example: pigtutorial.tar.gz
- It contains a version of compiled Pig that simplifies running Pig scripts in your local computer.
- Unpack the
pigtutorial.tar.gz
container into a freely chosen folder in your computer and use this folder as the working directory in the following exercises.
- Download the Pig distributtion
pig-0.17.0.tar.gz
from https://www-us.apache.org/dist/pig/pig-0.17.0/ and unpack it.- Copy the following two files from the unpacked container into your working directory
pig-0.17.0-core-h2.jar
lib/piggybank.jar
- Copy the following two files from the unpacked container into your working directory
- NB! Each student will get a new Cloudera cluster account for working with Apache Pig.
- You can access your new Cloudera credentials from here if you are logged in.
- Contact lab assistant if there are any issues with your credentials or access.
Exercise 7.1. Creating and running Pig Scripts locally in your computer
- Create a new WordCount example Pig script named
myscript.pig
inside the working directory. - Its content should be:
A = load 'input'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; store D into 'output';
- Create an
input
folder inside the working directory and move some textfiles into it. You can reuse the Gutenberg books from previous labs. - Open command line program (cmd in Windows or terminal in Linux), navigate into the working directory and execute the script using the following Java command:
java -cp pig.jar org.apache.pig.Main -x local myscript.pig
- If java command does not work, make sure that Java 8 sdk is properly installed,
JAVA_HOME
system variable is set properly and that java bin folder has been added to the systemPATH
variable.
- Check the results in the
output
folder.
Exercise 7.2. Creating & runnning Pig Scripts in the Cluster
- Log into our Hadoop cluster: http://172.17.77.39:8889/
- Open the page
Query Editors-> Pig
- Create the following Pig script:
A = LOAD 'Unclaimed_bank_accounts.csv' USING PigStorage(',') AS (last_name,first_name,balance,address,city,last_transaction,bank_name); DESCRIBE A; B = foreach A GENERATE bank_name, balance; DESCRIBE B; STORE B INTO 'processed_bank_accounts' USING PigStorage (',');
- In this script example, the HDFS input folder is
Unclaimed_bank_accounts.csv
and output folder path isprocessed_bank_accounts
- Change these paths as needed and make sure Unclaimed_bank_accounts.csv file exists in the correct HDFS location in the cluster
- Make sure output folder does not exist!
- Try to find the
DESCRIBE
statement result from the log file (Search for the string: "last_transaction" for example) - Always use the
DESCRIBE
statement when trying to debug error messages involving relation and field names. - Investigate the script output folder
processed_bank_accounts
located in cluster HDFS - Assign types to each of the fields in the A relation
double
for balance,int
for integers,chararray
for strings would be sufficient for now
- Save the Pig script in your computer so you can use it as a base for the following exercises.
Exercise 7.3. Statistics with Apache Pig
- You can continue running the scripts in your local computer to speed up prototyping and debugging.
- Lets now use Apache Pig to solve the first two tasks of exercise 3.2 from Practice 3 - Processing data with MapReduce
- Modify the Pig script (from 7.2) to:
- Group all data by bank name and calculate the sum of bank balances inside each group.
- Output should contain both the bank name and the calculated 'sum.
- Find an example of
Group By
either in lecture slides or in Pig Latin Manual linked under References - Now, in addition to sum, also calculate minimum, maximum and average.
- Group all data by bank name and calculate the sum of bank balances inside each group.
- Save the script as the first deliverable
Using PiggyBank UDF's to improve the CSV file loading
- You will run into issues with the additional commas in input file rows. We can fix it by using a special csv loader available in the PiggyBank UDF collection.
- Replace
USING PigStorage (',')
withUSING org.apache.pig.piggybank.storage.CSVExcelStorage()
- This should work without issues in the Cloudera cluster.
- However, when running the Pig script in your local computer, you need to register the PiggyBank
piggybank.jar
file which we have previously extracted frompig-0.17.0.tar.gz
to make CSVExcelStorage available from inside your Pig script. - Add the following line at the start of your Pig script to register it:
REGISTER piggybank.jar;
- This REGISTER statement assume the jar file is accessible from the working directory. Add a full or relative path to the jar if the jar is located somewhere else.
- Unfortunately, the tutorial pig.jar we are using is missing some libraries that are otherwise included in the Pig core and are needed by PiggyBank. We can make these libraries available by also registering the
pig-0.17.0-core-h2.jar
which we have previously extracted frompig-0.17.0.tar.gz
. Add the following line at the start of your Pig script to register it:REGISTER pig-0.17.0-core-h2.jar;
- Replace
Exercise 7.4. Nested queries in Pig
- Now, lets use Apache Pig to solve the last two tasks in exercise 3.2 and also exercise 3.3 task from the Practice 3 - Processing data with MapReduce:
- For each City, generate top 3 highest and top 3 lowest balances.
- You can execute multiple statements inside the
foreach
in the following way:B = GROUP A BY city; C = foreach B { FIRST_RESULT = ... ; SECOND_RESULT = ... ; GENERATE group, FIRST_RESULT, SECOND_RESULT; }
- You can use many of the Pig statements inside the enclosing (FILTER, ORDER, SPLIT, DISTINCT, ...)
- This allows you to use multiple sequential statements to compute both highest and lowest values at the same time.
- You can execute multiple statements inside the
- Perform the previous analysis for each unique City and Bank Name
- Now, analyse only entries from a specific last transaction year.
- You an use the
Fiter
statement - Consider: Where is it best location to put the
Fiter
statement? - You can split a chararray (string) field by using
STRSPLIT(string, delimiter, split_limit)
- You can get rid of nesting in result using
FLATTEN(...)
- You an use the
- For each City, generate top 3 highest and top 3 lowest balances.
- Save the script as the second deliverable
- Execute the created script in the Cloudera cluster (http://172.17.77.39:8889) and take screenshots of the result as the third deliverable.
Bonus Exercise (Additional credit points)
Preparation
- Bonus tasks will not work properly with the Pig Tutorial pig.jar we used in the lab
- You can either use the cluster web interface or set up full Pig distribution in your computer.
- Setting up full pig distribution:
- Download pig-0.17.0.tar.gz from https://www-eu.apache.org/dist/pig/ and unpack it.
- Configure
PIG_HOME
environment variable. It should link to the full path of thepig-0.17.0
folder that you unpacked and which containsbin folder
. - Configure full path to
pig-0.17.0/bin
folder into your User PATH environment variable so that pig command is available from the command line. - Modify pig.cmd (in Windows) and replace the following line:
set HADOOP_BIN_PATH=\bin
- with:
set HADOOP_BIN_PATH=\libexec
- Configure
HADOOP_HOME
environment variable. It should include the winutils.exe file in the bin subdirectory if you are using Windows. - Verify that
JAVA_HOME
environment variable exists and its path does not contain any spaces if you are using Windows.- You can replace some common Windows folder name in paths, such as "Program Files" with their short forms (
PROGRA~1
for "Program Files") in Windows 8.3+. - You can use
dir /x folder_path
to find the short names of folders.
- You can replace some common Windows folder name in paths, such as "Program Files" with their short forms (
- You should then be able to execute pig scripts from command line using:
pig -x local myscriptname.pig
Bonus tasks
- Solve Exercise 7.3 using CUBE operation instead of Group BY statement.
- Implement TF-IDF in Pig
- It should not require more than 5 foreach generate and 3 group by statements. But it is ok if you use more.
- There is a simple way how to configure PigStorage() to assign file names of the input data as an additional column in the resulting Relation. Investigate PigStorage API page.
- Run the TF-IDF Pig script on
/tmp/books_medium
or/tmp/books_large
datasets.- How much slower or faster did it run in comparison to your Practice 4 - MapReduce in Information Retrieval solution?
Deliverables of the bonus task:
- 7.3 with Cube operation Pig script
- TF-IDF Pig script
- Screenshots from the cluster web interface which show that you have successfully finished executing TF-IDF on
/tmp/books_medium
or/tmp/books_large
datasets.
Deliverables
- Pig scripts (7.3, 7.4) you created in the lab.
- Screenshots from the cluster web interface which show that you have successfully finished executing your Exercise 7.4 Pig script in the cluster.
Solutions to common issues and errors
- SUM, MIN, MAX and AVG operations inside pig statements have to be written with capital letters.
- Apache Pig is not available for you in the cluster job editor
- Apache Pig job editor is not accessible from the labuser account.
- Each student will get a new Cloudera cluster account for working with Apache Pig. You can access your new Cloudera credentials from here if you are logged in. Contact lab assistant if there are any issues with your credentials or access.
- If java command does not work, make sure that Java 8 sdk is properly installed,
JAVA_HOME
system variable is set properly and that java bin folder has been added to the systemPATH
variable.
Solutions for this task can no longer be submitted.