Institute of Computer Science
  1. Courses
  2. 2018/19 fall
  3. Distributed Data Processing on the Cloud (LTAT.06.005)
ET
Log in

Distributed Data Processing on the Cloud 2018/19 fall

  • HomePage
  • Lectures
  • Practicals
  • Submit Homework

Practice 7 - Higher level languages: Apache Pig

References

  • Overview of Apache Pig Latin - http://pig.apache.org/docs/r0.17.0/basic.html
  • Lecture slides - https://courses.cs.ut.ee/2018/DDPC/fall/uploads/Main/L7_2018.pdf
  • Apache Pig Tutorial - https://cwiki.apache.org/confluence/display/PIG/PigTutorial
  • Apache Pig version 0.17.0 API - https://pig.apache.org/docs/r0.17.0/api/

Preparation

  • Download pre-compiled Pig Tutorial example: pigtutorial.tar.gz
    • It contains a version of compiled Pig that simplifies running Pig scripts in your local computer.
    • Unpack the pigtutorial.tar.gz container into a freely chosen folder in your computer and use this folder as the working directory in the following exercises.
  • Download the Pig distributtion pig-0.17.0.tar.gz from https://www-us.apache.org/dist/pig/pig-0.17.0/ and unpack it.
    • Copy the following two files from the unpacked container into your working directory
      • pig-0.17.0-core-h2.jar
      • lib/piggybank.jar
  • NB! Each student will get a new Cloudera cluster account for working with Apache Pig.
    • You can access your new Cloudera credentials from here if you are logged in.
    • Contact lab assistant if there are any issues with your credentials or access.

Exercise 7.1. Creating and running Pig Scripts locally in your computer

  • Create a new WordCount example Pig script named myscript.pig inside the working directory.
  • Its content should be:
    • A = load 'input';
      B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
      C = group B by word;
      D = foreach C generate COUNT(B), group;
      store D into 'output';
  • Create an input folder inside the working directory and move some textfiles into it. You can reuse the Gutenberg books from previous labs.
  • Open command line program (cmd in Windows or terminal in Linux), navigate into the working directory and execute the script using the following Java command:
    • java -cp pig.jar org.apache.pig.Main -x local myscript.pig
    • If java command does not work, make sure that Java 8 sdk is properly installed, JAVA_HOME system variable is set properly and that java bin folder has been added to the system PATH variable.
  • Check the results in the output folder.

Exercise 7.2. Creating & runnning Pig Scripts in the Cluster

  • Log into our Hadoop cluster: http://172.17.77.39:8889/
  • Open the page Query Editors-> Pig
  • Create the following Pig script:
    • A = LOAD 'Unclaimed_bank_accounts.csv' USING PigStorage(',') AS 
      (last_name,first_name,balance,address,city,last_transaction,bank_name);
      DESCRIBE A;
      B = foreach A GENERATE  bank_name, balance;
      DESCRIBE B;
      STORE B INTO 'processed_bank_accounts' USING PigStorage (',');
      
    • In this script example, the HDFS input folder is Unclaimed_bank_accounts.csv and output folder path is processed_bank_accounts
    • Change these paths as needed and make sure Unclaimed_bank_accounts.csv file exists in the correct HDFS location in the cluster
    • Make sure output folder does not exist!
  • Try to find the DESCRIBE statement result from the log file (Search for the string: "last_transaction" for example)
  • Always use the DESCRIBE statement when trying to debug error messages involving relation and field names.
  • Investigate the script output folder processed_bank_accounts located in cluster HDFS
  • Assign types to each of the fields in the A relation
    • double for balance, int for integers, chararray for strings would be sufficient for now
  • Save the Pig script in your computer so you can use it as a base for the following exercises.

Exercise 7.3. Statistics with Apache Pig

  • You can continue running the scripts in your local computer to speed up prototyping and debugging.
  • Lets now use Apache Pig to solve the first two tasks of exercise 3.2 from Practice 3 - Processing data with MapReduce
  • Modify the Pig script (from 7.2) to:
    1. Group all data by bank name and calculate the sum of bank balances inside each group.
      • Output should contain both the bank name and the calculated 'sum.
    2. Find an example of Group By either in lecture slides or in Pig Latin Manual linked under References
    3. Now, in addition to sum, also calculate minimum, maximum and average.
  • Save the script as the first deliverable

Using PiggyBank UDF's to improve the CSV file loading

  • You will run into issues with the additional commas in input file rows. We can fix it by using a special csv loader available in the PiggyBank UDF collection.
    • Replace USING PigStorage (',') with USING org.apache.pig.piggybank.storage.CSVExcelStorage()
    • This should work without issues in the Cloudera cluster.
    • However, when running the Pig script in your local computer, you need to register the PiggyBank piggybank.jar file which we have previously extracted from pig-0.17.0.tar.gz to make CSVExcelStorage available from inside your Pig script.
    • Add the following line at the start of your Pig script to register it:
      • REGISTER piggybank.jar;
      • This REGISTER statement assume the jar file is accessible from the working directory. Add a full or relative path to the jar if the jar is located somewhere else.
    • Unfortunately, the tutorial pig.jar we are using is missing some libraries that are otherwise included in the Pig core and are needed by PiggyBank. We can make these libraries available by also registering the pig-0.17.0-core-h2.jar which we have previously extracted from pig-0.17.0.tar.gz. Add the following line at the start of your Pig script to register it:
      • REGISTER pig-0.17.0-core-h2.jar;

Exercise 7.4. Nested queries in Pig

  • Now, lets use Apache Pig to solve the last two tasks in exercise 3.2 and also exercise 3.3 task from the Practice 3 - Processing data with MapReduce:
    1. For each City, generate top 3 highest and top 3 lowest balances.
      • You can execute multiple statements inside the foreach in the following way:
        • B = GROUP A BY city;
          C = foreach B {
              FIRST_RESULT =  ... ;
              SECOND_RESULT  =  ... ;
              GENERATE group, FIRST_RESULT, SECOND_RESULT;
          }
        • You can use many of the Pig statements inside the enclosing (FILTER, ORDER, SPLIT, DISTINCT, ...)
        • This allows you to use multiple sequential statements to compute both highest and lowest values at the same time.
    2. Perform the previous analysis for each unique City and Bank Name
    3. Now, analyse only entries from a specific last transaction year.
      • You an use the Fiter statement
      • Consider: Where is it best location to put the Fiter statement?
      • You can split a chararray (string) field by using STRSPLIT(string, delimiter, split_limit)
      • You can get rid of nesting in result using FLATTEN(...)
  • Save the script as the second deliverable
  • Execute the created script in the Cloudera cluster (http://172.17.77.39:8889) and take screenshots of the result as the third deliverable.

Bonus Exercise (Additional credit points)

Preparation

  • Bonus tasks will not work properly with the Pig Tutorial pig.jar we used in the lab
  • You can either use the cluster web interface or set up full Pig distribution in your computer.
  • Setting up full pig distribution:
    • Download pig-0.17.0.tar.gz from https://www-eu.apache.org/dist/pig/ and unpack it.
    • Configure PIG_HOME environment variable. It should link to the full path of the pig-0.17.0 folder that you unpacked and which contains bin folder.
    • Configure full path to pig-0.17.0/bin folder into your User PATH environment variable so that pig command is available from the command line.
    • Modify pig.cmd (in Windows) and replace the following line:
      • set HADOOP_BIN_PATH=\bin
    • with:
      • set HADOOP_BIN_PATH=\libexec
    • Configure HADOOP_HOME environment variable. It should include the winutils.exe file in the bin subdirectory if you are using Windows.
    • Verify that JAVA_HOME environment variable exists and its path does not contain any spaces if you are using Windows.
      • You can replace some common Windows folder name in paths, such as "Program Files" with their short forms (PROGRA~1 for "Program Files") in Windows 8.3+.
      • You can use dir /x folder_path to find the short names of folders.
    • You should then be able to execute pig scripts from command line using: pig -x local myscriptname.pig

Bonus tasks

  1. Solve Exercise 7.3 using CUBE operation instead of Group BY statement.
  2. Implement TF-IDF in Pig
    • It should not require more than 5 foreach generate and 3 group by statements. But it is ok if you use more.
    • There is a simple way how to configure PigStorage() to assign file names of the input data as an additional column in the resulting Relation. Investigate PigStorage API page.
  3. Run the TF-IDF Pig script on /tmp/books_medium or /tmp/books_large datasets.
    • How much slower or faster did it run in comparison to your Practice 4 - MapReduce in Information Retrieval solution?

Deliverables of the bonus task:

  1. 7.3 with Cube operation Pig script
  2. TF-IDF Pig script
  3. Screenshots from the cluster web interface which show that you have successfully finished executing TF-IDF on /tmp/books_medium or /tmp/books_large datasets.

Deliverables

  1. Pig scripts (7.3, 7.4) you created in the lab.
  2. Screenshots from the cluster web interface which show that you have successfully finished executing your Exercise 7.4 Pig script in the cluster.

Solutions to common issues and errors

  • SUM, MIN, MAX and AVG operations inside pig statements have to be written with capital letters.
  • Apache Pig is not available for you in the cluster job editor
    • Apache Pig job editor is not accessible from the labuser account.
    • Each student will get a new Cloudera cluster account for working with Apache Pig. You can access your new Cloudera credentials from here if you are logged in. Contact lab assistant if there are any issues with your credentials or access.
  • If java command does not work, make sure that Java 8 sdk is properly installed, JAVA_HOME system variable is set properly and that java bin folder has been added to the system PATH variable.
7. Practice session 7
Solutions for this task can no longer be submitted.
  • Institute of Computer Science
  • Faculty of Science and Technology
  • University of Tartu
In case of technical problems or questions write to:

Contact the course organizers with the organizational and course content questions.
The proprietary copyrights of educational materials belong to the University of Tartu. The use of educational materials is permitted for the purposes and under the conditions provided for in the copyright law for the free use of a work. When using educational materials, the user is obligated to give credit to the author of the educational materials.
The use of educational materials for other purposes is allowed only with the prior written consent of the University of Tartu.
Terms of use for the Courses environment