Practice 3 - Processing data with MapReduce

Goal of this exercise is to learn how to design new MapReduce applications and how to execute them in real Hadoop clusters. You will create a new MapReduce application, download an example data set from open data repository and perform a series of data processing tasks. At the end of the session, you will learn how to execute the created MapReduce job in a Cloudera Hadoop cluster. We have set up a 128 CPU core, 512TB RAM and 6TB storage Cloudera Hadoop cluster for this lab session.

References

Here is a list of web sites and documents that contain supportive information for the practice.

Manuals

MapReduce tutorial: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Hadoop API: http://hadoop.apache.org/docs/stable/api/
Hadoop wiki https://hadoop.apache.org/

Dataset Description

The dataset that we will analyze using MapReduce is taken from Socrata OpenData repository

Name: Unclaimed bank accounts
Location: https://opendata.socrata.com/Government/Unclaimed-bank-accounts/n2rk-fwkj
Dataset attributes (column names) are:
1. Last / Business Name
2. First Name
3. Balance
4. Address
5. City
6. Last Transaction Date
7. Bank Name

Download ( Export->CSV) the dataset from Socrata OpenData as a .csv file, which name should be Unclaimed_bank_accounts.csv.

Exercise 3.1. Create a new MapReduce application

We will modify the same MapReduce project we used last week by creating a new MapReduce class.

Create a new folder named bank_accounts in your project which we will use as an input folder
Move the downloaded Unclaimed_bank_accounts.csv file into the bank_accounts folder
Create a new ee.ut.cs.ddpc package inside the project.
- Do not use the old org.apache.hadoop.examples package as you will otherwise have issues later when trying to execute your program in the cluster.
Copy the WordCount.java example from the last practice session into the ee.ut.cs.ddpc package folder and rename it as Lab3MRApp.java .
Modify the copied class and change the Map output, Reduce input & output types into Text type.
- In the Mapper class:
- In the Reducer class:
- In the main() method:
Also add the following lines to the main() method to reconfigure the map output key and value types to be Text:
- ```
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
```
NB! Remove the Combiner configuration statement in the main method. The Reduce function that you will create can not be used directly as a Combiner this time.

Exercise 3.2. Create a new MapReduce application

Modify the application to perform the following data analysis tasks:

For each City, calculate the average account Balance.
- Input to the map method is <line_number, line in the CSV file> where the line contains all the original table column values for one table record/line separated by commas.
- Split the whole line by commas and extract City and Balance field values.
  - Some values may contain additional commas! Easiest way to deal with them while testing is to ignore lines that do not have 7 values after separating/spliting them by commas.
- Decide what should be the Map method output key and value.
- Modify the Reduce method to calculate the average of values.
In addition to average, also calculate minimum and maximum account balance.
- Instead of writing out a single value using context.write(), Reduce function should compute and output multiple values.
- Output value should be either:
  1. A single key-value pair, where the value is a combined string which contains all the output values (city, "min_balance,avg_balance,max_balance").
  2. 3 different key-value pairs, using 3 context.write() calls:
    - ("city,MIN", min_balance)
    - ("city,AVG", avg_balance)
    - ("city,MAX", max_balance)
    - Creating a combined key that includes the type of the value allows us to later differentiate between different key-value pairs in the output.
Instead of simple max and min, calculate and output a list of top 3 highest balances and top 3 lowest.
Perform the previous analysis for each unique City and Bank Name combination
- Use the keys to modify by which attributed the data is grouped by

Exercise 3.3. Passing user defined parameters to Map or Reduce tasks.

Now, lets change the MapReduce program to analyse entries only from a specific user defined transaction year and ask users to provide the year value as the third argument to the program. The arguments to the program should become: input_folder output_folder year

Additional parameters can not be directly passed to the Map and Reduce tasks because we are not executing them directly. Instead, they should be written to the MapReduce Job Configuration before running the job. The Job Configuration object will be made available for the Map and Reduce processes through context and configuration parameter values can be extracted from there.

Modify the run configuration of your new application to contain three arguments: input_folder output_folder year instead of just two
Change how the main() method parses the input arguments as input and output folders. Currently it uses the last program argument as the output folder and all previous arguments as input folders to allow you to define multiple input paths. Instead change it to use first argument as input folder and second as output folder:
- ```
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
```
Value of user defined YEAR should be read from the program arguments in the main() method and written into the MapReduce job configuration so that it is passed along to all the Map and Reduce processes.
- You can use the conf.set(name, value); method to write the user defined YEAR value into the MapReduce configuration.
Inside the Map method, use the context.getConfiguration().get(name, default_value); method to access the previously defined YEAR value.
Use the YEAR value to filter out all input file entries that have the wrong transaction year.

Exercise 3.4. Running MapReduce applications in a real Cluster

In this part of the exercise you will run your program in real MapReduce cluster. This time we have set up a 1+4 node Hadoop cluster in our cloud where you can submit your application as a MapReduce job.

Hadoop Cluster:
http://172.17.77.39:8889/
Talk to your lab supervisor to get access (make sure you write the information down for future use)

NB! To access local cloud resources you will have to be inside the university network. When working from home or dormitory, you will need to set up VPN connection to university network.

UT VPN guide - https://wiki.ut.ee/pages/viewpage.action?pageId=17105590
Export your program as a .jar file (not executable jar)
- Eclipse IDE
  - Right click on your eclipse project, choose: Export -> Java -> JAR file -> Next
  - Choose the location where to save it, uncheck .classpath and .project. Click Finish.
- IntelliJ IDEA
  - Open View -> Tool windows -> Maven Projects and choose Lifecycle -> package inside your project there.
- .jar file will be created inside the target folder once the packaging process has finished.
Use the File Browser to create a personal folder for uploading files. NB! Use your first and last names when naming the folder.
Upload the .jar file to the cluster File System into your personal folder
Create a new Java Job (not MapReduce!) using the Java Editor
- Name your job so you will recognize it later
- Specify Path to be the full path to the jar file you previously uploaded to the File Browser
  - Something like: /user/labuser/my_folder_name/mymapreduce.jar
- Specify Class as the full classpath to the main class (which contains your main() method) of your program
  - Has to be full path: package path and class name (if you use package)
  - In our case it should be: ee.ut.cs.ddpc.Lab3MRApp
- Specify program command line arguments in the Arguments part. Make sure you use a separate Argument entry for each! There should be 3 argument entries/lines.
  - Make sure the input and output folder are inside your personal folder you have previously created
Submit your job using the Execute button (or Ctrt+Enter) and wait until its finished
Check that you got the correct results in the output folder in File Browser
You can check the list of all the executed Mapreduce jobs from the Job Browser
Save a screenshot which shows that you have successfully run the MapReduce program in the Cluster.

Bonus home exercise (Additional lab credit points)

The number of entries in the input data set we use in this lab is quite small. Your task is to generate additional input data and investigate how much input data is needed to make your program run for at least 5 minutes.

Create a data generation script in your preferred programming language (Python, Java, Bash, etc.) to generate additional input data.
Try not to simply generate random strings or numbers as otherwise the program will no longer perform similarly.
- Instead, you can create a list of possible cities, addresses, bank names, first and last names to draw values from and generate random numerical bank balances and dates.
Investigate how large should the input data size be for the program to take at least 5 minutes in the cluster.
- Take care that your program does not run more than 10 minutes as it may affect the other student's work.
Investigate whether your MapReduce job gets any benefit from parallel execution in the cluster.
- TIP: You can force MapReduce framework to use a specific number of Reduce tasks by using the job.setNumReduceTasks(16); command in the main() method in your application.
- TIP: To force the Map tasks to be parallelized, you can divide input data into separate files.
Deliverables
- How much data did you need to generate?
- Submit the data generation scripts or screenshots/configuration files if you use an existing program to generate data.
- Describe the experiments you executed to investigate how much data was needed to make your MapReduce application to run for 5 minutes.

Deliverables

MapReduce application source code
Screenshot of either the Java Editor or MapReduce Jobs page in the cluster web interface which displays that you have successfully finished executing your MapReduce code in the cluster
The JAVA job entry with your name must exist in the Cluster web interface.

3. Practice session 3

Sellele ülesandele ei saa enam lahendusi esitada.

Solutions to common issues and errors.

Make sure that you have removed the Combiner configuration statement in the main method. The Reduce function that you will create can not be used directly as a Combiner this time.
If you get an error about running MapReduce example tests when creating the jar file, you can try deleting the tests (Test classes inside /test/java/ folder inside your project), run maven clean and run maven package again.
If you get other errors when packaging your project as .jar file in IntelliJ IDEA, then you can try more general way of creating .jar Artifacts:
- Go to File -> Project Structure -> Project Settings -> Artifacts
- Add a new JAR type artifact from modules and dependencies
- Chose ee.ut.cs.ddpc.Lab3MRApp as the main class. Press OK.
- On the left side Output layout window, remove everything but hadoop-mapreduce-examples.jar and hadoop-mapreduce-examples compile output rows. Like this:
- Press OK to close the window and go to Build -> Build Artifacts -> Build. Jar file should now be generated into out/artifacts/hadoop_mapreduce_examples_jar folder inside your project.
If you get an error about unsupported version try changing the Java Compliance version to 1.8.
- Eclipse: Right click on the project -> Properties -> Java_Compiler-> User_Compliance
- Export your project as jar file again.
NB! You will run into problems while running Hadoop MapReduce programs if your Windows username includes a space character. We would suggest that you create an alternative user in your computer without any white space characters and complete the exercise under that user.
- You should also avoid spaces in folder paths.

NB! If you run into any issues not covered here then contact the lab supervisor.

Hajusandmetöötlus pilves 2018/19 sügis