Practice 3 - Processing data with MapReduce
Goal of this exercise is to learn how to design new MapReduce applications and how to execute them in real Hadoop clusters. You will create a new MapReduce application, download an example data set from open data repository and perform a series of data processing tasks. At the end of the session, you will learn how to execute the created MapReduce job in a Cloudera Hadoop cluster. We have set up a 128 CPU core, 512TB RAM and 6TB storage Cloudera Hadoop cluster for this lab session.
References
Here is a list of web sites and documents that contain supportive information for the practice.
Manuals
- MapReduce tutorial: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
- Hadoop API: http://hadoop.apache.org/docs/stable/api/
- Hadoop wiki https://hadoop.apache.org/
Dataset Description
The dataset that we will analyze using MapReduce is taken from Socrata OpenData repository
- Name: Unclaimed bank accounts
- Location: https://opendata.socrata.com/Government/Unclaimed-bank-accounts/n2rk-fwkj
- Dataset attributes (column names) are:
- Last / Business Name
- First Name
- Balance
- Address
- City
- Last Transaction Date
- Bank Name
Download ( Export->CSV) the dataset from Socrata OpenData as a .csv
file, which name should be Unclaimed_bank_accounts.csv.
Exercise 3.1. Create a new MapReduce application
We will modify the same MapReduce project we used last week by creating a new MapReduce class.
- Create a new folder named bank_accounts in your project which we will use as an input folder
- Move the downloaded
Unclaimed_bank_accounts.csv
file into the bank_accounts folder - Create a new
ee.ut.cs.ddpc
package inside the project.- Do not use the old
org.apache.hadoop.examples
package as you will otherwise have issues later when trying to execute your program in the cluster.
- Do not use the old
- Copy the
WordCount.java
example from the last practice session into theee.ut.cs.ddpc
package folder and rename it asLab3MRApp.java
. - Modify the copied class and change the Map output, Reduce input & output types into Text type.
- In the Mapper class:
- In the Reducer class:
- In the main() method:
- In the Mapper class:
- Also add the following lines to the
main()
method to reconfigure the map output key and value types to beText
:job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class);
- NB! Remove the Combiner configuration statement in the main method. The Reduce function that you will create can not be used directly as a Combiner this time.
Exercise 3.2. Create a new MapReduce application
Modify the application to perform the following data analysis tasks:
- For each City, calculate the average account Balance.
- Input to the map method is
<line_number, line in the CSV file>
where the line contains all the original table column values for one table record/line separated by commas. - Split the whole line by commas and extract City and Balance field values.
- Some values may contain additional commas! Easiest way to deal with them while testing is to ignore lines that do not have 7 values after separating/spliting them by commas.
- Decide what should be the Map method output key and value.
- Modify the Reduce method to calculate the average of values.
- Input to the map method is
- In addition to average, also calculate minimum and maximum account balance.
- Instead of writing out a single value using
context.write()
, Reduce function should compute and output multiple values. - Output value should be either:
- A single key-value pair, where the value is a combined string which contains all the output values
(city, "min_balance,avg_balance,max_balance")
. - 3 different key-value pairs, using 3
context.write()
calls:("city,MIN", min_balance)
("city,AVG", avg_balance)
("city,MAX", max_balance)
- Creating a combined key that includes the type of the value allows us to later differentiate between different key-value pairs in the output.
- A single key-value pair, where the value is a combined string which contains all the output values
- Instead of writing out a single value using
- Instead of simple max and min, calculate and output a list of top 3 highest balances and top 3 lowest.
- Perform the previous analysis for each unique City and Bank Name combination
- Use the keys to modify by which attributed the data is grouped by
Exercise 3.3. Passing user defined parameters to Map or Reduce tasks.
Now, lets change the MapReduce program to analyse entries only from a specific user defined transaction year and ask users to provide the year value as the third argument to the program. The arguments to the program should become: input_folder output_folder year
Additional parameters can not be directly passed to the Map and Reduce tasks because we are not executing them directly. Instead, they should be written to the MapReduce Job Configuration before running the job. The Job Configuration object will be made available for the Map and Reduce processes through context
and configuration parameter values can be extracted from there.
- Modify the run configuration of your new application to contain three arguments:
input_folder output_folder year
instead of just two - Change how the
main()
method parses the input arguments as input and output folders. Currently it uses the last program argument as the output folder and all previous arguments as input folders to allow you to define multiple input paths. Instead change it to use first argument as input folder and second as output folder:FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
- Value of user defined
YEAR
should be read from the program arguments in themain()
method and written into the MapReduce job configuration so that it is passed along to all the Map and Reduce processes.- You can use the
conf.set(name, value);
method to write the user definedYEAR
value into the MapReduce configuration.
- You can use the
- Inside the Map method, use the
context.getConfiguration().get(name, default_value);
method to access the previously definedYEAR
value. - Use the
YEAR
value to filter out all input file entries that have the wrong transaction year.
Exercise 3.4. Running MapReduce applications in a real Cluster
In this part of the exercise you will run your program in real MapReduce cluster. This time we have set up a 1+4 node Hadoop cluster in our cloud where you can submit your application as a MapReduce job.
Hadoop Cluster:
http://172.17.77.39:8889/
Talk to your lab supervisor to get access (make sure you write the information down for future use)
NB! To access local cloud resources you will have to be inside the university network. When working from home or dormitory, you will need to set up VPN connection to university network.
- UT VPN guide - https://wiki.ut.ee/pages/viewpage.action?pageId=17105590
- Export your program as a .jar file (not executable jar)
- Eclipse IDE
- Right click on your eclipse project, choose: Export -> Java -> JAR file -> Next
- Choose the location where to save it, uncheck .classpath and .project. Click Finish.
- IntelliJ IDEA
- Open
View -> Tool windows -> Maven Projects
and chooseLifecycle -> package
inside your project there.
- Open
.jar
file will be created inside thetarget
folder once the packaging process has finished.
- Eclipse IDE
- Use the File Browser to create a personal folder for uploading files. NB! Use your first and last names when naming the folder.
- Upload the .jar file to the cluster File System into your personal folder
- Create a new Java Job (not MapReduce!) using the Java Editor
- Name your job so you will recognize it later
- Specify
Path
to be the full path to the jar file you previously uploaded to the File Browser- Something like:
/user/labuser/my_folder_name/mymapreduce.jar
- Something like:
- Specify
Class
as the full classpath to the main class (which contains your main() method) of your program- Has to be full path: package path and class name (if you use package)
- In our case it should be:
ee.ut.cs.ddpc.Lab3MRApp
- Specify program command line arguments in the
Arguments
part. Make sure you use a separate Argument entry for each! There should be 3 argument entries/lines.- Make sure the input and output folder are inside your personal folder you have previously created
- Submit your job using the Execute button (or Ctrt+Enter) and wait until its finished
- Check that you got the correct results in the output folder in File Browser
- You can check the list of all the executed Mapreduce jobs from the Job Browser
- Save a screenshot which shows that you have successfully run the MapReduce program in the Cluster.
Bonus home exercise (Additional lab credit points)
The number of entries in the input data set we use in this lab is quite small. Your task is to generate additional input data and investigate how much input data is needed to make your program run for at least 5 minutes.
- Create a data generation script in your preferred programming language (Python, Java, Bash, etc.) to generate additional input data.
- Try not to simply generate random strings or numbers as otherwise the program will no longer perform similarly.
- Instead, you can create a list of possible cities, addresses, bank names, first and last names to draw values from and generate random numerical bank balances and dates.
- Investigate how large should the input data size be for the program to take at least 5 minutes in the cluster.
- Take care that your program does not run more than 10 minutes as it may affect the other student's work.
- Investigate whether your MapReduce job gets any benefit from parallel execution in the cluster.
- TIP: You can force MapReduce framework to use a specific number of Reduce tasks by using the
job.setNumReduceTasks(16);
command in themain()
method in your application. - TIP: To force the Map tasks to be parallelized, you can divide input data into separate files.
- TIP: You can force MapReduce framework to use a specific number of Reduce tasks by using the
- Deliverables
- How much data did you need to generate?
- Submit the data generation scripts or screenshots/configuration files if you use an existing program to generate data.
- Describe the experiments you executed to investigate how much data was needed to make your MapReduce application to run for 5 minutes.
Deliverables
- MapReduce application source code
- Screenshot of either the Java Editor or MapReduce Jobs page in the cluster web interface which displays that you have successfully finished executing your MapReduce code in the cluster
- The JAVA job entry with your name must exist in the Cluster web interface.
Solutions to common issues and errors.
- Make sure that you have removed the Combiner configuration statement in the main method. The Reduce function that you will create can not be used directly as a Combiner this time.
- If you get an error about running MapReduce example tests when creating the jar file, you can try deleting the tests (Test classes inside /test/java/ folder inside your project), run maven clean and run maven package again.
- If you get other errors when packaging your project as
.jar
file in IntelliJ IDEA, then you can try more general way of creating.jar
Artifacts:- Go to
File -> Project Structure -> Project Settings -> Artifacts
- Add a new JAR type artifact from modules and dependencies
- Chose
ee.ut.cs.ddpc.Lab3MRApp
as the main class. Press OK. - On the left side Output layout window, remove everything but
hadoop-mapreduce-examples.jar
andhadoop-mapreduce-examples
compile output rows. Like this: - Press OK to close the window and go to
Build -> Build Artifacts -> Build
. Jar file should now be generated intoout/artifacts/hadoop_mapreduce_examples_jar
folder inside your project.
- Go to
- If you get an error about unsupported version try changing the Java Compliance version to 1.8.
- Eclipse: Right click on the project -> Properties -> Java_Compiler-> User_Compliance
- Export your project as jar file again.
- NB! You will run into problems while running Hadoop MapReduce programs if your Windows username includes a space character. We would suggest that you create an alternative user in your computer without any white space characters and complete the exercise under that user.
- You should also avoid spaces in folder paths.
NB! If you run into any issues not covered here then contact the lab supervisor.