Practice 5 - Statistics in MapReduce
Goal of this exercise is to create a MapReduce application for performing statistical analysis of a given dataset. You will create a new MapReduce application, download an example data set from an open data repository and perform a series of simple data processing tasks. At the end of the session you will also learn how to execute the created MapReduce job in a real Hadoop cluster.
References
Referred documents and web sites contain supportive information for the practice.
Manuals
- Hadoop API: http://hadoop.apache.org/docs/stable/api/
- Hadoop MapReduce tutorial: http://hadoop.apache.org/docs/r1.2.0/mapred_tutorial.html
- Eclipse IDE: http://www.eclipse.org/
Exercise 5.1. Statistics with MapReduce
- Goal of this exercise is to analyse a dataset using MapReduce
- Make a new MapReduce Eclipse project (like in the last practice session)
- To be able to run your MapReduce program in ther cluster change the java compability option to version 7. (If it is not already using version 7)
- Right-click on your project -> Properties -> Java Compiler -> JDK Compliance -> 1.7
- To be able to run your MapReduce program in ther cluster change the java compability option to version 7. (If it is not already using version 7)
- Create folder named "input" in your eclipse project
- Dataset that you will analyze is taken from UCI Machine repository
- Name: Adult Data Set
- Location: http://archive.ics.uci.edu/ml/datasets/Adult
- Download adult.data file under Data Folder from UCI and move them to input folder
- You may have to delete 2 last empty lines in the file - or you will get parsing errors
- Dataset attributes (column names) are: age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country
- Take the WordCount.java example from the last practice session, add it to src folder in your project and modify it.
- NB! Do not forget to turn off the use of a combiner:
//job.setCombinerClass(IntSumReducer.class);
- NB! You will also have to modify the Map and Reduce class input and output data types.
- NB! Do not forget to turn off the use of a combiner:
- Also configure Map class output key and value types in the Main class job object.
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
- (Change the types accordingly to your needs when needed)
- The modified application must do the following:
- For each native-country, find the minimum and maximum age
- In Map, you can output for example: (<native-country>,<age>)
- In Reduce you get each unique key (in this case native-country) and a list of all values (in this case 'ages') and have to find the smallest and largest values in the list.
- In Reduce either output the two different values as separate key value-pairs or create a comma separated combined value for them.
- In addition to minimum and maximum, also find average and median values
- Do the previous analysis for each unique native-country AND workclass pair.
- Create a combined key: For example something like: (<native-country;workclass>, <value>)
- You can interpret the answers as a table instead of a list.
- And finally, instead of a specific column: age, do the analysis for every numerical column
- Create a loop inside Map that outputs a key/value pair for every column you want to process separately.
- >Use a unique key for each column. You use the name of the numerical column as part of the key.
- For each native-country, find the minimum and maximum age
Exercise 5.2. Running your MapReduce application in a real Cluster
In this part of the exercise you will run your program in real MapReduce cluster. This time we have set up a 1+4 node Hadoop cluster in our cloud where you can submit your application as a MapReduce job.
Hadoop Cluster:
http://172.17.136.3:8888/
Talk to your lab supervisor to get access (make sure you note it down for future use)
- Export your Your own program as a .jar file (not executable jar)
- Right click on your eclipse project, choose: Export -> Java -> JAR file -> Next
- Choose the location where to save it, uncheck .classpath and .project. Click Finish.
- Use the HDFS File Browser to create a personal folder for uploading files. NB! Use your lastname when naming the folder.
- Upload the .jar file to the cluster File System into your personal folder (use the HDFS File Browser, not My Documents view)
- Create a new Java Job (not MapReduce!)
Query Editors -> Job Designer-> New Action -> Java
- Name your job so you will recognize it later
- Specify
Jar Path
to be the full path to the jar file you uploaded- Something like:
/user/labuser/my_folder_name/mymaapreduce.jar
- Something like:
- Specify main class as the full path to the class with the main method your program uses.
- Has to be full path: package name+class name (if you use package)
- Specify Program command line arguments in the
Args
field as needed- Make sure the input and output folder are inside your personal folder you have previously created
- Submit your job and wait until its finished
- If you get an error about unsupported version try changing the Java Compliance version to 1.6. (Right_click_on_eclipse_project-> Properties -> Java_Compiler->User_Compliance)
- If you get error 500, go to
http://172.17.136.5:8888/jobbrowser
and check the status of your job, it might still have been successfully submitted. - In the job browser, you can find the les of map and reduce tasks that were run for your job, and under each you can find their respective log files, from where you can search of error messages if something did not work as expected
- Check that you got the correct results in the output folder
Deliverables
- MapReduce application source code.
- Output files (part-r-0000*) of your job.
- The screenshot of running the application in the cluster.
- Answer the following questions:
- What are the respective advantages/disadvantages of using either:
- separate key-value pairs to output the results
- or using comma separated combined values
- When running this exercise in the Hadoop cluster, where the computations run in a parallel fashion?
- Were there multiple map and reduce tasks?
- How can you verify it from the output of the MapReduce job?
- What are the respective advantages/disadvantages of using either: