Practice 5 - Designing MapReduce Applications
Goal of this exercise is to analyse an open dataset using MapReduce. You will create a new MapReduce application, download an example data set from open data repository and perform a series of data processing tasks.
References
Referred documents and web sites contain supportive information for the practice.
- MapReduce tutorial: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
- Hadoop API: http://hadoop.apache.org/docs/stable/api/
- Hadoop wiki https://hadoop.apache.org/
Dataset Description
The dataset that we will analyze using MapReduce is taken from UCI Machine Learning Repository and contains census information from a set of Adults living in USA. The original goal of the dataset was to use it for predicting whether the income of a person exceeds $50K per year.
- Name: UCI Adult dataset
- Location: http://archive.ics.uci.edu/ml/datasets/Adult
- Dataset attributes (column names) are:
0. age 1. workclass 2. fnlwgt 3. education 4. education-num 5. marital-status 6. occupation 7. relationship 8. race 9. sex 10. capital-gain 11. capital-loss 12. hours-per-week 13. native-country 14. Classification
Download the adult.data
and adult.test
files from the dataset Data Folder.
Exercise 5.1. Create a new MapReduce application
Goal of this exercise is to analyse UCI Adult dataset using MapReduce.
- Use your old IntelliJ MapReduce project or make a new one like in the last practice session
- Create folder named "input" inside your IntelliJ project
- Download the
adult.data
andadult.test
files from the dataset Data Folder page and move them inside the input folder.- You may have to delete 2 last empty lines in the file - or you will get parsing errors
- Download the
- Create a new
ee.ut.cs.ddpc
package inside the project. - Copy the
WordCount.java
example from the last practice session into theee.ut.cs.ddpc
package folder and rename it asLab5MRApp.java
. - Modify the copied class and change the Map output, Reduce input & output types into Text type.
- In the Mapper class:
- In the Reducer class:
- In the main() method:
- In the Mapper class:
- Also add the following lines to the
main()
method to reconfigure the map output key and value types to beText
:job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class);
- NB! Remove the Combiner configuration statement in the main method. The Reduce function that you will create can not be used directly as a Combiner this time.
//job.setCombinerClass(IntSumReducer.class);
Exercise 5.2. Data analytics using MapReduce
Modify the application to perform the following tasks on the UCI Adult data set:
- For each native-country, calculate the average hours-per-week
- Map method:
- Input key-value pair is:
(line_nr, line)
, where line is a single line from the csv file, which contains 14 comma separated values. - You should split the input line by commas and output native country as a key and hours-per-week as a value:
(native-country, hours-per-week)
- Input key-value pair is:
- In Reduce:
- Input is:
(native-country, [hours-per-week])
- Input key is unique native-country and value is a list-like iteratable object of all 'hours-per-week' values from this native-country.
- Compute the average of all hours-per-week values.
- Output should be
(native-country, avg_val)
- Input is:
- Map method:
- In addition to average value, also find minimum and maximum values
- Instead of writing out a single value using
context.write()
, Reduce function should compute and output multiple values. - Output should be written as either:
- 3 different key-value pairs, using 3
context.write()
calls:("native-country,MIN", min_val)
("native-country,AVG", avg_val)
("native-country,MAX", max_val)
- Or as a single KeyValue pair where the value object contains multiple results:
(native-country, "min_val, avg_val, max_val")
.
- 3 different key-value pairs, using 3
- Instead of writing out a single value using
- Perform the previous analysis for each unique native-country AND workclass pair.
- Use the Map output key to modify by which attributed the data is grouped by when it "arrives" in the Reduce method
- You can create a combined key. For example:
("native-country,workclass", value)
- And finally, instead of a specific column: hours-per-week, perform the analysis for every numerical column
- Create a loop inside the Map method that outputs a (key, value) pair for every column you want to process separately.
- Use a unique key for each column to make sure they are grouped separately in Reduce function. Simple way to achieve this is to add the name of the numerical column as another component into the combined key.
Exercise 5.3. Passing user defined parameters to Map or Reduce tasks.
Now, lets change the MapReduce program to analyse entries only from a specific user defined occupation and ask users to provide the occupation value as the third argument to the program. The arguments to the program should become: input_folder output_folder occupation
Additional parameters can not be directly passed to the Map and Reduce tasks because we are not executing them directly when the application is running in a distributed manner (in a cluster). Instead, they should be written to the MapReduce Job Configuration before running the job. The Job Configuration object will be made available for the Map and Reduce processes through context
and configuration parameter values can be extracted from there.
- Modify the run configuration of your new application to contain three arguments:
input_folder output_folder occupation
instead of just two - Change how the
main()
method parses the input arguments as input and output folders. Currently it uses the last program argument as the output folder and all previous arguments as input folders to allow you to define multiple input paths. Instead change it to use first argument as input folder and second as output folder:FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
- Value of user defined
occupation
should be read from the program arguments in themain()
method and written into the MapReduce job configuration so that it is passed along to all the Map and Reduce processes.- You can use the
conf.set(name, value);
method to write the user definedoccupation
value into the MapReduce configuration.
- You can use the
- Inside the Map method, use the
context.getConfiguration().get(name, default_value);
method to access the previously definedoccupation
value. - Use the
occupation
value to filter out all input file entries that have the wrong occupation field/column value.
Deliverables
- MapReduce application source code.
- Output files (part-r-0000*) of your job.
- Answer the following questions:
- What are the respective advantages/disadvantages of using either:
- separate key-value pairs to output the results
- or using comma separated combined values
- What are the respective advantages/disadvantages of using either:
Solutions to common issues and errors.
- If you get an error about
ClassNotFoundException
.- Make sure you have selected the
Include dependencies with the "Provided" scope
option in the Run Configuration of your application.
- Make sure you have selected the
- Make sure that you have removed the Combiner configuration statement in the main method. The Reduce function that you will create can not be used directly as a Combiner this time.
- If you get an error about running MapReduce example tests when creating the jar file, you can try deleting the tests (Test classes inside /test/java/ folder inside your project), run maven clean and run maven package again.
- NB! You will run into problems while running Hadoop MapReduce programs if your Windows username includes a space character. We would suggest that you create an alternative user in your computer without any white space characters and complete the exercise under that user.
- You should also avoid spaces in folder paths.
NB! If you run into any issues not covered here then contact the lab supervisor.