Practice 5 - Designing MapReduce Applications
Goal of this exercise is to analyse an open dataset using MapReduce. You will create a new MapReduce application, download an example data set from open data repository and perform a series of data processing tasks.
References
Referred documents and web sites contain supportive information for the practice.
- MapReduce tutorial: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
- Hadoop API: http://hadoop.apache.org/docs/stable/api/
- Eclipse IDE: http://www.eclipse.org/
Exercise 5.1. Create a new MapReduce application
Goal of this exercise is to analyse UCI Adult dataset using MapReduce.
- Use your old MapReduce Eclipse project or make a new one like in the last practice session
- Create folder named "input" in your eclipse project
- Dataset that you will analyze is taken from UCI Machine repository
- Name: Adult Data Set
- Location: http://archive.ics.uci.edu/ml/datasets/Adult
- Download adult.data file under Data Folder from UCI and move them to input folder
- You may have to delete 2 last empty lines in the file - or you will get parsing errors
- Dataset attributes (column names) are:
- age
- workclass
- fnlwgt
- education
- education-num
- marital-status
- occupation
- relationship
- race
- sex
- capital-gain
- capital-loss
- hours-per-week
- native-country
- Classification
- Make a copy the WordCount.java example from the last practice session and rename it.
- NB! Do not forget to turn off the use of a combiner:
//job.setCombinerClass(IntSumReducer.class);
- NB! Do not forget to turn off the use of a combiner:
- Modify the content of the pom.xml file inside your project folder to add one more Hadoop dependency
<dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>1.2.1</version> </dependency> </dependencies>
- Save the pom.xml file.
Exercise 5.2. Data analytics using MapReduce
Modify the application to perform the following tasks on the UCI Adult data set:
- For each native-country, calculate the average hours-per-week
- Map method:
- Input key-value pair is:
(line_nr, line)
, where line is a single line from the csv file, which contains 14 comma separated values. - You should split the input line by commas and output native country as a key and hours-per-week as a value:
(native-country, hours-per-week)
- Input key-value pair is:
- In Reduce:
- Input is:
(native-country, [hours-per-week])
- Input key is unique native-country and value is a list-like iteratable object of all 'hours-per-week' values from this native-country.
- Compute the average of all hours-per-week values.
- Output should be
(native-country, avg_val)
- Input is:
- Map method:
- In addition to average value, also find minimum and maximum values
- Instead of writing out a single value using
context.write()
, Reduce function should compute and output multiple values. - Output value should be either:
- 3 different key-value pairs, using 3
context.write()
calls:("native-country,MIN", min_val)
("native-country,AVG", avg_val)
("native-country,MAX", max_val)
- Or as a combined value
(native-country, "min_val, avg_val, max_val")
.
- 3 different key-value pairs, using 3
- Instead of writing out a single value using
- Perform the previous analysis for each unique native-country AND workclass pair.
- Use the Map output key to modify by which attributed the data is grouped by when it "arrives" in the Reduce method
- You can create a combined key. For example:
("native-country,workclass", value)
- And finally, instead of a specific column: hours-per-week, perform the analysis for every numerical column
- Create a loop inside the Map method that outputs a (key, value) pair for every column you want to process separately.
- Use a unique key for each column to make sure they are grouped separately in Reduce function. Simple way to achieve this is to add the name of the numerical column as another component into the combined key.
In case of issues
- You wmay have to modify the Map and Reduce class input and output data types.
- It is important to notice that the signature of the IntSumReducer class: Reducer<Text,IntWritable,Text,IntWritable> specifies that Reduce method input types are
(Text, IntWritable)
and output types are(Text, IntWritable)
- If you change the input or output types of Map/Reduce methods then you also have to change:
- The Mapper/Reducer class signature
- Map/Reduce method input and output types
- Respective MapReduce Job output or Map output (have to be added) data types inside the Main method:
job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);
- It is important to notice that the signature of the IntSumReducer class: Reducer<Text,IntWritable,Text,IntWritable> specifies that Reduce method input types are
Deliverables
- MapReduce application source code.
- Output files (part-r-0000*) of your job.
- Answer the following questions:
- What are the respective advantages/disadvantages of using either:
- separate key-value pairs to output the results
- or using comma separated combined values
- What are the respective advantages/disadvantages of using either:
Sellele ülesandele ei saa enam lahendusi esitada.