Practice 8 - Designing MapReduce Applications
Goal of this exercise is to analyse an open dataset using MapReduce. You will create a new MapReduce application, download an example data set from open data repository and perform a series of data processing tasks.
References
Here is a list of web sites and documents that contain supportive information for the practice.
Manuals
- MapReduce tutorial: http://hadoop.apache.org/docs/r2.9.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
- Hadoop API: http://hadoop.apache.org/docs/r2.9.2/api/
- IntelliJ IDEA: https://www.jetbrains.com/idea/
- Hadoop wiki https://hadoop.apache.org/
NB! It is critical to watch the MapReduce lectures before the Labs
- We will be using the MapReduce distributed computing model which was introduced in the lecture
- The lab programming tasks will be much harder to follow without understanding this model
Practical session communication!
There will be no physical lab sessions and they should be completed online for the foreseeable future.
- Lab supervisors will provide support through Slack
- We have set up a Slack workspace for lab and course related discussions. Invitation link was sent separately to all students through email.
- Use the
#practice8-mr-algorithms
Slack channel for lab related questions and discussion.
- Use the
- When asking questions or support from lab assistant, please make sure to also provide all needed information, including screenshots, configurations, and even code if needed.
- If code needs to be shown to the lab supervisor, send it (or a link to it) through Slack Direct Messages.
- If you have not received Slack invitation, please contact lab assistants through email.
In case of issues check:
- Pinned messages in the
#practice8-mr-algorithms
Slack channel. - Check the MapReduce tutorial: http://hadoop.apache.org/docs/r2.9.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
- Possible solutions to common issues section at the end of the guide.
- Ask in the
#practice8-mr-algorithms
Slack channel.
Dataset Description
The dataset that we will analyze using MapReduce is taken from UCI Machine Learning Repository and contains census information from a set of Adults living in USA. The original goal of the dataset was to use it for predicting whether the income of a person exceeds $50K per year.
- Name: UCI Adult dataset
- Location: http://archive.ics.uci.edu/ml/datasets/Adult
- Dataset attributes (column names) are:
0. age 1. workclass 2. fnlwgt 3. education 4. education-num 5. marital-status 6. occupation 7. relationship 8. race 9. sex 10. capital-gain 11. capital-loss 12. hours-per-week 13. native-country 14. Classification
Download the adult.data
file from the dataset Data Folder.
Exercise 8.1. Preparing a new MapReduce application
Goal of this exercise is to prepare a new MapReduce application by customizing the WorkCount example code.
- Use your old IntelliJ MapReduce project or make a new one like in the last practice session
- Create folder named "inputdataset" inside your IntelliJ project
- Download the
adult.data
file from the dataset Data Folder page and move them inside the inputdataset folder.- You may have to delete 2 last empty lines in the file - or you will get parsing errors
- Download the
- Create a new
ee.ut.cs.ddpc
package inside the project.- Right click on the
src/main/java
folder inside your project and chooseNew -> Package
- Right click on the
- Copy the
WordCount
example from the last practice session into theee.ut.cs.ddpc
package folder and rename it asLab8MRApp
. - Modify the copied class and change the Map output, Reduce input & output types into Text type.
- NB! Alternatively, if you can to make your life a bit easier, you can change the IntWritable types to DoubleWritable (Instead of Text).
- This makes the following tasks slightly easier, but will only allow you to use the "Pairs" approach from the lecture and not "Stripes" approach in the last exercise.
- In the Mapper class:
- In the Reducer class:
- In the main() method:
- Also add the following lines to the
main()
method to reconfigure the map output key and value types to beText
:job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class);
- NB! Remove the Combiner configuration statement in the main method. The Reduce function that you will create can not be used directly as a Combiner this time.
//job.setCombinerClass(IntSumReducer.class);
Solving the errors.
- You will have a number of errors in the Lab8MRApp.java class now as a result of changing the output types of the Map and Reduce methods.
- This is because the Map and Reduce functions are still internally using IntWritable objects instead of Text objects. You should replace the IntWritable objects with Text objects.
- You do not have to solve all the errors now. It is not yet important to make sure that the application is executed without errors. You can solve the errors while completing the tasks in the next Exercise.
- Also, it is important to remember that Hadoop MapReduce does not use Java base types (Integer, String, Double) as input and output values, but rather Writable objects. And that Text and other Writable objects are used slightly differently:
Text
- contains a singleString
value- Creating Text object:
Text myText = new Text("My string");
- Updating Text object value:
myText.set("New string value");
- Reading text object value:
String value = myText.toString();
- Creating Text object:
IntWritable
,LongWritable
- contain a single numerical value of the respective type.- Creating Writable object:
IntWritable myWritable = new IntWritable(1);
- Updating Writable object value:
myWritable.set(71);
- Reading Writable object value:
int value = myWritable.get();
- Creating Writable object:
Exercise 8.2. Data analytics using MapReduce
Goal of this exercise is to analyse UCI Adult dataset using MapReduce.
Modify the application to perform the following tasks on the UCI Adult data set:
- For each native-country, calculate the average hours-per-week
- Map method:
- Input key-value pair is:
(line_nr, line)
, where line is a single line from the csv file, which contains 14 comma separated values. - You should split the input line by commas and output native country as a key and hours-per-week as a value.
- It is suggested to use
string.split(delimiter)
method in Java (https://www.w3docs.com/snippets/java/how-to-split-a-string-in-java.html) rather than StringTokenizer, as it outputs an array, which makes it easier to extract specific dataset column values.
- It is suggested to use
- Output:
(native-country, hours-per-week)
- Input key-value pair is:
- In Reduce:
- Input is:
(native-country, [hours-per-week])
- Input key is unique native-country and value is a list-like iteratable object of all 'hours-per-week' values from this native-country.
- Compute the average of all hours-per-week values.
- Output should be
(native-country, avg_val)
- Input is:
- Map method:
- In addition to average value, also find minimum and maximum values
- Instead of writing out a single value using
context.write()
, Reduce function should compute and output multiple values. - Output should be written as either:
- 3 different key-value pairs, using 3
context.write()
calls:("native-country,MIN", min_val)
("native-country,AVG", avg_val)
("native-country,MAX", max_val)
- This follows the "Pairs" approach described in the lecture
- Or as a single KeyValue pair where the value object contains multiple results:
(native-country, "min_val, avg_val, max_val")
.- This follows the "Stripes" approach described in the lecture
- 3 different key-value pairs, using 3
- Instead of writing out a single value using
- Perform the previous analysis for each unique native-country AND workclass pair.
- Use the Map output key to modify by which attributed the data is grouped by when it "arrives" in the Reduce method
- You can create a combined key. For example:
("native-country,workclass", value)
- And finally, instead of a specific column: hours-per-week, perform the analysis for every numerical column
- Create a loop inside the Map method that outputs a (key, value) pair for every column you want to process separately.
- Use a unique key for each column to make sure they are grouped separately in Reduce function. Simple way to achieve this is to add the Label of the numerical column as another component into the combined key (e.g (native-country,workclass,"AGE")).
NB! Each sub-task should simply modify the existing code, you do not need to create 4 different applications!
Exercise 8.3. Passing user defined parameters to Map or Reduce tasks.
The goal of this exercise is to teach you how you can pass global variable to Map and Reduce tasks through job context. This is necessary when running MapReduce jobs in clusters, as Map and Reduce tasks are executed on many servers concurrently, and it is not possible to pass global variables directly.
Now, lets change the MapReduce program to analyse entries only from a specific user defined occupation and ask users to provide the occupation value as the third argument to the program. The arguments to the program should become: input_folder output_folder occupation
Additional parameters can not be directly passed to the Map and Reduce tasks because we are not executing them directly when the application is running in a distributed manner (in a cluster). Instead, they should be written to the MapReduce Job Configuration before running the job. The Job Configuration object will be made available for the Map and Reduce processes through context
and configuration parameter values can be extracted from there.
- Modify the run configuration of your new application to contain three arguments:
input_folder output_folder occupation
instead of just two - Change how the
main()
method parses the input arguments as input and output folders. Currently it uses the last program argument as the output folder and all previous arguments as input folders to allow you to define multiple input paths. Instead change it to use first argument as input folder and second as output folder:FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
- Value of user defined
occupation
should be read from the program arguments in themain()
method and written into the MapReduce job configuration so that it is passed along to all the Map and Reduce processes.- You can use the
conf.set(name, value);
method to write the user definedoccupation
value into the MapReduce configuration.- You must use
conf.set(name, value);
before theJob job = Job.getInstance(conf, "word count");
line, otherwise it will not be passed along properly.
- You must use
- You can use the
- Inside the Map method, use the
context.getConfiguration().get(name, default_value);
method to access the previously definedoccupation
value. - Use the
occupation
value to filter out all the input data entries which do not match the user defined occupation value.
Deliverables
- MapReduce application source code.
- Only the final version of the MapReduce code is needed. You do not need to submit intermediate versions or separate evrsions for different tasks.
- Output files (part-r-0000*) of your job.
- Only the output of the final completed task needs to be submited. The tasks are incremental and it will be clear which tasks have been completed.
- Answer the following questions:
- What are the respective advantages/disadvantages of using either:
- separate key-value pairs to output the results
- or using comma separated combined values
- What are the respective advantages/disadvantages of using either:
Solutions to common issues and errors.
- If you want to use other data types, such as
DoubleWritable
, they can be imported fromimport org.apache.hadoop.io
, like this:import org.apache.hadoop.io.DoubleWritable;
- If you get an error about
ClassNotFoundException
.- Make sure you have selected the
Include dependencies with the "Provided" scope
option in the Run Configuration of your application.
- Make sure you have selected the
- Make sure that you have removed the Combiner configuration statement in the main method. The Reduce function that you will create can not be used directly as a Combiner this time.
- NB! You will run into problems while running Hadoop MapReduce programs if your Windows username includes a space character. We would suggest that you create an alternative user in your computer without any white space characters and complete the exercise under that user.
- You should also avoid spaces in folder paths.
NB! If you run into any issues not covered here then contact the lab supervisor through Slack.