Practice 8 - Designing MapReduce Applications

Goal of this exercise is to analyse an open dataset using MapReduce. You will create a new MapReduce application, download an example data set from open data repository and perform a series of data processing tasks.

References

Here is a list of web sites and documents that contain supportive information for the practice.

Manuals

MapReduce tutorial: http://hadoop.apache.org/docs/r2.9.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Hadoop API: http://hadoop.apache.org/docs/r2.9.2/api/
IntelliJ IDEA: https://www.jetbrains.com/idea/
Hadoop wiki https://hadoop.apache.org/

NB! It is critical to watch the MapReduce lectures before the Labs

We will be using the MapReduce distributed computing model which was introduced in the lecture
The lab programming tasks will be much harder to follow without understanding this model

Practical session communication!

There will be no physical lab sessions and they should be completed online for the foreseeable future.

Lab supervisors will provide support through Slack
We have set up a Slack workspace for lab and course related discussions. Invitation link was sent separately to all students through email.
- Use the #practice8-mr-algorithms Slack channel for lab related questions and discussion.
When asking questions or support from lab assistant, please make sure to also provide all needed information, including screenshots, configurations, and even code if needed.
- If code needs to be shown to the lab supervisor, send it (or a link to it) through Slack Direct Messages.
- If you have not received Slack invitation, please contact lab assistants through email.

In case of issues check:

Pinned messages in the #practice8-mr-algorithms Slack channel.
Check the MapReduce tutorial: http://hadoop.apache.org/docs/r2.9.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Possible solutions to common issues section at the end of the guide.
Ask in the #practice8-mr-algorithms Slack channel.

Dataset Description

The dataset that we will analyze using MapReduce is taken from UCI Machine Learning Repository and contains census information from a set of Adults living in USA. The original goal of the dataset was to use it for predicting whether the income of a person exceeds $50K per year.

Name: UCI Adult dataset
Location: http://archive.ics.uci.edu/ml/datasets/Adult
Dataset attributes (column names) are:

0. age 
1. workclass
2. fnlwgt
3. education
4. education-num
5. marital-status 
6. occupation 
7. relationship 
8. race 
9. sex 
10. capital-gain 
11. capital-loss 
12. hours-per-week 
13. native-country 
14. Classification

Download the adult.data file from the dataset Data Folder.

Exercise 8.1. Preparing a new MapReduce application

Goal of this exercise is to prepare a new MapReduce application by customizing the WorkCount example code.

Use your old IntelliJ MapReduce project or make a new one like in the last practice session
Create folder named "inputdataset" inside your IntelliJ project
- Download the adult.data file from the dataset Data Folder page and move them inside the inputdataset folder.
  - You may have to delete 2 last empty lines in the file - or you will get parsing errors
Create a new ee.ut.cs.ddpc package inside the project.
- Right click on the src/main/java folder inside your project and choose New -> Package
Copy the WordCount example from the last practice session into the ee.ut.cs.ddpc package folder and rename it as Lab8MRApp .
Modify the copied class and change the Map output, Reduce input & output types into Text type.
- NB! Alternatively, if you can to make your life a bit easier, you can change the IntWritable types to DoubleWritable (Instead of Text).
- This makes the following tasks slightly easier, but will only allow you to use the "Pairs" approach from the lecture and not "Stripes" approach in the last exercise.
- In the Mapper class:
- In the Reducer class:
- In the main() method:
Also add the following lines to the main() method to reconfigure the map output key and value types to be Text:
- ```
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
```
NB! Remove the Combiner configuration statement in the main method. The Reduce function that you will create can not be used directly as a Combiner this time.
- //job.setCombinerClass(IntSumReducer.class);

Solving the errors.

You will have a number of errors in the Lab8MRApp.java class now as a result of changing the output types of the Map and Reduce methods.
This is because the Map and Reduce functions are still internally using IntWritable objects instead of Text objects. You should replace the IntWritable objects with Text objects.
You do not have to solve all the errors now. It is not yet important to make sure that the application is executed without errors. You can solve the errors while completing the tasks in the next Exercise.
Also, it is important to remember that Hadoop MapReduce does not use Java base types (Integer, String, Double) as input and output values, but rather Writable objects. And that Text and other Writable objects are used slightly differently:
- Text - contains a single String value
  - Creating Text object: Text myText = new Text("My string");
  - Updating Text object value: myText.set("New string value");
  - Reading text object value: String value = myText.toString();
- IntWritable, LongWritable - contain a single numerical value of the respective type.
  - Creating Writable object: IntWritable myWritable = new IntWritable(1);
  - Updating Writable object value: myWritable.set(71);
  - Reading Writable object value: int value = myWritable.get();

Exercise 8.2. Data analytics using MapReduce

Goal of this exercise is to analyse UCI Adult dataset using MapReduce.

Modify the application to perform the following tasks on the UCI Adult data set:

For each native-country, calculate the average hours-per-week
- Map method:
  - Input key-value pair is: (line_nr, line), where line is a single line from the csv file, which contains 14 comma separated values.
  - You should split the input line by commas and output native country as a key and hours-per-week as a value.
    - It is suggested to use string.split(delimiter) method in Java (https://www.w3docs.com/snippets/java/how-to-split-a-string-in-java.html) rather than StringTokenizer, as it outputs an array, which makes it easier to extract specific dataset column values.
  - Output: (native-country, hours-per-week)
- In Reduce:
  - Input is: (native-country, [hours-per-week])
  - Input key is unique native-country and value is a list-like iteratable object of all 'hours-per-week' values from this native-country.
  - Compute the average of all hours-per-week values.
  - Output should be (native-country, avg_val)
In addition to average value, also find minimum and maximum values
- Instead of writing out a single value using context.write(), Reduce function should compute and output multiple values.
- Output should be written as either:
  - 3 different key-value pairs, using 3 context.write() calls:
    - ("native-country,MIN", min_val)
    - ("native-country,AVG", avg_val)
    - ("native-country,MAX", max_val)
      - This follows the "Pairs" approach described in the lecture
  - Or as a single KeyValue pair where the value object contains multiple results: (native-country, "min_val, avg_val, max_val").
    - This follows the "Stripes" approach described in the lecture
Perform the previous analysis for each unique native-country AND workclass pair.
- Use the Map output key to modify by which attributed the data is grouped by when it "arrives" in the Reduce method
- You can create a combined key. For example: ("native-country,workclass", value)
And finally, instead of a specific column: hours-per-week, perform the analysis for every numerical column
- Create a loop inside the Map method that outputs a (key, value) pair for every column you want to process separately.
- Use a unique key for each column to make sure they are grouped separately in Reduce function. Simple way to achieve this is to add the Label of the numerical column as another component into the combined key (e.g (native-country,workclass,"AGE")).

NB! Each sub-task should simply modify the existing code, you do not need to create 4 different applications!

Exercise 8.3. Passing user defined parameters to Map or Reduce tasks.

The goal of this exercise is to teach you how you can pass global variable to Map and Reduce tasks through job context. This is necessary when running MapReduce jobs in clusters, as Map and Reduce tasks are executed on many servers concurrently, and it is not possible to pass global variables directly.

Now, lets change the MapReduce program to analyse entries only from a specific user defined occupation and ask users to provide the occupation value as the third argument to the program. The arguments to the program should become: input_folder output_folder occupation

Additional parameters can not be directly passed to the Map and Reduce tasks because we are not executing them directly when the application is running in a distributed manner (in a cluster). Instead, they should be written to the MapReduce Job Configuration before running the job. The Job Configuration object will be made available for the Map and Reduce processes through context and configuration parameter values can be extracted from there.

Modify the run configuration of your new application to contain three arguments: input_folder output_folder occupation instead of just two
Change how the main() method parses the input arguments as input and output folders. Currently it uses the last program argument as the output folder and all previous arguments as input folders to allow you to define multiple input paths. Instead change it to use first argument as input folder and second as output folder:
- ```
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
```
Value of user defined occupation should be read from the program arguments in the main() method and written into the MapReduce job configuration so that it is passed along to all the Map and Reduce processes.
- You can use the conf.set(name, value); method to write the user defined occupation value into the MapReduce configuration.
  - You must use conf.set(name, value); before the Job job = Job.getInstance(conf, "word count"); line, otherwise it will not be passed along properly.
Inside the Map method, use the context.getConfiguration().get(name, default_value); method to access the previously defined occupation value.
Use the occupation value to filter out all the input data entries which do not match the user defined occupation value.

Deliverables

MapReduce application source code.
- Only the final version of the MapReduce code is needed. You do not need to submit intermediate versions or separate evrsions for different tasks.
Output files (part-r-0000*) of your job.
- Only the output of the final completed task needs to be submited. The tasks are incremental and it will be clear which tasks have been completed.
Answer the following questions:
1. What are the respective advantages/disadvantages of using either:
  - separate key-value pairs to output the results
  - or using comma separated combined values

8. lab 8

Sellele ülesandele ei saa enam lahendusi esitada.

Solutions to common issues and errors.

If you want to use other data types, such as DoubleWritable, they can be imported from import org.apache.hadoop.io, like this: import org.apache.hadoop.io.DoubleWritable;
If you get an error about ClassNotFoundException.
- Make sure you have selected the Include dependencies with the "Provided" scope option in the Run Configuration of your application.
Make sure that you have removed the Combiner configuration statement in the main method. The Reduce function that you will create can not be used directly as a Combiner this time.
NB! You will run into problems while running Hadoop MapReduce programs if your Windows username includes a space character. We would suggest that you create an alternative user in your computer without any white space characters and complete the exercise under that user.
- You should also avoid spaces in folder paths.

NB! If you run into any issues not covered here then contact the lab supervisor through Slack.

Pilvetehnoloogia 2020/21 kevad