Practice 7 - Introduction to MapReduce

In this Practice session we will start working with the Apache Hadoop MapReduce framework. You will learn how to set up and configure a MapReduce environment in your computer using either IntelliJ IDEA without having to install Hadoop on your computer. Then you will study example MapReduce program source code, introduce a number of improvements and execute it in a pseudo distributed manner in your computer.

References

Here is a list of web sites and documents that contain supportive information for the practice.

Manuals

MapReduce tutorial: http://hadoop.apache.org/docs/r2.9.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Hadoop API: http://hadoop.apache.org/docs/r2.9.2/api/
IntelliJ IDEA: https://www.jetbrains.com/idea/
Hadoop wiki https://hadoop.apache.org/

Additional information

List of companies using Hadoop and their use cases https://cwiki.apache.org/confluence/display/HADOOP2/PoweredBy
List of Hadoop tools, frameworks and engines https://hadoopecosystemtable.github.io/
Commercial Hadoop distributions and support https://cwiki.apache.org/confluence/display/HADOOP2/Distributions+and+Commercial+Support

NB! It is critical to watch the MapReduce lecture before the Lab

We will be using the MapReduce distributed computing model which was introduced in the lecture
The lab programming tasks will be much harder to follow without understanding this model

Practical session communication!

There will be no physical lab sessions and they should be completed online for the foreseeable future.

Lab supervisors will provide support through Slack
We have set up a Slack workspace for lab and course related discussions. Invitation link was sent separately to all students through email.
- Use the #practice7-mapreduce Slack channel for lab related questions and discussion.
When asking questions or support from lab assistant, please make sure to also provide all needed information, including screenshots, configurations, and even code if needed.
- If code needs to be shown to the lab supervisor, send it (or a link to it) through Slack Direct Messages.
- If you have not received Slack invitation, please contact lab assistants through email.

In case of issues check:

Pinned messages in the #practice7-mapreduce Slack channel.
Check the MapReduce tutorial: http://hadoop.apache.org/docs/r2.9.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Possible solutions to common issues section at the end of the guide.
Ask in the #practice7-mapreduce Slack channel.

Exercise 7.1. Installing prerequisites for Hadoop

A good IDE can help a lot when programming in any new language or a framework. In this exercise you will set up IDE and configure your computer so that you can run Hadoop MapReduce programs in local machine mode without actually having to install Hadoop.

Install IntelliJ IDEA if don't have it already installed on your laptop.
- You can download "IntelliJ IDEA Community eddition" from https://www.jetbrains.com/idea/
You also need to install Java SDK, Version 8+ (if you don't yet have it installed)
Download the following files from: Hadoop downloads:
- hadoop-2.9.2-src.tar.gz
- hadoop-2.9.2.tar.gz
Unpack the downloaded files on your hard disk.
- In Windows you can use 7zip to unpack them
- In Linux, you can use the tar xf hadoop-2.9.2-src.tar.gz command to unpack them
We do not need to install Apache Hadoop in our computers. We will instead use IntelliJ to temporarily run MapReduce programs using the Hadoop MapReduce Java libraries.

Additional tasks if you are using Windows (Otherwise skip this section)

Download windows utilities required for Hadoop.
- Newer Hadoop libraries require additional native Windows libraries to be built which are not distributed with the Hadoop binaries by default. Its possible to build them from the Hadoop source code, but we will download pre-built versions instead to save time.
- Download hadoop-2.8.1.zip from GitHub repository
- Unpack the container, scan it with anti virus and copy its content (only files) into the created hadoop-2.9.2/bin folder. (NB! not the source folder!)
If you get an error about MSVCR100.dll then you may have to download and install Microsoft Visual C++ 2010 Redistributable Package (It should be 64 bit version if you're using 64bit OS)

Exercise 7.2 Configuring IntelliJ IDEA for Hadoop

We will import and configure the MapReduce example Java Project into IntelliJ IDEA.

Start the IntelliJ IDEA application
Choose import Project from existing sources
Import project from the hadoop-mapreduce-project\hadoop-mapreduce-examples\pom.xml file inside the previously unpacked folder of hadoop-2.9.2-src.tar.gz.
- NB! Make sure to select pom.xml from the correct folder!
NB! Make sure to select Import Maven project automatically
Wait until Maven has finished configuring the project dependencies.
- This may take several minutes (5-20 depending on internet speed).
  - You may want to jump ahead to Exercise 7.4 while Maven is working on the project.

Exercise 7.3. Running the WordCount example in IntelliJ IDEA

In this exercise you will learn how to execute the MapReduce WordCount example through the IntelliJ IDEA. You will also configure the input and output directories for WordCount.

Find the WordCount class inside your project (src/main/java/org.apache.hadoop.examples/)
Create a new folder named input inside your project. We will put all the input files for the WordCount MapReduce application there.
Download 5 random books from Gutenberg in text (UTF-8) format:
- http://www.gutenberg.org/ebooks/search/?sort_order=random
Move the downloaded text files into the input folder.
Lets also fix the output of the program. Currently it does not output any information about the executed program. Add the following two lines at the start of the main() method:

org.apache.log4j.BasicConfigurator.configure();
org.apache.log4j.Logger.getRootLogger().setLevel(org.apache.log4j.Level.INFO);

Try to run the main() method in this class as a Java application.
- Right click on the WordCount class and choose Run 'Wordcount.main()'
- You will initially see an error concerning the number of supplied arguments.
Modify the run configuration of the WordCount class to change what arguments should be supplied to it.
- Go to Run - > Edit Configurations and select the WordCount class run configuration-
- WordCount class requires two Program arguments: input folder and output folder
- Specify the previously created folder (where you moved Gutenberg books) as input folder and an arbitrarily named folder as output
- When using relative folder paths, folders are created inside the project main folder.
NB! Modify the run configuration of the WordCount class to also have the Include the dependencies with the "Provided" scope option activated.
- This will make sure that all the compile-time only Maven dependencies are also available at run-time.
Modify the run configuration Environment variables to define where Apache Hadoop is located on your machine.
- Click on the Browse button at the right side of the Environment Variables field inside the Run Configuration window
- Set up HADOOP_HOME: it should point to the FULL PATH of the hadoop-2.9.2 folder
- Additional task if you are using Windows (Otherwise skip setting up PATH) : Set up PATH: it should point to the FULL PATH of the hadoop-2.9.2\bin folder
After all the changes, the run configuration should look something like this:

Run the WordCount example again.
If the execution is successful, a new output folder should have been created
- Output will be written into part-r-00000 file inside the output folder
- NB! If you run the application again, output folder must first be deleted, moved or re-configured as a new folder.
Check that the output of the program looks correct
- Each line of the output should contain two values, Key and a Value, where:
- Key: A word
- Value: The number of times this word appeared in the input dataset (Word Count)

Exercise 7.4 Familiarizing yourself with the WordCount example

Before we go to the programming tasks, it is important to understand some of the MapReduce coding aspects.

PS! If you jumped ahead waiting for Maven to finish, then the location of the WordCount.java file is:

From inside IntelliJ project: src\main\java\org\apache\hadoop\examples
When opening it with a different Editor: hadoop-mapreduce-project\hadoop-mapreduce-examples\src\main\java\org\apache\hadoop\examples

It is important to note that Hadoop MapReduce does not use Java base types (Integer, String, Double) as input and output values, but rather Writable objects.
- Text - contains a single String value
  - Creating Text object: Text myText = new Text("My string");
  - Updating Text object value: myText.set("New string value");
  - Reading text object value: String value = myText.toString();
- IntWritable, LongWritable - contain a single numerical value of the respective type.
  - Creating Writable object: IntWritable myWritable = new IntWritable(1);
  - Updating Writable object value: myWritable.set(71);
  - Reading Writable object value: int value = myWritable.get();
The goal of Writable types is to for Hadoop to read and write the data in a serialized form for transmission.
It also allows users to define their own Writable classes for custom Key and Value data types (e.g ImageWritable, MatrixWritable, PersonWritable) when needed.
- You can check the available Writable classes and an example user defined writable class here: https://hadoop.apache.org/docs/r3.0.1/api/org/apache/hadoop/io/Writable.html
- For custom Key objects, you actually have to implement WritableComparable class not just Writable: https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/WritableComparable.html
- NB! You DO NOT have to create your own Writable data objects in this lab!
Check through the MapReduce WordCount example code. It consists of the WordCount class, which contains:
1. TokenizerMapper sub class, which defines the user defined map method which is appled on the input list elements.
  - Input Key and Values are: (Object key, Text value)
    - key = line offset
    - value = a single line from a text file
  - Map function splits the line by spaces using StringTokenizer class and outputs (word, 1) for every word it finds.
  - Output Key and Values are: (Text word, IntWritable one)
2. IntsumReducer sub class, which defines the user defined reduce method, which is applied on the Map output.
  - Input Key and Value types are: (Text key, Iterable<IntWritable> values)
    - key = word
    - values contains all the one's (or combined one's)
  - Reduce function iterates over values, computes their sum and outputs (word, sum)
  - Output Key and Values are: (Text key, IntWritable result)
    - key = word
    - result = sum
3. main method, which configures the WorCount MapReduce job.
  - It first checks the program arguments
  - Defines What classes are used as Reducer, Combiner and Partitioner.
    - ```
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
```
- Defines what are the output types:
  - ```
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(IntWritable.class);
```
  - And configures job output paths.

Exercise 7.5 Modifying the WordCount example.

Your goal in this task is to learn how to modify existing MapReduce applications. You will improve how the WordCount application splits lines of text into words and will change how the data is grouped between Map and Reduce tasks.

One of the main problems with the simple MapReduce WordCount example is that it does not remove punctuation marks and other special characters and as a result it counts words like cat and cat. separately.

We will also change how the data is grouped in the Reduce task. Initially all unique words are grouped together and their Sum is computed. In the second task of this exercise we will change the WordCount to calculate the count of words for each input file separately by simply changing how the data is grouped in the Reduce task.

Modify the WordCount Map method to remove punctuation marks (.,:;# etc.) and try to get it to output only clear lowercase words.
Modify the program so that instead of calculating global count of words, it calculates the count of words for each file separately.
- Instead of the current output (word, count) of the application we want the output to be word, file, count, where the count is the number of word occurrences in file.
- We have to change how the data is grouped between Map and Reduce tasks. Currently is simply grouped by unique word. We will change to be grouped by both word and the filename.
- Change the map output key to also include the file name.
  - You can create a combined string as the key which contains both word and filename: (word;filename)
- You can find the name of the file from which the input it taken by using the following methods:
  - FileSplit split = (FileSplit) context.getInputSplit();
  - String filename = split.getPath().toString();

Bonus exercise

The goal of the bonus task is to investigate the other available MapReduce example applications with the goal to introduce additional improvements.

Look through the other available MapReduce examples (which you previously unpacked) and choose one of them (other than WordCount) which you think could be improved in some way, like we did with WordCount.
- However, the improvements must be different from the changes we did in Exercise 7.5.
  - For example: removing punctuation marks in a different MapReduce example will not be sufficient!
Document all the made changes and explain why do you think these changes are beneficial.
Submit the modified Java class file and your comments to get the bonus credit points.

Deliverables:

Modified WordCount.java file
Output of the modified program (part-r-00000 file)
Answer the following questions:
- Why is the IntSumReducer class in the WordCount example used not only as reducer (job.setReducerClass) but also as a combiner (job.setCombinerClass)
- What advantages (if any) does it provide?
- Can all possible Reduce functions be used directly as combiners?

7. lab 7

Sellele ülesandele ei saa enam lahendusi esitada.

In case of issues, check these potential solutions to common issues:

NB! You will run into problems while running Hadoop MapReduce programs if your Windows username includes a space character. We would suggest that you create an alternative user in your computer without any white space characters and complete the exercise under that user.
- You should also avoid spaces in folder paths.
If you get an error about missing libraries in IntelliJ IDEA, don't forget to activate the include dependencies with "provided" scope in the run configuration of your application.
if you do not see any output when running MapReduce file, make sure you have done the following:
- Lets also fix the output of the program. Currently it does not output any information about the executed program. Add the following two lines at the start of the main() method:

org.apache.log4j.BasicConfigurator.configure();
org.apache.log4j.Logger.getRootLogger().setLevel(org.apache.log4j.Level.INFO);

If you see a lot of classes with errors in your project after Maven finished configuring your project:
- Make sure you have imported the project from the correct .pom file:
  - Import project from the hadoop-mapreduce-project\hadoop-mapreduce-examples\pom.xml file inside the Hadoop source folder you have previously unpacked.
  - If a wrong file is imported, you may accidentally have imported the Whole MapReduce project source code, not just MapReduce examples project.

Pilvetehnoloogia 2019/20 kevad