Practice 4 - Introduction to MapReduce

In this Practice session you will start working with the Apache Hadoop MapReduce framework. You will learn how to set up a MapReduce environment in your computer using Eclipse IDE without having to install or configure Hadoop. Then you will study an example MapReduce program, improve and execute it.

References

Referred documents and web sites contain supportive information for the practice.

Manuals

MapReduce tutorial: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Hadoop API: http://hadoop.apache.org/docs/stable/api/
Eclipse IDE: http://www.eclipse.org/

Extra information

List of companies using Hadoop and their use cases https://wiki.apache.org/hadoop/PoweredBy
Hadoop wiki https://hadoop.apache.org/
List of Hadoop tools, frameworks and engines https://hadoopecosystemtable.github.io/

Exercise 4.1. Installing prerequisites for Hadoop

A good IDE can help a lot when programming in any new language or a framework. In this exercise we will set up Eclipse so it can run Hadoop MapReduce programs in local machine mode without actually having to install Hadoop.

Install Eclipse if don't have it on your laptop. You can download "Eclipse IDE for Java Developers" from http://www.eclipse.org/downloads/
You may also need to install Java SDK, Version 8
Download hadoop-2.8.3-src.tar.gz from:
- Hadoop downloads
- Unpack the downloaded file on your hard disk.
  - In Windows you can use 7zip or WinRAR
  - In Linux, you can use the tar xf hadoop-2.8.3-src.tar.gz command

Additional tasks if you are using Windows (Otherwise skip this section)

NB! You will run into problems while running Hadoop Mapreduce programs if your Windows username includes a space character. We would suggest that you create an alternative user in your computer without any white space characters and complete the exercise under that user.
- You should also avoid spaces in folder paths.
Download and install Cygwin
- Installation location should be: C:\cygwin64
- Continue with the following tasks while Cyggwin is being installed, but make sure to check the progress every once in a while, because it will ask you specify additional options during installation process
Setup Windows Environment Variables for Cygwin.
- Open your System Control Panel: Control Panel -> System -> Advanced System Settings
- Find "Environmental Variables" Under the tab "Advanced"
- Modify PATH under System variables by adding the string "C:\cygwin64\bin"
  - NB! Make sure the installation path is really "C:\cygwin64\" (It might be C:\cygwin\ or some other folder instead)
- This will enable the Hadoop libraries to run Linux-style commands on your computer.
  - If this does not work for you, try using a Linux Virtual Machine or completing the exercises on lab computers.
Download windows utilities required for Hadoop.
- Create a new bin folder inside your previously downloaded and unpacked Hadoop 2.8.3 folder.
- Download hadoop-2.8.1.zip from GitHub repository
- Unpack the container, scan it with anti virus and copy its content into the recently created bin folder.
Setup Windows Environment Variables for Hadoop.
- Open your System Control Panel: Control Panel -> System -> Advanced System Settings
- Find "Environmental Variables" Under the tab "Advanced"
- Modify PATH variable under System variables by adding the path to the bin folder inside the unpacked Hadoop folder.

Exercise 4.2. Configuring Eclipse for Hadoop

Start Eclipse IDE
Most new Eclipse comes with Maven project support included. If not you can install Maven plugin for Eclipse manually
- In Eclipse, go to Help-> Install new software-> Add
- Add Maven repository http://download.eclipse.org/technology/m2e/releases
Create a new Java project in Eclipse and convert project to Maven project (right click on project name Configure -> Convert to Maven Project)

Add Hadoop libraries to your project using Maven:

Modify the content of the pom.xml file inside your project folder to add Hadoop library dependencies

Add the following block before the final </project> line

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>2.8.3</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.8.3</version>
    </dependency>
        <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
        <version>2.8.3</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-core</artifactId>
      <version>1.2.1</version>
     </dependency>
</dependencies>

Save the pom.xml file.

Now your Eclipse should be configured for executing MapReduce programs
If you get an error about jdk tools
- Make sure your system variable JAVA_HOME links correctly to Java SDK installation path.
  - Open your System Control Panel: Control Panel -> System -> Advanced System Settings
  - Check if JAVA_HOME is set. Its value should be the main directory of your Java 8 JDK inside your computer.
  - If it does not exist, add a new JAVA_HOME system variable
- If you still get the same the error, add the following dependency to Maven pom.xml:
```
  <dependency>
    <groupId>jdk.tools</groupId>
    <artifactId>jdk.tools</artifactId>
    <scope>system</scope>
    <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
    <version>1.8.0_161</version>
 </dependency> 
```
  - Change the <version>1.8.0_161</version> line to match the version of Java SDK installed in your computer.
- Update Maven configuration for your project
  - Right click your project (in Eclipse project explorer), choose Maven -> Update Project

Exercise 4.3. Running the WordCount example in Eclipse

Browse to the previously downloaded and unpacked hadoop-2.8.3-src directory in your computer
Copy the hadoop-mapreduce-project\hadoop-mapreduce-examples\src\main\java\org folder into your Eclipse project src folder.
Find the WordCount class inside your Eclipse project (src/org/apache/hadoop/examples/)
Create a new folder named input inside your Eclipse project. We will put all the input files for the WordCount Mapreduce application there.
Download 5 random books from Gutenberg in text (UTF-8) format:
- http://www.gutenberg.org/ebooks/search/?sort_order=random
Move the downloaded text files into the input folder.
Modify the VM arguments of the project to add Hadoop Home directory location
- Right click WordCount class -> Run As -> Run Configuration -> Arguments
- Add -Dhadoop.home.dir=PATH_TO_HADOOP to VM arguments and replace PATH_TO_HADOOP with the actual path to the unpacked Hadoop 2.8.3 folder inside your computer
- It should look something like this: -Dhadoop.home.dir=C:\Users\jakovits\Downloads\hadoop-2.8.3.tar\hadoop-2.8.3
Try to execute this class in Eclipse
- Right click on the WordCount class -> Run As -> Java Application
- You will initially see an error concerning the number of supplied arguments.
Modify the configuration of the WordCount class to change what arguments should be supplied to it.
- Right click WordCount class -> Run As -> Run Configuration -> Arguments
- WordCount class takes two command line argument, input folder and output folder
- Specify the previously created folder (where you moved Gutenberg books) as input folder and an arbitrarily named folder as output
- when using relative folder paths in Eclipse, folders are created inside the Eclipse project main folder
If the execution is successful, output will be written into part-r-00000 file inside the output folder
- If you run the application again, output folder must first be deleted, moved or changed to a new folder.

Exercise 4.4. Modifying the WordCount example.

Your goal in this task is to learn how to modify existing MapReduce applications. You will improve how the lines of text are tokenized into words and also change how the data is grouped between Map and Reduce tasks.

Modify the WordCount Map method to ignore punctuation marks (.,:;# etc.) and try to get it to output only clear lowercase words.
Modify the program so that instead of calculating global count of words, it calculates the count of words inside each file independently.
- Instead of word, count we want to get word, file, count
- We have to change how the data is grouped between Map and Reduce tasks. Currently is is grouped by words.
- Change the map output key to also include the file name
- You can find the name of the file from which the input it taken by using:
  - FileSplit split = (FileSplit) context.getInputSplit();
  - String filename = split.getPath().toString();

Bonus exercise

Look through the other available MapReduce examples (which you previously unpacked).
Choose any of them which could be improved in some way. Just like we did with WordCount.
- However, the improvements must be different from the changes we did in Exercise 4.4!
Document all the made changes and explain: Why do you think these changes are beneficial.
Submit the modified Java class file and your comments to get the bonus credit points.

Deliverables:

Modified WordCount.java file
Output of the modified program (part-r-00000)
Answer the following questions:
- Why is the IntSumReducer class in the WordCount example used not only as reducer (job.setReducerClass) but also as a combiner (job.setCombinerClass)
- What advantages (if any) does it provide?
- Can all Reduce functions also be used directly as combiners?

4. lab 4

Solutions for this task can no longer be submitted.

Basics of Cloud Computing 2017/18 spring