Practice 7 - Introduction to MapReduce
In this Practice session we will start working with the Apache Hadoop MapReduce framework. You will learn how to set up and configure a MapReduce environment in your computer using either IntelliJ IDEA without having to install Hadoop on your computer. Then you will study example MapReduce program source code, introduce a number of improvements and execute it in a pseudo distributed manner in your computer.
References
Here is a list of web sites and documents that contain supportive information for the practice.
Manuals
- MapReduce tutorial: http://hadoop.apache.org/docs/r2.9.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
- Hadoop API: http://hadoop.apache.org/docs/r2.9.2/api/
- IntelliJ IDEA: https://www.jetbrains.com/idea/
- Hadoop wiki https://hadoop.apache.org/
Additional information
- List of companies using Hadoop and their use cases https://cwiki.apache.org/confluence/display/HADOOP2/PoweredBy
- List of Hadoop tools, frameworks and engines https://hadoopecosystemtable.github.io/
- Commercial Hadoop distributions and support https://cwiki.apache.org/confluence/display/HADOOP2/Distributions+and+Commercial+Support
NB! It is critical to watch the MapReduce lecture before the Lab
- We will be using the MapReduce distributed computing model which was introduced in the lecture
- The lab programming tasks will be much harder to follow without understanding this model
Practical session communication!
There will be no physical lab sessions and they should be completed online for the foreseeable future.
- Lab supervisors will provide support through Slack
- We have set up a Slack workspace for lab and course related discussions. Invitation link was sent separately to all students through email.
- Use the
#practice7-mapreduce
Slack channel for lab related questions and discussion.
- Use the
- When asking questions or support from lab assistant, please make sure to also provide all needed information, including screenshots, configurations, and even code if needed.
- If code needs to be shown to the lab supervisor, send it (or a link to it) through Slack Direct Messages.
- If you have not received Slack invitation, please contact lab assistants through email.
In case of issues check:
- Pinned messages in the
#practice7-mapreduce
Slack channel. - Check the MapReduce tutorial: http://hadoop.apache.org/docs/r2.9.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
- Possible solutions to common issues section at the end of the guide.
- Ask in the
#practice7-mapreduce
Slack channel.
Exercise 7.1. Installing prerequisites for Hadoop
A good IDE can help a lot when programming in any new language or a framework. In this exercise you will set up IDE and configure your computer so that you can run Hadoop MapReduce programs in local machine mode without actually having to install Hadoop.
- Install IntelliJ IDEA if don't have it already installed on your laptop.
- You can download "IntelliJ IDEA Community eddition" from https://www.jetbrains.com/idea/
- You also need to install Java SDK, Version 8+ (if you don't yet have it installed)
- Download the following files from: Hadoop downloads:
hadoop-2.9.2-src.tar.gz
hadoop-2.9.2.tar.gz
- Unpack the downloaded files on your hard disk.
- In Windows you can use 7zip to unpack them
- In Linux, you can use the
tar xf hadoop-2.9.2-src.tar.gz
command to unpack them
- We do not need to install Apache Hadoop in our computers. We will instead use IntelliJ to temporarily run MapReduce programs using the Hadoop MapReduce Java libraries.
Additional tasks if you are using Windows (Otherwise skip this section)
- Download windows utilities required for Hadoop.
- Newer Hadoop libraries require additional native Windows libraries to be built which are not distributed with the Hadoop binaries by default. Its possible to build them from the Hadoop source code, but we will download pre-built versions instead to save time.
- Download
hadoop-2.8.1.zip
from GitHub repository - Unpack the container, scan it with anti virus and copy its content (only files) into the created
hadoop-2.9.2/bin
folder. (NB! not the source folder!)
- If you get an error about
MSVCR100.dll
then you may have to download and install Microsoft Visual C++ 2010 Redistributable Package (It should be 64 bit version if you're using 64bit OS)
Exercise 7.2 Configuring IntelliJ IDEA for Hadoop
We will import and configure the MapReduce example Java Project into IntelliJ IDEA.
- Start the IntelliJ IDEA application
- Choose import Project from existing sources
- Import project from the
hadoop-mapreduce-project\hadoop-mapreduce-examples\pom.xml
file inside the previously unpacked folder of hadoop-2.9.2-src.tar.gz.- NB! Make sure to select pom.xml from the correct folder!
- NB! Make sure to select
Import Maven project automatically
- Wait until Maven has finished configuring the project dependencies.
- This may take several minutes (5-20 depending on internet speed).
- You may want to jump ahead to Exercise 7.4 while Maven is working on the project.
- This may take several minutes (5-20 depending on internet speed).
Exercise 7.3. Running the WordCount example in IntelliJ IDEA
In this exercise you will learn how to execute the MapReduce WordCount example through the IntelliJ IDEA. You will also configure the input and output directories for WordCount.
- Find the WordCount class inside your project (src/main/java/org.apache.hadoop.examples/)
- Create a new folder named input inside your project. We will put all the input files for the WordCount MapReduce application there.
- Download 5 random books from Gutenberg in text (UTF-8) format:
- Move the downloaded text files into the input folder.
- Lets also fix the output of the program. Currently it does not output any information about the executed program. Add the following two lines at the start of the
main()
method:
org.apache.log4j.BasicConfigurator.configure(); org.apache.log4j.Logger.getRootLogger().setLevel(org.apache.log4j.Level.INFO);
- Try to run the
main()
method in this class as a Java application.- Right click on the WordCount class and choose
Run 'Wordcount.main()'
- You will initially see an error concerning the number of supplied arguments.
- Right click on the WordCount class and choose
- Modify the run configuration of the WordCount class to change what arguments should be supplied to it.
- Go to
Run - > Edit Configurations
and select the WordCount class run configuration- - WordCount class requires two Program arguments:
input
folder andoutput
folder - Specify the previously created folder (where you moved Gutenberg books) as input folder and an arbitrarily named folder as output
- When using relative folder paths, folders are created inside the project main folder.
- Go to
- NB! Modify the run configuration of the WordCount class to also have the
Include the dependencies with the "Provided" scope
option activated.- This will make sure that all the compile-time only Maven dependencies are also available at run-time.
- Modify the run configuration
Environment variables
to define where Apache Hadoop is located on your machine.- Click on the Browse button at the right side of the Environment Variables field inside the Run Configuration window
- Set up
HADOOP_HOME
: it should point to the FULL PATH of the hadoop-2.9.2 folder - Additional task if you are using Windows (Otherwise skip setting up PATH) : Set up
PATH
: it should point to the FULL PATH of the hadoop-2.9.2\bin folder
- Click on the Browse button at the right side of the Environment Variables field inside the Run Configuration window
- After all the changes, the run configuration should look something like this:
- Run the WordCount example again.
- If the execution is successful, a new output folder should have been created
- Output will be written into part-r-00000 file inside the output folder
- NB! If you run the application again, output folder must first be deleted, moved or re-configured as a new folder.
- Check that the output of the program looks correct
- Each line of the output should contain two values, Key and a Value, where:
- Key: A word
- Value: The number of times this word appeared in the input dataset (Word Count)
Exercise 7.4 Familiarizing yourself with the WordCount example
Before we go to the programming tasks, it is important to understand some of the MapReduce coding aspects.
PS! If you jumped ahead waiting for Maven to finish, then the location of the WordCount.java file is:
- From inside IntelliJ project:
src\main\java\org\apache\hadoop\examples
- When opening it with a different Editor:
hadoop-mapreduce-project\hadoop-mapreduce-examples\src\main\java\org\apache\hadoop\examples
- From inside IntelliJ project:
- It is important to note that Hadoop MapReduce does not use Java base types (Integer, String, Double) as input and output values, but rather Writable objects.
Text
- contains a singleString
value- Creating Text object:
Text myText = new Text("My string");
- Updating Text object value:
myText.set("New string value");
- Reading text object value:
String value = myText.toString();
- Creating Text object:
IntWritable
,LongWritable
- contain a single numerical value of the respective type.- Creating Writable object:
IntWritable myWritable = new IntWritable(1);
- Updating Writable object value:
myWritable.set(71);
- Reading Writable object value:
int value = myWritable.get();
- Creating Writable object:
- The goal of Writable types is to for Hadoop to read and write the data in a serialized form for transmission.
- It also allows users to define their own Writable classes for custom Key and Value data types (e.g ImageWritable, MatrixWritable, PersonWritable) when needed.
- You can check the available Writable classes and an example user defined writable class here: https://hadoop.apache.org/docs/r3.0.1/api/org/apache/hadoop/io/Writable.html
- For custom Key objects, you actually have to implement WritableComparable class not just Writable: https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/WritableComparable.html
- NB! You DO NOT have to create your own Writable data objects in this lab!
- Check through the MapReduce WordCount example code. It consists of the
WordCount
class, which contains:TokenizerMapper
sub class, which defines the user definedmap
method which is appled on the input list elements.- Input Key and Values are:
(Object key, Text value)
- key = line offset
- value = a single line from a text file
- Map function splits the line by spaces using
StringTokenizer
class and outputs (word, 1) for every word it finds. - Output Key and Values are:
(Text word, IntWritable one)
- Input Key and Values are:
IntsumReducer
sub class, which defines the user definedreduce
method, which is applied on the Map output.- Input Key and Value types are:
(Text key, Iterable<IntWritable> values)
- key = word
- values contains all the one's (or combined one's)
- Reduce function iterates over values, computes their sum and outputs (word, sum)
- Output Key and Values are:
(Text key, IntWritable result)
- key = word
- result = sum
- Input Key and Value types are:
main
method, which configures the WorCount MapReduce job.- It first checks the program arguments
- Defines What classes are used as Reducer, Combiner and Partitioner.
job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class);
- Defines what are the output types:
job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);
- And configures job output paths.
Exercise 7.5 Modifying the WordCount example.
Your goal in this task is to learn how to modify existing MapReduce applications. You will improve how the WordCount application splits lines of text into words and will change how the data is grouped between Map and Reduce tasks.
One of the main problems with the simple MapReduce WordCount example is that it does not remove punctuation marks and other special characters and as a result it counts words like cat
and cat.
separately.
We will also change how the data is grouped in the Reduce task. Initially all unique words are grouped together and their Sum
is computed. In the second task of this exercise we will change the WordCount to calculate the count of words for each input file
separately by simply changing how the data is grouped in the Reduce task.
- Modify the WordCount Map method to
remove punctuation marks
(.,:;# etc.) and try to get it to output only clearlowercase words
. - Modify the program so that instead of calculating global count of words, it calculates the count of words for each file separately.
- Instead of the current output (
word, count
) of the application we want the output to beword, file, count
, where thecount
is the number ofword
occurrences infile
. - We have to change how the data is grouped between Map and Reduce tasks. Currently is simply grouped by unique word. We will change to be grouped by both word and the filename.
- Change the map output key to also include the file name.
- You can create a combined string as the key which contains both word and filename: (
word;filename
)
- You can create a combined string as the key which contains both word and filename: (
- You can find the name of the file from which the input it taken by using the following methods:
FileSplit split = (FileSplit) context.getInputSplit();
String filename = split.getPath().toString();
- Instead of the current output (
Bonus exercise
The goal of the bonus task is to investigate the other available MapReduce example applications with the goal to introduce additional improvements.
- Look through the other available MapReduce examples (which you previously unpacked) and choose one of them (other than WordCount) which you think could be improved in some way, like we did with WordCount.
- However, the improvements must be different from the changes we did in Exercise 7.5.
- For example: removing punctuation marks in a different MapReduce example will not be sufficient!
- However, the improvements must be different from the changes we did in Exercise 7.5.
Document
all the made changes andexplain
why do you think these changes are beneficial.- Submit the modified Java class file and your comments to get the bonus credit points.
Deliverables:
- Modified WordCount.java file
- Output of the modified program (part-r-00000 file)
- Answer the following questions:
- Why is the
IntSumReducer
class in the WordCount example used not only as reducer (job.setReducerClass
) but also as a combiner (job.setCombinerClass
) - What advantages (if any) does it provide?
- Can all possible Reduce functions be used directly as combiners?
- Why is the
In case of issues, check these potential solutions to common issues:
- NB! You will run into problems while running Hadoop MapReduce programs if your Windows username includes a space character. We would suggest that you create an alternative user in your computer without any white space characters and complete the exercise under that user.
- You should also avoid spaces in folder paths.
- If you get an error about missing libraries in IntelliJ IDEA, don't forget to activate the include dependencies with "provided" scope in the run configuration of your application.
- if you do not see any output when running MapReduce file, make sure you have done the following:
- Lets also fix the output of the program. Currently it does not output any information about the executed program. Add the following two lines at the start of the main() method:
org.apache.log4j.BasicConfigurator.configure(); org.apache.log4j.Logger.getRootLogger().setLevel(org.apache.log4j.Level.INFO);
- If you see a lot of classes with errors in your project after Maven finished configuring your project:
- Make sure you have imported the project from the correct
.pom
file:- Import project from the
hadoop-mapreduce-project\hadoop-mapreduce-examples\pom.xml
file inside the Hadoop source folder you have previously unpacked. - If a wrong file is imported, you may accidentally have imported the Whole MapReduce project source code, not just MapReduce examples project.
- Import project from the
- Make sure you have imported the project from the correct