Practice 4 - Introduction to MapReduce
In this Practice session you will start working with the Apache Hadoop MapReduce framework. You will learn how to set up a MapReduce environment in Eclipse IDE without having to install or configure Hadoop in your computer and how to modify and run MapReduce programs.
References
Referred documents and web sites contain supportive information for the practice.
- Hadoop MapReduce tutorial
- List of companies using Hadoop and their use cases
- Hadoop wiki
- List of Hadoop tools, frameworks and engines
- Hadoop API
- Eclipse IDE
Exercise 4.1. Configuting Eclipse for Hadoop
A good IDE can help a lot when programming in any new language or a framework. In this exercise we will set up Eclipse so it can run Hadoop MapReduce programs in local machine mode without actually having to install Hadoop.
- If you are using Windows: (Otherwise skip this part)
- You will first have to download and install Cygwin
- Installation location should stay default:
C:\cygwin\bin
orC:\cygwin64\bin
- Installation location should stay default:
- Setup Windows Environment Variables for cygwin.
- Open your System Control Panel: Control Panel -> System -> Advanced System Settings
- Find "Environmental Variables" Under the tab "Advanced"
- Modify PATH under System variables by adding the string
";C:\cygwin\bin"
or";C:\cygwin64\bin"
(in 64bit system) to the end of it (do not forget the semicolon)
- This will enable the Hadoop libraries to run linux-style commands on your computer.
- If this does not work for you, try using a Linux Virtual Machine or completing the exercises on lab computers.
- You will first have to download and install Cygwin
- Start Eclipse (If don't have it on your laptop then download "Eclipse IDE for Java Developers" from http://www.eclipse.org/downloads/
- You may also need to install Java SDK
- Create a new Java project
- Download the following files from Hadoop downloads:
hadoop-0.23.11.tar.gz
(for libraries)hadoop-0.23.11-src.tar.gz
(for example sources)- Unpack the downloaded file on your hard disk.
- Add Hadoop libraries to your project:
- Right click on your project in the project explorer in Eclipse
- Choose: Build path -> Configure Build Path
- Under Libraries tab, choose "Add External JARs..."
- Explore to find the folder in which you unpacked the downloaded
hadoop-0.23.11.tar.gz
into - Browse to
/share/hadoop/
and add all .jar files from the following subfolders:- /common and /common/lib
- /mapreduce and /mapreduce/lib
- /yarn
- Now your eclipse is ready to execute MapReduce programs
Exercise 4.2. Running the WordCount example in Eclipse
- Open the unpacked hadoop-0.23.11-src folder
- Copy the subfolder
hadoop-mapreduce-project\hadoop-mapreduce-examples\src\main\java\org
of the unpacked folder to your Eclipse project src folder. - Find the WordCount class inside your Eclipse project.
- Try to run this class in Eclipse
- Rightclick on the WordCount class -> Run As -> Java Application
- Create a folder named input inside the Eclipse project
- Download some (like 10) books in text (.txt) format from Gutenberg:
- Move the downloaded text files to input folder in the Eclipse project.
- WordCount class takes two command line argument, intput folder and output folder
- To modify command line arguments, right click WordCount class -> Run As -> Run Configuration -> Arguments
- Specifying the previously created folder as input folder and an arbitrarily named folder as output
- when using relative folder paths in Eclipse, folders are created inside the Eclipse project main folder
Exercise 4.3. Modifying the WordCount example Map and Reduce methods.
- As independent work:
- Modify the WordCount Map method to ignore punctuation marks (.,:;# etc.) and try to get it to output only clear words.
- Modify the program so that instead of calculating global sum of words, it calculates WordCount for each file separately.
- Change the key of map output to also include the file name
- You can find the name of the file from which the input it taken by using:
FileSplit split = (FileSplit) context.getInputSplit();
String filename = split.getPath().toString();
Bonus exercise
- Investigate the other available MapReduce examples
- Choose any of them that looks like it could be improved in some way (similar to what we did with WordCount)
- Avoid all the alternative Word Count examples!
- Also avoid doing the exact same changes to other MapReduce examples!
- Document all the changes that you made and also provide argumentation why you think these changes are beneficial.
- Submit the modified Java class files and your documentation to get bonus credit points
Deliverables:
- Modified WordCount.java file
- Output of running the modified program in Eclipse (saved in a text file)
- Answer the following questions (Use your mind, Google and other interesting sources of information):
- Why is the IntSumReducer class in the WordCount example used not only as reducer (job.setReducerClass) but also as a combiner (job.setCombinerClass)
- What advantages (if any) does it provide?
- Can all possible Reduce functions be used like this?
Sellele ülesandele ei saa enam lahendusi esitada.
Issues
- If you get an error complaining about .staging file in a tmp directory, try to change the location of the tmp directory by adding the following VM argument to your programs Run configuration:
-Dhadoop.tmp.dir=tmp
- You can use the following code to automatically delete the output folder before running your MapReduce job:
(new Path(otherArgs[1])).getFileSystem(conf).delete(new Path(otherArgs[1]));
- Be careful with this as your MapReduce program has permissions to delete any folders in your home directory!
- Alternative Download links from the university servers: