Practice 10 - Introduction to MapReduce
References
Referred documents and web sites contain supportive information for the practice.
Manuals
- Hadoop API: http://hadoop.apache.org/docs/stable/api/
- Hadoop MapReduce tutorial: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html
- Eclipse IDE: http://www.eclipse.org/
Exercise 10.1. Setting up Eclipse for Hadoop
A good IDE can help a lot when programming in any language or framework. In this exercise we will set up Eclipse so it can run Hadoop MapReduce programs in local machine mode without actually having to install Hadoop.
- If you are using Windows, Otherwise skip this part.
- Start Eclipse (If don't have on your laptop then download "Eclipse IDE for Java Developers" from http://www.eclipse.org/downloads/
- You may need to also install Java sdk
- Start a new Java project
- Add MapReduce skeleton to project src folder
- Download input file for the application
- Add libraries
- Download and unpack libraries needed from: lib.zip
- Right click on your project in the project explorer
- Choose: Build path -> Configure Build Path
- Under Libraries tab, choose "Add External JARs..."
- Add all of the the unpacked jars!
- 'Most' of the Java Errors should now disappear in Eclipse
- When you finish the program you can run and debug it in a normal way without having to set up MapReduce framework on your PC or laptop.
Exercise 10.2. Pi in MapReduce
In last practice you rewrote your python PI finding program in JAVA. This time you will update your program to use MapReduce for the parallelization.
- Modify Map and Reduce methods in the skeleton for calculating pi
- Map gets <line_nr, number> as an argument, where:
- line_nr is line number from input file(s)
- number is how many random numbers are to be generated and used.
- Map should calculate the value of the function sqrt(1-x2) as many times as specified, by generating random argument for the function. it should then find the average value of all the calculated function results.
- Map should output <0, avg>.
- Reduce gets <id, list<avg>> as an argument, where:
- id - can ignore for now, all output of map tasks should have the same id (key) value.
- list<avg> - list of calculated averages from all the map tasks
- Reduce has to combine the results of Map tasks to calculate Pi
- Calculate the final average of values recieved from multiple map tasks.
- Multiply the average value by 4 to calculate Pi
- Reduce should output <id, value_of_Pi>
- Map gets <line_nr, number> as an argument, where:
- Rest of the functions needed for a MapReduce applications are all already finished for you.
- The program has to print out the hostname of the machine!
- For example, use InetAddress.getLocalHost().getHostName()
Exercise 10.3. Using MapReduce in a cluster
In this part of the exercise you run your program in real Mapreduce cluster. This time we have set up a 1+4 node Hadoop MapReduce cluster in our cloud where you can submit your application as a MapReduce job.
NB! The server's and HDFS's filesystems are completely separate and have different folder structures! MapReduce jobs read data only from the HDFS, and thus you have to move data to HDFS before you can use it as input for the MapReduce jobs.
Cluster:
IP: 54.242.93.149
username: ubuntu
password: on blackboard (make sure you note it down for future use)
- Try out WordCount example
- Log into the server using ssh
- Create a folder with your name on the server
mkdir Firstname_Lastname
- Create a folder with your name on HDFS
hadoop fs -mkdir Firstname_Lastname
- upload pg1112.txt file to HDFS (it's located on the server home folder /home/ubuntu)
- It is taken from http://www.gutenberg.org/ebooks/1112
-
hadoop dfs -copyFromLocal pg1112.txt Firstname_Lastname/pg1112.txt
- Run hadoop Wordcount example
hadoop jar hadoop-1.0.4/hadoop-examples-1.0.4.jar wordcount Firstname_Lastname/pg1112.txt Firstname_Lastname/output
- Check results of the MapReduce job under HDFS folder Firstname_Lastname/output
- Listing folders in HDFS:
hadoop dfs -ls <folder path>
- Printing out textfile contents in HDFS:
hadoop dfs -cat <file path>
- Listing folders in HDFS:
Now it is time to try out your own MapReduce program
- Export your Your own program as a .jar file
- Right click on your eclipse project, choose: Export -> Java -> JAR file -> Next
- Choose the location where to save it, uncheck .classpath and .project. Click Finish.
- Upload the .jar file to the machine using scp
- Work in folder ~/FirstName_LastName representing your name.
- For example:
scp pi.jar ubuntu@54.242.93.149:Firstname_Lastname/
- Copy input file to HDFS
hadoop fs -copyFromLocal input Firstname_Lastname/input
- Run the program in hadoop MapReduce framework
hadoop jar <path to jar file> <classpath> <arguments>
- For example:
hadoop jar pi.jar MapReduceSkeleton Firstname_Lastname/input Firstname_Lastname/output 5
- Note: To properly parallelize the computations, you can replace the single input file with a new folder, where each line of the original file is in a separate file.
- Commands to deal with hdfs file system as using the following styles:
hadoop fs -ls /
hadoop fs -cat /user/root/test.txt
- NB! Mapreduce output is always a folder!
Deliverables: (--2 points)
- Java Program and source code
- Output of the MapReduce job in a .txt file - DO this when running in cluster!!!
- Has to include the hostname of the machine, when running the Java code inside your new instance!
- Description of how hard it was for you to grasp the idea of MapReduce and to learn programming in this model.