Practice 12 - Advanced Mapreduce: Finding TF-IDF
Referred documents and web sites contain supportive information for the practice.
- Hadoop API: http://hadoop.apache.org/docs/stable/api/
- Hadoop MapReduce tutorial: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html
- Lecture 08.05 - MapReduce in Information Retrieval slides
Exercise 12.1. Term frequency–inverse document frequency (TF-IDF) with MapReduce
Read through the referred documents to learn more about the Term frequency–inverse document frequency algorithm. Your goal is to calculate the TF-IDF using MapReduce for a set of douments we have already uploaded to HDFS in the Hadoop cluster. This time the Dataset is a bit larger and once again it has been taken from http://www.gutenberg.org/.
Create the MapReduce application to calculate TF-IDF of a dived document set
- Download tfidf.zip as the basis for creating the application. It contains the application skeleton as an Eclipse project.
- Main class is in the tfidf package: tfidf.MapreduceSkeletonThird.java
- The program takes 3 arguments: <input folder> <output folder> <number of documents in the input folder>
- Extract the eclipse project to a freely chosen folder
- Start eclipse and create a new project and specify the path of the project to be your previously chosen folder so that the extracted files would be used.
- Input for your Eclipse application you can take from books.zip
- unpack the books.zip file and use the resulting folder as input in your MapReduce application in eclipse.
- Once again, most of the work has been done for you and you only have to define the content of the Map and Reduce methods.
- However, as you should remember from the lecture slides, calculating TF-IDF requires several MapReduce jobs, so this time you will have to define 4 Map and 3 Reduce methods, each in separate classes.
- First MR job - Word counts for each word and document
- Input (LineNr, Line in document)
- Split the line into words and output each word.
- Output (word;filename, 1)
- Input (word;filename, [counts])
- Sum all the counts as n
- Output (word;filename, n)
- Second MR job - Word frequency for each document
- Input (word;filename, n)
- change the key to be only filename, and move the word into value
- Output (filename, word;n)
- Input (filename, [word;n])
- Sum all the n's in the whole document as N and output every word again. You may have to make two cycles. One to sum all the n's and one to write out all words again one at a time!
- Iterators are one-traversal-only - You will have to store the values somewhere such as ArrayList or HashMap.
- Output (word;filename, n;N)
- Third MR job - Word frequency in the whole dataset
- Input (word;filename, n;N)
- Move filename to value field and add 1 to the end of value field
- Output (word, filename;n;N;1)
- Input (word, [filename;n;N;1])
- Calculate the sum of value fiels last entries as m and move filename back to key
- Again, you will have to look through values twice, once to find the sum and once to print out all the entries again. And you will have to store the values somewhere again.
- Output (word;filename, n;N;m)
- Fourth MR job - Calculating the TD-IDF value
- Input (word;filename, n;N;m)
- calculate TD-IDF based on n;N;m and D. D is known ahead of time.
- TFIDF = n/N * log(D/m)
- NB! Be careful when dividing integers with integers! You should use doubles instead, or division will not work properly.
- Output (word;filename, TD-IDF)
- There is no Reduce in this job. Map output is directly written out.
NB! When parsing arbitrary text files, you can be never sure you get what you expect as input. Thus, there might be errors in map or reduce tasks. Use java try and catch constructions in such cases to allow the program to continue when it gets an error.
Deploy the application in the cluster
- Export the application as a normal (not executable jar) jar file
- Upload the jar file to the server using scp command
- Run the application
- We have uploaded 378 books in HDFS under "books" folder, so use that folder as input
- hadoop jar tfidf.jar tfidf.MapReduceSkeletonThird books FirstName_LastName/outbooks 378
- Measure how long it runs
- Save the output of the application
Cluster: IP: 220.127.116.11 username: ubuntu password: ask the lab assistant if you do not remmember
- MapReduce application source code
- Command line output of running the application in the cluster.
- If you use 3 MapReduce jobs instead of 4, by removing the last job
- If you add and use combiners for the first 3 Job successfully.