Practice 4 - MapReduce in Information Retrieval
In this session, you will create a MapReduce application for computing Term Frequency–Inverse Document Frequency (TF-IDF) of a set of documents. As shown in the lecture the, we need to execute multiple MapReduce jobs to calculate TF-IDF. You will create 4 MapReduce programs (inside a single IDE project) and execute them in a sequence, where each following MapReduce job uses the output of the previous one.
This time the Dataset has once again been taken from http://www.gutenberg.org/.
References
Referred documents and web sites contain supportive information for the practice.
TF-IDF
- Lecture MapReduce in Information Retrieval slides
- http://en.wikipedia.org/wiki/Tf%E2%80%93idf
- http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html
Hadoop
- Hadoop API: http://hadoop.apache.org/docs/stable/api/
- Hadoop MapReduce tutorial: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Things to keep in mind before you start
- When parsing arbitrary text files, you can be never sure you get what you expect as input. Thus, there might be errors in either Map or Reduce tasks. Use Java
try
/catch
exception handling in such cases to allow the program to continue when it gets an error. - As mentioned several times in the lab, be aware that you can not always user Reducer as a Combiner directly.
- Use plain text (UTF-8) format English books from http://www.gutenberg.org/ as input files again.
- The default MapReduce input format TextInputFormat assigns line number to keys when reading input data. To be able to get our own assigned keys in the second, third or fourth MapReduce job, we need to use KeyValueTextInputFormat like this:
job.setInputFormatClass(KeyValueTextInputFormat.class);
- To automatically delete the output folder of your MapReduce application:
(new Path(otherArgs[1])).getFileSystem(conf).delete(new Path(otherArgs[1]));
- Don't hesitate to contact lab supervisor when you need support with technical issues or exercises.
Term Frequency–Inverse Document Frequency in MapReduce
Out goal is to compute Frequency–Inverse Document Frequency (TF-IDF) for each document and word combination in the dataset.
The formula for TF-IDF is:
TFIDF(Word, Document) = n/N * log(D/m)
where:
n
: number of occurrences of Word in DocumentN
: Total number of words in DocumentD
: Total number of documents in the datasetm
: number of documents Word occurs in
A single MapReduce job is not enough to calculate TF-IDF. We need to create a chain of MapReduce jobs, where each following job is using the output of the previous MapReduce job as input. In this lab, we will create four MapReduce jobs, which perform the following tasks:
- Calculate the word count for each word and document.
- Calculate total number of words for each document.
- Calculate the number of documents each word occurs in.
- Calculate the TD-IDF value for each word and document pair.
PS! 4 Mapreduce jobs are not required for calculating TF-IDF. It is possible to manage it with 3 (check Bonus task
) or even 2 MapReduce jobs without reducing parallel efficiency.
Dataset Description
The dataset that we will analyze using MapReduce is taken from Project Gutenberg free eBook repository.
- Name: Free Gutenberg eBooks
- Source: https://www.gutenberg.org/
- Location in the Hadoop Cluster :
- Testing dataset -
/tmp/books
folder in HDFS (10 books, 4MB) - Small dataset -
/tmp/books_small
folder in HDFS (96 books, 30MB) - Medium dataset -
/tmp/books_medium
folder in HDFS (769 books, 280MB) - Large dataset -
/tmp/books_large
folder in HDFS (7772 books, 2.8GB)
- Testing dataset -
- Link to download 10 books from the cluster: download_sample_books.zip
- Command for bulk downloading from Gutenberg:
- NB! You DO NOT need to download large datasets yourself!
- This is given just as a reference!
wget -m -H http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en
- If you do use it, be careful as the command stays running in the background
Exercise 4.1 First MR job
Create a MapReduce job to calculate the word count for each word and document. You can reuse the previously created WordCount program from the Practice session 2: Introduction to Hadoop MapReduce (Java).
- You can download the Job1WordCount file as a MapReduce job skeleton for this exercise and modify it as needed.
- The MapReduce job should consist of the following Map and Reduce functions:
- Map:
- Input: (LineNr, Line in document) as (key, value)
- Function: Split the line into words and output word and
filename
as key and1
as value. - Output: (word;filename, 1)
- Reduce:
- Input: (word;filename, [counts])
- Function: Sum all the
counts
asn
- Output: (word;filename, n)
- Map:
- Currently, the MapReduce application does not output any information about the executed program in the IDE. Add the following two lines at the start of the
main()
method:
org.apache.log4j.BasicConfigurator.configure(); org.apache.log4j.Logger.getRootLogger().setLevel(org.apache.log4j.Level.INFO);
- Do the same for all the following MapReduce jobs.
Exercise 4.2 Second MR job
Create a MapReduce job to calculate total number of words for each document.
- You can download the Job2WordFrequency file as a MapReduce job skeleton for this exercise and modify it as needed.
- NB! Input folder of this job should be the output folder of the previous job!
- The MapReduce job should consist of the following Map and Reduce functions:
- Map:
- Input: (word;filename, n)
- Function: Change the key to be only
filename
, and move theword
into value - Output: (filename, word;n)
- Reduce:
- Input: (filename, [word;n])
- Function: Sum all the
n
's in the whole document asN
and output every word again. You will have to make two cycles. One to sum all then
's and one to write out allword
's again one at a time!- Hadoop Iterable is a one-traversal-only object - You will have to store the values somewhere such as ArrayList.
- Output: (word;filename, n;N)
- Map:
Exercise 4.3 Third MR job
Create a MapReduce job to calculate the number of documents each word occurs in.
- You can download the Job3TotalFrequency file as a MapReduce job skeleton for this exercise and modify it as needed.
- NB! Input folder of this job should be the output folder of the previous job!
- The MapReduce job should consist of the following Map and Reduce functions:
- Map:
- Input: (word;filename, n;N)
- Function: Move
filename
to value field and add1
to the end of the value field - Output: (word, filename;n;N;1)
- Reduce:
- Input: (word, [filename;n;N;1])
- Function: Sum all the
1
's asm
and movefilename
back to key- Again, you will have to look through values twice, once to find the sum and once to output all the entries again.
- Output: (word;filename, n;N;m)
- Map:
Exercise 4.4 Fourth MR job
Create a MapReduce job to calculate the TD-IDF value for each word and filename pair.
- You can download the Job4Tfidf file as a MapReduce job skeleton for this exercise and modify it as needed.
- NB! Input folder of this job should be the output folder of the previous job!
- The MapReduce job should consist of the following Map and Reduce functions:
- Map:
- Input: (word;filename, n;N;m)
- Function: Calculate TD-IDF based on
n
,N
,m
, andD
:TFIDF = n/N * log(D/m)
D
is the total number of Documents in the dataset and is known ahead of time.- NB! Be careful when dividing integers with integers! You should use doubles instead, or division will not work properly.
- Output: (word;filename, TD-IDF)
- There is no Reduce in this job. Map output is directly written to HDFS as output.
- Map:
- You should provide the value of
D
as an argument to the MapReduce application so that you do not need to modify the application to change the value of D.
The output after the fourth MapReduce job should look like something like this: Job 4 example output. It should not contain negative numbers unless you used an incorrect number of total documents.
The larger the value of TF-IDF the more 'important' the word is to characterize the specific document.
Exercise 4.5 Run the whole job chain in the Hadoop Cluster
Hadoop Cluster:
Address: http://172.17.77.39:8889/
Location of books for input:
- Testing dataset -
/tmp/books
folder in HDFS (10 books, 4MB) - Smaller dataset of English books for testing - Small dataset -
/tmp/books_small
folder in HDFS (96 books, 30MB) - Medium dataset -
/tmp/books_medium
folder in HDFS (769 books, 500MB) - Large dataset -
/tmp/books_large
folder in HDFS (7772 books, 6GB)
- Testing dataset -
Take a screenshot of cluster web interface which displays that you have successfully finished executing your MapReduce code in the cluster on at least Medium dataset
NB! To access local cloud resources you will have to be inside the university network. When working from home or dormitory, you will need to set up VPN connection to university network.
- UT VPN guide - https://wiki.ut.ee/pages/viewpage.action?pageId=17105590
Bonus home exercise (Additional lab credit points)
- Restructure your work to use 3 MapReduce jobs instead of 4 by combining Exercise 4.3 and Exercise 4.4 into one job.
- Instead of executing 3 Java MapReduce Jobs in a sequence, combine all jobs into one Java executable and call all 3 jobs from a single program.
Deliverables
- MapReduce application source code.
- Sample output files of your jobs (At least one output file for each MapReduce Job) when running them in the cluster (When using
/tmp/books_medium
or/tmp/books_large
as the input folder)- Try to keep the samples of output files smaller than 10MB!
- Screenshot of the cluster web interface which displays that you have successfully finished executing your MapReduce code in the cluster.
- It is sufficient to take the screenshot of only the last MapReduce job.
Solutions to common issues and errors.
- If your MapReduce job does not work properly in the cluster: Check the job logs at the Job browser: http://172.17.77.39:8889/hue/jobbrowser
- If the Cluster Job editor does not properly display whether MapReduce job was completed: check the job browser at http://172.17.77.39:8889/hue/jobbrowser
- If you get an error about running MapReduce example tests when creating the jar file, you can try deleting the tests (Test classes inside /test/java/ folder inside your project), run maven clean and run maven package again.
- If you get other errors when packaging your project as
.jar
file in IntelliJ IDEA, then you can try more general way of creating.jar
Artifacts:- Go to
File -> Project Structure -> Project Settings -> Artifacts
- Add a new JAR type artifact from modules and dependencies
- Chose
ee.ut.cs.ddpc.Lab3MRApp
as the main class. Press OK. - On the left side Output layout window, remove everything but
hadoop-mapreduce-examples.jar
andhadoop-mapreduce-examples
compile output rows. Like this: - Press OK to close the window and go to
Build -> Build Artifacts -> Build
. Jar file should now be generated intoout/artifacts/hadoop_mapreduce_examples_jar
folder inside your project.
- Go to
- If you can not access cluster. To access local cloud resources you will have to be inside the university network. So you should use a VPN connection to university network if you are not inside isntitute buildings.
- UT VPN guide - https://wiki.ut.ee/pages/viewpage.action?pageId=17105590
NB! If you run into any issues not covered here then contact the lab supervisor.