Practice 4 - MapReduce in Information Retrieval

In this session, you will create a MapReduce application for computing Term Frequency–Inverse Document Frequency (TF-IDF) of a set of documents. As shown in the lecture the, we need to execute multiple MapReduce jobs to calculate TF-IDF. You will create 4 MapReduce programs (inside a single IDE project) and execute them in a sequence, where each following MapReduce job uses the output of the previous one.

This time the Dataset has once again been taken from http://www.gutenberg.org/.

References

Referred documents and web sites contain supportive information for the practice.

TF-IDF

Hadoop

Hadoop API: http://hadoop.apache.org/docs/stable/api/
Hadoop MapReduce tutorial: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

Things to keep in mind before you start

When parsing arbitrary text files, you can be never sure you get what you expect as input. Thus, there might be errors in either Map or Reduce tasks. Use Java try/catch exception handling in such cases to allow the program to continue when it gets an error.
As mentioned several times in the lab, be aware that you can not always user Reducer as a Combiner directly.
Use plain text (UTF-8) format English books from http://www.gutenberg.org/ as input files again.
The default MapReduce input format TextInputFormat assigns line number to keys when reading input data. To be able to get our own assigned keys in the second, third or fourth MapReduce job, we need to use KeyValueTextInputFormat like this:
- job.setInputFormatClass(KeyValueTextInputFormat.class);
To automatically delete the output folder of your MapReduce application: (new Path(otherArgs[1])).getFileSystem(conf).delete(new Path(otherArgs[1]));
Don't hesitate to contact lab supervisor when you need support with technical issues or exercises.

Term Frequency–Inverse Document Frequency in MapReduce

Out goal is to compute Frequency–Inverse Document Frequency (TF-IDF) for each document and word combination in the dataset.

The formula for TF-IDF is:

TFIDF(Word, Document) = n/N * log(D/m)

where:

n : number of occurrences of Word in Document
N : Total number of words in Document
D : Total number of documents in the dataset
m : number of documents Word occurs in

A single MapReduce job is not enough to calculate TF-IDF. We need to create a chain of MapReduce jobs, where each following job is using the output of the previous MapReduce job as input. In this lab, we will create four MapReduce jobs, which perform the following tasks:

Calculate the word count for each word and document.
Calculate total number of words for each document.
Calculate the number of documents each word occurs in.
Calculate the TD-IDF value for each word and document pair.

PS! 4 Mapreduce jobs are not required for calculating TF-IDF. It is possible to manage it with 3 (check Bonus task) or even 2 MapReduce jobs without reducing parallel efficiency.

Dataset Description

The dataset that we will analyze using MapReduce is taken from Project Gutenberg free eBook repository.

Name: Free Gutenberg eBooks
Source: https://www.gutenberg.org/
Location in the Hadoop Cluster :
- Testing dataset - /tmp/books folder in HDFS (10 books, 4MB)
- Small dataset - /tmp/books_small folder in HDFS (96 books, 30MB)
- Medium dataset - /tmp/books_medium folder in HDFS (769 books, 280MB)
- Large dataset - /tmp/books_large folder in HDFS (7772 books, 2.8GB)
Link to download 10 books from the cluster: download_sample_books.zip
Command for bulk downloading from Gutenberg:
- NB! You DO NOT need to download large datasets yourself!
- This is given just as a reference!
- wget -m -H http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en
- If you do use it, be careful as the command stays running in the background

Exercise 4.1 First MR job

Create a MapReduce job to calculate the word count for each word and document. You can reuse the previously created WordCount program from the Practice session 2: Introduction to Hadoop MapReduce (Java).

You can download the Job1WordCount file as a MapReduce job skeleton for this exercise and modify it as needed.
The MapReduce job should consist of the following Map and Reduce functions:
- Map:
  - Input: (LineNr, Line in document) as (key, value)
  - Function: Split the line into words and output word and filename as key and 1 as value.
  - Output: (word;filename, 1)
- Reduce:
  - Input: (word;filename, [counts])
  - Function: Sum all the counts as n
  - Output: (word;filename, n)
Currently, the MapReduce application does not output any information about the executed program in the IDE. Add the following two lines at the start of the main() method:

org.apache.log4j.BasicConfigurator.configure();
org.apache.log4j.Logger.getRootLogger().setLevel(org.apache.log4j.Level.INFO);

Do the same for all the following MapReduce jobs.

Exercise 4.2 Second MR job

Create a MapReduce job to calculate total number of words for each document.

You can download the Job2WordFrequency file as a MapReduce job skeleton for this exercise and modify it as needed.
NB! Input folder of this job should be the output folder of the previous job!
The MapReduce job should consist of the following Map and Reduce functions:
- Map:
  - Input: (word;filename, n)
  - Function: Change the key to be only filename, and move the word into value
  - Output: (filename, word;n)
- Reduce:
  - Input: (filename, [word;n])
  - Function: Sum all the n's in the whole document as N and output every word again. You will have to make two cycles. One to sum all the n's and one to write out all word's again one at a time!
    - Hadoop Iterable is a one-traversal-only object - You will have to store the values somewhere such as ArrayList.
  - Output: (word;filename, n;N)

Exercise 4.3 Third MR job

Create a MapReduce job to calculate the number of documents each word occurs in.

You can download the Job3TotalFrequency file as a MapReduce job skeleton for this exercise and modify it as needed.
NB! Input folder of this job should be the output folder of the previous job!
The MapReduce job should consist of the following Map and Reduce functions:
- Map:
  - Input: (word;filename, n;N)
  - Function: Move filename to value field and add 1 to the end of the value field
  - Output: (word, filename;n;N;1)
- Reduce:
  - Input: (word, [filename;n;N;1])
  - Function: Sum all the 1's as m and move filename back to key
    - Again, you will have to look through values twice, once to find the sum and once to output all the entries again.
  - Output: (word;filename, n;N;m)

Exercise 4.4 Fourth MR job

Create a MapReduce job to calculate the TD-IDF value for each word and filename pair.

You can download the Job4Tfidf file as a MapReduce job skeleton for this exercise and modify it as needed.
NB! Input folder of this job should be the output folder of the previous job!
The MapReduce job should consist of the following Map and Reduce functions:
- Map:
  - Input: (word;filename, n;N;m)
  - Function: Calculate TD-IDF based on n, N, m, and D:
    - TFIDF = n/N * log(D/m)
    - D is the total number of Documents in the dataset and is known ahead of time.
    - NB! Be careful when dividing integers with integers! You should use doubles instead, or division will not work properly.
  - Output: (word;filename, TD-IDF)
- There is no Reduce in this job. Map output is directly written to HDFS as output.
You should provide the value of D as an argument to the MapReduce application so that you do not need to modify the application to change the value of D.

The output after the fourth MapReduce job should look like something like this: Job 4 example output. It should not contain negative numbers unless you used an incorrect number of total documents.

The larger the value of TF-IDF the more 'important' the word is to characterize the specific document.

Exercise 4.5 Run the whole job chain in the Hadoop Cluster

Hadoop Cluster:
Address: http://172.17.77.39:8889/
Location of books for input:

Testing dataset - /tmp/books folder in HDFS (10 books, 4MB) - Smaller dataset of English books for testing
Small dataset - /tmp/books_small folder in HDFS (96 books, 30MB)
Medium dataset - /tmp/books_medium folder in HDFS (769 books, 500MB)
Large dataset - /tmp/books_large folder in HDFS (7772 books, 6GB)

Take a screenshot of cluster web interface which displays that you have successfully finished executing your MapReduce code in the cluster on at least Medium dataset

NB! To access local cloud resources you will have to be inside the university network. When working from home or dormitory, you will need to set up VPN connection to university network.

UT VPN guide - https://wiki.ut.ee/pages/viewpage.action?pageId=17105590

Bonus home exercise (Additional lab credit points)

Restructure your work to use 3 MapReduce jobs instead of 4 by combining Exercise 4.3 and Exercise 4.4 into one job.
Instead of executing 3 Java MapReduce Jobs in a sequence, combine all jobs into one Java executable and call all 3 jobs from a single program.

Deliverables

MapReduce application source code.
Sample output files of your jobs (At least one output file for each MapReduce Job) when running them in the cluster (When using /tmp/books_medium or /tmp/books_large as the input folder)
- Try to keep the samples of output files smaller than 10MB!
Screenshot of the cluster web interface which displays that you have successfully finished executing your MapReduce code in the cluster.
- It is sufficient to take the screenshot of only the last MapReduce job.

4. Practice session 4

Solutions for this task can no longer be submitted.

Solutions to common issues and errors.

If your MapReduce job does not work properly in the cluster: Check the job logs at the Job browser: http://172.17.77.39:8889/hue/jobbrowser
If the Cluster Job editor does not properly display whether MapReduce job was completed: check the job browser at http://172.17.77.39:8889/hue/jobbrowser
If you get an error about running MapReduce example tests when creating the jar file, you can try deleting the tests (Test classes inside /test/java/ folder inside your project), run maven clean and run maven package again.
If you get other errors when packaging your project as .jar file in IntelliJ IDEA, then you can try more general way of creating .jar Artifacts:
- Go to File -> Project Structure -> Project Settings -> Artifacts
- Add a new JAR type artifact from modules and dependencies
- Chose ee.ut.cs.ddpc.Lab3MRApp as the main class. Press OK.
- On the left side Output layout window, remove everything but hadoop-mapreduce-examples.jar and hadoop-mapreduce-examples compile output rows. Like this:
- Press OK to close the window and go to Build -> Build Artifacts -> Build. Jar file should now be generated into out/artifacts/hadoop_mapreduce_examples_jar folder inside your project.
If you can not access cluster. To access local cloud resources you will have to be inside the university network. So you should use a VPN connection to university network if you are not inside isntitute buildings.
- UT VPN guide - https://wiki.ut.ee/pages/viewpage.action?pageId=17105590

NB! If you run into any issues not covered here then contact the lab supervisor.

Distributed Data Processing on the Cloud 2018/19 fall