Practice 11 - Analyzing data using MapReduce
References
Referred documents and web sites contain supportive information for the practice.
Manuals
- Hadoop API: http://hadoop.apache.org/docs/stable/api/
- Hadoop MapReduce tutorial: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html
Exercise 11.1. Statistics with MapReduce
This time you will work with an actual data set that you need to analyse using MapReduce.
Dataset
Data set is taken from http://archive.ics.uci.edu/ml/datasets/Wine+Quality
Download the modified version (Where first row is removed and which you can directly use as an input for MapReduce application) from: winequality-red_mr.csv
Exercise
Create the MapReduce application to analyse the wine dataset
- Download and use MapReduceSkeletonSecond.java as the basis for creating the application.
- Input to Map is: <key, column values>
- Column values for a specific entries are separated by ";" character.
- look at String.split(); method
- Column attributes are: "fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
- Map should extract the specific value you are working with and output it as <quality, column value>
- Reduce should then aggregate the values for each of the quality(key) it gets as an input
The Application must do the following:
- For each quality (last column value: 3,4,5,6,7,8), find the average value of "fixed acidity".
- Map should extract the specific column value for fixed acidity and output it as <quality, column value>
- Reduce should then aggregate the list of values it gets as an input, calculate their average and output <quality, average>
- In addition to average, calculate also the median, maximum and minimum.
- You can output all the statistics at once as one string 'line', or output each of them as separate key-value pair where key can be a combination of ID and statistic_name.
- Update the MapReduce application to calculate the statistics for any given column.
- Give the column number as argument, save it in job configuration and read its value from job configuration in setup() function!
- You can use: context.getConfiguration().getInt("name", default_value); to read the values from configuration in the setup method.
- Update the MapReduce job to calculate statistics for every column at the same time. For example create a new key by combining previous key and column name (key = key + "column_name")
- For each quality (last column value: 3,4,5,6,7,8), find the average value of "fixed acidity".
Deploy the application in the cluster (details on the previous lab page)
- Upload the application jar file and winequality-red_mr.csv input file to the server using scp command
- Upload the winequality-red_mr.csv file to HDFS
- Use
hadoop fs -copyFromLocal <Server path> <HDFS path>
command
- Use
- Run the application
- Measure how long it runs
- Save the output of the application
Deliverables
- MapReduce application source code
- Output of running the application in the cluster.
- Answer the following questions.
- What is the highest pH value for quality 5 wine?
- What is the the lowest fixed acidity for quality 6 wine?
- What is the median value of total sulfur dioxide for quality 4 wine?
- How hard was it to complete this exercise in comparison to previous one?
- Which part of the exercises you have liked better? Grid or Cloud?