Arvutiteaduse instituut
  1. Kursused
  2. 2012/13 kevad
  3. Gridi ja pilvetehnoloogia alused (MTAT.08.011)
EN
Logi sisse

Gridi ja pilvetehnoloogia alused 2012/13 kevad

  • Main
  • Lectures
  • Practicals
  • Links
  • Results
  • Submit Homework

Practice 11 - Analyzing data using MapReduce

References

Referred documents and web sites contain supportive information for the practice.

Manuals

  • Hadoop API: http://hadoop.apache.org/docs/stable/api/
  • Hadoop MapReduce tutorial: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html

Exercise 11.1. Statistics with MapReduce

This time you will work with an actual data set that you need to analyse using MapReduce.

Dataset

Data set is taken from http://archive.ics.uci.edu/ml/datasets/Wine+Quality

Download the modified version (Where first row is removed and which you can directly use as an input for MapReduce application) from: winequality-red_mr.csv

Exercise

Create the MapReduce application to analyse the wine dataset

  • Download and use MapReduceSkeletonSecond.java as the basis for creating the application.
    • Input to Map is: <key, column values>
    • Column values for a specific entries are separated by ";" character.
      • look at String.split(); method
    • Column attributes are: "fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
    • Map should extract the specific value you are working with and output it as <quality, column value>
    • Reduce should then aggregate the values for each of the quality(key) it gets as an input

The Application must do the following:

  1. For each quality (last column value: 3,4,5,6,7,8), find the average value of "fixed acidity".
    • Map should extract the specific column value for fixed acidity and output it as <quality, column value>
    • Reduce should then aggregate the list of values it gets as an input, calculate their average and output <quality, average>
  2. In addition to average, calculate also the median, maximum and minimum.
    • You can output all the statistics at once as one string 'line', or output each of them as separate key-value pair where key can be a combination of ID and statistic_name.
  3. Update the MapReduce application to calculate the statistics for any given column.
    • Give the column number as argument, save it in job configuration and read its value from job configuration in setup() function!
    • You can use: context.getConfiguration().getInt("name", default_value); to read the values from configuration in the setup method.
  4. Update the MapReduce job to calculate statistics for every column at the same time. For example create a new key by combining previous key and column name (key = key + "column_name")

Deploy the application in the cluster (details on the previous lab page)

  • Upload the application jar file and winequality-red_mr.csv input file to the server using scp command
  • Upload the winequality-red_mr.csv file to HDFS
    • Use hadoop fs -copyFromLocal <Server path> <HDFS path> command
  • Run the application
  • Measure how long it runs
  • Save the output of the application

Deliverables

  1. MapReduce application source code
  2. Output of running the application in the cluster.
  3. Answer the following questions.
    1. What is the highest pH value for quality 5 wine?
    2. What is the the lowest fixed acidity for quality 6 wine?
    3. What is the median value of total sulfur dioxide for quality 4 wine?
    4. How hard was it to complete this exercise in comparison to previous one?
    5. Which part of the exercises you have liked better? Grid or Cloud?
  • Arvutiteaduse instituut
  • Loodus- ja täppisteaduste valdkond
  • Tartu Ülikool
Tehniliste probleemide või küsimuste korral kirjuta:

Kursuse sisu ja korralduslike küsimustega pöörduge kursuse korraldajate poole.
Õppematerjalide varalised autoriõigused kuuluvad Tartu Ülikoolile. Õppematerjalide kasutamine on lubatud autoriõiguse seaduses ettenähtud teose vaba kasutamise eesmärkidel ja tingimustel. Õppematerjalide kasutamisel on kasutaja kohustatud viitama õppematerjalide autorile.
Õppematerjalide kasutamine muudel eesmärkidel on lubatud ainult Tartu Ülikooli eelneval kirjalikul nõusolekul.
Tartu Ülikooli arvutiteaduse instituudi kursuste läbiviimist toetavad järgmised programmid:
euroopa sotsiaalfondi logo