Institute of Computer Science
  1. Courses
  2. 2017/18 spring
  3. Basics of Cloud Computing (MTAT.08.027)
ET
Log in

Basics of Cloud Computing 2017/18 spring

  • Main
  • Lectures
  • Practicals
  • Submit Homework

Practice 5 - Designing MapReduce Applications

Goal of this exercise is to analyse an open dataset using MapReduce. You will create a new MapReduce application, download an example data set from open data repository and perform a series of data processing tasks.

References

Referred documents and web sites contain supportive information for the practice.

  • MapReduce tutorial: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
  • Hadoop API: http://hadoop.apache.org/docs/stable/api/
  • Eclipse IDE: http://www.eclipse.org/

Exercise 5.1. Create a new MapReduce application

Goal of this exercise is to analyse UCI Adult dataset using MapReduce.

  • Use your old MapReduce Eclipse project or make a new one like in the last practice session
  • Create folder named "input" in your eclipse project
  • Dataset that you will analyze is taken from UCI Machine repository
    • Name: Adult Data Set
    • Location: http://archive.ics.uci.edu/ml/datasets/Adult
    • Download adult.data file under Data Folder from UCI and move them to input folder
      • You may have to delete 2 last empty lines in the file - or you will get parsing errors
    • Dataset attributes (column names) are:
      1. age
      2. workclass
      3. fnlwgt
      4. education
      5. education-num
      6. marital-status
      7. occupation
      8. relationship
      9. race
      10. sex
      11. capital-gain
      12. capital-loss
      13. hours-per-week
      14. native-country
      15. Classification
  • Make a copy the WordCount.java example from the last practice session and rename it.
    • NB! Do not forget to turn off the use of a combiner: //job.setCombinerClass(IntSumReducer.class);
  • Modify the content of the pom.xml file inside your project folder to add one more Hadoop dependency
    • <dependencies>
          <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-core</artifactId>
            <version>1.2.1</version>
           </dependency>
      </dependencies> 
    • Save the pom.xml file.

Exercise 5.2. Data analytics using MapReduce

Modify the application to perform the following tasks on the UCI Adult data set:

  1. For each native-country, calculate the average hours-per-week
    • Map method:
      • Input key-value pair is: (line_nr, line), where line is a single line from the csv file, which contains 14 comma separated values.
      • You should split the input line by commas and output native country as a key and hours-per-week as a value: (native-country, hours-per-week)
    • In Reduce:
      • Input is: (native-country, [hours-per-week])
      • Input key is unique native-country and value is a list-like iteratable object of all 'hours-per-week' values from this native-country.
      • Compute the average of all hours-per-week values.
      • Output should be (native-country, avg_val)
  2. In addition to average value, also find minimum and maximum values
    • Instead of writing out a single value using context.write(), Reduce function should compute and output multiple values.
    • Output value should be either:
      • 3 different key-value pairs, using 3 context.write() calls:
        • ("native-country,MIN", min_val)
        • ("native-country,AVG", avg_val)
        • ("native-country,MAX", max_val)
      • Or as a combined value (native-country, "min_val, avg_val, max_val").
  3. Perform the previous analysis for each unique native-country AND workclass pair.
    • Use the Map output key to modify by which attributed the data is grouped by when it "arrives" in the Reduce method
    • You can create a combined key. For example: ("native-country,workclass", value)
  4. And finally, instead of a specific column: hours-per-week, perform the analysis for every numerical column
    • Create a loop inside the Map method that outputs a (key, value) pair for every column you want to process separately.
    • Use a unique key for each column to make sure they are grouped separately in Reduce function. Simple way to achieve this is to add the name of the numerical column as another component into the combined key.

In case of issues

  • You wmay have to modify the Map and Reduce class input and output data types.
    • It is important to notice that the signature of the IntSumReducer class: Reducer<Text,IntWritable,Text,IntWritable> specifies that Reduce method input types are (Text, IntWritable) and output types are (Text, IntWritable)
    • If you change the input or output types of Map/Reduce methods then you also have to change:
      1. The Mapper/Reducer class signature
      2. Map/Reduce method input and output types
      3. Respective MapReduce Job output or Map output (have to be added) data types inside the Main method:
        •  job.setMapOutputKeyClass(Text.class); 
           job.setMapOutputValueClass(IntWritable.class);
           job.setOutputKeyClass(Text.class);
           job.setOutputValueClass(IntWritable.class);

Deliverables

  • MapReduce application source code.
  • Output files (part-r-0000*) of your job.
  • Answer the following questions:
    1. What are the respective advantages/disadvantages of using either:
      • separate key-value pairs to output the results
      • or using comma separated combined values
5. lab 5
Solutions for this task can no longer be submitted.
  • Institute of Computer Science
  • Faculty of Science and Technology
  • University of Tartu
In case of technical problems or questions write to:

Contact the course organizers with the organizational and course content questions.
The proprietary copyrights of educational materials belong to the University of Tartu. The use of educational materials is permitted for the purposes and under the conditions provided for in the copyright law for the free use of a work. When using educational materials, the user is obligated to give credit to the author of the educational materials.
The use of educational materials for other purposes is allowed only with the prior written consent of the University of Tartu.
Terms of use for the Courses environment