Homework 2 (HW02) - First look at the data
In this homework we will study how to have a first look at the data and try to understand the attributes. Please start by downloading the 'adult' dataset as CSV and the 'instacart' dataset as CSV. Note that this is just a small fraction of the full dataset announced at https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2. Before doing EX1-2 please make sure that you have read through the slides 74-106 of Lecture 02 regarding types of attributes and how to study attributes individually. Before doing EX3 and EX4 please read the tutorial on tidyr and the tutorial on ggplot.
Exercise 1-2 (EX1-2) (2 points) Describing and cleaning the dataset 'adult'
In this exercise the main goal is to clean up the dataset 'adult' and understand its attributes. The dataset has been made unclean on purpose, so that you can practice cleaning it up. Most of the introduced uncleanliness is of the same kinds as described in slides 92, 101 and 106 of Lecture 02. Additionally, some values have been introduced which are non-sensical due to the meaning of the attribute.
Feel free to choose the order in which you solve the subtasks in this Exercise EX1-2.
Please clean up the dataset and make sure you:
- replace all non-sensical values by missing values;
- use NA to denote missing values (original file uses '?');
- justify all the changes that you do, explaining why you are sure that this is a correct change and what you think the cause for such error could be;
- count how many rows are affected by each change that you make in the dataset;
- report values that look suspicious whereas you are not sure whether they are definitely wrong.
Hint 1: you do not have to do all cleaning with R only. You are allowed to use other tools, such as spreadsheet software or command-line scripts or some other tool, provided that you specify what tool you used and count the number of affected rows.
Hint 2: one possibility to count affected rows is to define a function that takes as input the original data.frame and the modified data.frame and counts how many rows differ.
Describe all attributes in the dataset, for each attribute:
- specify the type of the attribute (nominal/ordinal/interval/ratio)
- describe the set of possible values for this attribute
- count the number of missing values
- draw a plot which you think best summarises this attribute
Exercise 3 (EX3) (1 point) Visualisations about the dataset 'adult'
(a) Create a table where each row stands for an occupation, each column stands for a level of education, and the cells in the table contain the average salary of people with the corresponding occupation and education level.
(b) Use some function from the tidyr package to convert the table from step (a) into a format with columns education, occupation and average_salary.
(c) Plot the table from step (b) with ggplot command geom_tile(aes(x=education,y=occupation,fill=average_salary)).
(d) Reorder the education levels in the plot in what you think is a natural order (this can be achieved by creating an ordered factor before plotting).
(e) List 3 interesting facts that you can read out from this plot.
(f) Create another plot of this dataset that you think conveys interesting information. List 3 interesting facts that you can read out from this plot.
Exercise 4 (EX4) (1 point) Visualisations about the dataset 'instacart'
(a) Read the blog entry announcing that the 'instacart' dataset is being made public. By using this information and by studying the dataset itself identify the type of each attribute (nominal/ordinal/interval/ratio) and describe its meaning.
(b) Create a figure consisting of 7 histograms next to each other using facet_grid, one histogram for each day of the week. Each histogram must have 24 bars, one for each hour of the day. The height of a bar must correspond to the number of product-purchases (it means number of rows) in the data that are performed in that hour of the corresponding day-of-the-week.
(c) List 3 interesting facts that you can read out from this plot.
(d) Create another plot of this dataset that you think conveys interesting information. List 3 interesting facts that you can read out from this plot.