Andmekaeve - Kursused - Arvutiteaduse instituut

HW2. Data cleaning and descriptive statistics

Because of Estonian Independence Day on 24.02 you have almost 2 weeks to solve this homework (this is an exception, usually you have a week). This means that the deadline is 26.02 23.59.

We are working with the (adult dataset). Read it in to work with it. The data is made unclean on purpose so you would see how it often looks like in real life. So expect there to be problems like in real life scenarios.

Hint: python and R can sometimes try to guess the data type of features when you read the data in the default way. Since the data is not clean, the guessed data type can be different from what you would expect. So take time to search how to specify the data type when reading it or how to change it later on.

EX 1. The first task requires you to get an overall sense of the provided dataset. Answer the following questions.

What is the data about?
What are different features and their types (discrete, continuous etc, range of values, possible meaning of the features)?
How many rows are in the dataset?

Next, perform some data cleaning by changing the initial dataset. Try to eliminate obvious data input mistakes. It is also very common that a categorical (factor) feature may be written down in multiple different ways because of input errors (spelling mistakes etc). Aggregate information related to a single feature from multiple columns and remove features unnecessary for interpreting the data (ie row number columns).

Note that data prep usually also deals with missing values (noted as "?" in our file). However, do not remove/alter missing values at this stage. We will handle them later. What you can do is change them into actual missing values to make your life easier in the next exercises ("?" is not internally used as the mark for missing value, there is a separate data type for that).

Identify and fix the problems in the initial dataset. Describe all changes you made and why. The dataset prepared here will be used in the following exercises.

Hint: You do not have to do all of the initial cleaning with R/Python. Some of the tasks can be easier with tools like OpenRefine (http://openrefine.org/) or just plain spreadsheet software might help you. But surely R and Python will do the job nicely as well!

EX 2. Characterise and describe the dataset from previous exercise.

For each feature (depending on what’s applicable) show it’s:

Mean, median, max, min, standard deviation (for numerical features)
Frequency table (for categorical features)
Are there missing values (how many)
Plot showing the distribution or frequencies of values (choose a suitable plot for feature type)

Comment on the implications of the outcome (is the distribution skewed - more small/large values, any unexpectancies).

EX 3. Identify potential outliers in the (numerical features) dataset by applying inter-quartile ranges (IQR) (range between the 25% and 75% quartiles) - are there values that are over k*IQR larger the upper quartile or k*IQR below lower quartile. Use k=1.5.

Explain if you should remove outliers. If so, then which ones. The outliers may show up from mistakes (incorrect data input, measurement error) or actually correct but rare values. You do not need to decide for each observation separately but explain your chosen outlier exclusion criteria in general terms. Decide and remove identified outliers and recalculate statistics from ex 2 for the changed features (including the plot). Any observations?

EX 4. Often, datasets you work with have missing values (in our dataset they are marked as "?"). It is important to understand whether the values are missing completely at random or missing in some systematic way. There are some missing values in features age and native.country. Investigate, which cases of missing data do we have in each of our features (systematically or randomly missing). Make a decision if you can or should impute missing data in some of those features and explain your decision. If you decided to impute, do that (you don’t have to use anything too complicated) and compare relevant statistics and plots from ex 2 with the values on initial dataset. If you decide not to impute, explain (and convince us) on the reason why.

EX 5. Find 3 creative ways to describe the relationships between two or more variables (in R you can use ggplot2 package, in Python pandas, seaborn, matplotlib). Add these plots along with descriptions on what they show in your report. Don’t forget to add appropriate axis labels and titles.

Andmekaeve 2016/17 kevad

HW2. Data cleaning and descriptive statistics