Data Mining - Courses - Institute of Computer Science

HW02 (21.02) - Descriptive statistics

In this homework you will be using abalone dataset (short description of variables is in here). First, read in this dataset and then complete the following tasks.

1. Before going deeper into analysis practice your skills in R/Python/etc by answering the following questions:

a. What are the column names of the dataset?

b. How many observations (i.e. rows) are in this data frame?

c. Print the first 3 lines from the dataset. What are the values of feature rings of the printed observations?

d. Extract the last 2 rows of the data frame. What is the weight of these abalones?

e. What is the value of diameter in the row 755?

f. How many missing values are in the height column?

g. What is the mean of the height column? Exclude missing values from this calculation.

h. Extract the subset of rows of the data frame where gender is M and weight values are below 0.75. What is the mean of diameter in this subset?

i. What is the most frequent rings value?

j. What is the minimum of length when rings is equal to 18?

2. Now let’s try to actually understand the data by answering the following questions:

a. What is the data about?

b. What are different features and their types (discrete, continuous etc)?

c. How many rows are in the dataset?

d. For each feature show it’s:

mean
median
max
min
standard deviation
distribution (with a plot)

Comment on the outcome (is the distribution skewed - more small/large values, etc).

3. Create scatterplots between all variables except gender (in R and Python you can do a scatter matrix between all variables at once, you don’t have to do it manually for each pair). Calculate also the correlations between all variables. Plot separately the scatterplot of rings and the variable most correlated with it. Also plot separately two most correlated variables. Explain what you observe.

4. Identify potential outliers in the dataset by applying inter-quartile ranges (IQR) (range between the 25% and 75% quartiles) - are there values that are over k*IQR larger the upper quartile or k*IQR below lower quartile. Use k=1.5. Are there any multidimesional outliers - observations that are outliers in many variables? Note: you can also make boxplots to gain some intuition about the outliers.

5. Explain if you should remove outliers and if so, then which ones (you do not need to decide for each observation separately, try to explain in general). Remove the outliers you decided to remove and recalculate statistics from ex. 2d. What do you observe?

6. (bonus 1p) Fit a linear curve, i.e. y = w0 + w1*x, approximating relation between diameter and weight of an abalone. Parameters w0 and w1 should be chosen to minimize the mean square error of the fit, i.e. sum((y-y0)^2) / N, where y0 is the actual value of the dependent variable, y - predicted (fitted) value. This has to be done without using curve fitting functions, like lm() or polyfit()

Data Mining 2015/16 spring

HW02 (21.02) - Descriptive statistics