HW02 (21.02) - Descriptive statistics
In this homework you will be using abalone dataset (short description of variables is in here). First, read in this dataset and then complete the following tasks.
1. Before going deeper into analysis practice your skills in R/Python/etc by answering the following questions:
2. Now let’s try to actually understand the data by answering the following questions:
- mean
- median
- max
- min
- standard deviation
- distribution (with a plot)
3. Create scatterplots between all variables except gender (in R and Python you can do a scatter matrix between all variables at once, you don’t have to do it manually for each pair). Calculate also the correlations between all variables. Plot separately the scatterplot of rings and the variable most correlated with it. Also plot separately two most correlated variables. Explain what you observe.
4. Identify potential outliers in the dataset by applying inter-quartile ranges (IQR) (range between the 25% and 75% quartiles) - are there values that are over k*IQR larger the upper quartile or k*IQR below lower quartile. Use k=1.5. Are there any multidimesional outliers - observations that are outliers in many variables? Note: you can also make boxplots to gain some intuition about the outliers.
5. Explain if you should remove outliers and if so, then which ones (you do not need to decide for each observation separately, try to explain in general). Remove the outliers you decided to remove and recalculate statistics from ex. 2d. What do you observe?
6. (bonus 1p) Fit a linear curve, i.e. y = w0 + w1*x, approximating relation between diameter and weight of an abalone. Parameters w0 and w1 should be chosen to minimize the mean square error of the fit, i.e. sum((y-y0)^2) / N, where y0 is the actual value of the dependent variable, y - predicted (fitted) value. This has to be done without using curve fitting functions, like lm() or polyfit()