HW 5 Descriptive analysis.
(due March 22nd)
1. Study the classic Iris data set (Wikipedia and UCI repository ). Take file from here: iris.data. It is more convenient to refer to a particular variable by name, thus, assign names to your data, like: names(iris) = c("sepal length","sepal width","petal length","petal width","class")
- For each variable calculate mean, standard deviation, median, minimum, maximum and mode. Describe what type of data you have (e.g. continuous or discrete). Are the distributions of variables symmetric, positively or negatively skewed?
- Plot 4 boxplots: for each continuous variable with respect to the class (xaxis is the subclass of the flower: Iris-setosa, Iris-versicolor, Iris-virginica and y-axis is a continuous variable). What do these figures mean and how to interpret them?
2. Plot all 4 variables against each other on X-Y plot (2 at a time), highlighting by color the different species. Characterise the variables relative to how they can be used to differentiate between the three species.
3. Identify potential outliers in the iris data.
- Apply inter-quartile ranges (IQR) (range between the 25% and 75% quartiles) - are there values that are over k*IQR larger the upper quartile or k*IQR below lower quartile. Use k=1.5. What should be the k to detect 10 largest values in the data; and what should be the k to detect 10 smallest values in the data?
- Use now the "z-score". How many standard deviations is the value larger than the mean, or respectively, lower. Which values are more than 3 stdev "away" from mean. z-score = (xi-mean(X))/stdev(X) ( See also Wikipedia on Standard score )
Are the identified outliers the same for two cases?
4. Kernel density estimation: use these two data sets - Attach:klient1.txt and Attach:klient3.txt. Plot the density of these two data sets. How would you characterise these two data sets (klient1 and klient3)? (they represent the time through the week when people go shopping)
5. What is "the right" kernel and it's width? Play with different kernels and their width. Play with the data by randomly sampling the data, taking for example 100, 500, 1000, ... instead of the nearly 90K values in the full data. How many values and which kernels would be sufficient to interpret the data as you did in 4?
6. (Bonus, 2p) Often, datasets you work with have missing values. It is important to understand whether the values are missing completely at random or missing data are systematic in some way. In this task you are provided with two files called iris_missing_1.txt and iris_missing_2.txt. Investigate, which case of missing data you are dealing with in both cases. Accordingly to your discoveries, make an imputation of missing values. Describe, how you do it and compare with the initial dataset. How do your descriptive statistics differ? Measure the mean squared error (MSE) you made with your imputations for each variable ( MSE = 1/n Sum{i=1..n} (imputed value i − real value i) 2 ).