HW 07 (1.11) Descriptive data mining, R
Although not obligatory, we would encourage you to use R when doing these exercises. If you have never used R, read some manual first and use a list of functions, e.g. http://www.sr.bham.ac.uk/~ajrs/R/r-function_list.html. Most of the functions needed in this homework are already available in R, you will find the right ones after short search in the Internet.
Data file is here Attach:data.txt
1. Read in the data file: data.txt. For each variable, calculate mean, standard deviation, median, minimum and maximum. Describe what type of data you have (continuous or discrete etc.).
2. There are some obvious outliers in the data. Find a way how to detect and remove rows which contain them. Calculate same statistics again as in the first exercise. Explain why some of these statistics have changed and some of them not much. In the next exercises, use the data where outliers are removed.
3. Plot kernel density estimation curves on top of each other (all variables on the same plot). Explain the principle how curves are estimated (what does "using a kernel" mean here etc.).
4. Find a variable that divides variable "e" into classes with clearly different averages. Make two boxplots next to each other, showing variability of these two classes. Explain what a boxplot consists of (what are the different lines and dots that are plotted).
5. Make a QQ-plot to compare distributions of variables "a" and "b". Are the distributions similar? Explain what a QQ-plot consists of (what exactly do the points on the plot mean and how can we draw conclusions based on the pattern).
6. (Bonus 2p) It is possible to estimate numerically whether two variables have significantly different distribution or not. Find from the literature two different methods how to do it. Use them to find a p-value which says whether variables "a" and "b" have similar or different distribution. Explain which conclusion we can make. (1p if you only consider one method)