Institute of Computer Science
  1. Courses
  2. 2012/13 fall
  3. Data Mining (MTAT.03.183)
ET
Log in

Data Mining 2012/13 fall

Edit page
Past edits Uploaded files

DM - 2012

  • Main
  • Lectures
  • Projects
  • Links
  • Homework
    • Homework upload
    • admin
  • Feedback
Edit sidebar

HW 07 (1.11) Descriptive data mining, R

Although not obligatory, we would encourage you to use R when doing these exercises. If you have never used R, read some manual first and use a list of functions, e.g. http://www.sr.bham.ac.uk/~ajrs/R/r-function_list.html. Most of the functions needed in this homework are already available in R, you will find the right ones after short search in the Internet.

Data file is here Attach:data.txt

1. Read in the data file: data.txt. For each variable, calculate mean, standard deviation, median, minimum and maximum. Describe what type of data you have (continuous or discrete etc.).

2. There are some obvious outliers in the data. Find a way how to detect and remove rows which contain them. Calculate same statistics again as in the first exercise. Explain why some of these statistics have changed and some of them not much. In the next exercises, use the data where outliers are removed.

3. Plot kernel density estimation curves on top of each other (all variables on the same plot). Explain the principle how curves are estimated (what does "using a kernel" mean here etc.).

4. Find a variable that divides variable "e" into classes with clearly different averages. Make two boxplots next to each other, showing variability of these two classes. Explain what a boxplot consists of (what are the different lines and dots that are plotted).

5. Make a QQ-plot to compare distributions of variables "a" and "b". Are the distributions similar? Explain what a QQ-plot consists of (what exactly do the points on the plot mean and how can we draw conclusions based on the pattern).

6. (Bonus 2p) It is possible to estimate numerically whether two variables have significantly different distribution or not. Find from the literature two different methods how to do it. Use them to find a p-value which says whether variables "a" and "b" have similar or different distribution. Explain which conclusion we can make. (1p if you only consider one method)

  • Institute of Computer Science
  • Faculty of Science and Technology
  • University of Tartu
In case of technical problems or questions write to:

Contact the course organizers with the organizational and course content questions.
The proprietary copyrights of educational materials belong to the University of Tartu. The use of educational materials is permitted for the purposes and under the conditions provided for in the copyright law for the free use of a work. When using educational materials, the user is obligated to give credit to the author of the educational materials.
The use of educational materials for other purposes is allowed only with the prior written consent of the University of Tartu.
Terms of use for the Courses environment