HW3. Descriptive statistics ... (05.03)
1. Kernel density estimation: use these two data sets - klient1.txt and klient3.txt. Plot the density distribution to compare them. The numbers represents the time of week (measured in hours from Sunday midnight), when two different groups of people have gone shopping over the entire year. Choose an informative kernel and kernel width and justify your choice. Characterise briefly these two data sets (klient1 and klient3).
2. Extract from above data information about Fridays and Saturdays. Plot the four density plots (2 sets of clients, 2 days) as in task 1. by overlaying them above each other. Select two different distributions (from four) and make a Q-Q plot to compare them. Interpret the Q-Q plot and describe how the distributions differ.
3. Study the data product_time_shop.txt. There is information about a few products, shops and times of purchaces through the week. Describe the data - what products, shops, how many purchases of different products in different shops, and which periods are covered?
4. Draw violin plots and/or boxplots (preferably overlaying them) that would allow comparing different weekdays, shops, and product sales. Identify some meaningful illustrations to draw conclusions about 1) different weekdays, 2) different products, 3) shops. State your hypothesis and then draw respective analysis of data.
5. Use the same data as in 3. Explore the data and identify if any of the shops has run out of any popular product during the day (which shops, products, days?). Find some visualisation to convince the reader or shop manager. Formulate the principles of an automated procedure to identify all such events across entire supermarket(s).
NB! Bonus tasks are voluntary, choose only one, and only if you have completed all tasks 1-5 (except 8 that you can play anyway)
6. (Bonus task, voluntary, 2p)
- Draw Q-Q plot for whole data from task 1. Interpret it. Solve that problem somwehow and make a meaningful use of QQ-plots for entire data in Task 1.
- Show how to use QQ-plots also for data from Task 3. Make a brief example that uses the shopping time data.
7. (Bonus task, voluntary, 2p; only if you have completed tasks 1-5)
- Make use of heat-maps and dot-plots (X-Y plots), measuring also the correlation coefficient for data in Task 3. (Figure out what to compare and how).
- Consider issues of normalisation of such data. What may be the problems to overcome with normalisation? Apply these ideas and present the visual proof of the utility of normalisation.
8. Bonus task - play the http://guessthecorrelation.com/ game. Anyone who scores above 50 will receive one bonus point. Present the "proof" by a screenshot. Winner who scores the most will receive one more point. Post the results in Piazza to claim the winning best score points.