HW 06 (25.10) Clustering III and Seriation

1. Simulate DBSCAN on the data from last week task 1 (on paper). Aim at about 3 clusters.

2. Outline an algorithm for handling density based clustering for clusters of varying densities.

3. Look at some example binary matrices from here (tarball here). Does any of them follow the Pareto principle? If not, generate similar data and demonstrate how you would have discovered such example.

Comment from TA: It is enough to consider 80-20 here ("narrow" Pareto principle).

4. Implement a goodness measure for above example data that counts how well each value is surrounded by it's own "kind". Both for 0 and 1, take into account all 8 neighbours. Calculate the scores for all matrices in above.

Comment from TA: "By its own kind" means how 1s are surrounded by 1s and 0s surrounded by 0s.

5. Implement some sort of data reordering for above matrices, try to maximise your score in 4. For which datasets you find "optimal results", "good results", or really "bad results"?

Comment from TA: You can decide yourself where to put the border between good vs optimal and optimal vs bad. When maximising, it is enough to find "approximately maximal" result, you don't have to look at all permutations.

6. (Bonus 2p) - write a project proposal (1 A4) for density based clustering (e.g. developing new ideas, making test data, ...) or for matrix reordering/biclustering tasks. Motivate the problem and ask a relevant question that could be answered with a project.

Comment from TA: Try to be specific, which type of data is going to be used (numeric/non-numeric etc.) and how the methods are applicable if you have this type of data.

Data Mining 2012/13 fall

HW 06 (25.10) Clustering III and Seriation