HW 10 (due April 26th) Clustering, Seriation, ...
1. Cluster and visualise the disease frequencies and represent as a heatmap (or PCA plot). Consider, whether you would need any normalisation and if yes, apply it first. Experiment with distance measures, and identify which seems to work best for you. Describe what do you observe from data.
Here is the updated version of data file disease_freq_large.txt with ICD-10 codes.
2. Show the disease-disease distance matrix as a heatmap based on the best clustering order that you achieved.
3. Identify diseases with some specific time dependent patterns - annual temporal cyclic behavior, epidemics, etc... Characterise why and how they are "interesting". Experiment defininig "interesting" patterns and fetching respective diseases according to these predefined patterns.
4. and 5. Use these binary data examples - all_matrices.txt and the single tarball. Apply two different seriation and/or two-way clustering methods to all data sets. Make a visual compact output (e.g. heatmaps in three columns) for original and two "ordered" versions of data. Characterise the chosen methods briefly and illustrate with the example data sets provided how they produce similar or different results.
6. (1p) Project planning - propose one idea and project plan for a project task. Describe the idea, data, tasks and goals. If possible, also the business case and value. Develop the project description and "pitch" your idea to the others. (you do not need to stick to this project at the end)
7. (2p) Experiment with binary data from task 4-5. by defining your own "goodness measure" and performing optimisation to achieve the best possible results. Feel free to use any optimisation heuristic - simulated annealing, genetic algorithms, etc... Compare your solution(s) to those of 4. and 5.