Andmekaeve - Kursused - Arvutiteaduse instituut

HW10. Clustering I (30.04)

Clustering example data set Attach:Clustering_example.xlsx for simulation of the algorithms. Use Euclidean distance.

1. Simulate hierarchical clustering of example data (manually). Be smart, you do not need 20x20 distance matrix if you can visually determine the smallest distances. Provide the order of all mergers. Sometimes there are alternatives with the same distances - just decide which one first. Draw a tree/dendrogram of the clustering and provide one ordering of data points along that clustering tree (order of leaves in tree).
a) single linkage (min distance clustering)
b) complete linkage (max distance clustering)

2. Use the same data, simulate the K-means clustering starting from initial cluster centers (points) - A, C, F, M - as indicated in the file.

3. Use the large data set for clustering with some existing software or library. The file is an RGB version of a row-shuffled image. http://biit.cs.ut.ee/imgshuffle/data/DM2017/DM2017.txt

The directory to re-render the image in different row-order is in here: http://biit.cs.ut.ee/imgshuffle/index.cgi?fname=DM2017&dname=DM2017&shufflefile=DM2017.txt

The tool above allows you to paste in the row ID-s (R0001, R0002, etc) in a different permutation order. It will render the same image according to new order of rows. Your task is to cluster the data and fetch the row ID-s in some order provided by clustering.

Try out clustering the data using hierarchical clustering or K-means, for example.

What is depicted on the image? Insert a version of the "recovered image" to the report. Of course, it does not need to be perfect. Merely meant to visualise what you get from analysis.

4. Read the analysis "Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance" http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/
a) Identify various data analysis techniques used throughout the analysis b) Identify potential business needs and cases for such analysis

5. Develop a description of one idea for a project on data mining. This should be approximately 1-A4 long description of a project idea - the data, goal, analysis methods, final key expected result. Try to assess the team size and required time to achieve such a project.

We have also collected a number of ideas and even some data sets for you to use. Directory with project ideas, data, etc: https://drive.google.com/drive/folders/0B5TbAaWYr3OgR2FyVnlIVjBoVDA

The document for final projects that have been started - -your own project or one of the proposed ones - is here - https://docs.google.com/presentation/d/1JrCE0O0R3kRLt8jjrO4dO1LKySJ6WyOFeM9i23W37h4/edit

6. (Bonus, 1 point) Taxify - special bonus task. Provide one novel idea and project description for how the data from Taxify could and should be analysed in order to provide a business case that should help the company succeed. State the business goal of analysis, what data would be needed, and which analysis should be performed. Note: we would love to share your idea with Taxify. Please identify in text, if you are ok for sharing that part of your report with Taxify; ideally designed as a separate document to make it easier to forward.

In collaboration with Taxify the following special prizes will be offered:
I prize - 300€ + 100€ taxi credit,
II prize - 200€ + 75€ taxi credit,
III prize - 150€ + 50€ taxi credit.

Additional comment from Taxify: One of the most important aspects is efficiency in business - e.g.
- many riders in the same car (car pooling), analysis of parallel rides ongoing; ridesharing is 1-2% of transportation, how to improve it further ?
- demand/supply ratio, short term demand prediction and dynamic pricing (supply-demand heatmap)
- prediction of pickoff / dropoff places & times, per rider and as a group
It is possible that Taxify will also offer an internship to some student(s).

Andmekaeve 2016/17 kevad

HW10. Clustering I (30.04)