HW6. K-d tree, RPTree, Treap, Union-Find (11th of Oct, 23.59)
EX1. Build a K-d tree on the two-column data provided. Split data by choosing median values. Draw the final 2-dimensional K-d tree (You don't have to give every step). You can either do this by hand or write a program to do it for you (or partially do it for you). Do not use the red point when building the K-D tree.
After building the tree, answer the following questions:
- How would you use this structure to find points in the quadrant [5, 10] x [0, 5] (bottom-right quadrant).
- Describe how you would use this data structure to find the closest points (Euclidean distance) to the red query point (x=2.69,y=2.93). How many calculations would it take to find the closest point naively (i.e. by calculating the distance to every other point)?
X, Y 7.08, 9.78 7.36, 7.76 0.81, 4.50 2.03, 6.12 8.15, 7.49 5.67, 3.51 1.97, 2.31 0.05, 5.09 9.57, 1.15 7.98, 6.38 9.07, 1.75 4.29, 8.09 3.14, 3.09 1.45, 1.35 2.80, 2.59 1.82, 5.59 5.18, 4.76 9.51, 7.23 0.65, 8.89 5.06, 5.50 9.53, 0.28 3.09, 4.46 9.70, 8.74 6.76, 4.51
EX2. Use the same data as in exercise 1 and build a Random-Projection tree. You can use the following pseudo-algorithm to generate splits:
- Generate a random 2-D unit vector;
- Find projections of all data in the split onto the unit vector;
- Find the split location (median).
- How to use the RP tree to perform the closest neighbour query?
- How to guarantee that the found point is actually the closest neighbour?
You can make use of the following code to generate a unit vector and find projection (dot product of a unit vector and data point).
# Unit vector code from https://stackoverflow.com/questions/6283080/random-unit-vector-in-multi-dimensional-space from random import gauss import numpy as np def make_rand_vector(dims): vec = [gauss(0, 1) for i in range(dims)] mag = sum(x**2 for x in vec) ** .5 return [x/mag for x in vec] def projection(point, unit_vec): return np.dot(point, unit_vec) # Use example unit_vector = make_rand_vector(2) print(projection([5.18, 4.76], unit_vector))
You can use these data structures also on higher dimensional data! Use the 3-D data points given below and show how to do one random projection split.
As a unit vector use the following (0.577, 0.577, 0.577). Find a hyperplane that would split the data points at “median” - half on one side, half on another side. Label the below points accordingly and calculate the distance from the hyperplane. Provide formulae in your report.
Hint 1: In order to find a splitting hyperplane, you should first project your data onto the given unit vector. Practically you have vector-on-vector projections. Now, the question is, how do you find the median projection (hint: median value))?
Hint 2: Given the unit vector and the scalar you have all the data to find the median point. Then you can use this found median point and unit vector to find the dividing hyperplane. Hyperplane can be found using its normal vector (what is the normal vector here?) and a point on the hyperplane (note that the hyperplane will be 1 dimension less than your initial data).
X Y Z 73.64 10.36 97.37 34.17 94.43 33.35 32.08 97.13 25.58 51.45 89.81 38.43 15.24 95.85 97.98 44.86 12.32 29.15 34.50 28.76 48.76 67.45 22.75 53.81 7.510 90.28 81.90 77.12 26.87 22.04 71.35 66.32 86.06 23.59 38.70 26.60 17.07 44.94 41.63 12.07 61.15 20.94 48.47 37.50 25.41 92.62 66.00 54.00 79.05 60.91 1.420 1.540 83.79 92.31 20.09 65.39 75.14
Bonus 1p: visualise the given data and the separating hyperplane!
EX4. Treap Treap is an augmented binary search tree. Practically it has properties of a BST + heap. Every node is associated with a value, which has the constraint that larger values are higher up in the tree than children nodes (for max heap). Those keys are often randomly generated.
Every time you insert an element, you generate an additional heap value (priority value) for it and insert it like you would insert to a BST, which is followed by rotations (single rotations) to fix the heap property (just rotate it up the tree) until the tree is fixed. In order to delete a key, you first change the heap's value to -infinite and rotate it down the tree.
Insert the following keys in order (left-to-right) to a Treap. For BST, the keys are in lexicographical order (A < B, B < C, etc..):
A, B, C, D, E, F, G, H, I
1) Every time you insert one, generate a random number for the heap priority values. For at least one insertion, show how you perform the rotations.
2) Now insert the keys again (to a new tree), but this time, take every heap priority value to be equal to 1. What happened?
3) Play around with the distribution of priority keys, such that the tree becomes more balanced when inserting nodes in the same order (A to I). (Don't worry if you don't get a perfectly balanced tree)
Hint: video lecture & slides on Treap (Augmenting data structures).
EX5. Union-Find. We have made a generator for artificial team composition with "random individuals". Players indicate with whom they want to be in the same team. Your task is to generate the teams (assign the final team number to each player). Then to use the Union-Find algorithm with and without "path-compression" to demonstrate the effect of path compression (check questions below). If the data is "too small", make it bigger (increase n), if the generator code is "too easy", make it more challenging for the Union-Find to handle. Otherwise, just measure the time (or nr of operations) in case of without and with path-compression.
import random import time import numpy as np n = 10_000 # nr of players nr = 7 # Nr of teams nb = 10 # random thing to make it look complicated # Generate the pairs who want to "play together". Goal: data has to be a bit "skewed" # after honouring all these wishes, how many teams do we have, who belongs to which team? start_time = time.time() random.seed(10) # you can cycle over many seeds ... (random trials) merge= i = 1 while( i < n*nb ): a = random.randint( 1+(i//nb) , 1+(i//nb)+nb*nr ) % n b = a + random.randint(1,3)*nr if a==b or a==0 or b==0 or b > n or a > n : continue merge.append( [a,b] ) # a takes b into the team (unless already there) i += 1 end_time = time.time() generation_time= end_time - start_time print("Time to generate:", int(generation_time*100_000)/100_000, "s ", merge[:5])
Dataset: The pairs in list "merge" are (player1, player2) pairs, where player 2 shall join the team of player 1, whatever that team is.
Questions: How could you make use of the Union-Find data structure to efficiently answer the following questions:
- What's the team nr of each player?. Identify the team by the smallest nr of all the team members. e.g. team of players 8, 5, 7, 9, 4, 10 is identified by "4".
- How many teams and how many members in each team? (random data may play tricks to us)
- Is anyone left out of any team(s) by still being singleton (how many?)?
EX6. Bonus (2p) Gal Gadot is a climate scientist. As part of her research, she often needs to query the maximum temperature the earth has observed within a particular date range in history, based on a growing set of measurements that she collects and adds to frequently. Assume that temperatures and dates are integers representing values. Help Gal Gadot evaluate such range queries efficiently by implementing a database supporting the following operations:
|record_temp(t, d)||record a measurement of temperature t on date d|
|max_in_range(d1, d2)||return max temperature observed between dates d1 and d2 (including d2)|
To solve this problem, Gal Gadot needs to store temperature measurements in an AVL tree, where each node A stores a measurement A.item with a date property A.item.key and temperature property A.item.temp
To help evaluate the desired range query, we will augment each node with: A.max₋temp, the maximum temperature stored in A’s subtree; and both A.min₋date and A.max₋date, the minimum and maximum dates stored in A’s subtree respectively.
1) Describe an O(1)-time algorithm to compute the value of these augmentations on node A, assuming all other nodes in A’s subtree have already been correctly augmented.
2) Describe a database to implement operations record_temp(t, d) and max_in range(d1, d2), each in worst-case O(log n) time, where n is the number of unique dates of measurements stored in the database at the time of the operation.