HW5. Treap/Heap, Union-find, Random Projection (9th of Oct, 23.59)
EX1. (1p). Read: Kamp (CACM 2010 53:7) "You're doing it all wrong.". What is the main criticism in the article? What is your opinion about the article? Discuss how exactly the Heap's use of implicit indexing (i => 2*i, 2*i+1; and parent==i/2 ) should be changed according to Kamp's article.
EX2. (1p). Consider an interval (x, y). In this exercise, we will interpret these intervals as geometrical points in 2D space, such that x would be on the x-axis and y on the y-axis (interval (1,2) would be represented as a point p = (1,2) on 2D plot).
Example: Consider interval (3,5). This would be plotted as the following:
First of all, we will fix one interval (5, 10) and use it as a reference interval. The rest of the points in this 2D space need to be coloured according to the type of relationship they have with the reference interval. For example interval (1, 3) would be non-overlapping from the left to the reference interval, interval (11, 14) non-overlapping from the right etc. How many types of relationships are there in this 2D space? Visualise these types by colouring points (making coloured regions) in the space accordingly. Include a reference interval and also one example interval from each region.
EX3. (1p). Treap is an augmented binary search tree. Practically it has properties of a BST + heap. Every node is associated with a value, which has the constraint that larger values are higher up in the tree than children nodes (for max heap). Those keys are often randomly generated.
Every time you insert an element, you generate an additional heap value (priority value) for it and insert it like you would insert to a BST, which is followed by rotations (single rotations) to fix the heap property (just rotate it up the tree) until the tree is fixed. In order to delete a key, you first change the heap's value to -infinite and rotate it down the tree.
Insert the following keys in order (left-to-right) to a Treap. For BST, the keys are in lexicographical order (A < B, B < C, etc..):
A, B, C, D, E, F, G, H, I
1) Every time you insert one, generate a random number for the heap priority values. For at least one insertion, show how you perform the rotations.
2) Now insert the keys again (to a new tree), but this time, take every heap priority value to be equal to 1. What happened?
3) Play around with distribution of priority keys, such that the tree becomes more balanced when inserting nodes in the same order (A to I). (Don't worry if you don't get a perfectly balanced tree)
Hint: video lecture & slides on Treap (Augmenting data structures).
EX4. (1p). Your task is to use Union-Find with and without path compression on the following nodes (consider each to be a single-element set already).
1, 2, 3, 4, 5, 6, 7, 8
Briefly describe the algorithm. How do you decide which node becomes the parent when applying union? How does path compression work?
First, apply union operation in the following order:
Union(1,2), Union(3,4), Union(5,6), Union(6,7), Union(3,7), Union(4,8), Union(1,6)
What is the height of this tree? Now demonstrate finding 3 (when using path compression, how does the tree change? If you find 3 again, then how many comparisons less do you have to do?). How can you use this data structure to figure out that two nodes belong to the same set?
Show the final result both in array-structure and in a tree-structure
EX5. (1p). Use 4-D data points given below. Define a random hyperplane direction (hint: random 4-D unit vector). Find such a hyperplane that would split the data points at “median” - half on one side, half on another side. Label the below points accordingly and calculate the distance from the hyperplane. Provide formulae in your report.
Hints:
Feel free to take any vector in this case, just make sure it is a unit vector (it makes it simpler). You can turn a vector into a unit vector by dividing it by its magnitude (google: vector magnitude/length, unit vector). (If you really want to sample a uniformly random vector, then you should actually sample each coordinate from Normal distribution N(0,1), and then normalize the vector to length 1.)
Hint 2: In order to find a splitting hyperplane, you should first project your data onto that unit vector. Consider each datapoint to be a vector on its own, so you project vectors on a vector. Now, the question is, how do you find the median projection (hint: think of scalar projections and find the median value?)?
Hint 3: Next task is to use this found median point and initial random vector to define the dividing hyperplane. Hyperplane can be defined using its normal vector and a point on the hyperplane (google :), ps hyperplane will be 1 dimension less than your initial data). As the point, you can use the median point which you found previously. What could the normal vector be?
X Y Z D 73.64 10.36 97.37 53.80 34.17 94.43 33.35 73.01 32.08 97.13 25.58 43.36 51.45 89.81 38.43 48.67 15.24 95.85 97.98 81.92 44.86 12.32 29.15 81.43 34.50 28.76 48.76 66.48 67.45 22.75 53.81 40.16 7.510 90.28 81.90 45.47 77.12 26.87 22.04 81.32 71.35 66.32 86.06 35.05 23.59 38.70 26.60 47.57 17.07 44.94 41.63 72.22 12.07 61.15 20.94 37.69 48.47 37.50 25.41 82.65 92.62 66.00 54.00 65.94 79.05 60.91 1.420 39.06 1.540 83.79 92.31 63.50 20.09 65.39 75.14 69.09
EX6. (bonus 2p) Oct-tree is a generalisation of quad-tree from 2D into 3D. Now generalise it further into 4D and show how the data from the previous exercise would be distributed. Note that to identify a particular (largest bounding) cell a binary vector - 0/1 along each dimension is sufficient. Assign to each point the label of a bounding box with such 4-bit numbering. If a cell has more than one value, split it further. For the first data point, the outermost bounding box is 1101 and for the second point, it is 0010. If some cell contains multiple values and needs to be split further, indicate that particular next-level split as well. How would you search for a new query point listing all points within a radius 10 from it?
EX7. (bonus 2p) Range searching in high-dimensional points (k - d points) find its use in computational geometry. For example, 5 attributes of a structured data can be age, gender, yearly income, years of work experience, and number of diplomas, viewed as a 5-D point (34, 1, 40000, 10, 2).
- Show (or prove) how to efficiently implement a tree-based data structure to perform range searches over this data, e.g., for a query like: What are the individuals that are older than 30 and earn between 23k and 50k with more than 10 years of experience?
- Can you come up with your own real-world idea of use case for a similar situation, applied to a collection of structured elements? You can create your own high-dimensional (maximum 10 dimension) synthetic data (using some distribution of your choice). Try to adapt your variation of these trees to efficiently answer a range of search queries. Give some examples and try to measure the performance (variance etc).