HW6. Heaps and Succinct Data Structures (14th of Oct, 23.59)
EX1. & EX2. (1P + 1P) Experiment with k-ary (d-ary) heaps on random integers from 0 to 999 999 999 (precision 1B or 1 000 000 000). Experiment with k= 2..12. (for k=2 use the same implementation starting from index 0, not the “standard” binary heap starting from 1).
Build heap (heapify) unsorted random data with linear time method. Report heapify times for 1000 and 1000 000 values, and how much slower it got by 1000-fold increase of data.
Compare linear heapify time vs simply inserting values one by one. Note that this can be done in-place within the same array, simply taking the next element (insertion) and using bubble-up. Report similar time measurements. State which method is actually faster, by how much (on 1M values).
Apply random decrease-value to a randomly selected element of each heap (random position) by random integer decrease in between 1 to 100. Report the times and how much slower it becomes when 1000-fold increasing data.
Assess, which k demonstrates best performance in each measurement.
After identifying the best k, report the values of root and its children (top two layers) on the same data, i.e. heapify, one by one insertions, and random-decrease operations.
EX3. (1P) Analyze a continuous stream of integers identifying the smallest k elements in stream (continuously). Rreport at the end of processing the stream the k’th smallest value in theStrong data. Perform the experiments, report the identified value and the time it took to run the algorithm. Run the experiments for:
- Two sizes of data - 1000 000 values and 100 000 000 values.
- Six types of data: a) positive random integers and b) Fibonacci numbers, both limited in precision to modulo 1000, 1000 000, and 1000 000 000.
- Analyze such streams for k= 100, 10 000, and 1 000 000.
Hint: Fibonacci sequence (mod 1000) is 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 754, 508, 262
Report the results of all experiments in a single table with 5 columns: identified values and time for processing the 1M and 100M; in the final column the ratio of the two times (ca 100?). In total, there will be 18 rows - for two types (random and Fibonacci) and three precisions of data, each for three values of k (2*3*3=18).
Make a conclusion, how the identified values and measured times depend on stream length, data type and precision, and different values of k.
Hint: Use heap for storing the k smallest values as the max heap. Implement the data structure explicitly. Recommended is to use the binary or k-ary heap as the most straightforward approaches.
EX4. (1P) Succinct tree representation (heap-like). Let’s create a succinct heap-like representation of a BST made for Collatz Sequence starting from 17, inserted “as is” into a BST (no balancing). See HW4, EX1. And secondly, let’s do a traversal of that tree based on succinct representation only.
Create the heap-like representation of that tree - the bit-vector (feel free to use a list of bytes/integers instead of bits for this learning exercise) and list of tree nodes. You may want to create new leaf nodes (to be labeled by 0), where leaves are not present in the beginning.
Using the two linear lists a)succinct bit-vector representation and b) the respective values stored in a list/array, make a breadth-first traversal of the tree. Output tree node values; and where the value does not exist, the “L” (leaf).
Note, that you would need to implement simple rank functions (at which index m is the k’th 1; count the k (#1-s) of how many 1-s there is until position m). As the example data is so small, let’s not worry about the most optimal storage of the bit-vector or the complexity of these assisting functions (simple linear counting is sufficient). Goal is to simply get it running.
EX5. (1P) General tree - Let’s now create a succinct depth first parenthesisation based representation for a more general tree, the trie of words. Implement the “size of subtree” function using the similar principle as in the EX4 task.
First, build the parenthesis- based succinct depth-first representation for the TRIE (output the parentheses (we know they can also be implemented with 1 and 0, but for sake of clarity, let’s use parentheses). Below the opening parentheses print the letters in respective node.
Second, perform the traversal and one “size of subtree” operation. Use first the smaller data, and node “ab” as the query. And only when that works, test it for larger list of words.
words = ["abab", "abacus", "taco", "tactic", "aba" ]
words = ["action", "activate", "actor", "active", "activity", "actual", "actress", "actionable", "activist", "actuate", "act", "acting", "activator", "actuation", "actively", "actin", "actinium", "actinoid", "actinosphere", "actomyosin"]
The code to construct the trie and traverse it is already available in this code https://colab.research.google.com/drive/1GWudTO_V3BsH85ZS0VCA_qjoSpCUASZ2
EX6. Bonus (2p) Lines: Kamp (CACM 2010 53:7) " You're doing it all wrong. http://queue.acm.org/detail.cfm?id=1814327 ". What is the main criticism in the article? What is your opinion about the article?
EX7. Bonus (1p) Use the code from EX5 and experiment on some large dictionary (e.g. on unix machines there is list of words; Mac: /usr/share/dict/words , or grab one from here: ( https://github.com/dwyl/english-words ) How large would the compact representation of the entire dictionary actually be? (no need to make it compact, just use the code to pre-calculate expected size).