HW7. Hashing, Bloom Filter, Succinct Trees (18th of Oct, 23.59)
EX1. Put the following 12 integers [63, 17, 72, 20, 74, 80, 23, 67, 29, 6, 62, 56] into two hash tables.
- With chaining into a hash table of size 9;
- Using an open quadratic probing scheme (use squares as probing strategy, i.e. 1, 4, 9, 16,...), into a hash table of size 17.
Use the following hash function h(x) = (x * 13 + 42) % m (where m is the size of the hash table).
Questions: How many collisions were there in each case? When should one prefer the open addressing strategy over chaining (and vice versa) to resolve hash collisions? What are the tradeoffs?
EX2. Exercise 2 Watch this presentation by Titus Brown (2011): http://pyvideo.org/video/402/pycon-2011--handling-ridiculous-amounts-of-data-w
Explain the concept of Bloom filters. Why did Titus and his colleagues make use of Bloom filters in this case? How should Titus and his colleagues choose the hash functions {hⱼ} so that they do not count collisions and thus making it probabilistic?
Hint: Bloom filter is represented by a bit array and is described by its length m and a number of different hash functions {hⱼ}. It supports only two operations: Adding an element into the set and Testing whether an element is not a member of the set as shown by the code snippet bellow.
Adding element to the Bloom filter |
Input: Element x ∊D for the dataset D ={ x₁,........... xₙ} Input: Bloom filter with k hash functions {hⱼ}, j = 1 .....k for j ⟵1 to k do: l ⟵hⱼ(x) BLOOMFILTER[j]←1
EX3. Grab the text file https://gutenberg.org/files/74/74-0.txt (Tom Sawyer).
Determine the full alphabet of the file. (split to chars, count frequency of each char). Count frequencies of every possible 10-mer (sequences of characters of length 10) in text. Which ten-mers are the most frequent?
Make Bloom filter (1.25x the length of text, i.e. ca 80% full) for 10-mers. Hash every 10-letter sequence into the Bloom filter (use 5 hash tables, with different keys).
Now start from 10-letter string: “Within two” and extract the text from Bloom filter. I.e. use “ithin two” + char for every character of Tom Sawyer alphabet. If it is the only char that is in the Bloom filter, you have got the next 10-letter sequence. In this case most likely “ithin two ”. How far can you determine the text correctly?
EX4. Use the tree from the trie data structure for the following words. D={ word, world, armor, amour, amer, armour, arm, warm } (the same example as used in the HW 4)
Represent this tree using a depth first unary succinct tree representation(both the bitstring and edge contents (linked to the target node) as a separate array).
Make a simulation of an example query: “output all letters getting to leaves”. The last character of each word effectively. (no implementation)
EX5. On the same example as Ex4. represent the same trie with a parenthesis representation and separate node contents array. Illustrate an example query (calculate subtree size) from path prefix “ar”, (no implementation).
EX6. (Bonus 2p) Implement example code for Task 4 (minimal plain python, no fancy libraries, please)
Note that using this small data you do not need to generate any index on top of the bit-string representation of the tree. Feel free to implement the necessary bit operations as simple loops over the bitstrings.
EX7. (Bonus 2p) Implement example code for Task 5 (minimal plain python, no fancy libraries, please).
With “ x & ( 1 << i ) “ you can get the True/False value if the i’th bit (from least significant side) is set or not. You should probably start the tree from the least significant side of the bitstring. Note that the description of the tree would nicely fit into the python "integer" data type (2 bits per node).
Store characters of tree nodes in a separate array or list and read them from there.
Note - bit vector class for python
https://engineering.purdue.edu/kak/dist/BitVector-3.5.0.html
from BitVector import BitVector m = 35 bv = BitVector( size = m ) bv[ 10 ] = 1 print( bv ) bv << 2 print( bv ) bv2 = BitVector( size = m , intVal = 31 ) print( bv2 ) bv2 << 3 print( bv2 ) print( bv | bv2 ) ---> outputs 00000000001000000000000000000000000 00000000100000000000000000000000000 00000000000000000000000000000011111 00000000000000000000000000011111000 00000000100000000000000000011111000