HW7. Hashing, Bloom Filter (17th of Oct, 23.59)
EX1. Use the integers as in here. Experiment with the universal hashing scheme as presented in the lecture. Assess the goodness of the universal scheme - hash the given integers to hash tables of sizes 1000, 10,000 , 100,000 and 1,000,000 slots. For each hash table size experiment with the p, a, and b selected randomly. Report (ideally visualize) the number of collisions in total for different table sizes (over many trials with random p, a and b).
EX2. Find some p, m, a and b (for a universal family as described in the lectures) for which the integers given in EX1 are especially non-uniformly distributed (have many collisions in total or have too many collisions for one specific slot). How easy it is to find such bad cases? I.e. how many different hash functions did you have to evaluate? Describe how ignoring rules of selecting p, m, a or b can lead to a bad case (many collisions). Attempt to illustrate how the number of collisions are spread over the large set of random hash functions.
EX3. So far we did not ask what to do when the hash table grows full and/or we do not know in advance the final nr. of elements to be hashed. One answer to this is so-called "Extendible hashing" or also "Dynamic hashing". Please study online materials, e.g https://loonytek.com/2016/05/17/extendible-hashing/. Summarise the key points of extendible hashing, providing also one illustration (can be from the internet; properly cited of course). Also, report the most useful resource(s).
EX4. Hash functions are usually represented as integer -> integer mappings. Yet, keys in dictionaries are usually words, ID-s etc. The purpose of this task is to try to map every English word (e.g. https://raw.githubusercontent.com/dwyl/english-words/master/words.txt) into an integer (before even applying the hash functions). Since there are many more potential words than integers of length 32 bit or 64 bit, one may ask - how many collisions one would generate simply by conversion of words into integers. Try out some simple ideas - like simply adding all character values ord(c) together and count nr of collisions. Then try to find and implement a better mapping - and report how many collisions there will still be when applied to above very long list of English words. You can measure two things: total nr of colliding integers; and the "most occupied" integer value - for both, the 32 and 64 bit integers.
EX5. Watch this presentation by Titus Brown (2011): http://pyvideo.org/video/402/pycon-2011--handling-ridiculous-amounts-of-data-w Explain very briefly the concept of Bloom filters and why Titus and his colleagues use Bloom filters in this case?
EX6. (Bonus 3p) Hidden messages using Bloom Filters. Experiment with the De Bruijn graph approach discussed by Titus Brown in the lecture (see EX3) for "storing text" using Bloom filters... Grab a dictionary like this: http://www.gutenberg.org/cache/epub/972/pg972.txt Insert word definitions as individual "messages" -
For example for the sentence: EXILE, n. One who serves his country by residing abroad, yet is not an ambassador. Store: "EXILE," , "XILE, " , "ILE, n", "LE, n.", ... (You can use padding in front, e.g. '..........EXILE,' to create keys of length 16 (10+5+1), then '.........EXILE, ' , etc ) ,
Hints:
create a hash function for text - converting it first into a large integer use one large hash table bitmap (Bloom filter), and apply several different hash functions on the same array To "retrieve", query the Bloom filter with "EXILE," and re-create the entry by trying out all possible characters... (do you improve results by allowing English characters and punctuation only?) Experiment with changing the size of filter and what proportion of bits are set to 1 Experiment with the nr of 3-4-...X different hash functions and different keyword lengths. (you can assume all possible characters from byte code 0 to 255)
Measure how many word entries would be retrieved correctly and how many would "get broken" under different parameter settings.
EX7. (Bonus 3p) String A is an anagram of another string B if A is a permutation of the letters in B; for example, (indicatory, dictionary) and (brush, shrub) are pairs of words that are anagrams of each other. In this problem, all strings will be ASCII strings containing only the lowercase English letters a to z.
Given two strings A and B, the anagram substring count of B in A is the number of contiguous substrings of A that are anagrams of B. For example, if A = ’esleastealaslatet’ and B = ’tesla’, then, of the 13 contiguous substrings in A of length |B| = 5, exactly 3 of them are anagrams of B, namely (’least’, ’steal’, ’slate’), so the anagram substring count of B in A is 3.
- Given string A and a positive integer k, describe a data structure that can be built in O(|A|) time, which will then support a single operation: given a different string B with |B| = k, return the anagram substring count of B in A in O(k) time.
- Given string T and an array of n length-k strings S = (S0, . . . , Sn-1) satisfying 0 < k < |T|, describe an O(|T| + nk)-time algorithm to return an array A = (a0, . . . , an-1) for which ai is the anagram substring count of Si in T for all i ∈ {0, . . . , n − 1}.