HW7. Hashing, Dynamic programming (22th of Oct, 23.59)
EX1. Use the integers as in here. Experiment with the universal hashing scheme as presented in the lecture. Assess the goodness of the universal scheme - hash the given integers to hash tables of sizes 1000, 10,000 , 100,000 and 1,000,000 slots. For each hash table size experiment with the p, a, and b selected randomly. Report (ideally visualize) the number of collisions in total for different table sizes (over many trials with random p, a and b).
EX2. Find some p, m, a and b (for a universal family as described in the lectures) for which the integers given in EX1 are especially non-uniformly distributed (have many collisions in total or have too many collisions for one specific slot). How easy it is to find such bad cases? I.e. how many different hash functions did you have to evaluate? Describe how ignoring rules of selecting p, m, a or b can lead to a bad case (many collisions). Attempt to illustrate how the number of collisions are spread over the large set of random hash functions.
EX3. Watch this presentation by Titus Brown (2011): http://pyvideo.org/video/402/pycon-2011--handling-ridiculous-amounts-of-data-w Explain very briefly the concept of Bloom filters and why Titus and his colleagues use Bloom filters in this case?
EX4. Implement a simple edit distance calculation procedure (you can also use the spreadsheet if you want, especially for tracing of operations manually). Calculate edit distance between 'abracadabra' and 'abarcababbra'. Reconstruct the alignment or set of operations to get the second sequence from the first.
EX5. Demonstrate the knapsack problem on the following data (choosing the weight constraint as 6):
Name Cost Weight O1 2 3 O2 7 2 O3 5 1 O4 8 4
Explain what is a knapsack problem. Solve it using the dynamic programming approach. Draw the table and explain how it is filled. Show how to get the selected elements. Make an example of a practical problem that can be solved with this approach.
Hint: We did not talk about the knapsack problem in the lecture but there is a lot of material available for this problem (google knapsack problem). This video might help you get started:
EX6. (Bonus 3p) Hidden messages using Bloom Filters. Experiment with the De Bruijn graph approach discussed by Titus Brown in the lecture (see EX3) for "storing text" using Bloom filters... Grab a dictionary like this: http://www.gutenberg.org/cache/epub/972/pg972.txt Insert word definitions as individual "messages" -
EXILE, n. One who serves his country by residing abroad, yet is not an ambassador. Store: "EXILE," , "XILE, " , "ILE, n", "LE, n.", ... (You can use padding in front, e.g. '..........EXILE,' to create keys of length 16 (10+5+1), then '.........EXILE, ' , etc ) ,
Hints:
create a hash function for text - converting it first into a large integer use one large hash table bitmap (Bloom filter), and apply several different hash functions on the same array To "retrieve", query the Bloom filter with "EXILE," and re-create the entry by trying out all possible characters... (do you improve results by allowing English characters and punctuation only?) Experiment with changing the size of filter and what proportion of bits are set to 1 Experiment with the nr of 3-4-...X different hash functions and different keyword lengths. (you can assume all possible characters from byte code 0 to 255)
Measure how many word entries would be retrieved correctly and how many would "get broken" under different parameter settings.