HW7. Hashing and Bloom Filters (21st of Oct, 23.59)
EX1. (1P) Fibonacci number cycles (Pisano periods) in modulo m. (theory)
We have in the past generated Fibonacci numbers modulo m. When m is restricted, e.g. 100, then Fibonacci numbers F(n) = (Fn-1) + F(n-2) are calculated in modulo m.
Prove that, inevitably, there shall be a cycle for every fixed m, s.t. after some F(n) the sequence shall start repeating itself infinitely. (e.g. by the Pigeon-hole principle).
Additionally, calculate what is the maximal n, after which you know that the cycle must have been obtained? (the upper bound that you calculate does not need to be optimal; just give some upper bound, s.t. the real nr should be smaller.
With the same argument, calculate the upper bound for the length of the cycle.
Hint: You can also check Wikipedia; and the entry A001175 in the OEIS (Online Encyclopedia of Integer Sequences). You can look up by how much this is “too pessimistic” compared to the actual length of the cycle.
EX2. (1P) Finding the Pisano period (cycle) in Fibonacci numbers mod m by brute force.
For every m, calculate by brute force simulation, when the cycle starts (or the moment when the Fibonacci number in that modulo m) starts to repeat. You should be able to pick also the size of the hash table to use based on some arguments from the EX1 (or more compact).
Report the m, the n for F(n), from which onward the cycle starts, and the length of the cycle. Since you can do that for many m, visualize the obtained numbers in a scatterplot, where m is on the X axis and n and length(cycle) are on the y axis.
Make also a smaller table, a sparse shortlist of the results, only reporting the rows of m that are powers of 2, for example. In the table, provide also the actual numbers that mark the beginning of the cycle (n-1, n) and F(n-1), F(n).
How large m-s can you crack in a reasonable time limit, e.g. s.t. you limit your code to one hour of run time (leave your code running on the background and enjoy a coffee instead).
Hint: the easiest way to achieve this might be by hashing the two numbers together. Detect when the cycle starts (collision in the hash table?). And then run one full cycle to detect its length.
EX3. (1P) Infinite stream and counting word frequencies using a hash.
See the code that produces words with variable frequency.
https://colab.research.google.com/drive/1L7UA-sOZwrS2VSldhAX-ZcstUfsna28o
Count the occurrences of every word in that stream, for the first 1M words in the stream.
- (0.2p) - implement that with (Python) built-in hash table. Print at the end the 50 top most frequent words and their frequencies.
- (0.4p) - implement the same functionality but with explicit hash function and hash table implementing the same counter to replicate the same result as with the built-in method.
- (0.4p) - Replicate the similar functionality but with a single 1 byte counter for each word only (approximate log-counter).
A byte can count up to 255 occurrences. We want to count into the tens or hundreds of thousands or values. A log counter of exactly 1 byte is increased with exponentially lower probability (a bit like in the skip list the towering). A single byte can store up to 255 values.
If you use powers of 2, then increase of counter from 4 (2**4=16) to 5 (2**5=32) only if the probability of a random coin-toss is below 1/32. If it is a log-counter with exponent of 2, the largest value would correspond approximately to 2**255, astronomically large. Let’s use a smaller value, like powers of 1.1, for example. Only increment such log-counter probabilistically by one, if the random value gives such ever-decreasing probability.
For log in base 1.1 the log1.1(10,000) = 96.63 , hence increment from 96 to 97 only if the random value is below 1/10 000 (p <= 0.0001).
Run all three methods in parallel and validate that you get indeed the same result for the first two methods and approximately the same results for the third approach with a single byte counter only. How precise is the log-counter?
Report the times for three approaches (for 1M values from stream).
EX4. (1P) Create a Bloom filter for the same data stream.
Experiment with an explicit implementation of a Bloom Filtering approach for the data stream from the previous task.
How large the Bloom filter should be for the stream from the previous task? How does the filter size affect the false positive rate (i.e. the false alarms of collisions, that actually were never in the data previously?).
Try to find optimal parameters for the Bloom filter, so that you get a small filter but achieve 99% correctness of returning the false - “I have seen this value”. i.e.the false positive rate should be below 1% after you have already processed the stream of 1M values, measured by the next 100K elements from the same stream.
EX5. (1P) Create an “approximately counting” Bloom Filter for the same data
Normal Bloom filter uses a single bit for each hash function. If for a new key m, some field is 0, that must have never been witnessed before. If all functions return 1, it is possible that a collision has occurred. But this does not count the number of occurrences of a key.
Let’s combine the two principles - log-counting hash table and the Bloom filter. Task is to count approximately, how frequent a certain keyword might have been in the past. For that, instead of a single bit, use a single byte as a counter in Bloom filter, as in the EX3.
The “approximately counting” implementation of a Bloom filter should add probabilistically to each hashed location. The current smallest value should be the so far observed count (count can not be larger). Add to that only probabilistically. If the other values are already larger, keep those counts. (hence smallest value should correspond to a possible overestimate of the frequency).
The goal is two-fold:
- Use little space and no storage of the key itself for counting element frequencies
- Detect the frequent elements from the streams during the counting.
Is the Bloom filter counter sufficiently similar to the log-log counter from EX3?
Hint: to debug the code you may still need to keep all keys somewhere together with their counts, so that you could use the Bloom filter to retrieve their frequencies.
EX6. (Bonus 2p)
Modify the Bloom filter counter for the stream to count frequencies of observed elements, but only for the sudden increases of word frequencies in past streaming window.
For that the log counters could be used. But in addition to incrementing counters, implement also decrement of values (possibly into 0) to kick out old and infrequent elements.
Decrease the log-counters with similar probabilistic manner as it is increased. Hence frequent elements would still remain frequent but smaller values could be “kicked out” of the counter.
Can you detect from the stream the elements that are the most frequent at any given time?
(goal is not to use explicit storage of the keys and counts, but from this counting Bloom filter to observe that the key is in Bloom filter and seems to be extremely frequent over the last time-window).
Conduct an experiment on how small the counting filter could be to get the answers with some approximation. Note that you have the full data, you can simulate Bloom filter to achieve somewhat similar results, but with hopefully much smaller memory footprint.