25.09 (16-18) Exact string matching & Multi-pattern matching
1. Use pattern P=TATTAA and text ATTATTAAATATTAATAT. Simulate the BNDM approach: draw the automaton with epsilon-transitions (nondeterminstic automaton on reverse patern) and attempt to simulate it with the bitmap based approach. You can assume alphabet of 2 characters A and T. Then you need two bitvectors (in this case 8 bits) for letters bv[A]=00010011 and bv[T]=00101100.
2. Draw the (deterministic) BOM algorithm automaton for the above pattern. Simulate on text as above.
3. 3. Read the article and identify which extensions of exact search have been proposed there
Ricardo Baeza-Yates, Gaston H. Gonnet: A new approach to text searching Communications of the ACM October 1992, Volume 35 Issue 10 [ACM Digital Library:http://doi.acm.org/10.1145/135239.135243] PDF
4. Study the Shift-OR or Shift-AND algorithm for exact matching. Apply it on the same pattern and text as task 1.
5. P = { TATTA , TTAGTATT, AGAT , TTAA, GAATT }
S=ATGTATTAAGTTTAGTATTAGGTACGTCCAT
Simulate Aho-Corasick (AC) algorithm on above text and patterns.
6. (Bonus 2p) Think about the AC algorithm and very large set of patterns and how you would implement such an automaton (goto[state,char]) to be both space-efficient and (very) fast. Argue about such choices. And if possible, study how fgrep has implemented it. Could you do it better?