HW 01 18.09 (16-18) Exact string matching
1. Implement the brute force method and make actual speed comparisons by varying
a) Pattern length (e.g. 2, 4, 8, 16, 32) b) alphabet size 4, 20,
2. Implement the KMP algorithm and measure similarly the speed, compare to brute force.
3. Implement the Boyer-Moore-Horspool algorithm and measure similarly the speed.
4. Draw the KMP automaton and create the Fail[] table for string ABABACABABAA according to Init_Fail() function. Simulate the algorithm during the creation of the table, and then matching it on some text. Can you identify cases where the Fail links may be optimised further? Explain.
5. Study the grep family (grep, ggrep, egrep, fgrep, agrep) tools of Unix. What functionality is being offered? Test the speed of some of those tools on same data as 1.-3.
6. Bonus (2 p) Tune the Hume Sunday version and loop unrolling, comparing against 1.-3. Estimate how many times is "best" for performance in your implementation.
Hints:
Estimate speed in terms of how many MB/s can you search for with that particular algorithm.
You can use any programming language. But make sure to measure speed in good enough detail! Use gnuplot or Excel or R to visualise the measured speeds graphically.
Data sources:
- DNA sequences: Human chromosomes (mostly alphabet size 4, ignoring N, spacers and annotations)
- Protein sequences, e.g. NCBI (e.g. Swissprot DB ftp://ftp.ncbi.nih.gov/blast/db/FASTA/swissprot.gz )
- Natural language: Project Gutenberg. There you can find enough books to download. (suggest to use scripting and wget). Make sure to get TXT versions. :-)
- Generate random text according to you own needs. Then you can vary the alphabet size, exact length, character distributions (even or biases), etc. Also, you can generate the possible "worst case scenario" data sets easily.
Note: you can always implant your exact pattern that you would like to find in those text files!