HW12. Text search, Burrows Wheeler, Suffix array, Feedback (8th of December, 23.59)
Well done! Now you are at the very end of the course!
This last week we will study some of the most essential text algorithms! There is some code in here: https://colab.research.google.com/drive/1k8OSanq9MCQPYre83Njnj5V6Pibj7Ky4#scrollTo=gPUF2ap9mn-F
First, we generate strings and patterns, then provide also a suffix array code (alll generated by ChatGPT, so exercise caution).
Attach:text1000.txt Attach:text1mb.txt Attach:patterns.txt
T1: Given two files and a list of patterns, match them against the files.
Match each of patterns using the “naive” method, Knuth-Morris Pratt and (some) Boyer-Moore like algorithm to detect all the occurrences of patterns. Report the nr of occurrences and the nr of comparisons made.
For each algorithm and pattern - count and identify on which line these patterns required the most comparisons. Report the pattern, and the line and the nr of comparisons on that line. Briefly state, what explains the results.
T2: Convert each pattern into a backward oracle matching automaton.
Using that automaton perform a similar matching exercise. Again, count the nr of occurrences (should be the same, right?), and the number of character comparisons/matches needed.
Come up with two patterns - one that is “slowest” to match, and one that is fastest to match. Provide evidence - nr of comparisons or speed.
T3: Evaluate UNIX tools for searching.
Use the same patterns and measure the speed of grep and grep -f (fgrep) for matching a single pattern and many patterns simultaneously. Assess how many MB/s would be searched for each pattern, or all of them in one go. Also, evaluate approximate matching tools like agrep and tre-grep for speed and practical use. Select one that allows you to perform "fuzzy match" using edit distance.
T4. Build a Burrows-Wheeler transformation of the two texts provided.
Add in report the BWT for shorter text (use \n for newline, and wrap appropriately). Also provide a decoder that converts BWT encoded text back into original. Make sure to use the efficient version for decoding (even if coding is “slow”). Report the speeds of coding and decoding.
Examine the repetitiveness of the original text compared to BWT version of the same text. How would you characterise the repetitiveness?
Assess how well compression algorithms (zip, gzip, bzip) would compress the original and BWT transformed versions. Also experiment with various compression parameters of these tools, for speed and compression rate.
T5. Build a suffix array of the provided texts.
Search for same patterns. One code for a small example is in the code file. Assess how many comparisons were needed to search from within the SA index for the texts. Compare that to T1 and T2.
EX6. (1p) Provide feedback for the course and homework topics so far
Please fill in this form: https://forms.gle/MzksD9jufSWAJuHT6
EX7 (1p) Create a project plan for you and your team. This will be a separate PDF file upload form on the courses upload choices.
Include in the project plan:
- Name of the team
- Title of the project
- Team members
- Abstract
- The description of an idea to accomplish
- Timeline with the precision of dates, that should include ca 5 milestones (e.g. problem statement and division of work ready; code ready; experiment plan ready; experiments finished; illustrations planned for the report) and 2 deliverables (code ready for experimenation; final poster in print)
- First ideas what to include in the final report to be presented as a poster
This joint report should be done and shared by the entire team. And all team members can upload it independently as the PDF file, as well (a separate form in uploads)
Some videos
Boyer-Moore-Horspool -
Burrows-Wheeler Transform