HW11. Text algorithms, NFA and projects (21st of Nov, 23.59)
EX1: Write a very simple "grep" all by yourself - for comparing naive search and one other method of your choice from KMP, BMH, BMHHS algorithms. Use for example the same dictionary of English words as was used for hash function collision counting (https://raw.githubusercontent.com/dwyl/english-words/master/words.txt).
EX2: Now, in comparison, use the grep (https://en.wikipedia.org/wiki/Grep) and fgrep (or grep -F) for multiple patterns. State the speed - how many MB/s and how many times faster are these than your code? What is the speed difference matching one (long, short) pattern vs many patterns?
You can calculate MB/s by doing the following. Check what is the filesize in MB. Calculate the time taken to process every line in the file (try to match each line to your pattern). MB/s is then Filesize divided by seconds taken to process it.
EX3: Given the following
(ab|(aa|b)(ba)*(bb|a))*
Construct an NFA (Nondeterministic-Finite Automaton) using Thompson construction algorithm.
Minimize the constructed NFA.
Provide example patterns which are accepted by this automaton and some that are not.
EX4 (Bonus 3p): A finite state transducer (FST) is a type of deterministic finite automaton whose output is a string and not just accept or reject. The following are state diagrams of finite state transducers T1 and T2.
Each transition of an FST is labeled with two symbols, one designating the input symbol for that transition and the other designating the output symbol. The two symbols are written with a slash, /, separating them. In T1, the transition from q1 to q2 has input symbol 2 and output symbol 1. Some transitionsmay have multiple input–output pairs, such as the transition in T1 from q1 to itself. When an FST computes on an input string w, it takes the input symbols w1 · · ·wn one by one and, starting at the start state, follows the transitions by matching the input labels with the sequence of symbols w1 · · ·wn = w. Every time it goes along a transition, it outputs the corresponding output symbol. For example, on input 2212011, machine T1 enters the sequence of states q1, q2, q2, q2, q2, q1, q1, q1 and produces output 1111000. On input abbb, T2 outputs 1011. Give the sequence of states entered and the output produced in each of the following parts:
- T1 on input 011
- T1 on input 211
- T2 on input b
- T2 on input bbab
EX5 (Bonus 3p): Use the finite state transducer (FST) from EX4. Give the state diagram of an FST with the following behavior. Its input and output alphabets are {0,1}. Its output string is identical to the input string on the even positions but inverted on the odd positions. For example, on input 0000111 it should output 1010010.
EX6 & EX7: The course ends with the project to be completed usually by a small team of 2-3 people. Your task is to come up with one project proposal all by yourself (this is an individual task, you will have time to share) - taking ideas from project ideas file, from previous homeworks, or your imagination + google. E.g. you may attempt to extend some of the past exercises. Note - this is a proposal that you may use to attract other students to join; and this is at the same time a proposal that you do not need to start executing necessarily. So. it's more a planning exercise, not execution exercise.
Your project proposal should have:
- Title
- 2-5 sentences of a short description
- Briefly described the motivation and the main challenge of this project
- Division of tasks, the estimated number of work hours per task, and deadline (aim at our poster session!)
- Allocation of the 2-3 people to tasks and hours.
- Recommended: create a GANTT chart to cover the previous two points.
- Description of the envisioned end results that would go to the project report/poster.
We will discuss these in the practice session.