HW 04 16.10 Edit distance
1. Implement an edit distance calculation method (script) for accounting for the “trasnposal” of characters at cost 1. Test and measure the speed of your implementation (how many words of similar length could be compared in 1 minute?).
2. Argue about how often the transposals would be used in texts of alphabet size 2. In the lecture there was a claim/question whether they would be used a lot.
3. Experimental validation: Create example sequences in alphabet of size 2 and 3, and count the nr of times the transposal was actually used in the minimizing paths. Can you quantify if it is used more often in the case of alphabet of size 2 vs larger alphabets? Calculate the nr of transposal as the ratio of all operations on minimizing paths. You should also be able to test the nr of deletions vs insertions when you have sequences of different lengths.
4. Implement extensions to edit distance operations so that repetitions of the characters does not affect the final score, and that you could plug in a table for varying the cost for possible pairs of most likely substitutions. (e.g. c1aaaaalis, c1allis, c1aaalliis) would all be detected at the very small distance from cialis.
5. Use the "dictionary" search from https://biit-dev.cs.ut.ee/~orasmaa/gen_ed_test/?mode=1&dicts=2 Play around with words in the list of words https://biit-dev.cs.ut.ee/~orasmaa/gen_ed_test/viewFile.cgi?viewDict=2 - see which words could not yet be found as you would expect, and see which rules should be added to the list. You can also switch the word file. Unfortunately the GUI is in Estonian at the moment :-( - team up with someone who can assist you :-).
6. (3p) Fetch the "kohanimed.txt" file and find there the
- most unique names (nothing is very similar),
- most common names (nr. of similar names is the largest).
(you can use tools like agrep, or others).