HW12. Suffix trees / arrays, BWT Transform, Feedback (2nd December, 23.59)
Just state that you've completed peer reviews for EX1 and EX2. No need for anything extra.
EX1 (1p) . Peer review the first article assigned to you.
EX2 (1p) . Peer review the second article assigned to you.
EX3. (2p) Repeated String Match: Given two strings a and b, return the minimum number of times you should repeat string a so that string b is a substring of it. If it is impossible for b to be a substring of a after repeating it, return -1.
Note: string "abc" repeated 0 times is "", repeated 1 time is "abc" and repeated 2 times is "abcabc".
Hint: The key to solve this problem is to visualize that S = A+A+.... should be at least B.length and not greater than (A.length + B.length) in order to have B as substring. Now use Boyer–Moore–Horspool to design an algorithm to keep searching for the substring until reach the constraint (length < [b.length + a.length]).
Run your code on the following Test Cases and show your output
Test Case 1:
Input: a = "abcd", b = "cdabcdab"
Output: 3
Test Case 2:
Input: a = "a", b = "aa"
Output: 2
You can use the pseudocode
repeatedStringMatch = function(string a, string b ) { origin = a; initialize times = 1; //Check until reach max length return a.indexOf(b) >= 0 ? times : -1;
EX4. ( 1p)
- Perform Burrows-Wheeler Transform on the following string: CGGTCGCT$. Make sure to draw the matrix. Demonstrate how you can reconstruct the original string from BWT. Where and why is BWT used?
This video might help you get started:
- Given the string T = acracca$ construct a suffix tree using the following algorithm:
Simple-Suffix-Tree-Algorithm(T) Create the root node, with empty string for i ← 1 to n do Traverse current tree from the root Match symbols in the edge label one-by-one with symbols in the current suffix, Ti if a mismatch occurs then Split the edge at the position of mismatch to create a new node, if need be Insert suffix Ti into the suffix tree at the position of mismatch end if end for
EX5. (2p): Number of Ways to Separate Numbers: You wrote down many positive integers in a string called num. However, you realized that you forgot to add commas to seperate the different numbers. You remember that the list of integers was non-decreasing and that no integer had leading zeros.
Return the number of possible lists of integers that you could have written down to get the string num. Since the answer may be large, return it modulo 10 9 + 7
Run your code on the following Test Cases and show your output
Test Case 1:
Input: num = "327"
Output: 2
Test Case 2:
Input: num = "094"
Output: 0
You can use the pseudocode to define the count function
countWays(i, threshold) -> int if i == N: return 1 # 1 way to parse empty sequence N: if i + len(threshold) > N: return 0 # not enough digits if num[i] == '0': return 0 # no leading zeros and must be ways = 0 # for parsing len(threshold) digits from i, the nding number # might be lower than the prior digits L = len(threshold) if num[i:i+L] >= threshold: # take i:i+L and continue increment ways += countWays(i+L, num[i:i+L]) # any longer number will work too for L2 in range(L+1, N-i+1): # take next L2 digits and continue, each is a t threshold later on ways += countWays(i+L2, num[i:i+L2]) ways %= BIG return ways
EX6. (1p) Edit Distance Many word processors and keyword search engines have a spelling inbuilt correction feature. If we type in a misspelled word x, the word processor or search engine can suggest a correction y. The correction y should be a word that is close to x. One way to measure the similarity in spelling between two text strings is by “edit distance.”
The edit distance d(x, y) of two strings of text, x[1 . . m] and y[1 . . n], is defined to be the minimum possible cost of a sequence of “transformation operations” (defined below) that transforms string x[1 . . m] into string y[1 . . n]. To define the effect of the transformation operations, we use an auxiliary string z[1 . . s] that holds the intermediate results. At the beginning of the transformation sequence, s = m and z[1 . . s] = x[1 . . m] (i.e., we start with string x[1 . . m]). At the end of the transformation sequence, we should have s = n and z[1 . . s] = y[1 . . n]. Throughout the tranformation, we maintain the current length s of
string z, as well as a cursor position i, i.e., an index into string z. The invariant 1 ≤ i ≤ s + 1 holds at all times during the transformation.
Prove, for any two strings x and ywith edit distance d(x, y), there exists a sequence S of transformation operations that transforms x to y with cost d(x, y) where S does not contain any “left” operations.
EX7. (1p) Provide feedback for the course and homework topics so far
- which topics were most useful?
- what needs to be covered better in the course?
- are there some topics that would need more practical implementation assignments?
- are there some topics that got too much attention (e.g. too boring or otherwise already well known)