HW8. Time warping, Hashing, Graphs (29th of Oct, 23.59)
EX1. Implement the dynamic time warping algorithm and compare the following 6 series:
s1: 2 5 5 7 9 1 6 6 8 9 1 6 6 8 9
s2: 1 2 5 6 7 2 6 9 9 9 2 6 7 8
s3: 4 2 1 4 8 8 9 3 2 3 2 2 6 9 2 1 1 6 1
s4: 1 5 7 3 7 2 8 9 9 2 2 3 7 8
s5: 3 2 2 1 1 8 3 2 6 9 1 6 1
s6: 1 4 6 8 8 2 1 2 7 9 2 7 9 9 9 9 9 9
Visualise pairs of sequences by “stretching time” (dynamic time warp) in some way, so that with three pairs all 6 sequences are “covered”. Start from the most similar pair of them all, then the most similar pair where neither sequence is among the first pair, and finally the last (third) pair. Provide visualisations and calculated pairwise distances for the pairs.
EX2. During the discussion on hashing, we did not ask, how to deal when the hash table grows full and/or we do not know in advance the final nr. of elements to be hashed. One answer to this is so-called "Extendible hashing" or also "Dynamic hashing". Please study online materials, e.g https://loonytek.com/2016/05/17/extendible-hashing/ for 30-40 minutes on this topic. In submission highlight the most useful resource(s). To validate that you did EX2, you must also do EX3.
EX3. Now summarise the key points of extendible hashing, providing also one illustration (can be from the internet; properly cited of course). Approximately 0.5-1 pages should do. Provide key ideas as bullet points, for example.
EX4. Hash functions are usually represented as integer -> integer mappings. Yet, keys in dictionaries are usually words, ID-s etc. The purpose of this task is to try to map every English word (e.g. https://raw.githubusercontent.com/dwyl/english-words/master/words.txt) into an integer (before even applying the hash functions). Since there are many more potential words than integers of length 32 bit or 64 bit, one may ask - how many collisions one would generate simply by conversion of words into integers. Try out some simple ideas - like simply adding all character values ord(c)
together and count nr of collisions. Then try to find and implement a better mapping - and report how many collisions there will still be when applied to above very long list of English words. You can measure two things: total nr of colliding integers; and the "most occupied" integer value - for both, the 32 and 64 bit integers.
EX5. Here are many data types that can be represented as graphs. Study online collection at https://snap.stanford.edu/data/ . Select from there 4 different types and sizes of graphs. Consider them as undirected and characterise them by calculating:
- Nr of nodes
- Nr of edges
- Distribution of node degrees
Hints: it could be that some information on the website about nr of nodes and edges is outdated.
EX6. (Bonus 3p) According to the central dogma of molecular biology, DNA is transcribed into RNA molecules, which then are translated into proteins. Pick one RNA sequence e.g. https://rnacentral.org/rna/URS0000735371/9606, which structure is illustrated in a figure. You can use any other RNA sequence that you can find on the same website.
Compare the original RNA sequence to its “reverse self” (Hint: this is just reading the same sequence from the tail, don't confuse with complement or reverse-complement versions https://www.bioinformatics.org/sms/rev_comp.html) and try to identify some of the longest overlaps, e.g. the longest perfect matches between the original sequence and its reverse self. Report their lengths. Now, try allowing one error (insertion/deletion/error). Starting from every start position - try to find the longest “reverse match” to it. Note - it does not need to be perfect nor fast, nor visualised perfectly - just try it out if you could find some of them. Describe your approach and provide visualisations.