HW 08 20.11 Regular languages, Approximate matching
1. Study the definition of back-references in regular expressions. Explain by some examples and illustrated some powerful uses.
2. Explain set by step the following one-liner that we wrote during the lecture:
vilo@mac:~$ perl -e ' foreach $i (2..20){ print "$i B".( "a"x$i ) . "B\n" } ' | egrep -v 'B(aa+)(\1)+B' | awk '{ print $1 } ' 2 3 5 7 11 13 17 19 vilo@mac:~$
3. Make an explicit version of the "Four Russians" technique for patterns baa (column for all three letters by one "state").
4. Make an explicit version of the "Four Russians" technique for pattern baabaa'. But this time not in one "state" but rather 3 blocks of "states" for 2 characters matched at a time. Make a test on the text 'abbaaabaa'.
5. Propose an idea for a potential project - a question that should have a "quick answer" with <40h of work.
Examples for bioinformatics (<80h):
1) take ENCODE data on transcription factor binding sites and identify how many matches there are in the genome within the histone/nucleosome wrapped regions and in open areas. Are TF binding sites preferred in the "open" regions?
2) Match the TF binding sites against genome with SNP-s. How many TF binding sites would potentially disturb the TF binding due to a common SNP? (either regexp or the position weight matrix case)
6. Bonus (2p) Since back-references may require a lot of bactracking, they can also become very slow. I know that in past people were able to make grep queries that would last many seconds or minutes even on very short input files and line lengths. Can you discover some of such very slow instance of an input and the grep query (may depend also on grep implementation!). Explain the reasons.