Arvutiteaduse instituut
  1. Kursused
  2. 2012/13 sügis
  3. Tekstialgoritmid (MTAT.03.190)
EN
Logi sisse

Tekstialgoritmid 2012/13 sügis

Muuda lehte
Muudatuste ajalugu Üleslaetud failid

TEXT 2012

  • Main
  • Lectures
  • Project
    • Demos
  • Homework
    • Homework upload
    • admin
  • Links
Muuda külgriba

HW 08 20.11 Regular languages, Approximate matching

1. Study the definition of back-references in regular expressions. Explain by some examples and illustrated some powerful uses.

2. Explain set by step the following one-liner that we wrote during the lecture:

vilo@mac:~$ perl -e ' foreach $i (2..20){ print "$i B".( "a"x$i ) . "B\n" } ' | egrep -v 'B(aa+)(\1)+B' | awk '{ print $1 } ' 
2
3
5
7
11
13
17
19
vilo@mac:~$ 

3. Make an explicit version of the "Four Russians" technique for patterns baa (column for all three letters by one "state").

4. Make an explicit version of the "Four Russians" technique for pattern baabaa'. But this time not in one "state" but rather 3 blocks of "states" for 2 characters matched at a time. Make a test on the text 'abbaaabaa'.

5. Propose an idea for a potential project - a question that should have a "quick answer" with <40h of work.

Examples for bioinformatics (<80h):

1) take ENCODE data on transcription factor binding sites and identify how many matches there are in the genome within the histone/nucleosome wrapped regions and in open areas. Are TF binding sites preferred in the "open" regions?

2) Match the TF binding sites against genome with SNP-s. How many TF binding sites would potentially disturb the TF binding due to a common SNP? (either regexp or the position weight matrix case)

6. Bonus (2p) Since back-references may require a lot of bactracking, they can also become very slow. I know that in past people were able to make grep queries that would last many seconds or minutes even on very short input files and line lengths. Can you discover some of such very slow instance of an input and the grep query (may depend also on grep implementation!). Explain the reasons.

  • Arvutiteaduse instituut
  • Loodus- ja täppisteaduste valdkond
  • Tartu Ülikool
Tehniliste probleemide või küsimuste korral kirjuta:

Kursuse sisu ja korralduslike küsimustega pöörduge kursuse korraldajate poole.
Õppematerjalide varalised autoriõigused kuuluvad Tartu Ülikoolile. Õppematerjalide kasutamine on lubatud autoriõiguse seaduses ettenähtud teose vaba kasutamise eesmärkidel ja tingimustel. Õppematerjalide kasutamisel on kasutaja kohustatud viitama õppematerjalide autorile.
Õppematerjalide kasutamine muudel eesmärkidel on lubatud ainult Tartu Ülikooli eelneval kirjalikul nõusolekul.
Tartu Ülikooli arvutiteaduse instituudi kursuste läbiviimist toetavad järgmised programmid:
euroopa sotsiaalfondi logo