Procedures. Text Algorithm Projects. Automaton toolbox: RE >NFA, determinisation and minimization. Project

Size: px
Start display at page:

Download "Procedures. Text Algorithm Projects. Automaton toolbox: RE >NFA, determinisation and minimization. Project"

Transcription

1 Procedures Text Algorithm Projects Select topic (today) Prepare a project goal (hypothesis to test, programs to implement, experiments to run) 1 day Prepare a realistic project plan (3 hours) Aim at achieving results, quickly (1 week=1ap) Summarize key findings (1day) Prepare a poster (1day) Aim at experimental measurements At developing a novel idea/solution Plan to achieve results by Dec 3. Poster Session and End of the course: Dec 10. Implementing an useful piece of software Be ready to sell it by a good poster! Choose a topic with a potential to extend it into a research question MSc thesis, for example. Project One or 2 persons per project If related projects, collaborate to agree formats, etc. Same topic, different programming language (C / Java / ) in rare cases Automaton toolbox: RE >NFA, determinisation and minimization Automaton construction from regular expressions (Thompson and Glushkov) A tool for automaton determinisation And minimization i i i Create simple input and output formats, something in the philosophy of UNIX Goal: educational Python/Perl From Liina Kamm 1

2 Automaton toolbox Given input automaton, implement a converter, that creates an automaton that allows for 1 or 2 errors Higher level automaton/re M pattern/regexp or PWM or (M1 M2).{10,15}M3+.{10,15}M4 Match automaton where nodes or labels can mean matching of an individual pattern exactly or approximately, or a motif Calculate individual and total score(s) Output individual motif scores, actual distances Hint: matching of individual motifs can be treated as external routines (Sally, etc ) Ideal to extend into a MSc project Raamistik op ide lugemiseks Sorteeritud sõnaloendist otsimine Luua programmeerimise raamistik, et lugeda elementaaroperatsioone, aja mõõtmise asemel. Lisaboonus oleks kui saaks C++ abil näiteks realiseerida nii, et kompileeritakse kas üheks või teiseks vajaduseks Markko Merzini ettepanek Abacus Abracadabra Või Abacus 2racadabra 4m Otsi ligikaudselt sõnaloendist Väljasta rea nr ja sisu, millel esineb, Ja sarnasuse mõõt Trie indeks faili Ehita sõnade trie ja salvesta faili Otsi trie st ligikaudselt Tööriist Ekspressi mõistatuse jaoks Luua tööriist, mis aitaks luua ja lahenda EE s olnud ülesannet Ette anda fail filsõnadega Meil on kasutada ka Hendrik Niguli C realisatsioon Tekitada ülesandeid, otsida lühimat teed kahe sõna vahel (läbi vahe lülide), jne 2

3 Universaalse teisenduskauguse arvutamine, kiirelt Meil on Reina Kääriku realisatsioon universaalse teisenduskauguse arvutamiseks (matchimiseks) Universaalne teisenduskaugus Antud sõnapaaridest ja nende vahelisest hinnangulisest kaugusest, tuleta üldistatud teisenduskauguse jaoks teisendused ja kaalud Kas seda saab kiiremaks teha? Programmi koodi saab Siim Orasmaalt Vihje: eralda seniste teisenduste abil erinevused (senised teisendused) ja proovi need asendada uutega, millel uus hind. Paku välja inkrementaalne moodus (soovitus: tekita endale test andmed) Vihje: sobib magistritööks laiendamiseks Universaalne otsija Loo täisulik teisenduste loetelu koos hindadega, et eesti keeles kirja panekuga otsida venekeelsetest tekstidest sõnu Universaalne otsija Loo täisulik teisenduste loetelu koos hindadega, et eesti keeles kirja panekuga otsida mõnest inglise keelest Kombineeri eri transliteratsioonireegleid, lisa eestikeelseid õ ы jne. Tekita reegleid mis prooviksid aimata kirjavigu mis tulenevad kõla pildile vastavatest vigadest Eindžel > Angel, jne. Võib olla teha sama ka vene keele jaoks? Eesti keele sõnad vs teised keeled Võrrelda eesti keelt ja teisi keeli otsida eesti keele sõnu mis esinevad ka teistes keeltes (Täpselt ja ligikaudselt?) Kasuta sagedasi i(õi (või pikki) sõnu Andmed: proj. Guttenberg. Toomas Römeri ettepanek Data structures for DFA matching How to implement automaton matching that is optimal speedwise? table lookup (indexed?), linked list, binary search tree, etc Test using bytes, integers, and UTF 8 Study existing tools, summarize what s been used Implement 4 5 choices (create a test environment) Vary alphabet size, and measure/test 3

4 Time Warp Apply time warping on Sound? Bitiparalleelsus Dmitri Kostandi projekti ettepanek: uurida bitiparalleelseid lahendusi. Word frequencies in news > > time warp for related patterns Find temporal profiles of words: Given text /news/ with time, find words overrepresented in a time window Create word frequency profiles Find other words that have similar profiles Motif discovery: PWM optimization Take the 98 sequences data for Yeast, and TGAAA+TTT+ and G.GATGAG.T motifs (approximate) _%2B2_W_all.fa&ORGANISM=&FN_SELECT=Yeast_cluster_of_98_all.fa&USER_SEQ=Yeast_ 600_%2B2_W_all.fa&ORDER=none&USER_CLUSTERING=&WHICH=ALL&EXTRACT=MATCHING_MOTIF&PATTERNS=%0D%0A 1%3AG.GATGAG.T&CHARWIDTH=&CHARHEIGHT=&COLORSCHEME=GAUSS&VISUALIZATION=DEF Try to create a best PWM that is as frequent in the data as possible, yet occurs only 1 2 times per sequence, preferably in the expected place Match these on theentireyeastgenome Motif discovery: 2 motif model Take the 98 sequences data for Yeast, and TGAAA+TTT+ and G.GATGAG.T motifs (approximate). Try to create a model of two motifs cooccurring in data. For example, build an automaton recognising them (on either strand). Match the automaton on the entire Yeast genome Compare position weight matrices The aim of the project is to find position weight matrices (PWMs) that represent similar motifs. Given a list of matrices (Transfac PWMs) come up with a distance measure that captures the similarity between binding sites that are represented as PWMs Report clustering of PWMs based on the used distance measure 4

5 Clustering DNA binding domains The aim is to provide a full clustering of known transcription factor DNA binding domains using amino acid sequences Given a list of amino acid sequences (domains) and a conversion weight matrix (blosum, PAM) align the domains and then cluster them hierarchically using various distance measures (edit distance, hamming) report the hierarchical clusters finding mirna target sites from exons The aim is to find both exact matches and matches with errors of known micrornas (~22bp) in known human exons download mirna sequences from mirbase extract exon sequences from Ensembl or UCSC perform motif matching report mirna exon pairs DNA flexibility motifs Given a subset of motifs that enhance DNA bending come up with a scoring theme for measuring promoter flexibility match known DNA bending motifs (AAAA, GGGCCC etc) to known promoters check distances between motifs and take DNA helix structure into account (10.5bp/turn) come up with a scoring theme that takes the motifs and their distances into account for defining flexibility of promoters Report promoters with their respective flex. scores Exon similarities on amino acid level The aim of this project is to find genes where alternating exons (exon A or exon B in a transcript) code for a similar part of the protein extract exon sequences given the coordinates from UCSC or Ensembl browser translate them in right coding frame to amino acid sequences perform sequence comparison using amino acid conversion weight matrices (Blosum, PAM) report exon pairs with similarity score Natural text Tag clouds Given (preprocessed?) text > build tag clouds for most over representedwords in one set of text vs background Hint: use Tag clouds Directories > all text files in dir (and its subdirs) are counted for Possibly one is dedicated as background A) word relative frequencies (from all words) B) word frequencies (count files) Dir. can have a specific label s file 5

6 Newsfeed analysis Prepare a working prototype of a tool to analyse trends in news/texts Across time measure frequencies, identify hot topics (words), identify words associated to certain topics (e.g. In which context a company/person/party has been mentioned in news) Selection procedure Tegin ATIWikisse lehekülje: Teadusrühmad => BIIT => TA_2008 Redigeerimiseks: at1w1k1 6