An introduction to multiple alignments

Size: px
Start display at page:

Download "An introduction to multiple alignments"

Transcription

1 An introduction to multiple alignments original version by Cédric Notredame, updated by Laurent Falquet Overview! Multiple alignments! How-to, Goal, problems, use! Patterns! PROSITE database, syntax, use! PSI-BLAST! BLAST, matrices, use! [ Profiles/HMMs ]

2 Overview! What are multiple alignments?! How can I use my alignments?! How does the computer align the sequences?! The progressive alignment algorithm! What are the difficulties?! Pre-requisite?! How can we compare sequences?! How can we align sequences? Sometimes two sequences are not enough The man with TWO watches NEVER knows the exact time

3 What is a multiple sequence alignment?! What can it do for me?! How can I produce one of these?! How can I use it? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGGwheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : What is a multiple sequence alignment?! Structural/biochemical criteria! Residues playing a similar role end up in the same column.! Evolution criteria! Residues having the same ancestor end up in the same column. chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGGwheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. :

4 How can I use a multiple alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP unknown -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGGwheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM unknown AKDDRIRYDNEMKSWEEQMAE * :.*. : Extrapolation Less Than 30 % id BUT Conserved where it MATTERS Unkown Sequence Homology? SwissProt

5 How can I use a multiple alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGGwheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : Extrapolation Prosite Patterns P-K-R-[PA]-x(1)-[ST] How can I use a multiple alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGGwheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : L? K>R Extrapolation Prosite Patterns Prosite Profiles A F D E F G H Q I V L W -More Sensitive -More Specific

6 PROSITE profile (see also HMMs) A Substitution Cost For Every Amino Acid, At Every Position How can I use a multiple alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGGwheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : Phylogeny chite wheat trybr mouse -Evolution -Paralogy/Orthology

7 How can I use a multiple alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGGwheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : Phylogeny Struc. Prediction Column Constraint " Evolution Constraint " Structure Constraint How can I use a multiple alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGGwheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : Phylogeny Struc. Prediction PsiPred or PhD For secondary Structure Prediction: 75% Accurate. Threading: is improving but is not yet as good.

8 How can I use a multiple alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGGwheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : Phylogeny Struc. Prediction Caution! Automatic Multiple Sequence Alignment methods are not always perfect

9 The problem! why is it difficult to compute a multiple sequence alignment? Biology What is a good alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * Computation What is the good alignment? The problem! why is it difficult to compute a multiple sequence alignment? CIRCULAR PROBLEM... Good Sequences Good Alignment

10 The problem! Same as pairwise alignment problem! We do NOT know how sequences evolve.! We do NOT understand the relation between structures and sequences.! We would NOT recognize the correct alignment if we had it IN FRONT of our eyes The Charlie Chaplin paradox

11 What do I need to know to make a good multiple alignment?! How do sequences evolve?! How does the computer align the sequences?! How can I choose my sequences?! What is the best program?! How can I use my alignment? An alignment is a story ADKPKRPLSAYMLWLN Insertion Deletion ADKPRRP---LS-YMLWLN ADKPKRPKPRLSAYMLWLN Mutation ADKPKRPLSAYMLWLN Mutations + Selection ADKPKRPLSAYMLWLN ADKPRRPLS-YMLWLN ADKPKRPKPRLSAYMLWLN

12 Homology! Same sequences -> same origin? -> same function? - > same 3D fold? %Sequence Identity Same 3D Fold 30% Twilight Zone Length 100 Residues and mutations! All residues are equal, but some more than others Aliphatic Aromatic L V I F M C P A G G T C S Y H K D N E W Q R Small Hydrophobic Polar Accurate matrices are data driven rather than knowledge driven

13 Substitution matrices Different Flavors: Pam: 250, 350 Blosum: 45, 62 What is the best substitution matrix?! Mutation rates depend on families Family S N Histone Insulin Interleukin I !"Globin Apolipoprot. AI Interferon G Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (0.08 Billion years)! Choosing the right matrix may be tricky! Gonnet250 > BLOSUM62 > PAM250! Depends on the family, the program used and its tuning

14 Insertions and deletions? Indel Cost Affine Gap Penalty Cost=GOP+GEP*L L Cost Cost L L How to align many sequences?! Exact algorithms are computing time consuming! Needlemann & Wunsch! Smith & Waterman 2 Globins =>1 sec

15 How to align many sequences?! Exact algorithms are computing time consuming! Needlemann & Wunsch! Smith & Waterman 3 Globins =>2 mn How to align many sequences?! Exact algorithms are computing time consuming! Needlemann & Wunsch! Smith & Waterman! -> heuristic wished 4 Globins =>5 hours

16 How to align many sequences?! Exact algorithms are computing time consuming! Needlemann & Wunsch! Smith & Waterman! -> heuristic really wished! 5 Globins =>3 weeks How to align many sequences?! Exact algorithms are computing time consuming! Needlemann & Wunsch! Smith & Waterman! -> heuristic required! 6 Globins =>9 years

17 How to align many sequences?! Exact algorithms are computing time consuming! Needlemann & Wunsch! Smith & Waterman! -> heuristic definitely required! 7 Globins =>1000 years Existing methods 1-Carillo and Lipman: -MSA, DCA. -Few Small Closely Related Sequence. -Do Well When They Can Run. 4-Progressive: -ClustalW, Pileup, Multalign - Fast and Sensitive 2-Segment Based: 3-Iterative: -DIALIGN, MACAW. -May Align Too Few Residues -HMMs, HMMER, SAM. -Slow, Sometimes Inacurate -Good Profile Generators 5-Mixtures: -T-Coffee, MAFFT, MUSCLE, ProbCons, Psi-Praline, - Very sensitive

18 Progressive alignment Feng and Dolittle, 1980; Taylor 1981 Dynamic Programming Using A Substitution Matrix Progressive alignment Feng and Dolittle, 1980; Taylor Depends on the CHOICE of the sequences. -Depends on the ORDER of the sequences (Tree). -Depends on the PARAMETERS: Substitution Matrix. Penalties (Gop, Gep). Sequence Weight. Tree making Algorithm.

19 Progressive alignment! Works well when phylogeny is dense! No outlayer sequence! Example: river crossing Selecting sequences from a BLAST output

20 A common mistake!!! Sequences too closely related PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEE SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE :**::*.*******:***:* :****************..::******:*********** PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES :*** ******.******.**** *:************.:******:** Identical sequences brings no information Multiple sequence alignments thrive on diversity

21 Respect information! PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT TPCC_MOUSE PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT TPCC_MOUSE PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT TPCC_MOUSE SMTDLLN----AEDIKKA SMTDLLN----AEDIKKA SMTDLLS----AEDIKKA SMTDVLS----AEDIKKA SMTDLLS----AEDIKKA AMTELLN----AEDIKKA MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :*..*:::: VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFI IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFI IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSI IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSI IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFI IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM :.. *.*..:*: *: * *. :::..:*:::**:.*:*: :** : -This alignment is not informative about the relation between TPCC MOUSE and the rest of the sequences. -A better spread of the sequences is needed Selecting diverse sequences PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES PRVA_MACFU PRVA_ESOLU PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES PRVA_MACFU PRVA_ESOLU -AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIE -AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIE MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE -AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIE -SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIE -SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIE --AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE : *:.:..*.:*. * ** *: * : * :* * **:** LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES- LKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES- LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES- LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES- LKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES- LKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES- LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE *:... ::.: : *: ***:.**:*. :** :: EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA- EDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ- DEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA- QDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA- EDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA :**.*:.*.* *: ** ::.* **** **::** ** -A REASONABLE model now exists. -Going further:remote homologues.

22 Aligning remote homologues PRVA_MACFU PRVA_ESOLU PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES TPCS_RABIT TPCS_PIG TPCC_MOUSE PRVA_MACFU PRVA_ESOLU PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES TPCS_RABIT TPCS_PIG TPCC_MOUSE SMTDLLNA----EDIKKA AKDLLKA----DDIKKA AFAGVLND----ADIAAA AFAGILSD----ADIAAG MACAHLCKE----ADIKTA AVAKLLAA----ADVTAA SITDIVSE----KDIDAA -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :: VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFV LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLF LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELF LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLF IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM :..:... *: * : * :* :.*:*: :**. Going further PRVA_MACFU PRVB_BOACO PRV1_SALSA TPCS_RABIT TPCS_PIG TPCC_MOUSE TPC_PATYE PRVA_MACFU PRVB_BOACO PRV1_SALSA TPCS_RABIT TPCS_PIG TPCC_MOUSE TPC_PATYE VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM SDEMDEEATGRLNCDAWIQLFER---KLKEDLDERELKEAFRVLDKEKKGVIKVDVLRWI. :... ::. : * :* :.* *. : *. PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES- PRVA_ESOLU LKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA- PRVB_CYPCA LQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-- PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG- PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-- PRVB_LATCH LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-- PRVB_RANES LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-- TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ TPCS_PIG Swiss Institute FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ of Bioinf ormatics TPCC_MOUSE Institut Suisse LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE de Bioinf ormatique ::.. :: : ::.* :.** *. :** :: LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-- LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG-- LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ--- FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ- FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ- LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE- LS---SLGDELTEEEIENMIAETDTDGSGTVDYEEFKCLMMSSDA :. :: : :: * :..* :. :** ::

23 What makes a good alignment! The more divergent the sequences, the better! The fewer indels, the better! Nice ungapped blocks separated with indels! Different classes of residues within a block:! Completely conserved (*)! Size and hydropathy conserved (:)! Size or hydropathy conserved (.)! The ultimate evaluation is a matter of personal judgment and knowledge Avoiding pitfalls

24 Naming your sequences the right way! Never use white spaces in your sequence names! Never use special symbols. Stick to plain letters, numbers and the underscore sign (_) to replace spaces. Avoid ALL other signs, especially the most tempting ones #,, *, >, <! Never use names longer than 15 characters! Never give the same name to 2 different sequences in your set. Some programs accept it, others like ClustalW don t. Do not use too many sequences!

25 Beware of Repeats! There is a problem when two sequences do not contain the same number of repeats! It is then better to manually extract the repeats and to align them separately. Individual repeats can be recognized using Dotlet or Dotter. Keep a biological perspective chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGGwheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : chite AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGLwheat -DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLS trybr -K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG mouse ----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS * ***.:: ::... : *... : *. *: * DIFFERENT PARAMETERS chite KSEWEAKAATAKQNY-I--RALQE-YERNG-Gwheat KAPYVAKANKLKGEY-N--KAIAA-YNK-GESA trybr RKVYEEMAEKDKERY----K--RE-M mouse KQAYIQLAKDDRIRYDNEMKSWEEQMAE----- : : * :.* :

26 Do not overtune!!! chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGGwheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : chite ---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS-----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. *.:... :.. *. *: * DO NOT PLAY WITH PARAMETERS! IF YOU KNOW THE ALIGNMENT YOU WANT: MAKE IT YOURSELF! chite AATAKQNYIRALQEYERNGGwheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : BaliBase classification and benchmark PROBLEM Description Even Phylogenic Spread. One Outlayer Sequence Two Distantly related Groups Long Internal Indel Long Terminal Indel

27 Choosing the right method Source: BaliBase Thompson et al, NAR, (1999) Do et al, Genome Res. (2005) PROBLEM Program Strategy ClustalW, T-coffee, MUSCLE, ProbCons T-Coffee, MUSCLE, ProbCons ProbCons, MUSCLE, MAFFT Dialign II, ProbCons, T-Coffee Dialign II, ProbCons, MAFFT Some interesting links

28 More links! MUSCLE! MAFFT! PROBCONS! PSI-PRALINE! 3D-COFFEE! Conclusion! The best alignment method:! Your brain! The right data! The best evaluation method:! Your eyes! Experimental information (SwissProt)! Choosing the sequences well is important! Beware of repeated elements! What can I conclude?! Homology -> information extrapolation! How can I go further?! Patterns! Profiles! HMMs!

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Protein Sequence Analysis BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical

More information

Dynamic Programming Algorithms

Dynamic Programming Algorithms Dynamic Programming Algorithms Sequence alignments, scores, and significance Lucy Skrabanek ICB, WMC February 7, 212 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Creation of a PAM matrix

Creation of a PAM matrix Rationale for substitution matrices Substitution matrices are a way of keeping track of the structural, physical and chemical properties of the amino acids in proteins, in such a fashion that less detrimental

More information

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database

More information

Recent progress in multiple sequence alignment: a survey

Recent progress in multiple sequence alignment: a survey Recent progress in multiple sequence alignment: a survey Cédric Notredame Information Génétique et Structurale, UMR 1889, 31 Chemin Joseph Aiguier, 13 006 Marseille, France Tel.: +33 4 911 646 06; Fax:

More information

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases Chapter 7: Similarity searches on sequence databases All science is either physics or stamp collection. Ernest Rutherford Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing

More information

Why learn sequence database searching? Searching Molecular Databases with BLAST

Why learn sequence database searching? Searching Molecular Databases with BLAST Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results

More information

Database Searching and BLAST Dannie Durand

Database Searching and BLAST Dannie Durand Computational Genomics and Molecular Biology, Fall 2013 1 Database Searching and BLAST Dannie Durand Tuesday, October 8th Review: Karlin-Altschul Statistics Recall that a Maximal Segment Pair (MSP) is

More information

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1 BSCI348S Fall 2003 Midterm 1 Multiple Choice: select the single best answer to the question or completion of the phrase. (5 points each) 1. The field of bioinformatics a. uses biomimetic algorithms to

More information

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous. Two proteins sharing a common ancestor are said to be homologs. Homologyoften implies structural

More information

MATH 5610, Computational Biology

MATH 5610, Computational Biology MATH 5610, Computational Biology Lecture 2 Intro to Molecular Biology (cont) Stephen Billups University of Colorado at Denver MATH 5610, Computational Biology p.1/24 Announcements Error on syllabus Class

More information

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem. Dec-82 Oct-84 Aug-86 Jun-88 Apr-90 Feb-92 Nov-93 Sep-95 Jul-97 May-99 Mar-01 Jan-03 Nov-04 Sep-06 Jul-08 May-10 Mar-12 Growth of GenBank 160,000,000,000 180,000,000 Introduction to Bioinformatics Iosif

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 2. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 2. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 2 Sepp Hochreiter gene Central Dogma nucleus DNA 1. transcription (mrna) 2. transport mrna protein 3. translation (ribosom, trna) 4. folding (protein)

More information

AN IMPROVED ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT OF PROTEIN SEQUENCES USING GENETIC ALGORITHM

AN IMPROVED ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT OF PROTEIN SEQUENCES USING GENETIC ALGORITHM AN IMPROVED ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT OF PROTEIN SEQUENCES USING GENETIC ALGORITHM Manish Kumar Department of Computer Science and Engineering, Indian School of Mines, Dhanbad-826004, Jharkhand,

More information

Germ-line vs somatic-variation theories

Germ-line vs somatic-variation theories BME 128 Tuesday April 26 (1) Filling in the gaps Antibody diversity, how is it achieved? - by specialised (!) mechanisms Chp6 (Protein Diversity & Sequence Analysis) - more about the main concepts in this

More information

VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch

VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Vorlesungsthemen Part 1: Background

More information

3D Structure Prediction with Fold Recognition/Threading. Michael Tress CNB-CSIC, Madrid

3D Structure Prediction with Fold Recognition/Threading. Michael Tress CNB-CSIC, Madrid 3D Structure Prediction with Fold Recognition/Threading Michael Tress CNB-CSIC, Madrid MREYKLVVLGSGGVGKSALTVQFVQGIFVDEYDPTIEDSY RKQVEVDCQQCMLEILDTAGTEQFTAMRDLYMKNGQGFAL VYSITAQSTFNDLQDLREQILRVKDTEDVPMILVGNKCDL

More information

Theory and Application of Multiple Sequence Alignments

Theory and Application of Multiple Sequence Alignments Theory and Application of Multiple Sequence Alignments a.k.a What is a Multiple Sequence Alignment, How to Make One, and What to Do With It Brett Pickett, PhD History Structure of DNA discovered (1953)

More information

Changing Mutation Operator of Genetic Algorithms for optimizing Multiple Sequence Alignment

Changing Mutation Operator of Genetic Algorithms for optimizing Multiple Sequence Alignment International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 11 (2013), pp. 1155-1160 International Research Publications House http://www. irphouse.com /ijict.htm Changing

More information

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG Chapman & Hall/CRC Mathematical and Computational Biology Series ALGORITHMS IN BIO INFORMATICS A PRACTICAL INTRODUCTION WING-KIN SUNG CRC Press Taylor & Francis Group Boca Raton London New York CRC Press

More information

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences. BLAST Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences. An example could be aligning an mrna sequence to genomic DNA. Proteins are frequently composed of

More information

Protein Structure Prediction. christian studer , EPFL

Protein Structure Prediction. christian studer , EPFL Protein Structure Prediction christian studer 17.11.2004, EPFL Content Definition of the problem Possible approaches DSSP / PSI-BLAST Generalization Results Definition of the problem Massive amounts of

More information

Sequence Databases and database scanning

Sequence Databases and database scanning Sequence Databases and database scanning Marjolein Thunnissen Lund, 2012 Types of databases: Primary sequence databases (proteins and nucleic acids). Composite protein sequence databases. Secondary databases.

More information

BIOINFORMATICS AND FUNCTIONAL GENOMICS

BIOINFORMATICS AND FUNCTIONAL GENOMICS BIOINFORMATICS AND FUNCTIONAL GENOMICS third edition Jonathan Pevsner Bioinformatics and Functional Genomics Bioinformatics and Functional Genomics Third Edition Jonathan Pevsner Department of Neurology,

More information

Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics Online Course: IBT Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec6:Interpreting Your Multiple Sequence Alignment Interpreting Your Multiple Sequence

More information

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs 1997 Oxford University Press Nucleic Acids Research, 1997, Vol. 25, No. 17 3389 3402 Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Stephen F. Altschul*, Thomas L. Madden,

More information

Sequence Analysis '17 -- lecture Secondary structure 3. Sequence similarity and homology 2. Secondary structure prediction

Sequence Analysis '17 -- lecture Secondary structure 3. Sequence similarity and homology 2. Secondary structure prediction Sequence Analysis '17 -- lecture 16 1. Secondary structure 3. Sequence similarity and homology 2. Secondary structure prediction Alpha helix Right-handed helix. H-bond is from the oxygen at i to the nitrogen

More information

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Sequence Based Function Annotation Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Usage scenarios for sequence based function annotation Function prediction of newly cloned

More information

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics Alignment methods Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics Alignment methods Sequence alignment Assembly vs alignment Alignment methods Common issues Platform

More information

Why Use BLAST? David Form - August 15,

Why Use BLAST? David Form - August 15, Wolbachia Workshop 2017 Bioinformatics BLAST Basic Local Alignment Search Tool Finding Model Organisms for Study of Disease Can yeast be used as a model organism to study cystic fibrosis? BLAST Why Use

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Dr. Taysir Hassan Abdel Hamid Lecturer, Information Systems Department Faculty of Computer and Information Assiut University taysirhs@aun.edu.eg taysir_soliman@hotmail.com

More information

Protein 3D Structure Prediction

Protein 3D Structure Prediction Protein 3D Structure Prediction Michael Tress CNIO ?? MREYKLVVLGSGGVGKSALTVQFVQGIFVDE YDPTIEDSYRKQVEVDCQQCMLEILDTAGTE QFTAMRDLYMKNGQGFALVYSITAQSTFNDL QDLREQILRVKDTEDVPMILVGNKCDLEDER VVGKEQGQNLARQWCNCAFLESSAKSKINVN

More information

1.1 What is bioinformatics? What is computational biology?

1.1 What is bioinformatics? What is computational biology? Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 16, 2006 3 1 Introduction 1.1 What is bioinformatics? What is computational biology? Bioinformatics and computational biology are multidisciplinary

More information

Profile-Profile Alignment: A Powerful Tool for Protein Structure Prediction. N. von Öhsen, I. Sommer, R. Zimmer

Profile-Profile Alignment: A Powerful Tool for Protein Structure Prediction. N. von Öhsen, I. Sommer, R. Zimmer Profile-Profile Alignment: A Powerful Tool for Protein Structure Prediction N. von Öhsen, I. Sommer, R. Zimmer Pacific Symposium on Biocomputing 8:252-263(2003) PROFILE-PROFILE ALIGNMENT: A POWERFUL TOOL

More information

Getting To Know Your Protein

Getting To Know Your Protein Getting To Know Your Protein Comparative Protein Analysis: Part II. Protein Domain Identification & Classification Robert Latek, PhD Sr. Bioinformatics Scientist Whitehead Institute for Biomedical Research

More information

BIOINFORMATICS IN BIOCHEMISTRY

BIOINFORMATICS IN BIOCHEMISTRY BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses on the analysis of molecular sequences (DNA, RNA, and

More information

Lecture Notes for Fall 2012 Protein Structure Prediction

Lecture Notes for Fall 2012 Protein Structure Prediction Lecture Notes for 20.320 Fall 2012 Protein Structure Prediction Ernest Fraenkel Introduction We ill examine to methods for analyzing sequences in order to determine the structure of the proteins. The first

More information

Introduction to bioinformatics

Introduction to bioinformatics 58266 Introduction to bioinformatics Autumn 26 Esa Pitkänen Master's Degree Programme in Bioinformatics (MBI) Department of Computer Science, University of Helsinki http://www.cs.helsinki.fi/mbi/courses/6-7/itb/

More information

SNP calling and VCF format

SNP calling and VCF format SNP calling and VCF format Laurent Falquet, Oct 12 SNP? What is this? A type of genetic variation, among others: Family of Single Nucleotide Aberrations Single Nucleotide Polymorphisms (SNPs) Single Nucleotide

More information

Molecular Databases and Tools

Molecular Databases and Tools NWeHealth, The University of Manchester Molecular Databases and Tools Afternoon Session: NCBI/EBI resources, pairwise alignment, BLAST, multiple sequence alignment and primer finding. Dr. Georgina Moulton

More information

CHAPTER 21 LECTURE SLIDES

CHAPTER 21 LECTURE SLIDES CHAPTER 21 LECTURE SLIDES Prepared by Brenda Leady University of Toledo To run the animations you must be in Slideshow View. Use the buttons on the animation to play, pause, and turn audio/text on or off.

More information

Baum-Welch and HMM applications. November 16, 2017

Baum-Welch and HMM applications. November 16, 2017 Baum-Welch and HMM applications November 16, 2017 Markov chains 3 states of weather: sunny, cloudy, rainy Observed once a day at the same time All transitions are possible, with some probability Each state

More information

Cross-references to the corresponding SWISS-PROT entries as well as to matched sequences from the PDB 3D-structure database 2 are also provided.

Cross-references to the corresponding SWISS-PROT entries as well as to matched sequences from the PDB 3D-structure database 2 are also provided. Amos Bairoch and Philipp Bucher are group leaders at the Swiss Institute of Bioinformatics (SIB), whose mission is to promote research, development of software tools and databases as well as to provide

More information

Homology Modelling. Thomas Holberg Blicher NNF Center for Protein Research University of Copenhagen

Homology Modelling. Thomas Holberg Blicher NNF Center for Protein Research University of Copenhagen Homology Modelling Thomas Holberg Blicher NNF Center for Protein Research University of Copenhagen Why are Protein Structures so Interesting? They provide a detailed picture of interesting biological features,

More information

Introduction to bioinformatics

Introduction to bioinformatics 58266 Introduction to bioinformatics Autumn 27 Esa Pitkänen Master's Degree Programme in Bioinformatics (MBI) Department of Computer Science, University of Helsinki http://www.cs.helsinki.fi/mbi/courses/7-8/itb/

More information

AB INITIO PROTEIN STRUCTURE PREDICTION ALGORITHMS

AB INITIO PROTEIN STRUCTURE PREDICTION ALGORITHMS San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2011 AB INITIO PROTEIN STRUCTURE PREDICTION ALGORITHMS Maciej Kicinski San Jose State University

More information

Motif Discovery from Large Number of Sequences: a Case Study with Disease Resistance Genes in Arabidopsis thaliana

Motif Discovery from Large Number of Sequences: a Case Study with Disease Resistance Genes in Arabidopsis thaliana Motif Discovery from Large Number of Sequences: a Case Study with Disease Resistance Genes in Arabidopsis thaliana Irfan Gunduz, Sihui Zhao, Mehmet Dalkilic and Sun Kim Indiana University, School of Informatics

More information

Content Planner Construction via Evolutionary Algorithms and a Corpus-based Fitness Function

Content Planner Construction via Evolutionary Algorithms and a Corpus-based Fitness Function Content Planner Construction via Evolutionary Algorithms and a Corpus-based Fitness Function Pablo A. Duboue and Kathleen R. McKeown Computer Science Department Columbia University in the city of New York

More information

Evolutionary Genetics. LV Lecture with exercises 6KP

Evolutionary Genetics. LV Lecture with exercises 6KP Evolutionary Genetics LV 25600-01 Lecture with exercises 6KP HS2017 >What_is_it? AATGATACGGCGACCACCGAGATCTACACNNNTC GTCGGCAGCGTC 2 NCBI MegaBlast search (09/14) 3 NCBI MegaBlast search (09/14) 4 Submitted

More information

Original article: AN ENHANCED ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT OF PROTEIN SEQUENCES USING GENETIC ALGORITHM Manish Kumar

Original article: AN ENHANCED ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT OF PROTEIN SEQUENCES USING GENETIC ALGORITHM Manish Kumar Original article: AN ENHANCED ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT OF PROTEIN SEQUENCES USING GENETIC ALGORITHM Manish Kumar Department of Computer Science and Engineering, Indian School of Mines,

More information

BIOINFORMATICS Introduction

BIOINFORMATICS Introduction BIOINFORMATICS Introduction Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a 1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu What is Bioinformatics? (Molecular) Bio -informatics One idea

More information

RNA Secondary Structure Prediction Computational Genomics Seyoung Kim

RNA Secondary Structure Prediction Computational Genomics Seyoung Kim RNA Secondary Structure Prediction 02-710 Computational Genomics Seyoung Kim Outline RNA folding Dynamic programming for RNA secondary structure prediction Covariance model for RNA structure prediction

More information

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases. What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases. Bioinformatics is the marriage of molecular biology with computer

More information

Insight II. Homology. March Scranton Road San Diego, CA / Fax: 858/

Insight II. Homology. March Scranton Road San Diego, CA / Fax: 858/ Insight II Homology March 2000 9685 Scranton Road San Diego, CA 92121-3752 858/458-9990 Fax: 858/458-0136 Copyright * This document is copyright 2000, Molecular Simulations Inc., a subsidiary of Pharmacopeia,

More information

Quality Function Deployment

Quality Function Deployment Quality Function Deployment Customer Driven Product Development Illustrated by Example Define Identify CCRs Design Optimize for 6 Validate Define the Project Business Case, Opportunity Statement, Goal,

More information

Alignment to a database. November 3, 2016

Alignment to a database. November 3, 2016 Alignment to a database November 3, 2016 How do you create a database? 1982 GenBank (at LANL, 2000 sequences) 1988 A way to search GenBank (FASTA) Genome Project 1982 GenBank (at LANL, 2000 sequences)

More information

JPred and Jnet: Protein Secondary Structure Prediction.

JPred and Jnet: Protein Secondary Structure Prediction. JPred and Jnet: Protein Secondary Structure Prediction www.compbio.dundee.ac.uk/jpred ...A I L E G D Y A S H M K... FUNCTION? Protein Sequence a-helix b-strand Secondary Structure Fold What is the difference

More information

M. Phil. (Computer Science) Programme < >

M. Phil. (Computer Science) Programme < > M. Phil. (Computer Science) Programme Department of Information and Communication Technology, Fakir Mohan University, Vyasa Vihar, Balasore-756019, Odisha. MPCS11: Research Methodology Unit

More information

Genome editing. Knock-ins

Genome editing. Knock-ins Genome editing Knock-ins Experiment design? Should we even do it? In mouse or rat, the HR-mediated knock-in of homologous fragments derived from a donor vector functions well. However, HR-dependent knock-in

More information

Application for Automating Database Storage of EST to Blast Results. Vikas Sharma Shrividya Shivkumar Nathan Helmick

Application for Automating Database Storage of EST to Blast Results. Vikas Sharma Shrividya Shivkumar Nathan Helmick Application for Automating Database Storage of EST to Blast Results Vikas Sharma Shrividya Shivkumar Nathan Helmick Outline Biology Primer Vikas Sharma System Overview Nathan Helmick Creating ESTs Nathan

More information

Bioinformatics with basic local alignment search tool (BLAST) and fast alignment (FASTA)

Bioinformatics with basic local alignment search tool (BLAST) and fast alignment (FASTA) Vol. 6(1), pp. 1-6, April 2014 DOI: 10.5897/IJBC2013.0086 Article Number: 093849744377 ISSN 2141-2464 Copyright 2014 Author(s) retain the copyright of this article http://www.academicjournals.org/jbsa

More information

Introduction to BIOINFORMATICS

Introduction to BIOINFORMATICS Introduction to BIOINFORMATICS Antonella Lisa CABGen Centro di Analisi Bioinformatica per la Genomica Tel. 0382-546361 E-mail: lisa@igm.cnr.it http://www.igm.cnr.it/pagine-personali/lisa-antonella/ What

More information

A Brief Introduction to Bioinformatics

A Brief Introduction to Bioinformatics A Brief Introduction to Bioinformatics Dan Lopresti Associate Professor Office PL 404B dal9@lehigh.edu February 2007 Slide 1 Motivation Biology easily has 500 years of exciting problems to work on. Donald

More information

Basic Bioinformatics: Homology, Sequence Alignment,

Basic Bioinformatics: Homology, Sequence Alignment, Basic Bioinformatics: Homology, Sequence Alignment, and BLAST William S. Sanders Institute for Genomics, Biocomputing, and Biotechnology (IGBB) High Performance Computing Collaboratory (HPC 2 ) Mississippi

More information

BLAST Basics. ... Elements of Bioinformatics Spring, Tom Carter. tom/

BLAST Basics. ... Elements of Bioinformatics Spring, Tom Carter.  tom/ BLAST Basics...... Elements of Bioinformatics Spring, 2003 Tom Carter http://astarte.csustan.edu/ tom/ March, 2003 1 Sequence Comparison One of the fundamental tasks we would like to do in bioinformatics

More information

Introns early. Introns late

Introns early. Introns late Introns early Introns late Self splicing RNA are an example for catalytic RNA that could have been present in RNA world. There is little reason to assume that the RNA world was not plagued by self-splicing

More information

Protein design. CS/CME/BioE/Biophys/BMI 279 Oct. 24, 2017 Ron Dror

Protein design. CS/CME/BioE/Biophys/BMI 279 Oct. 24, 2017 Ron Dror Protein design CS/CME/BioE/Biophys/BMI 279 Oct. 24, 2017 Ron Dror 1 Outline Why design proteins? Overall approach: Simplifying the protein design problem Protein design methodology Designing the backbone

More information

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS*

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS* COMPUTATIONAL METHODS IN SCIENCE AND TECHNOLOGY 9(1-2) 93-100 (2003/2004) Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS* DARIUSZ PLEWCZYNSKI AND LESZEK RYCHLEWSKI BiolnfoBank

More information

STRUCTURAL BIOLOGY. α/β structures Closed barrels Open twisted sheets Horseshoe folds

STRUCTURAL BIOLOGY. α/β structures Closed barrels Open twisted sheets Horseshoe folds STRUCTURAL BIOLOGY α/β structures Closed barrels Open twisted sheets Horseshoe folds The α/β domains Most frequent domain structures are α/β domains: A central parallel or mixed β sheet Surrounded by α

More information

PRESENTING SEQUENCES 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5

PRESENTING SEQUENCES 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5 Molecular Biology-2017 1 PRESENTING SEQUENCES As you know, sequences may either be double stranded or single stranded and have a polarity described as 5 and 3. The 5 end always contains a free phosphate

More information

Modern BLAST Programs

Modern BLAST Programs Modern BLAST Programs Jian Ma and Louxin Zhang Abstract The Basic Local Alignment Search Tool (BLAST) is arguably the most widely used program in bioinformatics. By sacrificing sensitivity for speed, it

More information

Tutorial Segmentation and Classification

Tutorial Segmentation and Classification MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION v171025 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel

More information

Optimization of RNAi Targets on the Human Transcriptome Ahmet Arslan Kurdoglu Computational Biosciences Program Arizona State University

Optimization of RNAi Targets on the Human Transcriptome Ahmet Arslan Kurdoglu Computational Biosciences Program Arizona State University Optimization of RNAi Targets on the Human Transcriptome Ahmet Arslan Kurdoglu Computational Biosciences Program Arizona State University my background Undergraduate Degree computer systems engineer (ASU

More information

5-Minute Compliance Matrix. Ebook

5-Minute Compliance Matrix. Ebook 5-Minute Compliance Matrix Ebook. Introduction We at VisibleThread have seen more than our fair share of compliance matrices. We continue to work closely with proposal managers who manually create the

More information

Sequence searching and sequence alignments MBV-INFX410

Sequence searching and sequence alignments MBV-INFX410 Sequence searching and sequence alignments MBV-INFX410 In this exercise we will start with a bacterial DNA repair protein called Nth and identify its homologs in different species, including humans, using

More information

Practical Bioinformatics for Biologists (BIOS 441/641)

Practical Bioinformatics for Biologists (BIOS 441/641) Practical Bioinformatics for Biologists (BIOS 441/641) - Course overview Yanbin Yin MO444 1 Room and computer access Room entry code: 2159 Computer access: user poduser 2 Compared to BIOS 443/643 and 646

More information

BioPerl: Pairwise Sequence Alignment

BioPerl: Pairwise Sequence Alignment BioPerl: Pairwise Sequence Alignment use Bio::Tools::pSW; $factory = new Bio::Tools::pSW( '-matrix' => 'blosum62.bla', '-gap' => 2, '-ext' => 2, ); $factory->align_and_show($seq, $seq2, STDOUT); /4/2002

More information

LAB. WALRUSES AND WHALES AND SEALS, OH MY!

LAB. WALRUSES AND WHALES AND SEALS, OH MY! Name Period Date LAB. WALRUSES AND WHALES AND SEALS, OH MY! Walruses and whales are both marine mammals. So are dolphins, seals, and manatee. They all have streamlined bodies, legs reduced to flippers,

More information

Product Applications for the Sequence Analysis Collection

Product Applications for the Sequence Analysis Collection Product Applications for the Sequence Analysis Collection Pipeline Pilot Contents Introduction... 1 Pipeline Pilot and Bioinformatics... 2 Sequence Searching with Profile HMM...2 Integrating Data in a

More information

Unified nomenclature for the winged helix/forkhead transcription factors

Unified nomenclature for the winged helix/forkhead transcription factors CORRESPONDENCE Unified nomenclature for the winged helix/forkhead transcription factors Klaus H. Kaestner, 1,4,5 Walter Knöchel, 2,4 and Daniel E. Martínez 3,4,5 1 Department of Genetics, University of

More information

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence Annotating 7G24-63 Justin Richner May 4, 2005 Zfh2 exons Thd1 exons Pur-alpha exons 0 40 kb 8 = 1 kb = LINE, Penelope = DNA/Transib, Transib1 = DINE = Novel Repeat = LTR/PAO, Diver2 I = LTR/Gypsy, Invader

More information

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Annotating Fosmid 14p24 of D. Virilis chromosome 4 Lo 1 Annotating Fosmid 14p24 of D. Virilis chromosome 4 Lo, Louis April 20, 2006 Annotation Report Introduction In the first half of Research Explorations in Genomics I finished a 38kb fragment of chromosome

More information

Small Genome Annotation and Data Management at TIGR

Small Genome Annotation and Data Management at TIGR Small Genome Annotation and Data Management at TIGR Michelle Gwinn, William Nelson, Robert Dodson, Steven Salzberg, Owen White Abstract TIGR has developed, and continues to refine, a comprehensive, efficient

More information

How can something so small cause problems so large?

How can something so small cause problems so large? How can something so small cause problems so large? Objectives Identify the structural components of DNA and relate to its function Create and ask questions about a model of DNA DNA is made of genes. Gene

More information

Connecting With Our Customers. Telltacodelmar.com

Connecting With Our Customers. Telltacodelmar.com Connecting With Our Customers Telltacodelmar.com Introducing Telltacodelmar.com Telltacodelmar.com is a website through which our customers grade their experience with us. It s always on 24/7. Introducing

More information

Teaching Principles of Enzyme Structure, Evolution, and Catalysis Using Bioinformatics

Teaching Principles of Enzyme Structure, Evolution, and Catalysis Using Bioinformatics KBM Journal of Science Education (2010) 1 (1): 7-12 doi: 10.5147/kbmjse/2010/0013 Teaching Principles of Enzyme Structure, Evolution, and Catalysis Using Bioinformatics Pablo Sobrado Department of Biochemistry,

More information

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI Variation detection based on second generation sequencing data Xin LIU Department of Science and Technology, BGI liuxin@genomics.org.cn 2013.11.21 Outline Summary of sequencing techniques Data quality

More information

homology - implies common ancestry. If you go back far enough, get to one single copy from which all current copies descend (premise 1).

homology - implies common ancestry. If you go back far enough, get to one single copy from which all current copies descend (premise 1). Drift in Large Populations -- the Neutral Theory Recall that the impact of drift as an evolutionary force is proportional to 1/(2N) for a diploid system. This has caused many people to think that drift

More information

BETTER & FASTER! HOW TO PRIORITIZE REQUIREMENTS: Razvan Radulian, Why-What-How Consulting. Making the impossible possible!

BETTER & FASTER! HOW TO PRIORITIZE REQUIREMENTS: Razvan Radulian, Why-What-How Consulting. Making the impossible possible! HOW TO PRIORITIZE REQUIREMENTS: BETTER & FASTER! Razvan Radulian, Why-What-How Consulting Making the impossible possible! Research Triangle Park IIBA Chapter Meeting AGENDA Part I (pre-workshop): Core

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven Rahmann Exercises: Udo Feldkamp, Michael Wurst 1 Goals of this course Learn about Software tools Databases Methods (Algorithms) in

More information

Allen Hutchison Google, Inc. November 7, Automated Testing Dos and Don ts

Allen Hutchison Google, Inc. November 7, Automated Testing Dos and Don ts Allen Hutchison Google, Inc. November 7, 2006 Automated Testing Dos and Don ts Allen Hutchison Google, Inc. November 7, 2006 Automated Testing: Dos and Don ts Collection of experiences and observations

More information

Managers at Bryant University

Managers at Bryant University The Character of Success for Managers at Bryant University Interviewing Guide (Revised 8/25/04) Career Strategies, Inc. Boston, MA A New Approach to Interviewing for Managers at Bryant University An interviewer

More information

Multiple Sequence Alignment Using Optimization Algorithms

Multiple Sequence Alignment Using Optimization Algorithms Multiple Sequence Alignment Using Optimization Algorithms M. F. Omar, R. A. Salam, R. Abdullah, N. A. Rashid Abstract Proteins or genes that have similar sequences are likely to perform the same function.

More information

Personal Web Portfolios (resources for students 11/2017) Cynthia Todd

Personal Web Portfolios (resources for students 11/2017) Cynthia Todd Personal (resources for students 11/2017) Cynthia Todd Information Science What You Will Learn What is A Portfolio and Resources to Build an Effective One of of 2 What is a Personal Portfolio It is defined

More information

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

Genomic DNA ASSEMBLY BY REMAPPING. Course overview ASSEMBLY BY REMAPPING Laurent Falquet, The Bioinformatics Unravelling Group, UNIFR & SIB MA/MER @ UniFr Group Leader @ SIB Course overview Genomic DNA PacBio Illumina methylation de novo remapping Annotation

More information

Talent Management System User Guide. Employee Profile, Goal Management & Performance Management

Talent Management System User Guide. Employee Profile, Goal Management & Performance Management Talent Management System User Guide Employee Profile, Goal Management & Performance Management January 2017 Table of Contents OVERVIEW... 1 Access the Talent Management System (TMS)... 1 Access the TMS...

More information

Applications of genetic algorithms in bioinformatics

Applications of genetic algorithms in bioinformatics San Jose State University SJSU ScholarWorks Master's Theses Master's Theses and Graduate Research 2008 Applications of genetic algorithms in bioinformatics Amie Judith Radenbaugh San Jose State University

More information

ON USING DNA DISTANCES AND CONSENSUS IN REPEATS DETECTION

ON USING DNA DISTANCES AND CONSENSUS IN REPEATS DETECTION ON USING DNA DISTANCES AND CONSENSUS IN REPEATS DETECTION Petre G. POP Technical University of Cluj-Napoca, Romania petre.pop@com.utcluj.ro Abstract: Sequence repeats are the simplest form of regularity

More information

Methylation Analysis of Next Generation Sequencing

Methylation Analysis of Next Generation Sequencing Methylation Analysis of Next Generation Sequencing Muhammad Ahmer Jamil *, Prof. Holger Frohlich Priv.-Doz. Dr. Osman El-Maarri Bonn-Aachen International Center for Information Technology Institute of

More information

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme Illumina (Solexa) Current market leader Based on sequencing by synthesis Current read length 100-150bp Paired-end easy, longer matepairs harder Error ~0.1% Mismatch errors dominate Throughput: 4 Tbp in

More information