VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch
|
|
- Valerie Andrews
- 6 years ago
- Views:
Transcription
1 VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin
2 Vorlesungsthemen Part 1: Background Basics (4) 1. The Nucleic Acid World 2. Protein Structure 3. Dealing with Databases Part 2: Sequence Alignments (2) 4. Producing and Analyzing Sequence Alignments 5. Pairwise Sequence Alignment and Database Searching 6. Patterns, Profiles, and Multiple Alignments Part 3: Evolutionary Processes (3) 7. Recovering Evolutionary History 8. Building Phylogenetic Trees Part 5: Secondary Structures (4) 11. Obtaining Secondary Structure from Sequence 12. Predicting Secondary Structures Part 6: Tertiary Structures (4) 13. Modeling Protein Structure 14. Analyzing Structure-Function Relationships Part 7: Cells and Organisms (8) 15. Proteome and Gene Expression Analysis 16. Clustering Methods and Statistics 17. Systems Biology Part 4: Genome Characteristics (4) 9. Revealing Genome Features 10. Gene Detection and Genome Annotation 2
3 3 H 3. Semester (WS 12/13) DP Paarweises Seq. Align. Needleman/Wunsch Smith-Waterman FastA Blast Multiples Seq. Align. HMMs Heute Letzter Teil im Block Alignment (Wdh.) Buch: 6.1, 6.2, 6.6
4 Alginment scoring matrix Protein matrix: 4
5 Use of a scoring matrix P L S - - C F G G L T - A C H L Score = 3 5
6 Multiple sequence alignment 6
7 Sequence logo 7
8 Profile und Sequenzlogos 8
9 Biological Motives A large number of biological units with common functions tend to exhibit similarities at the sequence level. These include very short motives, such as gene splice sites, DNA regulatory binding sites, recognized by transcription factors (proteins that bind to the promoter and control gene expression), micrornas, and all the way to protein families. Often it is desirable to model such motives, to enable searching for new ones. Probabilistic models are very useful for this task. 9
10 Promoter 10
11 Regulation of Genes Transcription Factor (Protein) RNA polymerase (Protein) DNA Regulatory Element Gene 11
12 Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA Regulatory Element Gene 12
13 Regulation of Genes Transcription Factor New protein RNA polymerase DNA Regulatory Element Gene 13
14 Motif Logo Motifs can mutate on less important bases. The five motifs at top right have mutations in position 3 and 5. Representations called motif logos illustrate the conserved regions of a motif. Position: TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA
15 Example: Calmodulin-Binding Motif (calcium-binding proteins) 15
16 PSSM Starting Point A gap-less MSA of known instances of a given motif. Representing the motif by either: 1. Consensus. 2. Position Specific Scoring Matrix (PSSM). 16
17 Sequence logos: Visualizing PSSMs 17
18 Frequency matrix 18
19 Frequency matrices Three uses of frequency matrices Describe a sequence feature Calculate probability of occurrence of feature in a random sequence Calculate degree of match between a new sequence and a feature 19
20 Frequency Matrices, PSSMs, and Profiles A frequency matrix can be converted to a Position-Specific Scoring Matrix (PSSM) by converting frequencies to scores PSSMs also called Position Weight Matrixes (PWMs) or Profiles 20
21 Methods for converting frequency matrices to PSSMs Using log ratio of observed to expected where m(j,i) is the frequency of character j observed at position i and f(j) is the overall frequency of character j (usually in some large set of sequences) Using amino acid substitution matrix (Dayhoff similarity matrix) 21
22 Pseudo-counts How do we get a score for a position with zero counts for a particular character? Can t take log(0). Solution: add a small number to all positions with zero frequency 22
23 Consensus sequences Different ways to describe a consensus, from crude to refined: Consensus site Sequence logos Position Specific Score Matrix (PSSM) Hidden Markov Model (HMM) 23
24 Constructing a consensus 1. Collect sequences 2. Align sequences (consensus sites are descriptions of the alignment) 3. Condense the set of sequences into a consensus (to a consensus, PSSM, HMM). 4. Apply the scoring matrix in alignments/searches. 24
25 Position Specific Score Matrix (PSSM) A position specific scoring matrix (PSSM) is a matrix based on the amino acid frequencies (or nucleic acid frequencies) at every position of a multiple alignment. From these frequencies, the PSSM that will be calculated will result in a matrix that will assign superior scores to residues that appear more often than by chance at a certain position. 25
26 Creating a PSSM: Example NTEGEWI NITRGEW NIAGECC Amino acid frequencies at every position of the alignment: 26
27 Creating a PSSM: Example Amino acids that do not appear at a specific position of a multiple alignment must also be considered in order to model every possible sequence and have calculable log-odds scores. A simple procedure called pseudo-counts assigns minimal scores to residues that do not appear at a certain position of the alignment according to the following equation: Where Frequency is the frequency of residue i in column j (the count of occurances). pseudocount is a number higher or equal to 1. N is the number of sequences in the multiple alignment. 27
28 Creating a PSSM: Example In this example, N = 3 and let s use pseudocount = 1: Score(N) at position 1 = 3/3 = 1. Score(I) at position 1 = 0/3 = 0. Readjust: Score(I) at position 1 -> (0+1) / (3+20) = 1/23 = Score(N) at position 1 -> (3+1) / (3+20) = 4/23 = The PSSM is obtained by taking the logarithm (of the values obtained above divided by the background frequency of the residues). To simplify for this example we ll assume that every amino acid appears equally in protein sequences, i.e. f i = 0.05 for every i): PSSM Score(N) at position 1 = log(0.044 / 0.05) = PSSM Score(I) at position 1 = log(0.174 / 0.05) =
29 Creating a PSSM: Example The matrix assigns positive scores to residues that appear more often than expected by chance and negative scores to residues that appear less often than expected by chance. 29
30 Using a PSSM To search for matches to a PSSM, scan along the sequence using a window the length (L) of the PSSM. The matrix is slid on a sequence one residue at a time and the scores of the residues of every region of length L are added. Scores that are higher than an empirically predetermined threshold are reported. 30
31 Searching with a PSSM Most approaches use the Dynamic Programming Algorithm usually the Smith-Waterman variant Excellent method for finding distantly related sequences Gap model is AFFINE with the Open and Extend Gap Penalties, a function of which position they are in the alignment. Can be used to locate a motif in an alignment and then edit the alignment 31
32 PSI-Blast 32
33 Position-Specific-Iterated-BLAST Intuition substitution matrices should be specific to a particular site. e.g. enalize alanine glycine more in a helix Idea Use BLAST with high stringency to get a set of closely related sequences. Align those sequences to create a new substitution matrix for each position. Then use that matrix to find additional sequences Cycling/iterative method Gives increased sensitivity for detecting distantly related proteins Can give insight into functional relationships Very refined statistical methods Fast still based on BLAST methods Simple to use 33
34 PSI-BLAST Principle 1. First, a standard blastp is performed 2. The highest scoring hits are used to generate a multiple alignment 3. A PSSM is generated from the multiple alignment. Highly conserved residues get high scores Less conserved residues get lower scores Sequences >98% similar not included (avoid biasing the PSSM). 4. Another similarity search is performed, this time using the new PSSM 5. Steps 2-4 can be repeated until convergence No new sequences appear after iteration 34
35 Example Aminoacyl trna Synthetases 20 enzymes for 20 amino acids Each is very different Big, small, monomers, tetramers All bind to their appropriate trnas and amino acids, with high specificity TrpRS and TyrRS share only 13% sequence identity BUT, overall structures of TrpTRS and TyrTRS are similar Structure Function relationship Tryptophanyl-tRNA synthetase Tyrosyl-tRNA synthetase 35
36 Same SCOP family based on catalytic domain Overall structure similarity noted 36
37 So is there sequence similarity between TyrRS and TrpRS? Given structural similarities, we would expect to find sequence similarity BUT! blastp of E.coli TyrRS against bacterial sequences in SwissProt does NOT show similarity with TrpRS at e-value cutoff of 10 37
38 No TrpRS!? 38
39 Try Using PSI-BLAST PSI-BLAST available from BLAST main page Query form just like for blastp BUT: one extra formatting option must be used Format for PSI-BLAST activate the tick box! Second e-value cutoff used to determine which alignments will be used for PSSM build Threshold for inclusion First search using TyrRS as query Db = SwissProt; limit = Bacteria [ORGN] Threshold for inclusion =
40 40
41 41
42 After A Few Iterations 42
43 TyrRS Similarity to TrpRS! 43
44 Power of PSI-BLAST We knew TyrRS and TrpRS were similarly Functionally and structurally BLASTP gave no indication PSI-BLAST was able to detect their weak sequence similarity Words of caution: be sure to inspect and think about the results included in the PSSM build include/exclude sequences on basis of biological knowledge: you are in the driving seat! PSI-BLAST performance varies according to choice of matrix, filter, statistics etc just like BLASTP 44
45 Why (not) PSI-BLAST If the sequences used to construct the Position Specific Scoring Matrices (PSSMs) are all homologous, the sensitivity at a given specificity improves significantly However, if non-homologous sequences are included in the PSSMs, they are corrupted. Then they pull in more non-homologous sequences, and become worse than generic 45
46 Query Does the query really have a relationship with the results? One way to check is to run the search in the opposite direction but often not reversible even when true homology Results 46
47 PSI-BLAST caveats Increased ability to find distant homologues Cost of additional required care to prevent nonhomologous sequences from being included in the PSSM calculation When in doubt, leave it out! Examine sequences with moderate similarity carefully. Be particularly cautious about matches to sequences with highly biased amino acid content Low complexity regions, transmembrane regions and coiled-coil regions often display significant similarity without homology Screen them out of your query sequences! 47
48 Profil HMMs (Hidden Markov Modelle) 48
49 Markov Chains Rain Sunny Cloudy States : Three states - sunny, cloudy, rainy. State transition matrix : The probability of the weather given the previous day's weather. Initial Distribution : Defining the probability of the system being in each of the states at time 0. 49
50 Hidden Markov Models Hidden states : the (TRUE) states of a system that may be described by a Markov process (e.g., the weather). Observable states : the states of the process that are `visible' (e.g., seaweed dampness). 50
51 Components Of HMM Output matrix : containing the probability of observing a particular observable state given that the hidden model is in a particular hidden state. Initial Distribution : contains the probability of the (hidden) model being in a particular hidden state at time t = 1. State transition matrix : holding the probability of a hidden state given the previous hidden state. 51
52 Building from an existing alignment ACA ATG TCA ACT ATC ACA C - - AGC AGA ATC ACC G - - ATC Output Probabilities insertion Transition probabilities A HMM model for a DNA motif alignments, The transitions are shown with arrows whose thickness indicate their probability. In each state, the histogram shows the probabilities of the four bases. 52
53 Query a new sequence Suppose I have a query protein sequence, and I am interested in which family it belongs to? There can be many paths leading to the generation of this sequence. Need to find all these paths and sum the probabilities. Consensus sequence: ACAC - - ATC P (ACACATC) = 0.8x1 x 0.8x1 x 0.8x0.6 x 0.4x0.6 x 1x1 x 0.8x1 x 0.8 = 4.7 x
54 Profile Hidden Markov Models Statistical models of multiple sequence alignments Capture position-specific information about how conserved each column of the alignment is which residues are likely use position-specific scores for amino acids (or nucleotides) position specific penalties for opening and extending an insertion or deletion. 54
55 Advantages of using HMMs HMMs have a formal probabilistic basis use probability theory to guide how all the scoring parameters should be set can do things that more heuristic methods cannot do easily For example, a profile HMM can be trained from unaligned sequences, if a trusted alignment isn t yet known HMMs have a consistent theory behind gap and insertion scores 55
56 Advantages of using HMMs In most details, profile HMMs are a slight improvement over a carefully constructed profile but less skill and manual intervention are necessary to use profile HMMs HMMs can produce true global alignments, unlike BLAST 56
57 Limitations of HMMs Do not capture any higher-order correlations assumes that the identity of a particular position is independent of the identity of all other positions make poor models of RNAs because an HMM cannot describe base pairs. compared to protein threading methods which usually include scoring terms for nearby amino acids in a three-dimensional protein structure. Slower than and less user-friendly than PSI-BLAST 57
58 Applications of profile HMMs Database searching for weak homologies Alternative to PSI-BLAST Automated annotation of the domain structure of proteins 58
59 Applications of profile HMMs Useful for organizing sequences into evolutionarily related families Databases like Pfam constructed by distinguishing between a stable curated seed alignment of a small number of representative sequences full alignments of all detectable homologs HMMER used to make a model of the seed search the database for homologs automatically produce the full alignment by aligning every sequence to the seed consensus 59
60 Constructing a profile HMM Multiple sequence alignment is made of known members of a given protein family quality of alignment, number and diversity of the sequences crucial for success Profile HMM of family built from the alignment model-building program uses the alignment together with its prior knowledge of the general nature of proteins Model-scoring program used to assign a score with respect to the model to any sequence of interest better the score, the higher the chance that query sequence is homologous to protein family in the model. each sequence in a database scored to find the members of the family present in the database. 60
61 HMMER structure/topology M = match state; I = insertion (w.r.t profile - insert gap characters in profile) D = deletion (w.r.t sequence - insert gap characters in sequence) N = N-terminal un-aligned C = C-terminal un-aligned J = Tim Joining Conrad, VL Algorithmische segment, Bioinformatik, un-aligned WS2013/
62 Profile HMM programs HMMER Developed by Sean Eddy Freely available under GNU General Public License Includes model-building and model-scoring programs relevant to homology detection Contains a program that calibrates a model by scoring it against a set of random sequences fitting an extreme value distribution to the resultant raw scores parameters of this distribution then used to calculate accurate E-values for sequences of interest. 62
63 Programs in the HMMER 2 package hmmalign Align sequences to existing model hmmbuild Build a model from multiple sequence alignment. hmmcalibrate Takes an HMM and empirically determines parameters used to make searches more sensitive by calculating more accurate E-values hmmconvert Convert a model file into different formats, including a compact HMMER 2 binary format, and best effort emulation of GCG profiles. hmmemit Emit sequences probabilistically from a profile HMM. hmmfetch Get a single model from an HMM database. hmmindex Index an HMM database. hmmpfam Search an HMM database for matches to a query sequence. hmmsearch Search a sequence database for matches to an HMM. 63
64 PSI-Blast vs. phmms PSI-BLAST Input: SEQUENCE Database: SEQUENCES Algorithm: Constructs a PSSM from an initial pass and uses this in the next pass Output: Distantly related sequences + sensitive, -specific HMMs More sensitive But less user-friendly than PSI-BLAST and slower 64
65 Zusammenfassung 65
66 66 Mehr Informationen im Internet unter medicalbioinformaticsgroup.de/teaching Vielen Dank! Tim Conrad AG Medical Bioinformatics Weitere Fragen
Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)
Protein Sequence Analysis BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical
More informationHidden Markov Models. Some applications in bioinformatics
Hidden Markov Models Some applications in bioinformatics Hidden Markov models Developed in speech recognition in the late 1960s... A HMM M (with start- and end-states) defines a regular language L M of
More informationTextbook Reading Guidelines
Understanding Bioinformatics by Marketa Zvelebil and Jeremy Baum Last updated: May 1, 2009 Textbook Reading Guidelines Preface: Read the whole preface, and especially: For the students with Life Science
More informationBLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments
BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database
More informationCreation of a PAM matrix
Rationale for substitution matrices Substitution matrices are a way of keeping track of the structural, physical and chemical properties of the amino acids in proteins, in such a fashion that less detrimental
More informationDynamic Programming Algorithms
Dynamic Programming Algorithms Sequence alignments, scores, and significance Lucy Skrabanek ICB, WMC February 7, 212 Sequence alignment Compare two (or more) sequences to: Find regions of conservation
More informationAn Overview of Probabilistic Methods for RNA Secondary Structure Analysis. David W Richardson CSE527 Project Presentation 12/15/2004
An Overview of Probabilistic Methods for RNA Secondary Structure Analysis David W Richardson CSE527 Project Presentation 12/15/2004 RNA - a quick review RNA s primary structure is sequence of nucleotides
More informationSequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing
Sequence Analysis II: Sequence Patterns and Matrices George Bell, Ph.D. WIBR Bioinformatics and Research Computing Sequence Patterns and Matrices Multiple sequence alignments Sequence patterns Sequence
More informationMethods and tools for exploring functional genomics data
Methods and tools for exploring functional genomics data William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington Outline Searching for
More informationThe String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.
Dec-82 Oct-84 Aug-86 Jun-88 Apr-90 Feb-92 Nov-93 Sep-95 Jul-97 May-99 Mar-01 Jan-03 Nov-04 Sep-06 Jul-08 May-10 Mar-12 Growth of GenBank 160,000,000,000 180,000,000 Introduction to Bioinformatics Iosif
More informationOutline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases
Chapter 7: Similarity searches on sequence databases All science is either physics or stamp collection. Ernest Rutherford Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing
More informationTextbook Reading Guidelines
Understanding Bioinformatics by Marketa Zvelebil and Jeremy Baum Last updated: January 16, 2013 Textbook Reading Guidelines Preface: Read the whole preface, and especially: For the students with Life Science
More informationSequence Based Function Annotation
Sequence Based Function Annotation Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Sequence Based Function Annotation 1. Given a sequence, how to predict its biological
More informationGetting To Know Your Protein
Getting To Know Your Protein Comparative Protein Analysis: Part II. Protein Domain Identification & Classification Robert Latek, PhD Sr. Bioinformatics Scientist Whitehead Institute for Biomedical Research
More informationMatch the Hash Scores
Sort the hash scores of the database sequence February 22, 2001 1 Match the Hash Scores February 22, 2001 2 Lookup method for finding an alignment position 1 2 3 4 5 6 7 8 9 10 11 protein 1 n c s p t a.....
More informationG4120: Introduction to Computational Biology
ICB Fall 2009 G4120: Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology & Immunology Copyright 2009 Oliver Jovanovic, All Rights Reserved. Analysis of Protein
More informationG4120: Introduction to Computational Biology
ICB Fall 2004 G4120: Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Copyright 2004 Oliver Jovanovic, All Rights Reserved. Analysis of Protein Sequences Coding
More informationSequence Databases and database scanning
Sequence Databases and database scanning Marjolein Thunnissen Lund, 2012 Types of databases: Primary sequence databases (proteins and nucleic acids). Composite protein sequence databases. Secondary databases.
More informationWhy learn sequence database searching? Searching Molecular Databases with BLAST
Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results
More informationA Hidden Markov Model for Identification of Helix-Turn-Helix Motifs
A Hidden Markov Model for Identification of Helix-Turn-Helix Motifs CHANGHUI YAN and JING HU Department of Computer Science Utah State University Logan, UT 84341 USA cyan@cc.usu.edu http://www.cs.usu.edu/~cyan
More informationBasic Local Alignment Search Tool
14.06.2010 Table of contents 1 History History 2 global local 3 Score functions Score matrices 4 5 Comparison to FASTA References of BLAST History the program was designed by Stephen W. Altschul, Warren
More informationComparative Bioinformatics. BSCI348S Fall 2003 Midterm 1
BSCI348S Fall 2003 Midterm 1 Multiple Choice: select the single best answer to the question or completion of the phrase. (5 points each) 1. The field of bioinformatics a. uses biomimetic algorithms to
More informationMotif Discovery from Large Number of Sequences: a Case Study with Disease Resistance Genes in Arabidopsis thaliana
Motif Discovery from Large Number of Sequences: a Case Study with Disease Resistance Genes in Arabidopsis thaliana Irfan Gunduz, Sihui Zhao, Mehmet Dalkilic and Sun Kim Indiana University, School of Informatics
More informationData Mining for Biological Data Analysis
Data Mining for Biological Data Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Data Mining Course by Gregory-Platesky Shapiro available at www.kdnuggets.com Jiawei Han
More informationDesigning Filters for Fast Protein and RNA Annotation. Yanni Sun Dept. of Computer Science and Engineering Advisor: Jeremy Buhler
Designing Filters for Fast Protein and RNA Annotation Yanni Sun Dept. of Computer Science and Engineering Advisor: Jeremy Buhler 1 Outline Background on sequence annotation Protein annotation acceleration
More informationAdvanced topics in bioinformatics
Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site: http://bioinformatics.weizmann.ac.il/courses/atib
More informationCAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU
CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU !2 Sequence Alignment! Global: Needleman-Wunsch-Sellers (1970).! Local: Smith-Waterman (1981) Useful when commonality
More informationDatabase Searching and BLAST Dannie Durand
Computational Genomics and Molecular Biology, Fall 2013 1 Database Searching and BLAST Dannie Durand Tuesday, October 8th Review: Karlin-Altschul Statistics Recall that a Maximal Segment Pair (MSP) is
More informationApplying Hidden Markov Model to Protein Sequence Alignment
Applying Hidden Markov Model to Protein Sequence Alignment Er. Neeshu Sharma #1, Er. Dinesh Kumar *2, Er. Reet Kamal Kaur #3 # CSE, PTU #1 RIMT-MAEC, #3 RIMT-MAEC CSE, PTU DAVIET, Jallandhar Abstract----Hidden
More informationTypically, to be biologically related means to share a common ancestor. In biology, we call this homologous
Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous. Two proteins sharing a common ancestor are said to be homologs. Homologyoften implies structural
More informationBiology 644: Bioinformatics
Processes Activation Repression Initiation Elongation.... Processes Splicing Editing Degradation Translation.... Transcription Translation DNA Regulators DNA-Binding Transcription Factors Chromatin Remodelers....
More informationESSENTIAL BIOINFORMATICS
ESSENTIAL BIOINFORMATICS Essential Bioinformatics is a concise yet comprehensive textbook of bioinformatics that provides a broad introduction to the entire field. Written specifically for a life science
More information03-511/711 Computational Genomics and Molecular Biology, Fall
03-511/711 Computational Genomics and Molecular Biology, Fall 2011 1 Study questions These study problems are intended to help you to review for the final exam. This is not an exhaustive list of the topics
More informationChristian Sigrist. January 27 SIB Protein Bioinformatics course 2016 Basel 1
Christian Sigrist January 27 SIB Protein Bioinformatics course 2016 Basel 1 General Definition on Conserved Regions Conserved regions in proteins can be classified into 5 different groups: Domains: specific
More informationBioinformatics Practical Course. 80 Practical Hours
Bioinformatics Practical Course 80 Practical Hours Course Description: This course presents major ideas and techniques for auxiliary bioinformatics and the advanced applications. Points included incorporate
More information03-511/711 Computational Genomics and Molecular Biology, Fall
03-511/711 Computational Genomics and Molecular Biology, Fall 2010 1 Study questions These study problems are intended to help you to review for the final exam. This is not an exhaustive list of the topics
More informationSequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University
Sequence Based Function Annotation Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Usage scenarios for sequence based function annotation Function prediction of newly cloned
More informationComparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.
Page 1 REMINDER: BMI 214 Industry Night Comparative Genomics Russ B. Altman BMI 214 CS 274 Location: Here (Thornton 102), on TV too. Time: 7:30-9:00 PM (May 21, 2002) Speakers: Francisco De La Vega, Applied
More informationBIOINFORMATICS Introduction
BIOINFORMATICS Introduction Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a 1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu What is Bioinformatics? (Molecular) Bio -informatics One idea
More informationMachine Learning. HMM applications in computational biology
10-601 Machine Learning HMM applications in computational biology Central dogma DNA CCTGAGCCAACTATTGATGAA transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Biological data is rapidly
More informationBioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics
The molecular structures of proteins are complex and can be defined at various levels. These structures can also be predicted from their amino-acid sequences. Protein structure prediction is one of the
More informationHomework 4. Due in class, Wednesday, November 10, 2004
1 GCB 535 / CIS 535 Fall 2004 Homework 4 Due in class, Wednesday, November 10, 2004 Comparative genomics 1. (6 pts) In Loots s paper (http://www.seas.upenn.edu/~cis535/lab/sciences-loots.pdf), the authors
More informationBioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine
Bioinformatics Tools Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Bioinformatics Tools Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Overview This lecture will
More informationComparative Protein Analysis. Getting To Know Your Protein. Syllabus. Protein Domains. Proteins As Modules
omparative Protein nalysis Getting To Know our Protein omparative Protein nalysis: Part II. Protein omain Identification & lassification Robert Latek, Ph Sr. Bioinformatics Scientist Whitehead Institute
More informationCourse Information. Introduction to Algorithms in Computational Biology Lecture 1. Relations to Some Other Courses
Course Information Introduction to Algorithms in Computational Biology Lecture 1 Meetings: Lecture, by Dan Geiger: Mondays 16:30 18:30, Taub 4. Tutorial, by Ydo Wexler: Tuesdays 10:30 11:30, Taub 2. Grade:
More informationEddy SR (1998) Profile hidden Markov models. Bioinformatics 14: Eddy SR (2008) A Probabilistic Model of Local Sequence Alignment That
HMMER3 : http://hmmer.janelia.org/ Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755-763 Eddy SR (2008) A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance
More informationMaking Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University
Making Sense of DNA and Protein Sequences Lily Wang, PhD Department of Biostatistics Vanderbilt University 1 Outline Biological background Major biological sequence databanks Basic concepts in sequence
More informationQuestion 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.
Bio4342 Exercise 1 Answers: Detecting and Interpreting Genetic Homology (Answers prepared by Wilson Leung) Question 1: Low complexity DNA can be described as sequences that consist primarily of one or
More informationData Retrieval from GenBank
Data Retrieval from GenBank Peter J. Myler Bioinformatics of Intracellular Pathogens JNU, Feb 7-0, 2009 http://www.ncbi.nlm.nih.gov (January, 2007) http://ncbi.nlm.nih.gov/sitemap/resourceguide.html Accessing
More informationRepresentation in Supervised Machine Learning Application to Biological Problems
Representation in Supervised Machine Learning Application to Biological Problems Frank Lab Howard Hughes Medical Institute & Columbia University 2010 Robert Howard Langlois Hughes Medical Institute What
More informationGiri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748
CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/8/07 CAP5510 1 Pattern Discovery 2/8/07 CAP5510 2 What we have
More informationIntroduction to Algorithms in Computational Biology Lecture 1
Introduction to Algorithms in Computational Biology Lecture 1 Background Readings: The first three chapters (pages 1-31) in Genetics in Medicine, Nussbaum et al., 2001. This class has been edited from
More informationIdentifying Regulatory Regions using Multiple Sequence Alignments
Identifying Regulatory Regions using Multiple Sequence Alignments Prerequisites: BLAST Exercise: Detecting and Interpreting Genetic Homology. Resources: ClustalW is available at http://www.ebi.ac.uk/tools/clustalw2/index.html
More informationScoring Alignments. Genome 373 Genomic Informatics Elhanan Borenstein
Scoring Alignments Genome 373 Genomic Informatics Elhanan Borenstein A quick review Course logistics Genomes (so many genomes) The computational bottleneck Python: Programs, input and output Number and
More informationLecture 7 Motif Databases and Gene Finding
Introduction to Bioinformatics for Medical Research Gideon Greenspan gdg@cs.technion.ac.il Lecture 7 Motif Databases and Gene Finding Motif Databases & Gene Finding Motifs Recap Motif Databases TRANSFAC
More informationMay 16. Gene Finding
Gene Finding j T[j,k] k i Q is a set of states T is a matrix of transition probabilities T[j,k]: probability of moving from state j to state k Σ is a set of symbols e j (S) is the probability of emitting
More informationVALLIAMMAI ENGINEERING COLLEGE
VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER BM6005 BIO INFORMATICS Regulation 2013 Academic Year 2018-19 Prepared
More informationWhat I hope you ll learn. Introduction to NCBI & Ensembl tools including BLAST and database searching!
What I hope you ll learn Introduction to NCBI & Ensembl tools including BLAST and database searching What do we learn from database searching and sequence alignments What tools are available at NCBI What
More information1.1 What is bioinformatics? What is computational biology?
Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 16, 2006 3 1 Introduction 1.1 What is bioinformatics? What is computational biology? Bioinformatics and computational biology are multidisciplinary
More informationEpigenetics and DNase-Seq
Epigenetics and DNase-Seq BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2018 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Anthony
More informationBacterial Genome Annotation
Bacterial Genome Annotation Bacterial Genome Annotation For an annotation you want to predict from the sequence, all of... protein-coding genes their stop-start the resulting protein the function the control
More information3D Structure Prediction with Fold Recognition/Threading. Michael Tress CNB-CSIC, Madrid
3D Structure Prediction with Fold Recognition/Threading Michael Tress CNB-CSIC, Madrid MREYKLVVLGSGGVGKSALTVQFVQGIFVDEYDPTIEDSY RKQVEVDCQQCMLEILDTAGTEQFTAMRDLYMKNGQGFAL VYSITAQSTFNDLQDLREQILRVKDTEDVPMILVGNKCDL
More informationIntroduction to Cellular Biology and Bioinformatics. Farzaneh Salari
Introduction to Cellular Biology and Bioinformatics Farzaneh Salari Outline Bioinformatics Cellular Biology A Bioinformatics Problem What is bioinformatics? Computer Science Statistics Bioinformatics Mathematics...
More informationImaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized
1 2 3 Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized medicine, risk assessment etc Public Health Bio
More informationIntroduction to Bioinformatics Finish. Johannes Starlinger
Introduction to Bioinformatics Finish Johannes Starlinger This Lecture Genomics Sequencing Gene prediction Evolutionary relationships Motifs - TFBS Transcriptomics Alignment Proteomics Structure prediction
More informationTutorial for Stop codon reassignment in the wild
Tutorial for Stop codon reassignment in the wild Learning Objectives This tutorial has two learning objectives: 1. Finding evidence of stop codon reassignment on DNA fragments. 2. Detecting and confirming
More informationMATH 5610, Computational Biology
MATH 5610, Computational Biology Lecture 2 Intro to Molecular Biology (cont) Stephen Billups University of Colorado at Denver MATH 5610, Computational Biology p.1/24 Announcements Error on syllabus Class
More informationApplications of HMMs in Computational Biology. BMI/CS Colin Dewey
Applications of HMMs in Computational Biology BMI/CS 576 www.biostat.wisc.edu/bmi576.html Colin Dewey cdewey@biostat.wisc.edu Fall 2008 The Gene Finding Task Given: an uncharacterized DNA sequence Do:
More informationSingle alignment: FASTA. 17 march 2017
Single alignment: FASTA 17 march 2017 FASTA is a DNA and protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985.[1] FASTA is pronounced
More informationApplicazioni biotecnologiche
Applicazioni biotecnologiche Analisi forense Sintesi di proteine ricombinanti Restriction Fragment Length Polymorphism (RFLP) Polymorphism (more fully genetic polymorphism) refers to the simultaneous occurrence
More informationGenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs
Gene Finding GenBank Growth GenBank Growth In 2003 ~ 31 million sequences ~ 37 billion base pairs GenBank: Exponential Growth Growth of GenBank in billions of base pairs from release 3 in April of 1994
More informationG4120: Introduction to Computational Biology
G4120: Introduction to Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Lecture 3 February 13, 2003 Copyright 2003 Oliver Jovanovic, All Rights Reserved. Bioinformatics
More informationEvolutionary Genetics. LV Lecture with exercises 6KP
Evolutionary Genetics LV 25600-01 Lecture with exercises 6KP HS2017 >What_is_it? AATGATACGGCGACCACCGAGATCTACACNNNTC GTCGGCAGCGTC 2 NCBI MegaBlast search (09/14) 3 NCBI MegaBlast search (09/14) 4 Submitted
More informationChallenging algorithms in bioinformatics
Challenging algorithms in bioinformatics 11 October 2018 Torbjørn Rognes Department of Informatics, UiO torognes@ifi.uio.no What is bioinformatics? Definition: Bioinformatics is the development and use
More informationGrundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson
Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 11, 2011 1 1 Introduction Grundlagen der Bioinformatik Summer 2011 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a) 1.1
More informationCS273B: Deep learning for Genomics and Biomedicine
CS273B: Deep learning for Genomics and Biomedicine Lecture 2: Convolutional neural networks and applications to functional genomics 09/28/2016 Anshul Kundaje, James Zou, Serafim Batzoglou Outline Anatomy
More informationCollect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017
Annotation Annotation for D. virilis Chris Shaffer July 2012 l Big Picture of annotation and then one practical example l This technique may not be the best with other projects (e.g. corn, bacteria) l
More informationProfile HMMs. 2/10/05 CAP5510/CGS5166 (Lec 10) 1 START STATE 1 STATE 2 STATE 3 STATE 4 STATE 5 STATE 6 END
Profile HMMs START STATE 1 STATE 2 STATE 3 STATE 4 STATE 5 STATE 6 END 2/10/05 CAP5510/CGS5166 (Lec 10) 1 Profile HMMs with InDels Insertions Deletions Insertions & Deletions DELETE 1 DELETE 2 DELETE 3
More informationComputational gene finding
Computational gene finding Devika Subramanian Comp 470 Outline (3 lectures) Lec 1 Lec 2 Lec 3 The biological context Markov models and Hidden Markov models Ab-initio methods for gene finding Comparative
More informationMotif Search CMSC 423
Motif Search CMSC 423 Central Dogma of Biology proteins Translation mrna (T U) Transcription Genome DNA = double-stranded, linear molecule each strand is string over {A,C,G,T} strands are complements of
More informationFinding Regularity in Protein Secondary Structures using a Cluster-based Genetic Algorithm
Finding Regularity in Protein Secondary Structures using a Cluster-based Genetic Algorithm Yen-Wei Chu 1,3, Chuen-Tsai Sun 3, Chung-Yuan Huang 2,3 1) Department of Information Management 2) Department
More informationAnalysis of Biological Sequences SPH
Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu nuts and bolts meet Tuesdays & Thursdays, 3:30-4:50 no exam; grade derived from 3-4 homework assignments plus a final project (open book,
More informationCollect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018
Annotation Annotation for D. virilis Chris Shaffer July 2012 l Big Picture of annotation and then one practical example l This technique may not be the best with other projects (e.g. corn, bacteria) l
More informationBioinformatics: Sequence Analysis. COMP 571 Luay Nakhleh, Rice University
Bioinformatics: Sequence Analysis COMP 571 Luay Nakhleh, Rice University Course Information Instructor: Luay Nakhleh (nakhleh@rice.edu); office hours by appointment (office: DH 3119) TA: Leo Elworth (DH
More informationGene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar Gene Prediction Introduction Protein-coding gene prediction RNA gene prediction Modification
More informationProtein Structure Prediction. christian studer , EPFL
Protein Structure Prediction christian studer 17.11.2004, EPFL Content Definition of the problem Possible approaches DSSP / PSI-BLAST Generalization Results Definition of the problem Massive amounts of
More informationFACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE
FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE BIOMOLECULES COURSE: COMPUTER PRACTICAL 1 Author of the exercise: Prof. Lloyd Ruddock Edited by Dr. Leila Tajedin 2017-2018 Assistant: Leila Tajedin (leila.tajedin@oulu.fi)
More informationB L A S T! BLAST: Basic local alignment search tool 11/23/2010. Copyright notice. November 29, Outline of today s lecture BLAST. Why use BLAST?
November 29, 2010 BLAST: Basic local alignment search tool B L A S T! Jonathan Pevsner, Ph.D. Bioinformatics pevsner@kennedykrieger.org Johns Hopkins School of Medicine Copyright notice Many of the images
More informationUNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics
UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, 2010 SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics DURATION: 3 HOURS TOTAL MARKS: 125 Internal Examiner: Dr. Ché Pillay External Examiner: Prof. Nicola
More informationProtein Bioinformatics Part I: Access to information
Protein Bioinformatics Part I: Access to information 260.655 April 6, 2006 Jonathan Pevsner, Ph.D. pevsner@kennedykrieger.org Outline [1] Proteins at NCBI RefSeq accession numbers Cn3D to visualize structures
More informationExploring Similarities of Conserved Domains/Motifs
Exploring Similarities of Conserved Domains/Motifs Sotiria Palioura Abstract Traditionally, proteins are represented as amino acid sequences. There are, though, other (potentially more exciting) representations;
More informationCS273: Algorithms for Structure Handout # 5 and Motion in Biology Stanford University Tuesday, 13 April 2004
CS273: Algorithms for Structure Handout # 5 and Motion in Biology Stanford University Tuesday, 13 April 2004 Lecture #5: 13 April 2004 Topics: Sequence motif identification Scribe: Samantha Chui 1 Introduction
More informationChapter 4 DNA Structure & Gene Expression
Biology 12 Name: Cell Biology Per: Date: Chapter 4 DNA Structure & Gene Expression Complete using BC Biology 12, pages 108-153 4.1 DNA Structure pages 112-114 1. DNA stands for and is the genetic material
More informationProkaryotic Annotation Pipeline SOP HGSC, Baylor College of Medicine
1 Abstract A prokaryotic annotation pipeline was developed to automatically annotate draft and complete bacterial genomes. The protein coding genes in the genomes are predicted by the combination of Glimmer
More informationGene Prediction in Eukaryotes
Gene Prediction in Eukaryotes Jan-Jaap Wesselink Biomol Informatics, S.L. jjw@biomol-informatics.com June 2010/Madrid jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 1 / 34 Outline 1 Gene
More informationMolecular Modeling Lecture 8. Local structure Database search Multiple alignment Automated homology modeling
Molecular Modeling 2018 -- Lecture 8 Local structure Database search Multiple alignment Automated homology modeling An exception to the no-insertions-in-helix rule Actual structures (myosin)! prolines
More informationProblem Set 4. I) I) Briefly describe the two major goals of this paper. (2 pts)
Problem 1: Clustering (33 points) Problem Set 4 Microarray and DNA chip technologies have made it possible to study expression patterns of thousand of genes simultaneously. The amount of data coming out
More informationAna Teresa Freitas 2016/2017
Finding Regulatory Motifs in DNA Sequences Ana Teresa Freitas 2016/2017 Combinatorial Gene Regulation A recent microarray experiment showed that when gene X is knocked out, 20 other genes are not expressed
More informationOptimization of Process Parameters of Global Sequence Alignment Based Dynamic Program - an Approach to Enhance the Sensitivity.
Optimization of Process Parameters of Global Sequence Alignment Based Dynamic Program - an Approach to Enhance the Sensitivity of Alignment Dr.D.Chandrakala 1, Dr.T.Sathish Kumar 2, S.Preethi 3, D.Sowmya
More informationCascaded walks in protein sequence space: Use of artificial sequences in remote homology detection between natural proteins
Supporting text Cascaded walks in protein sequence space: Use of artificial sequences in remote homology detection between natural proteins S. Sandhya, R. Mudgal, C. Jayadev, K.R. Abhinandan, R. Sowdhamini
More information