ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

Chapman & Hall/CRC Mathematical and Computational Biology Series ALGORITHMS IN BIO INFORMATICS A PRACTICAL INTRODUCTION WING-KIN SUNG CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint of the Taylor & Francis Croup an informa business A CHAPMAN & HALL BOOK

Contents Preface xv 1 Introduction to Molecular Biology 1 11 DNA, RNA, and Protein 1 111 Proteins 1 112 DNA 4 113 RNA 9 12 Genome, Chromosome, and Gene 10 121 Genome 10 122 Chromosome 10 123 Gene 11 124 Complexity of the Organism versus Genome Size 11 125 Number of Genes versus Genome Size 11 13 Replication and Mutation of DNA 12 14 Central Dogma (from DNA to Protein) 13 141 Transcription (Prokaryotes) 13 142 Transcription (Eukaryotes) 14 143 Translation 15 15 Post-Translation Modification (PTM) 17 16 Population Genetics 18 17 Basic Biotechnological Tools 18 171 Restriction Enzymes 19 172 Sonication 19 173 Cloning 19 174 PCR 20 175 Gel Electrophoresis 22 176 Hybridization 23 177 Next Generation DNA Sequencing 24 18 Brief History of Bioinformatics 26 19 Exercises 27 2 Sequence Similarity 29 21 Introduction 29 22 Global Alignment Problem 30 221 Needleman-Wunsch Algorithm 32 222 Running Time Issue 34 223 Space Efficiency Issue 35 vii

viii 224 More on Global Alignment 38 23 Local Alignment 39 24 Semi-Global Alignment 41 25 Gap Penalty 42 251 General Gap Penalty Model 43 252 Affine Gap Penalty Model 43 253 Convex Gap Model 45 26 Scoring Function 50 261 Scoring Function for DNA 50 262 Scoring Function for Protein 51 27 Exercises 53 3 Suffix Tree 57 31 Introduction 57 32 Suffix Tree 57 33 Simple Applications of a Suffix Tree 59 331 Exact String Matching Problem 59 332 Longest Repeated Substring Problem 60 333 Longest Common Substring Problem 60 334 Longest Common Prefix (LCP) 61 335 Finding a Palindrome 62 336 Extracting the Embedded Suffix Tree of a String from the Generalized Suffix Tree 63 337 Common Substring of 2 or More Strings 64 34 Construction of a Suffix Tree 65 341 Step 1: Construct the Odd Suffix Tree 68 342 Step 2: Construct the Even Suffix Tree 69 343 Step 3: Merge the Odd and the Even Suffix Trees 70 35 Suffix Array 72 351 Construction of a Suffix Array 73 352 Exact String Matching Using a Suffix Array 73 36 FM-Index 76 361 Definition 77 362 The occ Data Structure 78 363 Exact String Matching Using the FM-Index 79 37 Approximate Searching Problem 81 38 Exercises 82 93 4 Genome Alignment 87 41 Introduction 87 42 Maximum Unique Match (MUM) 88 421 How to Find MUMs 89 43 MUMmerl: LCS 92 431 Dynamic Programming Algorithm in 0(n2) Time 432 An 0(n logn)-time Algorithm 93

ix 44 MUMmer2 and MUMmer3 96 441 Reducing Memory Usage 97 442 Employing a New Alternative Algorithm for Finding MUMs 97 443 Clustering Matches 97 444 Extension of the Definition of MUM 98 45 Mutation Sensitive Alignment 99 451 Concepts and Definitions 99 452 The Idea of the Heuristic Algorithm 100 453 Experimental Results 102 46 Dot Plot for Visualizing the Alignment 103 47 Further Reading 105 48 Exercises 105 5 Database Search 109 51 Introduction 109 511 Biological Database 109 512 Database Searching 109 513 Types of Algorithms 110 52 Smith-Waterman Algorithm Ill 53 FastA Ill 531 FastP Algorithm 112 532 FastA Algorithm 113 54 BLAST 114 541 BLAST1 115 542 BLAST2 116 543 BLAST1 versus BLAST2 118 544 BLAST versus FastA 118 545 Statistics for Local Alignment 119 55 Variations of the BLAST Algorithm 120 551 MegaBLAST 120 552 BLAT 121 553 PatternHunter 121 554 PSI-BLAST (Position-Specific Iterated BLAST) 56 Q-gram Alignment based on Suffix ARrays 123 (QUASAR) 124 561 Algorithm 124 562 Speeding Up and Reducing the Space for QUASAR 127 563 Time Analysis 127 57 Locality-Sensitive Hashing 128 58 BWT-SW 130 581 Aligning Query Sequence to Suffix Tree 130 582 Meaningful Alignment 133 59 Are Existing Database Searching Methods Sensitive Enough? 136 510 Exercises 136

X 6 Multiple Sequence Alignment 139 61 Introduction 139 62 Formal Definition of the Multiple Sequence Alignment Prob lem 139 63 Methods for Solving the MSA Problem 141 64 Dynamic Programming Method 142 65 Center Star Method 143 66 Progressive Alignment Method 146 661 ClustalW 147 662 Profile-Profile Alignment 147 663 Limitation of Progressive Alignment Construction 149 67 Iterative Method 149 671 MUSCLE 150 672 Log-Expectation (LE) Score 151 68 Further Reading 151 69 Exercises 152 7 Phylogeny Reconstruction 155 71 Introduction 155 711 Mitochondrial DNA and Inheritance 155 712 The Constant Molecular Clock 155 713 Phylogeny 156 714 Applications of Phylogeny 157 715 Phylogenetic Tree Reconstruction 158 72 Character-Based Phylogeny Reconstruction Algorithm 159 721 Maximum Parsimony 159 722 Compatibility 165 723 Maximum Likelihood Problem 172 73 Distance-Based Phylogeny Reconstruction Algorithm 178 731 Additive Metric and Ultrametric 179 732 Unweighted Pair Group Method with Arithmetic Mean (UPGMA) 184 733 Additive Tree Reconstruction 187 734 Nearly Additive Tree Reconstruction 189 735 Can We Apply Distance-Based Methods Given a Character-State Matrix? 190 74 Bootstrapping 191 75 Can Tree Reconstruction Methods Infer the Correct Tree? 192 76 Exercises 193 8 Phylogeny Comparison 199 81 Introduction 199 82 Similarity Measurement 200 821 Computing MAST by Dynamic Programming 201 822 MAST for Unrooted Trees 202

xi 83 Dissimilarity Measurements 203 831 Robinson-Foulds Distance 204 832 Nearest Neighbor Interchange Distance (NNI) 209 833 Subtree Transfer Distance (STT) 210 834 Quartet Distance 211 84 Consensus Tree Problem 214 841 Strict Consensus Tree 215 842 Majority Rule Consensus Tree 216 843 Median Consensus Tree 218 844 Greedy Consensus Tree 218 845 R* Tree 219 85 Further Reading 220 86 Exercises 222 9 Genome Rearrangement 225 91 Introduction 225 92 Types of Genome Rearrangements 225 93 Computational Problems 227 94 Sorting an Unsigned Permutation by Reversals 227 941 Upper and Lower Bound on an Unsigned Reversal Dis tance 228 942 4-Approximation Algorithm for Sorting an Unsigned Permutation 229 943 2-Approximation Algorithm for Sorting an Unsigned Permutation 230 95 Sorting a Signed Permutation by Reversals 232 951 Upper Bound on Signed Reversal Distance 232 952 Elementary Intervals, Cycles, and Components 233 953 The Hannenhalli-Pevzner Theorem 238 96 Further Reading 243 97 Exercises 244 10 Motif Finding 247 101 Introduction 247 102 Identifying Binding Regions of TFs 248 103 Motif Model 250 104 The Motif Finding Problem 252 105 Scanning for Known Motifs 253 106 Statistical Approaches 254 1061 Gibbs Motif Sampler 255 1062 MEME 257 107 Combinatorial Approaches 260 1071 Exhaustive Pattern-Driven Algorithm 261 1072 Sample-Driven Approach 262 1073 Suffix Tree-Based Algorithm 263

xii 1074 Graph-Based Method 265 108 Scoring Function 266 109 Motif Ensemble Methods 267 I 091 Approach of MotifVoter 268 1092 Motif Filtering by the Discriminative and Consensus Criteria 268 1093 Sites Extraction and Motif Generation 270 1010 Can Motif Finders Discover the Correct Motifs? 271 1011 Motif Finding Utilizing Additional Information 274 10111 Regulatory Element Detection Using Correlation with Expression 274 10112 Discovery of Regulatory Elements by Phylogenetic Footprinting 277 1012 Exercises 279 11 RNA Secondary Structure Prediction 281 111 Introduction 281 II 11 Base Interactions in RNA 282 1112 RNA Structures 282 112 Obtaining RNA Secondary Structure Experimentally 113 RNA Structure Prediction Based on Sequence Only 285 286 114 Structure Prediction with the Assumption That There is No Pseudoknot 286 115 Nussinov Folding Algorithm 288 116 ZUKER Algorithm 290 1161 Time Analysis 292 1162 Speeding up Multi-Loops 292 1163 Speeding up Internal Loops 294 117 Structure Prediction with Pseudoknots 296 1171 Definition of a Simple Pseudoknot 296 1172 Akutsu's Algorithm for Predicting an RNA Secondary Structure with Simple Pseudoknots 297 118 Exercises 300 12 Peptide Sequencing 305 121 Introduction 305 122 Obtaining the Mass Spectrum of a Peptide 306 123 Modeling the Mass Spectrum of a Fragmented Peptide 310 1231 Amino Acid Residue Mass 310 1232 Fragment Ion Mass 310 124 De Novo Peptide Sequencing Using Dynamic Programming 312 1241 Scoring by Considering y-ions 313 1242 Scoring by Considering y-ions and b-ions 314 125 De Novo Sequencing Using Graph-Based Approach 317 126 Peptide Sequencing via Database Search 319

xiii 127 Further Reading 320 128 Exercises 321 13 Population Genetics 323 131 Introduction 323 1311 Locus, Genotype, Allele, and SNP 323 1312 Genotype Frequency and Allele Frequency 324 1313 Haplotype and Phenotype 325 1314 Technologies for Studying the Human Population 325 1315 Bioinformatics Problems 325 132 Hardy-Weinberg Equilibrium 326 133 Linkage Disequilibrium 327 1331 D and D' 328 1332 r2 328 134 Genotype Phasing 328 1341 Clark's Algorithm 329 1342 Perfect Phylogeny Haplotyping Problem 330 1343 Maximum Likelihood Approach 334 1344 Phase Algorithm 336 135 Tag SNP Selection 337 1351 Zhang et al's Algorithm 338 1352 IdSelect 339 136 Association Study 339 1361 Categorical Data Analysis 340 1362 Relative Risk and Odds Ratio 341 1363 Linear Regression 342 1364 Logistic Regression 343 137 Exercises 344 References 349 Index 375