Article A Teaching Approach From the Exhaustive Search Method to the Needleman Wunsch Algorithm

Article A Teaching Approach From the Exhaustive Search Method to the Needleman Wunsch Algorithm Zhongneng Xu * Yayun Yang Beibei Huang, From the Department of Ecology, Jinan University, Guangzhou 510632, China, Departament d Enginyeria Quimica, Universitat Rovira i Virgili, 26 Av. dels Paisos Catalans, 43007 Tarragona, Spain, Department of Experimental Therapeutics, The University of Texas M. D. Anderson Cancer Center, Unit 36, Houston, TX 77030, USA Abstract The Needleman Wunsch algorithm has become one of the core algorithms in bioinformatics; however, this programming requires more suitable explanations for students with different major backgrounds. In supposing sample sequences and using a simple store system, the connection between the exhaustive search method and the Needleman Wunsch algorithm was analyzed to more thoroughly explain this algorithm. The present study could benefit the teaching and learning of the Needleman Wunsch algorithm. VC 2016 by The International Union of Biochemistry and Molecular Biology, 45(3):194 204, 2017. Keywords: Needleman Wunsch algorithm; exhaustive search; sequence alignment; teaching Introduction The Needleman Wunsch algorithm (NW) has been widely used in global sequence alignment even though running this program presented substantial time and space requirements [1 3]. In addition to its use in global alignment, NW helped in the development of other algorithms, such as the Smith-Waterman algorithm, BLASTs, the CLUSTALs series [4 9]. How to understand this algorithm was a requirement to the bioinformatics learners. Since the development of the NW, efforts have been made to make it more widely understood and more easily used. Smith and Waterman (1981a; 1981b) [10,11] proved the equation mathematically and added an improved scoring system, thereby providing the equation currently used to calculate the scoring matrix for the alignment between two sequences: Volume 45, Number 3, May/June 2017, Pages 194 204 *Address for correspondence to: Department of Ecology, Jinan University, Guangzhou 510632, China. E-mail: txuzn@jnu.edu.cn Received 4 April 2016; Revised 12 August 2016; Accepted 6 September 2016 DOI 10.1002/bmb.21027 Published online 14 October 2016 in Wiley Online Library (wileyonlinelibrary.com) 8 0; ðwhen i51 and j51þ >< S i;j21 1g; ðwhen i51 and j > 1Þ S i;j 5 S i21;j 1g; ðwhen i > 1 and j51þ >: max ðs i21;j21 1M i;j ; S i;j21 1g; S i21;j 1gÞ; ðwhen i > 1 and j > 1Þ where S i,j is the score at the position in row i and line j of the matrix, g is the penalty for a gap, and M i,j is the score for aligning the characters at the position in row i and line j of the matrix. The application of affine gap penalties in linear time was reported in the work of improving NW [12,13]. Setubal and Meidanis (1997) [14] used a simple scoring system (three scores for a match, a mismatch, and a gap, respectively) and two DNA sequences to explain NW for how to calculate the values in the scoring matrix. Being a core content in the bioinformatics course which became recently a biological base, NW was taught to students with different major backgrounds. Given the available publications on such a programming, it remained difficult to teach NW to some students or new learners, especially the ones lacked the knowledge of dynamic programming. Developing new methods to teach NW was sometimes necessary. The methods of exhaustive search might be easy for some students who did not have strong mathematical background, even though using them to solve pair sequence comparison was an NP-complete problem. If a teaching scenario could deduce a dynamic programming using an exhaustive search method, the NW could be easily (1) 194 Biochemistry and Molecular Biology Education

TABLE 1 understood. Furthermore, a new perspective on NW could help in obtaining new questions or in putting forward better alignment algorithms. To facilitate the above, the present study developed a teaching method. Methods Suppose a Pair of Sample Sequences For a simple example, suppose a pair of DNA sequences, ac (Sequence M) and at (Sequence N), to be globally aligned. Alignment Extension To easily explain alignment extension, the positions of the alignment are shown in Table 1. When a position is extended, there are three possible situations: (1) Sequence M provides a character and Sequence N provides a gap, (2) Sequence M provides a character and Sequence N provides a character, and (3) Sequence M provides a gap and Sequence N provides a character. Scoring System A scoring system must be confirmed when the alignment was carried out. To more easily explain the method, three simple scores for each position of two-sequence alignment [15] were selected: 8 Matching characters : 11 >< Mismatching characters : 21 >: A character aligning a gap : 22 Results The positions of the pair alignment 1 st position 2 nd position 3 rd position 4 th position Sequence M: a c Sequence N: a t Note: this table shows only one of the results of the global alignment of the sample sequences. In the exhaustive search method, there were 13 complete results of the example (Fig. 1). When there was at least one character in both sequences, the alignment extended in three directions; when there were no characters left in one sequence but not the other, the alignment extended in only one direction, and if there were no characters left in either sequence, the alignment stopped the extension. There are nine combinations of characters from Sequence M and Sequence N (Fig. 2). Based on the simple scoring system noted, there is a maximum alignment score, representing the best alignment, in each combination of characters. As noted above, in the 13 combinations of all characters from both sequences, the maximum of the thirteen scores is 0, indicating the best global alignment, ac vs. at, of the two sequences. It can delete some of the derived alignments from the same combination of characters. For an example, there were three forms of alignment in the combination of two characters of a from each sequence, and each alignment extended to three derived alignments, thereby producing nine resulting alignments (Fig. 3). In each derived group of three alignments, the added characters in the next position were the same, as shown below: 8 " Sequence M : c >< >: Sequence N : 2 " Sequence M : c Sequence N : t " Sequence M : 2 Sequence N : t The added characters in all extended positions in each derived group of the three alignments were also the same, as follows: 8 " Sequence M : c2 Sequence N : 2t " >< Sequence M : c >: Sequence N : t " Sequence M : 2c Sequence N : t2 That is, if the combination of the front characters (no matter how many gaps) in the sequences were the same, the characters added to the extended positions of the derived alignments were uniform. Thus, a comparison of the front characters in the sequences can lead to the deletion of the alignments that were derived from the lower-score alignments of the front characters. Nine boxes containing alignments from 0 aligning 0 to two characters aligning two characters were set in a 3 3 3 matrix (Fig. 4). In the matrix, the positions of alignments could extend from the left boxes to the right boxes, from up to down, or from left-up to right-down. If the position of an alignment extended from the left to right for one box, one character from Sequence M was added to the top sequence in the extended position and one gap was added in the Xu et al. 195

Biochemistry and Molecular Biology Education FIG 1 The global alignment of the two DNA sequences, ac and at, using an exhaustive search method. The last row is the scores of the complete alignments through the utilisation of a simple score system, 11 for a match, 21 for a mismatch, and 22 for a gap in each position of an alignment. 196 From Exhaustive Search Method to Needle Wunsch Algorithm

FIG 2 Nine combinations of characters from Sequence M (ac) and Sequence N (at) during the global alignment process. I, Sequence M provides zero characters and Sequence N provides zero characters. II, Sequence M provides one character and Sequence N provides zero characters. III, Sequence M provides two characters and Sequence N provides zero characters. IV, Sequence M provides zero characters and Sequence N provides one character. V, Sequence M provides one character and Sequence N provides one character. VI, Sequence M provides two characters and Sequence N provides one character. VII, Sequence M provides zero characters and Sequence N provides two characters. VIII, Sequence M provides one character and Sequence N provides two characters. IX, Sequence M provides two characters and Sequence N provides two characters.

Biochemistry and Molecular Biology Education FIG 3 Comparison of the alignments derived from the same combination of characters ( a from Sequence M and a from Sequence N) in an example. [Color figure can be viewed at wileyonlinelibrary.com] 198 From Exhaustive Search Method to Needle Wunsch Algorithm

FIG 4 3 3 3 matrix that can contain alignments from 0 to aligning two characters in Sequence M (ac) and Sequence N (at). The numbers in the boxes represent the numbers provided by the relative sequences. down sequence; if the position of an alignment extended from up to down for one box, one gap was added in the top sequence in the extended position and one character from Sequence N was added in the down sequence; if the position of an alignment extended from left-up to right-down for one box, one character from Sequence M and one character from Sequence N were added in the up sequence and the down sequence, respectively, in the extended position. With the topologic change, all alignments of the example could be filled in these boxes (Fig. 5). Different alignments in the same combinations of characters result in different scores. The highest score of each box was left, thereby omitting the others. The comparison of the front characters provides sufficient information in deciding which alignments were left; thus, when the positions of the alignments extended from left to right, from up to down, and from left-up to right-down, one box by one box, we can select the alignment with the highest score left in each box (Fig. 6). This method can save on calculating time and storage space compared with the exhaustive searching method (Fig. 1 or Fig. 5). When the scores replaced the character alignments in each box (Fig. 7), the NW emerges. Discussion Suggestion for Teaching Practices When we taught NW in classes, we first provide the students the example of the pair of sequences, ac (Sequence M) and at (Sequence N), and three simple scores for each position of two-sequence alignment, 11 for matching characters, 21 for mismatching characters, and 22 for a character aligning a gap. Then the students were required to find by themselves how many were the results of aligning the sequences and which one had the highest score. According to the responses of the students, we taught to them how to find all the results and introduced the exhaustive search tree. The students were also told that if the sequences were long enough, this NP-complete problem let us/computers do not able to find the answer in the effective limited time. Thus, NW was introduced in the way described in the present study. To make the students understand this deduction clearly, all the processes of the deduction were suggested to be written in the blackboard/ papers step by step, and sometimes asked the students to finish some of the procedures. A sample lesson plan was included in Table 2. Basing on their practice, the students could understand how to find all the results by the exhaustive search method, but this method was time- and energy- consumed. By observing the step-by-step deduction from the exhaustive search method to NW, sometimes joining in doing the deductive procedures, they could understand this algorithm, which was time-saved and suitable to computer programming, within the scope of their knowledge even if they had not enough mathematical background. In addition to the homework with a more complicated example, NW might be impressive in their brains. After the students grasped a complex dynamic programming deduced by a simpler exhaustive search method, they could understand easily why NW is a dynamic programming and why the highest score is obtained by using Eq. (1), and could acquire easily the knowledge of other chapters partly basing on NW, such as multiple sequence alignment, phylogenetic analysis, and database searching for similar sequences, and will have greater confidence in learning bioinformatics. Because of understanding the details of this deduction, some students in our classes triggered new ideas for sequence comparison in the course or in their later research projects. So the suggested activity fits into the learning and teaching of a bioinformatics course. Sequence alignment was the basis in a bioinformatics course, and NW was the basis of sequences alignment. Understanding NW helps the students easily catch other useful algorithms for sequence comparison. When sequence alignment is taught, we suggest the deduction from the exhaustive search method to NW is included in the series for teaching (Fig. 8). In the curriculum of bioinformatics major, Sequence Alignment should be an independent course after the basic mathematics course, and the deduction in the present study could be taught and discussed in the earlier chapters. If there is a bioinformatics course opening to students with different major backgrounds, sequence alignment is the separate chapter, and in such limited time NW, which is taught from the exhaustive search method, may be the single algorithm needed the detailed deduction. The Understanding of the Needleman Wunsch Algorithm From the Viewpoint of Research In 1970s, although NW widely used by the biological community, it was not mathematically proved, lacked the Xu et al. 199

Biochemistry and Molecular Biology Education FIG 5 Alignments of the two DNA sequences, ac and at, as filled in the 3 3 3 matrix of boxes shown in Fig. 4. 200 From Exhaustive Search Method to Needle Wunsch Algorithm

FIG 6 The selection of the optimal alignment with the highest score in each box in a 3 3 3 matrix of the alignment of the two DNA sequences, ac and at, from Figs. 4 and 5. The alignments of each box were deduced from three directions: an additional character in the top sequence provided from Sequence M and an additional gap in the down sequence were added in the alignment from the left box; an additional gap in the top sequence and an additional character in the down sequence provided from Sequence N were added in the alignment from the up box; and an additional character in the top sequence provided from Sequence M and an additional character in the down sequence provided from Sequence N were added in the alignment from the left-up box. The alignments in the boxes in the first line of the matrix were deduced from only the left boxes, and those in the first row were deduced from only the up boxes. The optimal alignment with the maximum score in a simple score system, e.g., 11 for a match, 21 for a mismatch, and 22 for a gap in each position of an alignment, was selected. widely useful store matrix, and sometimes was not sensitive to find the local similarity [11]. Smith and Waterman provided the proof of NW and suggested the suitable store system (1981b) for this algorithm. In fact, the format of NW described above was not the origin one which was published in 1970. The improved NW co-contributed by Waterman and other scientists was easier to express and deduce. NW first provided iterative matrix method to find sequence homology, but it focused on the similarity of the whole sequences and alignment of the whole length, so the more meaningful local homology, such as a gene with short sequence in a longer DNA sequence, may be neglected. Smith-Water algorithm, an algorithm revised from NW, and other algorithms for local alignment could be more agile to find homological segments in the long sequences [10]. As an earlier heuristic homology algorithm for sequence comparison and one of the basic algorithms in bioinformatics, NW was used for global alignment but not suitable to be a search program, and its improved transforms, like the Smith-Waterman algorithm, was better to search meaningful sequences, take BLAST programming as an instance. Previous works have attempted to explain how to calculate the score matrix of NW to make it more useful [1,10,16]. Although the computation is easily described [Eq. (1)], the previous deduction of the NW, e.g., the inspiration partly from the dot matrix or the later mathematical proofs, had not been clearly performed for the new learners. Not thoroughly understanding the foundational algorithm may impact the users understanding of the significance of the results obtained using the programs involved with the NW and may inhibit the development of further Xu et al. 201

Biochemistry and Molecular Biology Education FIG 7 The scores replaced the character alignments in the boxes of Fig. 6. The scores of the alignments were calculated with a simple score system, 11 for a match, 21 for a mismatch, and 22 for a gap in each position of an alignment. [Color figure can be viewed at wileyonlinelibrary.com] dynamic algorithms for obtaining optimal alignment results, especially in multi-alignment. Rather than listing the alignment scores the present study extended the alignment positions one by one from an exhaustive search method to obtain the optimal alignment. The exhaustive search method for comparing two sequences is time-consuming but more easily understood by new users and learners than the previous algorithms. Thus, the gap between the basic concept and the algorithm of sequence alignment is bridged in the present study. Benefiting from an easy understanding of the NW and the information from the above deduction, some teaching skills of relative algorithms could be mined. Calculating the number of global pair alignments was previously estimated [17,18], but still represented a mathematical question for some researchers. In Sequence I (i is the number of characters in this sequence) and Sequence J (j is the number of characters in this sequence), the number of all possible results for global alignment (R ij ) ranges from 3 maximum of i and j to 3 i1j.r ij could be calculated in detail; however, some methods were complicated, as in the following: 8 1; ðwhen j50þ 2i11; ðwhen j51þ >< R i;j 5 2i 2 12i11; ðwhen j52þ 4i 3 16i 2 18i13 ; ðwhen j53þ >: 3 Based on the deduction in this study, the values of R ij could be listed in a matrix when they were calculated (Figs. 4 and 5) shown as the following: R o;o 51 R 1;o 51 R 2;o 51 R 3;o 51 R 4;o 51 R 5;o 51 R o;1 51 R 1;1 53 R 2;1 55 R 3;1 57 R 4;1 59 R 5;1 511 R o;2 51 R 1;2 55 R 2;2 513 R 3;2 525 R 4;2 541 R 5;2 561 R o;3 51 R 1;3 57 R 2;3 525 R 3;3 563 R 4;3 5129 R 5;3 5231 R o;4 51 R 1;4 59 R 2;4 541 R 3;4 5129 R 4;4 5321 R 5;4 5681 R o;5 51 R 1;5 511 R 2;5 561 R 3;5 5231 R 4;5 5681 R 5;5 51683 From the above matrix, an iterative formula to calculate R ij was obtained: (2) 202 From Exhaustive Search Method to Needle Wunsch Algorithm

TABLE 2 A sample lesson plan for teaching the Needleman Wunsch algorithm Items Course Chapter Lesson title Lesson objectives Summary of actions Materials/ Equipment References Description An introduction to bioinformatics Alignment of pairs of sequences From the exhaustive search method to the Needleman Wunsch algorithm Let the students who may not have strong mathematical background easily understand the Needleman Wunsch algorithm by using the deduction beginning with the exhaustive search method. The students are first provided the example of two sequences to find the alignment result with the highest score. The teacher then introduced the exhaustive search tree to find all the results, from which the highest score one was found. To save the effective time, the Needleman Wunsch algorithm is introduced in the way described in the present study. Markers and a whiteboard, or chalks and a blackboard [1] The present study. [2] Mount, D. W. (2004]) Bioinformatics: Sequence and genome analysis (the 2nd edition). New York: Cold Spring Harbor Laboratory Press. Homework [3] Setubal, J. C. and Meidanis J. (1997) Introduction to Computational Molecular Biology. Boston: PWS Publishing Company. 1. Using the Needleman Wunsch algorithm to align a pair of amino acids sequences, MTP and MSRDETHTP, with three simple scores, 11 for matching characters, 21 for mismatching characters, and 22 for a character aligning a gap. 2. Aligning a pair of amino acids sequences, MTP and MSRDETHTP, with the score system of BLOSUM62, 210 for a gap opening penalty, and 20.5 for a gap extension penalty. R i;j 5R i21;j21 1R i21;j 1R i;j21 (3) the total of the alignments of any length of two sequences was easily calculated. FIG 8 The suggestion for teaching sequence alignment in a bioinformatics course. This research helps to reconsider time-saving methods. Because the trace-back was performed after the completion of the computation of the score matrix [1,2,10,14], each score needed space for simultaneous storage. Moreover, memory-efficient algorithms improved from NW were created [4,12,13,19]. In the present study, when the subalignment matrix was used and the alignments in the boxes were performed from left to right and from up to down, computer space was required for only the sub-alignments in the current line/row of boxes, and the memory needed for the previous lines/rows of boxes was no longer required. But larger spaces to save the characters presented trade-offs in some situations. The understanding of NW presented herein represents a new approach that might lead to new applications, new methods, or algorithms. These warrant further study. Xu et al. 203

Biochemistry and Molecular Biology Education Conclusion An exhaustive search for pair alignment, which is an NPcomplete problem, is herein more easily understood. The NW presented herein required limited time and space to be run by computers, and a deduction of its exhaustive origin was reported to rich the teaching methods of NW. Topological transformation (Fig. 5) was a bridge to an exhaustive search (Fig. 1) with the NW. The success of the deduction from an exhaustive method to NW in the present study facilitates an understanding of the foundational dynamic algorithm of global pair alignment and encourages new thinking in the exploration and application of alignment algorithms. References [1] Needleman, S. B., and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443 453. [2] Mount, D. W. (2004) Bioinformatics: Sequence and Genome Analysis, 2nd ed., Cold Spring Harbor Laboratory Press, New York. [3] Chakraborty, A., and Bandyopadhyay, S. (2013) FOGSAA: Fast optimal global sequence alignment algorithm. Sci. Rep. 3, 1746. [4] Chao, K.-M. Hardison, R. C., and Miller, W. (1994) Recent developments in linear-space alignment methods: A survey. J. Comp. Biol. 1, 271 291. [5] Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acid Res. 22, 4673 4680. [6] Altschul, S. F. Madden, T. L. Sch affer, A. A. Zhang, J. Zhang, Z. Miller, W., and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid. Res. 25, 3389 3402. [7] Chenna, R. Sugawara, H. Koike, T. Lopez, R. Gibson, T. J. Higgins, D. G., and Thompson, J. D. (2003) Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. 31, 3497 3500. [8] Huang, X., and Chao, K. (2003) A generalized global alignment algorithm. Bioinformatics 19, 228 233. [9] Huang, W. Umbach, D. M., and Li, L. (2006) Accurate anchoring alignment of divergent sequences. Bioinformatics 22, 29 34. [10] Smith, T. F. and Waterman, M. S. (1981a) Identification of common molecular subsequences. J. Mol. Biol. 147, 195 197. [11] Smith, T. F., and Waterman, M. S. (1981b) Comparison of biosequences. Adv. Appl. Math. 2, 482 489. [12] Gotoh, O. (1982) An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705 708. [13] Gotoh, O. (1990) Optimal sequence alignment allowing for long gaps. Bull. Math. Biol. 52, 359 373. [14] Setubal, J. C. and Meidanis J. (1997) Introduction to Computational Molecular Biology. Boston: PWS Publishing Company. [15] Xu, Z. N., Ed. (2008) Bioinformatics. Beijing: Tsinghua University Press. [16] Waterman, M. S. Smith, T. F., and Beyer, W. A. (1976) Some biological sequence metrics. Adv. Math. 20, 367 387. [17] Waterman, M. S. (1994) Parametric and ensemble sequence alignment algorithms. Bull. Math. Biol. 56, 743 767. [18] Waterman, M. S. Eggert, M., and Lander, E. (1992) Parametric sequence comparisons. Proc. Natl. Acad. Sci. U. S. A. 89, 6090 6093. [19] Hirschberg, D. S. (1975) A linear space algorithm for computing maximal common subsequences. Commun. ACM 18, 341 343. 204 From Exhaustive Search Method to Needle Wunsch Algorithm