CHAPTER 4 PATTERN CLASSIFICATION, SEARCHING AND SEQUENCE ALIGNMENT

Size: px

Start display at page:

Download "CHAPTER 4 PATTERN CLASSIFICATION, SEARCHING AND SEQUENCE ALIGNMENT"

Coral Moody
5 years ago
Views:

1 92 CHAPTER 4 PATTERN CLASSIFICATION, SEARCHING AND SEQUENCE ALIGNMENT 4.1 INTRODUCTION The major tasks of pattern classification in the given DNA sample, query pattern searching in the target database and global alignment of given protein sequences using both existing and proposed system along with the performance evaluation is discussed in this chapter. 4.2 PATTERN CLASSIFICATION FOR GENERATING UNIQUE IDENTIFICATION NUMBER The automation of generating unique identification number from a given Human DNA sample can be achieved using the proposed system which classifies the sequence into valid sequences and invalid sequences for which the unique identification numbers is generated.the identification numbers of valid sequences are checked for its repetition, from which the unique identification number of the individual is found. Various representations of nucleotides present in DNA and RNA is shown in Table 4.1. The set of valid sequences (VS) identified using the proposed system for the given DNA sample with Base pair = 32 and Sequence=25 are shown in Figure 4.1.

2 93 Table 4.1 Representation of nucleotides present in DNA and RNA S.No. Nucleotides Presence Character Fuzzy &Color Equivalent Representation 1 Adenine DNA/RNA A Thymine DNA T Guanine DNA/RNA G Cytosine DNA/RNA C Uracil RNA U 0.5 DNA SAMPLE: HUMAN- 1[BASE PAIR=32, SEQUENCE =25] AATGTGTTGTGTGACCCCTCAAAATCTCTCAAATGTGTTTTTACAC TCCGTTGGTAATATGGAATGTGTTAAAGTTGCTACCCGGGGTTTT TTAATGTGTCTCT Figure 4.1 Identification of valid sequences from the sample

3 94 VS [1] = ,VS [2] = ,VS [3] = , VS [4] = ,VS [5] = ,VS [6] = , VS [7] = ,VS [8] = ,VS [9] = , VS [10] = ,VS [11] = ,VS [12] = , VS [13] = The Unique Identification Number is: PATTERN SEARCHING TECHNIQUES Pattern Searching in DNA Sequence using Hash Coding Technique This is a method particularly suited for fast, ungapped searches of a small sequence, sequence pattern or motif through a large database. The word hash derives from the name hash table commonly used in computer science for a look-up table constructed from a database. Such tables represent an abstract or index of the informaton present in the database and are often used for fast searches.a hash table is also called an associative array, where a specific name or key is associated with each piece of data or value stored in it. As applied in sequence searching, hashing is the process of breaking the sequence into small words or k-tuples of a specified size and creating a hash table with those words keyed to position numbers. The values associated with each key are stored in buckets.the number of buckets obviously is the same as the number of keys specified and this is dependent on database, that match and the offset is saved.in general hashing reduces the complexity of the search problem to the order of the length of all the sequences in the database.pattern searching for the sample

4 95 shown in Figure 4.2 is done using hash coding technique, for the key size k = 1there will be 4 buckets, for key size k = 2 the number of buckets is 16 and the values obtained are the positions of each key value in the target sequence,similar hash table is constructed for the query sequence with k-tuple values k = 1 and k = 2, resulting in the position of each key value in the query sequence and the offset value which yields the number of matches is obtained by subtracting the values in target sequence with query sequence and finally the performance of the algorithm is analysed. Figure 4.2 Sample of pattern searching Analysis of Hash Coding Algorithm in Pattern Searching In hash coding method if the k-tuple is large, the speed is high,specificity, that is the ability to pick up accurate and meaningful matches is high and the sensitivity, that is the approximate or distant matching is low.conversely, if the k-tuple is small, the speed is low, the specificity is low, but the sensitivity is high.the hash coding algorithm is used in very widely used sequence matching and search programs BLAST and FASTA which are currently available in National Center for Biotechnology Information (NCBI) tool. In these case since the comparison

5 96 is executed in each repetition of the loop, the comparison operation is considered as the algorithm s base operation. Target Database Let c1 and c2 are the count of base operation performed in the hash table for k-tuple (k = 1) that is A, T, G, and C and for k-tuple (k = 2) that is AA, AT, AG, AC CC respectively. For target input of size 43 c1 (43) = 172 c2 (43) = 172 Count of base operation for target database ct (43) = c1 (43) +c2 (43) = 344 Query Pattern Let c3 and c4 are the count of basic operation performed in the hash table for k-tuple (k = 1) that is A, T, G, and C and for k-tuple (k = 2) that is AA, AT, AG, AC CC respectively. For query pattern of size 7 c3 (7) = 28 c4 (7) = 28 Count of base operation for query pattern cq (7) = c3 (7) +c4 (7) = 56 The total count of base operation in hash coding algorithm for input of size 50, C hash (50) is given by

6 97 C hash (50) = ct (43)+cq (7) ; C hash (50) = ; C hash (50)=400. The run time efficiency of hash coding algorithm, for the input of size 50, T hash (50) isgiven by T hash (50) C hash (50) = 400/ Seconds where, one base operation = 1/100 th of a second Pattern Searching in DNA Sequence using NFPR Technique The objective of NFPR algorithm is to generate unique identification number of given Human DNA sample and to check whether query pattern is present in the given target database. Step 1 Training inputs to the Neural-Fuzzy processor is normalized using various condtions as given in Table 3.1. Step 2 Generating Weight for Inference and Categoy for Inference using Ignition Function (IGF) and Tracking Function (TRF) as in Table 3.3. Step3 Inference of Category for the nucleotide pairs in Target database is done using Category Inference Function(CIF) as in Table 3.7. Step4 Generate unique identification number for all valid sequence in Target Database using Equation (3.8). Step5 Generate unique identification number for the Query Pattern using Equation (3.8).

7 98 Step6 Compare the identification number of Query Pattern with all the identification number of Target database and if there is a match then the pattern is confirmed in the database, if no match the absence of pattern in the database is confirmed. shown in Figure 4.3. The Separator Output of NFPR Processor for Target Database is Figure 4.3 Separator output of NFPR processor Various valid sequence of Target database Valid sequence [VS1] = Valid sequence [VS2] = Valid sequence [VS3] = Valid sequence [VS4] = Valid sequence [VS5] = Valid sequence [VS6] = Output for Query Pattern Query pattern =

8 Classification performed in NFPR System The proposed NFPR system classifies the given Human DNA sample into valid sequence (AA, AT, AG, AC, TA,TG, TC, GA, GT, GC, CA,CT,CG) and invalid sequence(tt,gg,cc); if the sequence pair is identified as valid, the consecutive five nucleotide base is considered as one complete sequence. If the sequence pair is identified as invalid, the system will consider the second nucleotide base in the pair with the next nucleotide base in the sample as a new pair and check for the category valid or invalid, if it s also invalid, the system will consider the second nucleotide base in the current pair with the next nucleotide base in the sample and so on. For example consider the following DNA sample: AATGTGTTGTGTGACCCCTCAAAATCTCTCAAATGTGTT TTTACACTCCGTTGGTAATATGGAATGTGTTAAAGTTGCTACCCG GGGTTTTTTAATGTGTCTCT Sequence pair (valid) Sequence pair (invalid) AATGTGT TGTGTGA CCC CTCAAAA TCTCTCA AATGTGT TTT One complete sequence Invalid sequence pairs or Valid sequence TACACTC CGTTGGT AATATGG AATGTGT TAAAGTT GCTACCC G GG GTTTTTT AATGTGT CTCTXXX

9 Analysis of Proposed NFPR Technique in Pattern Searching Let c1 is the count of base operation performed in the training and inference, c2 is the count of base operation performed in comparison of identification numbers in proposed NFPR system. For target input of size 50: c1 (50) = = 109; c2 (50) = 6; The total count of base operation in NFPR algorithm for input of size 50, C nfpr (50) is given by C nfpr (50) = c1 (50) +c2 (50) ; C nfpr (50) = =115; The run time efficiency of NFPR algorithm, for input of size 50, T nfpr (50),where one base operation =1/100 th of a second, is given by T nfpr (50) C nfpr (50) = 115/ Seconds in Figure 4.4. The base operation of NFPR and Hash coding algorithm is shown Figure 4.4 Base operation of NFPR and Hash coding algorithm

10 SEQUENCE ALIGNMENT TECHNIQUES It is a process of determination of the order of nucleotides in a DNA or RNA molecule or the order of amino acids in a protein. Alignments are the basis of sequence analysis methods and are used to pinpoint the occurrence of conserved motifs. Motifs are consecutive string of amino acids in a protein sequence whose general character is repeated, or conserved, in all sequences in a multiple alignment at a particular position. Sequence similarity measures can be classified into as either global or local. There are two mathematical aspects to sequence alignments. The first is the algorithm used to find sequence similarities and the second is the method used to determine which similarities are interesting and important. Similarities between sequences can be studied using methods such as dot plot method and dynamic programming algorithms such as Needleman-Wunsch algorithm and Smith-Waterman algorithm used by FASTA and BLAST programs. The objective is to determine the optimum match of two protein sequences. Usually the matching procedures such as Needleman-Wunsch algorithm involves scoring schemes that impose a gap penalty every time a skip is made in one or the other sequences in order to improve the degree of matching. The matching itself may be of a sort whereby only the identities are scored or it may involve a weighted scale that gives a partial credit for matched amino acids that are structurally similar or that are genetically similar or evolutionary favored. All these factors are taken into account, i.e. matched identities or similar residues counting positively and gaps counting negatively and an alignment score may be computed.

11 Dynamic Programming using Needleman-Wunsch Algorithm The dynamic programming provides a reliable computational method for aligning DNA and protein sequences. It is a computational method that is used to align two protein or nucleic acid sequences. The method is very important for sequence analysis because it provides the alignment between sequences. This method compares every pair of characters in the two sequences and generates an alignment. An alignment is generated by starting at the ends of the two sequences and attempting to match all possible pairs of characters between the sequences and by following a scoring scheme for matches, mismatches and gaps. This alignment will include matched and mismatched characters and gaps in the two sequences that are positioned so that the number of matches between identical or related characters is the maximum possible. The dynamic programming method usually used for global alignment of sequences is Needleman-Wunsch. This algorithm will maximize the number of matches between the sequences along the entire length of the sequences. For protein sequences, the simplest system of comparison is one based on identity. A match in an alignment is only scored if the two aligned amino acids are identical. This procedure generates a matrix of number that represents all possible alignments between the sequences. The highest set of sequential scores in the matrix defines an optimal alignment. The dynamic programming method is guaranteed in a mathematical sense to provide the optimal alignment for a given set of user-defined variables, including choice of scoring matrix and gap penalties. Gaps may also be present at the ends of sequences, in case there is extra sequence left over after the alignment. These end gaps are often but not always, given a gap penalty.

12 Global Alignment of Protein Sequences using National Center for Biotechnology Information (NCBI) Tool The process of performing global alignment of two protein sequences using NCBI-BLAST tool which has been implemented using Needleman-Wunsch technique is discussed below. Various character and color representation of amino acids present in protein sequences is given in Table 4.2. Table 4.2 Representation of amino acids present in protein sequences S.No. Amino acids Type Character & Color Representation Fuzzy Equivalent 1 alanine Hydrophobic A cysteine Hydrophilic C aspartic acid charged(-ve) D glycine Hydrophobic G lysine charged(+ve) K leucine Hydrophobic L methionine Hydrophobic M asparagine Hydrophilic N proline Hydrophobic P glutamine Hydrophilic Q arginine charged (+ve) R tyrosine Hydrophilic Y Space S 0.50 The subject sequence and query sequence which are to be globally aligned using NCBI-BLAST tool is given in Figure 4.5.

13 104 Figure 4.5 Sample of protein sequences for aligning The output generated for the above sample of protein sequences by NCBI-BLAST is given in Table 4.3 and also the aligned sequence generated by NCBI-BLAST is shown in Figure 4.6. Table 4.3 Generated output from NCBI-BLAST Query ID : lcl >lcl unnamed protein product Description : None Length : 3 Molecule type : amino acid NW Score : 23 Query Length : 12 Identities : 6/13 (46%) Subject ID : Gaps : 1/13 (8%) Description : None Mismatches : 6 Molecule type : amino acid Query 1AGC-GNRCKCRYP 12 A C G +C CR Sbjct 1ADCNGRQCLCRPM 13

14 105 Figure 4.6 Aligned sequences generated by NCBI-BLAST Number of Matches / Identities ( ) = 6/13 (46%) Number of Mismatches ( ) = 6 Number of gaps ( ) = 1 Alignment Efficiency =46% Analysis of Needlman-Wunsch Technique in Sequence Alignment Let c1 is the count of base operation performed in building matrix and c2 is the count of base operation performed in trace back operation of Needleman-Wunsch algorithm. For sequence input of size 25 c1 (25) = 132; c2 (25) = 156 The total count of base operation in Needleman-Wunsch algorithm for input of size 25, C needl (25) is given by C needl (25) = c1 (25)+c2 (25) ; C needl (25) = C needl (25)= 288

15 106 The run time efficiency for Needleman-Wunsch algorithm, for input of size 50, T needl (25) isgiven by T needl (25) C needl (25) = 288/ Seconds where, one base operation = 1/100 th of a second Classification Performed in NFSA Algorithm The proposed NFSA system classifies the given normalized amino acid sample into match (AA, CC, DD, GG, KK, LL, MM, NN, PP, QQ, RR) and no-match (AC, CD,) category; if the amino acid pair is identified as match, the location of first amino acid in the pair is stored and the next two consecutive amino acids after the second in the pair is considered as a new pair for classification. If the amino acid pair is identified as no-match, the system will consider second amino acid in the pair with the next amino acid in the sample as a new pair and check for the category match and no-match, if it s also no-match the system will consider the second amino acid in the current pair with the next amino acid in the sample and so on. For example consider the following normalized sample: AADGCCNGGNRRQCCKLCCRRYPPM Amino acid pair (match); location 1, 5,8,11.., are stored AADGCCNGGNRRQCCKLCCRRYPPM Amino acid pair (no-match)

16 Figure 4.7 Screen shot of existing NCBI tool used in sequence alignment 107

17 Global Alignment of Protein Sequences using Proposed NFSA Technique The objective of proposed NFSA algorithm is to perform global alignment of protein sequences in order to identify number of matches, gaps and mismatches. The Neural-Fuzzy processor is firstly trained with all the amino acids that are present in the protein sequences by its fuzzy equivalent to bring them either in match or no-match category. The normalized form of sequence1 and sequence 2 that are to be aligned is generated as shown in Figure 4.8. Figure 4.8 Normalized sequences The various processes performed in the associator are: a) Using the generated weights for inference and category for inference from Neural-Fuzzy processor, the associator identifies the location of match sequences indicated by ( ) as shown in Figure 4.9. {1, 5, 8, 11, 14, 18, 20, 23} indicates there are 8 matches between sequences.

18 109 b) Generate set of values for the above identified matched locations such as {(1, 2) (5, 6) (8, 9) (11, 12) (14, 15) (18, 19) (20, 21) (23, 24)} That is {(1, 1+1) (5, 5+1)..(23, 23+1)} Figure 4.9 Location of match sequences c) Create a single set from above set of values as below A= {1, 2, 5, 6, 8, 9, 11, 12,14,15,18, 19, 20, 21, 23, 24} d) Identify the missing elements in above A and create a set B B= {3, 4, 7, 10, 13, 16, 17, 22, 25} e) Check every alternate element in the above set A for even or odd, if it s even interchange the element with very next element as A= {(1,2),(5,6),(9,8),(11,12),(15,14),(19,18),(21,20),(23,24)} f) Check whether the element in the above set B are continuous if not continuous check for even or odd,if it s even add and element S before the element if odd add S after the element as B = {(3, 4), (7, S), (S, 10), (13, S), (17, 16), (S, 22), (25, S)} g) Map the elements of set A and set B one after another as in Figure 4.10

19 110 h) Check for other set of possible matches using set of values from process (b) {(1, 2) (5, 6) (8, 9) (11, 12) (14, 15) (18, 19) (20, 21) (23, 24)} that is for set (a, b) >1 check whether element a, b and a-1, b+1 elements are same if so other possible match is found. Set A with values greater than 1 are {(5, 6) (8, 9) (11, 12) (14, 15) (18, 19) (20, 21) (23, 24)} for set (5, 6): 5 th, 6 th element and 5-1=4 th, 6+1=7 th element are not same so there is no possibility for another match.but For set (8, 9): 8 th, 9 th element and 8-1=7 th, 9+1=10 th element are same so another possible match is found between 7 th and 10 th element (7,10) as shown in Figure Figure 4.10 Mapped sequences of proposed NFSA system Also it s found that there is no possible match for other sets (11, 12) (14, 15) (18, 19) (20, 21) (23, 24)making the total number of matches in proposed system as 9. i) Append set A and set B to create set C {(1,2),(3,4),(5,6),(7,S),(9,8),(S,10)(11,12),(13,S)(15,14),(17,16), (19,18),(21,20),(S,22),(23,24),(25,S)}

20 111 Figure 4.11 Aligned sequences of proposed NFSA system Number of Matches /Identities ( ) = 9/13(70%) Number of Mismatches ( ) = 2 Number of gaps ( ) =5 Algorithm Efficiency Number of Comparisons used in alignment = (Training+ Inference+ Aligning) =132 Comparisons Alignment Efficiency =70% The alignment efficiency and base operation of Needleman-Wunsch and NFSA is shown in Figure 4.12 and Figure 4.13.

4.6 Analysis of Proposed NFSA Technique in Sequence Alignment Let c1 is the count of base operation performed in the training and

21 112 Figure 4.12 Alignment efficiency of Needleman-Wunsch versus NFSA Analysis of Proposed NFSA Technique in Sequence Alignment Let c1 is the count of base operation performed in the training and inference, c2 is the count of base operation for aligning the sequence in proposed NFSA system. For the sequence input of size 25 c1 (25) = 102; c2 (25) = 30; The total count of base operation for proposed NFSA algorithm for input sequence of size 25, C nfsa (25) is given by C nfsa (25) = c1 (25)+c2 (25) ; C nfsa (25) = ; C nfsa (25) =132;

22 113 The run time efficiency of proposed NFSA algorithm, for the input sequence of size 25, T nfsa (25) is given by T nfsa (25) C nfsa (25) = 132/ Seconds where, one base operation = 1/100 th of a second. Figure 4.13 Base operation of NFSA and Needleman-Wunsch

23 IDENTIFICATION OF MUTATION IN HUMAN DNA The objective of NFPR algorithm is also to precisely identify the location of occurrence of mutation in given Human DNA sample using identification number, by classifying it into valid and invalid sequence. Case 1 Before Mutation VS1/RS VS2 IS1 IS1 IS1 VS3 VS4 VS5/RS AATGTGT TGTGTGA C C C CTCAAAA TCTCTCA AATGTGT IS2 IS2 IS2 VS6 VS7 VS8 VS9/RS VS10 T T T TACACTC CGTTGGT AATATGG AATGTGT TAAAGTT VS11 IS3 IS3 IS3 VS12 VS13/RS VS14 GCTACCC G G G GTTTTTT AATGTGT CTCTXXX After Mutation in Valid sequence [Point mutation] VS1/RS VS2 IS1 IS1 IS1 VS3 VS4 VS5/RS AATGTGT TGTGTGA C C C CTCA C AA TCTCTCA AATGTGT IS2 IS2 IS2 VS6 VS7 VS8 VS9/RS VS10 T T T TACACTC CGTTGGT AATATGG AATGTGT TAAAGTT VS11 IS3 IS3 IS3 VS12 VS13/RS VS14 GCTACCC G G G GTTTTTT AATGTGT CTCTXXX In case 1 the point mutation occurs in valid sequence VS 1,3 by the mutant C which can be identified with the change in identification number of VS 1,3 as shown in Table 4.4 with no alteration in its invalid sequence as shown in Table 4.5, resulting in a change in polypeptide sequence, it might change the shape or function of the protein, depending on where in the sequence occurs. The process of point mutation is shown in Figure 4.14.

24 115 Table 4.4 Discriminator-D1 outputs for valid sequence of HUMAN-1 before and after point mutation Valid Sequence (VS) Identification Number Before Mutation Repeated Sequence (RS) Identification Number After Mutation Repeated Sequence (RS) VS 1, RS RS VS 1, VS 1, VS 1, VS 1, RS VS 1, VS 1, VS 1, VS 1, RS VS 1, VS 1, VS 1, VS 1, RS VS 1, RS VS 1, Table 4.5 Discriminator-D2 outputs for invalid sequence of HUMAN-1 before and after point mutation Invalid Sequence (IS) Identification Number Before Mutation Identification Number After Mutation IS 1, IS 1, IS 1,

25 116 Figure 4.14 Point mutation Case 2 After Mutation in Valid sequence: [Frame shift mutation-insertion] VS1/RS VS2 IS1 IS1 IS1 VS3 VS4 VS5/RS AATGTGTTGTGTGA C C C CTCAAAATCTCTCA AATGTGT IS2 IS2 IS2 VS6 VS7 VS8 VS9/RS VS10 T T T TACACTCCGTTGGT AATATGGAATGTGT TAAAGTT VS11 IS3 IS3 IS3 VS12 VS13/RS VS14 GCTACCC G CGGGTTT T T TAATGTG TCTCTXX

26 117 In case 2 the frame shift mutation (insertion) occurs in one of the invalid sequence IS 1,3 by the mutant C which alters the valid sequence VS 1,12 as shown in Table 4.6 and invalid sequence IS 1,3 as shown in Table 4.7which can be identified by the change in identification number of both valid and invalid sequences after mutation. The frame shift mutationinsertion results in a change of polypeptide sequence, it might change the shape or function of the protein, depending on where in the sequence occurs. The process of frame shift mutation-insertion is shown in Figure Table 4.6 Discriminator-D1 outputs for valid sequence of HUMAN-1 before and after frame-shift [insertion] mutation Valid Sequence (VS) Identification Number Before Mutation Repeated Sequence (RS) Identification Number After Mutation Repeated Sequence (RS) VS 1, RS RS VS 1, VS 1, VS 1, VS 1, RS RS VS 1, VS 1, VS 1, VS 1, RS RS VS 1, VS 1, VS 1, VS 1, RS VS 1,

27 118 Table 4.7 Discriminator-D2 outputs for invalid sequence of HUMAN-1 before and after frame-shift [insertion] mutation Invalid Sequence (IS) Identification Number Before Mutation Identification Number After Mutation IS 1, IS 1, IS 1, IS 1, Figure 4.15 Frame shift mutation-insertion

28 119 Case 3 After Mutation in Invalid sequence: [Point mutation (Neutral or Silent)] VS1/RS VS2 IS1 IS1 IS1 VS3 VS4 VS5/RS AATGTGTTGTGTGA C C C CTCAAAA TCTCTCA AATGTGT IS2 IS2 IS2 VS6 VS7 VS8 VS9/RS VS10 T T T TACACTCCGTTGGTAATATGGAATGTGTTAAAGTT VS11 IS3 IS3 IS3 IS3 VS12 VS13/RS VS14 GCTACCC G G G G GTTTTTT AATGTGTCTCTXXX In case 3 the point mutation is occurs in same IS 1,3 as case 2 but with mutant G which only alters the invalid sequence IS 1,3 and can be identified only using the change in identification number of invalid sequence IS 1,3 as shown in Table 4.8 and the identification number of valid sequences remains unaltered as shown in Table 4.9. The point mutation results in no change of polypeptide sequence and also possible consequence for the organism is none. The process of point mutation (neutral or silent) is shown in Figure Table 4.8 Discriminator-D2 outputs for invalid sequence of HUMAN-1 before and after point [neutral or silent] mutation InvalidSequence (IS) IdentificationNumber Before Mutation IdentificationNumber After Mutation IS 1, IS 1, IS 1,

29 120 Table 4.9 Discriminator-D1 outputs for valid sequence of HUMAN-1 before and after point [neutral or silent] mutation Identification Identification Repeated Valid Repeated Number Number Sequence Sequence Sequence Before After (RS) (VS) (RS) Mutation Mutation VS 1, RS RS VS 1, VS 1, VS 1, VS 1, RS RS VS 1, VS 1, VS 1, VS 1, RS RS VS 1, VS 1, VS 1, VS 1, RS RS VS 1,

30 121 Figure 4.16 Point mutation-neutral or Silent Case 4 After Mutation in Valid sequence: [Frame shift mutation-deletion] VS1/RS VS2 IS1 IS1 IS1 VS3 VS4 VS5/RS AATGTGTTGTGTGA C C C CTCAAAA TCTCTCA AATGTGT IS2 IS2 IS2 VS6 VS7 VS8 VS9/RS VS10 T T T TACACTCCGTTGGTAATATGGAATGTGTTAAAGTT VS11 IS3 IS3 IS3 VS12 VS13/RS VS14 GCTACCC G G G GTTTTTA ATGTGTC TCTXXX In case 4 the frame mutation-deletion occurred in valid sequence VS 1,12 by the removal of mutant T which can be identified with the change in identification number of valid sequence VS 1,12 as shown in

31 122 Table 4.10,with no alteration in any of the invalid sequence as shown in Table As a result change in polypeptide sequence occurs which might change the shape or function of the protein, depending on where in the sequence occurs. The process of frame mutation-deletion is shown in Figure Table 4.10 Discriminator-D1 outputs for valid sequence of HUMAN-1 before and after frame-shift [deletion] mutation Identification Identification Valid Repeated Repeated Number Number Sequence Sequence Sequence Before After (VS) (RS) (RS) Mutation Mutation VS 1, RS RS VS 1, VS 1, VS 1, VS 1, RS RS VS 1, VS 1, VS 1, VS 1, RS RS VS 1, VS 1, VS 1, VS 1, RS VS 1,

32 123 Table 4.11 Discriminator-D2 outputs for invalid sequence of HUMAN-1 before and after frame-shift [deletion] mutation Identification Identification Invalid Number Number Sequence Before After (IS) Mutation Mutation IS 1, IS 1, IS 1, Figure 4.17 Frame shift mutation-deletion

33 124 Case 5 After Mutation in Valid sequence: [Inversion mutation] VS1/RS VS2 IS1 IS1 IS1 VS3 VS4 VS5/RS AATGTGTTGTGTGA C C C CTCAAAATCTCTCA AATGTGT IS2 IS2 IS2 VS6 VS7 VS8 VS9/RS VS10 T T T TACACTCCGTTGGTAATATGGAATGTGTTTGAAAT VS11 IS3 IS3 IS3 VS12 VS13/RS VS14 GCTACCC G G G GTTTTTT AATGTGT CTCTXXX Table 4.12 Discriminator-D1 outputs for valid sequence of HUMAN-1 Valid Sequence (VS) before and after inversion mutation Identification Number Before Mutation Repeated Sequence (RS) Identification Number After Mutation Repeated Sequence (RS) VS 1, RS RS VS 1, VS 1, VS 1, VS 1, RS RS VS 1, VS 1, VS 1, VS 1, RS RS VS 1, VS 1, VS 1, VS 1, RS VS 1,

125 In case 5 the inversion mutation occurs in valid sequence VS 1,10 by replacing TAAAGTT with mutant

shown in Table 4.12 with no alteration in any of the invalid sequence as shown in Table 4.13.

The inversion mutation results in a change of polypeptide sequence which might change the shape or

13 Discriminator-D2 outputs for invalid sequence of HUMAN-1 before and after inversion mutation Invalid

34 125 In case 5 the inversion mutation occurs in valid sequence VS 1,10 by replacing TAAAGTT with mutant TTGAAAT which can be identified with the change in identification number of valid sequence VS 1,10 as shown in Table 4.12 with no alteration in any of the invalid sequence as shown in Table The process of inversion mutation is shown in Figure The inversion mutation results in a change of polypeptide sequence which might change the shape or function of the protein, depending on where in the sequence occurs. Table 4.13 Discriminator-D2 outputs for invalid sequence of HUMAN-1 before and after inversion mutation Invalid Sequence (IS) Identification Number Before Mutation Identification Number After Mutation IS 1, IS 1, IS 1, Figure 4.18 Inversion mutation

Match the Hash Scores

Sort the hash scores of the database sequence February 22, 2001 1 Match the Hash Scores February 22, 2001 2 Lookup method for finding an alignment position 1 2 3 4 5 6 7 8 9 10 11 protein 1 n c s p t a.....