Metaheuristics for the DNA Fragment Assembly Problem

Size: px
Start display at page:

Download "Metaheuristics for the DNA Fragment Assembly Problem"

Transcription

1 International Journal of Computational Intelligence Research. ISSN Vol.1, No.2 (2005), pp c Research India Publications Metaheuristics for the DNA Fragment Assembly Problem Gabriel Luque 1 and Enrique Alba 2 1 Dpto. de Lenguajes y Ciencias de la Computación, E.T.S.I. Informática Campus Teatinos 29071, Málaga, SPAIN gabriel@lcc.uma.es 2 Dpto. de Lenguajes y Ciencias de la Computación, E.T.S.I. Informática Campus Teatinos 29071, Málaga, SPAIN eat@lcc.uma.es Abstract: As more research centers embark on sequencing new genomes, the problem of DNA fragment assembly for shotgun sequencing is growing in importance and complexity. Accurate and fast assembly is a crucial part of any sequencing project and since the DNA fragment assembly problem is NPhard, exact solutions are very difficult to obtain. Various heuristics were designed for solving the fragment assembly problem. But, in general, they are unable to sequence very large DNA molecules. In this work, we present several methods, a canonical genetic algorithm, a CHC method, a scatter search algorithm, and a simulated annealing, to solve accurately problem instances that are 77K base pairs long. Keywords: DNA Fragment Assembly, Genetic Algorithm, Scatter Search, Simulated Annealing, CHC. I. Introduction DNA fragment assembly is a technique that attempts to reconstruct the original DNA sequence from a large number of fragments, each one having several hundred base-pairs (bps) long. The DNA fragment assembly is needed because current technology, such as gel electrophoresis, cannot directly and accurately sequence DNA molecules longer than 1000 bases. However, most genomes are much longer. For example, a human DNA is about 3.2 billion nucleotides in length and cannot be read at once. The following technique was developed to deal with this limitation. First, the DNA molecule is amplified, that is, many copies of the molecule are created. The molecules are then cut at random sites to obtain fragments that are short enough to be sequenced directly. The overlapping fragments are then assembled back into the original DNA molecule. This strategy is called shotgun sequencing. Originally, the assembly of short fragments was done by hand, which is inefficient and error-prone. Hence, a lot of effort has been put into finding techniques to automate the shotgun sequence assembly. Over the past decade a number of fragment assembly packages have been developed and used to sequence different organisms. The most popular packages are PHRAP [1], TIGR assembler [2], STROLL [3], CAP3 [4], Celera assembler [5], and EULER [6]. These packages deal with the previously described challenges to different extend, but none of them solves them all. Each package automates fragment assembly using a variety of algorithms. The most popular techniques are greedy-based. This work reports on the design and implementation of several algorithms to tackle large instances of DNA fragment assembly problem. The remainder of this paper is organized as follows. In the next section, we present background information about the DNA fragment assembly problem. In Section III, the details of the heuristics are presented. In Section IV, we discuss the operators, fitness functions [7], and how to design and implement these methods for the DNA fragment assembly problem. Section V presents the software that we use. We analyze the results of our experiments in Section VI. We end this paper by giving our final thoughts and conclusions in Section VII. II. The DNA Fragment Assembly Problem With the advance of computational science, bioinformatics has become more and more attractive to researchers in the field of computational biology. Genomic data analysis using computational approaches is very popular as well. As we know, the primary goal of any genomic project is to determine the complete sequence of the genome and its genetic content. Thus, a genome project is accomplished in two steps, the first is the genome sequencing and the second is the genome annotation (i.e., the process of identifying the boundaries between genes and other features in raw DNA

2 DNA Fragment Assembly Problem 99 sequence). Contig: A layout consisting of contiguous overlapping fragments, i.e., a sequence in which the overlap between adjacent fragments is greater than a predefined threshold. Consensus: A sequence or string derived from the layout by taking the majority vote for each column of the layout. Figure. 1: Double Stranded DNA In this paper, we focus on the genome sequencing, which is also known as the DNA fragment assembly problem. The fragment assembly occurs in the very beginning of the process and that other steps depend on its accuracy. The input of the DNA fragment assembly problem is a set of fragments that are randomly cut from a DNA sequence. The DNA (Deoxyribonucleic Acid) is a double helix of two anti-parallel and complementary nucleotide sequences (see Figure 1). One strand is read from 5 to 3 and the other from 3 to 5. A DNA sequence is always read in the 5 to 3 direction. There are four kinds of nucleotides in any DNA sequence: Adenine (A), Thymine (T), Guanine (G), and Cytosine (C). We continue this section by giving a vivid analogy to the fragment assembly problem: Imagine several copies of a book cut by scissors into thousands of pieces, say 10 millions. Each copy is cut in an individual way such that a piece from one copy may overlap a piece from another copy. Assume one million pieces lost and remaining nine million are splashed with ink, try to recover the original text. [6] We can think of the DNA target sequence as being the original text and the DNA fragments are the pieces cut out from the book. To further understand the problem, we need to know the following basic terminology: Fragment: A short sequence of DNA with length up to 1000 bps. Shotgun data: A set of fragments. Prefix: A substring comprising the first n characters of fragment f. Suffix: A substring comprising the last n characters of fragment f. Overlap: Common sequence between the suffix of one fragment and the prefix of another fragment. Layout: An alignment of collection of fragments based on the overlap order, i.e., the fragment order in which the fragments must to be joined. To measure the quality of a consensus, we can look at the distribution of the coverage. Coverage at a base position is defined as the number of fragments at that position. It is a measure of the redundancy of the fragment data. It denotes the number of fragments, on average, in which a given nucleotide in the target DNA is expected to appear. It is computed as the number of bases read from fragments over the length of the target DNA [8]. Coverage = n i=1 length of the fragment i target sequence length where n is the number of fragments. TIGR uses the coverage metric to ensure the correctness of the assembly result. The coverage usually ranges from 6 to 10 [9]. The higher the coverage, the fewer the gaps are expected, and the better the result. A. DNA Sequencing Process To determine the function of specific genes, scientists have learned to read the sequence of nucleotides comprising a DNA sequence in a process called DNA sequencing. The fragment assembly starts with breaking the given DNA sequence into small fragments. To do that, multiple exact copies of the original DNA sequence are made. Each copy is then cut into short fragments at random positions (duplicate and sonicate phases in Figure 2). Then this biological material is converted to computer data (sequence and call bases phases in Figure 2). These steps take place in the laboratory. After the fragment set is obtained, traditional assemble approach is followed in this order: overlap, layout, and then consensus. To ensure that enough fragments overlap, the reading of fragments continues until the coverage is satisfied. These steps are the last steps in Figure 2. In what follows, we give a brief description of each of the three phases: overlap, layout, and consensus. Overlap Phase - Finding the overlapping fragments. This phase consists in finding the best or longest match between the suffix of one sequence and the prefix of another. In this step, we compare all possible pairs of fragments to determine their similarity. Usually, the dynamic programming algorithm applied to semiglobal alignment is used in this step. The intuition behind finding the pairwise overlap is that fragments with a significant overlap score are very likely next to each other in the target sequence. (1)

3 Gabriel Luque and Enrique Alba 100 not able to assemble a given set of fragments into a single contig. 1. Duplicate and 2. Sonicate 3. Sequence 4. Call Bases CCGTAGCCGGGATCCCGTCC CCCGAACAGGCTCCCGCCGTAGCCG AAGCTTTTTCCCGAACAGGCTCCCG 5. Layout CCGTAGCCGGGATCCCGTCC CCCGAACAGGCTCCCGCCGTAGCCG AAGCTTTTCTCCCGAACAGGCTCCCG 6. Call Consensus AAGCTTTTCTCCCGAACAGGCTCCCGCCGTAGCCGGGATCCCGTCC Figure. 2: Graphical representation of DNA sequencing and assembly [10] Layout Phase - Finding the order of fragments based on the computed similarity score. This is the most difficult step because it is hard to tell the true overlap due to the following challenges: Repeated regions: Repeats are sequences that appear two or more times in the target DNA. Repeated regions have caused problems in many genome-sequencing projects, and none of the current assembly programs can handle them perfectly. Chimeras and contamination: Chimeras arise when two fragments that are not adjacent or overlapping on the target molecule join together into one fragment. Contamination occurs due to the incomplete purification of the fragment from the vector DNA. After the order is determined, the progressive alignment algorithm is applied to combine all the pairwise alignments obtained in the overlap phase. Consensus Phase - Deriving the DNA sequence from the layout. The most common technique used in this phase is to apply the majority rule in building the consensus. Example: We next give an example of this process. Given a set of fragments {F1 = GTCAG, F2 = TCGGA, F3 = ATGTC, F4 = CGGATG}, assume the four fragments are read from 5 to 3 direction. First, we need to determine the overlap of each pair of the fragments by the using semiglobal alignment algorithm. Next, we determine the order of the fragments based on the overlap scores, which are calculated in the overlap phase. Suppose we have the following order: F2 F4 F3 F1. Then, the layout and the consensus for this example can be constructed as follows: F2 -> TCGGA F4 -> CGGATG F3 -> ATGTC F1 -> GTCAG Consensus -> TCGGATGTCAG Unknown orientation: After the original sequence is cut into many fragments, the orientation is lost. The sequence can be read in either 5 to 3 or 3 to 5. One does not know which strand should be selected. If one fragment does not have any overlap with another, it is still possible that its reverse complement might have such an overlap. In this example, the resulting order allows to build a sequence having just one contig. Since finding the exact order takes a huge amount of time, a heuristic such as genetic algorithm can be applied in this step [7], [11], [12]. In following sections, we illustrate how several heuristics are implemented for this problem. Base call errors: There are three types of base call errors: substitution, insertion, and deletion errors. They occur due to experimental errors in the electrophoresis procedure. Errors affect the detection of fragment overlaps. The consensus determination requires multiple alignments in high coverage regions. III. Optimization Algorithms Used Incomplete coverage: It happens when the algorithm is In this section we briefly describe the methods that we have used, a genetic algorithm, a CHC method, a scatter search, and a simulated annealing. Once we have understood the basic behavior of this techniques, we show in the next section how these algorithms can be used to solve the DNA fragment assembly problem.

4 DNA Fragment Assembly Problem 101 A. Genetic Algorithm The Genetic Algorithm (GA) was invented in the mid-1970s by John Holland [13]. It is based on Darwin s Evolution Theory. A GA is an iterative technique that applies stochastic operators on a pool of individuals (the population) in order to improve the average fitness ( f i < f i+1 ). Every individual in the population is the encoded version of a tentative solution. Initially, this population is randomly generated. An evaluation function associates a fitness value to every individual indicating its suitability to the problem. GAs are associated to the use of a binary representation, but nowadays other types of representations can be found. A GA usually applies a recombination operator on two tentative solutions, plus a mutation operator that randomly modies the individual contents to promote diversity. A genetic algorithm can be used for optimization problems with multiple parameters and multiple objectives. It is commonly used to tackle NPhard problems such as the DNA fragment assembly and the Travelling Salesman Problem (TSP). NP-hard problems require tremendous computational resources to solve exactly. Genetic algorithms help to find good solutions in a reasonable amount of time. Figure 3 shows a scheme of the GA. More details about the inner workings of the algorithm can be found in [14], [15]. 1 t = 0 2 initialize P(t) 3 evaluate structures in P(t) 4 while not end do 5 t = t select O(t) from P(t-1) 7 recombine structures in O(t) 8 mutate structures in O(t) 9 evaluate structures in O(t) 10 replace P(t) from O(t) (and P(t-1)) B. CHC Algorithm Figure. 3: Scheme of the GA algorithm CHC (Cross generational elitist selection, Heterogeneous recombination, and Cataclysmic mutation) [16] is a variant of genetic algorithm with a particular way of promoting diversity. It uses a highly disruptive crossover operator to produce new individuals maximally different from their parents. It is combined with a conservative selection strategy which introduces a kind of inherent elitism. Figure 4 shows a scheme of the CHC algorithm, whose main features are the following: The mating is not restricted to the best individuals, but parents are randomly paired in a mating pool C(t) (line 6 of Figure 4). However, recombination is only applied if the Hamming distance between the parents is above a certain threshold, a mechanism of incest prevention (line 8 of Figure 4). CHC uses a half-uniform crossover (HUX), which exchanges exactly half of the differing parental genes (line 9 of Figure 4). Traditional selection methods do not guarantee the survival of best individuals, though they have a higher probability to survive. On the contrary, CHC guarantees survival of the best individuals selected from the set of parents (P (t 1)) and offsprings (C (t)) put together (line 11 of Figure 4). Mutation is not applied directly as an operator. CHC applies a re-start mechanism if the population remains unchanged for some number of generations (lines of Figure 4). The new population includes one copy of the best individual, while the rest of the population is generated by mutating some percentage of bits of such best individual. 1 t = 0 2 initialize P(t) 3 evaluate structures in P(t) 4 while not end do 5 t = t select: C(t) = P(t-1) 7 for each pair (p1,p2) in C(t) 8 if incest prevention condition 9 add to C (t) HUX(p1,p2) 10 evaluate structures in C (t) 11 replace P(t) from C (t) and P(t-1) 12 if convergence(p(t)) 13 re-start P(t) C. Scatter Search Figure. 4: Scheme of the CHC algorithm Scatter Search (SS) [17], [18] is a population-based metaheuristic that uses a reference set to combine its solutions and construct others. The method generates a reference set from a population of solutions. Then a subset is selected from this reference set. The selected solutions are combined to get starting solutions to run an improvement procedure. The result of this improvement can motivate the updating of the reference set. A pseudo-code of the scatter search is showed in Figure 5. The procedures involved by the SS method are: Initial population creation: The first step of this technique is to generate an initial population (line 1 of Figure 5). This population must be a wide set of disperse solutions. However, it must also include good solutions. Several strategies can be applied to get a population with these properties. Reference Set update and creation: The SS operates on a small set of solutions, the RefSet, consisting of the

5 102 Gabriel Luque and Enrique Alba 1 generate P 2 build RefSet from P 3 while not end do 4 generate subsets S from RefSet 5 for each subset s in S do 6 recombine solution in s to obtain x s 7 improve x s 8 Update RefSet with x s 9 if convergence(refset) 10 generate a new P 11 build Refset from P and the old RefSet Figure. 5: Scheme of the SS algorithm good solution found during the search. The good solutions are not limited to those with the best objective values. By good solutions we mean solutions with the best objective values values as well as disperse solution (to avoid local optimum). In general, the RefSet is composed of two subsets: one subset for the best solutions and another for diverse solutions. This reference set is created from the initial population (line 2 of Figure 5) and it is updated when a new solution is generated (line 8 of Figure 5). Also, this set is partially reinitialized when the search has stagnated (lines 9-11 of Figure 5). Subset generation: This procedure (line 5) operates in the reference set to produce a subset of its solutions as a basis for creating combined solutions. Solution combination: It transforms a given subset of solutions into one or more combined solution vectors (line 6 of Figure 5). Improvement method: This procedure transforms a trial solution into an enhanced solution (line 7). D. Simulated Annealing Simulated Annealing (SA) [19] is a stochastic optimization technique, which has its origin in statistical mechanics. It is based upon a cooling procedure used in industry. This procedure heats the material to a high temperature so that it becomes a liquid and the atoms can move relatively freely. The temperature is then slowly lowered so that at each temperature the atoms can move enough to begin adopting the most stable configuration. In principle, if the material is cooled slowly enough, the atoms are able to reach the most stable (optimum) configuration. This smooth cooling process is known as annealing. Figure 6 shows a scheme of SA. First at all, the parameter T, called the temperature, and the solution, are initialized (lines 2-4). The solution s1 is accepted as the new current solution if δ = f(s1) f(s0) < 0. Stagnations in local optimum are prevented by accepting also solutions which increase the objective function value with a probability exp( δ/t ) if δ > 0. This process is repeated several times to obtain good sampling statistics for the current temperature. The number of such iterations is given by the parameter M arkov Chain length, whose name alludes the fact that the sequence of accepted solutions is a Markov chain (a sequence of states in which each state only depends on the previous one). Then the temperature is decremented (line 14) and the entire process repeated until a frozen state is achieved at T min (line 15). The value of T usually varies from a relatively large value to a small value close to zero. 1 t = 0 2 initialize(t) 3 s0 = Initial Solution() 4 v0 = Evaluate(s0) 5 repeat 6 repeat 7 t = t s1 = Generate(s0,T) 9 v1 = Evaluate(s0,T) 10 if Accept(v0,v1,T) 11 s0 = s1 12 v0 = v1 13 until t mod Markov Chain length == 0 14 T = Update(T) 15 until loop stop criterion satisfied Figure. 6: Scheme of the Simulated Annealing (SA) IV. DNA Fragment Assembly Using Heuristics Let us give some details about the most important issues of our implementation and how we have used the previous methods to solve the DNA fragment assembly problem. First, we address the common details such as the solution representation or the fitness function, and then, we describe the specific features of each algorithm. A. Common Issues Solution Representation We use the permutation representation with integer number encoding. A permutation of integers represents a sequence of fragment numbers, where successive fragments overlap. The solution in this representation requires a list of fragments assigned with a unique integer ID. For example, 8 fragments will need eight identifiers: 0, 1, 2, 3, 4, 5, 6, 7. The permutation representation requires special operators to make sure that we always get legal (feasible) solutions. In order to maintain a legal solution, the two conditions that must be satisfied are (1) all fragments must be presented in the ordering, and (2) no duplicate fragments are allowed in the ordering. For example, one possible ordering for 4 fragments is It means that fragment 3 is at the first position and fragment 0 is at the second position, and so on.

6 DNA Fragment Assembly Problem 103 Fitness Function A fitness function is used to evaluate how good a particular solution is. It is applied to each individual in the population and it should guide the genetic algorithm towards the optimal solution. In the DNA fragment assembly problem, the fitness function measures the multiple sequences alignment quality and finds the best scoring alignment. Parsons et al. mentioned two different fitness functions [7]. Fitness function F 1 - sums the overlap score (w(f, f1)) for adjacent fragments (f[i] and f[i + 1]) in a given solution. When this fitness function is used, the objective is to maximize such a score. It means that the best individual will have the highest score, since the order proposed by that solution has strong overlap between adjacent fragments. n 2 F 1(l) = w(f[i]f[i + 1]) (2) i=0 Fitness function F 2 - not only sums the overlap score for adjacent fragments, but it also sums the overlap score for all other possible pairs. n 1 n 1 F 2(l) = i j w(f[i]f[j]) (3) i=0 j=0 This fitness function penalizes solutions in which strong overlaps occur between non-adjacent fragments in the layouts. When this fitness function is used, the objective is to minimize the overlap score. It means that the best individual will have the lowest score. Program Termination The program can be terminated in one of two ways. We can specify the maximum number of evaluations to stop the algorithm or we can also stop the algorithm when the solution is no longer improving. B. GA Details Population Size We use a fixed size population to initialize random permutations. Recombination Operator Two or more parents are recombined to produce one or more offspring. The purpose of this operator is to allow partial solutions to evolve in different individuals and then combine them to produce a better solution. It is implemented by running through the population and for each individual, deciding whether it should be selected for crossover using a parameter called crossover rate (P c ). A crossover rate of 1.0 indicates that all the selected individuals are used in the crossover. Thus, there are no survivors. However, empirical studies have shown that better results are achieved by a crossover rate between 0.65 and 0.85, which implies that the probability of an individual moving unchanged to the next generation ranges from 0.15 to For our experimental runs, we use the order-based crossover (OX) and the edge-recombination crossover (ER). These operator were specifically designed for tackling problems with permutation representations. The order-based crossover operator first copies the fragment ID between two random positions in Parent1 into the offspring s corresponding positions. We then copy the rest of the fragments from Parent2 into the offspring in the relative order presented in Parent2. If the fragment ID is already present in the offspring, then we skip that fragment. The method preserves the feasibility of every tentative solution in the population. Edge recombination preserves the adjacencies that are common to both parents. This operator is appropriate because a good fragment ordering consists of fragments that are related to each other by a similarity metric and should therefore be adjacent to one another. Parsons [12] uses edge recombination operator as follows: A. Calculate the adjacencies. B. Select the first position (s) from one of the parents. C. Select a new position (s ) in the following order until no fragments remain: 1. s adjacent to s is selected if this new position is shared by both parents. 2. s that has more remaining adjacencies is selected. 3. s is randomly selected if it has an equal number of remaining adjacencies. Mutation Operator This operator is used for the modification of single individuals. The reason we need a mutation operator is for the purpose of maintaining diversity in the population. Mutation is implemented by running through the whole population and for each individual, deciding whether to select it for mutation or not, based on a parameter called mutation rate (P m ). For our experimental runs, we use the swap mutation operator and invert segment mutation operator. The first operator randomly selects two positions from a permutation and then swaps the two fragment positions. The second one also selects two positions from a permutation and then inverts the order of the fragments in partial permutation defined by the two random positions (i.e., we swap two edges in the equivalent graph). Since these operators do not introduce any duplicate number in the permutation, the solution they produce is always feasible. Selection operator The purpose of the selection is to weed out the bad solutions. It requires a population as a parameter, processes the population using the fitness function, and returns a new population. The level of the selection pressure is very important. If the pressure is too low, convergence becomes very slow. If the pressure is too high, convergence will be premature to a local optimum.

7 104 Gabriel Luque and Enrique Alba In this work, we use ranking selection mechanism [20], in which the GA first sorts the individuals based on the fitness and then selects the individuals with the best fitness score until the specified population size is reached. Preliminary results favored this method out of a set of other selection techniques analyzed. Note that the population size will grow whenever a new offspring is produced by crossover or mutation operators. C. CHC Details Incest Prevention As we said before, this method has a mechanism of incest prevention to avoid recombination of similar solutions. Typically, it is used the hamming distance as measure of similiarity but this one is unsuitable for permutations. In the experiments, we consider that the distance between two solutions is the total number of edges minus the number of common edges. Recombination The crossover that we use in our CHC, creates a single child by preserving the edges that parents have in common and the randomly assigning the remaining edges in order to generate a legal permutation. Population restart Whenever the population converges, the population is partially randomized for a restart by using the best individual found so far as a template and creating new individuals by repeatedly swapping edges until a specific fraction of edges differ from those template. D. SS Details Initial population creation There exist several ways to get an initial population of good and disperse solutions. In our experiments, the solutions for the population was randomly generated to achieve a certain level of diversity. Then, we apply the Improvement method (that it will be explained in next section) to these solutions in order to get better solutions. Subsets generation and Recombination operator It generates all 2-elements subsets and then it applies the recombination operator to them. For our experimental runs, we use the order-based crossover that it was explained in previous section. Improvement method We apply a hillclimber procedure to improve the solutions. The hillclimber is a variant of Lin s two-opt [21]. Two position are randomly selected, and then it inverts the subpermutation by swapping the two edges. Whenever an improvement is found, the solution is updated, and the hillclimber continues until it achieves a predetermined number of swap operations. E. SA Details Cooling scheme The cooling schedule controls the values of the temperature parameter. It specifies the initial value and how the temperature is updated at each stage of the algorithm. Several schemata are been proposed in the literature [22]: P roportional SA : T k = α T k 1 (4) T 0 CSA : T k = ln (1 + k) (5) F SA : T k = T k (6) V F SA : T k = T 0 e k (7) The proportional scheme is the most used one. In this case, the decreasing function is controlled by the α factor where α (0, 1). Markov chain length The number of the iterations between two consecutive changes of the temperature is given by the parameter Markov Chain length, whose name alludes the fact that the sequence of accepted solutions is a Markov chain (a sequence of states in which each state only depends on the previous one). Move operator This operator generates a new neighbor from current solution. For our experimental runs, we use the edge swap operator. This operator randomly selects two positions from a permutation and then invert the order of the fragment between these two fragment positions. V. The MALLBA Project Before moving on to Section VI in which we analyze the results obtained by the heuristics, we introduce MALLBA project which we used in this work and where all our programs can be found.

8 DNA Fragment Assembly Problem 105 The MALLBA research project [23] is aimed at developing a library of algorithms for optimization that can deal with parallelism in a user-friendly and, at the same time, efficient manner. Its three target environments are sequential, LAN and WAN computer platforms. All the algorithms described in the previous section are implemented as software skeletons (similar to the concept of software pattern) with a common internal and public interface. This permits fast prototyping and transparent access to parallel platforms. MALLBA skeletons distinguish between the concrete problem to be solved and the solver technique. Skeletons are generic templates to be instantiated by the user with the features of the problem. All the knowledge related to the solver method (e.g., parallel considerations) and its interactions with the problem are implemented by the skeleton and offered to the user. Skeletons are implemented by a set of required and provided C++ classes that represent an abstraction of the entities participating in the solver method: Provided Classes: They implement internal aspects of the skeleton in a problem-independent way. The most important provided classes are Solver (the algorithm) and SetUpParams (setup parameters). Required Classes: They specify information related to the problem. Each skeleton includes the Problem and Solution required classes that encapsulate the problem-dependent entities needed by the solver method. Depending on the skeleton other classes may be required. Therefore, the user of a MALLBA skeleton only needs to implement the particular features related to the problem. This speeds considerably the creation of new algorithms with minimum effort, especially if they are built up as combinations of existing skeletons (hybrids). The infrastructure used in the MALLBA project is made of communication networks and clusters of computers located at the Spanish universities of Málaga, La Laguna and UPC in Barcelona. These nodes are interconnected by a chain of Fast Ethernet and ATM circuits. The MALLBA library is publicly available at By using this library, we were able to perform a quick coding of algorithmic prototypes to cope with the inherent difficulties of the DNA fragment assembly problem. VI. Experimental Results A target sequence with accession number BX (GI ) was used in this work. It was obtained from the NCBI web site ( It is the sequence of a Neurospora crassa (common bread mold) BAC, and is 77,292 base pairs long. To test and analyze the performance of our algorithm, we generated two problem instances with GenFrag [24]. GenFrag takes a known DNA sequence and uses it as a parent strand from which to Table 1: Parameters when heading and optimum solution of the problem. Common Parameters Independent runs 30 Fitness function F1 Cutoff 30 Max Evaluations Genetic Algorithms Popsize 1024 Crossover OX (0.8) Mutation Edge Swap (0.2) Selection Ranking CHC Popsize 50 Crossover specific (1.0) Restart Edge Swap (30%) Scatter Search Initial Popsize 15 Reference set 8 (5 + 3) Subset generation All 2-elements subsets (28) Crossover OX (1.0) Improvement Edge Swap (100 iterations) Simulated Annealing Move operator Edge Swap total number evaluations Markov chain length 100 Coolin scheme Proportional (α = 0.99) randomly generate fragments according to the criteria (mean fragment length and coverage of parent sequence) supplied by the user. The first problem instance, , contains 442 fragments with average fragment length of 708 bps and coverage 4. The second problem instance, , contains 773 fragments with average fragment length of 703 bps and coverage 7. We evaluated the results in terms of the number of contigs assembled. We use a GA, a CHC, a SS, and a SA to solve this problem. To allow a fair comparison among the results of these heuristics, we have configured them to perform a similar computational effort (the maximum number of evaluations for any algorithm is ). The scatter search is an exception, since it executes a local search (a quite time-consuming operator) to improve the solutions but that is not performing complete evaluations of the solutions. It only needs to know if the new solution is better or worse than the current one. Then, SS is configured for its total execution time to be similar to the GA one. Since the results of these algorithms vary depending on the different parameter settings, we previously performed a complete analysis to study how the parameters affect the performance of the algorithms. From these previous analyses, we conclude that the best settings for our problem instances of the fragment assembly problem are the following: The GA uses a population size of 1024 individual, OX as crossover operator (with probability 0.8), and with an edge swap mutation operator (with probability 0.2). For CHC we use a population of 50 individuals and the restart method swaps edges

9 106 Gabriel Luque and Enrique Alba Table 2: Results on the Two Instances Algorithm b f e t b f e t Genetic Algorithm CHC Scatter Search Simulated Annealing until the edges of the new solution differs at least 30% from the old one. The SS uses an initial population of 15 individuals, the reference set is composed of eight solutions (five best solutions and three diverse ones), and it generates in every iteration all 2-elements subsets (28 subsets), and combines the solutions of each subset to generate a new solution. Then the improvement method is applied to these solutions (100 iterations) to get better solutions. For SA, we also tested several configurations and cooling schemata and the best results was obtained when we use a Markov chain of length total number evaluations 100 and a proportional cooling schema with a decreasing factor of Because of the stochastic nature of the algorithms, we perform 30 independent runs of each test to gather meaningful experimental data. A summary of the conditions for our experimentation is found in Table 1. Table 2 shows all the results and performance with all data instances and algorithms described in this work. The table shows the fitness of the best solution obtained (b), the average fitness found (f), average number of evaluations (e), and average time in seconds (t). We do not show the standard deviation because the fluctuations in the accuracy of different runs are rather small, claiming that the algorithms are very robust (as proved by the ANOVA results). The best results are boldfaced. Let us discuss some of the results found in this table. First, for the two instances, it is clear that the SA outperforms the rest of the algorithms from every point of view. In fact, SA obtains better fitness values than the previous best known solutions [25]. Also, its execution time is the lowest one. The reason of that is that the SA operates on a single solution, while the rest of the methods are population-based and in addition they execute time-consuming operators (specially the crossover operation). The CHC is the worst algorithm in both, execution time and solution quality. Its larger runtime is due to the additional computations needed to detect the converge of the population or to detect incest mating. CHC is not able of solving the DNA fragment assembly problem adequately and maybe it needs a local search (as proposed by [16]) to reduce the search space. The SS obtains better solutions than the GA, and it also achieves these solutions in a smaller runtime. This means that a structured application of operator and explicit reference search points are good idea for this problem. The computational effort to solve this problem (only considering the number of evaluations) is similar for all the heuristics with the exception of the SS, because its most time-consuming operation is the improvement method that does not perform complete evaluations of the solutions. This result indicates that all the algorithms examine a similar number of points in the search space, and the difference in the solution quality is due to how they explore the search space. For this problem, trajectory based methods such as the simulated annealing are more effective than population based ones. Thus, the resulting ranking of algorithms from best to worse is SA, SS, GA, and finally CHC. best fitness value 2.5 x GA CHC SS SA evaluations (%) Figure. 7: Best Fitness Evolution for instance In Figure 7 we show a trace of the best fitness evolution during the search of all the heuristics for the first instance. Let us analyze the behavior of each algorithm. GA is steadily improving its best solution and maybe if we would allow for a larger number of evaluations, it will obtain better solutions, although its convergence speed is slow. In fact, we tested GA with a larger number of evaluations, and it certainly improved its solutions but it did not achieve the high quality of SS solutions. CHC converges quickly to a suboptimal solution and then it is very difficult for this algorithm to get away from it. SS and SA initially improve quickly the quality of the solutions, and then the search advances slowly; it can be noticed that the growing slope of SA is rather fast compared to that of the rest of algorithms. We have performed statistical analyses to ensure the significance of the results and to confirm that our conclusions

10 DNA Fragment Assembly Problem 107 Table 3: Multiple comparison for the Two Instances: 95% Confidence Interval CHC SS SA f ( 1e5) e ( 1e5) t ( 1e8) f ( 1e5) e ( 1e5) t ( 1e8) f ( 1e5) e ( 1e5) t ( 1e8) GA (0.70, 0.72) (0.7, 1.2) (-1.2, -1.1) (0.55, 0.58) (4.7, 5.1) (0.1, 0.3) (-1.36, -1.34) (-0.1, 0.3) (0.2, 0.3) CHC (-0.16, -0.14) (3.7, 4.1) (1.3, 1.5) (-2.07, -2.05) (-1.1, -0.7) (1.3, 1.5) SS (-1.93, -1.91) (-5.5, -4.6) (0.1, 0.2) CHC SS SA f ( 1e5) e ( 1e5) t ( 1e8) f ( 1e5) e ( 1e5) t ( 1e8) f ( 1e5) e ( 1e5) t ( 1e8) GA (0.52, 1.12) (0.4, 1.1) (-4.2, -3.6) (0.14, 0.74) (4.6, 5.2) (0.2, 0.9) (-3.37, ) (-0.3, 0.4) (0.4, 1.1) CHC (-0.67, -0.07) (3.8, 4.5) (4.1, 4.8) (-4.19, -3.61) (-1.0, -0.4) (4.3, 5.0) SS (-3.82, -3.22) (-5.2, -4.6) (0.2, 0.5) are valid. First, we performed one-way ANOVA for comparing the means of the data. The most interesting parameter returned by this method is called p value that indicates if data is statistically different (p value is less than the confidence level) or if, on the contrary, we can not ensure that the data is different (p value is equal to or greater than the confidence level). In our case, we use a confidence level of 5% and this test shows that there exists a significant difference among the algorithms (the p value is always less than 0.05). Then, we performed a multiple comparison using one-way ANOVA results to determine which means are significantly different. The results of these tests are shown in Table 3. The boldfaced results indicate that the comparison is negative (i.e., the difference is not significant). In general, we can observe that all the results are statistically different (zero is not included in the confidence interval), with the exception of a too similar number of evaluations between SA and GA. Table 4: Final Best Contigs. Algorithms GA 6 4 CHC 7 5 SS 6 4 SA 4 2 Finally, Table 4 shows the final number of contigs computed in every case. A contig is a sequence in which the overlap between adjacent fragments is greater than a threshold (cutoff parameter). Hence, the optimum solution has a single contig. This value is used as a high-level criterion to judge the whole quality of the results since, as we said before, it is difficult to capture the dynamics of the problem into a mathematical function. These values are computed by applying a final step of refinement with a greedy heuristic popular in this application [26]. We have found that in some (extreme) cases it is possible that a solution with a better fitness than other one generates a larger number of contigs (worse solution). This is the reason for still needing more research to get a more accurate mapping from fitness to contig number. The values of this table however confirm again that the SA method outperform the rest clearly, the CHC obtains the worst results, and the SS and GA obtain similar number of contigs. VII. Conclusions The DNA fragment assembly is a very complex problem in computational biology. Since the problem is NP-hard, the optimal solution is impossible to find for real cases, except for very small problem instances. Hence, computational techniques of affordable complexity such as heuristics are needed for it. We tackled this problem with four sequential heuristics: three population based ones (genetic algorithm, scatter search, and CHC), and a trajectory based method (simulated annealing). We have observed that this last outperforms the rest of the methods (what implies more research in trajectory based methods in the future). It obtains a much better solutions than the others and it runs faster. This local search algorithm outperformed here the best known fitness values in the literature. On the contrary, the CHC is not able to obtain good solutions for this problem. The results of the SS are only slightly better than those of the GA. For the future, we plan to study new and more adequate fitness functions for this problem by including explicitly the number of contigs. These new functions are quite timeconsuming and then we are also planning to extend this study by using parallel versions of these algorithms. Previous experiments showed that the parallelism allows not only to reduce the execution time, but it also improves the accuracy in computing solution [25]. Acknowledgments The authors are partially supported by the Ministry of Science and Technology and FEDER under contract TIN C04-01 (the OPLINK project).

11 108 Gabriel Luque and Enrique Alba References [1] P. Green. Phrap. [2] G.G. Sutton, O. White, M.D. Adams, and A.R. Kerlavage. TIGR Assembler: A new tool for assembling large shotgun sequencing projects, Genome Science & Tech., pp. 9 19, [3] T. Chen and S.S. Skiena. Trie-based data structures for sequence assembly. In The Eighth Symposium on Combinatorial Pattern Matching, pp , [4] X. Huang and A. Madan. CAP3: A DNA sequence assembly program, Genome Research, 9, pp , [5] E.W. Myers. Towards simplifying and accurately formulating fragment assembly, Journal of Computational Biology, 2(2), pp , [6] P.A. Pevzner. Computational molecular biology: An algorithmic approach. The MIT Press, London, [7] R. Parsons, S. Forrest, and C. Burks. Genetic algorithms, operators, and DNA fragment assembly. Machine Learning, 21, pp , [8] J. Setubal and J. Meidanis. Fragment Assembly of DNA, in Introduction to Computational Molecular Biology, pp University of Campinas, Brazil, [9] S. Kim. A structured Pattern Matching Approach to Shotgun Sequence Assembly. PhD thesis, Computer Science Department, The University of Iowa, [10] C.F. Allex.Computational Methods for Fast and Accurate DNA Fragment Assembly. UW technical report CS-TR , Department of Computer Sciences, University of Wisconsin-Madison, [11] C. Notredame and D.G. Higgins. SAGA: sequence alignment by genetic algorithm, Nucleic Acids Research, 24, pp , [12] R. Parsons and M.E. Johnson. A case study in experimental design applied to genetic algorithms with applications to DNA sequence assembly, American Journal of Mathematical and Management Sciences, 17, pp , [13] J.H. Holland. Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor, Michigan, [14] D.E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, [15] L. Davis. Handbook of Genetic Algorithms. Van Nostrand Reinhold, [16] L.J. Eshelman. The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination. In Foundations of Genetic Algorithms, 1, pp Morgan Kaufmann, [17] F. Glover, M. Laguna, and R. Martí. Fundamentals of scatter search and path relinking, Control and Cybernetics, 39(3), pp , [18] M. Laguna. Scatter Search, in Handbook of Applied Optimization, pp , Oxford University Press, [19] S. Kirkpatrick, C.D. Gelatt, and M.P. Vecchi. Optimization by simulated annealing, Science, 220(4598), pp , [20] D. Whitely.The GENITOR algorithm and selection pressure: Why rank-based allocation of reproductive trials is best. In J. D. Schaffer, editor, Proceedings of the Third International Conference on Genetic Algorithms, pp , Morgan Kaufmann, [21] S. Lin and B.W. Kernighan. An effective heuristic algorithm for TSP, Operations Research, 21, pp , [22] X. Yao. A New SA Algorithm, Intenational Journal of Computer Mathematics, 56, pp , [23] E. Alba and the MALLBA Group. MALLBA: A library of skeletons for combinatorial optimisation In R. Feldmann and B. Monien, editors, Proceedings of the Euro- Par, volume 2400 of LNCS, pp Springer- Verlag, [24] M.L. Engle and C. Burks. Artificially generated data sets for testing DNA fragment assembly algorithms, Genomics, 16, [25] E. Alba, G. Luque, and S. Khuri. Assembling DNA Fragments with Parallel Algorithms. In B. McKay, editor, Proceedings of the IEEE Congress on Evolutionary Computation (CEC-2005), pp , Edinburgh, UK, [26] L. Li and S. Khuri. A Comparison of DNA Fragment Assembly Algorithms. In International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, pp , Author Biographies Enrique Alba is a Professor of Computer Science at the University of Málaga, Spain. He got his Ph.D. degree on designing and analyzing parallel and distributed genetic algorithms. His current research interests involve the design and application of evolutionary algorithms and other metaheuristics to real problems including telecommunications, combinatorial optimization, and bioinformatics. The main focus of all his work is on parallelism. Dr. Alba has published many scientific papers in international conferences and journals, three books on theory and applications of evolutionary algorithms and he holds national and international awards to his research results. Gabriel Luque received the Engineering degree in Computer Science from the University of Málaga, Spain, in He is currently working towards his Ph.D. in the design of new evolutionary algorithms, specifically on parallel algorithms, and their application to complex problems in the domains of bioinformatics, natural language processing, and combinatorial optimization in general.