Oligonucleotide Design by Multilevel Optimization

January 21, 2005. Technical Report. Oligonucleotide Design by Multilevel Optimization Colas Schretter & Michel C. Milinkovitch e-mail: mcmilink@ulb.ac.be

Oligonucleotide Design by Multilevel Optimization Colas Schretter 1, Michel C. Milinkovitch 1 1 Laboratory of Evolutionary Genetics, Institute of Molecular Biology and Medicine (IBMM), Free University of Brussels (ULB) Email: Colas Schretter - cschrett@ulb.ac.be; Michel C. Milinkovitch - mcmilink@ulb.ac.be; Corresponding author Abstract Background: Many molecular biology experiments make use of small RNA or DNA sequences called oligonucleotides. Their success is highly dependent on oligonucleotide design. Several constraints and properties of such oligonucleotides vary among applications such as long oligos for micro-arrays, primer pairs for PCR amplifications and sequencing, sirna to knock down gene expression. Most of methods proposed in the literature are usualy conceived with a dedicated and specific application in mind. The aim of our work is to specify a general framework to build design applications. Every given algorithm is a building block that can be combined to create a customized oligonucleotide design pipeline. Results: We present a collection of complementary techniques for the election of high quality oligonucleotides for PCR and DNA array experiments. The general pipeline proceeds by successive selection of best candidates on various criteria like minimization of secondary structures, using statistical mechanics approaches, and maximization of specificity. The latter is optimized through performing searches on genome among a short list of finalist candidates. Furthermore, we maintain diversity in the population of candidates to ensure domain exploration. Conclusions: The method of candidate selection we developed yields high-quality oligonucleotides and is implemented in a collection of design applications that is available at http://www.ulb.ac.be/sciences/ueg/softwares 1

1 Background A comparison of most recognized solutions to the problem of oligonucleotide design [1,2] underscore the importance of several key features to ensure high quality design. Specifically, a successful computer-assisted oligonucleotide design should: ensure hard constraints bounds on properties like the melting temperature estimation and the oligo length ranges; minimize the likelihood of secondary structure formations, namely, hairpins, homodimers and heterodimers; achieve perfect local complementarity matching with a query sequence; minimize cross-hybridization with non-target sequences within the considered genome. However, most of these constraints are independent such that it is impossible to define a unique objective function yielding solutions that are optimized across all constraints. Hence, we formulate the problem as a multi-criteria optimization problem. Multi-criteria optimization methods [3] tend to find the Pareto-optimal set of solutions. A design is said to be Pareto-optimal if there exists no feasible design which would improve one of the objectives without simultaneous worsening at least one other objective. Our approach consists into pruning the domain of candidate oligonucleotides by optimizing each set of independent constraints using an tour-by-tour process. After each tour, the size of the set of candidate solutions is reduced. Furthermore, we maintain diversity within the population of candidates by selecting nearly non-overlapping candidates only, hence, allowing for a trade-off between exploitation of the best solutions so far and exploration of the potential solutions domain. 2 Method Each possible candidate sub-sequence enters the selection pipeline shown in Figure 1. The set of best candidates is determined by searching for the cluster of best solutions based on the current criterion. This approach is a heuristic as a possibly optimal oligonucleotides could be discarded at an early stage of the selection pipeline without being tested for other criteria. Hence, we keep the largest possible set of acceptable candidates at each stage. 2

G S Q C C S u e r y e q u e n c e o n s t r a i n t s P a r a m e t e r s C o m b i n a t o r i a l o n s t r a i n t s D i m e r E n e r g y e n e r a t i o n F i l t e r O p t i m i z a t i o n R e s u l t S e t p e c i fi c i t y O p t i m i z a t i o n D o m a i n S a m p l i n g Figure 1: The selection pipeline. After each stage, only a subset of the candidates are retained 2.1 Generation of Candidates Every candidate oligonucleotide that conforms with the user-specified minimum and maximum lengths is generated from the query sequence. The number of such candidates is N = L max i=l min W i + 1 with L min L max W where W is the design area width, L min and L max are the minimum and maximum oligonucleotide lengths, respectively. Although N is a combinatorial quantity that grows factorialy with W, computational resources of a workstation allow the generation of every candidate for W and (L max L min ) values used in practice. e.g. N = 5236 if W = 500, L min = 20 and L max = 30. The complete coverage of the solution domain at this early stage ensures that an a priori optimal solution is not missed. 2.2 Filtering from Constraints The set of candidates is then filtered against a series of user-specified hard constraints: accepted range of melting temperature estimation (T m ), 3

accepted tolerance of overlapping with repetition or microsatellites regions, minimum and mean query sequence quality at the oligonucleotide positions. The filtering on quality is very flexible and the user can skip that criterion. Furthermore, in the case of primer design, we provide independent quality testing for the 3 -end of the oligonucleotides. 2.2.1 T m Estimation We use well know and validated methods for T m estimation. If the length of the oligonucleotide sequence is < 20, we use the Wallace model [4] T m = 2 (A + T) + 4 (G + C) where A, C, G and T are the number of corresponding nucleotides, else we use a more elaborated thermodynamic method [5] T m = T H + 16.6 log[salt] 269.3 H G + RT ln (C) where T = 298.2 o K is an experimental temperature, H and G are, respectively, the sum of the nearest-neighbor enthalpy, and Gibbs free energy (in cal/mol), R = 1.987cal/mol o K is the molar gas constant, C is the oligonucleotide concentration, and [salt] is a correction term dependent on the experimental salt conditions. 2.2.2 Repetitions and Microsatellites Masking Because they generally are very numerous within genomes, repeated regions are unspecific, and oligonucleotides containing repetitions should be avoided. Therefore, to each query sequence, we join a binary mask that indicates the positions of repeats. The union of all repetition regions is found by using regular expression matching. We accept masked bases within oligonucleotides with a tolerance proportional to the length of the candidate oligonucleotide. Indeed, a given oligonucleotide can correspond to a very high quality solution even if a few of its positions overlap masked regions. 2.3 Internal Energy Optimization All candidate oligonucleotides that passed the above-described stages are sorted according to their internal energy, i.e., their relative risks of forming hairpin and homodimer secondary structures. All possible hairpin and homodimer configurations s are enumerated. 4

We slide one oligo or primer over itself for homodimer and hairpin configurations, or over the other primer in case of heterodimer estimation. Each possible offset correspond to a state s S. The value G/U ref estimates the risk of hairpin and homodimer realization [6]. [ ] G = k B T ln e (Us/kBT) where k B = 0.0083144... is the Boltzmann constant and U s is the reference internal energy of the state s. Hence s S U s = k B T ln e (Us/kBT) For each state s, we use the Wallace model [4] to weight the sum of the interactions for each base. Indeed, we estimates U as U = k B [2 (A + T) + 4 (G + C)] where A, C, G and T are the number of hydrogen bonds of a nucleotides with its complement in the current dimer configuration. We normalize each total energy estimation by a reference energy value U ref to select best candidates regardless of the oligo length. Indeed, a sorting criterion directly proportional to U would systematically favor short oligonucleotides, because of their intrinsical lower hybridation energy. To compute U ref, we simply sum the interaction factor (2 or 4) associated to each base of the oligonucleotide sequence and multiply the sum by k B. 2.4 Domain Sampling We select nearly non-overlapping candidates to ensure domain exploration and to diversify the population of candidates for the next specificity optimization stage. We proceed by walking the set of candidates sorted on their internal energy, as shown in Figure 2. An item is discarded if it overlaps more than t 10 nucleotide positions with the union of previously retained oligonucleotides. As the list is initially sorted by increasing internal energy, the procedure gives more priority to oligonucleotides with lower internal energy, i.e., high quality solutions are selected first. The domain sampling stage is motivated by the observation that close neighbors, hence largely overlapping oligonucleotides, in the candidate space exhibit nearly identical scoring values. Therefore, diversification of the population of candidate is needed to avoid the selection of a unique cluster of close candidates. 5

Figure 2: Priority-based sampling. Rectangles represent relative positions and length of oligo s. All candidates are sorted vertically on their internal energy score. If a candidate is selected, its rectangle is filled in black. Regions masked by previously selected spans are casted in grey on the next candidates. The first oligonucleotide, i.e., with the lowest internal energy, is always retained. The 6th candidate for example is discarded because its overlap with the union of previously retained oligonucleotides. 2.5 Pairing of Primers Oligonucleotide design for PCR applications generally require further constrains such as pairing of oligonucleotides, then called primers, a low difference in T m between the two members of the pair, a maximum size of amplicons, and a minimization of the risks of heterodimer realization. We generate a fixed number of primer pairs in increasing order of internal energy score. Then, we trivially reject pairs defining amplicons that do not fit the range of user-defined amplicon size. Valid pairs are sorted by increasing heterodimer risk, evaluated with the thermodynamic model presented in section 2.3. 2.6 Specificity Optimization The final selection stage identifies solutions that minimizes cross-hybridation (i.e., hybridization with a non-target sequence within the genome). We evaluate specificity by defining within the output of a Blast query: 1. whether the first hit corresponds to a perfect match of the candidate on the genome, 2. the number of bases matches in the second hit. 6

Candidates are sorted by increasing Blast score of the second best hit: more specific oligonucleotides are higher in the list. specificity score of a primer pair is defined as the worse specificity score of its two members. Our approach therefore requires, for each candidate, a Blast pass on the considered genome or a database of mrna, depending on the specific application. To speed-up this process, we propose to extract, for the considered genome, a database of non-specific regions, dedicated to specificity testing. Hence, a significant first hit demonstrate poor specificity. A given region is defined to be non-specific if it is similar to another region within the genome. A few other alternative specificity evaluation heuristic are proposed in literature [7,8]. 3 Implementation and Results Two oligonucleotides design applications have been implemented in Java using our common multilevel optimization pipeline, namely, OptiAmp (Design of Primers for PCR Amplifications), and LOD (Long Oligo Design). The OligoFaktory web portal embeds these bioinformatic tools in a web-based framework. The dynamic and interactive web application provides consistent form-based input interface and presentation of outputs. Each plugin tool reads an input parameter file and dumps results on an output file. Both input and output files conform to a common XML interchange file format. An XHTML form is associated with each tool to fetch parameters from user s input and to produce input XML files. For all applications, a unified presentation of result sheets provides distribution graphs and locations bar graphs to visualize the result set. Moreover, easy-to-spot warning flags are shown in case of problems with hairpin and homodimer secondary structures and/or with specificity. The project is aimed at assisting researchers for a painless, rapid, automated, and reliable design. 3.1 Brucellas Design An hybrid micro-array was designed to capture the expression of genes for both Brucella Suis 1330 and Brucella Melitensis 16M microbial species. Pairing of orthologous genes and alignments of consensus sequences have been performed as explained in [9]. This preprocessing yielded 2853 consensus sequences. Figure 3 shows charts of relevant features of the designed oligonucleotide set. Note that the oligonucleotide locations tend to cluster to the 3 -end of the query sequences. The specificity score indicates the maximum 7

(a) Oligos Length Distribution (b) Relative Oligos Locations (c) Tm Distances from 79 o (d) Specificity Scores Distribution Figure 3: Features of the designed micro-array. number of non-specific base paring on the second best hit found on the genome of Brucellas Suis. The first hit corresponds to the perfect match of the oligonucleotide on its target genes in both Suis and Melitensis. The temperature range does not exceed two degrees, as specified by the constraints. 4 Discussion and Conclusions We have presented a general framework to build oligonucleotide design applications. The method is based on the principles of multilevel optimization strategies. A general pipeline of candidate selection is described and algorithmic details of each stages are given. The main contribution of our method, compared to current techniques is, by far, the internal energy optimization stage based on statistical mechanics principles. Indeed, we experimented that a poor evaluation of dimerization risk is one, if not, the most important factor of failures in applications. Our proposal is a refined and realistic model to evaluate risk of dimer realization. The evaluation of this model require more computational resources in comparison of the common heuristics. However, today workstations allow the design of large batches of queries with our method. A refined model based on nearest neighbors could be used to better estimate the internal energy U. However, we observed that the classification of dimer configurations as performed with the computationally cheap Wallace model is accurate in practice. 8

The domain sampling strategy is essential to allow a trade-off between exploration of the solution space and exploitation of best so-far solutions. Furthermore, additional optimization constraints can be taken into account by appending successive domain sampling stages followed by selection. Introducing diversification within the population of candidates is a key point for a parameterizable multi-criteria optimization pipeline. Acknowledgments The Internal Energy Optimization section has greatly benefited from discussions with Daniel Van Belle on statistical mechanics. This work was supported by the Universite Libre de Bruxelles (ULB) and the Region Wallonne (BioRobot-Initiative 114840). References 1. Burpo JF: A critical review of PCR primer design algorithms and crosshybridization case study. Tech. rep., Department of Chemical Engineering, Stanford University 2001. 2. Kampke T, Kieninger M, Mecklenbug M: Efficient primer design algorithms. Bioinformatics 2001, 17(3). 3. Candler W, Norton R: Multilevel Programming. Tech. rep., unpublished research memorandum, DRC, World Bank, Washington 1976. 4. Wallace RB, Shaffer J, Murphy RF, Bonner J, Hirose T, Itakura K: Hybridization of synthetic oligodeoxyribonucleotides to phi chi 174 DNA: the effect of single base pair mismatch. Nucleic Acids Research 1979, 6. 5. Wetmur JG, Sninsky JJ: PCR Strategies, Academic Press 1995 chap. 6, :69 83. 6. Groebe DR, Uhlenbeck OC: Characterization of RNA hairpin loop stability. Nucleic Acids Research 1988, 16. 7. Kurata K, Nakamura H: Novel Method for Primer/Probe Design and Sequence Analysis. Tech. rep., School of Engineering, The University of Tokyo 2000. 8. Rahmann S: Fast Large Scale Oligonucleotide Selection Using the Longest Common Factor Approach. Journal of Bioinformatics and Computational Biology 2003, 1(2):343 361. 9. Schretter C, Milinkovitch MC: Automated Long Oligo Design on Consensus Regions of Similar Genomes. Tech. rep., Unit of Evolutionary Genetics, Universite Libre de Bruxelles 2004. 9