Oligonucleotide Design by Multilevel Optimization

Similar documents
CHAPTER 3 PRIMER DESIGN CRITERIA

1. The AGI (Arabidospis Genome Initiative) convention gene names or AtRTPrimer ID should

PCR PRIMER DESIGN SARIKA GARG SCHOOL OF BIOTECHNOLGY DEVI AHILYA UNIVERSITY INDORE INDIA

High-Throughput SNP Genotyping by SBE/SBH

SNPWizard User Guide

Multi-objective Evolutionary Probe Design Based on Thermodynamic Criteria for HPV Detection

Genome-Wide Survey of MicroRNA - Transcription Factor Feed-Forward Regulatory Circuits in Human. Supporting Information

PRESENTING SEQUENCES 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5

Bioinformatics Course AA 2017/2018 Tutorial 2

Optimizing a Conventional Polymerase Chain Reaction (PCR) and Primer Design

Polymerase Chain Reaction: Application and Practical Primer Probe Design qrt-pcr

Array-Ready Oligo Set for the Rat Genome Version 3.0

Student Learning Outcomes (SLOS)

RNA Structure Prediction. Algorithms in Bioinformatics. SIGCSE 2009 RNA Secondary Structure Prediction. Transfer RNA. RNA Structure Prediction

Lecture #1. Introduction to microarray technology

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

High-Throughput Assay Design. Microarrays. Applications. Overview. Algorithms Universal DNA Tag Array Design and Optimization

DNA and RNA are both composed of nucleotides. A nucleotide contains a base, a sugar and one to three phosphate groups. DNA is made up of the bases

Degenerate Primer Design using Computational Tools Computational Molecular Biology Veronica Brand 11 December 2011

Textbook Reading Guidelines

TUTORIAL: PCR ANALYSIS AND PRIMER DESIGN

601 CTGTCCACACAATCTGCCCTTTCGAAAGATCCCAACGAAAAGAGAGACCACATGGTCCTT GACAGGTGTGTTAGACGGGAAAGCTTTCTAGGGTTGCTTTTCTCTCTGGTGTACCAGGAA >>>>>>>>>>>>>>>>>>

An Investigation of Palindromic Sequences in the Pseudomonas fluorescens SBW25 Genome Bachelor of Science Honors Thesis

Lecture 2: Central Dogma of Molecular Biology & Intro to Programming

PrimeTime Pre-designed qpcr Assays

Molecular Biology: DNA sequencing

Experiment (5): Polymerase Chain Reaction (PCR)

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations

90 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 4, 2006

Introduction to Microarray Analysis

Positional Preference of Rho-Independent Transcriptional Terminators in E. Coli

CAP BIOINFORMATICS Su-Shing Chen CISE. 10/5/2005 Su-Shing Chen, CISE 1

Genome Sequence Assembly

ChIP-seq and RNA-seq

Measuring and Understanding Gene Expression

Sequence Design for DNA Computing

DNA/RNA MICROARRAYS NOTE: USE THIS KIT WITHIN 6 MONTHS OF RECEIPT.

Human Genomics. Higher Human Biology

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

Exploring Similarities of Conserved Domains/Motifs

TALENs (Transcription Activator-Like Effector Nucleases)

Background Analysis and Cross Hybridization. Application

PRIMER SELECTION METHODS FOR DETECTION OF GENOMIC INVERSIONS AND DELETIONS VIA PAMP

ENGR 213 Bioengineering Fundamentals April 25, A very coarse introduction to bioinformatics

BIOINFORMATICS IN BIOCHEMISTRY

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin.

What Are the Chemical Structures and Functions of Nucleic Acids?

Tutorial for Stop codon reassignment in the wild

Iterated Conditional Modes for Cross-Hybridization Compensation in DNA Microarray Data

RNA Secondary Structure Prediction

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

Supplementary Materials. for. array reveals biophysical and evolutionary landscapes

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Chapter 8: Recombinant DNA. Ways this technology touches us. Overview. Genetic Engineering

Nature Methods: doi: /nmeth Supplementary Figure 1. Pilot CrY2H-seq experiments to confirm strain and plasmid functionality.

QPCR ASSAYS FOR MIRNA EXPRESSION PROFILING

PRIMEGENSw3 User Manual

Factors affecting PCR

ChIP-seq and RNA-seq. Farhat Habib

Oligo. Version 6 for Macintosh. Primer Analysis Software. DEMO Guide. Wojciech Rychlik. 1999, Molecular Biology Insights, Inc.

Supplementary Information for:

Human Genome Sequencing Over the Decades The capacity to sequence all 3.2 billion bases of the human genome (at 30X coverage) has increased

Feature Selection of Gene Expression Data for Cancer Classification: A Review

SMRT Analysis Barcoding Overview (v6.0.0)

A Greedy Algorithm for Minimizing the Number of Primers in Multiple PCR Experiments

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing

Chapter 5. Structural Genomics

Designing TaqMan MGB Probe and Primer Sets for Gene Expression Using Primer Express Software Version 2.0

The goal of this project was to prepare the DEUG contig which covers the

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

Advisors: Prof. Louis T. Oliphant Computer Science Department, Hiram College.

Insights from the first RT-qPCR based human transcriptome profiling based on wet lab validated assays

Bootcamp: Molecular Biology Techniques and Interpretation

DNA Chip Technology Benedikt Brors Dept. Intelligent Bioinformatics Systems German Cancer Research Center

A critical review of PCR primer design algorithms and crosshybridization

Methods of Biomaterials Testing Lesson 3-5. Biochemical Methods - Molecular Biology -

Polymerase Chain Reaction-361 BCH

Molecular Biology and Pooling Design

Motivation From Protein to Gene

Genetics and Genomics in Medicine Chapter 3. Questions & Answers

New Plant Breeding Technologies

M1D2: Diagnostic Primer Design 2/10/15

Deakin Research Online

Selecting Specific PCR Primers with MFEprimer. Wubin Qu and Chenggang Zhang

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Microarray Probe Design Using ɛ-multi-objective Evolutionary Algorithms with Thermodynamic Criteria

Motif Discovery from Large Number of Sequences: a Case Study with Disease Resistance Genes in Arabidopsis thaliana

Metaheuristics. Approximate. Metaheuristics used for. Math programming LP, IP, NLP, DP. Heuristics

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

A Sequencing Heuristic to Minimize Weighted Flowtime in the Open Shop

RNA Secondary Structure Prediction Computational Genomics Seyoung Kim

Analysis of Microarray Data

Microarrays & Gene Expression Analysis

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson

Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain. Elfar Þórarinsson February 2006

Finishing of Fosmid 1042D14. Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

Roche Molecular Biochemicals Technical Note No. LC 6/99

Transcription:

January 21, 2005. Technical Report. Oligonucleotide Design by Multilevel Optimization Colas Schretter & Michel C. Milinkovitch e-mail: mcmilink@ulb.ac.be

Oligonucleotide Design by Multilevel Optimization Colas Schretter 1, Michel C. Milinkovitch 1 1 Laboratory of Evolutionary Genetics, Institute of Molecular Biology and Medicine (IBMM), Free University of Brussels (ULB) Email: Colas Schretter - cschrett@ulb.ac.be; Michel C. Milinkovitch - mcmilink@ulb.ac.be; Corresponding author Abstract Background: Many molecular biology experiments make use of small RNA or DNA sequences called oligonucleotides. Their success is highly dependent on oligonucleotide design. Several constraints and properties of such oligonucleotides vary among applications such as long oligos for micro-arrays, primer pairs for PCR amplifications and sequencing, sirna to knock down gene expression. Most of methods proposed in the literature are usualy conceived with a dedicated and specific application in mind. The aim of our work is to specify a general framework to build design applications. Every given algorithm is a building block that can be combined to create a customized oligonucleotide design pipeline. Results: We present a collection of complementary techniques for the election of high quality oligonucleotides for PCR and DNA array experiments. The general pipeline proceeds by successive selection of best candidates on various criteria like minimization of secondary structures, using statistical mechanics approaches, and maximization of specificity. The latter is optimized through performing searches on genome among a short list of finalist candidates. Furthermore, we maintain diversity in the population of candidates to ensure domain exploration. Conclusions: The method of candidate selection we developed yields high-quality oligonucleotides and is implemented in a collection of design applications that is available at http://www.ulb.ac.be/sciences/ueg/softwares 1

1 Background A comparison of most recognized solutions to the problem of oligonucleotide design [1,2] underscore the importance of several key features to ensure high quality design. Specifically, a successful computer-assisted oligonucleotide design should: ensure hard constraints bounds on properties like the melting temperature estimation and the oligo length ranges; minimize the likelihood of secondary structure formations, namely, hairpins, homodimers and heterodimers; achieve perfect local complementarity matching with a query sequence; minimize cross-hybridization with non-target sequences within the considered genome. However, most of these constraints are independent such that it is impossible to define a unique objective function yielding solutions that are optimized across all constraints. Hence, we formulate the problem as a multi-criteria optimization problem. Multi-criteria optimization methods [3] tend to find the Pareto-optimal set of solutions. A design is said to be Pareto-optimal if there exists no feasible design which would improve one of the objectives without simultaneous worsening at least one other objective. Our approach consists into pruning the domain of candidate oligonucleotides by optimizing each set of independent constraints using an tour-by-tour process. After each tour, the size of the set of candidate solutions is reduced. Furthermore, we maintain diversity within the population of candidates by selecting nearly non-overlapping candidates only, hence, allowing for a trade-off between exploitation of the best solutions so far and exploration of the potential solutions domain. 2 Method Each possible candidate sub-sequence enters the selection pipeline shown in Figure 1. The set of best candidates is determined by searching for the cluster of best solutions based on the current criterion. This approach is a heuristic as a possibly optimal oligonucleotides could be discarded at an early stage of the selection pipeline without being tested for other criteria. Hence, we keep the largest possible set of acceptable candidates at each stage. 2

G S Q C C S u e r y e q u e n c e o n s t r a i n t s P a r a m e t e r s C o m b i n a t o r i a l o n s t r a i n t s D i m e r E n e r g y e n e r a t i o n F i l t e r O p t i m i z a t i o n R e s u l t S e t p e c i fi c i t y O p t i m i z a t i o n D o m a i n S a m p l i n g Figure 1: The selection pipeline. After each stage, only a subset of the candidates are retained 2.1 Generation of Candidates Every candidate oligonucleotide that conforms with the user-specified minimum and maximum lengths is generated from the query sequence. The number of such candidates is N = L max i=l min W i + 1 with L min L max W where W is the design area width, L min and L max are the minimum and maximum oligonucleotide lengths, respectively. Although N is a combinatorial quantity that grows factorialy with W, computational resources of a workstation allow the generation of every candidate for W and (L max L min ) values used in practice. e.g. N = 5236 if W = 500, L min = 20 and L max = 30. The complete coverage of the solution domain at this early stage ensures that an a priori optimal solution is not missed. 2.2 Filtering from Constraints The set of candidates is then filtered against a series of user-specified hard constraints: accepted range of melting temperature estimation (T m ), 3

accepted tolerance of overlapping with repetition or microsatellites regions, minimum and mean query sequence quality at the oligonucleotide positions. The filtering on quality is very flexible and the user can skip that criterion. Furthermore, in the case of primer design, we provide independent quality testing for the 3 -end of the oligonucleotides. 2.2.1 T m Estimation We use well know and validated methods for T m estimation. If the length of the oligonucleotide sequence is < 20, we use the Wallace model [4] T m = 2 (A + T) + 4 (G + C) where A, C, G and T are the number of corresponding nucleotides, else we use a more elaborated thermodynamic method [5] T m = T H + 16.6 log[salt] 269.3 H G + RT ln (C) where T = 298.2 o K is an experimental temperature, H and G are, respectively, the sum of the nearest-neighbor enthalpy, and Gibbs free energy (in cal/mol), R = 1.987cal/mol o K is the molar gas constant, C is the oligonucleotide concentration, and [salt] is a correction term dependent on the experimental salt conditions. 2.2.2 Repetitions and Microsatellites Masking Because they generally are very numerous within genomes, repeated regions are unspecific, and oligonucleotides containing repetitions should be avoided. Therefore, to each query sequence, we join a binary mask that indicates the positions of repeats. The union of all repetition regions is found by using regular expression matching. We accept masked bases within oligonucleotides with a tolerance proportional to the length of the candidate oligonucleotide. Indeed, a given oligonucleotide can correspond to a very high quality solution even if a few of its positions overlap masked regions. 2.3 Internal Energy Optimization All candidate oligonucleotides that passed the above-described stages are sorted according to their internal energy, i.e., their relative risks of forming hairpin and homodimer secondary structures. All possible hairpin and homodimer configurations s are enumerated. 4

We slide one oligo or primer over itself for homodimer and hairpin configurations, or over the other primer in case of heterodimer estimation. Each possible offset correspond to a state s S. The value G/U ref estimates the risk of hairpin and homodimer realization [6]. [ ] G = k B T ln e (Us/kBT) where k B = 0.0083144... is the Boltzmann constant and U s is the reference internal energy of the state s. Hence s S U s = k B T ln e (Us/kBT) For each state s, we use the Wallace model [4] to weight the sum of the interactions for each base. Indeed, we estimates U as U = k B [2 (A + T) + 4 (G + C)] where A, C, G and T are the number of hydrogen bonds of a nucleotides with its complement in the current dimer configuration. We normalize each total energy estimation by a reference energy value U ref to select best candidates regardless of the oligo length. Indeed, a sorting criterion directly proportional to U would systematically favor short oligonucleotides, because of their intrinsical lower hybridation energy. To compute U ref, we simply sum the interaction factor (2 or 4) associated to each base of the oligonucleotide sequence and multiply the sum by k B. 2.4 Domain Sampling We select nearly non-overlapping candidates to ensure domain exploration and to diversify the population of candidates for the next specificity optimization stage. We proceed by walking the set of candidates sorted on their internal energy, as shown in Figure 2. An item is discarded if it overlaps more than t 10 nucleotide positions with the union of previously retained oligonucleotides. As the list is initially sorted by increasing internal energy, the procedure gives more priority to oligonucleotides with lower internal energy, i.e., high quality solutions are selected first. The domain sampling stage is motivated by the observation that close neighbors, hence largely overlapping oligonucleotides, in the candidate space exhibit nearly identical scoring values. Therefore, diversification of the population of candidate is needed to avoid the selection of a unique cluster of close candidates. 5

Figure 2: Priority-based sampling. Rectangles represent relative positions and length of oligo s. All candidates are sorted vertically on their internal energy score. If a candidate is selected, its rectangle is filled in black. Regions masked by previously selected spans are casted in grey on the next candidates. The first oligonucleotide, i.e., with the lowest internal energy, is always retained. The 6th candidate for example is discarded because its overlap with the union of previously retained oligonucleotides. 2.5 Pairing of Primers Oligonucleotide design for PCR applications generally require further constrains such as pairing of oligonucleotides, then called primers, a low difference in T m between the two members of the pair, a maximum size of amplicons, and a minimization of the risks of heterodimer realization. We generate a fixed number of primer pairs in increasing order of internal energy score. Then, we trivially reject pairs defining amplicons that do not fit the range of user-defined amplicon size. Valid pairs are sorted by increasing heterodimer risk, evaluated with the thermodynamic model presented in section 2.3. 2.6 Specificity Optimization The final selection stage identifies solutions that minimizes cross-hybridation (i.e., hybridization with a non-target sequence within the genome). We evaluate specificity by defining within the output of a Blast query: 1. whether the first hit corresponds to a perfect match of the candidate on the genome, 2. the number of bases matches in the second hit. 6

Candidates are sorted by increasing Blast score of the second best hit: more specific oligonucleotides are higher in the list. specificity score of a primer pair is defined as the worse specificity score of its two members. Our approach therefore requires, for each candidate, a Blast pass on the considered genome or a database of mrna, depending on the specific application. To speed-up this process, we propose to extract, for the considered genome, a database of non-specific regions, dedicated to specificity testing. Hence, a significant first hit demonstrate poor specificity. A given region is defined to be non-specific if it is similar to another region within the genome. A few other alternative specificity evaluation heuristic are proposed in literature [7,8]. 3 Implementation and Results Two oligonucleotides design applications have been implemented in Java using our common multilevel optimization pipeline, namely, OptiAmp (Design of Primers for PCR Amplifications), and LOD (Long Oligo Design). The OligoFaktory web portal embeds these bioinformatic tools in a web-based framework. The dynamic and interactive web application provides consistent form-based input interface and presentation of outputs. Each plugin tool reads an input parameter file and dumps results on an output file. Both input and output files conform to a common XML interchange file format. An XHTML form is associated with each tool to fetch parameters from user s input and to produce input XML files. For all applications, a unified presentation of result sheets provides distribution graphs and locations bar graphs to visualize the result set. Moreover, easy-to-spot warning flags are shown in case of problems with hairpin and homodimer secondary structures and/or with specificity. The project is aimed at assisting researchers for a painless, rapid, automated, and reliable design. 3.1 Brucellas Design An hybrid micro-array was designed to capture the expression of genes for both Brucella Suis 1330 and Brucella Melitensis 16M microbial species. Pairing of orthologous genes and alignments of consensus sequences have been performed as explained in [9]. This preprocessing yielded 2853 consensus sequences. Figure 3 shows charts of relevant features of the designed oligonucleotide set. Note that the oligonucleotide locations tend to cluster to the 3 -end of the query sequences. The specificity score indicates the maximum 7

(a) Oligos Length Distribution (b) Relative Oligos Locations (c) Tm Distances from 79 o (d) Specificity Scores Distribution Figure 3: Features of the designed micro-array. number of non-specific base paring on the second best hit found on the genome of Brucellas Suis. The first hit corresponds to the perfect match of the oligonucleotide on its target genes in both Suis and Melitensis. The temperature range does not exceed two degrees, as specified by the constraints. 4 Discussion and Conclusions We have presented a general framework to build oligonucleotide design applications. The method is based on the principles of multilevel optimization strategies. A general pipeline of candidate selection is described and algorithmic details of each stages are given. The main contribution of our method, compared to current techniques is, by far, the internal energy optimization stage based on statistical mechanics principles. Indeed, we experimented that a poor evaluation of dimerization risk is one, if not, the most important factor of failures in applications. Our proposal is a refined and realistic model to evaluate risk of dimer realization. The evaluation of this model require more computational resources in comparison of the common heuristics. However, today workstations allow the design of large batches of queries with our method. A refined model based on nearest neighbors could be used to better estimate the internal energy U. However, we observed that the classification of dimer configurations as performed with the computationally cheap Wallace model is accurate in practice. 8

The domain sampling strategy is essential to allow a trade-off between exploration of the solution space and exploitation of best so-far solutions. Furthermore, additional optimization constraints can be taken into account by appending successive domain sampling stages followed by selection. Introducing diversification within the population of candidates is a key point for a parameterizable multi-criteria optimization pipeline. Acknowledgments The Internal Energy Optimization section has greatly benefited from discussions with Daniel Van Belle on statistical mechanics. This work was supported by the Universite Libre de Bruxelles (ULB) and the Region Wallonne (BioRobot-Initiative 114840). References 1. Burpo JF: A critical review of PCR primer design algorithms and crosshybridization case study. Tech. rep., Department of Chemical Engineering, Stanford University 2001. 2. Kampke T, Kieninger M, Mecklenbug M: Efficient primer design algorithms. Bioinformatics 2001, 17(3). 3. Candler W, Norton R: Multilevel Programming. Tech. rep., unpublished research memorandum, DRC, World Bank, Washington 1976. 4. Wallace RB, Shaffer J, Murphy RF, Bonner J, Hirose T, Itakura K: Hybridization of synthetic oligodeoxyribonucleotides to phi chi 174 DNA: the effect of single base pair mismatch. Nucleic Acids Research 1979, 6. 5. Wetmur JG, Sninsky JJ: PCR Strategies, Academic Press 1995 chap. 6, :69 83. 6. Groebe DR, Uhlenbeck OC: Characterization of RNA hairpin loop stability. Nucleic Acids Research 1988, 16. 7. Kurata K, Nakamura H: Novel Method for Primer/Probe Design and Sequence Analysis. Tech. rep., School of Engineering, The University of Tokyo 2000. 8. Rahmann S: Fast Large Scale Oligonucleotide Selection Using the Longest Common Factor Approach. Journal of Bioinformatics and Computational Biology 2003, 1(2):343 361. 9. Schretter C, Milinkovitch MC: Automated Long Oligo Design on Consensus Regions of Similar Genomes. Tech. rep., Unit of Evolutionary Genetics, Universite Libre de Bruxelles 2004. 9