Clustering over DNA Strings

Similar documents
Introduction to DNA Computing

The structure, type and functions of a cell are all determined by chromosomes:

CSCI 2570 Introduction to Nanocomputing

Name: Date: Period:

Which diagram represents a DNA nucleotide? A) B) C) D)

Pre-Lab: Molecular Biology

Molecular Biology. IMBB 2017 RAB, Kigali - Rwanda May 02 13, Francesca Stomeo

13-2 Manipulating DNA Slide 1 of 32

ADENINE, THYMINE,CYTOSINE, GUANINE

Bio-inspired Computing for Network Modelling

Outline. Structure of DNA DNA Functions Transcription Translation Mutation Cytogenetics Mendelian Genetics Quantitative Traits Linkage

C A T T A G C nitrogenous complimentary G T A A T C G to each other

Nucleic Acids: DNA and RNA

Allele: Chromosome DNA fingerprint: Electrophoresis: Gene:

Chapter 10. DNA: The Molecule of Heredity. Lectures by Gregory Ahearn. University of North Florida. Copyright 2009 Pearson Education, Inc.

DNA, RNA, and Protein Synthesis

THE STRUCTURE AND FUNCTION OF DNA

A DNA Computing Model to Solve 0-1 Integer. Programming Problem

DNA and Biotechnology Form of DNA Form of DNA Form of DNA Form of DNA Replication of DNA Replication of DNA

DNA and RNA. Chapter 12

DNA. Deoxyribose Nucleic Acid

(deoxyribonucleic acid)

DNA Structure DNA Nucleotide 3 Parts: 1. Phosphate Group 2. Sugar 3. Nitrogen Base

DNA STRUCTURE & REPLICATION

2. Why did researchers originally think that protein was the genetic material?

BIOLOGY 111. CHAPTER 6: DNA: The Molecule of Life

DNA. Discovery of the DNA double helix

What Are the Chemical Structures and Functions of Nucleic Acids?

DNA Structure & Replication How is the genetic information stored and copied?

Polymerase chain reaction

II. DNA Deoxyribonucleic Acid Located in the nucleus of the cell Codes for your genes Frank Griffith- discovered DNA in 1928

DNA, or Deoxyribonucleic Acid, is the genetic material in our cells. No two people (except identical twins) have the

What is DNA? DEOXYRIBONUCLEIC ACID

Chapter 6. Genes and DNA. Table of Contents. Section 1 What Does DNA Look Like? Section 2 How DNA Works

Chapter 9: DNA: The Molecule of Heredity

DNA and RNA Structure. Unit 7 Lesson 1

A Study on DNA based Information Security Methodologies

Appendix A DNA and PCR in detail DNA: A Detailed Look

DNA and Replication 1

DNA Structure and Replication

3.A.1 DNA and RNA: Structure and Replication

DNA, Replication and RNA

SOLVING TRAVELLING SALESMAN PROBLEM IN A SIMULATION OF GENETIC ALGORITHMS WITH DNA. Angel Goñi Moreno

copyright cmassengale 2

DNA Structure Notation Operations. Vincenzo Manca Dipartimento di Informatica Universita di Verona

A DNA CODIFICATION FOR GENETIC ALGORITHMS SIMULATION. Ángel Goñi, Francisco José Cisneros, Paula Cordero, Juan Castellanos

Chapter 13-The Molecular Basis of Inheritance

2. Structure and Replication of DNA. Higher Human Biology

Chapter 16: The Molecular Basis of Inheritance

Fast parallel molecular solution to the dominating-set problem on massively parallel bio-computing

CH 4 - DNA. DNA = deoxyribonucleic acid. DNA is the hereditary substance that is found in the nucleus of cells

Introduction to DNA and RNA

DNA Structure and Replication

Forensic Science Bell-Ringer

Design of Full Adder and Full Subtractor using DNA Computing

DNA Based Evolvable Instruction Set Architecture & Arithmetic Unit

1. Molecular computation uses molecules to represent information and molecular processes to implement information processing.

Lecture Four. Molecular Approaches I: Nucleic Acids

3/10/16 DNA. Essential Question. Answer in your journal notebook/ What impact does DNA play in agriculture, science, and society as a whole?

1. I can describe the stages of the cell cycle.

Molecular Genetics. The flow of genetic information from DNA. DNA Replication. Two kinds of nucleic acids in cells: DNA and RNA.

Biology 30 DNA Review: Importance of Meiosis nucleus chromosomes Genes DNA

Bio-inspired Models of Computation. An Introduction

LABS 9 AND 10 DNA STRUCTURE AND REPLICATION; RNA AND PROTEIN SYNTHESIS

AGENDA for 10/10/13 AGENDA: HOMEWORK: Due end of the period OBJECTIVES: Due Fri, 10-11

Molecular Genetics. Genetics is the study of how genes bring about traits in living things and how those characteristics are inherited.

DNA Structure and Replication 1

Egg Whites. Spider Webs

Ch 10 Molecular Biology of the Gene

Exam 2 Key - Spring 2008 A#: Please see us if you have any questions!

A nucleotide consists of: an inorganic phosphate group (attached to carbon 5 of the sugar) a 5C sugar (pentose) a Nitrogenous (N containing) base

Components of DNA. Components of DNA. Aim: What is the structure of DNA? February 15, DNA_Structure_2011.notebook. Do Now.

7.1 Techniques for Producing and Analyzing DNA. SBI4U Ms. Ho-Lau

AGENDA for 10/11/13 AGENDA: HOMEWORK: Due end of the period OBJECTIVES:

Molecular Cell Biology - Problem Drill 11: Recombinant DNA

Lecture Overview. Overview of the Genetic Information. Chapter 3 DNA & RNA Lecture 6

Lesson Overview. The Structure of DNA

Genetics 101. Prepared by: James J. Messina, Ph.D., CCMHC, NCC, DCMHS Assistant Professor, Troy University, Tampa Bay Site

What happens after DNA Replication??? Transcription, translation, gene expression/protein synthesis!!!!

Chapter 10 Genetic Engineering: A Revolution in Molecular Biology

Feedback D. Incorrect! No, although this is a correct characteristic of RNA, this is not the best response to the questions.

Chapter 16. The Molecular Basis of Inheritance. Biology Kevin Dees

Purines vs. Pyrimidines

Vocabulary: DNA (Deoxyribonucleic Acid) RNA (Ribonucleic Acid) Gene Mutation

Replication Review. 1. What is DNA Replication? 2. Where does DNA Replication take place in eukaryotic cells?

DNA. Evidence. How is DNA be used to solve crimes?

By the end of today, you will have an answer to: How can 1 strand of DNA serve as a template for replication?

MBioS 503: Section 1 Chromosome, Gene, Translation, & Transcription. Gene Organization. Genome. Objectives: Gene Organization

Bacteriophage = Virus that attacks bacteria and replicates by invading a living cell and using the cell s molecular machinery.

3.1.5 Nucleic Acids Structure of DNA and RNA

Chapter 12. DNA Structure and Replication

DNA: An Introduction to structure and function. DNA by the numbers. Why do we study DNA? Chromosomes and DNA

THE COMPONENTS & STRUCTURE OF DNA

DNA Computation Based Approach for Enhanced Computing Power

Nucleic Acids. Biotechnology

Student Manual. Pre-Lab Introduction to DNA Fingerprinting STUDENT MANUAL BACKGROUND

Exam: Structure of DNA and RNA 1. Deoxyribonucleic Acid is abbreviated: a. DRNA b. DNA c. RNA d. MRNA

}Nucleosides NUCLEIC ACIDS. Nucleic acids are polymers Monomer---nucleotides Nitrogenous bases Purines Pyrimidines Sugar Ribose Deoxyribose

DNA RNA PROTEIN. Professor Andrea Garrison Biology 11 Illustrations 2010 Pearson Education, Inc. unless otherwise noted

DNA stands for deoxyribose nucleic acid.

Transcription:

Proceedings of the 5th WSEAS Int. Conf. on COMPUTATIONAL INTELLIGENCE, MAN-MACHINE SYSTEMS AND CYBERNETICS, Venice, Italy, November 20-22, 2006 184 NURIA GOMEZ BLAS Clustering over DNA Strings EUGENIO SANTOS MIGUEL ANGEL DIAZ Abstract: This paper presents an unsupervised learning algorithm over strings that can be applied in aminoacid sequences. This way a set of aminoacids is obtained and they represent the minimal distance to the whole DNA pool. Biological laboratory protocols over this cluster could be easily studied instead over the whole pool and protocols could be used in the whole pool. This process can help to develop protocols applicable in the DNA computing paradigm. A brief review of DNA computing is also shown. Key Words: DNA Computing, String Clustering, Unsupervised Learning, Protocols 1 Introduction Deoxyribonucleic acid (DNA) is the chemical inside the nucleus of all cells that carries the genetic instructions for making living organisms. A DNA molecule consists of two strands that wrap around each other to resemble a twisted ladder. The sides are made of sugar and phosphate molecules. The rungs are made of nitrogen-containing chemicals called bases. Each strand is composed of one sugar molecule, one phosphate molecule, and a base. Four different bases are present in DNA - adenine (A), thymine (T), cytosine (C), and guanine (G). The particular order of the bases arranged along the sugar - phosphate backbone is called the DNA sequence; the sequence specifies the exact genetic instructions required to create a particular organism with its own unique traits. Each strand of the DNA molecule is held together at its base by a weak bond. The four bases pair in a set manner: Adenine (A) pairs with This work has been partially supported by Spanish Grant TIC2003-09319-c03-03. thymine (T), while cytosine (C) pairs with guanine (G). These pairs of bases are known as Base Pairs, see figure 1. DNA and RNA computing is a new computational paradigm that harnesses biological molecules to solve computational problems. Research in this area began with an experiment by Leonard Adleman [1] in 1994 using the tools of molecular biology to solve a hard computational problem. Adlemans experiment solved a simple instance of the Traveling Salesman Problem (TSP) by manipulating DNA. This marked the first solution of a mathematical problem with the tools of biology. 1.1 Enzymes In molecular biology, enzymes play an important role as catalysts of reactions, and transformers of information encoded in the DNA molecules. Several enzymes have played an important role in DNA computing, and as tools for the manipulation and transformation of DNA molecules, will continue to do so

Proceedings of the 5th WSEAS Int. Conf. on COMPUTATIONAL INTELLIGENCE, MAN-MACHINE SYSTEMS AND CYBERNETICS, Venice, Italy, November 20-22, 2006 185 Figure 1: DNA Molecule [2]. Restriction enzymes arise in natural systems as part of the cell s defense mechanisms, in which they cut up DNA that is not protected by specific cheicals, or methylated. Typically, a restriction enzyme recognizes a specific nucleotide sequence in a double-stranded DNA point. Another enzyme is ligase, which aids the formation of the phosphodiester linkages in double-stranded DNA. Other useful enzymes are exonucleases and endonucleases which chop DNA into its component nucleotides from the end of strand, or location internal to a strand, respectively. An important technique in molecular biology is the Polymerase Chain Reaction - PCR. The enzyme polymerase synthesizes a DNA strand in the 5 to 3 direction on a DNA template. PCR is used for amplification of specific base sequences. In the basic technique, double-stranded DNA is separated, and short oligonulceotide primers are hybridized to the target. Of course, this necessitates that part of the desired sequence be known for determination of the primers. Then, new DNA is synthesized from the primers using a polymerase. at this point, the amount of DNA has doubled. Then, additional primers are added to delineate the other end of the target sequence, and the entide process is repeated. Because of its repetivive nature, PCR has been automated, and is capable of generating a huge number of copies of the target sequence. Gel electrophoresis[3] is a technique used to separate and visualize DNA molecules based on their size. It explotis the fact that the DNA molecules are electrically charged. The gel is a matrix of molecules with holes of varying size. Under an applied electric field, DNA molecules will move a distance that depends on their size, with smaller molecules moving farther. By comparison to standards, the sizes of molucules under analysis are determined. With appropiate cutting of molecules and standards, DNA molecules can be sequenced with this technique.

Proceedings of the 5th WSEAS Int. Conf. on COMPUTATIONAL INTELLIGENCE, MAN-MACHINE SYSTEMS AND CYBERNETICS, Venice, Italy, November 20-22, 2006 186 2 DNA Computing Solving a problem with DNA must be done in a laboratory in order to exploit the whole advantages of parallelism and recombination. DNA computing process can be summarize in these steps: Generate DNA strands corresponding to the coded individuals of the initial population Some information must be coded into DNA in order to solve a problem therefore, a combination of nucleotides in a DNA strand are needed. An specific codification must be defined for a given problem according to biological operations that is, sticky ends, the way enzymes work, and biological operations. For example, next string represents a DNA strand that code an individual, it has two sticky ends:...cgtaaccgtacccagacgtacgtaaccgt TGCATGCATTGGCATGGGTCTGCATGCAT... Apply enzymes in order to start the recombination process The way DNA strands join or are modified depends on the application of enzymes: manipulation and transformation of DNA molecules. Most usual enzymes are: restriction enzymes, polymerase, ligase, exonucleases and endonucleases (see section 1.1). After applying an enzyme, previous string could result in two new DNA strands:...cgtaaccgtaccc TGCATGCATTGGCAT......AGACGTACGTAACCGT GGGTCTGCATGCAT... Extract the solution Usually, gel electrophoresis is used to separate DNA molecules, based on their size, knowing the number of nucleotides of desired solution. Next section gives an example of the Travelling Salesman Problem using DNA computation. 2.1 TSP DNA Algorithm The following sequence of steps, which are adapted from Adleman s DNA algorithm, will extract the shortest route beween two cities S and C. 1. Assign a unique DNA sequence for each city. 2. Construct DNA representations of binary paths between two cities X and Y as follows, where n is the distance between the two cities. 3. To form longer routes out of binary paths, splice two binary paths P 1 and P 2 together if and only if the final city code for P 1 matches the first city node for P 2. Delete half of the code of the final city node in P 1 and half of the code of the first city in P 2. 4. To prevent loops, do not form longer routes if the final city in P 2 matches the first city in P 1. 5. Repeat the above two steps until no more routes can be formed. 6. Place all the DNA sequences produced so far into a test tube. 7. Extract all those routes which start with the DNA code of the desired start city and place in a separate test tube. 8. Extract all those routes which end with the DNA code of the desired destination city and plase in a separate test tube. 9. Sort all the remaining routes by lengh. Gel electrophoresis can be used here. The shortest DNA sequence represents the shortest route between the desired start and destination cities. Also, the sequence of DNA codes in the shortest strand provides information as to the order in which cities are encountered, and the length of the route can be calculated.

Proceedings of the 5th WSEAS Int. Conf. on COMPUTATIONAL INTELLIGENCE, MAN-MACHINE SYSTEMS AND CYBERNETICS, Venice, Italy, November 20-22, 2006 187 String s (u i, c) Cluster c ABCDEFGH BHHBHGFGBEBFEGABFGCDGFGAECHFEHFDHFFGDA ABCDEFGH FEADGEFFAAAHDFGHEEFFGC B ACDEFGH CEAHBFEFEAADBDFGFGGHAB B ACDEFGHB ECEBBHBEBDEEABDAGGCH F ACDEGHB EEGFBHCFFEBFHHFHAFHE FD ACEGHB FCEECGDDFGDDAEHHCCFADACHF FDB ACEGH HHCHFBAEEDHAFEEFHBGAFCBDHA FBD ACEGHD DHBABHAECDHHCFABBHDFCBACABFDGFDHGE BF ACEGHDF BDBDBAEGEFHADGHFBAHGEAGBH BC AEGHDF CDAECGGBEEBCBHEHGGEGCDBEDAFEH CB AEGHDFB CCHBGBCCFCFAFABGHGGCBBBHHED C HGHBCCADEDBFEGBBDBFBEEF GAAHHCDABGHDDFAAEBGBDG BFBDEGEGEFCACAEGBHBGEEEHHHABADEGCFAFG ADACBFHDCDBAGFHHFHAFEGACCGDCA EBHFCDADBFGACBFBFGFHHCBDFACBDF DFDAGAFEDHBEECEEGHHE HFEADDCBBABHBFFBGCGBCBEDF CCGAGCGGHFGHCACDFGCGD EB AGHDFBC DEEDCHBCHACFGBGDFBHFDGBD E AGHDFBCE Table 1: One iteration in the learning algorithm with 20 strings (8 symbols/aminoacids). 3 DNA Clustering The whole process of DNA computing must be performed in a laboratory, that is, using test tubes, applying enzymes, heating or cooling test tubes in order to produce some reactions among the strands. A priori knowledge of these conditions might help biologist to adequate protocols to perform such tasks, computation with DNA. The following learning algorithm can be apply over strings representing the aminoacid sequence in a DNA pool (that is grouping three by three nucleotides) in order to obtain one or more clusters (aminoacids set) that represents the whole DNA pool. Given two strings u and v, a distance δ(u, v) is defined as a set containing strings that appear in u and do not appear in v. A more general distance measure can be defined as (u, v) = δ(u, v) δ(v, u), in this case some mathematical properties are accomplished, let u = abcde, v = age and w = hiaj then: (u, v) = {b, c, d, g} = (v, u) = {g, b, c, d} (u, (v, w)) = {a, g, h, i, j, b, c, d} = ( (u, v), w) = {g, b, c, d, h, i, j, a} (u, w) = {b, c, d, e, h, i, j} = ( (u, v), (v, w)) = {b, h, i, j, e, g, c, d} (u, u) = {λ} An unsupervised learning over a set of strings s = {u 1, u 2,, u n } can be approximating taking into account unsupervised learning, defined in Self Organizing Maps, in order to obtain a string c such

Proceedings of the 5th WSEAS Int. Conf. on COMPUTATIONAL INTELLIGENCE, MAN-MACHINE SYSTEMS AND CYBERNETICS, Venice, Italy, November 20-22, 2006 188 as: n n (u i, c) (u i, k) k c (1) i=1 i=1 Next, the learning algorithm is proposed as follows: 1. Take any arbitrary cluster c (for example c = u p or c = λ). 2. For each string u i s: (a) Let p = (u i, c). (b) If p c then an arbitrary symbol belonging to p must be removed from c provided δ(u i, c) > 0, if δ(u i, c) = 0 then a symbol belonging to δ(u i, c) must be added to c. (c) If p c then an arbitrary symbol belonging to p must be added to c provided δ(c, u i ) > 0, if δ(c, u i ) = 0 then a symbol belonging to δ(c, u i ) must be removed from c. 3. This last process must be repeated over set s until a desired cluster length is reached. A cluster is obtained when the algorithm stops. This cluster represents the aminoacid set that has the minimum distance with the whole DNA pool. Biological operations over such set might be apply over the whole pool. 4 Conclusions A clustering algorithm over symbolic information has been proposed. An aminoacid set is obtained when applying over DNA strands. This set could be studied in order to find some biological protocols that can be used in the whole DNA set. Such protocols would be of special interest in the field of DNA computing. This algorithm is focused as a tool to help biologists. More clustering algorithms could be implemented in order to take into account the aminoacid sequence in a DNA strand and the cardinality of a given aminoacid in a sequence. References: [1] Leonard M. Adleman.Molecular Computation Of Solutions To Combinatorial Problems. Science (journal) 266 (11): 10211024. (2004). [2] Russell Deaton, Max Garzon et al. DNA computing: A review. Fundamenta Informaticae 30, 23-41. IOS Press. (1997). [3] Old, R. W. and Primose, S. B. Principles of Gene Manipulation. Blackwell Scientific. (1985). [4] Martyn Amos. Theoretical and Experimental DNA Computation, Springer. ISBN 3-540- 65773-8. (2005). [5] Dan Boneh, Christopher Dunworth, Richard J. Lipton, and Jiri Sgall. On the Computational Power of DNA. DAMATH: Discrete Applied Mathematics and Combinatorial Operations Research and Computer Science 71. (1996). [6] Gheorge Paun, Grzegorz Rozenberg, Arto Salomaa. DNA Computing - New Computing Paradigms, Springer-Verlag. ISBN 3540641963. (1998). [7] Lila Kari, Greg Gloor, Sheng Yu. Using DNA to solve the Bounded Post Correspondence Problem. Theoretical Computer Science 231 (2): 192203. (2000).