Molecular Biology: DNA sequencing

Size: px

Start display at page:

Download "Molecular Biology: DNA sequencing"

Carmella Hubbard
5 years ago
Views:

1 Molecular Biology: DNA sequencing Author: Prof Marinda Oosthuizen Licensed under a Creative Commons Attribution license. TABLE OF CONTENTS INTRODUCTION... 2 SEQUENCING BY DIDEOXY CHAIN TERMINATION... 2 MANUAL SEQUENCING... 4 AUTOMATED SEQUENCING... 6 Cycle Sequencing... 6 SEQUENCING OF LARGE TEMPLATES... 7 Primer walking... 7 Subcloning... 9 Genome sequencing SEQUENCE ANALYSIS Evaluate your sequence data Sequence Assembly Sequence comparison using BLAST Sequence alignment APPLICATIONS Diagnostics Phylogenetics Microsatellites DEFINITIONS REFERENCES P a g e

2 INTRODUCTION The biological information of DNA is contained in its nucleotide sequence. DNA sequencing is the experimental process of determining the exact order of nucleotide bases in a DNA molecule. Early methods of sequencing nucleic acids depended upon degrading the molecules with specific enzymes that cleaved particular combinations of bases, followed by identification of labelled fragments on two-dimensional gels. It was only possible to sequence very short molecules using this degradative approach. In the mid-1970s two methods emerged which allowed sequencing of long stretches of DNA for the first time: the chemical sequencing method of Maxam and Gilbert (1977) and the dideoxy chain-termination method developed by Sanger et al. (1977). The hallmark of these developments was the ability to read the base sequence directly from an autoradiograph of a onedimensional gel, by simply following the progression of the band pattern in adjacent tracks, each of which corresponds to a particular base. The chemical sequencing method of Maxam and Gilbert (1977) is a degradative approach. The DNA fragment to be sequenced is first radioactively labelled at one end and then subjected to specific chemical reactions that in each case cleave at a particular type of base. The degradation products are separated according to chain length on long polyacrylamide gels, which can resolve fragments differing in length by a single nucleotide. This method was initially more popular and widely used in molecular biology laboratories, but because of innovations which enabled longer sequences to be obtained, the dideoxy chain-termination method gained ascendancy. SEQUENCING BY DIDEOXY CHAIN TERMINATION The sequencing method most commonly used today is the dideoxy chain-termination method (Sanger et al., 1977). Dideoxy sequencing allows the determination of 200 to 1000 nucleotides from a singlestranded DNA template, using as starting material either cloned or amplified DNA. The dideoxy sequencing reaction mix includes the template DNA, deoxynucleoside triphosphates (dntps), dideoxynucleoside triphosphates (ddntps), a DNA polymerase enzyme and a primer. Like the Polymerase Chain Reaction (PCR), dideoxy sequencing starts with the denaturation of the DNA template and the binding of an oligonucleotide primer (annealing), but unlike PCR a single primer is used. (For a detailed description of PCR, see the PCR techniques course notes.) DNA polymerases require the presence of a small piece of DNA, the primer, in order to initiate the growth of a DNA chain. The primer is complementary to the DNA template and chain elongation starts at the 3 end of the primer. The primer therefore determines the starting point of the sequence being read, and the direction of the sequencing reaction. For sequencing of the insert of a plasmid template, a sequencing primer based on the vector sequence adjacent to the insert can be used. For PCR products, either the forward or reverse amplification primer can be used as the sequencing primer, or an internal sequencing primer can be designed. 2 P a g e

3 One of the single most important factors in successful DNA sequencing is proper PRIMER DESIGN. A sequencing primer should have the following characteristics: A melting temperature (Tm) in the range of 52 C to 65 C No internal secondary structure Balanced distribution of G/C and A/T rich domains No complementarity between 3 ends (so primer-dimers will not form) No significant hairpin formation (>3 bp) (if there are regions of self-complementarity in the primer sequence, the primer can anneal to itself and fold up to form a hairpin loop) Lack of secondary priming sites in the target sequence Low specific binding at the 3' end (i.e. lower GC content to avoid mispriming) Starting from the 3 end of the annealed primer, the DNA polymerase synthesizes a new DNA strand, adding dntps or ddntps that are complementary to those on the template. If allowed to go to completion, a new strand of DNA would be the result. However, ddntps have no 3' hydroxyl (3'-OH) group (see Figure 2), so once a ddntp is incorporated at the end of a growing DNA strand, there is no way for the polymerase to continue elongating it. The dntp:ddntp ratio is adjusted so that most of the time the polymerase enzyme will use a dntp, but every so often a ddntp will be chosen, which will terminate the elongation reaction. At the end of the sequencing reaction, there are multiple copies of every possible fragment, each terminated by a ddntp. Separation of the reaction products on long, thin, high-resolution polyacrylamide gels (sequencing gels) or in thin capillaries containing liquid polymer allows the determination of the nucleotide sequence. Figure 1: Dideoxynucleoside triphosphates (ddntps) lack a 3' hydroxyl group, and if incorporated into a 3 P a g e growing strand of DNA, they will terminate the elongation reaction.

4 MANUAL SEQUENCING After annealing of the primer to the template, the sample is divided into four aliquots, A, C, G and T. To each of these aliquots is added DNA polymerase, all four dntps (one of which is radiolabelled) and one of the four ddntps. In the A tube, the primer is elongated by DNA polymerase in the presence of all four dntps and dideoxy-atp (ddatp). This analogue can be incorporated instead of datp. Remember, ddatp has no 3'-OH group, so incorporation of ddatp blocks the coupling of the next dntp and terminates the elongation of the chain. Incorporation of ddatp at the A positions (opposite the Ts of the template) results in a mixture of reaction products all terminated with a ddatp (Figure 2). products 5' GCTddA 5' GCTATTddA 5' GCTATTAddA 5' GCTATTAATTCTGCddA 5' GCTATTAATTCTGCACCTddA 5' GCTATTAATTCTGCACCTAddA template 3' CGATAATTAAGACGTGGATT-5' Figure 2: Schematic representation of the products obtained in a dideoxy-atp sequencing reaction. In the C tube, ddctp is added and a mixture of products each terminated at the C positions (opposite the Gs of the template) is produced. Likewise, in the G tube, ddgtp is added and in the T tube, ddttp is added. At the end of the reaction, the four tubes contain a population of every possible reaction product; in each tube the reaction products are terminated by the ddntp that was included (Figure 3). In manual sequencing reactions, the sequencing products are labelled by adding a radioactively labelled dntp. The most commonly used radiolabel is [α- 35 S]dATP, although [α- 32 P]dATP and [α- 33 P]dATP are also used. 35 S emits low energy β radiation, which gives sharper bands on autoradiographs, causes less damage to the sugar-phosphate backbone of the DNA and is safer for users. The sequencing products from each tube are loaded into adjacent wells on a high resolution denaturing polyacrylamide gel (sequencing gel). Sequencing gels are usually about 40 cm long, they contain between 4% and 8% acrylamide and 7 M urea. 4 P a g e

5 DNA in solution is a negatively charged molecule, and on application of an electrical current, the sequencing products move towards the anode. As they migrate through the gel, they are separated according to their size (for more details on electrophoresis, see the Blotting techniques course notes). Once electrophoresis is complete, the gel is dried down and exposed to X-ray film. The DNA sequence can then be read from the autoradiograph. The smaller products will move through the gel fastest, and will appear at the bottom of the autoradiograph. The sequence of the DNA is therefore read from the bottom to the top of the autoradiograph in the four adjacent lanes and this is known as a sequence ladder (see Figure 3). A B Figure 3: A: Separation of the dideoxy reaction products on a high resolution sequencing gel allows the reading of the nucleotide sequence, B: Example of an autoradiograph of a manual sequencing gel. Manual sequencing is labour intensive and slow. Bases must be read by eye, and there are sometimes problems with the resolution of short stretches of DNA. Several major advances in the 1980s and 1990s made the automation of Sanger dideoxy sequencing possible. These included: 5 P a g e

6 the development of heat stable DNA polymerases (e.g. Taq polymerase) the introduction of thermocycling the development of four-colour fluorescent dyes to label ddntps the modification of the electrophoresis technique such that automated visualization and recording of the sequence of migrating ddntps could be accomplished. AUTOMATED SEQUENCING These days, the running and interpretation of sequencing gels is usually performed automatically. In automated sequencing, fluorescent groups in the primer or in the ddntps replace the radioactive label. The fluorescent groups are detected during the electrophoresis by laser irradiation and light detectors. Cycle Sequencing One of the limitations of the standard chain termination method is that only a single labelled DNA molecule is produced from each primer-template complex. The sensitivity of the method is therefore limited by the amount of DNA template that can be used in the reaction. Cycle sequencing is a method in which a small number of template DNA molecules are used over and over again to generate the sequence ladder. Most cycle sequencing chemistries these days use ddntps that are labelled with fluorescent groups (dye-labelled terminators). The four ddntps are labelled with different fluorophores, each of which has a different emission spectrum. The advantage of this is that the sequencing reaction can be carried out in a single tube and analyzed in a single run. The cycle sequencing reaction mix includes the template DNA, a primer, dntps, ddntps (each labelled with a different fluorophore) and a thermostable DNA polymerase (usually a variant of Taq polymerase). The reaction mix is subject to repeated rounds of denaturation, annealing and elongation, rather like a PCR. During the elongation step, the thermostable DNA polymerase adds dntps or ddntps. The formation of the new DNA strand is terminated upon the addition of a ddntp. At the end of the cycles, multiple copies of every possible fragment are present, each terminated by a ddntp. The sequencing reaction is purified and subject to electrophoresis in an automated sequencer. The older automated sequencing machines (ABI 370, ABI 377, ALF, Hitachi, LiCor) had slab gels, but in the newer machines the electrophoresis is carried out in a capillary containing a liquid polymer. After each run the capillary is automatically emptied and refilled with the fluid polymer; the advantage of this is that the capillaries can be reused many times. The ABI 310 has a single capillary, the 3100 and 3130XL have 16 capillaries while the 3700 model has 96! Upon application of an electrical current, the fragments in the reaction mix migrate through the gel or the polymer and are separated according to size. As the sequencing products reach the end of the gel or the capillary, they are exposed to a laser beam which excites the fluorescent group and it emits light. The four different fluorophores emit light at different wavelengths which is detected by photomultipliers. A computer program interprets the fluorescence that has been detected and 6 P a g e

converts the pattern of fluorescent peaks obtained at the four different wavelengths into a nucleotide sequence of 500 or even up to 800 nucleotides.

7 converts the pattern of fluorescent peaks obtained at the four different wavelengths into a nucleotide sequence of 500 or even up to 800 nucleotides. The sequence is viewed in the form of an electropherogram (Figure 4). A very good animated illustration of automated sequence analysis, including both cycle sequencing and sample electrophoresis, is available at Figure 4: Example of a portion of an electropherogram. SEQUENCING OF LARGE TEMPLATES As we have seen, we can obtain up to 800 nucleotides from a single template. But how do we obtain the sequence of a template that is longer than 800 bp? The following will be discussed: Primer walking - for sequencing a relatively small piece of DNA (>1 4 kb) Subcloning when you need to sequence larger templates (> 4 kb) Genome sequencing Primer walking Sequencing by primer walking is an effective strategy for obtaining the sequence of DNA templates ranging in size from 1-4 kb. Primer walking is illustrated in Figure 5. For a cloned template, the initial sequence data is obtained using a standard primer which hybridizes to the vector sequence upstream of the insert DNA. For a PCR product, one of the amplification primers can be used to obtain the initial sequence data. A sequencing primer is then designed towards the 3 end of this initial sequencing data. This oligonucleotide is used to prime a second sequencing reaction on the same template DNA. The data obtained from the second reaction will overlap with the initial data and extend the sequence further downstream. By repeated cycles of oligonucleotide design and DNA sequencing, the cloned insert is sequenced completely in one direction. The same strategy is used to sequence the other strand beginning with a second standard primer which primes in the other direction. The complete DNA sequence can then be compiled from all the smaller sequences. 7 P a g e

8 Figure 5: Primer walking to obtain the full length sequence of a template that is too large to sequence in a single sequencing reaction. 8 P a g e

9 Subcloning It is not efficient to sequence larger templates (> 4 kb) by primer walking. Such larger DNA templates must first be broken up into smaller overlapping fragments which can be cloned into a vector for sequencing. These subclones should overlap, so that the individual DNA sequences will themselves overlap. The overlaps can be then located, either by eye or using a computer, and the master sequence gradually built up. There are several ways of producing the subclones. The DNA could be cleaved with two different restriction endonucleases, producing one set of fragments with say Sau3A and another with AluI. However, this method suffers from the drawback that the restriction enzyme sites may be inconveniently placed and individual fragments may be too long to be completely sequenced. Often four or five different enzymes will have to be used to clear up all the gaps in the master sequence. (For more details on cloning see the Cloning techniques course notes). An alternative method is to use shotgun cloning (Figure 6), where the DNA template is randomly sheared and cloned into a vector, producing a shotgun library. Clones from this library are sequenced and the individual sequences assembled, until the final finished sequence of the original template is obtained. One drawback of this technique is that some regions of the DNA template may be easier to clone than others, resulting in areas in the final sequence assembly which are over-represented while other regions are under-represented or represented by reads on only one strand. In fact, there may be regions where there are no subclones at all, and this would result in gaps in the master sequence. Such regions would be targeted for further sequencing. Numerous strategies, including the use of PCR, have been developed for the closure of gaps in the sequence. 9 P a g e

10 Figure 6: Shotgun sequencing strategy. Genome sequencing New automated sequencing technologies and improved computer assisted methods for sequence assembly have greatly increased the speed of sequencing, and have made sequencing of whole genomes a reality. By sequencing the entire genome of an organism, all of its genes can be identified. Scientists hope that by deciphering the genome sequence of an organism they will gain understanding of how the organism functions, and how its genes work together to direct its growth and development. In the case of disease-causing organisms, genome sequences may reveal genes responsible for virulence and pathogenesis and it may be possible to identify novel vaccine candidate genes or new drug targets. It is hoped that the human genome sequence may shed light on genetically-related diseases. Most genomes are far too big to be inserted into a vector and must be broken into smaller bits in order to be cloned. For example, bacteria have genomes that range from 1 Mb up to 10 Mb, and some are even larger. How do we sequence such a huge piece of DNA? There are two strategies for sequencing whole genomes: the hierarchical approach and the whole genome shotgun approach. See Figure 7 for a schematic comparison. 10 P a g e

11 In HIERARCHICAL SEQUENCING, the genomic DNA is first broken down into manageable sized pieces (approximately 100 kb), and these pieces are cloned separately into a cloning vector that will accept large inserts of 100 kb to 3 Mb (e.g. BAC, P1 or YAC). It is then possible to physically order these large clones, resulting in several clones which, between them, span the entire length of the genome. This is known as a tiling path (a minimal subset of overlapping clones that span the whole genome). Ideally the fewest possible clones are identified to make up the tiling path (there are numerous ways of physically ordering the clones, which are beyond the scope of this course. You can read about these methods elsewhere if you are interested). Although the genomic DNA has been fragmented, the size of the piece of DNA in each clone is still too large to sequence directly (remember it is only possible to sequence around 800 bp at a time). So each individual large clone is then sequenced separately, usually by breaking it up into little pieces (approximately 2 kb) and cloning these small fragments of DNA into a sequencing vector. This process is known as sub-cloning (cloning a cloned piece of DNA!). The sequences of many small 2 kb subclones are obtained and assembled to give the sequence of the original large (approximately 100 kb) clone. Since the order of the large clones in the tiling path is known, the complete genome sequence is obtained by assembling the overlapping sequences of these clones. Hierarchical sequencing was the basis of the publicly funded Human Genome Project. The WHOLE GENOME SHOTGUN APPROACH does not make use of ordered subclones. Instead, the entire genome is fragmented into small clones which are sequenced and assembled. Shotgun sequencing was the basis of the privately funded Human Genome Project, and today most small genomes (e.g. bacterial genomes) are sequenced using this strategy. 11 P a g e

12 Figure 7: Strategies for sequencing whole genomes: hierarchical sequencing versus the whole genome shotgun approach. SEQUENCE ANALYSIS Once you have generated a sequence trace file, you will need to manipulate it in some way in order to analyse your data. The first thing you ll need to do is to evaluate the quality of your sequence data to see whether it s worth continuing to analyse it. If you have generated more than one sequence from your template you will need to assemble those sequences into a single contig. Once your sequences have been assembled and edited, you can write out a consensus sequence and analyse it further. You may want to compare it with existing sequence data to find out how your organism is related to other known organisms, or perhaps to locate unique sequences that could be used to identify your organism in a diagnostic test. 12 P a g e

Many programs are available for sequence manipulation and analysis. Some of these are freely available and you can download them from the web and install them on your own computer.

13 Many programs are available for sequence manipulation and analysis. Some of these are freely available and you can download them from the web and install them on your own computer. Evaluate your sequence data You will first need to evaluate the quality of your sequence data. A good sequence looks like this: Peaks should be sharp, well-defined, and scaled high in the first several panels of printed analyzed data, as shown here The characteristics of peaks on a chromatogram change from panel to panel in a predictable way for a typical sequencing run. In the first few panels, the peaks should be sharp, well defined and scaled high. Peak definition should remain fairly good for up to about 550 bases. Several freely available programmes are available for viewing AB1 trace files. These include programs such as: Chromas LITE ( BioEdit ( Trev (a sequence editor which is part of the Staden Package: Sequence Assembly If you have generated more than one sequence from your template you will need to assemble those sequences into a single contig. A contig is an aligned contiguous section of sequence comprising two or more reads joined together by virtue of matching bases. It is always a good idea to obtain a forward and reverse read for every template you wish to sequence in order to check the accuracy of your data. Gap4 can be recommended for sequence assembly. Gap4 forms part of the Staden Package ( The original version of gap4 was described in Bonfield et al. (1995). Four different programs are installed when you install the Staden Package: pregap4, gap4, spin and trev. Pregap4: Before entry into a gap4 database the raw data from sequencing instruments needs to be passed through several processes, such as screening for vectors, quality 13 P a g e

14 evaluation, and conversion of data formats. Pregap4 provides a graphical user interface to set up the processing required to prepare trace data for assembly or analysis; and also gives a method for its automation. Gap4: Gap4 is a Genome Assembly Program. The program contains all the tools that would be expected from an assembly program plus many unique features and a very easily used interface. It performs assembly, contig joining, assembly checking, repeat searching, experiment suggestion, read pair analysis and contig editing. Spin: Spin is an interactive and graphical program for analysing and comparing sequences. It contains functions to search for restriction sites, consensus sequences/motifs and protein coding regions, can analyse the composition of the sequence and translate DNA to protein. It also contains functions for locating segments of similarity within and between sequences, and for finding global and local alignments between pairs of sequences. Trev: For some types of sequencing project it is convenient to view and edit the chromatogram data prior to assembly into a gap4 database, and this is the function of the program trev (we have already met trev). Sequence comparison using BLAST Newly obtained sequences can be compared with databases of previously characterized genes and proteins to identify them, to assign possible functions or to obtain evolutionary information. One of the most commonly used comparison tools is BLAST, (Basic Local Alignment Search Tool), a method for rapid searching of nucleotide and protein databases. There are three main sequence databases which share sequence data on a daily basis: DNA Data Bank of Japan (DDBJ) EMBL Nucleotide Sequence Database GenBank GenBank is maintained by the National Centre for Biotechnology Information (NCBI) in the National Library of Medicine (NLM) at the National Institutes of Health (NIH) in Bethesda, Maryland, USA. The NCBI creates public databases, conducts research in computational biology and develops software for analyzing sequence data. BLAST is NCBI s sequence similarity search tool. Many different BLAST programs are available. The type of program to choose will depend on the nature of the query sequence, the purpose of the search and the target database. For nucleotide queries, megablast is currently the best program to use to find an identical match to a query sequence. Discontiguous megablast and blastn are the programs of choice for finding similar sequences from related organisms. To identify a query amino acid sequence and to find similar sequences from other organisms in protein databases, blastp is the program of choice. blastx translates a nucleotide query sequence in all six reading frames and compares the translation to a protein database. 14 P a g e

15 Sequence alignment It is possible to produce sequence alignments to compare related sequences. If two biological sequences are sufficiently similar, almost invariably they have similar biological functions and are probably descended from a common ancestor. This implies that function is encoded into sequence and that there is a redundancy in the encoding, such that positions in the sequence may be changed without perceptible changes in the function. Sequence alignments provide the basis for predicting de novo the secondary structure of proteins, for knowledge-based tertiary structure predictions and for inferring phylogenetic trees and resolving questions of ancestry between species. There are many sequence alignment programs available. One of the most commonly used is ClustalW (Higgins et al., 1994) ( a general purpose multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. Evolutionary relationships can be seen via viewing Cladograms or Phylograms. APPLICATIONS There are many applications for sequence data. The following are just a few: Diagnostics The DNA sequences of disease-causing organisms can be used to assist in the diagnosis of veterinary diseases. Sequence data can be particularly useful for the identification of microbial disease-causing organisms such as viruses, bacteria and protozoa. Frequently many closely related species are present in the field, only some of which cause disease, and it is often difficult to distinguish between these species in any other way. One of the most widely used loci in diagnosis is the ribosomal RNA operon. Portions of the nucleotide sequence of rrna genes are highly conserved between all organisms, particularly in regions which determine the secondary structure of the ribosome. Other regions of rrna genes, which are not under pressure to remain unchanged, vary between species. The structure of rrna genes is therefore ideal for the development of a PCR test to distinguish between related species. Primers can be designed in conserved areas; these primers will amplify a portion of the rrna gene from all related parasites. Very specific and sensitive probes can then be designed in the variable regions of the rrna genes to distinguish between the parasites. Phylogenetics Phylogenies can be used to determine evolutionary relationships between organisms. Such relationships can assist in epidemiological studies and can be used to determine the origin of isolates in outbreaks of certain diseases. 15 P a g e

16 Phylogenetic analyses that are based on sequence data depend upon initial homology inferences at the amino acid or nucleotide level. Thus, the accuracy of phylogenies will depend upon this first step. The most valuable phylogenetic information can be obtained from amino acid sequence alignments. It makes little sense to align untranslated nucleotide sequences of coding regions, because of codon degeneracy (codon variability is higher than amino acid variability). The exception is if one is looking for variation within the nucleotide patterns themselves, as would be the case with trna or rrna genes, promoter regions or terminators. In these cases the same methods that are used to align amino acids can be applied. Microsatellites Microsatellites are stretches of DNA that consist of tandem repeats of a simple sequence of nucleotides. The repeat units are generally di-, tri- tetra- or pentanucleotides (for example, AAT repeated 15 times in succession). Microsatellites tend to occur in non-coding DNA. Microsatellites form through slipped-strand mispairing at replication pauses or single strand annealing following exonucleolytic degradation at a DNA double-strand break. These processes can result in variable numbers of repeat units and thus microsatellites tend to be highly polymorphic. Microsatellites are useful genetic markers precisely because they are polymorphic, and in addition they are locus-specific. PCR primers can be designed in the regions flanking the repeats and using the polymerase chain reaction (PCR), the microsatellites can be easily amplified. The number of repeat units that an individual has at a given locus can be easily resolved using polyacrylamide gels or in capillaries containing a liquid polymer. There are numerous applications for microsatellites. They have been used as markers for certain disease conditions. Some microsatellite alleles are associated, through genetic linkage, with certain mutations in coding regions of the DNA that can cause a variety of medical disorders, such as schizophrenia (Bailer et al., 2002). Because of their high specificity, microsatellites are frequently used as forensic markers to test DNA in court cases, and have been used to identify both human and animal DNA samples. DNA from the crime scene is compared with DNA from the suspect. Match identities for microsatellite profiles can be very high in fact, the probability that the evidence from the crime scene is not a match with that of the suspect is less than one in many millions in most cases. Parentage analysis is another application for microsatellites. Each individual inherits one length of nucleotide repeats from his or her mother and one from his or her father (individuals with a single band received the same band from both their mother and their father). For captive or endangered species microsatellites can serve as tools to evaluate inbreeding levels. DEFINITIONS Annealing: Binding of an oligonucleotide primer to a DNA or RNA template. Codon: A triplet of nucleotides in a messenger RNA molecule coding for a single amino acid. Contig: an aligned contiguous section of sequence comprising two or more DNA sequences joined together by virtue of matching bases. 16 P a g e

17 Deoxynucleoside triphosphates (dntps): The basic building blocks of DNA, consisting of a nitrogenous base (adenine, guanine, thymine, or cytosine), a phosphate molecule, and a sugar molecule (deoxyribose). Dideoxynucleoside triphosphates (ddntps): Modified dntps that lack a 3' hydroxyl (3'-OH) group, and therefore terminate strand synthesis when incorporated into a growing DNA chain. DNA polymerase: An enzyme that synthesizes DNA on a DNA or RNA template. DNA sequencing: The experimental process of determining the exact order of nucleotide bases in a DNA molecule. Dye-labelled terminators: ddntps that are labelled with fluorescent groups. Fluorophore: A molecule that can be excited by light to emit fluorescence. Genetic code: The sequence of nucleotides, coded in triplets (codons) along the mrna, that specifies the sequence of amino acids in the translated protein. The DNA sequence of a gene can be used to predict the mrna sequence, and the genetic code can in turn be used to predict the amino acid sequence. Melting temperature (Tm): The temperature at which two strands of a double-stranded DNA molecule detach due to breakage of the hydrogen bonds between them. Open reading frame: A series of codons each of which encodes an amino acid, uninterrupted by a stop codon (TAA, TAG or TGA). All ORFs have the potential to encode a protein or polypeptide although many may not actually do so. Polyacrylamide: A material used to make gels for separation of mixtures of macromolecules (such as proteins, small DNA or RNA molecules of up to 1000 nucleotides) by electrophoresis. Polyacrylamide is used in DNA sequencing gels to separate the sequencing ladder. Primer: A short oligonucleotide that, when annealed to its complementary sequence in a singlestranded DNA template molecule, provides a start point for synthesis of a new DNA strand. Reading frame: One of three possible ways of translating a nucleotide sequence into codons. Subcloning: The transfer of a cloned fragment of DNA, or a part thereof, from one vector to another. Template DNA: The DNA molecule that is copied during a strand synthesis reaction. In a sequencing reaction, strand synthesis is catalysed by a DNA polymerase. Thermostable DNA polymerase: A DNA polymerase that is relatively unaffected by high temperatures. Taq polymerase is the most commonly used thermostable DNA polymease and was isolated from Thermus aquaticus, a bacterium found in hotsprings in Yellowstone National Park in the United States of America. 17 P a g e

18 REFERENCES 1. Bailer, U, Leisch, F, Meszaros, K, Lenzinger, E, Willinger, U, Strobl, R, Heiden, A, Gebhardt, C, Doge, E, Fuchs, K, Sieghart, W, Kasper, S, Hornik, K & Aschauer HN Genome scan for susceptibility loci for schizophrenia and bipolar disorder. Biological Psychiatry, 52: Bonfield,J.K., Smith,K.F. and Staden,R. A new DNA sequence assembly program. Nucleic Acids Research, 24, (1995). 3. Higgins, D, Thompson, J, Gibson, T, Thompson, JD, Higgins, DG & Gibson, TJ CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22: Maxam, AM & Gilbert, W A new method for sequencing DNA. Proceedings of the National Academy of Sciences USA, 74, Sanger, F, Nicklen, S & Coulson, AR DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences USA, 74, P a g e