We begin with a high-level overview of sequencing. There are three stages in this process.

Size: px
Start display at page:

Download "We begin with a high-level overview of sequencing. There are three stages in this process."

Transcription

1 Lecture 11 Sequence Assembly February 10, 1998 Lecturer: Phil Green Notes: Kavita Garg Introduction This is the first of two lectures by Phil Green on Sequence Assembly. Yeast and some of the bacterial genomes have already been completely sequenced. There is also a rapidly growing body of known human sequence. All this has been because of our ability to read DNA. This lecture discusses the shotgun strategy for rapid DNA sequencing, and begins discussion of sequence assembly algorithms (to be concluded in the next lecture). The method and importance of assigning error probabilities to each base are also discussed Steps In Sequencing We begin with a high-level overview of sequencing. There are three stages in this process. Make a physical map of sequence ready clones: These clones can be Cosmids, Bacterial Artificial Chromosomes (BACs), or Yeast Artificial Chromosomes (YACs). Cosmids have an insert size of 40kb. Cosmids are easiest to work with but the number of clones required is large. YACs have a an insert size of kb. Though fewer YACs will be required, they are difficult to work with, and they often show deletions, recombinations, and chimeras (i.e., two or more unrelated DNA fragments joined in the same clone). Local experience is that smaller YACs don t suffer much from these problems, so it is feasible to construct maps from (small) YACs, then subclone them into cosmids for further processing. Maintaining 2x coverage of the genome by YACs (i.e., an average of two YAC clones covering each point on the genome) protects against chimeras and other anomalous clones. BACs are also commonly used in cloning projects. They have an insert size of kb, and seems to be relatively free of the abnormalities found in large YACs. Techniques for constructing clone maps were/will be discussed in lectures 10, 13, and 14. Sequence each clone by the Shotgun Strategy: This is the strategy used for rapid DNA sequencing, and consists of two phases. The Shotgun phase involves three steps: 76

2 LECTURE 11. SEQUENCE ASSEMBLY 77 Subclone the DNA of interest (already cloned in Cosmid, YAC, or BAC) into M13 or plasmids. The DNA of interest is sheared physically to make random small inserts of size 1 to 2 kb which are then cloned into M13 or plasmids. The subclones are not mapped. These subclones are then sequenced using automated DNA sequencing. This involves a number of steps which are described in section The sequencing reads vary from bps in length and the coverage is around 6 to 10 times (i.e., each base of the cosmid, BAC, or YAC clone is covered by 6 10 reads). Assembly of subclones into contigs is done using various computational tools to be outlined in section After the assembly is done, there will often be gaps in the sequence segments where reads did not cover the clone, or where all reads were of low quality. As a result, we might get two or more contigs. In the finishing phase additional subclones spanning the gaps are obtained and sequenced. The goal is to allow all data to be joined into a single contig with an error rate of or better. Annotation/Identification: Once the sequence of piece of DNA is known, it can be identified by comparing to other sequences in the database. Finding genes, regulatory elements and repeats, etc., using various available tools will tell about the biology of the sequence Automated DNA sequencing This involves determining the DNA sequence of each of the subclones. Sanger s method of sequencing is widely used and is described here. After amplifying and purifying a subclone, it is subjected to four different reactions (one reaction for each base). Each reaction results in a collection of single stranded molecules which all begin at the 5 end of the subclone and extend to a randomly selected base specific to the reaction. Each reaction contains the following reagents/enzyme: Template: The DNA fragment to be sequenced. DNA Polymerase: An enzyme which synthesizes a complementary strand against the given template. Primer: An oligonucleotide (around 20bps) which binds to the template and serves as the starting point for polymerase. The primer is complementary to a known sequence on the cloning vector (M13, plasmid) near the beginning of the insert. dntps (deoxynucleotides): The chemical form of bases A, G, C, and T (datp, dgtp, dttp, and dctps). These are incorporated by the polymerase to make the complementary strand. ddntps (dideoxynucleotides): A special kind of dntp which leads to termination of the growing strand when incorporated. Each reaction mixture contains all four dntps and either ddatp or ddgtp or ddctp or ddttp at a lower concentration than the dntps. For example, the reaction for base A contains ddatp plus all the other reaction reagents listed above. Polymerase constructs a complementary strand against the template using dntps. Each A incorporated into the new strand will be either datp or ddatp (randomly selected), with ddatp causing termination of the growth of the new strand. E.g.

3 LECTURE 11. SEQUENCE ASSEMBLY 78 5 TGCTTGTAATCT 3 template <- attaga 5 new strand; a denotes ddatp and acattaga 5 terminates the growing strand aacattaga 5 acgaacattaga 5 We get all the 5 substrings ending at A here. If we know the length of these fragments, we will know the positions of all A s in the template. Similarly we get all sequences ending at G, C and T by repeating the experiment with the other ddntps. This gives a mixture of fragments ending in A, G, C, and T. Each of these fragments is labeled with a dye depending on whether it ends in A, G, C, or T (hence we have four different dyes). There are two methods depending on the labeling. If the primer is labeled then it is called Dye Primer Sequencing and if the terminating (ddntp) base is labeled it is called Dye Terminator Sequencing. (The dyes are attached to the primers or terminating bases during their manufacture.) Fluorescent dyes are most commonly used and can be detected by laser based methods. The fragments are then separated by gel electrohoresis. All the reaction mixtures are run on a single lane of the gel. Each gel has lanes and hence we can get up to 64 reads in one run lasting 4 16 hours. Fragments are detected by laser excitation of the fluorescent dye. We get a profile of fluorescence intensity as a function of fragment size at each of the four dye wavelengths Data Quality The quality of the data is low towards the end of the read for several reasons. First, these are the longer fragments, and obviously there is a smaller proportionate difference in length between large fragments that differ by one base. Second, the fluorescent signal strength is weaker, since fewer long molecules are created during the reaction. (Random incorporation of ddntps results in a geometric distribution of concentration versus size.) Finally, the longer times until these fragments pass the laser detector allow more diffusion of the molecules in the gel. Usable reads extending over 800 bases are routinely obtainable, but the above factors (among others) currently prevent dramatically longer reads. There are other factors like polymerase, sequencing reaction, etc., which influence the quality of the data. For example, compression in peaks is observed because of self annealing of some part of the (singlestranded) fragment. I.e., the fragment forms a hair pin loop structure which allows it to move faster on the gel than would be expected from its length. These are sequence-specific and are found mostly in GC rich regions of the sequence. The bottom line is that there may be low quality regions of the read in the midst of otherwise high quality segments. Hopefully, reads from other clones covering the same region, or from the complementary strand, will be of higher quality Error probabilities Since there is a lot of variation in quality, it is important to assign quality values to each base. I.e., for each base call, we estimate the probability that it is correct. This is very valuable during sequence assembly since it allows us to use the entire read length during assembly of the sequence. (Older methods didn t work

4 LECTURE 11. SEQUENCE ASSEMBLY 79 effectively unless the reads were hand-trimmed to contain only the high-quality portions.) Also, reads can be put together more accurately if we know the probability attached with each base. Additionally, it helps to create a more accurate consensus sequence, by focusing on the high quality traces, rather than averaging over lower quality data Method for defining error probabilities Error probabilities were estimated by the following procedure. First, we determined three key parameters visible in the the traces that seemed to correlate to erroneous base calls (see section ). Next we obtained a large set of reads covering accurately known sequences, so that it was possible to classify each base call as correct or incorrect. Given the trace parameters for these reads, for each choice of parameter threshold, we determined empirical error rates. Resulting data are summarized in a lookup table. For a new read for which error probabilities are to be estimated, we compute the trace parameter values and determine the estimated error rate from the lookup table Parameters for defining the error probabilities The following parameters, calculated from the fluorescence intensity traces, seem to be most critical in determining base calling error rates. Distance from the nearest unresolved peak: It may be difficult to resolve successive peaks when they overlap significantly. Base calls tend to be more accurate if they are well-separated from unresolved peaks. The distance parameter for each base is the distance (number of bases) from it to the nearest unresolved peak. Spacing Criterion: Ideally, peaks will be evenly spaced, and the more even the better. We quantify this by computing the spacing between the peaks in a window of seven bases centered on the peak of interest. The spacing parameter is the ratio between the smallest and largest spacings observed in this window. Size Criterion: With good data, one of the four fluorescence signals will clearly dominate the other three (presumably background noise) at each peak. This is quantified in the size parameter: the ratio of the heights of the largest uncalled to the smallest called peak, again measured over a window of seven bases centered on the base of interest. Empirically, there is good agreement between these numerical parameters and read quality, as judged by human experts Sequence Assembly Now that we have each of the subclones sequenced and an error probability assigned to each base, we want to reconstruct the sequence of the clone. Reads:

5 LECTURE 11. SEQUENCE ASSEMBLY 80 Clone (e.g., one Cosmid) In the literature, the problem of reconstructing a sequence from overlapping subsequences is sometimes formulated as follows: The Shortest Common Superstring Problem: Given a set S of strings over some alphabet, find a shortest string which contains each member of S as a contiguous substring. Unfortunately, this simple abstraction doesn t capture the real problem very well. First, this problem assumes perfect data. Second, even if we get perfect data, there is a problem with repeats. Biological sequences contain local repeats (found in a limited region) and interspersed repeats (present at various places in the genome). For example, sequences of the so-called Alu family, related sequences of about 300 bp, are repeated about a million times in mammalian genomes, or about once every few kilobases. Thus, a cosmid is likely to contain repeats and the Shortest Common Superstring will tend to collapse repeats, particularly in cases where there are several repeats in tandem, producing an incorrect solution. Fortunately, exact repeats are rare, since different copies have mutated differently over evolutionary time spans. Furthermore, collapsed repeats in an assembly would tend to result in an unusually large number of reads covering the repeated region, a signal which might be used to trigger more careful processing. In any event, the following (quasi) mathematical formulation of the sequence assembly problem is probably a more accurate abstraction of the real problem: The Sequence Assembly Problem: Given a set S of strings with an error probability for each string letter, find a string for which the probability of observing S (as randomly placed substrings with error) is greatest Main Steps in Sequence Assembly Details about sequence assembly algorithms will be covered in the next lecture, but in broad outline, most sequence assembly programs proceed as follows. 1. Do pairwise comparison of all the reads to determine potentially overlapping reads. 2. Determine layout, i.e., the pattern of overlaps. 3. Determine the consensus sequence, i.e., the best guess at the clone sequence. There are, of course, variations on this outline. E.g., one of the early sequence assembly programs, due to Rodger Staden, loosely intermixed all three steps. Whenever a pair of reads having a strong overlap was detected, the pair was replaced by their consensus sequence and the process was iterated.

6 LECTURE 11. SEQUENCE ASSEMBLY 81 What is the scope of the resulting computational problem? Shotgun sequencing of a BAC clone might result in reads, if reads tended to be short. Full shotgun assembly of a bacterial genome (typically a few megabases) could produce tens of thousands of reads. E.g., UW s STC is sequencing the Pseudomonas aeruginosa bacterium, which has approximately 6 Mb of DNA. They have completed about fourty-five thousand reads, and expect have about 90 thousand at completion. Assembly of the current fourty-five thousand reads takes an hour or two; assembly of the complete genome should be possible in an over night run. Run time is mainly driven by the number of pairwise matches, which is approximately quadratic in the number of repeats. This is not typically a problem for bacterial genomes, which are relatively small and have relatively few repeats, but might become a bigger issue with human sequence. References [1] Brent Ewing and Phil Green. Basecalling of automated sequencer traces using phred. II. Error probabilities. Submitted, December [2] Brent Ewing, LaDeana Hillier, Michael C. Wendl, and Phil Green. Basecalling of automated sequencer traces using phred. I. Accuracy assessment. Submitted, December 1997.