DNA and genome sequencing. Matthew Hudson Dept of Crop Sciences University of Illinois

Size: px
Start display at page:

Download "DNA and genome sequencing. Matthew Hudson Dept of Crop Sciences University of Illinois"

Transcription

1 DNA and genome sequencing Matthew Hudson Dept of Crop Sciences University of Illinois

2 Genome projects 2,424 ongoing genome projects 696 for eukaryotes 520 completed genomes 47 from eukaryotes Almost every crop now has a genome project

3 DNA Sequencing Dideoxy sequencing was developed by Fred Sanger at Cambridge in the 1970s. Often called Sanger sequencing. Nobel prize number 2 for Fred Sanger in 1980, shared with Walter Gilbert from Harvard (inventor of the now little-used Maxam-Gilbert sequencing method).

4 Sanger s Dideoxy DNA sequencing method -How it works: 1. DNA template is denatured to single strands. 2. DNA primer (with 3 end near sequence of interest) is annealed to the template DNA and extended with DNA polymerase. 3. Four reactions are set up, each containing: 1. DNA template eg a plasmid 2. Primer 3. DNA polymerase 4. dntps (datp, dttp, dctp, and dgtp) 4. Next, a different radio-labeled dideoxynucleotide (ddatp, ddttp, ddctp, or ddgtp) is added to each of the four reaction tubes at 1/100th the concentration of normal dntps

5

6 Terminators stop further elongation of a DNA deoxyribose-phosphate backbone ddntps are terminators: they possess a 3 -H instead of 3 -OH, compete in the reaction with normal dntps, and produce no phosphodiester bond. Whenever the radio-labeled ddntps are incorporated in the chain, DNA synthesis terminates.

7 hasta la vista

8 Manual Dideoxy DNA sequencing-how it works (cont.): 5. Each of the four reaction mixtures produces a population of DNA molecules with DNA chains terminating at each terminator base.. 6. Extension products in each of the four reaction mixutes also end with a different radio-labeled ddntp (depending on the base). 7. Next, each reaction mixture is electrophoresed in a separate lane (4 lanes) at high voltage on a polyacrylamide gel. 8. Pattern of bands in each of the four lanes is visualized on X-ray film. 9. Location of bands in each of the four lanes indicate the size of the fragment terminating with a respective radio-labeled ddntp. 10. DNA sequence is deduced from the pattern of bands in the 4 lanes.

9 Vigilant et al PNAS 86:

10 Radio-labeled ddntps (4 rxns) Sequence (5 to 3 ) G G A T A T A A C C C C T G T Short products Long products

11 Manual vs automatic sequencing Manual sequencing has basically died out. It needs four lanes, radioactive gels, and a technician in one day from one gel can get four sets of four lanes, with maybe 300 base pairs of data from each template. Everyone now uses automatic sequencing the downside is no one lab can afford the machine, so it is done in a central facility (eg. Keck center). Most automated DNA sequencers can load robotically and operate around the clock for weeks with minimal labor.

12 Dye deoxy terminators One tube. One gel lane or capilliary

13 Robotic 96 capillary machine: ABI 3730 xl

14 DNA sequence output from ABI 377 (a gel-based sequencer) 1. Trace files (dye signals) are analyzed and bases called to create chromatograms. 2. Chromatograms from opposite strands are reconciled with software to create doublestranded sequence data.

15 Genome sequencing How do you use these chunks of sequence to make a whole genome sequence?

16 The traditional genome A physical map is made A BAC tiling path is created BACs are farmed out to hundreds of collaborating laboratories Each lab does a few BACs Arabidopsis, E. coli etc were done this way, but since Craig Venter got interested, everything is going shotgun

17 Shotgun Genome Sequencing Slow and expensive.. but accurate and complete and assembly is straightforward Much faster and cheaper very hard to get complete genome assembly of large (>10Mb) genomes

18 Finished genome Shotgun genome Maize now Whole chromosome sequences 100kb average chunks Some BAC contigs Done clone by clone Need physical map MAGIs e.g. human, Arabidopsis e.g. poplar

19 Shotgun sequencing Extract DNA Pick clones Grow clones Read fragments with gel or capillary Shear Ligate into library Extract vector DNA Sequence using ddntps ~700 bases per read One or two reads per clone Shotgun sequence of mouse, ~2.6GB, 7x coverage That s 26,000,000 sequencing reactions, 13,000,000 minipreps

20 The genome factory There are a few centers around the world that have a factory big enough to do shotgun sequence of a large eukaryotic genome: Broad Institute, MIT Baylor College of Medicine, Houston Washington University, St Louis DoE Joint Genomics Institute, Walnut Creek, CA Sanger Centre, Cambridge Beijing Genomics Institute, Chinese Academy of Sciences

21 Pictures from JGI

22 Qpix robot picks colonies

23

24

25 Biomek PCR / cleanup robot

26

27 PCR 384 x 4 x 48 x 3

28 About 150 sequencers, at $200,000 each

29 Sequence analysis

30 Bioinformatics Armies of programmers and large supercomputers are necessary to assemble and annotate the sequence

31 Assembly and annotation Assembly we have to compare those 30,000,000 seqences with each other and work out how they fit together. Nasty mathematical problem Annotation when we have the sequence, we have to work out where the genes are and what they do. Mostly a computational problem very large databases.

32 Whole-genome resequencing Wouldn t it be great to have the whole genome of each line you work with? Then the whole genome would be haplotyped. Whole plant or metazoan genomes still cost $40-50m NIH have target for human genome to cost $100,000 in 2010 $1,000 in 2020 This is likely to be achieved ahead of schedule Human resequencing technology is likely to have a big impact on plant biology also.

33 Cost of sequencing is falling exponentially 10 1 Cost per base ($)

34 Robotic 96 capillary machine: ABI 3730 xl

35 DNA sequence output from ABI 377 (a gel-based sequencer) 1. Trace files (~350KB / run) 2. Analyzed and bases called to create sequence and quality files (~2kb / run) 3. One run is about 700 base pairs (bp) 4. Typical genome project soybean 6M runs so far

36 Limits to how cheap sequencing can get using the Sanger method Extract DNA Pick clones Grow clones Read fragments with gel or capillary Shear Ligate into library Extract vector DNA Sequence using ddntps ~700 bases per read One or two reads per clone Cost: $2 per read high throughput Plus costs of clone generation ~$1 Total current lowest cost, ~$5/kb, 0.5c /Q20 base

37 Next-generation sequencing A number of proprietary technologies, most based on the manipulation of microbeads and/or nanobeads where sequencing is performed without gels or capillaries First on the market was a company called 454 (now Roche) now on the second generation of instruments. 454 have a major competitor in Solexa (now Illumina) Recently AB announced its own next-generation platform, SOLiD (AB acquired Agencourt)

38 Next-generation sequencing approach Extract and Shear DNA No E. coli No plasmids No freezers No hydras No gels No capillaries Isolate clonal molecules on beads polony amplification Immobilize on Solid support Fluorescent or luminescent readout in situ

39 454 Sequencing technology

40 Picowell (50nm) technology

41 Sequencing by synthesis using chemiluminescence GS20: 20Mb of sequence for ~$5,000 in running costs Quality is similar to early ESTs (97-98% at best) We have no clone information, so no read pairings Homopolymer

42

43 Data output flowgram file binary SFF format About 250 MB per run Similar to trace file contains luminosity readings for each of 1.6M wells from a photomultiplier, for each of four bases, for each of 42 flow cycles Processed using on-board FPGA with instrument Others have tried to improve software, but 454 s is still best all round

44 454 FLX Claimed: 100 MB per run, 200+ base reads Cost: ~$12,000 / run in reagents & basic maintenance Ours delivered Tues June 12 no data yet

45 1Gb of sequence for < $3,000 in running costs

46

47

48 Data output No access to data yet, reportedly: A series of huge image files Each is color Analysis uses image analysis techniques Raw data output is ~ 500GB per run Current customers say compute infrastructure cannot cope 100s of CPU hours to process one run Raw data currently must be discarded

49 Polony sequencing / ABI SOLiD George Church s group invented polony method Since developed by Agencourt Now bought by ABI Similar to Solexa no wells, small beads, 4-color fluorescent detection, about 1G per run, about $3,000 per run Uses ligation of nucleotide-specific probes rather than reversible terminators

50

51 Summary of NGS technologies