Complete Genomics Technology Overview

Size: px
Start display at page:

Download "Complete Genomics Technology Overview"

Transcription

1 Complete Genomics Technology Overview Complete Genomics has developed a third generation sequencing platform capable of generating complete genome sequence data at an unprecedented high rate of throughput and low cost. The company is deploying this platform to offer comprehensive human genome sequencing services through its commercial scale, fully automated human genome center. This service will enable Complete Genomics customers to characterize the full spectrum of genetic variants that exist in large numbers of human subjects. Large scale human genome studies will enable our customers to further elucidate the genetic underpinnings of complex diseases and drug responses. Complete Genomics sequencing platform is the synthesis of technology advancements in libraries, arrays, sequencing assay, instruments and software, integrated into a complete system for largescale studies of complete human genomes (Figure 1). Libraries Arrays Assay Instruments Software DNA libraries that enable Complete Genomics cpal technology and diploid human genome assembly Massively parallel DNA nano arrays that minimize reagent usage and maximize imaging efficiency A proprietary assay that combines hybridization and ligation to produce highaccuracy reads with minimal reagent usage High speed instruments for rapidly reading Complete Genomics submicron DNA nano arrays Assembly software for rapidly reconstructing diploid genomes from billions of paired end reads Figure 1: Technology synthesis The low reagent usage and high imaging efficiency of Complete Genomics sequencing platform enable sequencing of complete human genomes at a fraction of the cost of alternative approaches. The accuracy of Complete Genomics novel sequencing chemistry, combined with the informational power of its unique library structures, enable the company to resolve many of the complexities of the human genome, and thereby provide the high quality human genome datasets required to understand complex diseases and drug responses. Furthermore, Complete Genomics is incorporating hundreds of sequencing instruments and an enterprise class data center into a single facility, creating the world s largest commercial human genome sequencing center. This new facility will enable Complete Genomics customers to sequence thousands of human genomes efficiently and cost effectively. The company s customers are not burdened with the operational, computational, and capital purchase costs of owning and operating the instruments, nor do they need to have the computing resources necessary for large scale sequencing of complete human genomes. 1

2 Libraries Complete Genomics uses a three tiered DNA fragment library architecture to resolve the unique structural characteristics of the human genome. Tier 1: ~500 base pair (bp) fragments: 35 base paired end reads 35 base paired end reads from ~500bp fragments are sufficient to span the majority of repetitive elements in the human genome, including Alu repeats, which make up 10% of the genome. Tier 2: ~5 10 kilo base pair (kbp) fragments: 35 base paired end reads 5 10kbp mate pair reads span most long interspersed nuclear element (LINE) repeats and tandem short interspersed nuclear element (SINE) repeats in the human genome. Tier 3: ~100kbp fragments: long fragment reads (LFR) Complete Genomics has developed a proprietary technology, Long Fragment Reads (LFR), which enables independent sequencing and analysis of the two parental chromosomes in a diploid sample. This capability is critical to sequencing diploid human genomes, as it allows heterozygote phasing over large intervals (potentially entire chromosomes), even in areas with high recombination rates. In addition, by distinguishing calls from the two chromosomes, LFR allows higher confidence calling of homozygous positions (>99% of the genome) at low coverage. Additional applications of LFR include resolution of extensive rearrangements in cancer genomes and full length sequencing of alternatively spliced transcripts. 2

3 Library construction process Complete Genomics paired end DNA libraries consist of genomic DNA fragments with known synthetic DNA sequences (called adaptors) interspersed at regular intervals (Figure 2). The adaptors act as starting points for reading up to 10 bases from each adaptor genomic DNA junction. Complete Genomics uses a proprietary library construction process to insert four adaptors into each DNA fragment. While the spacing of the adaptors varies by a few bases, two of the four adaptors are inserted into contiguous genomic DNA, so that the 10 base reads from each end of these adaptors result in 20 contiguous bases. Figure 2: Multiple adapter library construction process Four adaptors support 70 base reads (35 bases per paired end). The read length may be increased by inserting more adaptors. Long Fragment Read (LFR) process Complete Genomics proprietary Long Fragment Read (LFR) technology provides the benefits of dramatically longer read lengths, including haplotype phasing (Figure 3). Genomic DNA of approximately 100kbp is used as the input for LFR, as the length of input DNA impacts the interval over which phasing can be performed. This high molecular weight genomic DNA is aliquotted into a 384 well plate such that ~ 0.1 haploid genomes (10% of a haploid genome) are aliquotted into each well. The DNA fragments in each well are amplified, and this amplified DNA is fragmented to ~500bp. The DNA in each well is ligated to adaptor arms containing a unique identifier, and the ligated DNA from all 384 wells is then pooled into a single tube. 3

4 This pooled DNA is then used as input to Complete Genomics standard library construction and sequencing processes. The 384 wells will contain approximately 40 fragments, spanning in aggregate each position in the genome, with about 20 fragments coming from the maternal chromosome and 20 from the paternal chromosome. At a rate of 0.1 genome equivalents per well, there is a 10% chance that fragments in a well will overlap, and a 50% chance that any such overlapping fragments are derived from separate parental chromosomes. Thus, ~95% of the data from a well will be derived from a single parental chromosome. Resolving parental chromosomes The data from each well is then mapped to one or more reference genomes, and reads that map near each other are grouped by their unique identifiers, enabling reconstruction of the ~100kbp haploid fragments in each well. Single nucleotide polymorphisms (SNPs) within the sample are then used to distinguish between 100kbp fragments from the maternal and paternal chromosomes. The initial 40 genome equivalents described above yield on average a 100kbp maternal fragment starting every 5kbp and a 100kbp paternal fragment every 5kbp. Thus, two consecutive maternal fragments will overlap each other on average by ~95kbp. In the human genome, there are typically single nucleotide polymorphisms (SNPs) within 95kbp, many of which will be heterozygous in any given sample. Using these SNPs, maternal fragments are distinguished from paternal fragments; by chaining together overlapping fragments, large maternal and paternal segments (up to complete chromosomes) can be constructed separately. Phasing will not be possible across long repeat sections such as satellites in centromeric regions. But for most practical purposes, Complete Genomics technology increases effective read length from 35bp to over 100kbp. Figure 3: Long Fragment Read Process 4

5 Arrays Complete Genomics has developed ultra high density DNA arrays that can be read with standard fluorescent chemistry and commercial imaging equipment minimizing the cost of both reagents and imaging equipment. Clonal DNA amplification in solution DNA nano balls (DNBs) Complete Genomics sequencing is performed on amplified DNA clusters termed DNA nano balls (DNBs). The amplification avoids the cost and challenges of relying on single fluorophore measurements used by single molecule sequencing systems. Figure 4: DNA nano ball formation Starting with a small circular DNA template (Figure 4) consisting of approximately 80 bases of genomic DNA and four synthetic adaptors, Complete Genomics generates a head to tail concatemer consisting of more than 200 copies of the circular template. Complete Genomics has developed a variety of proprietary techniques for forming this concatemer into a ball (a DNA nano ball, or DNB) as well as controlling its size, density and binding affinity to surfaces and to other DNBs. One milliliter (ml) of reaction volume generates over 10 billion DNBs, sufficient for sequencing an entire human genome. Unlike alternative approaches, clonal DNA amplification is not performed in emulsions or on surfaces. The amplification process occurs in solution and in a single reaction chamber, allowing for higher density and lower reagent usage. Additionally, the DNB production process inherently produces clonal amplicons; it is not subject to the stochastic variation from limiting dilution that is inherent in alternative approaches. 5

6 Patterned substrates Complete Genomics produces patterned substrates (Figure 5) with two dimensional arrays of spots that are activated to capture and hold DNBs. The patterned surfaces are produced using standard silicon processing techniques. Each spot contains a single DNB Figure 5: Patterned substrate Complete Genomics patterned arrays achieve a significantly higher density of DNA spots than the unpatterned arrays that are typically used, leading to fewer pixels per base read, faster processing, and more efficient reagent use. The company s first generation commercial patterned substrates are 25mm by 75mm (1 x 3 ) standard microscope slides each with the capacity to hold approximately 1 billion individual spots that can bind DNBs. DNB arrays with even higher density are under development. Figure 6: Slide preparation 6

7 Self assembling DNB Arrays Complete Genomics makes a DNB array by introducing the DNBs to the patterned surface (Figure 6). The DNBs stick to the activated or sticky spots, and do not stick to the field between the spots. Once a single DNB has stuck to a spot, it repels other DNBs, resulting in one DNB per spot. DNBs are threedimensional, resulting in more DNA copies per square nanometer of binding surface than traditional DNA arrays. This unique three dimensional quality further reduces the quantity of sequencing reagents required, and results in brighter spots and more efficient imaging. In practice, DNB array occupancies exceed 90% (Figure 7). A high density DNB array thus selfassembles from DNBs in solution, removing one of the most costly aspects of making traditional patterned oligo or DNA arrays. Figure 7: Four color image of a DNB array Assay The historical drawback of sequencing by ligation has been short read length, which is typically limited to approximately six bases from the ligation site. Complete Genomics has increased the read length to 10 bases; and by inserting multiple adapters into each genomic fragment, each of which has two ligation sites, multiple adjacent 10 base segments of genomic DNA can be read. cpal: ligation based DNA sequencing Complete Genomics sequencing assay, called combinatorial probe anchor ligation (cpal), uses many of the advantages of sequencing by hybridization (SBH) including: DNA array parallelism, independent and non iterative base reading, and the capacity to read multiple bases per reaction. In addition, cpal resolves two SBH limitations: an inability to read simple repeats and a need for exceptionally intensive computation. cpal uses pools of probes labeled with four distinct dyes (one per base) to read the positions adjacent to each adaptor (Figure 8). There is a separate pool of probes for each read position. Complete Genomics proprietary approach allows 10 contiguous bases to be read from each end of an adaptor. Ligating the matching probes with the adjacent anchors dramatically improves the full 7

8 match specificity of the probe binding compared to hybridization without ligation. Under optimal fluidics and imaging conditions, the raw error rate of this assay can be below 0.1%. After each base is read, the entire anchor probe complex is washed away. The next anchor is then hybridized, and the next probe pool is ligated to the anchor. There is no chaining of consecutive probes, thus no accumulation of errors. Figure 8: Combinatoral probe anchor ligation (cpal) One of the unique advantages of cpal is random access (independent and non iterative base reading). Each base read cycle does not depend on the completeness of any of the previous cycles. This provides excellent fault tolerance qualities if a base read fails, it does not prevent interpretation of the rest of the reads for that DNB; if desired, the failed base can simply be reassayed. Another key advantage of independent base reading is its tolerance to low ligation yield per cycle. This dramatically reduces the required probe and enzyme concentrations, thereby substantially reducing reagent costs. cpal further allows for reading multiple positions per cycle, which is not possible with sequencing by synthesis. Reading multiple positions per cycle decreases the number of cycles, thus reducing reagent consumption and imaging time. 8

9 Instruments Figure 9: Complete Genomics sequencing instrument Complete Genomics instrument design is highly modular. Each of its components can be independently upgraded as suppliers release new, improved versions. Relying on standardized components enables Complete Genomics to track its suppliers technology roadmaps to deliver state of the art performance, while leveraging the continuous cost reductions of standardized components. High volume purchases by Complete Genomics will enable the company to work with suppliers to improve performance of Complete Genomics sequencing system. Complete Genomics sequencing instrument (Figure 9) consists of three loosely coupled sub systems: (1) DNA arrays, which are packaged into flow slides; (2) a standard liquid handling robot; and (3) a high speed imager. This modular design enables Complete Genomics to adjust components easily as specifications or performance criteria change, enabling rapid reconfiguration that keeps pace with hardware technology development. 9

10 Flow slides Complete Genomics has developed a powerful flow slide platform for minimizing reagent use and simplifying fluorescence imaging (Figure 10). Micro channels formed on top of the patterned substrates enable efficient reagent delivery and elimination of dead volume while simultaneously satisfying the optical requirements for high (DNA spots measured per cycle) resolution imaging. Process capacity can be increased by adding more flow slides to the liquid handler deck. This ensures that increases in imager speed are matched by increases in process capacity. Fluidics Figure 10: Robot Flow slides Fluidics Robot Complete Genomics uses standard, off the shelf liquid handling robots to pipette reagents to the flow slides. When reactions are complete and a flow slide is ready to be imaged, a robot arm transfers the slide from the liquid handling deck to the imager stage. Each instrument can run two to 12 slides in parallel while one slide is imaging, the remaining slides are in various stages of preparation for imaging. Imager The imager is constructed from off the shelf components to form a four color fluorescence microscope. The main components are the illuminator, filter changer, microscope objective, tube lens, motorized stage, and detector. Software Complete Genomics has developed its own suite of base calling, mapping, assembly, and analysis software. The base calling software receives data from the imager after each reaction cycle. Images are processed to determine bases at each position on a DNB array. Called bases for each DNB are collated to form raw read data. Mapping, assembly and analysis software operate on read data and produce a variety of outputs, including reads aligned to a reference genome and consensus sequence assembly of overlapping DNB reads. Base calling software Four images, one for each color dye, are generated for each adaptor position. The position of each spot in an image and the resulting intensities for each of the four colors is determined by adjusting for crosstalk between dyes and background intensity. A quantitative model is fitted to the resulting 10

11 four dimensional dataset. A base is called for a given spot, with a quality score that reflects how well the four intensities fit the model. Read data format Read data is encoded in a compact binary format and includes both a called base and a quality score. The quality score is correlated with base accuracy. Analysis software, including sequence assembly software, uses the score to determine the contribution of evidence from individual bases within a read. Figure 11: Read data format Reads are gapped due to the DNB structure (Figure 11). Gap sizes vary (usually +/ 1 base) due to the variability inherent in enzyme digestion. Due to the random access nature of cpal, reads may occasionally have an unread base ( no call ) in an otherwise high quality DNB. Read pairs are mated as described in the DNA libraries section. Mapping Complete Genomics has developed high speed mapping software capable of aligning read data to a reference sequence. The mapping software can map a 50X coverage human data set to a human reference sequence in less than 24 hours. The software runs on a commodity Linux cluster and scales horizontally with more CPUs. The mapping is tolerant of small variations from a reference sequence, such as those caused by individual genomic variation, read errors or unread bases. This often allows direct reconstruction of SNPs. To support assembly of larger variations, including large scale structural changes or regions of 11

12 dense variation, each 35 base arm of a DNB is mapped separately, with mate pairing constraints applied after alignment. Assembly Complete Genomics has developed sequence assembly software (Figure 12) that runs on a commodity Linux cluster and scales horizontally with more CPUs. Figure 12: Localized reads The assembly software supports DNB read structure (mated, gapped reads with non called bases) and is designed to generate a diploid genome assembly by leveraging Long Fragment Reads (LFR) technology for phasing. In addition to reconstructing SNPs, the Complete Genomics assembler can reconstruct novel segments not present in a reference sequence. The algorithm utilizes a combination of evidential (Bayesian) reasoning and de Bruijn graph based algorithms. The use of a statistical model, which is empirically calibrated to each dataset, allows all read data to be used, without pre filtering or data trimming. In addition to genotype data and contig sequences, the Complete Genomics assembler identifies large scale structural variations (deletions, translocations, etc.) and copy number variations by leveraging mated reads. 12

13 Genome Center Complete Genomics is building the world s largest commercial human genome sequencing center to provide turnkey, outsourced complete human genome sequencing to customers worldwide. Comprised of two major components, the operations center and the data center, Complete Genomics genome sequencing center will have the capacity to sequence 1,000 complete human genomes in 2009 and 20,000 in Operations center By 2010, the operations center will hold 192 sequencing instruments, organized with four instruments to a pod, eight pods to a bay, and six bays to the center. The center will also house the production bio suite, capable of preparing more than 200 samples per day for sequencing. Data center By 2010, the data center will contain approximately 60,000 processors with 30 petabytes of reliable storage (Figure 13). Owing to the infrastructure requirements (particularly power) of such a large data center, the center will be split across two physical locations: a small on site facility and a larger remote facility. The two locations will be connected with an optical fiber backbone. The data center will employ enterprise class security technology, commonly deployed by banks, government agencies, and other secure data institutions. This will ensure that outsiders ca For nnot more access information any data about or ac nalyses omplete on Genomics, Complete our Gteenomics chnology computers or our sequencing and that st serict rvices separation please contact between us at: customer datasets is maintained. Figure 13: Complete Genomics data center 13

14 Find out more about Complete Genomics For more information about Complete Genomics, its technology or sequencing services please contact the company at: Complete Genomics, Inc. Corporate Headquarters 2071 Stierlin Court Mountain View, CA Phone: