Sequencing technologies Jose Blanca COMAV institute bioinf.comav.upv.es
Outline Sequencing technologies: Sanger 2nd generation sequencing: 3er generation sequencing: 454 Illumina SOLiD Ion Torrent PacBio Nanopore General considerations Reducing the complexity
Cost per raw Megabase of DNA
Technological exponential growth Nokia 1200 (2007) iphone 5S (2013)
Sanger sequencing
Sanger sequencing Traditional DNA sequencing method Ideal for small sequencing projects Read length around 600-700 bp Around 5-10$ per reaction 384 reactions in parallel at most Applied Biosystems is the main technological provider
Sanger sequencing
Sanger sequencing
ABI files Sequences are usually delivered as ABI files. Binary files. Contain real signal, chromatogram and called sequence and quality They can be opened with: chromas (win) (www.technelysium.com.au/chromas.html) Sequence scanner (win) (Applied Biosystems) FinchTv (win, mac, linux) (www.geospiza.com/products/finchtv.shtml) trev (win, mac, linux) (part of Staden). Usually they have a.ab1 extension.
Sequence and quality Phred score = - 10 log (prob error)
Sequence and quality Sequence in fasta format >SGN-E577821 TAACAATAACAGTCAGAGCCCGTGTGGGAGCAGTACCGTGGAGTCATCCAGCGGAG AAACGGTTGTTCACGCGCCTAATACGCGACACGCGCCGC >SGN-E577821 54 57 54 57 48 48 48 48 57 57 57 47 47 41 42 41 47 57 57 57 57 47 44 44 44 44 50 50 54 57 57 46 43 37 44 43 57 37 37 37 57 57 57 57 52 52 52 52 57 50 47 47 52 52 47 52 52 50 50 50 50 50 57 57 54 57 57 57 57 57 57 57 46 46 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 29 29
Sequence and quality Due to technical limitations different technologies have different errors patterns.
Sanger sequencing Quality is worst at the beginning and at the end.
Other sources of error Pre-sequencing: PCR mutation-like errors Cloning artifacts, chimeric molecules Post-sequencing: Assembly artifacts Alignment errors due to: Reference Alignment algorithms SNV calling software
2nd generation sequencing
Sanger vs NGS sequencing Sanger NGS Num. sequences per reaction 1 clone Millions of molecules Max. parallelization 384 Several millions Sequence quality High Low Sequence length 600-800 bp 35-20000 (depends on the platform) Throughtput Low High
Sanger vs NGS
Library preparation Fragmentation Sonication Nebulization Shearing Size selection End repair Sequencing adaptor ligation Purification
454 First NGS platform (and first to be phased out) Pirosequencing based chemistry Long reads (400-700bp) Most expensive cost per base Ideal for de novo sequencing projects Owned by Roche ½, ¼ and 1/8 run can be ordered >1 million reads GS FLX+ and GS Junior
454
454 quality The lengthiest the homopolymer the less quality. It is very difficult to differentiate AAAAAA from AAAAA. Quality diminishes with the sequence length.
Illumina Previously known as Solexa Reversible terminators based sequencing technique Short reads (50 or 250bp depending on the version) Lowest cost per base Ideal for resequencing projects Highest throughput Runs divided in 8 lanes Up to 4000 million reads Can sequence both ends of the molecules (paired ends) HiSeq2500, NextSeq 500 and MiSeq
Illumina 1 sample: 30X human genome
HiSeq X Ten
Illumina
Illumina Quality diminishes with sequence length. No homopolymer problem, mainly substitution errors.
SOLiD Ligation based sequencing chemistry Short reads (35-75bp depending on the version) Only for resequencing projects It used to produce nucleotide sequences, but colors Color sequences have poor quality, but nucleotide sequences have high quality 115 or 320 million reads
SOLiD
SOLiD Really?
Ion Torrent Around 60-80 M reads. 200 pb length. Sequences based on H+ production Error rates lower than other 2nd generation Error pattern similar to 454, with homopolymer problem. Very cheap per run. Belongs to Life technologies (Applied Biosystems)
Ion Torrent
3rd generation sequencing
PacBio 3rd generation platform (single molecule) Polymerase based chemistry (SMRT) Longest NGS reads (up to 20,000bp) Very high error rate Ideal for de novo sequencing projects 45000 reads
PacBio 3rd generation, single molecule detection. No amplification step required. Nucleotides labeled on the phosphate removed during the polymerization. Sequencing based on the time required by the polymerase to incorporate a nucleotide (Polymerase requires milliseconds versus microseconds for the stochastic diffusion)
PacBio length distribution Longest lengths Limited by the polymerase damages due to the laser Sequencing strategies Standard Circular consensus Taken from flxlexblog.wordpress.com Distributions for standard sequencing, for circular the mean is around 250 pb.
PacBio quality distribution Distributions for standard sequencing. For circular sequencing the quality is at least 99.9% Taken from flxlexblog.wordpress.com
Nanopore Senses differences in ion flow USB powered
Nanopore-first data Not very reliable (yet)
Nanopore alignments
Nanopore alignments
Nanopore accuracy
Nanopore-Illumina hybrid error correction
Sequencing technologies Modified from Michael Stromberg
NGS capabilities Modified from ngsbuzz.blogspot.com Ngsbuzz gives 2.94 Gb yield to Pacbio.
Throughput
Cost per nucleotide
Cost per reaction
Read length
Bioinformatic challenges Huge data files handling. Beefy computers required. Software still being developed or missing. Ad-hoc software required during the analysis. Existing software tailored to experienced bioinformaticians. Dollar for dollar rule proposed EDSAC by Computer Laboratory Cambridge
Bioinformatic challenges Amount of data managed on a typical transcriptome assembly
NGS vs Hard Drive Stein Genome Biology 2010 11:207 doi:10.1186/gb-2010-11-5-207
Reducing the complexity
Genome Pros: Finest resolution Reproducible Cons: Expensive ($600 per sample) Lots of information will be lost if no reference is available, especially in the repetitive regions.
RNASeq Pros: Cheaper than whole genome sequencing ($300) Well proven methodologies Cons: RNA handling For many samples is pricier than GBS Follows gene density Reproducible Follows gene density
Exome
Exome
Exome Pros: More complete representation than RNASeq More reproducible than RNASeq Cons: Exome capture platforms only available in model species Pricier than RNASeq
doi: 10.1534/genetics.112.147710 Genotyping by Sequencing (GBS) GBS review: Nature Reviews Genetics 12, 499510 (July 2011) doi:10.1038/nrg3012
Genotyping by Sequencing (GBS) Pros: Cheap ($50 per sample) Lots of variation Cons: Prone to artifacts (e.g. false SNPs due to repetitive DNA) if no reference genome is available. Degree of coverage along the genome depends on the Restriction Enzyme chosen How reproducible is it? GBS based SNPs in Soy doi: 10.1534/genetics.112.147710
Amplicons Pros: Cons: Cheap for few genes Not scalable for lots of genes Labor intensive Previous sequence information is required Amplicon sets can be ordered, but the design is expensive
This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. Jose Blanca COMAV institute bioinf.comav.upv.es