Next Generation Sequencing Lecture Saarbrücken, 19. March 2012 Sequencing Platforms
Contents Introduction Sequencing Workflow Platforms Roche 454 ABI SOLiD Illumina Genome Anlayzer / HiSeq Problems Quality Scores Tweaks Data Formats Resources 2
Introduction Rapid technological development 3 NGS platforms mainly used: Roche 454 ABI SOLiD Illumina Genome Anlayzer / HiSeq Others Sanger ( 1 st generation ) Polonator Ion Torrent Budget systems: Illumina MySeq, 454 Junior, Helicos Heliscope Oxford Nanopore Pacific Biosciences SMRT 2012-03-19 Sequencing Platforms Fabian Müller 3 Baker, Nature Methods, 2010
Distribution of Platforms All platforms 4 http://pathogenomics.bham.ac.uk/hts/
Distribution of Platforms GAII 5
Distribution of Platforms HiSeq 6 http://pathogenomics.bham.ac.uk/hts/
Distribution of Platforms SOLiD 7 http://pathogenomics.bham.ac.uk/hts/
Distribution of Platforms 454 8 http://pathogenomics.bham.ac.uk/hts/
Distribution of Platforms all platforms Sequencing Platforms Fabian Müller 2012-03-19 9 http://pathogenomics.bham.ac.uk/hts/
Workflow Library preparation Fragmentation Adapter ligation Size selection Amplification Dehybridization Immobilization (2 nd ) Amplification Emulsion PCR Solid phase 2012-03-19 Sequencing Platforms Fabian Müller 10 Mardis, Annu. Rev. Genomics Hum. Genet., 2008
Workflow Library preparation Fragmentation Adapter ligation Size selection Amplification Dehybridization Immobilization (2 nd ) Amplification Emulsion PCR Solid phase Sequencing By synthesis o Cyclic reversible termination o Single-nucleotide addition By ligation Imaging Base calling Alignment/assembly Higher level data processing Quality Control 11
Illumina Genome Analyzer / HiSeq Previously: Solexa Then: Genome Analyzer State of the art: HiSeq 2500 Currently the most widely used platform Cluster generation step Sequencing by synthesis Reversible terminators 12
Illumina Genome Analyzer / HiSeq Hybridization to flow cell lawn of sequences complimentary to adapters 1 flow cell has 8 lanes 13 Mardis, Annu. Rev. Genomics Hum. Genet., 2008; www.illumina.com
Illumina Genome Analyzer / HiSeq Cluster Generation: Bridge Amplification Denaturation Random cluster distances Strand removal s.t. a single direction can be sequenced 14 Mardis, Annu. Rev. Genomics Hum. Genet., 2008
Illumina Genome Analyzer / HiSeq Sequencing by synthesis using reversible terminator chemistry Add sequencing primer Add labeled nucleotides Excite and detect light emission Remove blocking group Image analysis for cluster identification Sequence of images yields DNA sequence 15 Metzker, Nature Reviews Genetics, 2010
Illumina Genome Analyzer / HiSeq Extensive output HiSeq allows for 2 flowcells to be processed simultaneously and images from top and bottom Substitutions are the most common error Spike-ins facilitate quality control and base calling calibration e.g. ΦX174 phage genome Limitations Read length limited by dephasing Quality decreases towards read ends as signal intensities decline Substitution biases as only 2 lasers excite 4 dntp (A/C, G/T) Alternative base callers (e.g. Ibis) Low complexity reads o Results from sequencing junk (e.g. dust, lints, ) 16
Roche 454 GS / FLX / Titanium Current machine version: GS FLX+ Emulsion PCR amplification Sequencing by synthesis Pyrosequencing 17
Roche 454 GS20 / FLX / Titanium Emulsion PCR Use water in oil emulsion to isolate single DNA molecules Amplification in microreactors produces millions of copies on each bead Applies also to ABI SOLiD Molecule to bead ratio to ensure 1 molecule per bead Occupied beads can be selected from empty ones via the second adapter sequence 2012-03-19 Sequencing Platforms Fabian Müller 18 Metzker, Nature Reviews Genetics, 2010
Roche 454 GS20 / FLX / Titanium A picotiter plate contains 1 bead per well ~2M wells Reagents are added Nucleotides (unlabeled) are successively washed across the plate ATP driven luciferase light reactions allows to monitor which and how many bases are incorporated 19 Metzker, Nature Reviews Genetics, 2010
Roche 454 GS20 / FLX / Titanium A picotiter plate contains 1 bead per well ~2M wells Reagents are added Nucleotides (unlabeled) are successively washed across the plate ATP driven luciferase light reactions allows to monitor which and how many bases are incorporated Imaging via high resolution CCD camera 20 Metzker, Nature Reviews Genetics, 2010
Roche 454 GS20 / FLX / Titanium Problems Mixed beads o Software postprocessing Long homopolymers can lead to inconsistent calls o Primary errors are insertions and deletions Bleed-over signals and ghost wells o Strong light emissions may influence neighboring well readout o Software correction Limitations Emulsion PCR technically challenging Polymerase and luciferase efficiency drops during run Long reads Deep sequencing 21
ABI SOLiD Life Technologies/Applied Biosystems Current machine versions: SOLiD4, 5500XL Emulsion PCR similar to 454 Sequencing by ligation 2012-03-19 Sequencing Platforms Fabian Müller 22 http://www.appliedbiosystems.com
ABI SOLiD Use labeled oligonucleotides Degenerate positions 3-5 Specific dinucleotides at 1-2 1 of 4 fluorescent dyes 2012-03-19 Sequencing Platforms Fabian Müller 23 Mardis, Annu. Rev. Genomics Hum. Genet., 2008
ABI SOLiD Sequencing: Ligation of oligos from mixture o First 2 bases will match the template Imaging Capping of unextended probes o Phosphatase treatment to prevent any remaining unextended strands from contributing to out of phase ligation events Cleaving off the flour 2012-03-19 Sequencing Platforms Fabian Müller 24 Mardis, Annu. Rev. Genomics Hum. Genet., 2008
Do for 5 primer offsets ABI SOLiD Sequencing: Do for n cycles Ligation of oligos from mixture o First 2 bases will match the template Imaging Capping of unextended probes o Phosphatase treatment to prevent any remaining unextended strands from contributing to out of phase ligation events Cleaving off the flour 25 Mardis, Annu. Rev. Genomics Hum. Genet., 2008
Do for 5 primer offsets ABI SOLiD Sequencing: Do for n cycles Ligation of oligos from mixture o First 2 bases will match the template Imaging Capping of unextended probes o Phosphatase treatment to prevent any remaining unextended strands from contributing to out of phase ligation events Cleaving off the flour 26 Mardis, Annu. Rev. Genomics Hum. Genet., 2008
ABI SOLiD Imaging cycling produces a chain of colors (color space) 27
ABI SOLiD Imaging cycling produces a chain of colors (color space) Each base is captured twice 28
ABI SOLiD Imaging cycling produces a chain of colors (color space) Each base is captured twice If the first base is known (we know the adapter), then for a given sequence the remaining bases follow Alignment in color space 29
ABI SOLiD Double interogation of each base facilitates discrimation of errors from true polymorphisms (SNPs) If reference sequence is present Works better in theory than in practice High accuracy 30
ABI SOLiD Double interogation of each base facilitates discrimation of errors from true polymorphisms (SNPs) If reference sequence is present Works better in theory than in practice High accuracy Problems Probes do not necessarily ligate next to the primer signal decline Limitations Emulsion PCR technically challenging Long run times Short read lengths 31
Problems All Platforms Dephasing Sequencing cycles out of sync Source o Multiple bases inserted o No base inserted o Terminator stuck or ineffective Adapter problems Adapter chimeras Sequencing into the adapter PCR artifacts E.g. coverage variation Library contamination Local effects E.g. bubbles, machine calibration, incomplete mixing of reagents, broken chemistry E.g. degraded fluorophores/ polymerase QC is essential! Tools: FastQC, SuperDeDuper, samtools, GATK, o See exercise 32
Platform Comparison Roche 454 ABI SOLiD Illumina HiSeq Read length 700 1000 bp 75bp (75+35bp PE) 2 * 105bp Runtime 23 h 7 d / genome 12 d Initial release 10/2005 2007 Early 2007 #reads 1*10 6 6*10 8 3*10 9 (SE) 6*10 9 (PE) Error rates ~1% ~0.1% ~1% Machine cost ~ 690,000$ Sequencing cost ~ 20$ / Mb ~ 0.5$ / MB ~ 30$ / Gb 33
Quality Scores Phred Score (Q): Q = 10 log 10 P Here P denotes the estimated base calling error probability Base quality scores tend to decline towards the end of the read Reads are often trimmed before or in the alignment step 34
Tweaks Paired End Sequencing Virtually increases read length Better mapping Long inserts allow for efficient assembly Helpful in resolving structural variations and repetitive regions Sequencing Platforms Fabian Müller 2012-03-19 35 www.illumina.com, Kircher 2011
Tweaks Paired End Sequencing Virtually increases read length Better mapping Long inserts allow for efficient assembly Helpful in resolving structural variations and repetitive regions Mate Pair libraries Similar to paired end, but involves circularization Used for larger DNA molecules Provides distance information Sequencing Platforms Fabian Müller 2012-03-19 36 www.illumina.com, Kircher 2011
Tweaks Directional libraries Sequence only 1 strand Barcoding Aka multiplexing Adding sample specific tags allows for sequencing multiple samples in a single lane The samples can be separated based on their tags Sequencing Platforms Fabian Müller 2012-03-19 37
File Formats Image data Usually discarded after base calling FASTA/FASTQ identifier (typically specifies flow cell location) Sequence quality scores (FASTQ only) SAM/BAM File format for aligned reads However due to good compression and annotation, also often used for storing unaligned reads More in the alignment lecture 38
File Formats FASTA/FASTQ identifier (typically specifies flow cell location and read number) Sequence quality scores (ASCII encoded, FASTQ only) Color space equivalents exist for SOLiD *.fastq @HWUSI-EAS100R:6:73:941:1973#0/1 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC +HWUSI-EAS100R:6:73:941:1973#0/1 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC Different ASCII encodings for quality scores exist 39
File Formats FASTA/FASTQ identifier (typically specifies flow cell location and read number) Sequence quality scores (ASCII encoded, FASTQ only) Color space equivalents exist for SOLiD Different ASCII encodings for quality scores exist *.csfasta / *.qual >186_2041_1641_F3 T122233110.3012011122133012030.1110.31220022220.120 >186_2041_1706_F3 T11132121312201321220103230123.2113.31201112230.031 >186_2041_1709_F3 T2103022220322301123212223030330323320201102233.123 >97_2040_1850_F3 38 36 26 33 41 26 24 33 28 31 27 23 5 35 32 31 11 10 24 38 22 24 7 12 15 21 12 18 34 31 27 >97_2040_1898_F3 41 41 41 38 32 29 39 24 23 36 32 38 25 30 28 21 27 33 34 33 24 27 9 35 34 14 30 18 33 8 13 32 40
Resources Seqanswers Forum and wiki for all sorts of questions concerning NGS http://seqanswers.com NCBI Short Read Archive (SRA) Data archive for NGS data Discontinued? Maybe not http://www.ncbi.nlm.nih.gov/sra European Nucleotide Archive (ENA) http://www.ebi.ac.uk/ena/ DNAnexus Cloud based data management and analysis capabilities for sequencing providers and researchers https://dnanexus.com/ 41
Sequencing Platforms Questions? Sequencing Platforms Fabian Müller 2012-03-19 42