Overview of sequencing Technologies

Size: px
Start display at page:

Download "Overview of sequencing Technologies"

Transcription

1 Overview of sequencing Technologies Basic Bioinformatics Training 2017 ILRI, Addis Ababa- Ethiopia Dec 11 15, 2017 Trushar Shah / Joyce Njuguna

2 Outline of the talk l Introduction l History of DNA sequencing l Current Technologies l Applications l Bioinformatics

3 History of DNA Sequencing

4 Sequencing Platforms 1986 Dye terminator Sanger sequencing (dideoxy chain termination) technology. Peaking at about 900kb/day

5 Sanger sequencing chromatogram

6 Sequencing Platforms 1986: Fluorescently labelled ddntps 1987: Applied Biosystems ABI : Capillary gel electrophoresis 1999: Applied Biosystems ABI 3700 DNA Analyzer ABI 3730

7 Why NGS? l Sanger sequencing: Has limitations - Cost: - Speed of data generation: - Involves cloning

8 NGS Technologies l Solexa (Illumina Inc.) l 454 sequencing (Roche Inc.) l ABiSolid (Invitogen Inc.) l Helicos l Pac-bio

9 Recent advances in sequencing technologies

10 Sequencing Costs Inclusive of labor, administration, reagent, instruments, bioinformatics.

11 Cost of Sequencing Sample collection and experimental design Sequencing Data reduction Data management Downstream analyses 100% Data management % Pre-NGS (Approximately 2000) Now (Approximately 2010) Future (Approximately 2020) Sboner et al. Genome Biology :125 doi: /gb

12 Speed in sequencing ~50,000 times faster than in 2000

13 Next generation sequencing (NGS) 454/ FLX Titanium AB SOLiD system Illumina/ Solexa Generates more than 1 million high-quality reads per run and read lengths of 400 bases per 10-hour instrument run The SOLiD 4hq system can generate upto 300 Gb sequence data (75 bp, matepair: 2 x 75 bp; paired-end: 75 x 35 bp) HiSeq 2000 is capable of generating 200 Gb of data per run and 25 Gb of data per day (2 x 100 bp)

14 Next-generation DNA sequencing instruments All commercially-available sequencers have the following shared attributes: Random fragmentation of starting DNA, ligation with custom linkers = a library Library amplification on a solid surface (either bead or glass) Direct step-by-step detection of each nucleotide base incorporated during the sequencing reaction Hundreds of thousands to hundreds of millions of reactions imaged per instrument run = massively parallel sequencing Shorter read lengths than capillary sequencers A digital read type that enables direct quantitative comparisons

15 Sequencing principle From Hudson M.E., Mol. Ecol. Res. 8(1):3-17 (2008).

16 Sequencing Platforms Next Generation 2005: Next Generation Sequencing (NGS), massive parallel sequencing both throughput and speed advances. The first was the 454 life sciences pyrosequencing (1.5Gb/day)

17 Sequencing Chemistry Pyrosequencing eg: Roche / 454 Sequencing by synthesis eg: Illumina Sequencing by ligation eg: Life Technologies Solid Ion semiconductor sequencing eg: Ion Torrent

18 Sequencing Platforms Next Generation 2006: the second NGS platform, was Solexa (later acquired by Illumina). Now the dominant platform an estimated >90% of all bases sequenced are from an Illumina machine, Sequencing by Synthesis (200Gb/day)

19 Sequencing Platforms Next Generation 2006: the second NGS platform, was Solexa (later acquired by Illumina). Now the dominant platform an estimated >90% of all bases sequenced are from an Illumina machine, Sequencing by Synthesis (200Gb/day) NovaSeq

20 Roche/454 Pyrosequencer Random Fragmentation Adapter Ligation Single Stranded Adapter Ligated Library empcr for clonal amplification

21 Roche/454 Pyrosequencer Load beads into PicoTiter Plate Load Enzyme Beads Centrifugation Sequencing by Synthesis

22 Pyrosequencing From Ronaghi, M., Gen. Res. (2001) 11:3-11

23 Pyrosequencing From Ronaghi, M., Gen. Res. (2001) 11:3-11

24 Life Technologies SOLiD: sequencing by ligation Dressman 2003

25 Life Technologies SOLiD: sequencing by ligation custom adapter library empcr on magnetic beads sequencing by ligation using fluorescent probes from a common primer sequential rounds of ligation from a series of primers fixed/known nucleotides for each probeset identify two bases each cycle, or two base encoding

26

27 This allows for error correction

28 Sequencing Platforms Bench top Sequencers Roche 454 junior Illumina MiniSeq Illumina MiSeq Illumina IlluminaMiniSeq NextSeq Illumina MiSeq Illumina HiSeq Illum Illumina Life Technologies v Ion Torent v Ion proton Ion PGM Illumina Miseq Ion Proton Ion IonS5 PGM Io

29 Sequencing Platforms The 3rd Generation sequencer 2009: single molecule read time sequencing by Pacific Biosystems. Near 100Kb possible read length Ion PGM PacBio RS II PacBio Sequel Io ONT

30 Pac Bio Advances (RSII vs Sequal) California Condor data (~1.2Gbp genome) Jan 2017 California Condor data (~1.2Gbp genome) based on 4 SMRT cell in Jan 2017 RS2 data Sequel data Read count RS2 Sequel 448,767 1,947,684 N50 10,426 4,293 Longest Read # reads > 12Kb Coverage > 12Kb 82, , , , Frequency Frequency e+00 2e+04 4e+04 6e+04 8e+04 1e+05 read length 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 read length

31 Whole transcripts Pac bio Iso-seq Produce full length transcripts without assembly. The isoform sequencing (Iso-Seq) application generates full-length cdna sequences from the 5 end of transcripts to the poly-a tail. The Iso-Seq method generates accurate information about alternatively spliced exons and transcriptional start sites.

32 Sequencing Platforms The 3rd Generation sequencer M 2015: Oxford Nanopore founded in 2005 and currently in beta testing. The sequences uses nanopore technology developed in the 90 s to sequence single molecules. Throughput is 500Mb per Ion Proton S5 flowcell, capable of near 200Kb Ion reads Sequel ONT MinION 17 Feb 2017, Nairobi ONT PromethION Intro to NGS Sequencing Technologies ONT SmidgION Bert Overduin 14

33 Increase in Genome Sequencing Projects JGI Genomes online database genome sequencing projects

34 Performance of the various sequencers

35

36 Illumina Benchtop

37 Illumina Production

38 Technology Overview: Solexa/Illumina Sequencing

39 Immobilize DNA to Surface Source:

40 Technology Overview: Solexa Sequencing

41 Sequence Colonies

42 Sequence Colonies

43 Call Sequence

44 Adapted from Metzker 2010

45 Adapted from Metzker 2010

46 Video Illumina dye sequencing

47 Comparative account Platform account 454/FLX Solexa (Illumina) AB SOLiD Read length ~ bp 36, 75, 106,or 151bp 75bp Single read Yes Yes Yes Paired-end Reads Yes Yes Yes Long-insert (several Kbp) mate-paired reads Yes Yes No Number of reads per instrument run 500K >100M 400M Max Data output 0.5Gbp 20.5Gbp 20Gbp Run time to 1Gb 6 Days <1 Day <1 Day Ease of use (workflow) Difficult Least difficult Difficult Base calling Flow space Nucleotide space Color space DNA Applications Whole genome sequencing and resequencing Yes Yes Yes de novo sequencing Yes Yes Yes Targeted resequencing Yes Yes Yes Discovery of genetic variants (SNPs, InDels, Copy number, Chromosomal rearrangements) Yes Yes Yes Chromatin immunoprecipitation (ChIP) Yes Yes Yes Methylation Analysis Yes Yes Yes Metagenomics Yes No No RNA Applications Whole transcriptome Yes Yes Yes Small RNA Yes Yes Yes Expression Tags Yes Yes Yes

48

49 Adapted from Jeffrey 2017

50 Video Oxford Nanopore sequencing

51

52

53 Video Single molecule real time sequencing

54 Anatomy of an NGS Library library fragment adapter flowcell/bead binding sequences amplification primers sequencing primers indexing primers insert adapter flowcell/bead binding sequences amplification primers sequencing primers indexing primers

55 Anatomy of an NGS Library Single-end Read 1 Paired-end Read 1 Read 2

56 Anatomy of an NGS Library Read single-end read partial sequence from library fragment Read single-end read complete sequence from library fragment partial (or complete) adapter sequence Read single-end read no library fragment partial (or complete) adapter sequence

57 Anatomy of an NGS Library Read 1 Read 2 paired-end read non-overlapping reads partial sequence from library fragment Read 1 Read 2 paired-end read complete sequence from library fragment overlapping reads Read 2 Read 1 paired-end read no library fragment non-overlapping reads partial (or complete) adapter sequence

58 NGS: Applications l l Genome resequencing - Population genetics - Variant discovery - Unique genetic elements Whole genome sequencing

59 NGS: Applications l RNA-seq - Quantitative expression analysis - Splice site detection - mirna identification - Variant detection

60 NGS: Applications l CHIP-seq

61 Applications of NGS in crops Trends Biotech 27: Reference genotypes Parental lines of mapping populations Wild Relatives Different tissues of the same genotype/ same tissues of different genotypes (e.g. NILs) Germplasm lines/natural population cdna RRG BACs gdna/cdna gdna/cdna cdnas Pools of PCR products/ metagenomics Next Generation Sequencing (NGS) ESTs Gene space Genome assembly Sequence variations SNPs/indels/haplotypes Gene expression data Candidate genes SNPs/indels/haplotypes Genomic resources SSRs Phenotype data Genetic maps and QTLs Alien introgression Expression QTLs (eqtls) Candidate markers/genes Population biology GENOME STRUCTURE AND DYNAMICS CROP BREEDING

62 Oh, my god! What should I do now? Challenges in analysis of NGS data NGS machines Tsunami of sequence data

63 Acknowledgments Matthew L. Settles Genome Center Bioinformatics Core, University of Carlifornia Davis Bert Overduin Training and Outreach Bioinformatician University of Edinburgh

64 Thank you Basic Bioinformatics Training 2017 ILRI, Addis Ababa- Ethiopia Dec 11 15, 2017

65 Bioinformatics tools Reference genome scenario: - Mapping of sequence reads to reference genome - MAQ, NOVOalign, SOAP, ZOOM, GMAP No-reference genome scenario: - De novo assembly - Velvet, SHARCGS, MIRA2, etc. Visualization - User friendly interface - EagleView, Tablet, Maqview, GBrowse

66 Desired features for analytical tool Consideration of quality Flexibility in handling Read length No. of mismatches gapped High speed Accuracy in mapping/assembly Amenable to visualization tool

67 NGS data l l Genomic data Transcriptomic data

68 Assembly De novo sequencing involves assembling overlapping reads to form contiguous sequence of DNA. Done in cases where there s no genomic information available.

69 Tools for Assembly ABySS ALLPATHS Edena Euler-SR Newbler MIRA SSAKE Velvet

70 Assembly 1. Overlap-layout consensus (OLC) 2. Euler / De Bruijn approach

71 Traditional approach Overlap-layout-consensus method for assembly. Build an overlap graph where each node represents a read. An edge exists between two reads if they overlap. Traverse the graph to find unambiguous paths which form contigs.

72 Assembly: Overview Assembly problem Three repeats can cause a misassembling of the inner segments

73 Current approaches Euler / De Bruijn approach. Introduced as a alternative to overlap-layoutconsensus approach in capillary sequencing. More suited for short read assembly. Based on De Bruijn graph. Implemented in Velvet 1, the mostly used short read assembly method at present.

74 De Bruijn graph method Break each read sequence in to overlapping fragments of size k. (k-mers) Form De Bruijn graph such that each (k-1)-mer represents a node in the graph. Edge exists between node a to b iff there exists a k-mer such that is prefix is a and suffix is b. Traverse the graph in unambiguous path to form contigs.

75 K = 3 De Bruijn graph

76 Sequencing machine Short read alignment

77 Need to map them back to reference

78 Alignment software In the last two years, many tools for short-read alignments have been published: Eland Maq Bowtie Biostrings BWA SSAHA2, Soap, RMAP, SHRiMP, ZOOM, NovoAlign, Mosaik, Slider,... Which one is right for your task?

79 Desired features l Consideration of quality l Flexibility in handling - Read length - No. of mismatches - gapped l High speed l Accuracy in mapping/assembly

80 Short read alignment: Algorithms Short-read aligners use one of these ideas to base their algorithm on: use spaced-seed indexing hash seed words from the reference hash seed words from the reads use the Burrows-Wheeler transform (BWT) BWT seems to be the winning idea (very fast, sufficiently accurate), and is used by the newest tools (Bowtie, SOAPv2, BWA).

81 Next Generation Data & Format

82 Fasta format l The standard format for nucleotide and protein sequence is fasta, named after the program. It is very easy to read and write manually or with a program: Name of sequence >sequence id more info yet more info ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC CCACTAGCTGCATCGATG Sequence itself, in one or many lines

83 Multiple fasta format >sequence id more info yet more info ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC CCACTAGCTGCATCGATG >sequence 2 id more info yet more info ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC CCACTAGCTGCATCGATG >sequence 3 id more info yet more info ACCCGTGAGTAGTGTGTGCTCATCTCGTAGCATCGATCGAAATCGTCATCG CAGATGTCGTGCTAGCTGTGCATGCATCGATCGATCGATGTAGCTAGTAGT CAGTAGTAGTCGTAGATGTCGTCGATGCTAGTAGTGCTGCTGTGTGCTGTC CCACTAGCTGCATCGATG

84 Fasta + quality l This is the standard output of sanger platforms two files >sequence id more info yet more info ACCCGTGA >sequence id more info yet more info l It s great and easy to read, but takes up a lot of disk space

85 Fastq format FASTA with Qualities GGGGGGAAGTCGGCAAAATAGATCCGTAACTTCGGG +HWI-EAS225:3:1:2:854#0/1 GGGAAGATCTCAAAAACAGAAGTAAAACATCGAACG +HWI-EAS225:3:1:2:1595#0/1 a`abbbababbbabbbbbbabb`aaababab\aa_`

86 Each read is represented by four lines: followed by read ID sequence Fastq format '+', optionally followed by repeated read ID quality string: same length as sequence each character encodes the base-call quality of one base

87 Fastq l Much less fun to read, less space per base and only one file _SLXA-EAS1_s_7:5:1:817:345 length=36 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC +SRR _SLXA-EAS1_s_7:5:1:817:345 length=36 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC Encoding varies. Illumina 1.3+ format can encode a Phred quality score from 0 to 62 using ASCII 64 to 126

88 Data space l Base space l Colour space example:- >1_1_1_F3_I1 T >2_2_2_F3_I1 T >3_3_3_F3_I1 T >4_4_4_F3_I1 T >1_1_1_F3_I1 TAAAATGCCCCCCCCCCCC >2_2_2_F3_I1 TTTTTAAAAAAGGCCCCCC >3_3_3_F3_I1 TAAAAATTTTTTTTTGGGGG >4_4_4_F3_I1 TAAAAATTTTGGGCCCCCCC

89

90 The Love of Ambiguity IUPAC Code Meaning Complement IUPAC Code Meaning Complement A A T S C/G S C C G Y C/T R G G C K G/T M T T A V A/C/G B M A/C K H A/C/T D R A/G Y D A/G/T H W A/T W B C/G/T V X null X N A/C/G/T N

91 Paired-end sequencing: Principle The two ends of the fragments get different adapters. Hence, one can sequence from one end with one primer, then repeat to get the other end with the other primer. This yields mate pairs of reads, separated by a known distance. For large distances, circularisation might be needed.

92 Paired-End Reads Jarvie & Harkins (454) Nature Methods May 2008

93 FASTQ and paired-end reads Convention for paired-end runs: The reads are reported two FASTQ files, such that the nth read in the first file is mate-paired to the nth read in the second file. The read IDs must match.

94 Paired ends: Uses Paired-end sequencing is useful to find micro-indels to find copy-number variations for assembly tasks to look for splice variants but of little value for standard ChIP-Seq normal RNA-Seq

95 Coverage In resequencing, we hope to sequence uniformly, i.e., see each part of the genome represented by the same amount of reads. Due to the random nature of shotgun sequencing, we need to cover the genome several times in order to see each position at least once. In other techniques (ChIP-Seq, RNA-Seq, Tag-Seq, CNV-Seq, etc.), the local coverage is what we are interested in.

96 SNP Discovery: Goal sequencing errors SNP

97 SNP Discovery: Base Qualities High quality Low quality

98 SNP Discovery haploid diploid strain 1 strain 2 strain 3 AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA individual 1 individual 2 individual 3 AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTAGCATA AACGTTAGCATA

99 Genotyping & Consensus Generation haploid diploid strain 1 [A] strain 2 [C] strain 3 [A] AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA individual 1 [A/C] individual 2 [C/C] AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA individual 3 [A/A] AACGTTAGCATA AACGTTAGCATA

100 Tablet : visualization of consensus &SNP

101 Conclusions Next-generation sequencing has revolutionized our ability to study genomes variation. Comparisons of matched genomes are revealing mutations that have prognostic value in predicting outcomes and response to stress. While the sequencing data production and associated costs are becoming less expensive, a substantial and skilled team of experts is required for effective and timely genome analysis.

102 Thanks

103 Burrows-Wheeler Transformation Bowtie builds a genome index based in the Burrows- Wheeler Transformatopn (BWT) and FM Index. The Burrows-Wheeler Transformation of a text T, BWT(T), is constructed as shown to the right. The Burrows-Wheeler Matrix of T is the matrix whose rows are all distinct cyclic rotations of T$ sorted lexicographically ($ is less than all other characters). BWT(T) is the sequence of characters in the last column of this matrix. Burrows-Wheeler Transform (BWT) Burrows-Wheeler Matrix

104 LF Mapping The Burrows-Wheeler Matrix has a property called the LF mapping: the i th occurrence of character X in the last column corresponds to the same text character as the i th occurrence of X in the first column. This property underlies algorithms that use the BWT to navigate or search the text. corresponds to the same text character as the 2 nd occurrence of a in the first column the 2 nd occurrence of a in the last column

105 Reversing the Transformation To recreate T from BWT(T), start with i = 0 and T = BWT[0] and repeatedly apply rule 2 : T = BWT[ LF(i) ] + T; i = LF(i) Where LF(i) maps row i to the row whose first character corresponds to i s last character according to the LF Mapping: Final T