Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

Size: px
Start display at page:

Download "Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park"

Transcription

1 Next Generation Sequences & Chloroplast Assembly 8 June, 2012 Jongsun Park

2 Table of Contents 1 History of Sequencing Technologies 2 Genome Assembly Processes With NGS Sequences 3 How to Assembly Chloroplast Sequences

3 History of Sequencing Technologies

4 Classical Method: Sanger Sequencing +ddatp Primer New DNA strain DNA Template A +ddttp Primer New DNA strain DNA Template T +ddgtp Primer New DNA strain DNA Template G +ddctp Primer New DNA strain DNA Template C Using dideoxynucleotides, Dna synthesis can be stopped randomly with four different reaction tubes.

5 Automated Method: Sanger Sequencing +ddatp with dye Primer New DNA strain DNA Template A +ddttp with dye Primer New DNA strain DNA Template T +ddgtp with dye Primer New DNA strain DNA Template G +ddctp with dye Primer New DNA strain DNA Template C Using four fluorescent dyes, scanner can read sequences directly. With 96 capillaries, machine can read 96 or 384 different samples at one time.

6 Capacity of Automated Sanger Sequencing Per one capillary, 700 bases (Quality Value is higher than 20) can be read in one hour. Per one machine, ABI3730 with 384 capillaries kit, 700 * 384 * 24 = 6,451,200 bp can be obtained without considering sample preparation process. If you want to get 1x human genome sequences using one this machine, it will take 465 days (30,000,000,000 bp / 6,451,200 bp = ). For de novo assembly, usually 6x to 10x coverage are required: it will take 4,650 days for obtaining enough sequences for genome project.

7 Next Generation Sequencing (NGS) Technologies Solexa; Illumina GS Titanium; Roche 454 SOLiD; ABI SMRT; Pacific Bioscience Helicos; Helicos Bioscience Next (or Current) generation sequencing technologies have accelerated the speed of genome sequencing projects and have broaden application range of genome sequences.

8 NGS: 454 Technology

9 Microtiter plate X80

10 NGS: Solexa Technology

11 NGS: Solexa Technology

12 NGS: Solexa Technology

13 NGS: Solexa Technology

14 Capacities of Next Generation Sequencers 384 x 700 bp = 268,800 bp = 269Kb (per one reaction / 1 hr) ABI 3730; ABI 950,000 x 450 bp = 405,000,000 bp = 405Mb (per one reaction / 2 3 days) GS Titanium; Roche ,000,000 x 7 x (151 x 2) bp = 73,990,000,000 bp = 74.0 Gb (per one reaction / 12 days) Solexa GA2; Illumina 70,000,000 x 14 x (101 x 2) bp = 197,960,000,000 bp = Gb (per one reaction / 8 days) HiSeq2000; Illumina 1,400,000,000 x 75 bp (50+25) = 105,500,000,000 bp = 105.5Gb (per one reaction / 11 days) SOLiD 4; ABI

15 Single Read and Mate Pair (Pair end) Sequences Single read is the most simple method: just read one direction for each sample (cluster or bead). Mate pair (or pair end) method can generate both side of short read sequences of each read, which is similar to BAC end, Fosmid end, Cosmid end sequences. Mate pair (pair end) methods are useful for generating larger sequences with gaps (We called it scaffolding.)

16 Summary of Costs For NGS Technologies Sequence explosion, Nature, 46(1),

17 Pros and Cons of NGS Technologies Pros Large number of reads per one run Extremely low costs for generating huge amount of sequences Diverse applications not only for genomics but also for transcriptomics, small RNAs, epigenomics, and etc. with small modification of protocols. Cons Short read length per each reads (36 bp to 151 bp or 450bp) Different types of sequencing qualities (GS, Solexa, and SOLiD) Difficult to deal with large size of sequences with normal programs including de novo assembly Impossible to get small amount of sequences with low costs

18 Genome Assembly Processes With NGS Sequences

19 Whole Shotgun Sequence Strategy Assembly process is essential for genome project because read length of each sequence is less than 1 kb. Assembly process was conducted by several popular programs, such as phrap and PCAP3, for Sanger sequences. Assembled genome Scaffolding Genome Assembly

20 Genome Assembly Process We can perform genome assembly manually! 23 sequences should be compared with each other! 23C2 = 23*22 / 2 = 253 comparison!

21 How to Find Overlapped Sequences? Using dynamic algorithm, we can make the program for finding similar sequences. Complexity of this algorithm is O(n 3 ).

22 Classification of Alignments (1) Considering whole sequences for alignment? Yes Global Alignment No Local alignment

23 Classification of Alignments (2) How many sequences are considered for alignment? More than 2 sequences! Multiple alignment 2 sequences! Pair wise alignment

24 Famous Bioinformatics Tools for Alignments Global alignment : ClustalW, T coffee, and MUSCLE Local alignment : FASTA, and BLAST (Basic Local Alignment Searching Tools) provided by NCBI. Pair wise alignment : BLAST and FASTA Multiple sequence alignment : ClustalW, T coffee, MUSCLE, and etc.

25 Example of Genome Assembly: Vitis Vinifera Pair wise comparison of 6,200,000 reads 6,200,000C2 = 19,219,996,900,000 comparisons

26 Genome Assemblers

27 Short read Sequence Assembly (1) Short read sequences generated by NGS machines cause several problems of already well established genome assemblers. Too many reads require near to infinite computational power. Too short reads cannot find reliable overlaps to make long contigs. To reduce computational power, new algorithm was required. Short reads require another strategy to make reliable contig sequences. Dealing a lot of sequences also caused several technical problems, such as physical memory problems, harddisk space, and computing power.

28 Short read Sequence Assembly (2) 563,466, ,466,202C563,466,201 = 158,747,080,116,419,301 comparison?!

29 De brujin Graph: Alternative Method For Alignment

30 De brujin Graph Algorithm For Alignment (1) This algorithm has been utilized for finding overlapped short read sequences quickly. This algorithm consists of three parts: i) Generating k mer sequences ii) Constructing de brujin graph iii) Resolving the graph with generating sequences

31 De brujin Graph Algorithm For Alignment (2) GCAAAACACTTA K = 3 GCA CAA AAA AAA AAC ACA CAC ACT CTT TTA

32 De brujin Graph Algorithm For Alignment (3) GCAAAACACTTA ACACTTATTCGT ACA CAC ACT CTT TTA TAT ATT TTC TCG CGT K = CGT TCG TTC ATT TAT

33 De brujin Graph Algorithm For Alignment (4) CGT TCG TTC 2 8 ATT TAT GCAAAACACTTATTCGT

34 How to Assembly Chloroplast Sequences

35 Revisit: Whole Shotgun Sequence Strategy Assembly process is essential for genome project because read length of each sequence is less than 1 kb. Assembly process was conducted by several popular programs, such as phrap and PCAP3, for Sanger sequences. Assembled genome Scaffolding Genome Assembly

36 Revisit: Genome Assembly Process We can perform genome assembly manually! 23 sequences should be compared with each other! 23C2 = 23*22 / 2 = 253 comparison!

37 Genome Assemblers For NGS Sequences (1)

38 Velvet Assembler (1) velveth ~exeman/velvet_1.1.03/velveth k31 31 shortpaired fastq fastq fastq velvetg ~exeman/velvet_1.1.03/velvetg k31 ins_length 500 cov_cutoff auto exp_cov auto

39 Velvet Assembler (2)

40 Velvet Assembler (3) Statistics of assembled sequences

41 Genome Assemblers For NGS Sequences (2)

42 ABYSS Assembler (1) ~exeman/abyss 1.2.7/ABYSS/ABYSS k 31 o ABYSS.k31.fasta fastq fastq

43 ABYSS Assembler (2)

44 Thank you for your attention!