Genome Assembly With Next Generation Sequencers

Size: px
Start display at page:

Download "Genome Assembly With Next Generation Sequencers"

Transcription

1 Genome Assembly With Next Generation Sequencers Personal Genomics Institute 3 May, 2011 Jongsun Park

2 Table of Contents 1 Central Dogma and Omics Studies 2 History of Sequencing Technologies 3 Genome Assembly Processes With NGS Sequences 4 Current Status of Plant Genomes

3 Central Dogma and -Omics Studies

4 Advances of Biotechnologies Central Dogma in Molecular Biology and Bioinformatics DNA RNA Proteins Genomics Transcriptomics Proteomics Bioinformatics

5 Further Omics Studies With Bioinformatics Proteomics Transcriptomics 3D structure Metabolomics Pathway Analysis Gene Expression Regulatory Network Non-coding RNAs Genomics Phylogenomics Population Genomics Epigenomics

6 Position of Sequences in Central Dogma! MASTWTSWAMTCCAAMST Predicted from sequences AGCUACGUGAGAGACGUACUGUAC Target for Sequencing AGCTACGTGAGAGACGTACUGTAC

7 History of Sequencing Technologies

8 Classical Method: Sanger Sequencing +ddatp Primer New DNA strain DNA Template A +ddttp Primer New DNA strain DNA Template T +ddgtp Primer New DNA strain DNA Template G +ddctp Primer New DNA strain DNA Template C - Using dideoxynucleotides, Dna synthesis can be stopped randomly with four different reaction tubes.

9 Automated Method: Sanger Sequencing +ddatp with dye Primer New DNA strain DNA Template A +ddttp with dye Primer New DNA strain DNA Template T +ddgtp with dye Primer New DNA strain DNA Template G +ddctp with dye Primer New DNA strain DNA Template C - Using four fluorescent dyes, scanner can read sequences directly. - With 96 capillaries, machine can read 96 or 384 different samples at one time.

10 Capacity of Automated Sanger Sequencing - Per one capillary, 700 bases (Quality Value is higher than 20) can be read in one hour. - Per one machine, ABI3730 with 384 capillaries kit, 700 * 384 * 24 = 6,451,200 bp can be obtained without considering sample preparation process. - If you want to get 1x human genome sequences using one this machine, it will take 465 days (30,000,000,000 bp / 6,451,200 bp = ). - For de novo assembly, usually 6x to 10x coverage are required: it will take 4,650 days for obtaining enough sequences for genome project.

11 Next Generation Sequencing (NGS) Technologies Solexa; Illumina GS-Titanium; Roche 454 SOLiD; ABI SMRT; Pacific Bioscience Helicos; Helicos Bioscience - Next (or Current) generation sequencing technologies have accelerated the speed of genome sequencing projects and have broaden application range of genome sequences.

12 NGS: 454 Technology

13 NGS: Solexa Technology

14 NGS: SOLiD Technology (1)

15 NGS: SOLiD Technology (2)

16 NGS: SOLiD Technology (3)

17 Capacities of Next Generation Sequencers 384 x 700 bp = 268,800 bp = 269Kb (per one reaction / 1 hr) ABI 3730; ABI 950,000 x 450 bp = 405,000,000 bp = 405Mb (per one reaction / 2-3 days) GS-Titanium; Roche ,000,000 x 7 x (151 x 2) bp = 73,990,000,000 bp = 74.0 Gb (per one reaction / 12 days) Solexa GA2; Illumina 70,000,000 x 14 x (101 x 2) bp = 197,960,000,000 bp = Gb (per one reaction / 8 days) HiSeq2000; Illumina 1,400,000,000 x 75 bp (50+25) = 105,500,000,000 bp = 105.5Gb (per one reaction / 11 days) SOLiD 4; ABI

18 Single Read and Mate-Pair (Pair-end) Sequences - Single read is the most simple method: just read one direction for each sample (cluster or bead). - Mate-pair (or pair-end) method can generate both side of short-read sequences of each read, which is similar to BAC-end, Fosmid-end, Cosmid-end sequences. - Mate-pair (pair-end) methods are useful for generating larger sequences with gaps (We called it scaffolding.)

19 Summary of Costs For NGS Technologies Sequence explosion, Nature, 46(1),

20 Pros and Cons of NGS Technologies Pros - Large number of reads per one run - Extremely low costs for generating huge amount of sequences - Diverse applications not only for genomics but also for transcriptomics, small RNAs, epigenomics, and etc. with small modification of protocols. Cons - Short read length per each reads (36 bp to 151 bp or 450bp) - Different types of sequencing qualities (GS, Solexa, and SOLiD) - Difficult to deal with large size of sequences with normal programs including de novo assembly - Impossible to get small amount of sequences with low costs

21 Genome Assembly Processes With NGS Sequences

22 Whole Shotgun Sequence Strategy - Assembly process is essential for genome project because read length of each sequence is less than 1 kb. - Assembly process was conducted by several popular programs, such as phrap and PCAP3, for Sanger sequences. Assembled genome Scaffolding Genome Assembly

23 Genome Assembly Process - We can perform genome assembly manually! 23 sequences should be compared with each other! 23C2 = 23*22 / 2 = 253 comparison!

24 How to Find Overlapped Sequences? - Using dynamic algorithm, we can make the program for finding similar sequences. - Complexity of this algorithm is O(n 3 ).

25 Classification of Alignments (1) Considering whole sequences for alignment? Yes Global Alignment No Local alignment

26 Classification of Alignments (2) How many sequences are considered for alignment? More than 2 sequences! Multiple alignment 2 sequences! Pair-wise alignment

27 Famous Bioinformatics Tools for Alignments - Global alignment : ClustalW, T-coffee, and MUSCLE - Local alignment : FASTA, and BLAST (Basic Local Alignment Searching Tools) provided by NCBI. - Pair-wise alignment : BLAST and FASTA - Multiple sequence alignment : ClustalW, T-coffee, MUSCLE, and etc.

28 Example of Genome Assembly: Vitis Vinifera Pair-wise comparison of 6,200,000 reads 6,200,000C2 = 19,219,996,900,000 comparisons

29 Genome Assemblers

30 Short-read Sequence Assembly (1) - Short-read sequences generated by NGS machines cause several problems of already well-established genome assemblers. - Too many reads require near to infinite computational power. - Too short reads cannot find reliable overlaps to make long contigs. - To reduce computational power, new algorithm was required. - Short reads require another strategy to make reliable contig sequences. - Dealing a lot of sequences also caused several technical problems, such as physical memory problems, harddisk space, and computing power.

31 Short-read Sequence Assembly (2) 563,466, ,466,202C563,466,201 = 158,747,080,116,419,301 comparison?!

32 De brujin Graph: Alternative Method For Alignment

33 De brujin Graph Algorithm For Alignment (1) - This algorithm has been utilized for finding overlapped short-read sequences quickly. - This algorithm consists of three parts: i) Generating k-mer sequences ii) Constructing de brujin graph iii) Resolving the graph with generating sequences

34 De brujin Graph Algorithm For Alignment (2) GCAAAACACTTA K = 3 GCA CAA AAA AAA AAC ACA CAC ACT CTT TTA

35 De brujin Graph Algorithm For Alignment (3) GCAAAACACTTA ACACTTATTCGT ACA CAC ACT CTT TTA TAT ATT TTC TCG CGT K = CGT TCG TTC 2-8 ATT 2-7 TAT

36 De brujin Graph Algorithm For Alignment (4) CGT 2-10 TCG 2-9 TTC 2-8 ATT TAT GCAAAACACTTATTCGT

37 Genome Assemblers For Short Read Sequences

38 Examples of Plant Genome de novo Assembly Lithocarpus hancei Ficus altissima Ficus altissima Ficus altissima Ficus tinctoria Ficus tinctoria # of contigs 462, , , , , ,937 Total length 112,614,098 87,502,701 33,293,636 61,369,608 87,427, ,554,688 Maximum length Average length 1,748 1,090 1,688 1,334 1,274 1, N50 length ,937,915,739 bp # of contigs 950 ea Total length 976,089 bp Maximum length 12,606 bp Average length 1, bp N50 length 3,061 bp

39 Giant Panda Genome Project

40 Sordaria macrospora Genome Project

41 Large Scale Genome Projects: 1000 Human Genomes

42 Large Scale Genome Projects: BGI 1000 Genomes

43 Large Scale Genome Projects: Genome 10K Project

44 Current Status of Plant Genome Projects

45 Unpublished 11 Higher Plant Genomes Species name Method Size (Mb) # of contigs # of transcripts Arabidopsis lyrata WGS ,670 Medicago truncatula BAC, WGS ,334 Selaginella moellendorffii WGS ,285 Lycopersicon esculentum WGS, BAC ,409 49,389 Solanum phureja WGS , ,512 Ricinus communis WGS ,518 38,613 Mimulus guttatus WGS ,243 47,442 Manihot esculenta WGS ,216 27,501 Phoenix dactylifera WGS ,704 - Prunus persica WGS ,689 Oryza glaberrima WGS ,309 - All pictures are from wikipedia.

46 14 Published Higher Plant Genomes from 11 Species Species name Journal Method Size (Mb) # of contigs # of transcripts Arabidopsis thaliana Nature, 2000 BAC +α ,615 Oryza sativa japonica Oryza sativa indica Oryza sativa japonica (syngenta) Science, 2002 Nature, 2005 Science, 2002 PLoS Biology, 2005 BAC +α ,710 WGS ,267 49,710 PLoS Biology, 2005 WGS ,777 45,824 Populus trichocarpa Science, 2006 WGS ,012 45,555 Vitis vinifera Nature, 2007 WGS, Complete ,434 Carica papaya Nature, 2008 WGS ,677 28,589 Lotus japonicus DNA Research, 2008 WGS ,945 26,700 Sorghum bicolor Nature, 2009 WGS ,304 36,338 Zea mays Science, 2009 BAC, WGS 2, ,764 Cucumis sativus Nature genetics, 2009 WGS ,488 26,682 Glycine max Nature, 2010 WGS ,262 62,199 Brachypodium distachyon Malus x domestica All pictures are from wikipedia. Nature, 2010 WGS ,255 Nature Genetics, 2010 WGS ,621 57,386 *

47 7 Unicellular Plants Genomes Species name Journal Method Size (Mb) # of contigs # of transcripts Chlamydomonas reinhardtii Science, 2007 WGS ,709 Micromonas pusilla CCMP1545 Science, 2009 WGS ,547 Micromonas sp. RCC299 Science, 2009 WGS ,108 Ostreococcus lucimarinus CCE9901 PNAS, 2007 WGS ,488 Ostreococcus sp. RCC809 Not published yet WGS ,492 Ostreococcus tauri PNAS, 2007 WGS ,725 Coccomyxa sp. C169 Not published yet WGS ,629 All pictures are from wikipedia.

48 Distribution of 32 Plant Genome Size Mb Unicellular Plants

49 Distribution of Number of Transcripts of 29 Plants # of transcripts 120, , ,000 80,000 60,000 40,000 20,000 36,338 38,613 33,410 Unicellular Plants ,70027,501 28,58928,68930,43432,25532,670 22,285 16,709 7,4887,4927,725 9,629 10,10810,547 45,55545,824 47, ,71053,42353,764 67,393 62,199 0

50 Relationship between Genome Size and Transcripts # of transcript/genome Size (Mb) Unicellular Plants

51 Comparisons with Genomes in Other Kingdoms Mb Bacteria Oomycota Eukaryota Fungi Metazoa Viridaeplanta 5 species 6 species 23 species 256 species 90 species 32 species

52 Cucumber Genome Sequences

53 Thank you for your attention!