遺傳密碼 : 如何解讀天書 中研院植物暨微生物學研究所 施明德

Size: px
Start display at page:

Download "遺傳密碼 : 如何解讀天書 中研院植物暨微生物學研究所 施明德"

Transcription

1 遺傳密碼 : 如何解讀天書 中研院植物暨微生物學研究所 施明德 邢禹依特聘研究員 Tel: , bohsing@gate.sinica.edu.tw

2 一個細胞的日常工作

3 DNA RNA protein DNA: 存放遺傳資訊 RNA: 傳遞遺傳資訊 Protein: 遺傳資訊的產物 細胞的資訊流 (Information flow)

4 染色體 由雙股螺旋的 DNA 折疊而成

5 細胞的遺傳資訊 - 基因 攜帶遺傳資訊的 DNA 片段 Gregor Mendel ( )

6 細胞的遺傳資訊 - 基因 Thomas H. Morgan ( )

7 如何解讀天書? 1. 如何讀天書 2. 如何解讀天書 3. 如何找尋造成突變性狀的基因

8 染色體中只有一部份是基因

9 如何解讀天書? 1. 如何讀天書 技術 定序 策略 DNA 的組裝 2. 如何解讀天書 3. 如何找尋造成突變性狀的基因

10 DNA 化學結構

11 DNA 定序 (sequencing)

12 Sanger 定序法 RNA DNA For sequencing

13 Sanger 定序法步驟 取得 DNA 片段 選殖至質體載體並擴增 定序反應 (DNA 合成 ) 電泳分離 判讀序列

14 Frederick Sanger ( ) Sir Frederick Sanger Two Nobel Prizes on chemistry One on protein sequencing (1958) The other on DNA sequencing (1980)

15 P 32 or S 35 ABI 377 ABI3730

16 Qpix picking robot

17 Biomek FX robot

18

19 Human Genome Project J. Craig Venter Celera Genomics (J. Craig Venter Institute) Francis Collins Human Genome Project Was lead by Francis Watson

20 第二代定序法 ( 次世代定序法 ) 取得 DNA 片段 將 adaptors 連接至 DNA 片段 利用 PCR 擴增 DNA 片段 以擴增後之 DNA 片段為模板, 利用螢光修飾之核苷酸或 DNA 片段合成對應股 依序判讀

21 Roche 454 定序 速度 : 每個反應可定序 Mb 優點 : 可定序長度較長 ( bp) 缺點 : 每 base 平均單價最高, 且無法準確定序重複性序列適用於 Transcriptome 的分析

22 Illumina SBS 定序 速度 : 每個反應可定序 20 Gb 優點 : 可覆蓋厚度達 genome size 的 40 倍以上 缺點 : 每次定序長度約 bp, 錯誤率隨著定序長度而增大 目前次世代定序法的主流, 適用於染色體定序與 Transcriptome 的分析

23 Applied Biosystems SOLiD 定序 速度 : 每個反應可定序 Gb 優點 : 每 bp 平均單價最低, 且定序準確率最高 缺點 : 每次定序長度為 50 bp 以下 適用於染色體再定序

24 Improvements in the rate of DNA sequencing over the past 30 years and into the future Nature 458,

25 第三代定序 Single-molecule DNA sequencing --Without PCR amplification--

26 New-generation sequencing platforms under development Fluorescence-based single-molecule sequencing Nano-technologies for single-molecule sequencing Electronic detection for single-molecule sequencing Electron microscopy for single-molecule sequencing Other approaches for single-molecule sequencing

27 Helicos 定序 類似 454 或 Illumina 定序法, 但大幅改良螢光讀取能力, 因此不需 PCR 擴增即可達成 single-molecule 定序 速度 : 每個反應可定序 25 Gb 以上 缺點 : 每次定序長度約 30 bp, 對重複性序列的解讀能力較差

28 Pacific Biosciences 定序 利用固定於玻璃基座的 DNA polymerase, 配合使用依不同鹼基而標記特定且可移除螢光分子 ( 與磷酸根形成鍵結, 而非鹼基 ) 的 dntps, 使得模板 DNA 能持續進行聚合反應 速度 : 一次定序的模板 DNA 可達 10 Kbp

29 Nanopores 定序 將漏斗狀, 可形成奈米孔的膜蛋白 ( 目前最常用的為 a-hemolysin, 一種溶血素 ) 嵌在雙層脂質膜上, 上面接著一個 exonuclease 或 helicase, 可使雙股 DNA 解開而成為單股 DNA, 並導入漏斗狀的膜蛋白 由於每個鹼基通過奈米孔時會有不同的速率, 並導致不同強度的電流擾動, 因而達成 DNA 定序之目的

30 Nanopores 定序 Nanopore DNA sequencing, as envisioned, will use an exonuclease (green) to cleave nucleotides from DNA. Each nucleotide will be directed into the nanopore (blue), where it interacts with a cyclodextrin "adapter" (red, donut-shaped) and interrupts a current to an extent that pinpoints its identity (such as G = guanine, C = cytosine, and T = thymine).

31 Nanopores 定序

32

33

34

35 如何解讀天書? 1. 如何讀天書 技術 定序 策略 DNA 的組裝 2. 如何解讀天書 3. 如何找尋造成突變性狀的基因

36 定序計畫的基本流程

37 DNA PUZZLE

38 完成拼圖的難易度與染色體尺寸有正相關 1 piece = 150 bp 1 Illumina read Coverage depth = 25 times

39 定序的里程碑 Genome Date Size Institute Method Homo sapiens mtdna ,159 bp (1 circular) - - Haemophilus influenzae (bacteria) ,830,137 bp (1 circular) TIGR Shotgun Mycoplasma genitalium (bacteria) ,070 bp (1 circular) TIGR Shotgun Escherichia coli (bacteria) ,639,221 bp (1 circular) University of Wisconsin-Madison Shotgun Methanococcus jannaschii (Archaeon) ,739,933 bp (3 circular) DOE Shotgun Saccharomyces cerevisiae (yeast) ,067,280 bp (16 linear) 100+ labs Mapping Caenorhabditis elegans (nematode) ,000,000 bp (6 linear) Consortium Mapping

40 Genome Date Size Institute Method Drosophila melanogaster (fruit fly) 定序的里程碑 ,000,000 bp UC Berkley Celera Genomics Shotgun w/bac map Arabidopsis thaliana (angiosperm) ,000,000 bp (5 linear) Consortium BAC-by-BAC Homo sapiens (human) ,400,000,000 bp Human Genome Project & Celera Genomics Mapping & Shotgun Oryza sativa (rice) ,000,000 bp International Rice Genome Sequencing Project, Monsanto, Bejing Genome Institute, Syngenta BAC-by-BAC & Shotgun

41 染色體組裝 (assembly)

42 Two assembly approaches Overlap-Layout-Consensus (OLC) Used by most assemblers for previous generation (Sanger) sequencing Celera Assembler, PCAP, Phusion, Arachen, etc de Bruijn Graph Used by most assemblers for NGS data SOAPdenovo, Allpaths-LG, Velvet, Abyss, etc

43 Overlap-Layout-Consensus (OLC)

44 5 billion reads? that is 12.5 quadrillion (10 15 ) overlaps at 1 million overlaps/second, that would take 400 years

45 de Bruijn Graph find all k-mers, build graph Every k-mer is a node Two nodes are linked with an edge if they share k-1 mer An assembly is a path through the graph that visits each edge at least one We can only roughly estimate the graph of the genome from reads due to sequencing errors and lack of coverage

46 Benefits and drawbacks of OLC and de Bruijn Graph Benefits of OLC Can deal with variable length reads and reads from different sequencing platforms Overlaps can be long and thus more reliable Can resolve repeats of up to read size Drawbacks of OLC Computationally intensive, number of overlaps grows quickly with the number of reads and coverage Benefits of Graph Computationally efficient Drawbacks of Graph Max size of k-mer is limited by the shortest read size All overlaps in the graph are exact overlaps of k-1 bases Repeats of the longer than k- bases cannot be resolved Errors in the reads create spurious branches in the graph requires error correction

47 2158 Base Pairs of Human Sequence GCAATGAAATATGTTCTTGTAATTTAAGCTGACACTCCTAATTTAGCTCTTGTCCTCTACTGAGTCTACCTAATTATATGTAT GGATTGACTTGGTGTTTTCTCTTTTTCTTAAATAGTAATGCAGAAAGCCTGGAGAGAGAGAAACCCCCAAGCTAGGATTTCTG CAGCTCATGAAGCCTTGGAGATAAATGAGTAAGTGGGGGAAAATCTTGCTGTTAAAAAGGAAATCTCATCCTTTGCTGAATAT ATTCAGTTGCCATTGATAGGATACTTAAATTAAACTGCATTTGAACTGGAGGATTATTTGGGGAGTTATTACTCTATTTAAAA AAGTTTTTTTTTTAAATGAAGGACAGCCACCATGTGGAGGTGGTTTTAGTCATTTTATGAATTCAATGGCTTTGCTGTGATCC TAAATTAATTTCTTGAAGGGCTATCCCTAGGATATTGTGAGGATATAAAATAAATACAATTCTTTACATATCTAAAACATTCT GACAGGGAAAATTTTCCAGATGTAGAATGCTCATCTGCACTAGAACATTTTCTAGTAGAACTTCTGCTAGTGGGGAAAACATG ATAACAACATAAGGTTTAAAAAAAAAATTTTAGAAAATACTTCAAGATTAAGACAAAGATAAGAGGAAATGCTGTCTTGAGTG TTGTTAAACATTCTGTGGGTTACCAAGGAAGGCTGGGAAATCTCTTCTGGAGATCTCAGAAAATGAGAAAGATTCTTAAAGTT GGAGTCATAAAAACTCAGGGTTGGCAGAGACCTTAAAGGTCACTTAGCTGAACCACCCATCTGGTGCTTGAATCACCTCAACA CTATCCTTGCCAAGTGGTCATTGTTAAACTATTTTATGATTTTTCTGAAGAAGGTTACAGAATCTTCTTCAGAGATCTTAGGG AAAAAAAAAAAAGATTGTCGTGAGAGTTGAAAATCCTGCCATTGTAACCAGTTGATCTACGGTTTCTGATTCTGTCATGCAAC ATATTTATTTTCCAGTTTCTTGTCATCTACAAATTCGATATGCCTGCCTTCTGTGTGTCATCCATATTTCTGAGAAAAATATG AAGGCCAGGAATAGAGCCCTGTGACATGACATAGAAACTACCCTCCAGGTTCATGTCTTCATGAATCACCATCTTTTGTATTG TTCACTCAATTACTAAGCCACCCAGTTACACTGTGACTCAGCTCATATTTCTCCATTTGGATCTTAAGAATGCCAATCGTAGC TGCGGATCTTAAATTTATAGTAAATCTATTACAGTAAATTAAGCTAGCACAATCTGATTTATTTATTCTTAGTGAATATAAGC TGGCTTCTAGTCGTCACTACTTTCTTTTTAAAGTGCTTGGAGACCATTCCTTTAATAATCCATTAGAATATCTTTCCAAATCA CTGTGTTCTGTAGTTTGGGAAGTCTGCCTTCTTCCCCTTTTTGAAAATTTATGCTACATTTATCATCTCATCTTCTAGCACCT CTCCATTCTTTGTGATTCCTCAACTATCCACAGAGAGCAATTCCATGGCCTGCCTACAAGGTCTTTCGGTTTCCTGGGATTTG CCCATCCAGTCCAGTAATTCATTTAGAATGGATCAATTATTTGCTATCTTACATCTTTTTACCCATTTTAGAGTTTAATTTCT TCTCCCTTTTTCAGTCTGACAGTCATTCTCCTTGATAGAGAAGCCAGGAACAAAATAGGAGGGAGAGAGTTTTGCTTTTTCTT TATTATCTACTGCTTTTAACAATAAACCTTCCTTGTTTTGATGTTATTATGTTGTTTGTCTTTTTTTTTTACTTATTTGCCTT TGTGACATGGGGACGGTGATAGGGCCTTAAATATAATTTTAAAATAGGGAATAAATGGTTGTCTTTAGTATTTTATTTTGTTT TATTATTATTATTATTATTGTTATTTTTGCAAGCTTCAGCTAATTTGGAATTGTAGCTCTCCTGACATTATTCTTATAAGCTC ATTCCACTCTCTTATAGACCATCATTACATGCCCTCTTTCCATCTTTTAAAATATGTCCTTTAAAAATCTGACCTGGGAGAAA TCTCTGTGAAGCCGTGTTGGTTACTTAAGTGCCACCCCTCTTTTCTTCCTGAGAGGATCATTTGTGATTGCAGTTACAGTTGA

48 如何解讀天書? 1. 如何讀天書 技術 定序 策略 DNA 的組裝 2. 如何解讀天書 3. 如何找尋造成突變性狀的基因

49 基因註解 (annotation) 的基本流程

50

51

52 TURKEY TROTS TO WATER GG FROM CINCPAC ACTION COM THIRD FLEET INFO COMINCH CTF SEVENTY-SEVEN X WHERE IS RPT WHERE IS TASK FORCE THIRTY FOUR RR THE WORLD WONDERS

53 A typical gene structure Flanking region Flanking region 5' 3' * GT AG GT AG GC box CAAT box TSS Initiation codon Stop codon AATAA Poly(A)-addition site GC box TATA box

54 基因註解 (annotation) 的基本流程

55 利用軟體預測

56 利用軟體預測

57 和已知的蛋白質序列資料庫比對 >LOC_Os02g protein peptidase, T1 family, putative, expressed Length=236 Score = 437 bits (1123), Expect = 8e-156, Method: Compositional matrix adjust. Identities = 212/235 (90%), Positives = 222/235 (94%), Gaps = 0/235 (0%) Query 1 MGDSQYSFSLTTFSPSGKLVQIEHALTAVGSGQTSLGIKASNGVVIATEKKLPSILVDEA 60 MGDSQYSFSLTTFSPSGKLVQIEHALTAVGSGQTSLGIKA+NGVVIATEKKLPSILVDE Sbjct 1 MGDSQYSFSLTTFSPSGKLVQIEHALTAVGSGQTSLGIKAANGVVIATEKKLPSILVDET 60 Query 61 SVQKIQHLTPNIGTVYSGMGPDFRVLVRKSRKQAEQYLRLYKEPIPVTQLVRETATVMQE 120 SVQKIQ LTPNIG VYSGMGPDFRVLVRKSRKQA+QY RLYKE IPVTQLVRETA VMQE Sbjct 61 SVQKIQSLTPNIGVVYSGMGPDFRVLVRKSRKQAQQYYRLYKETIPVTQLVRETAAVMQE 120 Query 121 FTQSGGVRPFGVSLLVAGYDDKGPQLYQVDPSGSYFSWKASAMGKNVSNAKTFLEKRYTE 180 FTQSGGVRPFGVSLL+AGYDD GPQLYQVDPSGSYFSWKASAMGKNVSNAKTFLEKRYTE Sbjct 121 FTQSGGVRPFGVSLLIAGYDDNGPQLYQVDPSGSYFSWKASAMGKNVSNAKTFLEKRYTE 180 Query 181 DMELDDAIHTAILTLKEGFEGEISSKNIEIGKIGTDKVFRVLTPAEIDDYLAEVE 235 DMELDDAIHTAILTLKEG+EG+IS+ NIEIG I +D+ F+VLTPAEI D+L EVE Sbjct 181 DMELDDAIHTAILTLKEGYEGQISANNIEIGVIRSDREFKVLTPAEIKDFLEEVE 235 >LOC_Os08g protein late embryogenesis abundant group 1, putative, expressed Length=151 Score = 74.7 bits (182), Expect = 1e-16, Method: Compositional matrix adjust. Identities = 59/159 (37%), Positives = 84/159 (53%), Gaps = 20/159 (13%) Query 1 MQSMKETASNIAASAKSGMDKTKATLEEKAEKMKTRDPVQKQMATQVKEDKINQAEMQKR 60 MQS KE A+N+ ASA++GMDK++A ++ + EK R+ K A AE +K+ Sbjct 8 MQSTKEAAANVGASARAGMDKSRAAVQGQVEKATARNAADKDAAEVRRQERLQAAEEEKQ 67 Query 61 ETRQHNAAMKEAAGAGTGLGLGTATHSTTGQVGHGTGTHQMSALP--GHGTGQLTDRVVE 118 NAA KE A G G A H + G G +A P GH V + Sbjct 68 HAMAANAAAKERATGGAG-----AYHPSQG----APGVDPRAAQPTGGH VQD 110 Query 119 GTAVTDPIGRNTGTGR-TTAHNTHVGGGGATGYGTGGGY 156 G A + P+G TGT R + AHN HVG + +GTGG Y Sbjct 111 GVAESRPVGTATGTARPSAAHNPHVGSDFSQAHGTGGQY 149

58 24 exons CDS join( , , , , , , , , , , , , , , , , , , , , , , , ) /codon_start=1 /product="similar to Arabidopsis thaliana Putative ATPase (ISW2-like) (AC011623)" /translation="mgkpgkygdgddddseeeqlspsssageeeeeeveeeegeeqqe EQGEEEEGFSGDEEEQEVEGEADGEQVEEEEEEESSVGEEEAEAEGEEEEEEVEEEQG AGEEEEEEVDEEEIEAVTTGAGGDDDDEEVGDDGGAEEESQSTEDDEVAAGKDGGGED GDKLEDATGNAEIGKRERAKLREMQKLKKHKIQEILDAQNKAIDADMMTEEEEDEEYL KEEEDALDGAGGTRLVSQPSCIKGKMRDYQLAGLNWLIRLYENGINGILADEMGLGKT LQTISLLGYLHEFRGITGPHMVVAPKSTLGNWMKEIQRFCPVLRAIKFLGNPEERNHI RENLLVPGKFDVCVTSFEMAIKEKTALKRFSWRYIIIDEAHRIKNENSLLSKTMRIYN TNYRLLITGTPLQNNLHELWSLLNFLLPEIFSSAETFDDWFQISGENDQHEVVQQLHK VLRPFLLRRLKSDVEKGLPPKKETILKVGMSEMQKQYYRALLQKDLEVVNAGGERKRL LNIAMQLRKCCNHPYLFQGAEPGPPYTTGDHLIENAGKMVLLDKLLPKLKERDSRVLI FSQMTRLLDILEDYLMYKGYQYCRIDGNTGGEDRDASIEAFNKPGSEKFVFLLSTRAG GLGINLATADVVILYDSDWNPQVDLQAQDRAHRIGQKKEVQVFRFCTEYTIEEKVIER AYKKLALDALVIQQGRLAEQKAVNKDELLQMVRFGAEMVFSSKDSTITDEDIDRIIAK GEEATAQLDAKMKKFTEDAIKFKMDDTAELYDFDDDKMCCHIWSDFVIIECTLFFLKD ENKLDFKKLVTDNWIEPTSRRERKRNYSESDYFKQALRQGAPAKPREPRIPRMPHLHD FQFFNTQRLNELYEKEVKYLVQANQKKDTVGEGDDEDQLEPLTVEEQEEKEQLLEEGF STWTRRDFNTFIRACEKYGRNDIKNISSEMEGKTEEEVQRYAKVFQERYKELNDYDRV IKNIEKGEARIYRKDEIMKAIGKKLDRYKNPWLELKIQYGQNKGKLYNEECDRFMLCM VHKLGYGNWDELKAAFRMSPLFRFDWFVKSRTTQELARRCETLIRLVEKENQEYDERE RLARKDKKNMSPAKRSSSRSLDTPPQSSSKRRRQSYTEANAGSGRRRRG"

59

60

61

62

63 ASPGC (Academia Sinica Plant Genome Center) Chin-San Chen, Hong-Hwa Chen, Teh-Yuan Chow, Yue-Ie Caroline Hsing, Jei-Fu Shaw, Hong-pang Wu YMGC (National Yang Ming University Genome Research Center) Kwang-Jen Hsiao, Shih-Feng Peter Tsai Vita Genomics, Inc. Ye-Shyon Ellson Chen ASPGC (Academia Sinica Plant Genome Center) Shu-Chen Chang, Chih-Ying Chao, Ya-Ting Chao, Yua-Li Chen, Tsai-Ru Chen, Shih- Kuang Chen, Hsiu-Chi Chen, Yi-Shin Chen, Chia-Hsiung Cheng, An-Chi Chien, Yu-Ti Chiu, Joseph Chow, Mu-Kuei Chu, Chun-I Chung, Fang-Jung Fan, Shih-yun Han, Yi-Chih Hong, Ai-Ling Hour, Yu-Yun Hsiao, Shin-hsin Hsiao, Jaw-Nan Hsiung, Chih-Hsiung Hsu, Pi-Chu Hsu, Chun-Tung Hsu, Jiann-Jang Huang, Pau-Inn Kau, Pei-Fang Lee, Ming-Chu Lee, Jiunn-Kuan Lee, Hsing-Fang Lee, Huai-Lin Leu, Yen-Fen Li, Ya-Wam Li, Hai-Lun Li, Tao-Yi Lin, Shu-Jen Lin, Yao-Cheng Lin, Huei-Mei Lin, John-Yu Liou, Su-Mei Liu, Po- Chang Lu, Chun-Lin Su, Rou-Fen Wang, Fu-Jin Wei, Shu-wan Wu, Li-Fang Wu, Cheng- Chieh Wu, Chi-Chang Yang, Kong-Chung Yang, Chao-Yuan Yu, Shih-Wen Yu, YMGC Chung-Yung Chen, Hsiang-Ju Chen, Pei-Wei Chen, Wen-Chun Chen, Ya-Tzu Chen, Ming- Huang Ho, Tsai-Lien Liao, Jeng-Li Lin, Tze-Tze Liu, Yen-Ming Liu, Pei-Wen Lo, Yi-Chien Lu, Chung-Mei Pan, Yi-Ling Sun, Hui-Chi Tsai, Keh-Ming Wu Vita Chin Chang, Ea-Gen Chao, Vi-Ann Chao, Grace Chien, Yueh-Mei Chou, Su-Chi Huang, Teng-Yi Tsen, Pei-Feng Wang, Kuang-Den Chen, Yu-Tin Chen, Ying-Ta Lai, Cheng-Hsin Lu, Chun-Lin Su, Chien-Wei Tsai, Chi-Ta Yang

64 Resequencing personal genome project as example

65

66 Sequencing-based genome-wide association study in rice

67 Comparative sequence analysis 1,000 M yr 500 M yr 80 M yr Sequence similarity =~ Functional similarity Phil Hieter

68 Metagenomics

69 Metabolomics information at NCBI

70 如何解讀天書? 1. 如何讀天書 技術 定序 策略 DNA 的組裝 2. 如何解讀天書 3. 如何找尋造成突變性狀的基因

71 Rice gibberellin-insensitive dwarf mutant gene Dwarf 1 encodes the α-subunit of GTP-binding protein M. Ashikari et al., PNAS 98: 張德慈院士 ( )

72

73 Gene cloning by resequencing and SNP mapping

74 Putative SNP frequency in the candidate region Bulked data of mutated Line1 F 2 generation Bulked data of WT F 2 generation

75 Narrow down and touchdown Mutated Line1 F 2 generation Mutated Line2 F 2 generation

76

77

78

79 Gnotobiotics: the study of organisms raised in germ-free conditions

80 2, 4, 5

81

82

83

84

85 Metagenomic sequencing will be done to characterize the microbial communities (bacterial, viral, and phage) from body sites including the oral cavity, skin, vagina, the GI tract, and the nasopharyngeal tract.

86

87

88

89 Thank you 謝謝