ادخ مانب DNA Sequencing یرهطم لضفلاوبا دیس
|
|
- Marcia Francis
- 5 years ago
- Views:
Transcription
1 بنام خدا DNA Sequencing سید ابوالفضل مطهری
2 Sequencing Problem
3 Sequencing Problem S = X 1 X 2 X 3 X 4...X G 1 X G
4 Sequencing Problem S = X 1 X 2 X 3 X 4...X G 1 X G finding the sequence
5 Sequencing Problem S = X 1 X 2 X 3 X 4...X G 1 X G finding the sequence or finding some parts of the sequence
6 Sequencing Problem S = X 1 X 2 X 3 X 4...X G 1 X G finding the sequence or finding some parts of the sequence DNA Sequencing sequence is the genome
7 Genome
8 Genome Human Genome consists of 23 chromosomes
9 Genome Human Genome consists of 23 chromosomes Each chromosome contains a DNA sequence
10 Genome Human Genome consists of 23 chromosomes Each chromosome contains a DNA sequence The total length is G =
11 Genome
12 DexoyriboNucleic Acid (DNA) Phosphate p Sugar Bases T Thymine A Adenine C G Cytosine Guanine
13 DexoyriboNucleic Acid (DNA) Phosphate Sugar p Covalent p Bases T Thymine A Adenine C G Cytosine Guanine
14 DexoyriboNucleic Acid (DNA) Phosphate p Sugar Covalent p Bases T A Thymine Adenine Hydrogen A T C G G C Cytosine Guanine
15 p Single Stranded DNA
16 Single Stranded DNA p p
17 Single Stranded DNA p p p
18 Single Stranded DNA p p p p
19 Single Stranded DNA p p p p p
20 Single Stranded DNA A G G T C p p p p p TACGG...
21 Fred Sanger Won the Nobel Prize for Chemistry twice Sequencing Insulin DNA Sequencing
22 Phi X 174 Bactriophage First Genome to be Sequenced Genome Size: 5386 bp Fred Sanger (1977)
23 GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGA AGTGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGC GTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAG CGTGGATTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCT Phi X 174 Bactriophage CATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTTCGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGG CTTGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGG CCTGTCTCATCATGGAAGGCGCTGAATTTACGGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGA AACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTC CGCAGCCGTTGCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTA TGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCG CTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTTATGTCCCTCATC GTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCAC GATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTA First Genome to be Sequenced ATGAGCTTAATCAAGATGATGCTCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGACTT CTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGA GGTAAAACCTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTT Genome Size: 5386 bp TCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAA GAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTC CGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATTCATT Fred Sanger (1977) CAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTT AATGTGACCGTTTATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGAGTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGC CGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGCTTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGT TCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTT CTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGC TTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGAATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGAC GTCCTTCCCCGTACGCCGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAATCAGGTTATTAAAGAG ATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTC AAAAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCC TGATGAGGCCGCCCCTAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGATAAGTTGCTTG ATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGAT GCTTCCTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGAGATGCAAAATGAGACTCAAAAAG AGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGT CTATTATGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGTATTTTACCAATGACCAAATCAA AGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTTCTAATGTC GTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTA ATTTGTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATGGTTCGTTCTTATTACCCTTCTGAATGTCA CGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGAT AACCGCATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTGACGTTCGTGAT GAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAA AATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTA TTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAGGCTCATGCTGATGGTTGGT TTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGC CGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCG GACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGC AGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAGCTACCAGTTATATGGCTGTTGGTTTCTATGTGG CTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCAAGAAGCTGTT CAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAA CCAGATATTGAAGCAGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTAT GCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCA
24 Lambda Phage Genome Size: 48,502 bp Fred Sanger, et al (1982) Method: Shotgun Sequencing
25 Shotgun Sequencing
26 Reads Shotgun Sequencing
27 Shotgun Sequencing Reads Hearken to the reed-flute
28 Shotgun Sequencing Reads banishment from its home Lamenting its banishment from it complains, Lamenting its Hearken to the reed-flute reed-flute, how it complains,
29 Shotgun Sequencing Reads banishment from its home Lamenting its banishment from it complains, Lamenting its reed-flute, how it complains, Hearken to the reed-flute
30 Shotgun Sequencing Reads banishment from its home Lamenting its banishment from it complains, Lamenting its Hearken to the reed-flute, how it complains,
31 Shotgun Sequencing Reads banishment from its home Lamenting its banishment from Hearken to the reed-flute, how it complains, Lamenting its
32 Shotgun Sequencing Reads banishment from its home Hearken to the reed-flute, how it complains, Lamenting its banishment from
33 Shotgun Sequencing Reads Hearken to the reed-flute, how it complains, it complains, Lamenting its Lamenting its banishment from banishment from its home
34 DNA Shotgun Sequencing Human Genome
35 DNA Shotgun Sequencing {A, C, G, T }
36 DNA Shotgun Sequencing {A, C, G, T } Reads
37 DNA Shotgun Sequencing {A, C, G, T } Reads
38 DNA Shotgun Sequencing ACCGTCCTTAAGG {A, C, G, T } Reads
39 DNA Shotgun Sequencing ACCGTCCTTAAGG Reads {A, C, G, T } ACCGTCGTTAAGG
40 DNA Shotgun Sequencing ACCGTCCTTAAGG Reads {A, C, G, T } ACCGTCGTTAAGG ACCGTCGTTAAGG
41 DNA Shotgun Sequencing {A, C, G, T } Reads GTTTCCTAAGTGG ACTGCTTCGTCCTA ACCGTCGTTAAGG CCTTAAGCGGTTC ACTGCTTCGTCCTA
42 Starting Big Projects
43 Starting Big Projects 1989 Genome of s. cerevisiae Size of Genome: 12 Mb 16 Chromosomes Finished 1996
44 Starting Big Projects 1989 Genome of s. cerevisiae Size of Genome: 12 Mb 16 Chromosomes Finished E. Coli: 4.6 Mb M. Capricolum: 1 Mb C. elegans: 100 Mb
45 Starting Big Projects 1989 Genome of s. cerevisiae Size of Genome: 12 Mb 16 Chromosomes Finished Human Genome Project (HGP) Size of Genome: 3.2 Gb 23 Chromosomes Finished E. Coli: 4.6 Mb M. Capricolum: 1 Mb C. elegans: 100 Mb
46 Strategy: Hierarchical SG
47 Strategy: Hierarchical SG Random Fragments kbp
48 Strategy: Hierarchical SG Random Fragments kbp Tiling path Physical Map
49 Strategy: Hierarchical SG Random Fragments kbp Tiling path Physical Map 500 bp Shotgun
50 H Influenzae The first sequenced free-living organism Genome Size: 1.8 Mb Strategy: Whole Genome Shotgun Sequencing The Institute for Genomic Research TIGR Finished 1995 Done in 13 months Cost: 50 cents per base
51 Whole Genome Shotgun Sequencing
52 Whole Genome Shotgun Sequencing Shotgun Number of Reads: N Read Length: L ATCCGGTACGGATCGTACA
53 HGP FINISHED
54 There is Never Enough Sequencing courtesy: Batzoglou
55 There is Never Enough Sequencing 100 million species courtesy: Batzoglou
56 There is Never Enough Sequencing 100 million species 7 billion individuals courtesy: Batzoglou
57 There is Never Enough Sequencing 100 million species Somatic mutations (e.g., HIV, cancer) 7 billion individuals courtesy: Batzoglou
58 Sequencing Growth Cost of one human genome 2004: $30,000, : $100, : $10, : $4,000 (today)???: $300 courtesy: Batzoglou
59 Next Generation Sequencing Very High Throughput Higher error rate Shorter Length Very Cheap
60 Computational Challenges courtesy: Batzoglou
61 Computational Challenges How to assemble the reads? courtesy: Batzoglou
62 Computational Challenges How to assemble the reads? For given genome size G, what are the requirements courtesy: Batzoglou
63 Computational Challenges How to assemble the reads? For given genome size G, what are the requirements Number of Reads: N courtesy: Batzoglou
64 Computational Challenges How to assemble the reads? For given genome size G, what are the requirements Number of Reads: N Length of Reads: L courtesy: Batzoglou
65 Computational Challenges How to assemble the reads? For given genome size G, what are the requirements Number of Reads: N Length of Reads: L N Possible or not L courtesy: Batzoglou
66 Coverage Condition
67 Coverage Condition N cov is the minimum number of reads required to cover the sequence
68 Coverage Condition N cov is the minimum number of reads required to cover the sequence N N cov impossible L
69 Repeat Condition courtesy: Tse
70 Repeat Condition L-spectrum is given
71 Repeat Condition L-spectrum is given ATAGCCTAGCGAT
72 Repeat Condition L-spectrum is given ATAGCCTAGCGAT ATAGC TAGCC AGCCT GCCTA CCTAG CTAGC TAGCG AGCGA GCGAT
73 Repeat Condition de Bruijn Graph Add a node for each (L-1)- mer in a read Add edges for adjacent (L-1)-mers
74 Repeat Condition de Bruijn Graph Add a node for each (L-1)- mer in a read Add edges for adjacent (L-1)-mers CCTA CCCT GCCC CTAG AGCC ATAG TAGC AGCG GCGA CGAT
75 Repeat Condition de Bruijn Graph Add a node for each (L-1)- mer in a read Add edges for adjacent (L-1)-mers CCTA CCCT GCCC Find an Eulerian Path CTAG AGCC TAGC ATAG AGCG GCGA CGAT
76 Repeat Condition de Bruijn Graph Add a node for each (L-1)- mer in a read Add edges for adjacent (L-1)-mers CCTA CCCT GCCC Find an Eulerian Path CTAG AGCC ATAGCCCTAGCGAT TAGC ATAG AGCG GCGA CGAT
77 Repeat Condition A B A
78 Repeat Condition A Interleaved repeats B A B A B A
79 Repeat Condition A Interleaved repeats B A B A B Triple repeat A A A A
80 Repeat Condition A Interleaved repeats B A B A B Triple repeat A A A A L crit is the minimum read length where there exists a unique solution for the problem
81 Repeat Condition L>L crit N impossible L crit L
82 Putting together Normalized number of reads N N cov L Normalized Read Length
83 Putting together Normalized number of reads N N cov 1 covered not covered L Normalized Read Length
84 Putting together Normalized number of reads N N cov triple or interleaved repeats no triple or interleaved repeats 1 covered not covered L crit L Normalized Read Length
85 Putting together Normalized number of reads N N cov triple or interleaved repeats no triple or interleaved repeats 1 covered not covered L crit L Normalized Read Length
86 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.
87 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.
88 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.
89 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.
90 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.
91 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.
92 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.
93 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.
94 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.
95 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.
96 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.
97 RANDOM Sequence N N cov 1 L
98 RANDOM Sequence N N cov Is feasible 1 L Motahari, Bresler, Tse (2012)
99 Human Chromosome 19 N N cov 10 Longest triple repeat Longest interleaved repeat 9 Longest repeat L crit (Bresler, Bresler & T. 12)
100 Human Chromosome 19 N N cov 10 Longest triple repeat Longest interleaved repeat 9 Longest repeat Greedy Algorithm 4 3 Modified de Bruijn Algorithm L crit (Bresler, Bresler & T. 12)
101 WHY? (Bresler, Bresler & T. 12)
102 WHY? log ( number of repeats) 15 Human chromosome data 5 i.i.d. fit (Bresler, Bresler & T. 12)