ادخ مانب DNA Sequencing یرهطم لضفلاوبا دیس

Size: px
Start display at page:

Download "ادخ مانب DNA Sequencing یرهطم لضفلاوبا دیس"

Transcription

1 بنام خدا DNA Sequencing سید ابوالفضل مطهری

2 Sequencing Problem

3 Sequencing Problem S = X 1 X 2 X 3 X 4...X G 1 X G

4 Sequencing Problem S = X 1 X 2 X 3 X 4...X G 1 X G finding the sequence

5 Sequencing Problem S = X 1 X 2 X 3 X 4...X G 1 X G finding the sequence or finding some parts of the sequence

6 Sequencing Problem S = X 1 X 2 X 3 X 4...X G 1 X G finding the sequence or finding some parts of the sequence DNA Sequencing sequence is the genome

7 Genome

8 Genome Human Genome consists of 23 chromosomes

9 Genome Human Genome consists of 23 chromosomes Each chromosome contains a DNA sequence

10 Genome Human Genome consists of 23 chromosomes Each chromosome contains a DNA sequence The total length is G =

11 Genome

12 DexoyriboNucleic Acid (DNA) Phosphate p Sugar Bases T Thymine A Adenine C G Cytosine Guanine

13 DexoyriboNucleic Acid (DNA) Phosphate Sugar p Covalent p Bases T Thymine A Adenine C G Cytosine Guanine

14 DexoyriboNucleic Acid (DNA) Phosphate p Sugar Covalent p Bases T A Thymine Adenine Hydrogen A T C G G C Cytosine Guanine

15 p Single Stranded DNA

16 Single Stranded DNA p p

17 Single Stranded DNA p p p

18 Single Stranded DNA p p p p

19 Single Stranded DNA p p p p p

20 Single Stranded DNA A G G T C p p p p p TACGG...

21 Fred Sanger Won the Nobel Prize for Chemistry twice Sequencing Insulin DNA Sequencing

22 Phi X 174 Bactriophage First Genome to be Sequenced Genome Size: 5386 bp Fred Sanger (1977)

23 GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGA AGTGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGC GTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAG CGTGGATTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCT Phi X 174 Bactriophage CATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTTCGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGG CTTGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGG CCTGTCTCATCATGGAAGGCGCTGAATTTACGGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGA AACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTC CGCAGCCGTTGCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTA TGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCG CTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTTATGTCCCTCATC GTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCAC GATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTA First Genome to be Sequenced ATGAGCTTAATCAAGATGATGCTCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGACTT CTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGA GGTAAAACCTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTT Genome Size: 5386 bp TCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAA GAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTC CGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATTCATT Fred Sanger (1977) CAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTT AATGTGACCGTTTATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGAGTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGC CGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGCTTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGT TCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTT CTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGC TTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGAATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGAC GTCCTTCCCCGTACGCCGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAATCAGGTTATTAAAGAG ATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTC AAAAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCC TGATGAGGCCGCCCCTAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGATAAGTTGCTTG ATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGAT GCTTCCTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGAGATGCAAAATGAGACTCAAAAAG AGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGT CTATTATGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGTATTTTACCAATGACCAAATCAA AGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTTCTAATGTC GTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTA ATTTGTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATGGTTCGTTCTTATTACCCTTCTGAATGTCA CGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGAT AACCGCATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTGACGTTCGTGAT GAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAA AATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTA TTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAGGCTCATGCTGATGGTTGGT TTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGC CGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCG GACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGC AGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAGCTACCAGTTATATGGCTGTTGGTTTCTATGTGG CTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCAAGAAGCTGTT CAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAA CCAGATATTGAAGCAGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTAT GCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCA

24 Lambda Phage Genome Size: 48,502 bp Fred Sanger, et al (1982) Method: Shotgun Sequencing

25 Shotgun Sequencing

26 Reads Shotgun Sequencing

27 Shotgun Sequencing Reads Hearken to the reed-flute

28 Shotgun Sequencing Reads banishment from its home Lamenting its banishment from it complains, Lamenting its Hearken to the reed-flute reed-flute, how it complains,

29 Shotgun Sequencing Reads banishment from its home Lamenting its banishment from it complains, Lamenting its reed-flute, how it complains, Hearken to the reed-flute

30 Shotgun Sequencing Reads banishment from its home Lamenting its banishment from it complains, Lamenting its Hearken to the reed-flute, how it complains,

31 Shotgun Sequencing Reads banishment from its home Lamenting its banishment from Hearken to the reed-flute, how it complains, Lamenting its

32 Shotgun Sequencing Reads banishment from its home Hearken to the reed-flute, how it complains, Lamenting its banishment from

33 Shotgun Sequencing Reads Hearken to the reed-flute, how it complains, it complains, Lamenting its Lamenting its banishment from banishment from its home

34 DNA Shotgun Sequencing Human Genome

35 DNA Shotgun Sequencing {A, C, G, T }

36 DNA Shotgun Sequencing {A, C, G, T } Reads

37 DNA Shotgun Sequencing {A, C, G, T } Reads

38 DNA Shotgun Sequencing ACCGTCCTTAAGG {A, C, G, T } Reads

39 DNA Shotgun Sequencing ACCGTCCTTAAGG Reads {A, C, G, T } ACCGTCGTTAAGG

40 DNA Shotgun Sequencing ACCGTCCTTAAGG Reads {A, C, G, T } ACCGTCGTTAAGG ACCGTCGTTAAGG

41 DNA Shotgun Sequencing {A, C, G, T } Reads GTTTCCTAAGTGG ACTGCTTCGTCCTA ACCGTCGTTAAGG CCTTAAGCGGTTC ACTGCTTCGTCCTA

42 Starting Big Projects

43 Starting Big Projects 1989 Genome of s. cerevisiae Size of Genome: 12 Mb 16 Chromosomes Finished 1996

44 Starting Big Projects 1989 Genome of s. cerevisiae Size of Genome: 12 Mb 16 Chromosomes Finished E. Coli: 4.6 Mb M. Capricolum: 1 Mb C. elegans: 100 Mb

45 Starting Big Projects 1989 Genome of s. cerevisiae Size of Genome: 12 Mb 16 Chromosomes Finished Human Genome Project (HGP) Size of Genome: 3.2 Gb 23 Chromosomes Finished E. Coli: 4.6 Mb M. Capricolum: 1 Mb C. elegans: 100 Mb

46 Strategy: Hierarchical SG

47 Strategy: Hierarchical SG Random Fragments kbp

48 Strategy: Hierarchical SG Random Fragments kbp Tiling path Physical Map

49 Strategy: Hierarchical SG Random Fragments kbp Tiling path Physical Map 500 bp Shotgun

50 H Influenzae The first sequenced free-living organism Genome Size: 1.8 Mb Strategy: Whole Genome Shotgun Sequencing The Institute for Genomic Research TIGR Finished 1995 Done in 13 months Cost: 50 cents per base

51 Whole Genome Shotgun Sequencing

52 Whole Genome Shotgun Sequencing Shotgun Number of Reads: N Read Length: L ATCCGGTACGGATCGTACA

53 HGP FINISHED

54 There is Never Enough Sequencing courtesy: Batzoglou

55 There is Never Enough Sequencing 100 million species courtesy: Batzoglou

56 There is Never Enough Sequencing 100 million species 7 billion individuals courtesy: Batzoglou

57 There is Never Enough Sequencing 100 million species Somatic mutations (e.g., HIV, cancer) 7 billion individuals courtesy: Batzoglou

58 Sequencing Growth Cost of one human genome 2004: $30,000, : $100, : $10, : $4,000 (today)???: $300 courtesy: Batzoglou

59 Next Generation Sequencing Very High Throughput Higher error rate Shorter Length Very Cheap

60 Computational Challenges courtesy: Batzoglou

61 Computational Challenges How to assemble the reads? courtesy: Batzoglou

62 Computational Challenges How to assemble the reads? For given genome size G, what are the requirements courtesy: Batzoglou

63 Computational Challenges How to assemble the reads? For given genome size G, what are the requirements Number of Reads: N courtesy: Batzoglou

64 Computational Challenges How to assemble the reads? For given genome size G, what are the requirements Number of Reads: N Length of Reads: L courtesy: Batzoglou

65 Computational Challenges How to assemble the reads? For given genome size G, what are the requirements Number of Reads: N Length of Reads: L N Possible or not L courtesy: Batzoglou

66 Coverage Condition

67 Coverage Condition N cov is the minimum number of reads required to cover the sequence

68 Coverage Condition N cov is the minimum number of reads required to cover the sequence N N cov impossible L

69 Repeat Condition courtesy: Tse

70 Repeat Condition L-spectrum is given

71 Repeat Condition L-spectrum is given ATAGCCTAGCGAT

72 Repeat Condition L-spectrum is given ATAGCCTAGCGAT ATAGC TAGCC AGCCT GCCTA CCTAG CTAGC TAGCG AGCGA GCGAT

73 Repeat Condition de Bruijn Graph Add a node for each (L-1)- mer in a read Add edges for adjacent (L-1)-mers

74 Repeat Condition de Bruijn Graph Add a node for each (L-1)- mer in a read Add edges for adjacent (L-1)-mers CCTA CCCT GCCC CTAG AGCC ATAG TAGC AGCG GCGA CGAT

75 Repeat Condition de Bruijn Graph Add a node for each (L-1)- mer in a read Add edges for adjacent (L-1)-mers CCTA CCCT GCCC Find an Eulerian Path CTAG AGCC TAGC ATAG AGCG GCGA CGAT

76 Repeat Condition de Bruijn Graph Add a node for each (L-1)- mer in a read Add edges for adjacent (L-1)-mers CCTA CCCT GCCC Find an Eulerian Path CTAG AGCC ATAGCCCTAGCGAT TAGC ATAG AGCG GCGA CGAT

77 Repeat Condition A B A

78 Repeat Condition A Interleaved repeats B A B A B A

79 Repeat Condition A Interleaved repeats B A B A B Triple repeat A A A A

80 Repeat Condition A Interleaved repeats B A B A B Triple repeat A A A A L crit is the minimum read length where there exists a unique solution for the problem

81 Repeat Condition L>L crit N impossible L crit L

82 Putting together Normalized number of reads N N cov L Normalized Read Length

83 Putting together Normalized number of reads N N cov 1 covered not covered L Normalized Read Length

84 Putting together Normalized number of reads N N cov triple or interleaved repeats no triple or interleaved repeats 1 covered not covered L crit L Normalized Read Length

85 Putting together Normalized number of reads N N cov triple or interleaved repeats no triple or interleaved repeats 1 covered not covered L crit L Normalized Read Length

86 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.

87 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.

88 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.

89 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.

90 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.

91 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.

92 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.

93 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.

94 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.

95 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.

96 what about the rest? The Greedy Algorithm Algorithm I. Find two reads with the largest overlap and merge them into one read. II. Repeat 1 until only one contig remains.

97 RANDOM Sequence N N cov 1 L

98 RANDOM Sequence N N cov Is feasible 1 L Motahari, Bresler, Tse (2012)

99 Human Chromosome 19 N N cov 10 Longest triple repeat Longest interleaved repeat 9 Longest repeat L crit (Bresler, Bresler & T. 12)

100 Human Chromosome 19 N N cov 10 Longest triple repeat Longest interleaved repeat 9 Longest repeat Greedy Algorithm 4 3 Modified de Bruijn Algorithm L crit (Bresler, Bresler & T. 12)

101 WHY? (Bresler, Bresler & T. 12)

102 WHY? log ( number of repeats) 15 Human chromosome data 5 i.i.d. fit (Bresler, Bresler & T. 12)