Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I

Size: px
Start display at page:

Download "Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I"

Transcription

1 Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I Dr David Studholme. 18 th February BIO1033 theme lecture February

2 28 February 2

3 (Head of Bioinformatics) (Computational biologist) (Post-doc) (Post-doc) (PhD) 28 February 3

4 28 February 4

5 28 February 5

6 The plan Part I: bacteria Part II: eukaryotes 28 February 6

7 What we will cover in the next hour Examples of how we use bacterial genomes Short-read sequence data Alignment of short reads against reference Calling single-nucleotide variation Assembly? 28 February 7

8 SINGLE-NUCLEOTIDE VARIATION 28 February 8

9 Example: fowl typhoid 28 February 9

10 Salmonella enterica subspecies enterica serotype Gallinarum FOWL TYPHOID Photo credit: 28 February 10

11 Live vaccines: Salmonella strain SG9R 28 February 11

12 What do we know about attenuated strain SG9R? SG9 (wild type) SG9R (vaccine) Pathogenic Attenuated Smooth colonies Rough colonies rfaj TCA [Ser] rfaj TAA [Stop] 28 February 12

13 Outbreaks linked to the vaccine strain SG9R? Spleen, Farm A Faeces, Farm H Faeces, Farm A Vaccine MLVA (multi-locus variable number of tandem repeats analysis) 28 February 13

14 Whole-genome sequencing of Salmonella Gallinarum Strain Comments Depth of coverage SG9 Wild type, UK, x SG9Ra Vaccine 342 x SG9Rb Vaccine, x MB4523 Outbreak, Belgium, x Illumina GAIIx paired reads ( nt) Photo: 28 February 14

15 Single-nucleotide differences SG9 wild type SG 287/ SG9R vaccine 9 MB4523 outbreak 28 February 15

16 Single-nucleotide substitutions: Basis of attenuation? 28 February 16

17 Single-nucleotide substitutions: Basis of attenuation 28 February 17

18 Single-nucleotide substitutions: Reversal of attenuation? vaccine outbreak 28 February 18

19 Fowl typhoid project - conclusions Concern over undefined live vaccines Possibility of reversal by a few substitutions 28 February 19

20 Example II KIWIFRUIT CANKER 28 February 20

21 Example: kiwi-fruit canker 28 February 21

22 28 February 22

23 Genetic relationships between PSA outbreaks 28 February 23

24 Genetic relationships between PSA outbreaks 28 February 24

25 Genetic relationships between PSA outbreaks 28 February 25

26 Genetic relationships between PSA outbreaks (single-nucleotide differences) 28 February 26

27 Example III BANANA XANTHOMONAS WILT 28 February 27

28 Example: Banana Xanthomonas wilt 28 February 28

29 Example: banana Xanthomonas wilt 28 February 29

30 Causal agent of BXW disease: Xanthomonas campestris pv. musacearum (Xcm) 28 February 30

31 28 February 31

32 BXW is a recently emerging disease February 32

33 Xcm is closely related to X. vasicola 28 February 33

34 28 February 34

35 Genetic homogeneity 28 February 35

36 Whole-genome sequencing of Xvm isolates 28 February 36

37 Xvm isolates fall into two major sequence types (based on single-nucleotide differences) NCPPB4379 Uganda (Kayunga) 2007 NCPPB4394 Tanzania 2007 NCPPB4433 Burundi 2008 NCPPB4434 Kenya NCPPB4383 NCPPB4384 Uganda (Wakiso) 2007 Uganda (Nakasongola) 2007 Red type Xcm Sub-lineage II 100 NCPPB4380 Uganda (Kiboga) % identical NCPPB4381 Uganda (Luwero) 2007 NCPPB4395 Tanzania NCPPB4392 Tanzania NCPPB2005 Ethiopia 1967 (Enset) NCPPB2251 Ethiopia 1969 NCPPB4387 D. R. Congo 2007 Blue type Xcm Sub-lineage I 99 NCPPB4389 Rwanda February 37

38 The spread of banana Xanthomonas wilt disease (BXW) 28 February 38

39 The spread of banana Xanthomonas wilt disease (BXW) 28 February 39

40 Exemplar projects: single-nucleotide variation 28 February 40

41 28 February 41

42 SINGLE NUCLEOTIDE VARIATION: HOW WE DO IT 28 February 42

43 28 February 43

44 28 February 44

45 28 February 45

46 28 February 46

47 ... previously Now (in Exeter) February 47

48 28 February 48

49 How much data? 28 February 49

50 Shotgun DNA sequencing Genomic DNA Fragmented DNA 28 February 50

51 DNA sequencing 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 28 February 51

52 DNA sequencing: paired reads 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 28 February 52

53 What do the data look like? 28 February 53

54 What do the data look like? 28 February 54

55 FastQ format 28 February 55

56 FastQ format 1. Title line 2. Sequence line 4. Quality line 28 February 56

57 Quality scores encoded in ASCII 28 February 57

58 Alignment of sequence reads versus a reference genome sequence 28 February 58

59 Alignment of sequence reads versus a reference genome sequence 28 February 59

60 DNA sequencing: paired reads 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 28 February 60

61 Alignment of sequence reads versus a reference genome sequence 28 February 61

62 Visualisation of alignments 28 February 62

63 The alignment-tool deluge 28 February 63

64 My favourite short-read aligner 28 February 64

65 Short-read alignment: Some considerations How to handle non-unique matches? Mask repetitive sequences? Splicing-aware? 28 February 65

66 Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet : doi: /nrg February 66

67 28 February 67

68 Alignment of sequence reads versus a reference genome sequence 28 February 68

69 SAMtools Pileup format 28 February 69

70 SNP calling Assuming that false positives follow a binomial distribution with a 1.00 % probability of a single base-call being incorrect. 28 February 70

71 A word about SNP calling Base pairs Unambiguous Ambiguous Unambiguous: 95% consensus in each dataset >= 10 x depth in each dataset 28 February 71

72 Ambiguous single-nucleotide variation 28 February 72

73 GENE CONTENT 28 February 73

74 Gene content: global overview Xcm Xcm Xoo 28 February 74

75 28 February 75

76 Incorporating short-read assembly 28 February 76

77 Alignment is safer than assembly 28 February 77

78 Evolutionary events inferred from comparative genomics 28 February 78

79 Gene presence/absence These genes are unique to the banana pathogenic bacterial isolates 28 February 79

80 Inferring gene content from alignment 28 February 80

81 Gene content workflow Alignment 1 Alignment 2 Alignment 3 Alignment 4 CovereageBed CovereageBed CovereageBed CovereageBed Spreadsheet 1 Spreadsheet 2 Spreadsheet 3 Spreadsheet 4 Custom Perl script R, pheatmap Heatmap 28 February 81

82 DE NOVO ASSEMBLY 28 February 82

83 Photo credit: Ben Casey 28 February 83

84 Sequencing and assembling a genome 28 February 84

85 Shotgun DNA sequencing Genomic DNA Fragmented DNA 28 February 85

86 DNA sequencing 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 28 February 86

87 DNA sequencing: paired reads 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 28 February 87

88 The de novo sequence assembly problem 28 February 88

89 28 February 89

90 28 February 90

91 28 February 91

92 28 February 92

93 28 February 93

94 Illumina sequence read 100 bp Sanger / capillary sequence read 900 bp Human genome 3,200,000 bp 28 February 2014 Wheat genome 16,000,000,000 94

95 GATGGATAAGTTTTCTGACA CTGAATACAGGGATGTCTAT TTTTCTGACAAGCACTTCAG AAGCACTTCAGGGCTGAGCA TCAGGGCTGAGCATCCTGAA AGGGCTGAGCATCCTGAATA GACGATTTGATGGATAAGTT GGGATGTCTATCCGGAGGAA TCCGGAGGAATGTTCTGCCA Sequence reads GACGATTTGATGGATAAGTT GATGGATAAGTTTTCTGACA TTTTCTGACAAGCACTTCAG AAGCACTTCAGGGCTGAGCA TCAGGGCTGAGCATCCTGAA AGGGCTGAGCATCCTGAATA CTGAATACAGGGATGTCTAT GGGATGTCTATCCGGAGGAA TCCGGAGGAATGTTCTGCCA Contiguous sequence ( contig ) GACGATTTGATGGATAAGTTTTCTGACAAGCACTTCAGGGCTGAGCATCCTGAATACAGGGATGTCTATCCGGAGGAATGTTCTGCCA 28 February 95

96 The greedy algorithm (for de novo sequence assembly) 28 February 96

97 Simple greedy approach 28 February 97

98 Simple greedy approach 28 February 98

99 Simple greedy approach 28 February 99

100 Simple greedy approach 28 February 100

101 Simple greedy approach...acaggaggt GAGGTCCAGA......ACAGGAGGTCCAGA February 101

102 Simple greedy approach 28 February 102

103 Simple greedy approach 28 February 103

104 Simple greedy approach 28 February 104

105 Simple greedy approach 28 February 105

106 Simple greedy approach 28 February 106

107 Simple greedy approach 28 February 107

108 Simple greedy approach 28 February 108

109 Simple greedy approach 28 February 109

110 Simple greedy approach 28 February 110

111 The problem of repeat sequences 28 February 111

112 The problem of repetitive sequences CGCGCATATATATATATATATATATATATATATATATATATATATATATATATATATATGCCGATTGA 28 February 112

113 The problem of repetitive sequences CGCGCATATATATATATATATATATATATATATATATATATATATATATATATATATATGCCGATTGA CGCATATATATATAT TATATATATATATATA ATATATATATATATA CGCGCATATATATAT TATATATATATATATA TATATATATATATATA TATATATATATATATA TATATGCCGATT ATGCCGATTGA 28 February 113

114 The problem of repetitive sequences TATATATATATATATA TATATATATATATATA ATATATATATATATA CGCGCATATATATAT ATGCCGATTGA CGCATATATATATAT TATATGCCGATT TATATATATATATATA TATATATATATATATA 28 February 114

115 The problem of repetitive sequences TATATATATATATATA TATATATATATATATA ATATATATATATATA CGCGCATATATATAT ATGCCGATTGA CGCATATATATATAT TATATGCCGATT TATATATATATATATA TATATATATATATATA CGCGCATATATATATATATATATGCCGATTGA 28 February 115

116 The problem of repetitive sequences CGCGCATATATATATATATATATATATATATATATATATATATATATATATATATATATGCCGATTGA CGCGCATATATATATATATATATGCCGATTGA 28 February 116

117 OLC and k-mer graphs GRAPH-BASED METHODS 28 February 117

118 What is a graph? undirected directed 28 February 118

119 OVERLAP LAYOUT CONSENSUS (OLC) 28 February 119

120 Overlap-consensus-Layout (OLC) ATGCCGTTGAACTTCGTTGAACACATGGTCATAC Genome sequence ATGCCGTT GCCGTTGAA GAACACATGG GAACTTCGTTGA CACATGGTCAT TTGAACACAT Sequence reads 28 February 120

121 Overlap-consensus-Layout (OLC) ATGCCGTT GCCGTTGAA GAACACATGG CACATGGTCAT GAACTTCGTTGA TTGAACACAT 28 February 121

122 Is there a Hamiltonian path? (passes through every node exactly once) ATGCCGTT GCCGTTGAA GAACACATGG CACATGGTCAT GAACTTCGTTGA TTGAACACAT 28 February 122

123 Hamiltonian path ATGCCGTT GCCGTTGAA GAACACATGG CACATGGTCAT GAACTTCGTTGA TTGAACACAT ATGCCGTT GCCGTTGAA GAACTTCGTTGA TTGAACACAT GAACACATGG CACATGGTCAT ATGCCGTTGAACTTCGTTGAACACATGGTCAT 28 February 123

124 OLC requires all-versus-all comparisons ATGCCGTT GCCGTTGAA CACATGGTCAT GAACACATGG TTGAACACAT GAACTTCGTTGA 28 February 124

125 K-MER GRAPH METHOD 28 February 125

126 Genome size = 2,250,000,000 bp Average read length = 52 bp 176 Gb of sequence data 28 February 126

127 Building the k-mer graph aaccgg aacc accg ccgg K = 4 28 February 127

128 Building the k-mer graph 28 February 128

129 Reconstructing the original sequence from the k-mer graph Eulerian path: passes through every edge at least once 28 February 129

130 Finding the Eulerian path in a k-mer graph Ideal data Real data 28 February 130

131 From contigs to scaffolds 28 February 131

132 GATGGATAAGTTTTCTGACA CTGAATACAGGGATGTCTAT TTTTCTGACAAGCACTTCAG AAGCACTTCAGGGCTGAGCA TCAGGGCTGAGCATCCTGAA AGGGCTGAGCATCCTGAATA GACGATTTGATGGATAAGTT GGGATGTCTATCCGGAGGAA TCCGGAGGAATGTTCTGCCA Sequence reads GACGATTTGATGGATAAGTT GATGGATAAGTTTTCTGACA TTTTCTGACAAGCACTTCAG AAGCACTTCAGGGCTGAGCA TCAGGGCTGAGCATCCTGAA AGGGCTGAGCATCCTGAATA CTGAATACAGGGATGTCTAT GGGATGTCTATCCGGAGGAA TCCGGAGGAATGTTCTGCCA GACGATTTGATGGATAAGTTTTCTGACAAGCACTTCAGGGCTGAGCATCCTGAATACAGGGATGTCTATCCGGAGGAATGTTCTGCCA Contiguous sequence ( contig ) 28 February 132

133 Sequence reads Contig assembly Contig 1 Contig2 Contig3 Contig6 Contig4 Contig5 Contig3 Contig 1 Contig5 Contig2 Contig4 Contig6 Scaffold 28 February 133

134 Paired-end sequencing 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 500 bp 50 kbp 28 February 134

135 Scaffolding (using paired reads) 28 February 135

136 Software: assembly 28 February 136

137 28 February 137