Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I Dr David Studholme. 18 th February 2014. BIO1033 theme lecture. 1 28 February 2014 @davidjstudholme
28 February 2014 @davidjstudholme 2
2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 (Head of Bioinformatics) (Computational biologist) (Post-doc) (Post-doc) (PhD) 28 February 2014 @davidjstudholme 3
28 February 2014 @davidjstudholme 4
28 February 2014 @davidjstudholme 5
The plan Part I: bacteria Part II: eukaryotes 28 February 2014 @davidjstudholme 6
What we will cover in the next hour Examples of how we use bacterial genomes Short-read sequence data Alignment of short reads against reference Calling single-nucleotide variation Assembly? 28 February 2014 @davidjstudholme 7
SINGLE-NUCLEOTIDE VARIATION 28 February 2014 @davidjstudholme 8
Example: fowl typhoid 28 February 2014 @davidjstudholme 9
Salmonella enterica subspecies enterica serotype Gallinarum FOWL TYPHOID Photo credit: http://www.cfsph.iastate.edu/diseaseinfo/disease-images.php?name=fowl-typhoid 28 February 2014 @davidjstudholme 10
Live vaccines: Salmonella strain SG9R 28 February 2014 @davidjstudholme 11
What do we know about attenuated strain SG9R? SG9 (wild type) SG9R (vaccine) Pathogenic Attenuated Smooth colonies Rough colonies rfaj TCA [Ser] rfaj TAA [Stop] 28 February 2014 @davidjstudholme 12
Outbreaks linked to the vaccine strain SG9R? Spleen, Farm A Faeces, Farm H Faeces, Farm A Vaccine MLVA (multi-locus variable number of tandem repeats analysis) 28 February 2014 @davidjstudholme 13
Whole-genome sequencing of Salmonella Gallinarum Strain Comments Depth of coverage SG9 Wild type, UK, 1955 81 x SG9Ra Vaccine 342 x SG9Rb Vaccine, 2009 412 x MB4523 Outbreak, Belgium, 2009 390 x Illumina GAIIx paired reads (36 100 nt) Photo: http://www.timpestridge.co.uk 28 February 2014 @davidjstudholme 14
Single-nucleotide differences SG9 wild type SG 287/91 281 1 2 1 SG9R vaccine 9 MB4523 outbreak 28 February 2014 @davidjstudholme 15
Single-nucleotide substitutions: Basis of attenuation? 28 February 2014 @davidjstudholme 16
Single-nucleotide substitutions: Basis of attenuation 28 February 2014 @davidjstudholme 17
Single-nucleotide substitutions: Reversal of attenuation? vaccine outbreak 28 February 2014 @davidjstudholme 18
Fowl typhoid project - conclusions Concern over undefined live vaccines Possibility of reversal by a few substitutions 28 February 2014 @davidjstudholme 19
Example II KIWIFRUIT CANKER 28 February 2014 @davidjstudholme 20
Example: kiwi-fruit canker 28 February 2014 @davidjstudholme 21
28 February 2014 @davidjstudholme 22
Genetic relationships between PSA outbreaks 28 February 2014 @davidjstudholme 23
Genetic relationships between PSA outbreaks 28 February 2014 @davidjstudholme 24
Genetic relationships between PSA outbreaks 28 February 2014 @davidjstudholme 25
Genetic relationships between PSA outbreaks (single-nucleotide differences) 28 February 2014 @davidjstudholme 26
Example III BANANA XANTHOMONAS WILT 28 February 2014 @davidjstudholme 27
Example: Banana Xanthomonas wilt 28 February 2014 @davidjstudholme 28
Example: banana Xanthomonas wilt 28 February 2014 @davidjstudholme 29
Causal agent of BXW disease: Xanthomonas campestris pv. musacearum (Xcm) 28 February 2014 @davidjstudholme 30
28 February 2014 @davidjstudholme 31
BXW is a recently emerging disease 2001 1968 2005 2007 2007 2007 2006 28 February 2014 @davidjstudholme 32
Xcm is closely related to X. vasicola 28 February 2014 @davidjstudholme 33
28 February 2014 @davidjstudholme 34
Genetic homogeneity 28 February 2014 @davidjstudholme 35
Whole-genome sequencing of Xvm isolates 28 February 2014 @davidjstudholme 36
Xvm isolates fall into two major sequence types (based on single-nucleotide differences) NCPPB4379 Uganda (Kayunga) 2007 NCPPB4394 Tanzania 2007 NCPPB4433 Burundi 2008 NCPPB4434 Kenya 2008 91 NCPPB4383 NCPPB4384 Uganda (Wakiso) 2007 Uganda (Nakasongola) 2007 Red type Xcm Sub-lineage II 100 NCPPB4380 Uganda (Kiboga) 2007 99.9985% identical NCPPB4381 Uganda (Luwero) 2007 NCPPB4395 Tanzania 2007 99 NCPPB4392 Tanzania 2007 100 NCPPB2005 Ethiopia 1967 (Enset) NCPPB2251 Ethiopia 1969 NCPPB4387 D. R. Congo 2007 Blue type Xcm Sub-lineage I 99 NCPPB4389 Rwanda 2007 0.1 28 February 2014 @davidjstudholme 37
The spread of banana Xanthomonas wilt disease (BXW) 28 February 2014 @davidjstudholme 38
The spread of banana Xanthomonas wilt disease (BXW) 28 February 2014 @davidjstudholme 39
Exemplar projects: single-nucleotide variation 28 February 2014 @davidjstudholme 40
28 February 2014 @davidjstudholme 41
SINGLE NUCLEOTIDE VARIATION: HOW WE DO IT 28 February 2014 @davidjstudholme 42
28 February 2014 @davidjstudholme 43
28 February 2014 @davidjstudholme 44
28 February 2014 @davidjstudholme 45
28 February 2014 @davidjstudholme 46
... previously Now (in Exeter)... 28 February 2014 @davidjstudholme 47
28 February 2014 @davidjstudholme 48
How much data? 28 February 2014 @davidjstudholme 49
Shotgun DNA sequencing Genomic DNA Fragmented DNA 28 February 2014 @davidjstudholme 50
DNA sequencing 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 28 February 2014 @davidjstudholme 51
DNA sequencing: paired reads 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 28 February 2014 @davidjstudholme 52
What do the data look like? 28 February 2014 @davidjstudholme 53
What do the data look like? 28 February 2014 @davidjstudholme 54
FastQ format 28 February 2014 @davidjstudholme 55
FastQ format 1. Title line 2. Sequence line 4. Quality line 28 February 2014 @davidjstudholme 56
Quality scores encoded in ASCII 28 February 2014 @davidjstudholme 57
Alignment of sequence reads versus a reference genome sequence 28 February 2014 @davidjstudholme 58
Alignment of sequence reads versus a reference genome sequence 28 February 2014 @davidjstudholme 59
DNA sequencing: paired reads 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 28 February 2014 @davidjstudholme 60
Alignment of sequence reads versus a reference genome sequence 28 February 2014 @davidjstudholme 61
Visualisation of alignments 28 February 2014 @davidjstudholme 62
The alignment-tool deluge 28 February 2014 @davidjstudholme 63
My favourite short-read aligner 28 February 2014 @davidjstudholme 64
Short-read alignment: Some considerations How to handle non-unique matches? Mask repetitive sequences? Splicing-aware? 28 February 2014 @davidjstudholme 65
Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 12:443-51. doi: 10.1038/nrg2986 28 February 2014 @davidjstudholme 66
28 February 2014 @davidjstudholme 67
Alignment of sequence reads versus a reference genome sequence 28 February 2014 @davidjstudholme 68
SAMtools Pileup format 28 February 2014 @davidjstudholme 69
SNP calling Assuming that false positives follow a binomial distribution with a 1.00 % probability of a single base-call being incorrect. 28 February 2014 @davidjstudholme 70
A word about SNP calling Base pairs Unambiguous Ambiguous Unambiguous: 95% consensus in each dataset >= 10 x depth in each dataset 28 February 2014 @davidjstudholme 71
Ambiguous single-nucleotide variation 28 February 2014 @davidjstudholme 72
GENE CONTENT 28 February 2014 @davidjstudholme 73
Gene content: global overview Xcm Xcm Xoo 28 February 2014 @davidjstudholme 74
28 February 2014 @davidjstudholme 75
Incorporating short-read assembly 28 February 2014 @davidjstudholme 76
Alignment is safer than assembly 28 February 2014 @davidjstudholme 77
Evolutionary events inferred from comparative genomics 28 February 2014 @davidjstudholme 78
Gene presence/absence These genes are unique to the banana pathogenic bacterial isolates 28 February 2014 @davidjstudholme 79
Inferring gene content from alignment 28 February 2014 @davidjstudholme 80
Gene content workflow Alignment 1 Alignment 2 Alignment 3 Alignment 4 CovereageBed CovereageBed CovereageBed CovereageBed Spreadsheet 1 Spreadsheet 2 Spreadsheet 3 Spreadsheet 4 Custom Perl script R, pheatmap Heatmap 28 February 2014 @davidjstudholme 81
DE NOVO ASSEMBLY 28 February 2014 @davidjstudholme 82
Photo credit: Ben Casey http://commons.wikimedia.org/wiki/file:dna_alignment_written_in_paper.jpg 28 February 2014 @davidjstudholme 83
Sequencing and assembling a genome 28 February 2014 @davidjstudholme 84
Shotgun DNA sequencing Genomic DNA Fragmented DNA 28 February 2014 @davidjstudholme 85
DNA sequencing 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 28 February 2014 @davidjstudholme 86
DNA sequencing: paired reads 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 28 February 2014 @davidjstudholme 87
The de novo sequence assembly problem 28 February 2014 @davidjstudholme 88
28 February 2014 @davidjstudholme 89
28 February 2014 @davidjstudholme 90
28 February 2014 @davidjstudholme 91
28 February 2014 @davidjstudholme 92
28 February 2014 @davidjstudholme 93
Illumina sequence read 100 bp Sanger / capillary sequence read 900 bp Human genome 3,200,000 bp 28 February 2014 Wheat genome 16,000,000,000 bp @davidjstudholme 94
GATGGATAAGTTTTCTGACA CTGAATACAGGGATGTCTAT TTTTCTGACAAGCACTTCAG AAGCACTTCAGGGCTGAGCA TCAGGGCTGAGCATCCTGAA AGGGCTGAGCATCCTGAATA GACGATTTGATGGATAAGTT GGGATGTCTATCCGGAGGAA TCCGGAGGAATGTTCTGCCA Sequence reads GACGATTTGATGGATAAGTT GATGGATAAGTTTTCTGACA TTTTCTGACAAGCACTTCAG AAGCACTTCAGGGCTGAGCA TCAGGGCTGAGCATCCTGAA AGGGCTGAGCATCCTGAATA CTGAATACAGGGATGTCTAT GGGATGTCTATCCGGAGGAA TCCGGAGGAATGTTCTGCCA Contiguous sequence ( contig ) GACGATTTGATGGATAAGTTTTCTGACAAGCACTTCAGGGCTGAGCATCCTGAATACAGGGATGTCTATCCGGAGGAATGTTCTGCCA 28 February 2014 @davidjstudholme 95
The greedy algorithm (for de novo sequence assembly) 28 February 2014 @davidjstudholme 96
Simple greedy approach 28 February 2014 @davidjstudholme 97
Simple greedy approach 28 February 2014 @davidjstudholme 98
Simple greedy approach 28 February 2014 @davidjstudholme 99
Simple greedy approach 28 February 2014 @davidjstudholme 100
Simple greedy approach...acaggaggt GAGGTCCAGA......ACAGGAGGTCCAGA... 28 February 2014 @davidjstudholme 101
Simple greedy approach 28 February 2014 @davidjstudholme 102
Simple greedy approach 28 February 2014 @davidjstudholme 103
Simple greedy approach 28 February 2014 @davidjstudholme 104
Simple greedy approach 28 February 2014 @davidjstudholme 105
Simple greedy approach 28 February 2014 @davidjstudholme 106
Simple greedy approach 28 February 2014 @davidjstudholme 107
Simple greedy approach 28 February 2014 @davidjstudholme 108
Simple greedy approach 28 February 2014 @davidjstudholme 109
Simple greedy approach 28 February 2014 @davidjstudholme 110
The problem of repeat sequences 28 February 2014 @davidjstudholme 111
The problem of repetitive sequences CGCGCATATATATATATATATATATATATATATATATATATATATATATATATATATATGCCGATTGA 28 February 2014 @davidjstudholme 112
The problem of repetitive sequences CGCGCATATATATATATATATATATATATATATATATATATATATATATATATATATATGCCGATTGA CGCATATATATATAT TATATATATATATATA ATATATATATATATA CGCGCATATATATAT TATATATATATATATA TATATATATATATATA TATATATATATATATA TATATGCCGATT ATGCCGATTGA 28 February 2014 @davidjstudholme 113
The problem of repetitive sequences TATATATATATATATA TATATATATATATATA ATATATATATATATA CGCGCATATATATAT ATGCCGATTGA CGCATATATATATAT TATATGCCGATT TATATATATATATATA TATATATATATATATA 28 February 2014 @davidjstudholme 114
The problem of repetitive sequences TATATATATATATATA TATATATATATATATA ATATATATATATATA CGCGCATATATATAT ATGCCGATTGA CGCATATATATATAT TATATGCCGATT TATATATATATATATA TATATATATATATATA CGCGCATATATATATATATATATGCCGATTGA 28 February 2014 @davidjstudholme 115
The problem of repetitive sequences CGCGCATATATATATATATATATATATATATATATATATATATATATATATATATATATGCCGATTGA CGCGCATATATATATATATATATGCCGATTGA 28 February 2014 @davidjstudholme 116
OLC and k-mer graphs GRAPH-BASED METHODS 28 February 2014 @davidjstudholme 117
What is a graph? undirected directed http://en.wikipedia.org/wiki/graph_%28mathematics%29 28 February 2014 @davidjstudholme 118
OVERLAP LAYOUT CONSENSUS (OLC) 28 February 2014 @davidjstudholme 119
Overlap-consensus-Layout (OLC) ATGCCGTTGAACTTCGTTGAACACATGGTCATAC Genome sequence ATGCCGTT GCCGTTGAA GAACACATGG GAACTTCGTTGA CACATGGTCAT TTGAACACAT Sequence reads 28 February 2014 @davidjstudholme 120
Overlap-consensus-Layout (OLC) ATGCCGTT GCCGTTGAA GAACACATGG CACATGGTCAT GAACTTCGTTGA TTGAACACAT 28 February 2014 @davidjstudholme 121
Is there a Hamiltonian path? (passes through every node exactly once) ATGCCGTT GCCGTTGAA GAACACATGG CACATGGTCAT GAACTTCGTTGA TTGAACACAT 28 February 2014 @davidjstudholme 122
Hamiltonian path ATGCCGTT GCCGTTGAA GAACACATGG CACATGGTCAT GAACTTCGTTGA TTGAACACAT ATGCCGTT GCCGTTGAA GAACTTCGTTGA TTGAACACAT GAACACATGG CACATGGTCAT ATGCCGTTGAACTTCGTTGAACACATGGTCAT 28 February 2014 @davidjstudholme 123
OLC requires all-versus-all comparisons ATGCCGTT GCCGTTGAA CACATGGTCAT GAACACATGG TTGAACACAT GAACTTCGTTGA 28 February 2014 @davidjstudholme 124
K-MER GRAPH METHOD 28 February 2014 @davidjstudholme 125
Genome size = 2,250,000,000 bp Average read length = 52 bp 176 Gb of sequence data 28 February 2014 @davidjstudholme 126
Building the k-mer graph aaccgg aacc accg ccgg K = 4 28 February 2014 @davidjstudholme 127
Building the k-mer graph 28 February 2014 @davidjstudholme 128
Reconstructing the original sequence from the k-mer graph Eulerian path: passes through every edge at least once 28 February 2014 @davidjstudholme 129
Finding the Eulerian path in a k-mer graph Ideal data Real data 28 February 2014 @davidjstudholme 130
From contigs to scaffolds 28 February 2014 @davidjstudholme 131
GATGGATAAGTTTTCTGACA CTGAATACAGGGATGTCTAT TTTTCTGACAAGCACTTCAG AAGCACTTCAGGGCTGAGCA TCAGGGCTGAGCATCCTGAA AGGGCTGAGCATCCTGAATA GACGATTTGATGGATAAGTT GGGATGTCTATCCGGAGGAA TCCGGAGGAATGTTCTGCCA Sequence reads GACGATTTGATGGATAAGTT GATGGATAAGTTTTCTGACA TTTTCTGACAAGCACTTCAG AAGCACTTCAGGGCTGAGCA TCAGGGCTGAGCATCCTGAA AGGGCTGAGCATCCTGAATA CTGAATACAGGGATGTCTAT GGGATGTCTATCCGGAGGAA TCCGGAGGAATGTTCTGCCA GACGATTTGATGGATAAGTTTTCTGACAAGCACTTCAGGGCTGAGCATCCTGAATACAGGGATGTCTATCCGGAGGAATGTTCTGCCA Contiguous sequence ( contig ) 28 February 2014 @davidjstudholme 132
Sequence reads Contig assembly Contig 1 Contig2 Contig3 Contig6 Contig4 Contig5 Contig3 Contig 1 Contig5 Contig2 Contig4 Contig6 Scaffold 28 February 2014 @davidjstudholme 133
Paired-end sequencing 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 500 bp 50 kbp 28 February 2014 @davidjstudholme 134
Scaffolding (using paired reads) 28 February 2014 @davidjstudholme 135
Software: assembly http://seqanswers.com/forums/showthread.php?t=43 28 February 2014 @davidjstudholme 136
28 February 2014 @davidjstudholme 137