Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I

Similar documents
Transcription:

Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I Dr David Studholme. 18 th February 2014. BIO1033 theme lecture. 1 28 February 2014 @davidjstudholme

28 February 2014 @davidjstudholme 2

2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 (Head of Bioinformatics) (Computational biologist) (Post-doc) (Post-doc) (PhD) 28 February 2014 @davidjstudholme 3

28 February 2014 @davidjstudholme 4

28 February 2014 @davidjstudholme 5

The plan Part I: bacteria Part II: eukaryotes 28 February 2014 @davidjstudholme 6

What we will cover in the next hour Examples of how we use bacterial genomes Short-read sequence data Alignment of short reads against reference Calling single-nucleotide variation Assembly? 28 February 2014 @davidjstudholme 7

SINGLE-NUCLEOTIDE VARIATION 28 February 2014 @davidjstudholme 8

Example: fowl typhoid 28 February 2014 @davidjstudholme 9

Salmonella enterica subspecies enterica serotype Gallinarum FOWL TYPHOID Photo credit: http://www.cfsph.iastate.edu/diseaseinfo/disease-images.php?name=fowl-typhoid 28 February 2014 @davidjstudholme 10

Live vaccines: Salmonella strain SG9R 28 February 2014 @davidjstudholme 11

What do we know about attenuated strain SG9R? SG9 (wild type) SG9R (vaccine) Pathogenic Attenuated Smooth colonies Rough colonies rfaj TCA [Ser] rfaj TAA [Stop] 28 February 2014 @davidjstudholme 12

Outbreaks linked to the vaccine strain SG9R? Spleen, Farm A Faeces, Farm H Faeces, Farm A Vaccine MLVA (multi-locus variable number of tandem repeats analysis) 28 February 2014 @davidjstudholme 13

Whole-genome sequencing of Salmonella Gallinarum Strain Comments Depth of coverage SG9 Wild type, UK, 1955 81 x SG9Ra Vaccine 342 x SG9Rb Vaccine, 2009 412 x MB4523 Outbreak, Belgium, 2009 390 x Illumina GAIIx paired reads (36 100 nt) Photo: http://www.timpestridge.co.uk 28 February 2014 @davidjstudholme 14

Single-nucleotide differences SG9 wild type SG 287/91 281 1 2 1 SG9R vaccine 9 MB4523 outbreak 28 February 2014 @davidjstudholme 15

Single-nucleotide substitutions: Basis of attenuation? 28 February 2014 @davidjstudholme 16

Single-nucleotide substitutions: Basis of attenuation 28 February 2014 @davidjstudholme 17

Single-nucleotide substitutions: Reversal of attenuation? vaccine outbreak 28 February 2014 @davidjstudholme 18

Fowl typhoid project - conclusions Concern over undefined live vaccines Possibility of reversal by a few substitutions 28 February 2014 @davidjstudholme 19

Example II KIWIFRUIT CANKER 28 February 2014 @davidjstudholme 20

Example: kiwi-fruit canker 28 February 2014 @davidjstudholme 21

28 February 2014 @davidjstudholme 22

Genetic relationships between PSA outbreaks 28 February 2014 @davidjstudholme 23

Genetic relationships between PSA outbreaks 28 February 2014 @davidjstudholme 24

Genetic relationships between PSA outbreaks 28 February 2014 @davidjstudholme 25

Genetic relationships between PSA outbreaks (single-nucleotide differences) 28 February 2014 @davidjstudholme 26

Example III BANANA XANTHOMONAS WILT 28 February 2014 @davidjstudholme 27

Example: Banana Xanthomonas wilt 28 February 2014 @davidjstudholme 28

Example: banana Xanthomonas wilt 28 February 2014 @davidjstudholme 29

Causal agent of BXW disease: Xanthomonas campestris pv. musacearum (Xcm) 28 February 2014 @davidjstudholme 30

28 February 2014 @davidjstudholme 31

BXW is a recently emerging disease 2001 1968 2005 2007 2007 2007 2006 28 February 2014 @davidjstudholme 32

Xcm is closely related to X. vasicola 28 February 2014 @davidjstudholme 33

28 February 2014 @davidjstudholme 34

Genetic homogeneity 28 February 2014 @davidjstudholme 35

Whole-genome sequencing of Xvm isolates 28 February 2014 @davidjstudholme 36

Xvm isolates fall into two major sequence types (based on single-nucleotide differences) NCPPB4379 Uganda (Kayunga) 2007 NCPPB4394 Tanzania 2007 NCPPB4433 Burundi 2008 NCPPB4434 Kenya 2008 91 NCPPB4383 NCPPB4384 Uganda (Wakiso) 2007 Uganda (Nakasongola) 2007 Red type Xcm Sub-lineage II 100 NCPPB4380 Uganda (Kiboga) 2007 99.9985% identical NCPPB4381 Uganda (Luwero) 2007 NCPPB4395 Tanzania 2007 99 NCPPB4392 Tanzania 2007 100 NCPPB2005 Ethiopia 1967 (Enset) NCPPB2251 Ethiopia 1969 NCPPB4387 D. R. Congo 2007 Blue type Xcm Sub-lineage I 99 NCPPB4389 Rwanda 2007 0.1 28 February 2014 @davidjstudholme 37

The spread of banana Xanthomonas wilt disease (BXW) 28 February 2014 @davidjstudholme 38

The spread of banana Xanthomonas wilt disease (BXW) 28 February 2014 @davidjstudholme 39

Exemplar projects: single-nucleotide variation 28 February 2014 @davidjstudholme 40

28 February 2014 @davidjstudholme 41

SINGLE NUCLEOTIDE VARIATION: HOW WE DO IT 28 February 2014 @davidjstudholme 42

28 February 2014 @davidjstudholme 43

28 February 2014 @davidjstudholme 44

28 February 2014 @davidjstudholme 45

28 February 2014 @davidjstudholme 46

... previously Now (in Exeter)... 28 February 2014 @davidjstudholme 47

28 February 2014 @davidjstudholme 48

How much data? 28 February 2014 @davidjstudholme 49

Shotgun DNA sequencing Genomic DNA Fragmented DNA 28 February 2014 @davidjstudholme 50

DNA sequencing 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 28 February 2014 @davidjstudholme 51

DNA sequencing: paired reads 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 28 February 2014 @davidjstudholme 52

What do the data look like? 28 February 2014 @davidjstudholme 53

What do the data look like? 28 February 2014 @davidjstudholme 54

FastQ format 28 February 2014 @davidjstudholme 55

FastQ format 1. Title line 2. Sequence line 4. Quality line 28 February 2014 @davidjstudholme 56

Quality scores encoded in ASCII 28 February 2014 @davidjstudholme 57

Alignment of sequence reads versus a reference genome sequence 28 February 2014 @davidjstudholme 58

Alignment of sequence reads versus a reference genome sequence 28 February 2014 @davidjstudholme 59

DNA sequencing: paired reads 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 28 February 2014 @davidjstudholme 60

Alignment of sequence reads versus a reference genome sequence 28 February 2014 @davidjstudholme 61

Visualisation of alignments 28 February 2014 @davidjstudholme 62

The alignment-tool deluge 28 February 2014 @davidjstudholme 63

My favourite short-read aligner 28 February 2014 @davidjstudholme 64

Short-read alignment: Some considerations How to handle non-unique matches? Mask repetitive sequences? Splicing-aware? 28 February 2014 @davidjstudholme 65

Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 12:443-51. doi: 10.1038/nrg2986 28 February 2014 @davidjstudholme 66

28 February 2014 @davidjstudholme 67

Alignment of sequence reads versus a reference genome sequence 28 February 2014 @davidjstudholme 68

SAMtools Pileup format 28 February 2014 @davidjstudholme 69

SNP calling Assuming that false positives follow a binomial distribution with a 1.00 % probability of a single base-call being incorrect. 28 February 2014 @davidjstudholme 70

A word about SNP calling Base pairs Unambiguous Ambiguous Unambiguous: 95% consensus in each dataset >= 10 x depth in each dataset 28 February 2014 @davidjstudholme 71

Ambiguous single-nucleotide variation 28 February 2014 @davidjstudholme 72

GENE CONTENT 28 February 2014 @davidjstudholme 73

Gene content: global overview Xcm Xcm Xoo 28 February 2014 @davidjstudholme 74

28 February 2014 @davidjstudholme 75

Incorporating short-read assembly 28 February 2014 @davidjstudholme 76

Alignment is safer than assembly 28 February 2014 @davidjstudholme 77

Evolutionary events inferred from comparative genomics 28 February 2014 @davidjstudholme 78

Gene presence/absence These genes are unique to the banana pathogenic bacterial isolates 28 February 2014 @davidjstudholme 79

Inferring gene content from alignment 28 February 2014 @davidjstudholme 80

Gene content workflow Alignment 1 Alignment 2 Alignment 3 Alignment 4 CovereageBed CovereageBed CovereageBed CovereageBed Spreadsheet 1 Spreadsheet 2 Spreadsheet 3 Spreadsheet 4 Custom Perl script R, pheatmap Heatmap 28 February 2014 @davidjstudholme 81

DE NOVO ASSEMBLY 28 February 2014 @davidjstudholme 82

Photo credit: Ben Casey http://commons.wikimedia.org/wiki/file:dna_alignment_written_in_paper.jpg 28 February 2014 @davidjstudholme 83

Sequencing and assembling a genome 28 February 2014 @davidjstudholme 84

Shotgun DNA sequencing Genomic DNA Fragmented DNA 28 February 2014 @davidjstudholme 85

DNA sequencing 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 28 February 2014 @davidjstudholme 86

DNA sequencing: paired reads 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 28 February 2014 @davidjstudholme 87

The de novo sequence assembly problem 28 February 2014 @davidjstudholme 88

28 February 2014 @davidjstudholme 89

28 February 2014 @davidjstudholme 90

28 February 2014 @davidjstudholme 91

28 February 2014 @davidjstudholme 92

28 February 2014 @davidjstudholme 93

Illumina sequence read 100 bp Sanger / capillary sequence read 900 bp Human genome 3,200,000 bp 28 February 2014 Wheat genome 16,000,000,000 bp @davidjstudholme 94

GATGGATAAGTTTTCTGACA CTGAATACAGGGATGTCTAT TTTTCTGACAAGCACTTCAG AAGCACTTCAGGGCTGAGCA TCAGGGCTGAGCATCCTGAA AGGGCTGAGCATCCTGAATA GACGATTTGATGGATAAGTT GGGATGTCTATCCGGAGGAA TCCGGAGGAATGTTCTGCCA Sequence reads GACGATTTGATGGATAAGTT GATGGATAAGTTTTCTGACA TTTTCTGACAAGCACTTCAG AAGCACTTCAGGGCTGAGCA TCAGGGCTGAGCATCCTGAA AGGGCTGAGCATCCTGAATA CTGAATACAGGGATGTCTAT GGGATGTCTATCCGGAGGAA TCCGGAGGAATGTTCTGCCA Contiguous sequence ( contig ) GACGATTTGATGGATAAGTTTTCTGACAAGCACTTCAGGGCTGAGCATCCTGAATACAGGGATGTCTATCCGGAGGAATGTTCTGCCA 28 February 2014 @davidjstudholme 95

The greedy algorithm (for de novo sequence assembly) 28 February 2014 @davidjstudholme 96

Simple greedy approach 28 February 2014 @davidjstudholme 97

Simple greedy approach 28 February 2014 @davidjstudholme 98

Simple greedy approach 28 February 2014 @davidjstudholme 99

Simple greedy approach 28 February 2014 @davidjstudholme 100

Simple greedy approach...acaggaggt GAGGTCCAGA......ACAGGAGGTCCAGA... 28 February 2014 @davidjstudholme 101

Simple greedy approach 28 February 2014 @davidjstudholme 102

Simple greedy approach 28 February 2014 @davidjstudholme 103

Simple greedy approach 28 February 2014 @davidjstudholme 104

Simple greedy approach 28 February 2014 @davidjstudholme 105

Simple greedy approach 28 February 2014 @davidjstudholme 106

Simple greedy approach 28 February 2014 @davidjstudholme 107

Simple greedy approach 28 February 2014 @davidjstudholme 108

Simple greedy approach 28 February 2014 @davidjstudholme 109

Simple greedy approach 28 February 2014 @davidjstudholme 110

The problem of repeat sequences 28 February 2014 @davidjstudholme 111

The problem of repetitive sequences CGCGCATATATATATATATATATATATATATATATATATATATATATATATATATATATGCCGATTGA 28 February 2014 @davidjstudholme 112

The problem of repetitive sequences CGCGCATATATATATATATATATATATATATATATATATATATATATATATATATATATGCCGATTGA CGCATATATATATAT TATATATATATATATA ATATATATATATATA CGCGCATATATATAT TATATATATATATATA TATATATATATATATA TATATATATATATATA TATATGCCGATT ATGCCGATTGA 28 February 2014 @davidjstudholme 113

The problem of repetitive sequences TATATATATATATATA TATATATATATATATA ATATATATATATATA CGCGCATATATATAT ATGCCGATTGA CGCATATATATATAT TATATGCCGATT TATATATATATATATA TATATATATATATATA 28 February 2014 @davidjstudholme 114

The problem of repetitive sequences TATATATATATATATA TATATATATATATATA ATATATATATATATA CGCGCATATATATAT ATGCCGATTGA CGCATATATATATAT TATATGCCGATT TATATATATATATATA TATATATATATATATA CGCGCATATATATATATATATATGCCGATTGA 28 February 2014 @davidjstudholme 115

The problem of repetitive sequences CGCGCATATATATATATATATATATATATATATATATATATATATATATATATATATATGCCGATTGA CGCGCATATATATATATATATATGCCGATTGA 28 February 2014 @davidjstudholme 116

OLC and k-mer graphs GRAPH-BASED METHODS 28 February 2014 @davidjstudholme 117

What is a graph? undirected directed http://en.wikipedia.org/wiki/graph_%28mathematics%29 28 February 2014 @davidjstudholme 118

OVERLAP LAYOUT CONSENSUS (OLC) 28 February 2014 @davidjstudholme 119

Overlap-consensus-Layout (OLC) ATGCCGTTGAACTTCGTTGAACACATGGTCATAC Genome sequence ATGCCGTT GCCGTTGAA GAACACATGG GAACTTCGTTGA CACATGGTCAT TTGAACACAT Sequence reads 28 February 2014 @davidjstudholme 120

Overlap-consensus-Layout (OLC) ATGCCGTT GCCGTTGAA GAACACATGG CACATGGTCAT GAACTTCGTTGA TTGAACACAT 28 February 2014 @davidjstudholme 121

Is there a Hamiltonian path? (passes through every node exactly once) ATGCCGTT GCCGTTGAA GAACACATGG CACATGGTCAT GAACTTCGTTGA TTGAACACAT 28 February 2014 @davidjstudholme 122

Hamiltonian path ATGCCGTT GCCGTTGAA GAACACATGG CACATGGTCAT GAACTTCGTTGA TTGAACACAT ATGCCGTT GCCGTTGAA GAACTTCGTTGA TTGAACACAT GAACACATGG CACATGGTCAT ATGCCGTTGAACTTCGTTGAACACATGGTCAT 28 February 2014 @davidjstudholme 123

OLC requires all-versus-all comparisons ATGCCGTT GCCGTTGAA CACATGGTCAT GAACACATGG TTGAACACAT GAACTTCGTTGA 28 February 2014 @davidjstudholme 124

K-MER GRAPH METHOD 28 February 2014 @davidjstudholme 125

Genome size = 2,250,000,000 bp Average read length = 52 bp 176 Gb of sequence data 28 February 2014 @davidjstudholme 126

Building the k-mer graph aaccgg aacc accg ccgg K = 4 28 February 2014 @davidjstudholme 127

Building the k-mer graph 28 February 2014 @davidjstudholme 128

Reconstructing the original sequence from the k-mer graph Eulerian path: passes through every edge at least once 28 February 2014 @davidjstudholme 129

Finding the Eulerian path in a k-mer graph Ideal data Real data 28 February 2014 @davidjstudholme 130

From contigs to scaffolds 28 February 2014 @davidjstudholme 131

GATGGATAAGTTTTCTGACA CTGAATACAGGGATGTCTAT TTTTCTGACAAGCACTTCAG AAGCACTTCAGGGCTGAGCA TCAGGGCTGAGCATCCTGAA AGGGCTGAGCATCCTGAATA GACGATTTGATGGATAAGTT GGGATGTCTATCCGGAGGAA TCCGGAGGAATGTTCTGCCA Sequence reads GACGATTTGATGGATAAGTT GATGGATAAGTTTTCTGACA TTTTCTGACAAGCACTTCAG AAGCACTTCAGGGCTGAGCA TCAGGGCTGAGCATCCTGAA AGGGCTGAGCATCCTGAATA CTGAATACAGGGATGTCTAT GGGATGTCTATCCGGAGGAA TCCGGAGGAATGTTCTGCCA GACGATTTGATGGATAAGTTTTCTGACAAGCACTTCAGGGCTGAGCATCCTGAATACAGGGATGTCTATCCGGAGGAATGTTCTGCCA Contiguous sequence ( contig ) 28 February 2014 @davidjstudholme 132

Sequence reads Contig assembly Contig 1 Contig2 Contig3 Contig6 Contig4 Contig5 Contig3 Contig 1 Contig5 Contig2 Contig4 Contig6 Scaffold 28 February 2014 @davidjstudholme 133

Paired-end sequencing 5' 3' 3' 5' 900 bp Sanger 100 bp (Illumina) 450 bp (Roche 454) 500 bp 50 kbp 28 February 2014 @davidjstudholme 134

Scaffolding (using paired reads) 28 February 2014 @davidjstudholme 135

Software: assembly http://seqanswers.com/forums/showthread.php?t=43 28 February 2014 @davidjstudholme 136

28 February 2014 @davidjstudholme 137