Why are we here? Introduction
|
|
- Valerie McLaughlin
- 5 years ago
- Views:
Transcription
1 Why are we here? Introduction
2 Genome assembly Original DNA Fragments Sequenced ends Fragments Contigs Scaffold
3 A correct assembly The right motifs, the correct number of times, in correct order and position.
4 Black box processing DATA Processing RESULTS
5 Working with heuristics DATA Processing RESULTS
6 Why use heuristics for genome assembly? The problem is not completely defined. Exhaustive methods are: Too limited, thus producing simple partial solutions. Too slow, not scaling well. DATA Processing RESULTS Data varies too much and no good models are available. It is so much faster and easier and it works! (sometimes, anyway) 6
7 Black box processing done right DATA Processing RESULTS Use good data, check its pre-conditions to be well processed. Know (roughly) how the processing works. Check soundness and sanity of results.
8 Questions? 8
9 Sequencing and assembly 101 Lecture #1
10 A brief history of DNA sequencing 1953 double helix structure, Watson & Crick 1977 rapid DNA sequencing, Sanger 1977 first full (5k) genome bacteriophage Phi X Late 80s first production Sanger sequencers Mid 90s DNA microarrays 2001 draft human genome 2004 first 454 pyrosequencing machine 2006 first Solexa/Illumina sequencer 2011 PacBio RS 2014 Nanopore, Bionano 2015 Dovetail, 10x
11 Next generation sequencing
12 Paired libraries (PE and LMP)
13 Single molecule, long reads Figure from: The sequence of sequencers: The history of sequencing DNA Heather & Chain - Genomics 2015
14 Sequencing technologies Updated as of Q1 2017
15 The genome assembly problem (WGS) Original DNA Fragments Sequenced ends Fragments Contigs Scaffold 15
16 A correct assembly has: The right motifs, the correct number of times, in correct order and position. None of which is assessed by length stats. 16
17 Overlap Layout Consensus 17
18 Overlap - Layout - Consensus 18
19 Overlap Layout Consensus: Key points Finding overlaps and defining them is key. The layout can be quite difficult. The method tracks every read. The consensus is constructed from the reads. 19
20 De Bruijn Graphs 20
21 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 21
22 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 TTC 22
23 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 TTC T 1 TCT 23
24 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 TTC T 1 A 1 TCT CTA 24
25 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 TTC T 1 A 1 A 1 TCT CTA TAA 25
26 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 TTC T 1 A 1 A 1 G 1 TCT CTA TAA AAG 26
27 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 TTC T 1 A 1 A 1 G 1 T 1 TCT CTA TAA AAG AGT 27
28 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA T 1 A A G CGA TTC T TCT CTA TAA AAG AGT 28
29 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T T 1 A A G CGA GAT TTC T TCT CTA TAA AAG AGT 29
30 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T G 1 T 1 T A A CGA GAT ATT TTC T TCT CTA TAA AAG AGT 30
31 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T 1 T 1 C 2 T 1 A 1 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT 31
32 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T 1 T 1 C 2 T 2 A 1 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT 32
33 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T 1 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT 33
34 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T 1 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT CGA 34
35 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T 1 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT CGAT 35
36 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T 1 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT CGATT 36
37 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T 1 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT CGATTC 37
38 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T 1 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT CGATTCTAAGT 38
39 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 1 T 1 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT CGATTCTAAGT 39
40 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 1 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT 40
41 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT 41
42 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 2 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT 42
43 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 2 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT G 1 TTG 43
44 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 2 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT G 1 TTG T 1 TGT 44
45 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 2 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT G 1 T 1 A 1 TTG TGT GTA 45
46 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 2 C 2 T 2 A 2 A 2 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT G A 1 T 1 TTG TGT GTA A 1 46
47 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 2 C 2 T 2 A 2 A 2 G 2 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT G A 1 T 1 TTG TGT GTA A 1 47
48 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 2 C 2 T 2 A 2 A 2 G 2 T 2 CGA GAT ATT TTC TCT CTA TAA AAG AGT G A 1 T 1 TTG TGT GTA A 1 48
49 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 2 C 2 T 2 A 2 A 2 G 2 T 2 CGA GAT ATT TTC TCT CTA TAA AAG AGT G A 1 T 1 TTG TGT GTA A 1 CGATTCTAAGT CGATTGTAAGT 49
50 Graphs get complicated Image from dskernel.blogspot.co.uk 50
51 Common structures Polymorphism Errant base call Repeating element 51
52 Cleaning graphs 52
53 Cleaning graphs Clip tips 53
54 Cleaning graphs Clip tips Remove low coverage nodes 54
55 Cleaning graphs Clip tips Remove low coverage nodes Remove bubbles 55
56 Cleaning graphs Clip tips Remove low coverage nodes Remove bubbles 56
57 Cleaning graphs Clip tips Remove low coverage nodes Remove bubbles 57
58 OLC vs. De bruijn Representation of the problem Steps to add a read OLC (Reads) Overlap Graph Insert read, compare to every read already included and insert overlaps De bruijn De bruijn (kmer) graph Insert new kmers or update count for those already present Strengths Tracks reads Intuitive representation Consensus Computational speed Ability to handle big datasets Optimal depth Typical sequencing technologies processed Just enough to cover genome and give accurate consensus Long reads (PacBio, Nanopore) Hybrids (Illumina + Long Reads) The higher the better (to grow SNR) Illumina 58
59 The size of the universe K is odd K is even Noncanonical representati on Canonical representati on 59
60 The K tradeoff Longer kmers are more unique in the target, disentangling the graph. Smaller kmers will overlap more often, favouring contiguity. Every read produces L-k+1 kmers. Higher k -> less coverage. Every single error affects k kmers. Higher k -> more errors. A typical choice for 100bp reads is k=71. 60
61 Resolving repeats using reads 61
62 A correct assembly has: The right motifs, the correct number of times, in correct order and position. None of which is assessed by length stats. 62
63 Graphs, contigs, and scaffolds Graphs: assembler s representation More information Allow some back-tracking Can encode support/ambiguity Sequence origin Expected quality Main quality driver Unitig 1 element in the graph Very high Sequence data, cleanup, overlap detection Contig suported chain in the graph High + graph complexity, single-read mapping & entropy Scaffold external-link group of contigs Variable + pair reliability, parametrisation Visualization of a w2rap-contigger GFA for an E. coli dataset assembly Rendered using Bandage (Wick R.R. et al., Bioinformatics, 31(20),
64 Beware of N50 N50 is the most used metric in assembly world and it should not be: Using contiguity as primary goal reward risky joining. N50 is affected by filtering, and not very sensitive! 1400bp 800bp 800bp 700bp 500bp 1400bp 800bp 800bp 700bp 1400bp 800bp 1400bp 500bp 500bp 500bp 500bp 400bp 400bp 1400bp 64
65 Contiguity stats N50 is the most used metric in assembly world and it should not be Scaffolds Contigs Unitigs Cumulative Length Sequence Count Don t forget to check your Ns!!! 65
66 Running abyss as a first pass assembler It runs easily and can use both single and multi-host multiprocessing. Creates a ton of useful output, and a nice log. Kmer spectra LMP fragment sizes histogram (mapped to contigs) PE fragment sizes histogram (mapped to unitigs) 66 Length stats Redirected Log
67 Fragment Sizes 100,000 80,000 Fragment Count 60,000 40,000 20, Fragment Size Fragment Count Fragment Size (bins of 10bp) 67
68 Questions? 68
69
70 Assembly and graph exploration Hands on #1
71 Back to the drawing board: experiment design, QC, data preparation, QC, assembly, QC, QC Lecture #2
72 Assembly project workflow Prior Knowledge Genome Characteristics Preliminary Evaluation Objectives Kariotype: Genome size, Ploidy Heterozygocity Sequencing Strategy GC content Sequencing Contaminants / Symbionts Sequencing Data Data Sets: Close relatives Draft Assemblies Validation, and feature Analysis External Validation Data Genes / ESTs / RNAseq / Markers Objectives met? Final Assembly and Validation Mithocondria Assembled NGS scaffolds Chloroplast 72
73 Experiment design (you choose the data!) Know your biological question. Plan your data processing (from an information perspective). Decide on conditions and biological/technical replicas. Decide on technologies and coverages: How will the typical bias affect your experiment? Is the coverage enough? Significant results?
74 74
75 The assembly is just a probabilistic model of a genome, condensing the information from the experimental evidence. All the information is already present in the experimental results. 75
76 A correct assembly has: The right motifs, the correct number of times, in correct order and position. None of which is assessed by length stats. 76
77 Sample and library preparation: a source of bias DNA/RNA extraction techniques have bias: And sample quality limit sequencing! Samples are never pure. PCR generates further bias. No chemical reaction is perfect, nor complete. You can learn what your typical biases are: Assess them. Take their impact into account. Try to get better data produced. 77
78 Do not neglect the QC data from the lab Concentrations. Sample contamination. Fragment sizes! 78
79 Read preparation: Adaptor trimming: if you have lots of adaptor sequence. But SPECIALLY if you have linkers from LMP (check Nextclip). Pair joining: allows higher k on overlapping reads. Might loose longer frags. Quality trimming: only if your data is terrible and you are short of memory. Error correction: once it miscorrects, all subsequent processing is tainted. Your approach should be able to cope with errors, EC is just one option. Pacbio reads are a special case, more about that later. Deduplication: hard to do right, sometimes needed, scaffolders handle it. Digital normalisation: rna-* / meta-*, and if you understand what it does. IN GENERAL: illumina is better than it used to be. Keep it in mind. 79
80 Counting kmers 10 8 >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT Kmer Count Kmer Frequency CGA GAT ATT 2 TT C 2 TC T 2 T 2 T 2 C T A 2 A G T 2 CT A 2 TA A 2 AA G AG T G A 1 TT G T 1 TG T A 1 GT A 80
81 The kmer spectra 81
82 The kmer spectra s components Density =10x Frequency 82
83 The kmer spectra s components Density =10x 2=20x Frequency 83
84 The kmer spectra s components Density =10x 2=20x 3=30x Frequency 84
85 The kmer spectrum and its dissection. We typically use KAT to kmer-count. You can read : Kmer coverage. Genome size. Errors vs. Good kmers. Comparing different spectrum (KAT): Is a reference free library assessment. Runs fast. Gives at least a better vs. worse result. 85
86 If an assembly is correct, then the original reads should be a plausible sequencing set for the resulting genome model.
87 Checking content inclusion using KAT Just compare the frequency of kmers in the assembly to the reads spectrum. 87
88 KAT vs. CEGMA C. fraxinea - Ash Dieback - NORNEX
89 Real case Heterozygous content Homozygous content Duplications Errors Discarded heterozygous content Discarded homozygous content 89
90 Assembly validation tools Cumulative percentage of aligned sequences 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Alignment of PacBio reads to heaxaploid wheat assemblies Percent coverage IWGSC (with 3B) IWGSC (publication) Synthetic W7984 Earlham v1 Earlham v1.1 Cadenza Blah
91 Finding breakpoints by mapping reads Blah
92 Example from the wheat genome
93 Questions? 93
94
95 Stare at the spectra Group activity #1
96 The sequencing portfolio: Every technology and how to combine them Lecture #3
97 Take your pick Illumina paired end: a good and cheap way to get the motifs Long mate pairs: a hint at order and distances 10x linked reads: illumina + molecule tags! PacBio/Nanopore: Long reads: longer, not very precise, motifs Circular consensus reads: long, expensive, precise motifs Optical maps: good positional information. Hi-C: spatial distance, relates to linear distance, Dovetail cleans it up. Genetic maps, markers, deletion bins, synteny, etc, etc, etc
98 98
99 99
100 100
101 101
102 Creating and Sequencing Paired Libraries 102
103 Scaffolding with paired reads 103
104 Fragment Sizes 100,000 80,000 Fragment Count 60,000 40,000 20, Fragment Size Fragment Count Fragment Size (bins of 10bp) 104
105 Read mapping stats 105
106 Read mapping stats 106
107 About gap closing BEWARES: Heuristics are too greedy If there was a gap When did we lose that information and why? Walking is not the same as bridging You can be masking problems. If you need to: Last step Check QC, metrics and stats before and after, eye-ball typical cases Be conscious it IS a patch 107
108 108
109 109
110 110
111 The 10x lost SNPs tale Courtesy of Graham Etherington
112 Example 1: PacBio + Illumina, MaSuRCA Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm Aleksey V. Zimin et. Al Genome Research
113 Example 2: complex haplotype phasing PCR free Paired End DISCOVAR de Novo Contig Assembly Filter Contigs Alternative Haplotype recovery Nextera LMP SOAP-based scaffolding BAC-by-BAC Sequences Haplotype phasing and reconstruction Haplotype Phased Assembly Pop-seq Optical map based extension Bionano Optical Maps Pseudo-molecule Reconstruction Collapsed Haplotype Scaffolded Assembly Chromosome Binned Assembly 113
114 Example 3: 10x haplotype phasing Naive Short-Read approach Original Haplotypes Short read sequencing ($) Reads (no phasing) Chimeric/mixed haplotypes w2rap: preserving variation in assembly graphs Original Haplotypes Short read sequencing ($) Reads (no phasing) Assembly graph (no phasing - all paths) +LMP ($) Updated Roadmap: Collapsed/long haplotypes Assembly graph (no phasing - all paths) +LMP ($) +10x ($$) +Maps Phased and long Haplotypes 114
115 Questions? 115
116
117 Playing with longer-range, and QC Hands on #2
118 Questions? 118
119 The genome assembly clinic Group Activity #2
120 Questions? 120
121 Closing remarks Don t do assembly, do research!
122 The graph is the assembly Graphs: assembler s representati More information Allow some back-tracking Can encode support/ambiguity GFA format: Gaining good acceptance. Many assemblers (inc. w2rap) Fasta + Graph supported Visualization of a w2rap-contigger GFA for an E. coli dataset assembly Rendered using Bandage (Wick R.R. et al., Bioinformatics, 31(20),
123 The graph has more information on genes and regions Better region representation Better gene representation Example: A. thaliana assembly Total TAIR CDSs: In contigs: (98.77%) Not in contigs: 337 (1.23%) Paths found automathically*: 175 (51.93%) Most of the rest have paths, just more complex Ad-hoc analysis using blast+ and Bandage, authomatically is by Bandage path finding
124 Moving away from the reference Reference bias: Problem in human genome analysis. Crops are more complex and plastic -> bigger problem. Many genomes: What about annotation? Many analyses and reconciliation? Which reference is best? for which study?
125 Different options with multi-genome analysis Hierarchical Stacking Information Integration Low-information genomes Re-seq, Captures, etc (Many More) Genome Genome Genome Genome assemblies Not fully de novo, (Many) Genome Informative Model Genome Level1 references (Very Few) True de novo Genome Genome Genome Fast to produce, starts looking great Can be opaque Forces reference bias Lock-in factor / difficult to evolve Slower to produce Needs openness and transparency Reduces (models) reference bias True positive network effect Purpose-built subsets!
126 Multi-genomes, genetics and genomics: convergence! Genome Haplotypes as building blocks Graphs with detailed variation Genome Genome Translation system Genome Informative Model Genome Scales have converged already: Markers Genome Genome Genome Long reads Linked reads
127 Questions? 127
128 If you re still confused..come back for more confusion next year! Thank you! Bernardo J. Clavijo Gonzalo Garcia Jon Wright Luis Yanes