Why are we here? Introduction

Size: px
Start display at page:

Download "Why are we here? Introduction"

Transcription

1 Why are we here? Introduction

2 Genome assembly Original DNA Fragments Sequenced ends Fragments Contigs Scaffold

3 A correct assembly The right motifs, the correct number of times, in correct order and position.

4 Black box processing DATA Processing RESULTS

5 Working with heuristics DATA Processing RESULTS

6 Why use heuristics for genome assembly? The problem is not completely defined. Exhaustive methods are: Too limited, thus producing simple partial solutions. Too slow, not scaling well. DATA Processing RESULTS Data varies too much and no good models are available. It is so much faster and easier and it works! (sometimes, anyway) 6

7 Black box processing done right DATA Processing RESULTS Use good data, check its pre-conditions to be well processed. Know (roughly) how the processing works. Check soundness and sanity of results.

8 Questions? 8

9 Sequencing and assembly 101 Lecture #1

10 A brief history of DNA sequencing 1953 double helix structure, Watson & Crick 1977 rapid DNA sequencing, Sanger 1977 first full (5k) genome bacteriophage Phi X Late 80s first production Sanger sequencers Mid 90s DNA microarrays 2001 draft human genome 2004 first 454 pyrosequencing machine 2006 first Solexa/Illumina sequencer 2011 PacBio RS 2014 Nanopore, Bionano 2015 Dovetail, 10x

11 Next generation sequencing

12 Paired libraries (PE and LMP)

13 Single molecule, long reads Figure from: The sequence of sequencers: The history of sequencing DNA Heather & Chain - Genomics 2015

14 Sequencing technologies Updated as of Q1 2017

15 The genome assembly problem (WGS) Original DNA Fragments Sequenced ends Fragments Contigs Scaffold 15

16 A correct assembly has: The right motifs, the correct number of times, in correct order and position. None of which is assessed by length stats. 16

17 Overlap Layout Consensus 17

18 Overlap - Layout - Consensus 18

19 Overlap Layout Consensus: Key points Finding overlaps and defining them is key. The layout can be quite difficult. The method tracks every read. The consensus is constructed from the reads. 19

20 De Bruijn Graphs 20

21 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 21

22 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 TTC 22

23 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 TTC T 1 TCT 23

24 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 TTC T 1 A 1 TCT CTA 24

25 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 TTC T 1 A 1 A 1 TCT CTA TAA 25

26 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 TTC T 1 A 1 A 1 G 1 TCT CTA TAA AAG 26

27 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 TTC T 1 A 1 A 1 G 1 T 1 TCT CTA TAA AAG AGT 27

28 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA T 1 A A G CGA TTC T TCT CTA TAA AAG AGT 28

29 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T T 1 A A G CGA GAT TTC T TCT CTA TAA AAG AGT 29

30 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T G 1 T 1 T A A CGA GAT ATT TTC T TCT CTA TAA AAG AGT 30

31 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T 1 T 1 C 2 T 1 A 1 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT 31

32 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T 1 T 1 C 2 T 2 A 1 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT 32

33 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T 1 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT 33

34 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T 1 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT CGA 34

35 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T 1 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT CGAT 35

36 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T 1 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT CGATT 36

37 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T 1 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT CGATTC 37

38 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA 1 T 1 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT CGATTCTAAGT 38

39 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 1 T 1 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT CGATTCTAAGT 39

40 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 1 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT 40

41 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 1 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT 41

42 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 2 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT 42

43 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 2 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT G 1 TTG 43

44 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 2 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT G 1 TTG T 1 TGT 44

45 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 2 C 2 T 2 A 2 A 1 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT G 1 T 1 A 1 TTG TGT GTA 45

46 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 2 C 2 T 2 A 2 A 2 G 1 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT G A 1 T 1 TTG TGT GTA A 1 46

47 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 2 C 2 T 2 A 2 A 2 G 2 T 1 CGA GAT ATT TTC TCT CTA TAA AAG AGT G A 1 T 1 TTG TGT GTA A 1 47

48 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 2 C 2 T 2 A 2 A 2 G 2 T 2 CGA GAT ATT TTC TCT CTA TAA AAG AGT G A 1 T 1 TTG TGT GTA A 1 48

49 Assembling a De Bruijn Graph >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT 2 T 2 T 2 C 2 T 2 A 2 A 2 G 2 T 2 CGA GAT ATT TTC TCT CTA TAA AAG AGT G A 1 T 1 TTG TGT GTA A 1 CGATTCTAAGT CGATTGTAAGT 49

50 Graphs get complicated Image from dskernel.blogspot.co.uk 50

51 Common structures Polymorphism Errant base call Repeating element 51

52 Cleaning graphs 52

53 Cleaning graphs Clip tips 53

54 Cleaning graphs Clip tips Remove low coverage nodes 54

55 Cleaning graphs Clip tips Remove low coverage nodes Remove bubbles 55

56 Cleaning graphs Clip tips Remove low coverage nodes Remove bubbles 56

57 Cleaning graphs Clip tips Remove low coverage nodes Remove bubbles 57

58 OLC vs. De bruijn Representation of the problem Steps to add a read OLC (Reads) Overlap Graph Insert read, compare to every read already included and insert overlaps De bruijn De bruijn (kmer) graph Insert new kmers or update count for those already present Strengths Tracks reads Intuitive representation Consensus Computational speed Ability to handle big datasets Optimal depth Typical sequencing technologies processed Just enough to cover genome and give accurate consensus Long reads (PacBio, Nanopore) Hybrids (Illumina + Long Reads) The higher the better (to grow SNR) Illumina 58

59 The size of the universe K is odd K is even Noncanonical representati on Canonical representati on 59

60 The K tradeoff Longer kmers are more unique in the target, disentangling the graph. Smaller kmers will overlap more often, favouring contiguity. Every read produces L-k+1 kmers. Higher k -> less coverage. Every single error affects k kmers. Higher k -> more errors. A typical choice for 100bp reads is k=71. 60

61 Resolving repeats using reads 61

62 A correct assembly has: The right motifs, the correct number of times, in correct order and position. None of which is assessed by length stats. 62

63 Graphs, contigs, and scaffolds Graphs: assembler s representation More information Allow some back-tracking Can encode support/ambiguity Sequence origin Expected quality Main quality driver Unitig 1 element in the graph Very high Sequence data, cleanup, overlap detection Contig suported chain in the graph High + graph complexity, single-read mapping & entropy Scaffold external-link group of contigs Variable + pair reliability, parametrisation Visualization of a w2rap-contigger GFA for an E. coli dataset assembly Rendered using Bandage (Wick R.R. et al., Bioinformatics, 31(20),

64 Beware of N50 N50 is the most used metric in assembly world and it should not be: Using contiguity as primary goal reward risky joining. N50 is affected by filtering, and not very sensitive! 1400bp 800bp 800bp 700bp 500bp 1400bp 800bp 800bp 700bp 1400bp 800bp 1400bp 500bp 500bp 500bp 500bp 400bp 400bp 1400bp 64

65 Contiguity stats N50 is the most used metric in assembly world and it should not be Scaffolds Contigs Unitigs Cumulative Length Sequence Count Don t forget to check your Ns!!! 65

66 Running abyss as a first pass assembler It runs easily and can use both single and multi-host multiprocessing. Creates a ton of useful output, and a nice log. Kmer spectra LMP fragment sizes histogram (mapped to contigs) PE fragment sizes histogram (mapped to unitigs) 66 Length stats Redirected Log

67 Fragment Sizes 100,000 80,000 Fragment Count 60,000 40,000 20, Fragment Size Fragment Count Fragment Size (bins of 10bp) 67

68 Questions? 68

69

70 Assembly and graph exploration Hands on #1

71 Back to the drawing board: experiment design, QC, data preparation, QC, assembly, QC, QC Lecture #2

72 Assembly project workflow Prior Knowledge Genome Characteristics Preliminary Evaluation Objectives Kariotype: Genome size, Ploidy Heterozygocity Sequencing Strategy GC content Sequencing Contaminants / Symbionts Sequencing Data Data Sets: Close relatives Draft Assemblies Validation, and feature Analysis External Validation Data Genes / ESTs / RNAseq / Markers Objectives met? Final Assembly and Validation Mithocondria Assembled NGS scaffolds Chloroplast 72

73 Experiment design (you choose the data!) Know your biological question. Plan your data processing (from an information perspective). Decide on conditions and biological/technical replicas. Decide on technologies and coverages: How will the typical bias affect your experiment? Is the coverage enough? Significant results?

74 74

75 The assembly is just a probabilistic model of a genome, condensing the information from the experimental evidence. All the information is already present in the experimental results. 75

76 A correct assembly has: The right motifs, the correct number of times, in correct order and position. None of which is assessed by length stats. 76

77 Sample and library preparation: a source of bias DNA/RNA extraction techniques have bias: And sample quality limit sequencing! Samples are never pure. PCR generates further bias. No chemical reaction is perfect, nor complete. You can learn what your typical biases are: Assess them. Take their impact into account. Try to get better data produced. 77

78 Do not neglect the QC data from the lab Concentrations. Sample contamination. Fragment sizes! 78

79 Read preparation: Adaptor trimming: if you have lots of adaptor sequence. But SPECIALLY if you have linkers from LMP (check Nextclip). Pair joining: allows higher k on overlapping reads. Might loose longer frags. Quality trimming: only if your data is terrible and you are short of memory. Error correction: once it miscorrects, all subsequent processing is tainted. Your approach should be able to cope with errors, EC is just one option. Pacbio reads are a special case, more about that later. Deduplication: hard to do right, sometimes needed, scaffolders handle it. Digital normalisation: rna-* / meta-*, and if you understand what it does. IN GENERAL: illumina is better than it used to be. Keep it in mind. 79

80 Counting kmers 10 8 >seq1 TTCTAAGT >seq2 CGATTCTA >seq3 CGATTGTAAGT Kmer Count Kmer Frequency CGA GAT ATT 2 TT C 2 TC T 2 T 2 T 2 C T A 2 A G T 2 CT A 2 TA A 2 AA G AG T G A 1 TT G T 1 TG T A 1 GT A 80

81 The kmer spectra 81

82 The kmer spectra s components Density =10x Frequency 82

83 The kmer spectra s components Density =10x 2=20x Frequency 83

84 The kmer spectra s components Density =10x 2=20x 3=30x Frequency 84

85 The kmer spectrum and its dissection. We typically use KAT to kmer-count. You can read : Kmer coverage. Genome size. Errors vs. Good kmers. Comparing different spectrum (KAT): Is a reference free library assessment. Runs fast. Gives at least a better vs. worse result. 85

86 If an assembly is correct, then the original reads should be a plausible sequencing set for the resulting genome model.

87 Checking content inclusion using KAT Just compare the frequency of kmers in the assembly to the reads spectrum. 87

88 KAT vs. CEGMA C. fraxinea - Ash Dieback - NORNEX

89 Real case Heterozygous content Homozygous content Duplications Errors Discarded heterozygous content Discarded homozygous content 89

90 Assembly validation tools Cumulative percentage of aligned sequences 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Alignment of PacBio reads to heaxaploid wheat assemblies Percent coverage IWGSC (with 3B) IWGSC (publication) Synthetic W7984 Earlham v1 Earlham v1.1 Cadenza Blah

91 Finding breakpoints by mapping reads Blah

92 Example from the wheat genome

93 Questions? 93

94

95 Stare at the spectra Group activity #1

96 The sequencing portfolio: Every technology and how to combine them Lecture #3

97 Take your pick Illumina paired end: a good and cheap way to get the motifs Long mate pairs: a hint at order and distances 10x linked reads: illumina + molecule tags! PacBio/Nanopore: Long reads: longer, not very precise, motifs Circular consensus reads: long, expensive, precise motifs Optical maps: good positional information. Hi-C: spatial distance, relates to linear distance, Dovetail cleans it up. Genetic maps, markers, deletion bins, synteny, etc, etc, etc

98 98

99 99

100 100

101 101

102 Creating and Sequencing Paired Libraries 102

103 Scaffolding with paired reads 103

104 Fragment Sizes 100,000 80,000 Fragment Count 60,000 40,000 20, Fragment Size Fragment Count Fragment Size (bins of 10bp) 104

105 Read mapping stats 105

106 Read mapping stats 106

107 About gap closing BEWARES: Heuristics are too greedy If there was a gap When did we lose that information and why? Walking is not the same as bridging You can be masking problems. If you need to: Last step Check QC, metrics and stats before and after, eye-ball typical cases Be conscious it IS a patch 107

108 108

109 109

110 110

111 The 10x lost SNPs tale Courtesy of Graham Etherington

112 Example 1: PacBio + Illumina, MaSuRCA Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm Aleksey V. Zimin et. Al Genome Research

113 Example 2: complex haplotype phasing PCR free Paired End DISCOVAR de Novo Contig Assembly Filter Contigs Alternative Haplotype recovery Nextera LMP SOAP-based scaffolding BAC-by-BAC Sequences Haplotype phasing and reconstruction Haplotype Phased Assembly Pop-seq Optical map based extension Bionano Optical Maps Pseudo-molecule Reconstruction Collapsed Haplotype Scaffolded Assembly Chromosome Binned Assembly 113

114 Example 3: 10x haplotype phasing Naive Short-Read approach Original Haplotypes Short read sequencing ($) Reads (no phasing) Chimeric/mixed haplotypes w2rap: preserving variation in assembly graphs Original Haplotypes Short read sequencing ($) Reads (no phasing) Assembly graph (no phasing - all paths) +LMP ($) Updated Roadmap: Collapsed/long haplotypes Assembly graph (no phasing - all paths) +LMP ($) +10x ($$) +Maps Phased and long Haplotypes 114

115 Questions? 115

116

117 Playing with longer-range, and QC Hands on #2

118 Questions? 118

119 The genome assembly clinic Group Activity #2

120 Questions? 120

121 Closing remarks Don t do assembly, do research!

122 The graph is the assembly Graphs: assembler s representati More information Allow some back-tracking Can encode support/ambiguity GFA format: Gaining good acceptance. Many assemblers (inc. w2rap) Fasta + Graph supported Visualization of a w2rap-contigger GFA for an E. coli dataset assembly Rendered using Bandage (Wick R.R. et al., Bioinformatics, 31(20),

123 The graph has more information on genes and regions Better region representation Better gene representation Example: A. thaliana assembly Total TAIR CDSs: In contigs: (98.77%) Not in contigs: 337 (1.23%) Paths found automathically*: 175 (51.93%) Most of the rest have paths, just more complex Ad-hoc analysis using blast+ and Bandage, authomatically is by Bandage path finding

124 Moving away from the reference Reference bias: Problem in human genome analysis. Crops are more complex and plastic -> bigger problem. Many genomes: What about annotation? Many analyses and reconciliation? Which reference is best? for which study?

125 Different options with multi-genome analysis Hierarchical Stacking Information Integration Low-information genomes Re-seq, Captures, etc (Many More) Genome Genome Genome Genome assemblies Not fully de novo, (Many) Genome Informative Model Genome Level1 references (Very Few) True de novo Genome Genome Genome Fast to produce, starts looking great Can be opaque Forces reference bias Lock-in factor / difficult to evolve Slower to produce Needs openness and transparency Reduces (models) reference bias True positive network effect Purpose-built subsets!

126 Multi-genomes, genetics and genomics: convergence! Genome Haplotypes as building blocks Graphs with detailed variation Genome Genome Translation system Genome Informative Model Genome Scales have converged already: Markers Genome Genome Genome Long reads Linked reads

127 Questions? 127

128 If you re still confused..come back for more confusion next year! Thank you! Bernardo J. Clavijo Gonzalo Garcia Jon Wright Luis Yanes