Assembly Ian Misner, Ph.D. Bioinformatics Crash Course
Multiple flavors to choose from De novo No prior sequence knowledge required Takes what you have and tries to build the best contigs/scaffolds possible The more data the better. Multiple library types. Multiple sequencing platforms. Reference Take your reads and put them back together, but using a guide. The reference doesn t have to be the exact same species or strain. You cannot align what isn t there. DNA sequences Genomes RNA sequences Transcriptomes
De Novo Assembly The process of reconstructing the orignal DNA sequence from the fragment reads alone. Think of it like a jigsaw puzzle: Find reads that fit together (overlap) Some reads fit in multiple locations (repeats) Your children lost some pieces (sequencing bias) Some pieces are dirty (adaptor contamination, errors)
Star Wars-omics Small genome Come to the Dark Side we have cookies
Star Wars-omics Reads: e......w o The Dar Come to he Dork Side..we have cookies de......he Dark Sid Overlap: Come to o The Dar he Dork Side. Dark Sid e......w de......he.we have cookies Consensus: Come to the Dark Side......we have cookies
Torsten Seemann
Assembly approaches Greedy assembly Overlap::layout::consensus (OLC) de Bruijn graphs String Graphs Seed and extend These all do the same thing but they simply use different shortcuts to deal with the data.
The Ghost of Genomes Past Gene-c Maps Physical Maps Knowledge of Genome structure Haploid genomes Accurate & Long reds Resources Time, People, $$$$$ Yes Yes Yes Yes Yes Yes What do you get when you have millions of dollars to spend, tons of people to help, accurate long reads, and genetic maps? Keith Bradnam UC Davis
Keith Bradnam UC Davis
The Ghost of Genomes Present Gene-c Maps Physical Maps Knowledge of Genome structure Haploid genomes Accurate & Long reds Resources Time, People, $$$$$ Yes Yes Yes Yes Yes Yes What do you get when you have no dollars to spend, one person to help, less accurate short reads, and no genetic maps? Keith Bradnam UC Davis
Keith Bradnam UC Davis
What s the problem? We want the best possible assembly from the smallest sequencing cost and the least amount of bioinformatics effort. I have one simple request, and that is to have sharks with freaking laser beams attached to their heads! Dr. Evil
What the problem? REPEATS! REPEATS! REPEATS! REPEATS! REPEATS you get the point The repeat paradox: It is nearly impossible to resolve repeats of length n unless you have reads longer than n.
What s the problem?
Assembly approaches Greedy assembly Overlap::layout::consensus (OLC) de Bruijn graphs String Graphs Seed and extend These all do the same thing but they simply use different shortcuts to deal with the data.
Greedy Assembly Find sequences with overlaps: 1. Find the largest overlaps 2. Merge those overlaps Pros: Simple in practice. Cons: Early mistakes can create bad assemblies. Lars Arvestad
O-L-C Overlap: What reads overlap? Create a node for that read Create a directed edge Layout: How do we combine those reads? Simplify graph Find the shortest paths in the graph Consensus: Derive the contigs from the graphs.
Overlap
Layout
Consensus
Ben Langmead - JHU
Ben Langmead - JHU
Keith Bradnam UC Davis
Assembly Quality Assessment
Key Metrics Bird Genome Assembly
Keith Bradnam UC Davis
Keith Bradnam UC Davis
Keith Bradnam UC Davis
Keith Bradnam UC Davis
Where does this leave us? There is not one single assembler that works for every data set. No single assembler performs well across all measures. In Assemblathon 2 paper the choice of one command option by one tool for one metric caused scoring errors for overall assembler ranking.
What does this all mean. There is no consensus on how to make a good assembly. Use different assemblers, use different options within assemblers. Assembly will get better, it has to, but it will take time. Long reads will help! LOOK AT YOUR DATA!