Assembly. Ian Misner, Ph.D. Bioinformatics Crash Course. Bioinformatics Core

Assembly Ian Misner, Ph.D. Bioinformatics Crash Course

Multiple flavors to choose from De novo No prior sequence knowledge required Takes what you have and tries to build the best contigs/scaffolds possible The more data the better. Multiple library types. Multiple sequencing platforms. Reference Take your reads and put them back together, but using a guide. The reference doesn t have to be the exact same species or strain. You cannot align what isn t there. DNA sequences Genomes RNA sequences Transcriptomes

De Novo Assembly The process of reconstructing the orignal DNA sequence from the fragment reads alone. Think of it like a jigsaw puzzle: Find reads that fit together (overlap) Some reads fit in multiple locations (repeats) Your children lost some pieces (sequencing bias) Some pieces are dirty (adaptor contamination, errors)

Star Wars-omics Small genome Come to the Dark Side we have cookies

Star Wars-omics Reads: e......w o The Dar Come to he Dork Side..we have cookies de......he Dark Sid Overlap: Come to o The Dar he Dork Side. Dark Sid e......w de......he.we have cookies Consensus: Come to the Dark Side......we have cookies

Torsten Seemann

Assembly approaches Greedy assembly Overlap::layout::consensus (OLC) de Bruijn graphs String Graphs Seed and extend These all do the same thing but they simply use different shortcuts to deal with the data.

The Ghost of Genomes Past Gene-c Maps Physical Maps Knowledge of Genome structure Haploid genomes Accurate & Long reds Resources Time, People, $$$$$ Yes Yes Yes Yes Yes Yes What do you get when you have millions of dollars to spend, tons of people to help, accurate long reads, and genetic maps? Keith Bradnam UC Davis

Keith Bradnam UC Davis

The Ghost of Genomes Present Gene-c Maps Physical Maps Knowledge of Genome structure Haploid genomes Accurate & Long reds Resources Time, People, $$$$$ Yes Yes Yes Yes Yes Yes What do you get when you have no dollars to spend, one person to help, less accurate short reads, and no genetic maps? Keith Bradnam UC Davis

Keith Bradnam UC Davis

What s the problem? We want the best possible assembly from the smallest sequencing cost and the least amount of bioinformatics effort. I have one simple request, and that is to have sharks with freaking laser beams attached to their heads! Dr. Evil

What the problem? REPEATS! REPEATS! REPEATS! REPEATS! REPEATS you get the point The repeat paradox: It is nearly impossible to resolve repeats of length n unless you have reads longer than n.

What s the problem?

Greedy Assembly Find sequences with overlaps: 1. Find the largest overlaps 2. Merge those overlaps Pros: Simple in practice. Cons: Early mistakes can create bad assemblies. Lars Arvestad

O-L-C Overlap: What reads overlap? Create a node for that read Create a directed edge Layout: How do we combine those reads? Simplify graph Find the shortest paths in the graph Consensus: Derive the contigs from the graphs.

Overlap

Layout

Consensus

Ben Langmead - JHU

Keith Bradnam UC Davis

Assembly Quality Assessment

Key Metrics Bird Genome Assembly

Keith Bradnam UC Davis

Where does this leave us? There is not one single assembler that works for every data set. No single assembler performs well across all measures. In Assemblathon 2 paper the choice of one command option by one tool for one metric caused scoring errors for overall assembler ranking.

What does this all mean. There is no consensus on how to make a good assembly. Use different assemblers, use different options within assemblers. Assembly will get better, it has to, but it will take time. Long reads will help! LOOK AT YOUR DATA!