A shotgun introduction to sequence assembly (with Velvet) MCB 247 - Brem, Eisen and Pachter
Hot off the press January 27, 2009 06:00 AM Eastern Time llumina Launches Suite of Next-Generation Sequencing Kits New Kits Dramatically Increase Throughput and Bring Powerful Sequencing Applications Within Reach of Every Customer SAN DIEGO--(BUSINESS WIRE)--Illumina (NASDAQ:ILMN) today announced the release of new sequencing chemistry kits and complementary software for its Genome Analyzer system. These new kits and software enable researchers to generate 40% more reads per run and extend read length to greater than 75 base pairs (bp). Also launched is the new Mate Pair Library Preparation Kit, which provides support for generating longer insert paired-end libraries and is complementary to Illumina s existing short-end paired libraries. These new improvements enable researchers to generate 10 to 15 Gigabases (Gb) of high-quality data per run, more than doubling the output previously attainable on the Genome Analyzer. The availability of mate pair library kits and long paired-end reads has greatly increased the flexibility and capacity of our Illumina sequencers. I believe that they have greatly improved our ability to sequence cdna libraries and may even open up the possibility to do de novo sequencing on the Illumina sequencer, said W. Richard McCombie, Ph.D., Professor at the Cold Spring Harbor Laboratory. They are also greatly helping our medical resequencing by giving us more data and the ability to look for small insertions and deletions in patient samples. Illumina s unique combination of very high density and long reads allows researchers to economically take on a broad range of projects, such as whole human genome sequencing and de novo sequencing of complex organisms. In addition to the higher output and longer reads afforded by the new kits and software, Illumina s flexible mate pair technique allows researchers to generate paired-end insert libraries measuring two to five kilobases (kb) to more comprehensively catalogue large structural variations. Coupled with Illumina s standard paired-end insert libraries (200-500 bp), which are necessary for detection of smaller structural variants, these kits provide researchers with the most comprehensive set of library preparation tools for accurate and comprehensive sequencing and characterization of complex genomes. In addition to providing new solutions for de novo sequencing, the combination of short insert paired-end reads with the new longer insert mate pair sequencing is the most powerful approach for maximal coverage across the genome. This combination enables detection of the widest range of structural variant types and is essential for accurately identifying complex rearrangements, said David Bentley, Vice-President and Chief Scientist of DNA Sequencing at Illumina. Under an early access program, researchers at the National Center for Genome Resources (NCGR) have started working with the new long read and Mate Pair Library Kits. "At NCGR, the long read and mate pair chemistries are already enabling our cotton de novo and human resequencing projects. Four of our Genome Analyzers are now dedicated to 2 x 88 and 2 x 106 base pair runs, generating up to 20.5 Gigabases per run and a raw accuracy of greater than 99% over 106 base pairs. Additionally, we're excited to use these improvements for structural variant detection and metagenomics," said Greg May, Ph.D., Director of the Genome Center at NCGR.
Assembly basics (Paired) read length Insert size Coverage Contigs Scaffolds
Assembly basics (Paired) read length Insert size Coverage Contigs Scaffolds N50 metric
Assembly basics (Paired) read length Insert size Coverage Contigs Scaffolds Lander-Waterman model/equation/statistics N50 metric
The chicken (puzzle) and egg (assembly) The chicken is the sequenced part of the genome (you don t know what this is, but its definitely incomplete). This is the puzzle. The egg is the assembly you produce.
Contigs and Scaffolds
Notation L = read length T = minimum detectable overlap G = genome size N = number of reads NL G c = coverage ( ) θ = T L σ =1 θ
Lander-Waterman Expected number of islands: Ne cσ Expected number of islands consisting of j clones: Ne 2cσ (1 e cσ ) j 1 Expected number of contigs: Expected length of an island: Expected length of a contig: L 1 1 e cσ (ecσ c +1 σ e cσ ) Ne cσ Ne 2cσ L( ecσ 1 c +1 σ)
Quantifying an assembly In addition to recording # contigs, # scaffolds, etc. a popular number is the N50 size: The largest number E such that at least half of the bases are in contigs (scaffolds) larger than E. Example: If the contigs have sizes 7,4,3,2,2,1,1 (kb) the N50 contig size is
Quantifying an assembly In addition to recording # contigs, # scaffolds, etc. a popular number is the N50 size: The largest number E such that at least half of the bases are in contigs (scaffolds) larger than E. Example: If the contigs have sizes 7,4,3,2,2,1,1 (kb) the N50 contig size is 4kb
Fragment assembly Computational challenge: assemble individual short fragments (reads) into a single genomic sequence (superstring). Difficult because of: repeats, sequencing errors, sequencing bias, strand ambiguity, lack of unique solution, size of problem.
Computational complexity Problem: Given a set of strings, find a shortest string that contains all of them Input: Strings s 1,s 2,...s n. Desired output: A string s that contains all strings s 1,s 2,...s n as substrings, such that the length of s is minimized. This is a hard problem.
Example Set of strings: 000,001,010,011,100,101,110,111 A superstring: 000001010011100101110111
Example Set of strings: 000,001,010,011,100,101,110,111 A superstring: 000001010011100101110111 Shortest superstring: 0001110100
Represting assemblies with de Bruijn graphs
Velvet Overview Step 1: Construct the de Bruijn graph from the reads. Step 2: Simplification. Step 3: Error removal. Step 4: Resolution of repeats
Removing tips A tip is a chain of nodes that is disconnected on one end. They arise from sequencing errors and coverage gaps. Short tips are clipped (<2k-mer bp)
Untangling repeats using mate pairs
Comparison of assemblers
References Lander and Waterman (1988) Genomic mapping by fingerprinting random clones: a mathematical analysis, Jones and Pevzner (2004) An Introduction to Bioinformatics. Zerbino and Birney (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs.