High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Size: px
Start display at page:

Download "High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler"

Transcription

1 High-Throughput Bioinformatics: Re-sequencing and de novo assembly Elena Czeizler

2 Sequencing data Current sequencing technologies produce large amounts of data: short reads The outputted sequences are quite short (25-100bp) A sequencing process produces batches of 10s to 100s millions of sequences Challenge: make sense of all these short reads: Find mapping locations of sample reads in a reference genome Denovo assembly of a genome 2

3 Topics Today Re-sequencing approach De-novo sequencing approach 3

4 De novo vs. reference assembly of genome Reads can be assembled into contigs: contiguous sequence of DNA created by assembling overlapping sequenced fragments of a chromosome. 4 [picture from M.Frilander]

5 Re-sequencing pipeline Goal: identify single nucleotide polymorphisms (SNPs), insertions/deletions, structural variants, etc. Example: The 1000 Genomes Project aims at providing a comprehensive resource on human genetic variation Proceeds in a similar manner for all platforms: 1. Create an index to be used for searching the reference genome. 2. Using the index, align reads to reference. 3. Form a consensus sequence of the reads. 4. Identify SNPs etc. 5

6 Challenges of short read alignment The human genome is made of very repetitive sequences causes mapping ambiguity Reads are short; if they fall in repetitive areas, it's hard to know where they truly map We have to account for both sequencing (machine) errors and variations between the sample and reference The volume of data can be so large that performance is a real concern 6

7 Short read aligners: Examples 7

8 Short read alignment Because of the huge amount of data, BLAST is too slow, and faster alignment methods have been developed. Most important ingredient is the index: look-up structure to rapidly find short sequences. Hash tables based methods Suffix/prefix trees / array based: Burrows-Wheeler transform The index is constructed for the reads or for the reference genomes or for both. 8

9 Aligners based on hash tables Extensions of the idea of BLAST: seed-and-extend. Two steps: indexing and alignment Indexing: divide reads of length L into bins based on their first n nucleotides L First n nucleotides: key n bin 1 bin 2 bin 3 In practice, n is roughly 20 => 4 20 = 1.1*10 12 bins 9

10 Aligners based on hash tables Alignment: For each position p in the reference genome: Consider the next L nucleotides in the reference sequence Seed: Find the appropriate bin using the first n nucleotides (out of L) Extend: Match the remaining reference sequence to reads in the bin. Some methods allow gaps, others don t p L n 10

11 Aligners based on hash tables: Seed Improvements Spaced seed: allow a number of mismatches in hit (e.g., at most 2) using several templates in parallel Example: Maq method divides each read into 4 seeds of equal length Reasoning: Two mismatches will fall into at most two seed segments, leaving the other two to match perfectly Procedure: find candidates by looking up all 6 possible pairs of seeds (for each read) in index, then check remaining segments for candidates (remove candidates with too many mismatches) Allows gaps within seed (insertion, deletion). 11 Trapnell C & Salzberg SL. How to map billions of short reads onto genomes. Nature Biotech. (2009) 27: 455

12 Aligners based on hash tables By choosing proper seeds hash table index aligners can be very sensitive, but that may decrease the performance short seeds => false positives that slow down the (later) mapping process longer seeds => more seeds needed => more memory, slows down the mapping process The more sensitive, the greater the performance hit Comprehensive hash tables take lots of memory, which degrades performance in practical implementations Depending on the application, we may want to sacrifice sensitivity for performance, or limit mismatches 12

13 Alignment algorithms based on suffix/prefix tries/arrays A trie is an ordered tree data structure used to store a set of strings. All the descendants of a node have a common prefix: the string associated with that node By using certain representations of suffix/prefix tries (, e.g., suffix tree, enhanced suffix array and FM-index) the alignment to multiple identical copies of a substring in the reference is done only once All identical copies collapse on a single path in the trie. When using a typical hash table index, the alignment must be performed for each individual copy. Many of the current aligners use the FM-Index, based on the Burroughs-Wheeler Transform (BWT) 13

14 Burrows-Wheeler transform Creates a transformation of the reference sequence that: contains the same information as the original sequence can be compressed more efficiently allows fast lookup of substrings can be back-transformed Burrows-Wheeler Transform suffix array Memory-efficient (~1GB for human genome) Used by Bowtie and BWA 14

15 Burrows-Wheeler transform: Example Sequence: ACAACG Add the dollar symbol to mark the end of the string; $ is considered to be lexicographically smaller than all the other symbols. Create all cyclic permutations of the sequence ACAACG$ and then sort them in lexicographic order BWT of the sequence = concatenation of the last character of each line in sorted list, here: GC$AAAC Suffix array: list of the original row numbers (6,2,0,3,1,4,5) The suffix array A of a string S is an array of integers providing the starting positions of suffixes of S in lexicographical order. A[i] =the starting 15 position of the i-th smallest suffix in S

16 Reversing the BWT The BW matrix has the property of 'last-first mapping : The i-th occurrence of char a in the last column corresponds to the same character in the original string X as the i-th occurrence of a in the first column We can use this property to reverse the transformation Example: X= acaacg BWT(X)=gc$aaac Langmead et al,genome Biology,

17 Suffix array interval Suffix array values store the positions of the occurrences of suffixes in the string All suffixes that have a given substring W as prefix appear on consecutive rows All occurrences of a substring W in X appear in an interval (consecutive indices) in suffix array Examples: W=AC -> [2,3], W=A -> [1,3] X=ACAACG 17

18 FM index Ferragina and Manzini (2000) developed a method for searching occurrences of a pattern P in the BWT transformed string with minimal memory requirements. We need to define the rank of a character a as the number of times the same character a occurred previously in the BWT. Example: Let X=abaaba having BWT(X)=abba$aa and suppose we search the occurences of the pattern P=aba Start the search from the right end of the pattern: a Look in the first column and take the range that contains character a : [1,4] Note: we only know the columns F,L and rank 18

19 FM index: example (cont) X=abaaba BWT(X)=abba$aa P=aba Next, find all rows beginning with the next-longest proper suffix of P:ba In the L column of the shaded area there are 2 b => there are 2 instances when b precedes a By using the rank of these 2 b characters (0 and 1) we obtain the lines that contain ba as a prefix: the lines begining with b 0 and b 1, i.e, the first 2 lines in the b section 19

20 FM index: example (cont) X=abaaba BWT(X)=abba$aa P=aba Next, find all rows beginning with the final suffix of P:aba In the L column of the shaded area we have a in both cases => in both instances ba is preceded by a By using the rank of these 2 a characters (2 and 3) we obtain the lines that contain aba as a prefix: the lines begining with a 2 and a 3, i.e, the last 2 lines in a section 20

21 Advantages of BWT/FM Summary: Burrows-Wheeler transform allows to do a search in both time-efficient and space-efficient way Check occurrence of substring W in O( W ) time, using 4n + n log 2 n /8 bits space Rapid placement of reads on the genome with BWT index much faster than hash-based search at same sensitivity level even faster for reduced sensitivity Complete index can be loaded into memory additional speedup 21

22 Re-sequencing in color space For SOLiD reads, sequence alignment is done in color space. Translation into nucleotides is the last step after alignment! Use software that supports color space. In practice, this is just an option you give to the alignment program. 22

23 Using read qualities in alignment? Knowing the error probability of each base, the aligner may pay lower penalty for an error-prone mismatch. Not all programs use them! (check the manuals) Most new methods use read qualities. Out of the old ones: MAQ. 23

24 SNP calling After aligning the short reads to reference genome, identify nucleotides that differ from reference. Make a consensus sequence of the reads. Simplest approach: choose the most common one. Better approach: Use quality values in the voting Reference Consensus Individual reads SNP 24

25 SNP function and annotation Genome-wide association studies (GWAS): aim at finding SNPs associated with certain phenotypes examine many common genetic variants in different individuals and identify those that are associated with particular traits, e.g., major diseases SNP location in the genome is important for interpretation: Is it in exon sequence? Does it change the amino acid/ introduce a stop codon? Does the change disrupt the protein structure? Is it in RNA-coding region? Is it in a gene promoter or other regulatory regions? 25

26 26 De novo assembly

27 De novo genome assembly De novo: no reference genome available Assemble overlapping short reads into longer contigs (contiguous sequences). There are 3 main strategies for short read assembly: Greedy extension (string based method), de Bruijn graph and overlap layout (graph based approaches) For large datasets of more than hundred millions of short reads, De Bruijn graph-based assemblers are most appropriate. Softwares based on de Bruijn graphs : AbySS, ALLPATHS, Edena, Velvet and SOAPdenovo 27

28 Velvet assembly - de Bruijn graphs Motivation: all-against-all overlap calculation is too expensive for short read data Idea: consider the k-mers contained in a read instead of reads themselves (compresses redundant sequences) Start by rewriting reads into sequence of k-mers (overlapping by k-1 bases) Step 1: Reads are hashed to a predefined k-mer length (k=21 for 25 bp reads) Each k-mer has an ID that maps the k-mer back to the read and its position in the read. Each k-mer is recorded with its reverse complement Step 2: For each read, it records which k-mers are overlapped by subsequent reads The original set of k-mers is cut each time an overlap with another read begins or ends k-mers between the cuts form the set of nodes in the de Bruijn graph 28

29 Velvet assembly - de Bruijn graphs Split reads into k-mers (here 5-mers) and align all k-mers in the reads De Bruijn graph: Each node represents a series of overlapping k- mers Final nucleotides make up the sequence of the node. Each node has a twin node representing the reverse complement k-mers overlaps between reads from opposite strands are taken into account k must be odd to ensure each k-mer cannot be its own reverse complement D. R. Zerbino and E. Birney (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18:

30 Velvet assembly - de Bruijn graphs Edges between 2 nodes: last k-mer of the edge s origin overlaps with the first k-mer of its destination. edges connect nodes in the order occurring in a read If multiple reads use the same edge, increase the edge multiplicity Reads are mapped as paths through the graph 30 D. R. Zerbino and E. Birney (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18:

31 Velvet - De Bruijn graph After construction, the graph is processed to remove errors etc. by: Simplification: Whenever a node A has only one outgoing edge that points to another node B that has only one incoming edge, the two nodes (and their twins) are merged -> collapsing chains into single nodes Removing tips : chains of nodes that are disconnected on one end (probably due to error at read end); keep sufficiently long tips (true sequence interrupted by coverage gap). Tour bus algorithm to remove bubbles ; redundant paths that start and end at the same nodes (form a bubble ) and contain similar sequences (might be caused by SNPs or errors). 31

32 Velvet - De Bruijn graph Tour bus algorithm to remove bubbles : Start from node A and progress along the graph, visiting nodes in order of increasing distance from the origin (to the right). If a previously visited node is encountered (e.g., D), backtrack current path and the path from previous visit to find closest common ancestor (here: A). Extract and align the sequences from the two retraced paths (B C and BC). Merge paths if similar enough. Simplify. 32 D. R. Zerbino and E. Birney (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18:

33 How long reads does de novo genome assembly require? Key problem in full assembly: short reads likely to occur at several places. => this complicates sequence reconstruction (if downstream regions diverge, contig cannot be extended) Theoretical analysis ( An analysis of the feasibility of short read sequencing, Whiteford et al., Nucleic Acids Research, 2005): E.coli: with 30 bp read length, 75% of genome is covered with contigs>10,000bp C.elegans: with 50 bp read length, 51% of the genome is covered with contigs >10,000bp. Human (chromosome 1): 50 bp read length, ~17% is covered with contigs>10,000 bp 33

34 Percentage of the E.coli genome covered by contigs greater than a threshold length as a function of read length Whiteford, N. et al. Nucl. Acids Res :e171; doi: /nar/gni Theoretical analysis with perfect sequencing (no errors)!!!

35 C. Elegans Human 35

36 Paired reads facilitate assembly So far we considered only reads from individual DNA fragments (also called single-end reads) Paired reads: Paired-end sequencing: DNA fragments are sequenced from both ends (the part in the middle is not sequenced) Mate-pair library preparation: select sheared DNA fragments of a certain size (2-5kb), circularize it by means of an internal adapter, bringing the ends that were previously distant from one another into close proximity; the circle is cut into fragments, fragments containing the adapter are selected and sequenced by paired-end sequencing => Result: you know reads from both ends and their approximate distance This helps to find the correct genome location when individual reads can be mapped to multiple locations => also helps to resolve repeat structures 36

37 Mate-pair library preparation 37

38 Genome assembly using paired end reads 38 [454 Technical note 1]

39 Paired end mapping reveals structural variation in human genome Examples of structural variation: deletions, duplications, copy-number variations, insertions, inversions, etc. SVs may have a more significant impact on phenotypic variation than SNPs SVs are implicated in e.g., female fertility, susceptibility to HIV infection, systemic autoimmunity, etc. Five different signatures were used to predict SVs Paired-End Mapping Reveals Extensive Structural Variation in the Human Genome, Korbel et al, Science

40 Example: Sequencing of the giant panda genome Li et al. (2010) The sequence and de novo assembly of the giant panda genome. Nature 463,

41 Sequencing setup for Giant Panda 37 paired-end insert libraries with insert sizes of 150 bp, 500 bp, 2 kb, 5 kb and 10 k were constructed. Illumina Genome Analyzer platform. 176 Gb of usable sequence, 73 x coverage of the whole genome. Average read length of 52 bp. SOAPdenovo assembler, using the de Bruijn graph algorithm. 41

42 Genome assembly using paired end reads Assemble short reads from small insert size libraries (<500bp) into contigs Use paired-end information to join contigs into scaffolds Scaffold = arrangement of contigs with spaces between Step-by-step from shortest to longest insert size Filling gaps in scaffolds Look at paired-end reads with one end in contig and other end in gap region and start local assembly from unmapped end 42

43 Summary of assembly Final contig size 2.24 Gb Estimated genome size 2.40 Gb. N50 is the largest length L such that the combined length of contigs of length >=L is at least 50% of the total length of all contigs Determine N50 by sorting contigs according to length and adding lengths up starting from the longest 43

44 Genome annotation de novo Full genome assembly is only the beginning... next things to do: Gene finding: Align known genes of model species against the new genome. Ab initio prediction methods: based on statistical signals within the DNA Hidden Markov Model-based prediction of genes: Genscan, Augustus, HMMgene More recently, Support Vector Machine (SVM)-based gene finding approaches: mgene Gene annotation: Function of the genes that can be aligned to new genome give some hint. Gene orthologs: InParanoid (eukaryotic species), MultiParanoid. 44

45 Literature Recommended additional reading: Li H and Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14): D. R. Zerbino and E. Birney (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18: Further references: Kircher M, Stenzel U, Kelso J (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biology 10:R83. Li, H, Homer, N. (2010) A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics 11(5): Magi A. et al. (2010) Bioinformatics for Next Generation Sequencing Data. Genes 1: Homer N, Merriman B, Nelson SF (2009) BFAST: An Alignment Tool for Large Scale Genome Resequencing. PLoS ONE 4(11): e7767. doi: /journal.pone Li et al. (2010) The sequence and de novo assembly of the giant panda genome. Nature 463, Whiteford, N. et al. (2005) An analysis of the feasibility of short read sequencing. Nucleic Acids Res. 33, e171. Whole Genome Assembly using Paired End Reads in E. coli, B. licheniformis, and S. cerevisiae. 454 Application note 1, Korbel et al. (2007) Paired-end mapping reveals extensive structural variation in the human genome, Science 318: Fujimoto et al. (2010) Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing. Nature Genetics 42,