Next-Generation Genome Sequencing

Next-Generation Genome Sequencing Jarkko Salojärvi, D.Sc. (Tech) Department of Biosciences, Division of Plant Biology Department of Veterinary Biosciences, Veterinary Microbiology and Epidemiology University of Helsinki 28.10.2010

Topics Today Quick recap of relevant parts of technologies SOLiD and Solexa/Illlumina. Sequencing in Practice Properties of data How raw data will look like Re-sequencing Sequencing de novo Some example throughout the presentation...

Where it all began: Sanger sequencing Developed in 1977. Read length up to 800-1000 bases. Current platforms allow 96 concurrent reads. Still needed for closing gaps, sequencing long repeated fragments Frederick Sanger 1918- twice Nobel laureate

High-throughput sequencing technologies - summary

Throughput of different technologies Most recent versions. SOLiD 4 Solexa HiSeq 454 Titanium Runtime 5-10 days 8 days 10 h Read Length (bp) 35-50 100 400+ Raw sequence Gb /run 80-100 150-200 0.4-0.6 Accuracy >99.94% >98.5% 99% (at 400 bp)

Sequencing in Practice

What do you get from sequencing service? Huge text files with usually two things being reported: 1.Base calls for each read 2.Read qualities In SOLiD these are in two separate files.csfasta and.qv.qual In Solexa, both are reported in the same.fastq file. At Biocenter Viikki: SOLiD 4 and 454 Titanium At FIMM Meilahti: Solexa Commercial services available, price thousand(s)/run.

Assembly task Given: A text file with lots of short reads, nucleotide sequences. Task: Align these, either with respect to each other (de novo) or a reference genome (re-sequencing). Essential: Coverage: Number of overlapping reads. Depth: Number of reads on a single nucleotide. Composition of reads (fragments/mate-pair) Is there a reference sequence of the organism?

Fragments vs. mate-pairs 1.Individual fragments. 2.Paired reads Mate-pair: genomic DNA is fragmented and size-selected inserts are circularized and linked by means of an internal adaptor. Paired end: Fragmentation of genomic DNA into short segments, followed by sequencing of both ends of the segment (but not the part in between). End result is the same: you know reads from both ends, plus the average distance between reads. A large fraction of short reads are difficult to map uniquely to the genome, and the second read of a pair can be used to find the correct location.

Mate-pair reads [Korbel et al.07] A Human genomic DNA i) Shearing and size selection ii) Protection and adapter ligation Bio Met Met Bio viii) Cutoff I Cutoff D iii) Circularization vi) Sequencing of >30 million paired ends with 454 technology Bio Bio Sequenced paired ends iv) Random Cleavage Bio v) Linker(+) read isolation Bio vii) Computational analysis and mapping of Structural Variants (SVs) Count 0 2000 4000 0 2000 4000 6000 8000 Span of paired ends (i.e. distance between mapped ends [bp]) Result: A snapshot of the full genome in every ~3kbp -Can be used to align contigs from standard sequencing run.

Example: paired end mapping to reveal structural variation (SV) in human genome B Human reference genome Normally mapped End distance > cutoff D Altered end orientation Source, i.e. location paired ends originated from in sample genome Individual (sample) sequence Best-placement of end in human reference genome No SV Deletion Inversion breakpoint Span of paired ends in human reference genome End distance < cutoff I Insertion of sequence from distant locus Insertion of sequence from distant locus Region deleted from sample genome Region inserted in sample genome Human reference genome Individual (sample) sequence Region inverted in sample genome End that maps in inverted orientation relative to original (i.e. sample) locus Insertion, simple Insertion, mated Insertion, unmated [Korbel et al.07]

De novo vs. re-sequencing of genome In de novo, reads are assembled into contigs: Contiguous sequence of DNA created by assembling overlapping sequenced fragments of a chromosome. Reference assembly = re-sequencing. if the genome/template is known! reference assembly if the genome/template genome/tempale is unknown! de ference novo assembly Reference genome Gap in sequence coverage but reference genome tells that sequences are from the same contig or genomic region! tolerates short read length contig 1 contig 2 same original contig (for example mrna) may be splitted to multiple to shor contigs! longer reads provide more overlap for connecting individual reads

SOLiD raw data

SOLiD probes Probes designed for reading two nucleotides at a time. Four different colors. Resulting sequence in colorspace...but also CG,GC and TA are red?!

SOLiD sequence decoding Key to decoding: known last base of the adapter oligo sequence. Known: 0=A CA AC GT TG 0-1 CA AC GT TG 1-2 AA CC GG TT 2-3 A C A Petri Auvinen, DNA Sequencing and Genomi

SOLiD raw data files: raw reads in.csfasta SOLiD gives out in general two files, the reads in color space (.csfasta) and read qualities (.QV-qual) <filename>.csfasta Overall format: Last base of adapter oligo sequence+color space presented in numbers. 1 st Nucleotide 2 nd Nucleotide A C G T A 0 1 2 3 C 1 0 3 2 G 2 3 0 1 T 3 2 1 0 Example: >1_88_1830_R3 G32113123201300232320 >1_89_1562_R3 G23133131233333101320 & >$'&)#0(-$&+'& >TAG_ I D Co l o r _spa c e

SOLiD raw data files - qualities in QV.qual Quality values are in <filename>.qv.qual phred-like score for each read. score q=-10*log10(p) Example: ' >TAG_ID quality values >97_2040_1850_F3 38 36 26 33 41 26 24 33 28 31 27 23 5 35 32 31 11 10 24 38 22 24 7 12 15 21 12 18 34 31 27 11 15 26 13 14 17 17 13 12 8 5 17 5 12 >97_2040_1898_F3 41 41 41 38 32 29 39 24 23 36 32 38 25 30 28 21 27 33 34 33 24 27 9 35 34 14 30 18 33 8 13 32 10 31 24 7 22 5 27 30 21 5 0 27 9 ' p q 0.5 3 0.1 10 0.01 20 0.001 30 0.0001 40

Benefit: Complementation in color space One benefit of color space is that it is self-complementing:! 2 nd Nucleotide A C G T A 0 1 2 3 C 1 0 3 2 1 st Nucleotide! F'6%'03'! Ba s e G 2 3 0 1 T 3 2 1 0 A G C T C G T C G T G C A G Co l o r spa c e 2 3 2 2 3 1 2 3 1 1 3 1 2 D+71.'7'0-'5! Ba s e T C G A G C A G C A C G T C Co l o r spa c e 2 3 2 2 3 1 2 3 1 1 3 1 2

Downside One incorrect base can screw up the whole read in decoding CA AC GT TG 0-1 CA AC GT TG 1-2 AA CC GG TT 2-3 A C A A CA AC GT TG 0-1 TA GC CG AT 1-2 AA CC GG TT 2-3 A C G G In colorspace there is still only one error -> Alignment MUST be done in colorspace!

Solexa raw data

Solexa pipeline

Solexa output in fastq file Solexa raw data comes in one text file, default naming by flowcell lane and read direction example: s_7_1_sequence.txt Four lines per read: 1.@Sequence identifier 2.Raw sequence letters (A,T,C,G,N) 3.+same_sequence identifier 4.Read quality codes Phred-like.

Fastq quality scores Quality scores are reported in ASCII Saves disk space Example:

Re-sequencing

Requirements All alignment programs are designed in unix/linux platforms. Windows too slow. Written in C, some parts in Perl. Need a lot of memory: for human-sized genomes, at least 8Gb of RAM. Need a lot of disk space: data files now ~ 5 Gb. Take from tens of minutes to hours to complete. An account at CSC or some other computation facility recommended No graphical user interfaces Command line interface example: >assemble.pl reads_in.csfasta read_qualities_f3_qv.qual 157000000 - ref_file TAIR9_chr.fas -ref_type nt -NO_CORRECTION

Re-sequencing pipeline Proceeds in a similar manner for all platforms: 1. Create an index to be used for searching the reference genome. 2. Using the index, align reads to reference. 3. Form a consensus sequence. 4. Identify SNPs etc.

Short read alignment Because of huge amount of data, BLAST is too slow, and faster alignment methods have been developed. Faster methods use shortcuts based on indexing, where you search only a small part of the sequences. Hashing-based indexing. Burrows-Wheeler transform. Progress is rapid, methods published 2 years ago are now old.

Hashing-based aligners First generation of read aligners. Extend the idea of BLAST. Indexing: divide reads of length L into bins based on their first n nucleotides n is roughly 20 Alignment: For each position p in the reference genome: Reference sequence=pick next L nucleotides Find the appropriate bin Match the remaining reference sequence to reads in the bin Software: No gaps: Eland, Maq. Gaps allowed: Elandv2, SOAP, GenomeMapper (part of SHORE).

Burrows-Wheeler transform Next generation of sequence aligners Reference: ^GOOGOL Searches sequence matches using a prefix trie. Results in fast read alignment method Requires less memory Small gaps allowed Software: Bowtie (no gaps), BWA. SOAP2, 2-way BWT Task: Find match to LOL, given at most one mismatch

SHort Read Mapping Package = SHRiMP k-mer hashing step +very efficient implementation of the Smith-Waterman algorithm. Can be used for letter space and color space reads. Slower than the others, but gives optimal local alignment.

Performance comparison Homer N, Merriman B, Nelson SF (2009) BFAST: An Alignment Tool for Large Scale Genome Resequencing. PLoS ONE 4(11): e7767. doi:10.1371/journal.pone.0007767

Re-sequencing in colorspace Colorspace has its own pros and cons, which can be taken into account in sequence alignment. Translation into nucleotides as the last step after alignment! Use software that supports colorspace. In practice, this is just an option you give to the alignment program.

Using read qualities in alignment? Not all programs use them!! (check the manuals) Most new methods use read qualities. Out of the old ones: Maq. Lets look how the assembly goes with Maq... Has been used a lot in early papers. A benchmark for new methods regarding speed and alignment.

Maq - workflow maq fasta2bfa ref.fasta ref.bfa Convert the reference sequences to the binary fasta format maq fastq2bfq reads.fastq reads-1.bfq Convert the reads to the binary fastq format maq match reads-1.map ref.bfa reads-1.bfq Align the reads to the reference maq mapcheck ref.bfa reads-1.map >mapcheck.txt Statistics from the alignment maq assemble consensus.cns ref.bfa reads-1.map 2>assemble.log Build the mapping assembly maq cns2fq consensus.cns >cns.fq Extract consensus sequences and qualities maq cns2snp consensus.cns >cns.snp Extract list of SNPs

Maq - workflow

Aligning reads - analysing results Software for visualization: Maqviewer, SHOREmap, IGV, Tablet Visualization is VERY important! See the real quality of the data, alignment, coverage etc. Helps to identify errors Helps to evaluate SNP calls, identify gaps etc. There are wings, a propeller, and a pilot - it must be...

Maqviewer Only basic functionality.

Tablet viewer Coded in JAVA Graphical interface May be slow

SNP calling After aligning the short reads to reference genome, identify nucleotides that differ from reference. Make a consensus sequence of the reads Simplest: choose the most common one. Better: Use quality values in the voting Reference Consensus Individual reads SNP

SNP calling Software for SNP detection: SOAPsnp, Maq, probhd, SHOREmap, MUMmer Maq computes a phred-type score for the SNPs. SNPs hard to define, usually some thresholds given based on depth and number of differing nucleotides.??

Example: Sequencing of A.thaliana genome, mutations induced by EMS Two different mutants sequenced with two platforms: SOLiD + Maq: 16724 SNPs Solexa + Maq: 18469 SNPs Roughly 1 SNP per every 10,000 bases.

Further analysis Which SNP is responsible of the mutant phenotype? Usually some window of the genome is known. To identify the SNP, need to combine SNP locations and genome annotation: Is the SNP in a coding sequence,exon,intron, promoter, 3/5 UTR, junk? Is the SNP disruptive? Transclation results in stop codon/altered amino acid sequence? Each splice variant can be different, variants not known. Location in the protein? If done properly, would require protein structure prediction (very hard)

Sequencing de novo

How long reads does de novo genome assembly require? Key Problem: longer than read length repeats in the genome. Theoretical analysis: E.coli: 30 bp read length, 75% of genome is covered with contigs>10,000bp C.elegans: 50 bp read length, 51% of the genome is covered with contigs >10,000bp. Human: 50 bp read length, ~15% is covered with contigs>10,000 bp (chromosome 1). Re-sequencing and de novo sequencing of the majority of a bacterial genome is theoretically possible with read lengths of 20 30 bp. With longer genomes significant proportions are left uncovered.

Percentage of the E.coli genome covered by contigs greater than a threshold length as a function of read length Whiteford, N. et al. Nucl. Acids Res. 2005 33:e171; doi:10.1093/nar/gni170

100 75 (b) 1000 10000 100 75 (b) 100 500 1000 2000 50 25 50000 100000 50 25 4000 6000 10000 0 200000 50 100 150 200 Read length l (nt) 0 5 25 50 75 100 125 Read length l (nt) C. Elegans Human Use paired-end mapping to connect the longer contigs.

Genome assembly using paired end reads Figure 1: An illustration of the Paired End assembly process. Paired End reads are used to order and orient the contigs derived from the Newbler assembly. The large blue lines represent contigs generated from the whole genome shotgun sequencing and assembly. The multiple blue and grey lines represent Paired End information. The blue segments represent the two 20 nucleotide regions that were sequenced while the dotted grey line represents the distance between those two sequenced regions. [454 Technical note 1]

2010 Nature America ity score q20 (Supplementary Table 7). From these reads, ABySS, sequence could not be al SOAPdenovo and Velvet generated 6,535, 4,826 and 6,617 contigs estimated that 2.2 3.5 M longer than 100 bp, respectively (Supplementary Fig. 18). On aver- alternative assemblies (S age, 64.6% of the contigs showed high sequence similarity (q90% identity) to a contig in each of the other assemblies (Fig. 4a). In addi- DISCUSSION tion, the SOAPdenovo assembly showed similarity to 90.7% of all The most abundant and assembly sequences (Fig. 4a). are SNVs. We compared Software: ALLPATHS, Edena, we Velvet and SOAPdenovo To analyze AbySS, the quality of these assemblers, designed PCR prim- three methods of SNP ers to amplify DNA fragments from the 186 randomly selected con- decision. Among the th Based on de Bruijn graphs. tigs (32 fragments that were 500 1,000 bp long and 30 fragments of at least 1.5 kb Fujimoto et al.: and carried from each software assembly) Illumina/solexa sequencing of human genome a b SOAPdenovo out PCR 200-bp amplification. Out of 186 contigs, 814 insert libraries, 12 runs. 181 were amplified with51-76 the proper length read length nt. (Supplementary 40xcoverage Fig. 19). We also validated these sequences through Sanger sequencing novothan assembly, of contigs: anddemore 90% comparison of them showed high 1,956 11,616 1,921 ABySS 6,535 to (violet) sequence identity (>90%) the predicted SOAPdenovo 4,826 (yellow) contig sequences. Software for de novo genome assembly Velvet 6,617 (green). ABySS Contigs that were aligned with more than 90% 280 436 Velvet Figure 4 De novo assembly of unmapped identity were considered shared contigs 955 reads. (a) Comparison of contigs generated by ABySS (violet), SOAPdenovo (yellow) and c d Hs Alt Velvet (green). Contigs that were aligned with Hs GRCh37 more than 90% identity were considered shared Fujimoto et al. (2010) Whole-genome sequencing and comprehensivehs variant analysis of other contigs. (b) Identification of contigs by ABySS a Japanese individual using massively parallel sequencing. Nature Genetics 42, 931 936. Chimpanzee showing the proportion of the total length

Velvet assembly - de Bruijn graphs Split reads into k-mers. Align all k-mers in the reads (here 5- mers) de Bruijn graph: Each node represents a series of overlapping k-mers Final nucleotides make up the sequence of the node. Last k-mer of an arc s origin overlaps with the first of its destination. Reads are mapped as paths through the graph D. R. Zerbino and E. Birney (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18: 821-829.

Example: de novo fragment assembly of one SOLiD run of A.Thaliana Data 68,073,401 reads. Read length 50 nt 12 x coverage. Assembly At least 3xcoverage: 29 M reads. Number of contigs: 295,203 Median contig length: 276 nt Longest contig: 3399 nt Shortest contig: 82 nt Sum contig length : 98,457,405 ~62.7% of the genome. contig length (nt) 0 500 1000 1500 2000 2500 3000 3500 0 50000 100000 150000 200000 250000 300000 Index

SNP analysis of de novo vs. re-sequencing The sequenced genome was A.Thaliana, Cvi ecotype In de novo, velvet+mummer: 2,371,409 SNPs Re-sequencing, TAIR9+Maq: 183,811 SNPs Published SNP list: 810,205 SNPs Warning: Indices of the published list of Cvi SNPs do not match TAIR9 reference genome!! Consensus sequence of Cvi is not published ( released )

Example: Sequencing of the giant panda genome Li et al. (2010) The sequence and de novo assembly of the giant panda genome. Nature 463, 311 317.

Genome assembly using paired end reads v2 Several libraries with different insert lengths Can use the same sequencing technology for whole assembly. Strategy: Join reads with short insert lengths into contigs Make into scaffolds by mapping unpaired ends to other contigs Scaffold = set of contigs with spaces between. Use longer insert libraries for arranging contigs

Sequencing setup for Giant Panda 37 paired-end insert libraries with insert sizes of 150 bp, 500 bp, 2 kb, 5 kb and 10 k. Illumina Genome Analyser platform. 176 Gb of usable sequence, 73x coverage. Average read length of 52 bp.

Summary of Assembly Final contig size 2.24 Gb Estimated genome size 2.40 Gb.

Genome annotation de novo Gene finding: Align known genes of model species against the new genome. Hidden Markov Model-based prediction of genes: Genscan, Augustus, HMMgene Gene annotation: Function of the genes that can be aligned to new genome give some clue. Gene orthologues, InParanoid, Multiparanoid.

Structure of the umami receptor T1R1 gene Heterodimer T1R1/T1R3 may be the sole receptor for umami taste. Umami: detection of the carboxylate anion of glutamic acid, a naturally occurring amino acid common in meats, cheese, broth, stock and other protein-heavy foods. In panda T1R1 is a pseudogene. Recent mutation, may explain the diet?

Example application: nucleosome positioning

Chromatin structure Chromatin=combination of DNA, RNA, and protein that makes up chromosomes. Functions: Package DNA, strengthen the DNA to allow mitosis and meiosis Serves as a mechanism to control expression and DNA replication. Changes in chromatin structure are affected by chemical modifications of histone proteins such as methylation (DNA and proteins) and acetylation (proteins), and by non-histone, DNA-binding proteins.

Predicting nucleosome positions Separate DNA into nucleosome vs. linker DNA parts. Sequence these with 454. Nucleosome ~146 bp, linker DNA ~50-500 bp. Construct a model to predict nucleosome positions. [Field et al. 08]

Computational model Nucleosomes: estimate a (position-specific) di-nucleotide model PN over all nucleotide sequences. Linker DNA: Estimate 5-mer model PL for linker DNA vs. nucleosome. ScoreðSÞ~log P NðSÞ P L ðsþ P N,1 ðs½1šþ 147 P P N,iðS½iŠjS½i{1ŠÞ i~2 ~log P l S½1Š 147 P P lðs½išjs½maxð1,i{4þš,...,s½i{1šþ i~2 Estimate score for whole DNA, taking into account all legal configurations of nucleosome positioning. Normalize to get probabilities PðW c ½SŠÞ~ W c½sš P W c ½SŠ, c [C

Result Nucleosome localization can be predicted from DNA sequence. Two different types of regulation by chromatin in yeast promoters: Nucleosome-depeleted areas: genes showing relatively low cell-to-cell expression variability, or transcriptional noise. Nucleosome-rich areas: Transcription factors need to compete with nucleosomes for access to the DNA => variability in gene expression.

Further uses for high-throughput sequencing? Cataloging sequences and their variation: Between individuals and species. SNPs, quantitative trait loci. Copy number variations. Mutations and genome rearrangements. Metagenomics. Evolution at an individual level. Phylogeny Epigenetics DNA methylation (using ChIP-seq). Chromatin structure. Transcriptome Digital Gene Expression. ChIP-seq. Splice variants. microrna. Cell-specific gene expression.

What can high-throughput sequencing do for you? [Kahvejian et al. 08]

References Li, H, Homer, N. (2010) A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics 11(5):473-483. Fujimoto et al. (2010) Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing. Nature Genetics 42, 931 936. Li et al. (2010) The sequence and de novo assembly of the giant panda genome. Nature 463, 311 317. Magi A. et al. (2010) Bioinformatics for Next Generation Sequencing Data. Genes 1:294-307 Vera, J.C., Wheat, C.W., Fescemyer, H.W., Frilander, M.J., Crawford, D.L., Hanski, I., and Marden, J.H. (2008) Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Molecular Ecology 17:1636-1647. Field Y, Kaplan N, Fondufe-Mittendorf Y, Moore IK, Sharon E, et al. (2008) Distinct Modes of Regulation by Chromatin Encoded through Nucleosome Positioning Signals. PLoS Comput Biol 4(11): e1000216. doi:10.1371/ journal.pcbi.1000216 Kahvejian A., Quackenbush J., Thompson J.F. (2008) What would you do if you could sequence everything? Nature Biotechnology 26(10):1125-1133.2008. D. R. Zerbino and E. Birney (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18: 821-829. Korbel et al. (2007) Paired-end mapping reveals extensive structural variation in the human genome, Science 318:420-426. Whole Genome Assembly using Paired End Reads in E. coli, B. licheniformis, and S. cerevisiae. 454 Application note 1, 2006. Whiteford, N. et al. (2005) An analysis of the feasibility of short read sequencing. Nucleic Acids Res. 33, e171.