Introduction to Next Generation Sequencing

The Sequencing Revolution Introduction to Next Generation Sequencing Dena Leshkowitz,WIS 1 st BIOmics Workshop High throughput Short Read Sequencing Technologies Highly parallel reactions (millions to billions possible) Performed on cloned DNA populations Companies 454/Roche - Launched in 2005 Pyrosequencing by synthesis Solexa/Illumina - Launched in late 2006 Reversible terminator sequencing by synthesis (dye labeled nucleotides) Agencourt/ABI/Invitrogen Launched mid-2007 Sequencing by ligation (dye labeled dinucleotides) DNA Sequencing Throughput History Cost for Sequencing the Human Genome Objectives Illumina Genome Analyzer pipeline overview Pipeline Components Overview Master Script (GOAT) Interpreting pipeline s output Tools and approaches to further analyze the sequences Image Analysis Base Calling Alignment GERALD

Technology Overview Image Analysis & Base Calling flow cell A flow cell contains eight lanes Lane 1... Lane 8 Each cluster at each cycle, generates 4 fluorescence intensities Each lane contains two columns, each column contains up to 50 tiles Column 1 Column 2 DNA clusters are located and quantified, across all images Each tile is imaged four times per cycle one image per base Naively: highest of the 4 values determines the base Pipeline Components Overview Master Script (GOAT) Base Calling: Intensity Correction Cross talk correction: emission spectra of the four dyes overlaps Normalization: scaling factor to make intensities equivalent Image Analysis Base Calling Alignment Emission spectra of dye X Y GERALD X Y Base Calling Phasing/Prephasing Correction Phasing Prephasing Base Calling G C C C C C A Corrected Intensity C quality score Requires a sample with a random, balanced base composition and therefore is usually done on Phix our control A C G T

Quality Score Pipeline Components Overview Each base has a quality score Solexa's base scoring is similar to Phred scoresa way of expressing estimates of sequencing error probabilities. Q phred = -10 log10( Pe ) Pe = error probability of a particular base call Q20 = 1 error in 100 bases Q30 = 1 error in 1000 bases Master Script (GOAT) Image Analysis Base Calling Alignment GERALD The quality score is in ASCII format ASCII character code= quality value + 64 Quality Filtering GERALD Chastity threshold: The ratio of the brightest intensity over the sum of the brightest and second brightest intensities I A C = >0.6 I + I A B Filter (pure-bases): I A sequence which has a B chastity less than 0.6 on two or more bases among the first 25 bases will be filtered I A ELAND Very fast Alignment: Program ELAND Only 2 mismatches allowed in first 32 bases (N is not counted as a mismatch) Alignments are used to estimate error rates Alignment: Programs Gerald (Eland) Objectives Eland Types Application Description Illumina Genome Analyzer pipeline overview Eland_extended Eland_pair Single reads Paired reads Aligns single reads to a reference Aligns paired reads Interpreting pipeline s output Eland_tag DGE Aligns to a nonredundant reference set of sequence tags Tools and approaches to further analyze the sequences Eland_rna Single reads, whole transcriptome Aligns to a reference genome, splice junctions and contaminations

Sequence Output Formats FASTQ (s_1_sequence.txt) Sequence Output Line 1: Unique ID for a sequencing read Line 2: Sequences Line 3: Repeat of the ID (preceded with a + sign) Line 4: Base calling quality score (Analogous to Phred scores but in ASCII value) Example: @30LH2AAXX:8:1:984:225 ATTCCCCTGTACTGAGACATAGAGAGTTTGCAAGACCA +30LH2AAXX:8:1:984:225 \\\\\\\\\\Z\\\ZZZ\\\\\\W\\\\\ZYYYVYVVV Eland Alignment Outputs ELAND Outputs s_n_export.txt Results of alignment of all reads in the lane. The fields are tab separated to facilitate export to databases. The last field on each line is a flag telling you whether or not the read passed the filter (Y or N). s_n_sorted.txt Contains only entries for reads which : pass pure bases filtering have a unique alignment in the reference. Alignments are sorted by order of their alignment position Example : 30LL2AAXX 1 53 735 205 ACGTGCTTACCCTACCACTCTATACCACCATCACTACC UUUUUUUUUUUUUUUUU UUUUUULUULUUUQQOQQIOO NC_001133.fna 354 F 19T10C3ATG1 0 30LL2AAXX 1 8 348 612 ACGTGCTTACCCTACCACTTTATACCACCACCACATGC UUUUUUUUUUUUUUUUU UUUUUUUUUUUUUQQQQQOMO NC_001133.fna 354 F 38 59 30LL2AAXX 1 78 835 1401 TACCCTACCACTTTATACCACCACCACATGCCATACTC UUUUUUUUUUUUUUUUU Alignment File Format Tab Delimited Run Folder name Lane Tile X Coordinate of cluster Y Coordinate of cluster Index string (Blank for a non-indexed run) Read number (1 or 2 for paired-read analysis) Read Quality string In symbolic ASCII format Match chromosome Name of chromosome match OR code indicating why no match resulted Match Contig Gives the contig name Match Position Always with respect to forward strand Match Strand F for forward, R for reverse Match Descriptor Concise description of alignment Single-Read Alignment Score Paired-Read Alignment Score Partner Chromosome -paired read Partner Contig- paired read Partner Offset Partner Strand Filtering Did the read pass quality filtering? Y for yes, N for no 30LH2AAXX 8 85 1701 577 CAAATATGTTCAACAAAATTATAGTAGAAA GCTTTCCA ]]]]]]]]]]]]]]]]\]]]]]]\\]Z]]]YYYYYVVV NC_000067.5.fasta 3011999 F 30A7 11 Y Run Statistics Quality Control

Summary.htm Report Folder Run Statistics Summary.htm (Report folder) The number of detected clusters The number of cluster that Passed Filtering The average intensity of all color channels in all tiles for the first cycle. Should be above 100 Percent intensity after 20 cycles should be 50% or more %PF should be above 50% (possible problems: too many clusters, faint clusters ) %Aligned filtered reads uniquely aligned %Error rate Should be 1.5 and below The percentage of each base called as a function of the cycle. Each channel (ATGC) is plotted separately IVC.htm Intensity Versus Cycle The red bar shows the % of bases at each cycle that are wrong, based on the eland alignment The error rate raises with the cycles Remark: the sequences were selected upon there ability to align to the first 32 bases Error.htm Pipeline Outputs You will find the following folders within the folder run: Folder Data type Folder structure: Storage Space GERALD_29-01- 2009 FINAL_29-01- 2009 Report Original folder from pipeline Original folder from pipeline Original Gerald folder (can contain CASAVA) from pipeline Final text outputs: Sequences Alignments Summarized as web page Summary.htm Optional data Optional data Optional data (also found in GERALD) (also found in GERALD) FC1012X Gerald Images 750Gb 250Gb Transferred to storage server (dapsas) <100Gb

Statistics of Runs Performed Objectives Illumina Genome Analyzer pipeline overview Interpreting pipeline s output Tools and approaches to further analyze the sequences The Jigsaw Puzzle One Run with 4Gb made of 100 million pieces each of length of 40 bases and some do not fit correctly. Mapping: Aligning to a reference sequence 1. Resequencing 2. Transcriptome analysis (RNA-seq) 3. Cistrome analysis (Chip-seq) First Step in Analysis Sequence data (bases & quality) De novo Assembly: Assembling individual sequences to a larger sequences De Novo Sequencing Example: Pseudomonas syringae Butler et al. FEMS Microbiol Lett 291 (2009) 103 111 6 million genome X42 coverage ~3.5 million paired end reads of 36 bases De novo assembly using VELVET and EDENA, at least 3% of the reference genome was absent from the assembly (842 unassembled regions). Unassembled regions are noncoding RNA 90% of the protein-coding genes being assembled with 100% accuracy over their full length Differences Among the Mapping Applications Speed (Bowtie -string matching using Burrows Wheeler Transform) Use of quality data (MAQ, consed) Ability to perform multiple mapping (Nexalign) Amount of mismatches and indels supported (Soap) Length of seed alignment supported (Eland -32bases)

Resequencing Example SNP Detection & Reporting using CONSED Consed Can Detect Inserted Base CASAVA Consensus Assessment of Sequence And Variation (Illumina) RNA-Seq Post sequencing analysis: uses the export.txt files from the Eland alignment as input For resequencing projects: produces a set of allele calls of SNPs For RNA-seq (whole transcriptome sequencing): provides counts for exons, genes and splice junctions http://en.wikipedia.org/wiki/rna-seq RNA-seq The expression value is calculated by counting the number of reads per gene, exon or splice junction Normalization of the expression value is done by: Dividing the number of reads by the virtual length of the gene or exon Scaling the number of reads between the samples RNA Seq An example of alternative splicing Chromosome Start End GeneSymbol Count_Normalized Lane2 Count_Lane2 c12 11351282 11354633 PRB4 0.57268 524 c4 70896237 70902762 STATH 31.23833 18743 c7 142539296 142546956 PIP 11.35417 6540 c12 10889715 10893342 PRR4 137.22163 77393 c12 11310124 11313908 PRB3 7.40788 8082 c20 43314293 43316620 SLPI 26.63712 15929 2008 by Cold Spring Harbor Laboratory Press Marioni J C et al. Genome Res. 2008;18:1509-1517

Basic output files: BED An example of Bed format file for reads that mapped to a genome: Visualize the Sequence Data Importing to Genome Viewers CHR: START: STOP: NAME: COUNT: STRAND: chr1 17071700 17071733 seqname 2 + chr1 17071700 17071734 seqname 3 + chr1 17071700 17071735 seqname 4 + chr1 17071700 17071736 seqname 26 + chr1 17071701 17071736 seqname 2 + chr1 17071702 17071736 seqname 3 + chr1 17088793 17088829 seqname 1 + Basic output files: WIG Sequencing "signal" - wiggle track: Imported Bed & Wiggle files to IGB genome browser Locus 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 2 3 5 6 6 6 6 6 6 6 6 4 3 1 0 Signal variablestep chrom=chr1 1 0 2 2 3 3 4 5 5 6 6 6 7 6 8 6 9 6 10 6 11 6 12 6 13 4 14 3 15 1 Defining DNA protein interactions Chip-Seq MACS: Model-based Analysis for ChIP-Seq Binding Use confident peaks to model shift size Sultan et al. Science. 2008 Aug 15;321(5891):956-60 CSHL 2009 - Shirley Liu

Example of a Peak (MACS) Objectives Illumina Genome Analyzer pipeline overview Interpreting pipeline s output Tools and approaches to further analyze the sequences chr chr1 start 4838075 end 4838758 length 684 summit 278 tags 68-10LOG10 *(pvalue) 459.98 Fold enrich ment 42.53 FDR (%) 0.84 Bioinformatics wiki http://bip.weizmann.ac.il/wiki THANKS See you at the workshop this afternoon Everybody is invited to read and add to this wiki!