NEXT GENERATION SEQUENCING
HISTORY
HISTORY Sanger Dominant for last ~30 years 1000bp longest read Based on primers so not good for repetitive or SNPs sites
HISTORY Sanger Dominant for last ~30 years 1000bp longest read Based on primers so not good for repetitive or SNPs sites Next Generation Sequencing Much shorter reads, 25 to 300 bp Higher throughput Cheaper cost per Mb Single molecule sequencing (no cloning step) Since Jan 2008 more DNA sequenced than all previous years
ILLUMINA (SOLEXA)
ILLUMINA (SOLEXA) Computational Biology Research Group
ILLUMINA (SOLEXA) Computational Biology Research Group
PAIRED END SEQUENCING The two ends of the fragments get different adapters Hence, one can sequence from one end with one primer, then repeat to get the other end with the other primer. This yields pairs of reads, separated by a known distance This provides additional information while aligning reads allowing better resolution in repeat areas
SOLID SEQUENCING
SOLID SEQUENCING
NGS APPLICATIONS
NGS APPLICATIONS Resequencing Characterise different related species or strains
NGS APPLICATIONS Resequencing Characterise different related species or strains Transcriptome analysis Analysis of the entire transcribed part of an organism s genome
NGS APPLICATIONS Resequencing Characterise different related species or strains Transcriptome analysis Analysis of the entire transcribed part of an organism s genome Examine chromatin modifications Quantify in vivo protein-dna interactions using the combination of chromatin immunoprecipitation and sequencing (ChIP-Seq)
NGS APPLICATIONS Resequencing Characterise different related species or strains Transcriptome analysis Analysis of the entire transcribed part of an organism s genome Examine chromatin modifications Quantify in vivo protein-dna interactions using the combination of chromatin immunoprecipitation and sequencing (ChIP-Seq) de novo genome assembly
ILLUMINA DATA
ILLUMINA DATA Generates short reads (~35-100bp)
ILLUMINA DATA Generates short reads (~35-100bp) Good for resequencing
ILLUMINA DATA Generates short reads (~35-100bp) Good for resequencing Can be used for de novo sequencing with paired end reads
READS
READS Acquire and process images and convert to FASTQ
READS Acquire and process images and convert to FASTQ Get data
READS Acquire and process images and convert to FASTQ Get data Quality control
READS Acquire and process images and convert to FASTQ Get data Quality control Map to genome
READS Acquire and process images and convert to FASTQ Get data Quality control Map to genome Visualisation
READS Acquire and process images and convert to FASTQ Get data Quality control Map to genome Visualisation Post Processing Peak Finding SNP Calling
FASTQ FORMAT
FASTQ FORMAT @read_identifier#0/1
FASTQ FORMAT @read_identifier#0/1 TATACAATGCACTTAGTCATCCGCGTATCACTTTAT
FASTQ FORMAT @read_identifier#0/1 TATACAATGCACTTAGTCATCCGCGTATCACTTTAT +
FASTQ FORMAT @read_identifier#0/1 TATACAATGCACTTAGTCATCCGCGTATCACTTTAT + IIIIIIIIIIIIIIIIIIGIIIIIIIIII4IIII:I
FASTQ FORMAT @read_identifier#0/1 TATACAATGCACTTAGTCATCCGCGTATCACTTTAT + IIIIIIIIIIIIIIIIIIGIIIIIIIIII4IIII:I #0 index number for a multiplexed sample (0 for no indexing) /1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)
FASTQ SCORING Q phred = -10log10(e); Q solexa = 10log10(p(X)/(1-p(X))); e is the estimated probability of a base being wrong p(x) = probability of called base X being right Currently, the FASTQ format encodes a Phred quality score from 0 to 62 using ASCII 64 to 126 (although in raw read data Phred scores from 0 to 40 only are expected).
BASE CALL QUALITY STRINGS If p is the probability that the base call is wrong, the (standard Sanger) Phred score is: Q Phred = -10 log 10 e Score written with character ascii code Q + 64.
READ ALIGNMENT For applications such as resequencing, SNP calling, and many others the reads need to be aligned to a reference genome Number of reads is in 10-100 million or more optimal alignment algorithms are too time intensive O(mn)
ADDITIONAL CHALLENGES Read errors Dominant cause for mismatches Detection of substitutions? Importance of base-call quality Repetitive regions/accuracy ~20% of human genome is repetitive for 32bp reads Use paired-end information
ALIGNMENT ALGORITHM APPROACHES Hashing (seed-and-extend paradigm, k-mers + Smith-Waterman) The entire genome Straightforward, easily parallelized but large memory The read sequences Flexible memory footprint, harder to parallelize Indexing by Burrows-Wheeler Transform Pros: fast and relatively small memory Cons: decrease in performance for longer reads