NEXT GENERATION SEQUENCING. Farhat Habib

Similar documents
Transcription:

NEXT GENERATION SEQUENCING

HISTORY

HISTORY Sanger Dominant for last ~30 years 1000bp longest read Based on primers so not good for repetitive or SNPs sites

HISTORY Sanger Dominant for last ~30 years 1000bp longest read Based on primers so not good for repetitive or SNPs sites Next Generation Sequencing Much shorter reads, 25 to 300 bp Higher throughput Cheaper cost per Mb Single molecule sequencing (no cloning step) Since Jan 2008 more DNA sequenced than all previous years

ILLUMINA (SOLEXA)

ILLUMINA (SOLEXA) Computational Biology Research Group

ILLUMINA (SOLEXA) Computational Biology Research Group

PAIRED END SEQUENCING The two ends of the fragments get different adapters Hence, one can sequence from one end with one primer, then repeat to get the other end with the other primer. This yields pairs of reads, separated by a known distance This provides additional information while aligning reads allowing better resolution in repeat areas

SOLID SEQUENCING

SOLID SEQUENCING

NGS APPLICATIONS

NGS APPLICATIONS Resequencing Characterise different related species or strains

NGS APPLICATIONS Resequencing Characterise different related species or strains Transcriptome analysis Analysis of the entire transcribed part of an organism s genome

NGS APPLICATIONS Resequencing Characterise different related species or strains Transcriptome analysis Analysis of the entire transcribed part of an organism s genome Examine chromatin modifications Quantify in vivo protein-dna interactions using the combination of chromatin immunoprecipitation and sequencing (ChIP-Seq)

NGS APPLICATIONS Resequencing Characterise different related species or strains Transcriptome analysis Analysis of the entire transcribed part of an organism s genome Examine chromatin modifications Quantify in vivo protein-dna interactions using the combination of chromatin immunoprecipitation and sequencing (ChIP-Seq) de novo genome assembly

ILLUMINA DATA

ILLUMINA DATA Generates short reads (~35-100bp)

ILLUMINA DATA Generates short reads (~35-100bp) Good for resequencing

ILLUMINA DATA Generates short reads (~35-100bp) Good for resequencing Can be used for de novo sequencing with paired end reads

READS

READS Acquire and process images and convert to FASTQ

READS Acquire and process images and convert to FASTQ Get data

READS Acquire and process images and convert to FASTQ Get data Quality control

READS Acquire and process images and convert to FASTQ Get data Quality control Map to genome

READS Acquire and process images and convert to FASTQ Get data Quality control Map to genome Visualisation

READS Acquire and process images and convert to FASTQ Get data Quality control Map to genome Visualisation Post Processing Peak Finding SNP Calling

FASTQ FORMAT

FASTQ FORMAT @read_identifier#0/1

FASTQ FORMAT @read_identifier#0/1 TATACAATGCACTTAGTCATCCGCGTATCACTTTAT

FASTQ FORMAT @read_identifier#0/1 TATACAATGCACTTAGTCATCCGCGTATCACTTTAT +

FASTQ FORMAT @read_identifier#0/1 TATACAATGCACTTAGTCATCCGCGTATCACTTTAT + IIIIIIIIIIIIIIIIIIGIIIIIIIIII4IIII:I

FASTQ FORMAT @read_identifier#0/1 TATACAATGCACTTAGTCATCCGCGTATCACTTTAT + IIIIIIIIIIIIIIIIIIGIIIIIIIIII4IIII:I #0 index number for a multiplexed sample (0 for no indexing) /1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

FASTQ SCORING Q phred = -10log10(e); Q solexa = 10log10(p(X)/(1-p(X))); e is the estimated probability of a base being wrong p(x) = probability of called base X being right Currently, the FASTQ format encodes a Phred quality score from 0 to 62 using ASCII 64 to 126 (although in raw read data Phred scores from 0 to 40 only are expected).

BASE CALL QUALITY STRINGS If p is the probability that the base call is wrong, the (standard Sanger) Phred score is: Q Phred = -10 log 10 e Score written with character ascii code Q + 64.

READ ALIGNMENT For applications such as resequencing, SNP calling, and many others the reads need to be aligned to a reference genome Number of reads is in 10-100 million or more optimal alignment algorithms are too time intensive O(mn)

ADDITIONAL CHALLENGES Read errors Dominant cause for mismatches Detection of substitutions? Importance of base-call quality Repetitive regions/accuracy ~20% of human genome is repetitive for 32bp reads Use paired-end information

ALIGNMENT ALGORITHM APPROACHES Hashing (seed-and-extend paradigm, k-mers + Smith-Waterman) The entire genome Straightforward, easily parallelized but large memory The read sequences Flexible memory footprint, harder to parallelize Indexing by Burrows-Wheeler Transform Pros: fast and relatively small memory Cons: decrease in performance for longer reads