Bioinformatics in next generation sequencing projects

Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet May 2013

Standard sequence library generation

Illumina Sequencing Technology

Illumina (Solexa) Sequencing

Illumina paired-end and index-read sequencing

Once sequenced the problem becomes computational Computational analyses is the bottleneck Rapid improvement in sequencing Still need for customized analysis for most projects

Overview of computational analyses genome sequence assembled contig RNA-Seq expression levels ChIP-Seq peak calling Primary Analyses: Image analysis Base calling Mapping (Assembly) Data type specific analyses (e.g. peak calling, calculate expression) Custom project specific analyses

Preliminary Analyses Sequences and Real Time Analysis Quality scores Raw Image (TB) Text File (GB) Platform-specific analysis using the vendors programs

Sequenced reads Fasta file: >EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC Read identifier Fastq file: SOLiD @HWI - EAS269:1:120:1786:18#0/1 GAACTCTGCCTTTTTCAGTGATGAGGAAAGGAGTTCTCTCTGGTCCCCAG +HWI - EAS269:1:120:1786:18#0/1 aaab^_u_aa [ U [ _Z ] a `WU_^X `GT^_ \ TM^ ^ \ \ Z \ YQVVXUBBBB Quality scores csfasta file >1_39_146_F3 T22100200202311030112002022222002021 >1_39_194_F3 T11022322003020303320012223122202221 SOLiD, QV file >1_39_146_F3 14 6 21 27 5 18 6 15 22 27 18 17 14 18 26 15 24 19 18 18 8 20 17 12 20 6 14 13 23 6 11 12 7 13 4 >1_39_194_F3 26 27 16 27 23 22 23 25 22 10 5 21 4 17 20 26 26 17 25 27 23 25 14 24 26 4 4 4 4 4 4 4 4 4 14

Phred Quality Score, Q Each base call has an estimate of the probability of being wrong (error probability, p) Q = -10 * log 10 (p) Phred Quality Score Probability of incorrect base call Base call accuracy 10 1 in 10 90 % 20 1 in 100 99 % 30 1 in 1000 99.9 % 40 1 in 10000 99.99 % 50 1 in 100000 99.999 %

FastQ encodings

Fastq quality control (FastQC) Video tutorial: http://www.youtube.com/watch?v=bz93reov87y

Quality scores for each sequence position

Quality scores for each sequence position: A good run

GC for reads

Percent A,C,G,T at each position

Relative enrichment of kmers

Overview of computational analyses genome sequence assembled contig RNA-Seq expression levels ChIP-Seq peak calling Primary Analyses: Image analysis Base calling Mapping Assembly Data type specific analyses (e.g. peak calling, calculate expression) Custom project specific analyses

Short Read Assembly Velvet and SOAPdenovo de novo genomic assembler specially designed for short read sequencing technologies Nature 2009

Two principal approaches for transcriptome reconstruction

Genome-independent transcriptome reconstruction Default k = 25 Garbherr et al. Nature Biotechnology, July 2011

Finding novel non-annotated genes or transcript variants

Mapping of millions of short reads Task: Map millions of short sequences (25-100 nt) onto a genome (3 000 Mbp ) or transcriptome Mismatches (sequencing errors and SNPs) Unique / Repetitive matches Indels (Normal variation, CNVs) Large rearrangements (translocations) BLAST, BLAT tools not designed for these tasks

Mapping of RNA-Seq reads STAR Garber et al. 2011 Nat Methods

Mapping of splice junctions Exon n GTAAGT-----------AG Exon n+1 1. compile sets of junctions 2. map reads towards genome + junction compilation + Genome Chromosome Fasta Files Known and putative splice junctions Fasta File

Tophat first Method A B C identify candidate exons via genomic mapping A B A C B C Generate possible pairings of exons A B A C B C Align unmappable reads to possible junctions

Longer reads By segmenting the long reads, and mapping the segments independently, we can look harder for junctions we might have missed with shorter reads >HWI-EAS229_75_30DY0AAXX:7:1:0:949 GATGTTCTCAGTGTCC GATGTAATCAGTGTCC AACCCTCTCAGTGTCC Running time independent of intron size Very long (100Kb+) intron

Mapping to transcriptome Gene: 5 UTR Exons Introns 3 UTR W C DNA (genome) Transcription pre-mrna AAAAA RNA processing (splicing, polyadenylation) mrna AAAAA

Microexons and junction coverage 2 or more splice junctions within the same read in-house mapping tophat mapping

Microexons and junction coverage 2 or more splice junctions within the same read in-house mapping tophat mapping Different read length will have different problems!

Example of STAR aligned single-cell RNA-Seq data Mapping'speed 308'M'reads'/'hour %'uniquely'mapping 60 %'multimapping 25 %'unmapped 15 281 719 splice junctions 279 356 with GT/AG 2 123 with GC/AG 215 with AT/AC

Storing mapped Alignments Formats for storing alignments should include: genomic coordinates mismatches, insertion, deletions etc. quality information

Samtools Sequence Alignment Map (SAM) Generic Alignment format Supports long and short reads Human readable, flexible and compact Emerging standard Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. BioinformaScs, 25, 2078-9. [PMID: 19505943] h"p://samtools.sourceforge.net/

SAM Example Bit field, where 16 means reverse strand Alignment structure. Here: 22 aligned bases, then 731 bases intron, then 28 aligned bases Start position HWI - EAS269:1:114:1242:1582#0 16 chr Y 616000 255 22M731N28M * 0 0 ATTTCGACCATGATCATCGAACCTTCCCCTGGATCCACTTCCACGATCAC #9 ; -7 +2@4 : 2=20-14= : ><?< ; : BB? : 4<BB?ABBBBABCBBBBC=BB NM: i : 0 XS: A:-

CIGAR Format M, match/ mismatch I, insertion D, deletion S, softclip... Ref: GCATTCAGATGCAGTACGC Read: cctcag--gcagtagtg Pos: 5 CIGAR: 2S4M3D6M3S 50M

Samtools for SAM/BAM files Library and software package (C, Java) Creating, sorting, indexing SAM & BAM Visualizing alignments in command SNP calling Short indel detection BAM (Binary representation of SAM) ~25% file size reduction

Read mapping statistics e.g. using RSeQC (package) Density of Reads 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Nucleotide Frequency 0.15 0.20 0.25 0.30 0.35 0.40 0.45 A T G C 0 20 40 60 80 100 GC content (%) 0 10 20 30 40 Position of Read

Read mapping statistics: Read mapping across genes read number 2000 4000 6000 8000 10000 0 20 40 60 80 100 percentile of gene body (5' >3')

Read mapping statistics splicing junctions complete_novel 9% partial_novel 2% known 89%

Read mapping statistics: duplicate and unique reads 0 100 200 300 400 500 Frequency Number of Reads (log10) Sequence base Mapping base 0 1 2 3 4 5 2 3 9 83 Reads %

Read mapping statistics: q values on mapped reads 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 Phred Quality Score Position of Read

Visualization Integrated Genome Viewer (Broad Inst.) Custom tracks at UCSC Genome Browser

Peak characteristics differ with signal

Peak characteristics differ with signal H3K4me3: Sharp promoter peaks H3K36me3: Broad transcription elongation signal

Important file formats Sequences: FastQ Aligned reads: SAM/BAM Genome annotations: Bed, Gff Coverage: Wig, (Tdf) http://genome.ucsc.edu/faq/faqformat.html

BED format chrom - The name of the chromosome (e.g. chr3, chry, chr2_random) or scaffold (e.g. scaffold10671). chromstart - The starsng posison of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0. chromend - The ending posison of the feature in the chromosome or scaffold. The chromend base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromstart=0, chromend=100, and span the bases numbered 0-99. track name=pairedreads description="clone Paired Reads" usescore=1 chr22 1000 5000 http://genome.ucsc.edu/faq/faqformat.html

BED continued track name=pairedreads description="clone Paired Reads" usescore=1 chr22 2000 6000 cloneb 900-2000 6000 0 2 433,399, 0,3601 strand - Defines the strand - either '+' or '-'. thickstart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). thickend - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays). itemrgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemrgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser. blockcount - The number of blocks (exons) in the BED line. blocksizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockcount. blockstarts - A comma-separated list of block starts. All of the blockstart positions should be calculated relative to chromstart. The number of items in this list should correspond to blockcount.

WIG format (coverage format) Wiggle format (WIG) allows the display of continuous-valued data in a track format Variable step variablestep chrom=chr2 300701 12.5 300702 12.5 300703 12.5 300704 12.5 300705 12.5 is equivalent to: variablestep chrom=chr2 span=5 300701 12.5 Fixed step fixedstep chrom=chr3 start=400601 step=100 11 22 33

Data Repositories Short Read Archive (fastq) [discontinued!] http://www.ncbi.nlm.nih.gov/sra European Nucleotide Archive Gene Expression Omnibus (bed, wig, fastq) http://www.ncbi.nlm.nih.gov/geo/

SEQAnswers, an active forum for discussions on next-generation sequencing methods and bioinformatics http://seqanswers.com/

Genome-independent transcriptome reconstruction: accuracy and coverage Garbherr et al. Nature Biotechnology, July 2011