Bioinformatics in next generation sequencing projects

Similar documents
Sanger vs Next-Gen Sequencing

Reference genomes and common file formats

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

RNA-seq Data Analysis

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

RNA-Seq Software, Tools, and Workflows

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

Genomics and Transcriptomics of Spirodela polyrhiza

Introduction to RNA sequencing

Ensembl Tools. EBI is an Outstation of the European Molecular Biology Laboratory.

About Strand NGS. Strand Genomics, Inc All rights reserved.

De Novo Assembly of High-throughput Short Read Sequences

DATA FORMATS AND QUALITY CONTROL

Next-Generation Sequencing. Technologies

RNA-Sequencing analysis

Mapping strategies for sequence reads

Next Gen Sequencing. Expansion of sequencing technology. Contents

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

Lecture 2: Biology Basics Continued

RNA Seq: Methods and Applica6ons. Prat Thiru

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Sequence Annotation & Designing Gene-specific qpcr Primers (computational)

Course Presentation. Ignacio Medina Presentation

Read Mapping and Variant Calling. Johannes Starlinger

Basic Bioinformatics: Homology, Sequence Alignment,

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

RNAseq Differential Gene Expression Analysis Report

Introduction to the UCSC genome browser

CNV and variant detection for human genome resequencing data - for biomedical researchers (II)

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

L3: Short Read Alignment to a Reference Genome

RNA-Seq with the Tuxedo Suite

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Gap Filling for a Human MHC Haplotype Sequence

measuring gene expression December 5, 2017

Release Notes for Genomes Processed Using Complete Genomics Software

BIOINFORMATICS. Lacking alignments? The next-generation sequencing mapper segemehl revisited

Identifying copy number alterations and genotype with Control-FREEC

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Training materials.

Variant calling in NGS experiments

A step-by-step guide to ChIP-seq data analysis

Intermediate RNA-Seq Tips, Tricks and Non-Human Organisms

Exploring structural variation in the tomato genome with JBrowse

Introduction to RNA-Seq

Green Center Computational Core ChIP- Seq Pipeline, Just a Click Away

RNA-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

COMPUTER RESOURCES II:

Jenny Gu, PhD Strategic Business Development Manager, PacBio

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017

RNA Expression Time Course Analysis

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

SCALABLE, REPRODUCIBLE RNA-Seq

Hands-On Four Investigating Inherited Diseases

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Introduction to genome biology

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Long and short/small RNA-seq data analysis

Genomics AGRY Michael Gribskov Hock 331

TSSpredator User Guide v 1.00

Differential gene expression analysis using RNA-seq

Gene Identification in silico

ChIP-seq analysis. adapted from J. van Helden, M. Defrance, C. Herrmann, D. Puthier, N. Servant

Next Generation Sequencing: An Overview

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics

Haploid Assembly of Diploid Genomes

Cancer Genetics Solutions

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Workflow of de novo assembly

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Assessing De-Novo Transcriptome Assemblies

SNP calling and VCF format

Targeted Sequencing Reveals Large-Scale Sequence Polymorphism in Maize Candidate Genes for Biomass Production and Composition

Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein?

NOW GENERATION SEQUENCING. Monday, December 5, 11

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd

RNA-Seq Tutorial 1. Kevin Silverstein, Ying Zhang Research Informatics Solutions, MSI October 18, 2016

MAKING WHOLE GENOME ALIGNMENTS USABLE FOR BIOLOGISTS. EXAMPLES AND SAMPLE ANALYSES.

Next Generation Sequencing Technologies. Some slides are modified from Robi Mitra s lecture notes

user s guide Question 3

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Measuring transcriptomes with RNA-Seq

Shuji Shigenobu. April 3, 2013 Illumina Webinar Series

Welcome to the NGS webinar series

The dsrbp and inactive editor, ADR-1, utilizes dsrna binding to regulate A-to-I RNA editing across the C. elegans transcriptome

user s guide Question 3

RNA-Seq analysis workshop

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Atelier Chip-Seq. Stéphanie Le Gras, IGBMC Strasbourg Violaine Saint-André, Institut Curie Paris Morgane Thomas-Chollier, ENS Paris

Interpreting RNA-seq data (Browser Exercise II)

Transcription:

Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet May 2013

Standard sequence library generation

Illumina Sequencing Technology

Illumina (Solexa) Sequencing

Illumina paired-end and index-read sequencing

Once sequenced the problem becomes computational Computational analyses is the bottleneck Rapid improvement in sequencing Still need for customized analysis for most projects

Overview of computational analyses genome sequence assembled contig RNA-Seq expression levels ChIP-Seq peak calling Primary Analyses: Image analysis Base calling Mapping (Assembly) Data type specific analyses (e.g. peak calling, calculate expression) Custom project specific analyses

Preliminary Analyses Sequences and Real Time Analysis Quality scores Raw Image (TB) Text File (GB) Platform-specific analysis using the vendors programs

Sequenced reads Fasta file: >EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC Read identifier Fastq file: SOLiD @HWI - EAS269:1:120:1786:18#0/1 GAACTCTGCCTTTTTCAGTGATGAGGAAAGGAGTTCTCTCTGGTCCCCAG +HWI - EAS269:1:120:1786:18#0/1 aaab^_u_aa [ U [ _Z ] a `WU_^X `GT^_ \ TM^ ^ \ \ Z \ YQVVXUBBBB Quality scores csfasta file >1_39_146_F3 T22100200202311030112002022222002021 >1_39_194_F3 T11022322003020303320012223122202221 SOLiD, QV file >1_39_146_F3 14 6 21 27 5 18 6 15 22 27 18 17 14 18 26 15 24 19 18 18 8 20 17 12 20 6 14 13 23 6 11 12 7 13 4 >1_39_194_F3 26 27 16 27 23 22 23 25 22 10 5 21 4 17 20 26 26 17 25 27 23 25 14 24 26 4 4 4 4 4 4 4 4 4 14

Phred Quality Score, Q Each base call has an estimate of the probability of being wrong (error probability, p) Q = -10 * log 10 (p) Phred Quality Score Probability of incorrect base call Base call accuracy 10 1 in 10 90 % 20 1 in 100 99 % 30 1 in 1000 99.9 % 40 1 in 10000 99.99 % 50 1 in 100000 99.999 %

FastQ encodings

Fastq quality control (FastQC) Video tutorial: http://www.youtube.com/watch?v=bz93reov87y

Quality scores for each sequence position

Quality scores for each sequence position: A good run

GC for reads

Percent A,C,G,T at each position

Relative enrichment of kmers

Overview of computational analyses genome sequence assembled contig RNA-Seq expression levels ChIP-Seq peak calling Primary Analyses: Image analysis Base calling Mapping Assembly Data type specific analyses (e.g. peak calling, calculate expression) Custom project specific analyses

Short Read Assembly Velvet and SOAPdenovo de novo genomic assembler specially designed for short read sequencing technologies Nature 2009

Two principal approaches for transcriptome reconstruction

Genome-independent transcriptome reconstruction Default k = 25 Garbherr et al. Nature Biotechnology, July 2011

Finding novel non-annotated genes or transcript variants

Mapping of millions of short reads Task: Map millions of short sequences (25-100 nt) onto a genome (3 000 Mbp ) or transcriptome Mismatches (sequencing errors and SNPs) Unique / Repetitive matches Indels (Normal variation, CNVs) Large rearrangements (translocations) BLAST, BLAT tools not designed for these tasks

Mapping of RNA-Seq reads STAR Garber et al. 2011 Nat Methods

Mapping of splice junctions Exon n GTAAGT-----------AG Exon n+1 1. compile sets of junctions 2. map reads towards genome + junction compilation + Genome Chromosome Fasta Files Known and putative splice junctions Fasta File

Tophat first Method A B C identify candidate exons via genomic mapping A B A C B C Generate possible pairings of exons A B A C B C Align unmappable reads to possible junctions

Longer reads By segmenting the long reads, and mapping the segments independently, we can look harder for junctions we might have missed with shorter reads >HWI-EAS229_75_30DY0AAXX:7:1:0:949 GATGTTCTCAGTGTCC GATGTAATCAGTGTCC AACCCTCTCAGTGTCC Running time independent of intron size Very long (100Kb+) intron

Mapping to transcriptome Gene: 5 UTR Exons Introns 3 UTR W C DNA (genome) Transcription pre-mrna AAAAA RNA processing (splicing, polyadenylation) mrna AAAAA

Microexons and junction coverage 2 or more splice junctions within the same read in-house mapping tophat mapping

Microexons and junction coverage 2 or more splice junctions within the same read in-house mapping tophat mapping Different read length will have different problems!

Example of STAR aligned single-cell RNA-Seq data Mapping'speed 308'M'reads'/'hour %'uniquely'mapping 60 %'multimapping 25 %'unmapped 15 281 719 splice junctions 279 356 with GT/AG 2 123 with GC/AG 215 with AT/AC

Storing mapped Alignments Formats for storing alignments should include: genomic coordinates mismatches, insertion, deletions etc. quality information

Samtools Sequence Alignment Map (SAM) Generic Alignment format Supports long and short reads Human readable, flexible and compact Emerging standard Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. BioinformaScs, 25, 2078-9. [PMID: 19505943] h"p://samtools.sourceforge.net/

SAM Example Bit field, where 16 means reverse strand Alignment structure. Here: 22 aligned bases, then 731 bases intron, then 28 aligned bases Start position HWI - EAS269:1:114:1242:1582#0 16 chr Y 616000 255 22M731N28M * 0 0 ATTTCGACCATGATCATCGAACCTTCCCCTGGATCCACTTCCACGATCAC #9 ; -7 +2@4 : 2=20-14= : ><?< ; : BB? : 4<BB?ABBBBABCBBBBC=BB NM: i : 0 XS: A:-

CIGAR Format M, match/ mismatch I, insertion D, deletion S, softclip... Ref: GCATTCAGATGCAGTACGC Read: cctcag--gcagtagtg Pos: 5 CIGAR: 2S4M3D6M3S 50M

Samtools for SAM/BAM files Library and software package (C, Java) Creating, sorting, indexing SAM & BAM Visualizing alignments in command SNP calling Short indel detection BAM (Binary representation of SAM) ~25% file size reduction

Read mapping statistics e.g. using RSeQC (package) Density of Reads 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Nucleotide Frequency 0.15 0.20 0.25 0.30 0.35 0.40 0.45 A T G C 0 20 40 60 80 100 GC content (%) 0 10 20 30 40 Position of Read

Read mapping statistics: Read mapping across genes read number 2000 4000 6000 8000 10000 0 20 40 60 80 100 percentile of gene body (5' >3')

Read mapping statistics splicing junctions complete_novel 9% partial_novel 2% known 89%

Read mapping statistics: duplicate and unique reads 0 100 200 300 400 500 Frequency Number of Reads (log10) Sequence base Mapping base 0 1 2 3 4 5 2 3 9 83 Reads %

Read mapping statistics: q values on mapped reads 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 Phred Quality Score Position of Read

Overview of computational analyses genome sequence assembled contig RNA-Seq expression levels ChIP-Seq peak calling Primary Analyses: Image analysis Base calling Mapping Assembly Data type specific analyses (e.g. peak calling, calculate expression) Custom project specific analyses

Visualization Integrated Genome Viewer (Broad Inst.) Custom tracks at UCSC Genome Browser

Peak characteristics differ with signal

Peak characteristics differ with signal H3K4me3: Sharp promoter peaks H3K36me3: Broad transcription elongation signal

Important file formats Sequences: FastQ Aligned reads: SAM/BAM Genome annotations: Bed, Gff Coverage: Wig, (Tdf) http://genome.ucsc.edu/faq/faqformat.html

BED format chrom - The name of the chromosome (e.g. chr3, chry, chr2_random) or scaffold (e.g. scaffold10671). chromstart - The starsng posison of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0. chromend - The ending posison of the feature in the chromosome or scaffold. The chromend base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromstart=0, chromend=100, and span the bases numbered 0-99. track name=pairedreads description="clone Paired Reads" usescore=1 chr22 1000 5000 http://genome.ucsc.edu/faq/faqformat.html

BED continued track name=pairedreads description="clone Paired Reads" usescore=1 chr22 2000 6000 cloneb 900-2000 6000 0 2 433,399, 0,3601 strand - Defines the strand - either '+' or '-'. thickstart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). thickend - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays). itemrgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemrgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser. blockcount - The number of blocks (exons) in the BED line. blocksizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockcount. blockstarts - A comma-separated list of block starts. All of the blockstart positions should be calculated relative to chromstart. The number of items in this list should correspond to blockcount.

WIG format (coverage format) Wiggle format (WIG) allows the display of continuous-valued data in a track format Variable step variablestep chrom=chr2 300701 12.5 300702 12.5 300703 12.5 300704 12.5 300705 12.5 is equivalent to: variablestep chrom=chr2 span=5 300701 12.5 Fixed step fixedstep chrom=chr3 start=400601 step=100 11 22 33

Data Repositories Short Read Archive (fastq) [discontinued!] http://www.ncbi.nlm.nih.gov/sra European Nucleotide Archive Gene Expression Omnibus (bed, wig, fastq) http://www.ncbi.nlm.nih.gov/geo/

SEQAnswers, an active forum for discussions on next-generation sequencing methods and bioinformatics http://seqanswers.com/

Genome-independent transcriptome reconstruction: accuracy and coverage Garbherr et al. Nature Biotechnology, July 2011

Genome-independent transcriptome reconstruction: accuracy and coverage Garbherr et al. Nature Biotechnology, July 2011