RNA-seq Data Analysis

Similar documents
10/06/2014. RNA-Seq analysis. With reference assembly. Cormier Alexandre, PhD student UMR8227, Algal Genetics Group

VM origin. Okeanos: Image Trinity_U16 (upgrade to Ubuntu16.04, thanks to Alexandros Dimopoulos) X2go: LXDE

Galaxy for Next Generation Sequencing 初探次世代序列分析平台 蘇聖堯 2013/9/12

RNA Seq: Methods and Applica6ons. Prat Thiru

RNA-Seq Software, Tools, and Workflows

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

RNAseq Differential Gene Expression Analysis Report

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Gene Expression analysis with RNA-Seq data

Applications of short-read

Introduction to RNA-Seq

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

Introduction to RNA-Seq

How to deal with your RNA-seq data?

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

RNA-Seq Module 2 From QC to differential gene expression.

NGS part 2: applications. Tobias Österlund

RNA-Sequencing analysis

Sanger vs Next-Gen Sequencing

Transcriptome analysis

Session 8. Differential gene expression analysis using RNAseq data

Course Presentation. Ignacio Medina Presentation

Long and short/small RNA-seq data analysis

SO YOU WANT TO DO A: RNA-SEQ EXPERIMENT MATT SETTLES, PHD UNIVERSITY OF CALIFORNIA, DAVIS

Bioinformatics in next generation sequencing projects

Quantifying gene expression

RNA-Seq Analysis. Simon Andrews, Laura v

Eucalyptus gene assembly

Introduction of RNA-Seq Analysis

Differential gene expression analysis using RNA-seq

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

RNA-Seq de novo assembly training

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Sequence Analysis 2RNA-Seq

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Sequence Based Function Annotation

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Introduction to RNA sequencing

Next Generation Sequencing

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

De novo assembly in RNA-seq analysis.

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

NGS Data Analysis and Galaxy

Transcriptome Assembly and Evaluation, using Sequencing Quality Control (SEQC) Data

Bioinformatics Monthly Workshop Series. Speaker: Fan Gao, Ph.D Bioinformatics Resource Office The Picower Institute for Learning and Memory

Benchmarking of RNA-seq data processing pipelines using whole transcriptome qpcr expression data

Why QC? Next-Generation Sequencing: Quality Control. Illumina data format. Fastq format:

1. Introduction Gene regulation Genomics and genome analyses

Intermediate RNA-Seq Tips, Tricks and Non-Human Organisms

Introduction to RNA-Seq in GeneSpring NGS Software

Differential gene expression analysis using RNA-seq

ChIP-seq analysis 2/28/2018

Next-Generation Sequencing: Quality Control

Post-assembly Data Analysis

Experimental Design. Dr. Matthew L. Settles. Genome Center University of California, Davis

Mapping strategies for sequence reads

RNA-Seq with the Tuxedo Suite

Transcriptional analysis of b-cells post flu vaccination using rna sequencing

Introduction to NGS analyses

RNA-Seq Tutorial 1. Kevin Silverstein, Ying Zhang Research Informatics Solutions, MSI October 18, 2016

Reference genomes and common file formats

Form for publishing your article on BiotechArticles.com this document to

SCALABLE, REPRODUCIBLE RNA-Seq

RNA-Seq data analysis course September 7-9, 2015

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

RNA- seq data analysis tutorial. Andrea Sboner

RNA-Seq analysis workshop

Total RNA isola-on End Repair of double- stranded cdna

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

RNA Sequencing: Experimental Planning and Data Analysis. Nadia Atallah September 12, 2018

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

Short Read Alignment to a Reference Genome

Analysis of RNA-seq Data

Single Cell Transcriptomics scrnaseq

02 Agenda Item 03 Agenda Item

CBC Data Therapy. Metatranscriptomics Discussion

DATA FORMATS AND QUALITY CONTROL

Wheat CAP Gene Expression with RNA-Seq

RNA-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

ChIP-seq and RNA-seq


Combined final report: genome and transcriptome assemblies

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

Statistical Genomics and Bioinformatics Workshop. Genetic Association and RNA-Seq Studies

Next Generation Sequencing. Tobias Österlund

Post-assembly Data Analysis

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

Galaxy Platform For NGS Data Analyses

Reference genomes and common file formats

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

Zika infected human samples

ChIP-seq and RNA-seq. Farhat Habib

From reads to results: differential. Alicia Oshlack Head of Bioinformatics

!!!! The Genome Access Course. RNA-Seq Data Analysis!!!!!!!!!!!!

Transcription:

Lecture 3. Clustering; Function/Pathway Enrichment analysis RNA-seq Data Analysis Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Lecture 1. Map RNA-seq read to genome Lecture 2. Quantification, normalization of gene expression & detection of differentially expressed gene detection

RNA-seq Experiment mrna AAAAA AAAAA AAAAA cdna Fragments (100 to 500 bp) Illumina Sequencer can reads end(s) of each fragments Read: ACTGGACCTAGACAATG

Experimental design: single-end vs pairedend Single-end Paired-end cdna Fragment

Experimental design: stranded vs unstranded Stranded Un-stranded Gene 5 3 Reads of opposite direction come from another embedded gene

Experimental design: read length (50 bp, 100 bp, 150 bp, ) Long sequence reads could read into the adapter: 50 bp 5 3 5 150 bp 3 Adapter sequence

Experiment design 1: for quantification of gene expression Read length: 50 to 100 bp Paired vs single ends: Single end Number of reads: >5 million per sample Replicates: 3 replicates

Experiment design 2: for RNA-seq without reference genome Read length: 100-150 bp (longer reads are not always better, because of severe sequencing bias with longer fragments.) Paired-end & stranded reads; Higher read depth is necessary (Normally pooled from multiple samples)

Limitation of RNA-seq 1. Sequencing bias mrna AAAAA AAAAA cdna Fragments AAAAA Not random Reads with variable depth at different regions of the gene

Read-depth are not even across the same gene

Read-depth are not even across the same gene Consequence 1: batch effect (different batches or different protocals could have different way of sequencing bias)

Read-depth are not even across the same gene Consequence 2: RNA-seq is for comparing same gene across different samples, not for comparing different genes in the same sample

Complications in read mapping: 1. Reads from homologous genes; Gene A Gene B Read 2. Assignment to correct splicing isoform; Gene Isoform 1 Isoform 2 Reads

RNA-seq Data Analysis

Data analysis procedures Step 1. Check quality of the reads (optional); Step 2. Map reads to gene; Step 3. Estimate expression level by counting reads per gene.

Step 1. Quality Control (QC) using FASTQC Software 1. Sequencing quality score

Step 2. Map reads to genome using TOPHAT Software Alignment of genomic sequencing vs RNA-seq Cole Trapnell & Steven L Salzberg, Nature Biotechnology 27, 455-457 (2009)

Diagnose low mapping rate 1. Low quality reads or reads with adapters * Trimming tools (FASTX, Trimmomatic, et al.) 2. Contamination? fastq_species_detector (Available on BioHPC Lab. It identifies species for reads by blast against Genbank) * Trimming is not needed in majority of RNA-seq experiments except for de novo assembly

fastq_species_detector A BioHPC tool for detecting contaminants Commands: mkdir /workdir/my_db cp /shared_data/genome_db/blast_ncbi/nt* /workdir/my_db cp /shared_data/genome_db/blast_ncbi/taxdb.* /workdir/my_db /programs/fastq_species_detector/fastq_species_detector.sh my_file.fastq.gz /workdir/my_db Sample output: Read distribution over species: Species #Reads %Reads -------------------------------------- Drosophila melagaster 254 35.234 Cyprinus carpio 74 10.529 Triticum aestivum 12 2.059 Microtus ochrogaster 3 1.765 Dyella jiangningensis 3 1.765

About the files 1. Reference genome (FASTA) 2. FASTQ 3. GFF3/GTF 4. SAM/BAM >chr1 TTCTAGGTCTGCGATATTTCCTGCCTATCCATTTTGTTAACTCTTCAATG CATTCCACAAATACCTAAGTATTCTTTAATAATGGTGGTTTTTTTTTTTT TTTGCATCTATGAAGTTTTTTCAAATTCTTTTTAAGTGACAAAACTTGTA CATGTGTATCGCTCAATATTTCTAGTCGACAGCACTGCTTTCGAGAATGT AAACCGTGCACTCCCAGGAAAATGCAGACACAGCACGCCTCTTTGGGACC GCGGTTTATACTTTCGAAGTGCTCGGAGCCCTTCCTCCAGACCGTTCTCC CACACCCCGCTCCAGGGTCTCTCCCGGAGTTACAAGCCTCGCTGTAGGCC CCGGGAACCCAACGCGGTGTCAGAGAAGTGGGGTCCCCTACGAGGGACCA GGAGCTCCGGGCGGGCAGCAGCTGCGGAAGAGCCGCGCGAGGCTTCCCAG AACCCGGCAGGGGCGGGAAGACGCAGGAGTGGGGAGGCGGAACCGGGACC CCGCAGAGCCCGGGTCCCTGCGCCCCACAAGCCTTGGCTTCCCTGCTAGG GCCGGGCAAGGCCGGGTGCAGGGCGCGGCTCCAGGGAGGAAGCTCCGGGG CGAGCCCAAGACGCCTCCCGGGCGGTCGGGGCCCAGCGGCGGCGTTCGCA GTGGAGCCGGGCACCGGGCAGCGGCCGCGGAACACCAGCTTGGCGCAGGC TTCTCGGTCAGGAACGGTCCCGGGCCTCCCGCCCGCCTCCCTCCAGCCCC TCCGGGTCCCCTACTTCGCCCCGCCAGGCCCCCACGACCCTACTTCCCGC GGCCCCGGACGCCTCCTCACCTGCGAGCCGCCCTCCCGGAAGCTCCCGCC GCCGCTTCCGCTCTGCCGGAGCCGCTGGGTCCTAGCCCCGCCGCCCCCAG TCCGCCCGCGCCTCCGGGTCCTAACGCCGCCGCTCGCCCTCCACTGCGCC CTCCCCGAGCGCGGCTCCAGGACCCCGTCGACCCGGAGCGCTGTCCTGTC GGGCCGAGTCGCGGGCCTGGGCACGGAACTCACGCTCACTCCGAGCTCCC GACGTGCACACGGCTCCCATGCGTTGTCTTCCGAGCGTCAGGCCGCCCCT ACCCGTGCTTTCTGCTCTGCAGACCCTCTTCCTAGACCTCCGTCCTTTGT

About the files 1. FASTA 2. RNA-seq data (FASTQ) 3. GFF3/GTF 4. SAM/BAM @HWUSI-EAS525:2:1:13336:1129#0/1 GTTGGAGCCGGCGAGCGGGACAAGGCCCTTGTCCA + ccacacccacccccccccc[[cccc_ccaccbbb_ @HWUSI-EAS525:2:1:14101:1126#0/1 GCCGGGACAGCGTGTTGGTTGGCGCGCGGTCCCTC + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWUSI-EAS525:2:1:15408:1129#0/1 CGGCCTCATTCTTGGCCAGGTTCTGGTCCAGCGAG + cghhchhgchehhdffccgdgh]gcchhcahwcea @HWUSI-EAS525:2:1:15457:1127#0/1 CGGAGGCCCCCGCTCCTCTCCCCCGCGCCCGCGCC + ^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWUSI-EAS525:2:1:15941:1125#0/1 TTGGGCCCTCCTGATTTCATCGGTTCTGAAGGCTG + SUIF\_XYWW]VaOZZZ\V\bYbb_]ZXTZbbb_b @HWUSI-EAS525:2:1:16426:1127#0/1 GCCCGTCCTTAGAGGCTAGGGGACCTGCCCGCCGG

About the files 1. FASTA 2. RNA-seq data (FASTQ) 3. GFF3/GTF 4. SAM/BAM @HWUSI-EAS525:2:1:13336:1129#0/1 GTTGGAGCCGGCGAGCGGGACAAGGCCCTTGTCCA + ccacacccacccccccccc[[cccc_ccaccbbb_ @HWUSI-EAS525:2:1:14101:1126#0/1 GCCGGGACAGCGTGTTGGTTGGCGCGCGGTCCCTC + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWUSI-EAS525:2:1:15408:1129#0/1 CGGCCTCATTCTTGGCCAGGTTCTGGTCCAGCGAG + cghhchhgchehhdffccgdgh]gcchhcahwcea @HWUSI-EAS525:2:1:15457:1127#0/1 CGGAGGCCCCCGCTCCTCTCCCCCGCGCCCGCGCC + ^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWUSI-EAS525:2:1:15941:1125#0/1 TTGGGCCCTCCTGATTTCATCGGTTCTGAAGGCTG + SUIF\_XYWW]VaOZZZ\V\bYbb_]ZXTZbbb_b @HWUSI-EAS525:2:1:16426:1127#0/1 GCCCGTCCTTAGAGGCTAGGGGACCTGCCCGCCGG Single-end data: one file per sample Paired-end data: two files per sample

About the files 1. FASTA 2. FASTQ 3. Annotation (GFF3/GTF) 4. SAM/BAM chr12 unknown exon 96066054 96067770. +. gene_id "PGAM1P5"; gene_name "PGAM1P5"; transcript_id "NR_077225"; tss_id "TSS14770"; chr12 unknown CDS 96076483 96076598. - 1 gene_id "NTN4"; gene_name "NTN4"; p_id "P12149"; transcript_id "NM_021229"; tss_id "TSS6395"; chr12 unknown exon 96076483 96076598. -. gene_id "NTN4"; gene_name "NTN4"; p_id "P12149"; transcript_id "NM_021229"; tss_id "TSS6395"; chr12 unknown CDS 96077274 96077487. - 2 gene_id "NTN4"; gene_name "NTN4"; p_id "P12149"; transcript_id "NM_021229"; tss_id "TSS6395";... This file can be opened in Excel

About the files 1. FASTA 2. FASTQ 3. GFF3/GTF 4. Alignment (SAM/BAM) HWUSI-EAS525_0042_FC:6:23:10200:18582#0/1 16 1 10 40 35M * 0 0 AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT agafgfaffcfdf[fdcffcggggccfdffagggg MD:Z:35 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0 HWUSI-EAS525_0042_FC:3:28:18734:20197#0/1 16 1 10 40 35M * 0 0 AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT hghhghhhhhhhhhhhhhhhhhghhhhhghhfhhh MD:Z:35 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0 HWUSI-EAS525_0042_FC:3:94:1587:14299#0/1 16 1 10 40 35M * 0 0 AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT hfhghhhhhhhhhhghhhhhhhhhhhhhhhhhhhg MD:Z:35 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0 D3B4KKQ1:227:D0NE9ACXX:3:1305:14212:73591 0 1 11 40 51M * 0 0 GCCAAAGATTGCATCAGTTCTGCTGCTATTTCCTCCTATCATTCTTTCTGA CCCFFFFFFGFFHHJGIHHJJJFGGJJGIIIIIIGJJJJJJJJJIJJJJJE MD:Z:51 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0 HWUSI-EAS525_0038_FC:5:35:11725:5663#0/1 16 1 11 40 35M * 0 0 GCCAAAGATTGCATCAGTTCTGCTGCTATTTCCTC hhehhhhhhhhhghghhhhhhhhhhhhhhhhhhhh MD:Z:35 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0

Running TOPHAT Required files Reference genome. (FASTA file indexed with bowtie2- build software) RNA-seq data files. (FASTQ files) Optional files Annotation file * (GFF3 or GTF) * If not provided, TOPHAT will try to predict splicing sites;

Running TOPHAT tophat -G myannot.gff3 mygenome mydata.fastq.gz Some extra parameters --no-novel : only using splicing sites in gff/gtf file -N : mismatches per read (default: 2) -g: max number of multi-hits (default: 20) -p : number of CPU cores (BioHPC lab general: 8) -o: output directory * TOPHAT manual: http://ccb.jhu.edu/software/tophat/manual.shtml

Running TOPHAT tophat -G myannot.gff3 mygenome mydata.fastq.gz Some extra parameters --no-novel : only using splicing sites in gff/gtf file -N : mismatches per read (default: 2) -g: max number of multi-hits (default: 20) -p : number of CPU use this cores parameter; (BioHPC lab general: 8) -o: output directory In majority of the cases, it is recommended to Tophat is very slow without this option. You might want to use an alternative aligner like STAR. * TOPHAT manual: http://ccb.jhu.edu/software/tophat/manual.shtml

What you get from TOPHAT A BAM file per sample File name: accepted_hits.bam Alignment statistics File name: align_summary.txt Input: 9230201 Mapped: 7991618 (86.6% of input) of these: 1772635 (22.2%) have multiple alignments (2210 have >20) 86.6% overall read alignment rate.

. Alternative aligner: STAR Much faster to map spliced reads; Trim (adapter, et al) & align Dobin et al. (2015) Bioinformatics 29 (1): 15-21

GSNAP: A SNP-tolerant alignment * It is critical for allele species expression analysis. GSNAP limit to single splicing per read. Thomas D. Wu, and Serban Nacu Bioinformatics 2010;26:873-881

Visualizing BAM files with IGV * Before using IGV, the BAM files need to be indexed with samtools index, which creates a.bai file.

Exercise 1 Run TOPHAT to align RNA-seq reads to genome; Visualize TOPHAT results with IGV; Learn to use Linux shell script to create a pipeline;

BioHPC Lab office hours Time: Office: 1-3 pm, every Monday & Thursday 618 Rhodes Hall Sign-up: https://cbsu.tc.cornell.edu/lab/office1.aspx General bioinformatics consultation/training is provided; Available throughout the year;