Post-assembly Data Analysis

Similar documents
Post-assembly Data Analysis

Supplementary Materials for De-novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity

De novo metatranscriptome assembly and coral gene expression profile of Montipora capitata with growth anomaly

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

Quantifying gene expression

Sequence Analysis 2RNA-Seq

NGS part 2: applications. Tobias Österlund

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

Transcriptome analysis

Applications of short-read

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

VM origin. Okeanos: Image Trinity_U16 (upgrade to Ubuntu16.04, thanks to Alexandros Dimopoulos) X2go: LXDE

Eucalyptus gene assembly

Standard Data Analysis Report Agilent Gene Expression Service

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Sequence Based Function Annotation

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

RNA-seq Data Analysis

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

10/06/2014. RNA-Seq analysis. With reference assembly. Cormier Alexandre, PhD student UMR8227, Algal Genetics Group

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Assessing De-Novo Transcriptome Assemblies

CBC Data Therapy. Metatranscriptomics Discussion

RNA-Seq Analysis. Simon Andrews, Laura v

1. Introduction Gene regulation Genomics and genome analyses

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

Next Generation Sequencing

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

Shuji Shigenobu. April 3, 2013 Illumina Webinar Series

Introduction of RNA-Seq Analysis

RNA-Seq analysis using R: Differential expression and transcriptome assembly

RNA-Seq with the Tuxedo Suite

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

ChIP-seq and RNA-seq. Farhat Habib

RNASEQ WITHOUT A REFERENCE

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

measuring gene expression December 11, 2018

Final exam: Introduction to Bioinformatics and Genomics DUE: Friday June 29 th at 4:00 pm

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

RNA Sequencing Analyses & Mapping Uncertainty

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

COMPUTATIONAL PREDICTION AND CHARACTERIZATION OF A TRANSCRIPTOME USING CASSAVA (MANIHOT ESCULENTA) RNA-SEQ DATA

Differential gene expression analysis using RNA-seq

Combined final report: genome and transcriptome assemblies

Transcription Start Sites Project Report

Result Tables The Result Table, which indicates chromosomal positions and annotated gene names, promoter regions and CpG islands, is the best way for

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

MicroRNA sequencing (mirnaseq)

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

RNA

Differential gene expression analysis using RNA-seq

Analysis of RNA-seq Data. Bernard Pereira

Introduction to RNA sequencing

Benchmarking of RNA-seq data processing pipelines using whole transcriptome qpcr expression data

RNA Seq: Methods and Applica6ons. Prat Thiru

RNAseq Differential Gene Expression Analysis Report

NGS Data Analysis and Galaxy

measuring gene expression December 5, 2017

From reads to results: differential. Alicia Oshlack Head of Bioinformatics

RNA-Seq Software, Tools, and Workflows

Finding Genes, Building Search Strategies and Visiting a Gene Page

Finding Genes, Building Search Strategies and Visiting a Gene Page

RNA-Seq analysis workshop

From assembled genome to annotated genome

Corset: enabling differential gene expression analysis for de novo assembled transcriptomes

A fuzzy method for RNA-Seq differential expression analysis in presence of multireads

Galaxy Platform For NGS Data Analyses

RNA-Seq Analysis. August Strand Genomics, Inc All rights reserved.

Functional profiling of metagenomic short reads: How complex are complex microbial communities?

HUMkvhE [Transcriptome Resequencing Report]

SCALABLE, REPRODUCIBLE RNA-Seq

Analysis of Microarray Data

Chimp Sequence Annotation: Region 2_3

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Bioinformatics for Cell Biologists

ChIP-seq and RNA-seq

Total RNA isola-on End Repair of double- stranded cdna

Exercise on Microarray data analysis

De novo transcript sequence reconstruction from RNA- Seq: reference generation and analysis with Trinity

RNA Sequencing: Experimental Planning and Data Analysis. Nadia Atallah September 12, 2018

Exercises (Multiple sequence alignment, profile search)

Annotation of a Drosophila Gene

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis

Introduction to RNA-Seq

Small Exon Finder User Guide

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

RNA-SEQUENCING ANALYSIS

Analysis of Microarray Data

Differential expression analysis for sequencing count data. Simon Anders

Introduction to RNAseq Analysis. Milena Kraus Apr 18, 2016

Analysis of a Tiling Regulation Study in Partek Genomics Suite 6.6

TUTORIAL. Revised in Apr 2015

RNA-seq differential expression analysis. bioconnector.org/workshops

Exercises: Analysing ChIP-Seq data

Introduction to RNA-Seq

RNA-Sequencing analysis

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Transcription:

Assembled transcriptome Post-assembly Data Analysis Quantification: get expression for each gene in each sample Genes differentially expressed between samples Clustering/network analysis Identifying over-represented functional categories in DE genes Evaluation of the quality of the assembly

Part 1. Abundance estimation using RSEM Different summarization strategies will result in the inclusion or exclusion of different sets of reads in the table of counts.

Map reads to genome (TOPHAT) vs Map reads to Transcriptome (BOWTIE) What is easier? No issues with alignment across splicing junctions. What is more difficult? The same reads could be repeatedly aligned to different splicing isoforms, and paralogousgenes.

RSEM assign ambiguous reads based on unique reads mapped to the same transcript. Red& yellow: unique regions Blue: regions shared between two transcripts Transcript 1 Transcript 2 Hass et al. Nature Protocols 8:1494

Trinity provides a script for calling BOWTIE and RSEM align_and_estimate_abundance.pl align_and_estimate_abundance.pl \ --transcripts Trinity.fasta \ --seqtype fq \ --left sequence_1.fastq.gz \ --right sequence_2.fastq.gz \ --SS_lib_type RF \ --aln_method bowtie \ --est_method RSEM \ --thread_count 4 \ --trinity_mode \ --output_prefix tis1rep1 \

Parameters for align_and_estimate_abundance.pl --aln_method: alignment method Default: bowtie. Alignment file from other aligner might not be supported. --est_method: abundance estimation method Default : RSEM, slightly more accurate. Optional: express, faster and less RAM required. --thread_count: number of threads --trinity_mode: the input reference is from Trinity. Non-trinity reference requires a gene-isoform mapping file (--gene_trans_map).

Output files from RSEM (two files per sample) *.isoforms.results table transcript_id gene_id length effective_ expected_ length count TPM FPKM IsoPct gene1_isoform1 gene1 2169 2004.97 22.1 3.63 3.93 92.08 gene1_isoform2 gene1 2170 2005.97 1.9 0.31 0.34 7.92 *.genes.results table gene_id transcript_id(s) length effective_ expected_ length count TPM FPKM gene1 gene1_isoform1,gene1_isoform 2 2169.1 2005.04 24 3.94 4.27

Output files from RSEM (two files per sample) *.isoforms.results table transcript_id gene_id length effective_ expected_ length count TPM FPKM IsoPct gene1_isoform1 gene1 2169 2004.97 22.1 3.63 3.93 92.08 gene1_isoform2 gene1 2170 2005.97 1.9 0.31 0.34 7.92 gene_id *.genes.results table sum transcript_id(s) length effective_ expected_ length count TPM FPKM Percentage of an isoform in a gene gene1 gene1_isoform1,gene1_isoform 2 2169.1 2005.04 24 3.94 4.27

Filtering transcriptome reference based on RSEM filter_fasta_by_rsem_values.pl filter_fasta_by_rsem_values.pl \ --rsem_output=s1.isoforms.results,s2.isoforms.results \ --fasta=trinity.fasta \ --output=trinity.filtered.fasta \ --isopct_cutoff=5 \ --fpkm_cutoff=10 \ --tpm_cutoff=10 \ Can be filtered by multiple RSEM files simultaneously. (criteria met in any one file will be filtered )

Part 2. Differentially Expressed Genes with 3 biological replicatesin each species % of replicates 0.00 0.05 0.10 0.15 0.20 0.25 0 5 10 15 20 Expression level Distribution of Expression Level of A Gene Condition 1 Condition 2

Use EdgeRor DESeqto identify DE genes Reported for each gene: Normalized read count (Log2 count-per-million); Fold change between biological conditions (Log2 fold) Q(FDR). Using FDR, fold change and read count values to filter: E.g. Log2(fold) >2 or <-2 FDR < 0.01 Log2(counts) > 2

Combine multiple-sample RSEM results into a matrix abundance_estimates_to_matrix.pl \ --est_method RSEM \ s1.genes.results \ s2.genes.results \ s3.genes.results \ s4.genes.results \ --out_prefix mystudy Output file: mystudy.counts.matrix s1 s2 s3 s4 TR10295 c0_g1 0 0 0 0 TR17714 c6_g5 51 70 122 169 TR23703 c1_g1 0 0 0 0 TR16168 c0_g2 1 1 1 0 TR16372 c0_g1 10 4 65 91 TR12445 c0_g3 0 0 0 0

Use run_de_analysis.pl script to call edger or DESeq Make a condition-sample mapping file contition1 s1 condition1 condition2 condition2 s2 s3 s4 Run script run_de_analysis.pl run_de_analysis.pl \ --matrix mystudy.counts.matrix \ --samples_file mysamples \ --method edger \ --min_rowsum_counts 10 \ --output edger_results Skip genes with summed read counts less than 10

Output files from run_de_analysis.pl logfc logcpm PValue FDR TR8841 c0_g2 13.28489 7.805577 3.64E-20 1.05E-15 TR14584 c0_g2-12.9853 7.507117 2.82E-19 4.04E-15 TR16945 c0_g1-13.3028 7.823384 4.21E-18 3.11E-14 TR15899 c3_g2 9.485185 7.708629 5.35E-18 3.11E-14 TR8034 c0_g1-10.1468 7.836908 5.42E-18 3.11E-14 TR3434 c0_g1-12.909 7.431009 1.16E-17 5.55E-14 Filtering (done in R or Excel): FDR: False-discovery-rate of DE detection (0.05 or below) logfc: fold change log2(fc) logcpm: average expression level (count-per-million)

Part 3. Clustering/Network Analysis Heat map and Hierarchical clustering K-means clustering

Clustering tools in Trinity package # Heap map analyze_diff_expr.pl \ --matrix../mystudy.tmm.fpkm.matrix \ --samples../mysamples \ -P 1e-3 -C 2 \ # K means clustering define_clusters_by_cutting_tree.pl \ -K 6 -R cluster_results.matrix.rdata Output: Images in PDF format; Gene list for each group from K-means analysis;

Part 4. Functional Category Enrichment Analysis Gene list (from DE or clustering) TR14584 c0_g2 TR16945 c0_g1 TR8034 c0_g1 TR3434 c0_g1 TR13135 c1_g2 TR15586 c2_g4 TR14584 c0_g1 TR20065 c2_g3 TR1780 c0_g1 TR25746 c0_g2 GO ID GO enrichment analysis Enriched GO categories over represented pvalue FDR GO term GO:0044456 3.40E-10 3.78E-06synapse part regulation of synaptic GO:0050804 1.49E-09 8.29E-06transmission GO:0007268 5.64E-09 2.09E-05synaptic transmission acetylcholine-activated cationselective channel GO:0004889 1.00E-07 0.00012394 activity GO:0005215 1.23E-070.000136471transporter activity

Function annotation 1. Predict open-reading-frame(dna -> protein sequence) TransDecoder-t Trinity.fasta--workdirtransDecoder-S 2. Predict gene function from sequences Trinotate package BLAST Uniprot: homologs in known proteins PFAM: protein domain) SignalP: signal peptide) TMHMM: trans-membrane domain RNAMMER: rrna Output: go_annotations.txt (Gene Ontology annotation file) Step-by-step guidance: https://cbsu.tc.cornell.edu/lab/userguide.aspx?a=software&i=143#c

Enrichment analysis Use the geneidsin the first column run_goseq.pl \ --genes_single_factor DE.filtered.txt \ --GO_assignments go_annotations.txt \ --lengths genes.lengths.txt GO annotation file from Trinotate 3 rd column from RSEM *.genes.results file

Part 5. Evaluation of Assembly Quality Compare the Drosophila yakubaassemblyusedin the exercise to Drosophila melanogaster proteins from flybase #hit_pct_cov_bin count_in_bin >bin_below 100 5165 5165 90 1293 6458 80 1324 7782 70 1434 9216 60 1634 10850 50 1385 12235 40 1260 13495 30 1091 14586 20 992 15578 10 507 16085

BLAST against D. melanogaster protein database Blastx \ -query Trinity.fasta \ -db melanogaster.pep.all.fa \ -out blastx.outfmt6 -evalue1e-20 -num_threads 6 \ Analysis the blast results and generate histogram analyze_blastplus_tophit_coverage.pl \ blastx.outfmt6 \ Trinity.fasta \ melanogaster.pep.all.fa \