Post-assembly Data Analysis

Similar documents
Post-assembly Data Analysis

Supplementary Materials for De-novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity

De novo metatranscriptome assembly and coral gene expression profile of Montipora capitata with growth anomaly

Sequence Based Function Annotation

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

From assembled genome to annotated genome

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

NGS part 2: applications. Tobias Österlund

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Quantifying gene expression

Applications of short-read

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

Transcriptome analysis

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

Sequence Analysis 2RNA-Seq

Genome Annotation - 2. Qi Sun Bioinformatics Facility Cornell University

Transcription Start Sites Project Report

10/06/2014. RNA-Seq analysis. With reference assembly. Cormier Alexandre, PhD student UMR8227, Algal Genetics Group

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

VM origin. Okeanos: Image Trinity_U16 (upgrade to Ubuntu16.04, thanks to Alexandros Dimopoulos) X2go: LXDE

RNASEQ WITHOUT A REFERENCE

Shuji Shigenobu. April 3, 2013 Illumina Webinar Series

RNA-seq Data Analysis

CBC Data Therapy. Metatranscriptomics Discussion

RNA Sequencing Analyses & Mapping Uncertainty

1. Introduction Gene regulation Genomics and genome analyses

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

Eucalyptus gene assembly

Why learn sequence database searching? Searching Molecular Databases with BLAST

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Assessing De-Novo Transcriptome Assemblies

Exercises (Multiple sequence alignment, profile search)

Standard Data Analysis Report Agilent Gene Expression Service

Prokaryotic Annotation Pipeline SOP HGSC, Baylor College of Medicine

RNA-Seq Analysis. Simon Andrews, Laura v

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

Introduction to RNA sequencing

Annotation of contig62 from Drosophila elegans Dot Chromosome

Annotation of Contig8 Sakura Oyama Dr. Elgin, Dr. Shaffer, Dr. Bednarski Bio 434W May 2, 2016

RNA-Seq with the Tuxedo Suite

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

COMPUTATIONAL PREDICTION AND CHARACTERIZATION OF A TRANSCRIPTOME USING CASSAVA (MANIHOT ESCULENTA) RNA-SEQ DATA

ChIP-seq and RNA-seq. Farhat Habib

Introduction of RNA-Seq Analysis

Finding Genes, Building Search Strategies and Visiting a Gene Page

Finding Genes, Building Search Strategies and Visiting a Gene Page

Transcriptome Profiling and Long Non-Coding Rna Identification in Grapevine

Combined final report: genome and transcriptome assemblies

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro

Next Generation Sequencing

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Annotation of a Drosophila Gene

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Small Exon Finder User Guide

RNA-Seq analysis using R: Differential expression and transcriptome assembly

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

From reads to results: differential. Alicia Oshlack Head of Bioinformatics

RNA

Chimp Sequence Annotation: Region 2_3

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

Differential gene expression analysis using RNA-seq

Drosophila ficusphila F element

RNA-Sequencing analysis

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

RNA-seq analysis pipeline

Galaxy Platform For NGS Data Analyses

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

De novo transcript sequence reconstruction from RNA- Seq: reference generation and analysis with Trinity

Necklace: combining reference and assembled transcriptomes for more comprehensive RNA-Seq analysis

A fuzzy method for RNA-Seq differential expression analysis in presence of multireads

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

Analysis of RNA-seq Data. Bernard Pereira

Textbook Reading Guidelines

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

Differential gene expression analysis using RNA-seq

RNA-Seq Now What? BIS180L Professor Maloof May 24, 2018

Rapid Transcriptome Characterization for a nonmodel organism using 454 pyrosequencing

MicroRNA sequencing (mirnaseq)

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

I AM NOT A METAGENOMIC EXPERT. I am merely the MESSENGER. Blaise T.F. Alako, PhD EBI Ambassador

measuring gene expression December 11, 2018

RNA Seq: Methods and Applica6ons. Prat Thiru

SCALABLE, REPRODUCIBLE RNA-Seq

Corset: enabling differential gene expression analysis for de novo assembled transcriptomes

RNA-SEQUENCING ANALYSIS

RNA-Seq Software, Tools, and Workflows

Annotating Fosmid 14p24 of D. Virilis chromosome 4

FUNCTIONAL BIOINFORMATICS

Exercise I, Sequence Analysis

ChIP-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

Final exam: Introduction to Bioinformatics and Genomics DUE: Friday June 29 th at 4:00 pm

Benchmarking of RNA-seq data processing pipelines using whole transcriptome qpcr expression data

Transcription:

Assembled transcriptome Post-assembly Data Analysis Quantification: the expression level of each gene in each sample DE genes: genes differentially expressed between samples Clustering/network analysis Identifying over-represented functional categories in DE genes Evaluation of the quality of the assembly

Part 1. Abundance estimation using RSEM Different summarization strategies will result in the inclusion or exclusion of different sets of reads in the table of counts.

Map reads to genome (TOPHAT) vs Map reads to Transcriptome (BOWTIE) What is easier? No issues with alignment across splicing junctions. What is more difficult? The same reads could be repeatedly aligned to different splicing isoforms, and paralogous genes.

RSEM assign ambiguous reads based on unique reads mapped to the same transcript. Red & yellow: unique regions Blue: regions shared between two transcripts Transcript 1 Transcript 2 Hass et al. Nature Protocols 8:1494

Trinity provides a script for calling BOWTIE and RSEM align_and_estimate_abundance.pl align_and_estimate_abundance.pl \ --transcripts Trinity.fasta \ --seqtype fq \ --left sequence_1.fastq.gz \ --right sequence_2.fastq.gz \ --SS_lib_type RF \ --aln_method bowtie \ --est_method RSEM \ --thread_count 4 \ --trinity_mode \ --output_prefix tis1rep1 \

Parameters for align_and_estimate_abundance.pl --aln_method: alignment method Default: bowtie. Alignment file from other aligner might not be supported. --est_method: abundance estimation method Default : RSEM, slightly more accurate. Optional: express, faster and less RAM required. --thread_count: number of threads --trinity_mode: the input reference is from Trinity. Non-trinity reference requires a gene-isoform mapping file (--gene_trans_map).

Output files from RSEM (two files per sample) *.isoforms.results table transcript_id gene_id length effective_ expected_ length count TPM FPKM IsoPct gene1_isoform1 gene1 2169 2004.97 22.1 3.63 3.93 92.08 gene1_isoform2 gene1 2170 2005.97 1.9 0.31 0.34 7.92 *.genes.results table gene_id transcript_id(s) length effective_ expected_ length count TPM FPKM gene1 gene1_isoform1,gene1_isoform 2 2169.1 2005.04 24 3.94 4.27

Output files from RSEM (two files per sample) *.isoforms.results table transcript_id gene_id length effective_ expected_ length count TPM FPKM IsoPct gene1_isoform1 gene1 2169 2004.97 22.1 3.63 3.93 92.08 gene1_isoform2 gene1 2170 2005.97 1.9 0.31 0.34 7.92 gene_id *.genes.results table sum transcript_id(s) length effective_ expected_ length count TPM Percentage of an isoform in a gene FPKM gene1 gene1_isoform1,gene1_isoform 2 2169.1 2005.04 24 3.94 4.27

Filtering transcriptome reference based on RSEM filter_fasta_by_rsem_values.pl filter_fasta_by_rsem_values.pl \ --rsem_output=s1.isoforms.results,s2.isoforms.results \ --fasta=trinity.fasta \ --output=trinity.filtered.fasta \ --isopct_cutoff=5 \ --fpkm_cutoff=10 \ --tpm_cutoff=10 \ Can be filtered by multiple RSEM files simultaneously. (criteria met in any one filter will be filtered )

Part 2. Differentially Expressed Genes with 3 biological replicates in each species % of replicates 0.00 0.05 0.10 0.15 0.20 0.25 0 5 10 15 20 Expression level Distribution of Expression Level of A Gene Condition 1 Condition 2

Use EdgeR or DESeq to identify DE genes Reported for each gene: Normalized read count (Log2 count-per-million); Fold change between biological conditions (Log2 fold) Q(FDR). Using FDR, fold change and read count values to filter: E.g. Log2(fold) >2 or <-2 FDR < 0.01 Log2(counts) > 2

Combine multiple-sample RSEM results into a matrix abundance_estimates_to_matrix.pl \ --est_method RSEM \ s1.genes.results \ s2.genes.results \ s3.genes.results \ s4.genes.results \ --out_prefix mystudy Output file: mystudy.counts.matrix s1 s2 s3 s4 TR10295 c0_g1 0 0 0 0 TR17714 c6_g5 51 70 122 169 TR23703 c1_g1 0 0 0 0 TR16168 c0_g2 1 1 1 0 TR16372 c0_g1 10 4 65 91 TR12445 c0_g3 0 0 0 0

Use run_de_analysis.pl script to call edger or DESeq Make a condition-sample mapping file contition1 condition1 condition2 condition2 s1 s2 s3 s4 Run script run_de_analysis.pl run_de_analysis.pl \ --matrix mystudy.counts.matrix \ --samples_file mysamples \ --method edger \ --min_rowsum_counts 10 \ --output edger_results Skip genes with summed read counts less than 10

Output files from run_de_analysis.pl logfc logcpm PValue FDR TR8841 c0_g2 13.28489 7.805577 3.64E-20 1.05E-15 TR14584 c0_g2-12.9853 7.507117 2.82E-19 4.04E-15 TR16945 c0_g1-13.3028 7.823384 4.21E-18 3.11E-14 TR15899 c3_g2 9.485185 7.708629 5.35E-18 3.11E-14 TR8034 c0_g1-10.1468 7.836908 5.42E-18 3.11E-14 TR3434 c0_g1-12.909 7.431009 1.16E-17 5.55E-14 Filtering (done in R or Excel): FDR: False-discovery-rate of DE detection (0.05 or below) logfc: fold change log2(fc) logcpm: average expression level (count-per-million)

Part 3. Clustering/Network Analysis Heat map and Hierarchical clustering K-means clustering

Clustering tools in Trinity package # Heap map analyze_diff_expr.pl \ --matrix../mystudy.tmm.fpkm.matrix \ --samples../mysamples \ -P 1e-3 -C 2 \ # K means clustering define_clusters_by_cutting_tree.pl \ -K 6 -R cluster_results.matrix.rdata Output: Images in PDF format; Gene list for each group from K-means analysis;

Part 4. Functional Category Enrichment Analysis Results from RNA-seq data analysis DE Gene list (From DEseq or EdgeR) (from DE or clustering) TR14584 c0_g2 TR16945 c0_g1 TR8034 c0_g1 TR3434 c0_g1 TR13135 c1_g2 TR15586 c2_g4 TR14584 c0_g1 TR20065 c2_g3 TR1780 c0_g1 TR25746 c0_g2 GO ID Enriched GO categories From GO enrichment analysis of DE genes over represented pvalue FDR GO term GO:0044456 3.40E-10 3.78E-06synapse part regulation of synaptic GO:0050804 1.49E-09 8.29E-06transmission GO:0007268 5.64E-09 2.09E-05synaptic transmission acetylcholine-activated cation- GO:0004889 channel activity 1.00E-07 0.00012394selective GO:0005215 1.23E-07 0.00013647 1transporter activity

Steps in RNA-seq analysis: 1. Identifying open reading frames and translate into protein sequences (optional, not needed if using BLAST2GO et al); 2. Annotate the transcripts with GO BLAST2GO TRINOTATE INTERPROSCAN 3. Enrichment analysis

Transdecoder Identify ORF from Transcript

Transdecoder Identify ORF from Transcript Long ORF Alignment to known protein sequences or motif Markov model of coding sequences (model trained on genes from the same genome)

Transdecoder Identify ORF from Transcript #identify longest ORF TransDecoder.LongOrfs -t myassembly.fa #BLAST ORF against known protein datbase blastp -query myassembly.fa.transdecoder_dir/longest_orfs.pep \ -db swissprot -max_target_seqs 1 \ -outfmt 6 -evalue 1e-5 \ -num_threads 8 > blastp.outfmt6 #Predict ORF based on a Markov model of coding sequence (trained from long ORF) TransDecoder.Predict --cpu 8 \ --retain_blastp_hits blastp.outfmt6 \ -t myassembly.fa

Function annotation Predict gene function from sequences Trinotate package BLAST Uniprot : homologs in known proteins PFAM: protein domain) SignalP: signal peptide) TMHMM: trans-membrane domain RNAMMER: rrna Output: go_annotations.txt (Gene Ontology annotation file) Step-by-step guidance: https://cbsu.tc.cornell.edu/lab/userguide.aspx?a=software&i=143#c or Recommended BLAST2GO https://cbsu.tc.cornell.edu/lab/userguide.aspx?a=software&i=73#c

Enrichment analysis Use the geneids in the first column run_goseq.pl \ --genes_single_factor DE.filtered.txt \ --GO_assignments go_annotations.txt \ --lengths genes.lengths.txt GO annotation file from Trinotate Results: 3 rd column from RSEM *.genes.results file GO ID over represented pvalue FDR GO term GO:0044456 3.40E-10 3.78E-06synapse part regulation of synaptic GO:0050804 1.49E-09 8.29E-06transmission GO:0007268 5.64E-09 2.09E-05synaptic transmission

Part 5. Evaluation of Assembly Quality https://github.com/trinityrnaseq/trinityrnaseq/wiki/transcriptome-assembly-quality-assessment A. Representation of known proteins Compare to SWISSPROT (% of full length) Compare to genes in a closely related species (% of full length; completeness) Compare to BUSCO core proteins (BUSCO)

BLAST against D. melanogaster protein database Blastx \ -query Trinity.fasta \ -db melanogaster.pep.all.fa \ -out blastx.outfmt6 -evalue 1e-20 -num_threads 6 \ Analysis the blast results and generate histogram analyze_blastplus_tophit_coverage.pl \ blastx.outfmt6 \ Trinity.fasta \ melanogaster.pep.all.fa \

Part 5. Evaluation of Assembly Quality Compare the Drosophila yakuba assembly used in the exercise to Drosophila melanogaster proteins from flybase #hit_pct_cov_bin count_in_bin >bin_below 100 5165 5165 90 1293 6458 80 1324 7782 70 1434 9216 60 1634 10850 50 1385 12235 40 1260 13495 30 1091 14586 20 992 15578 10 507 16085

Part 5. Evaluation of Assembly Quality https://github.com/trinityrnaseq/trinityrnaseq/wiki/transcriptome-assembly-quality-assessment A. Statistics scores Map RNA-Seq reads to the assembly (>80% of reads can be aligned) ExN50 transcript contig length N50: shortest sequence length at 50% of the genome E90N50: N50 based on transcripts representing 90% of expression data DETONATE score With or without reference sequences