RNA Sequencing Analyses & Mapping Uncertainty

Similar documents
Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

RNA-Seq Analysis. Simon Andrews, Laura v

Differential gene expression analysis using RNA-seq

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

Next Generation Sequencing

Introduction of RNA-Seq Analysis

RNA

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

Eucalyptus gene assembly

Transcriptome analysis

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

Applications of short-read

Canadian Bioinforma3cs Workshops

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

RNAseq Differential Gene Expression Analysis Report

Sequence Analysis 2RNA-Seq

Post-assembly Data Analysis

Figure S1. Data flow of de novo genome assembly using next generation sequencing data from multiple platforms.

1. Introduction Gene regulation Genomics and genome analyses

Analysis of RNA-seq Data. Bernard Pereira

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

CBC Data Therapy. Metatranscriptomics Discussion

A normalization method based on variance and median adjustment for massive mrna polyadenylation data

Genomics and Transcriptomics of Spirodela polyrhiza

Introduction to RNA-Seq in GeneSpring NGS Software

SO YOU WANT TO DO A: RNA-SEQ EXPERIMENT MATT SETTLES, PHD UNIVERSITY OF CALIFORNIA, DAVIS

measuring gene expression December 11, 2018

An introduction to RNA-seq. Nicole Cloonan - 4 th July 2018 #UQWinterSchool #Bioinformatics #GroupTherapy

Parts of a standard FastQC report

RNA Sequencing: Experimental Planning and Data Analysis. Nadia Atallah September 12, 2018

Introduction. CS482/682 Computational Techniques in Biological Sequence Analysis

Introduction to RNAseq Analysis. Milena Kraus Apr 18, 2016

SCALABLE, REPRODUCIBLE RNA-Seq

Analysis Datasheet Exosome RNA-seq Analysis

VM origin. Okeanos: Image Trinity_U16 (upgrade to Ubuntu16.04, thanks to Alexandros Dimopoulos) X2go: LXDE

Quantifying gene expression

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

Result Tables The Result Table, which indicates chromosomal positions and annotated gene names, promoter regions and CpG islands, is the best way for

ChIP-seq and RNA-seq. Farhat Habib

ENGR 213 Bioengineering Fundamentals April 25, A very coarse introduction to bioinformatics

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

Course Presentation. Ignacio Medina Presentation

Statistical Genomics and Bioinformatics Workshop. Genetic Association and RNA-Seq Studies

Lees J.A., Vehkala M. et al., 2016 In Review

RNA-Seq Analysis. August Strand Genomics, Inc All rights reserved.

How much sequencing do I need? Emily Crisovan Genomics Core

How much sequencing do I need? Emily Crisovan Genomics Core September 26, 2018

Analysis of Microarray Data

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing

measuring gene expression December 5, 2017

Differential gene expression analysis using RNA-seq

High performance sequencing and gene expression quantification

Optimal Calculation of RNA-Seq Fold-Change Values

Genomic resources. for non-model systems

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Introduction to NGS analyses

Galaxy Platform For NGS Data Analyses

RNA-Seq analysis using R: Differential expression and transcriptome assembly

How to deal with your RNA-seq data?

Machine Learning. HMM applications in computational biology

Nature Genetics: doi: /ng Supplementary Figure 1. H3K27ac HiChIP enriches enhancer promoter-associated chromatin contacts.

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

Analysis of NGS data. resources. Grid computing workshop 2015 Jan Oppelt, NCBR & CEITEC MU 1 st December, 2015

Integrative Genomics 1a. Introduction

Arraygen Technologies Pvt. Ltd.

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson

Analytics Behind Genomic Testing

RNA-Seq data analysis course September 7-9, 2015

Benchmarking of RNA-seq data processing pipelines using whole transcriptome qpcr expression data

Massive Analysis of cdna Ends for simultaneous Genotyping and Transcription Profiling in High Throughput

ChIP-seq and RNA-seq

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

RNA-seq Data Analysis

Post-assembly Data Analysis

Single Cell Transcriptomics scrnaseq

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Cassava. Monitoring transcriptional changes in cassava infected with South African cassava mosaic using next-generation sequencing 4/19/2013


From reads to results: differential. Alicia Oshlack Head of Bioinformatics

Accurate, Fast, and Model-Aware Transcript Expression Quantification

Analysis of Biological Sequences SPH

RNA sequencing Integra1ve Genomics module

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis

Course Information. Introduction to Algorithms in Computational Biology Lecture 1. Relations to Some Other Courses

Exploring and Understanding ChIP-Seq data. Simon v

RNASEQ WITHOUT A REFERENCE

Textbook Reading Guidelines

De novo metatranscriptome assembly and coral gene expression profile of Montipora capitata with growth anomaly

Analysis of Microarray Data

Bioinformatics Monthly Workshop Series. Speaker: Fan Gao, Ph.D Bioinformatics Resource Office The Picower Institute for Learning and Memory

Supplementary Information Supplementary Figures

NGS Data Analysis and Galaxy

Introduction to Algorithms in Computational Biology Lecture 1

NGS part 2: applications. Tobias Österlund

RNA-Seq Software, Tools, and Workflows

Nature Methods: doi: /nmeth Supplementary Figure 1. Pilot CrY2H-seq experiments to confirm strain and plasmid functionality.

Sanger vs Next-Gen Sequencing

Transcription:

RNA Sequencing Analyses & Mapping Uncertainty Adam McDermaid 1/26

RNA-seq Pipelines Collection of tools for analyzing raw RNA-seq data Tier 1 Quality Check Data Trimming Tier 2 Read Alignment Assembly De-novo Reference-based Tier 3 Differential Expression Analysis Functional Enrichment Analysis Quality Check (De-novo) Assembly of Transcripts Differential Expression Analysis RNA-seq Reads Read Mapping Gene Read Count Data Trimming Transcription Unit Prediction Functional Enrichment Analysis Tier 4 Clustering & Bi-clustering Network Analyses De-novo (Bi)-Clustering Network Analysis & Modeling 2/26

Visualizations for Differential Gene Expression ViDGER R package Available on GitHub Visualization of Differential Gene Expression Results Cufflinks EdgeR DESeq2 Six publication-quality figures Two matrix options Currently under review in Bioinformatics 3/26

vsboxplot Boxplots of log-fpkm values for each condition 4/26

vsscatterplot log-fpkm scatterplots between two select conditions 5/26

vsscattermatrix Matrix option to display all pairwise scatterplot comparisons Also displays pairwise correlations and FPKM densities for each condition 6/26

vsdegmatrix Displays shared differentially expressed genes for each pairwise comparison 7/26

vsmaplot Pairwise display of log fold change compared with mean expression values Colored data points indicate significant p- values and differential expression 8/26

vsvolcano Pairwise display of log fold change compared with negative log p-value 9/26

vsvolcanomatrix Matrix option for all pairwise Volcano plots 10/26

vsfourway Conditional comparison relative to chosen sample (control) 11/26

Current Pipeline Issues Pipeline tools are not optimal Read Alignment Ambiguous Reads All downstream analyses and results are affected Quality Check (De-novo) Assembly of Transcripts RNA-seq Reads Read Mapping Gene Read Count Data Trimming Transcription Unit Prediction Differential Expression Analysis Functional Enrichment Analysis De-novo (Bi)-Clustering Network Analysis & Modeling 12/26

Mapping Uncertainty Read alignment may result in multiple possible alignment locations Reads aligned to reference genome may have multiple possible alignments Due to duplicate nature of genomes, especially in plant genomes When a read can be mapped to multiple locations, mapping uncertainty occurs 13/26

Mapping Uncertainty Occurrences Plants Highly duplicative nature of genome Animals Alternative splicing Metagenomics Sequencing of entire microbial communities simultaneously Identical genes across different species Similar, mutated or evolved genes 14/26

Mapping Uncertainty in Plants 55 RNA-seq datasets 5 Distinct species 943.5 GB of data Average of 20% ambiguous data Species Uniquemapped Multimapped Unmapped Arabidopsis thaliana Diploid plants Vitis vinifera Solanum lycopersicum Solanum tuberosum Polyploid plants Triticum aestivum 77%~89% 55%~82% 49%~87% 55%~69% 62%~69% 8%~17% 10%~25% 6%~34% 18%~26% 18%~25% 2%~5% 8%~23% 5%~44% 12%~19% 9%~18% 15/26

Current Strategies Omission of ambiguous reads Underestimation Alignment to all possible locations Overestimation Best Match Proportionate assignment Best current approach Still over/underestimates 16/26

Solving Mapping Uncertainty Two step approach: Determine severity of mapping uncertainty in a given dataset Quantification using relevant factors Address issue of mapping ambiguous reads 17/26

Quantifying Mapping Uncertainty GeneQC Computational program collecting relevant information from datasets Interprets information in meaningful way to provide quantification of mapping uncertainty Maximum base-pair similarity between gene pairs Maximum proportion of shared reads between gene pairs Partitions genes into categorizations based on mapping uncertainty severity 18/26

C D GeneQC 0.5 A B C 97 19/26

D-score Allows for comparable metric of mapping uncertainty 0.07 Combines three statistics Maximum proportion of shared ambiguous reads Maximum base-pair similarity Number of gene pair interactions Normalized between 0 and 1 for each dataset i 0.18 0.65 0.84 0.95 20/26

D-score Higher D-score indicates more mapping uncertainty D-score increased alphabetically by categorization on average 21/26

Addressing Mapping Uncertainty Co-expression Modules (CEMs) Can use expression levels for known co-expressed genes (CEGs) to predict likely expression levels for the gene locations This information can be in turn used to determine which location is most likely for any particular ambiguous read Can use existing information to gain insight into the likelihood of the correct location for alignment 22/26

Heuristic Approach to Mapping Uncertainty Utilize Negative Binomial Distribution X i ~NB 2 xҧ i, s 2 i + xҧ i ҧ x i s i 2 + ҧ x i Read alignments follow such that read counts converge to CEM mean 23/26

Theoretical Approach to Mapping Uncertainty Hidden Markov Model True state is unknown Background information used for predictions CEM statistics Read alignment interactions 24/26

Projected Outcomes GeneQC Allow for quantification of mapping uncertainty over entire dataset Statistical Modeling Allow for re-alignment of ambiguous reads aligned to genes with high levels of mapping uncertainty Implement within R package for widespread use Improve reliability of all reference-based RNA-seq analyses 25/26

Acknowledgements Dr. Qin Ma Dr. Anne Fennell NSF-IOS 1546869 NSF-OIA 1355423 SDRIC 26/26