RNA spike-in controls & analysis methods for trustworthy genome-scale measurements Sarah A. Munro, Ph.D. Genome-Scale Measurements Group ABRF Meeting March 29, 2015
Overview External RNA Controls Consortium (ERCC) RNA spike-in controls erccdashboard analysis tool ERCC 2.0: Building an updated suite of RNA controls
Overview External RNA Controls Consortium (ERCC) RNA spike-in controls erccdashboard analysis tool ERCC 2.0: Building an updated suite of RNA controls
How can we have trustworthy We re simultaneously measuring thousands of RNA molecules in gene expression experiments But are we getting it right? gene expression results?
External RNA Controls Consortium (ERCC) initiated by industry, hosted by NIST Initiated by Janet Warrington, VP Clinical Genomics at Affymetrix Open to all interested parties Voluntary More than 90 participants Industry, Academia, Government All major microarray technology developers Other gene expression assay developers Spikeins
ERCC control sequences are in NIST Standard Reference Material 2374 DNA sequence library 96 unique control sequences in DNA plasmids Controls intended to mimic mammalian mrna In vitro transcription to make RNA controls NIST SRM 2374 and related data files are available directly from NIST @ http://tinyurl.com/erccsrm
Making ERCC ratio mixtures with true positive and true negative ratios NIST Plasmid DNA Library RNA transcripts Mixtures with known abundance ratios in vitro transcription Pooling
Using ERCC ratio mixtures Treated (n>3) Control (n>3)
Using ERCC ratio mixtures Treated (n>3) Control (n>3)
Using ERCC ratio mixtures Treated (n>3) Control (n>3)
Using ERCC ratio mixtures Treated (n>3) Control (n>3) Measurement process Expression Measures Multiple steps Many people & labs Takes days to weeks Statistical Analysis
Example gene expression data Treated Control
Are the RNA molecule ratios statistically different across the samples? Treated Control
Evaluate technical performance with ERCC true positive and true negative ratios Treated Control
Overview External RNA Controls Consortium (ERCC) RNA spike-in controls erccdashboard analysis tool ERCC 2.0: Building an updated suite of RNA controls
Use erccdashboard to produce standard performance metrics for any experiment R package is available from: Bioconductor NIST GitHub Site Open source and open access for use in Other analysis tools and pipelines Commercial software
Gauge technical performance with 4 erccdashboard figures Developed as part of SEQC study, with ABRF partners Technology-independent ratio performance measures Assessed differences in performance across Experiments Laboratories Measurement processes Munro, S. A. et al. Nature Communications 5:5125 doi: 10.1038/ncomms6125 (2014).
Ambion ERCC Ratio Mixtures 23 Controls per Subpool Design abundance spans 2 20 range within each Subpool
Spike-in design for SEQC RNA Sequencing Experiments Samples replicates for sequencing Rat Experiment Treated and Control Rat RNA Biological Replicates Interlaboratory Experiment Human Reference RNA Samples Technical Replicates
What is the dynamic range of my experiment? Rat Experiment Interlaboratory Experiment Log2 Normalized ERCC Counts Log2 Normalized ERCC Counts Log2 ERCC Spike Amount (attomol nt µg -1 total RNA) Log2 ERCC Spike Amount (attomol nt µg -1 total RNA)
What is the dynamic range of my experiment? Rat Experiment Interlaboratory Experiment Log2 Normalized ERCC Counts Typical Sequencing ~40 million sequence reads per replicate Log2 Normalized ERCC Counts Deep Sequencing ~260 million sequence reads per replicate Log2 ERCC Spike Amount (attomol nt µg -1 total RNA) Log2 ERCC Spike Amount (attomol nt µg -1 total RNA)
What was the diagnostic power? Rat Experiment Interlaboratory Experiment True Positive Rate True Positive Rate False Positive Rate False Positive Rate
What was the diagnostic power? Rat Experiment Interlaboratory Experiment True Positive Rate Area Under the Curve (AUC) depends on the number of controls detected! True Positive Rate False Positive Rate False Positive Rate
AUC is a reasonable summary statistic But we d like to evaluate our diagnostic performance as a function of abundance
Rat Experiment MA Plot Log2 Normalized Ratio of Counts Log2 Normalized Average Counts
LODR: Limit of Detection of Ratios DE Test P-values Rat Experiment Average Counts Reference RNA Model P-values as a function of average signal Find P-value threshold based on chosen false discovery rate Here FDR = 0.1 Default is FDR = 0.05 Estimate LODR from intersection of model confidence interval upper bound and P-value threshold
LODR: Limit of Detection of Ratios DE Test P-values Rat Experiment Average Counts Reference RNA LODR provides Specified confidence in the differentially expressed transcripts above LODR (90% chance of <10% FDR) Guidance for experimental design increase signal for transcripts above LODR estimate
Rat Experiment MA Plot 4:1 LODR Log2 Ratio of Normalized Counts Log2 Normalized Average Counts
Rat Experiment ** MA Plot * 4:1 LODR Log2 Ratio of Normalized Counts Log2 Normalized Average Counts
Log2 Ratio of Normalized Counts Rat Experiment MA Plot 4:1 LODR ** * Increased sequencing depth shifts endogenous transcript ratio measurements above LODR Log2 Normalized Average Counts
What are the LODR estimates for my experiment? Rat Experiment Interlaboratory Experiment DE Test P-values DE Test P-values Average Counts Average Counts
How do the endogenous samples relate to LODR? Rat Experiment Interlaboratory Experiment Log2 Ratio of Normalized Counts Log2 Ratio of Normalized Counts 4:1 LODR 4:1 LODR Log2 Normalized Average Counts Log2 Normalized Average Counts
How much technical variability & bias is there? Rat Experiment Interlaboratory Experiment Log2 Ratio of Normalized Counts Log2 Ratio of Normalized Counts Decreased Variability Significant Ratio Bias
mrna Fraction Differences Between Samples Contributes to Bias in ERCC Ratios Spike-in Spike-in mrna Total RNA mrna enrichment mrna rrna Sample 1 Sample 2 Sample 1 Sample 2 The RNA fractions are exaggerated for illustration purposes
Dynamic Range AUC Diagnostic performance Variability Bias LODR & Sample Transcripts LODR Limit of Detection of Ratios
EVALUATE REPRODUCIBILITY ACROSS LABORATORIES
Good Performance Poor Performance
Interlaboratory Analysis Using erccdashboard performance metrics Lab 1-6 Illumina + poly-a selection (Illumina kit) Lab 7-9 Life Tech + poly-a selection (Life Tech kit) Lab 10-12 Illumina + ribosomal RNA depletion
Consistent LODR across 11 of 12 Labs Diagnostic performance was consistent within and amongst measurement processes Lab 7 was an outlier for diagnostic performance LODR (Average Counts) LODR agreement with AUC Laboratory
Ratio bias is highly variable amongst experiments Ratio bias (r m ) can be attributed to mrna fraction difference between samples: Shippy et al. 2006 mrna fraction Difference R s = nominal subpool ratio (E 1 /E 2 ) s = empirical ratio Log(r m ) Large standard errors indicate that mrna fraction isn t the only factor contributing to ERCC ratio bias mrna enrichment protocol is a factor Laboratory
Protocol-dependent bias from poly-a selection affects ERCC controls due to short poly-a tails Lab 1-6 ILM Poly-A Lab 7-9 LIF Poly- Lab 10-12 ILM Ribo
mrna enrichment protocol biases vary across individual ERCCs but are consistent for a protocol
mrna enrichment protocol biases vary across individual ERCCs but are consistent for a protocol
Results of the erccdashboard Publication Ratio performance measures for any technology platform and any experiment Diagnostic Power Novel LODR metric Technical Variability & Bias Comparison across experiments Quantification of mrna fraction differences between samples Show protocol-dependent bias
Overview External RNA Controls Consortium (ERCC) RNA spike-in controls erccdashboard analysis tool ERCC 2.0: Building an updated suite of RNA controls
ERCC 2.0: A New Suite of RNA Controls Approached by industry and academia to build new RNA controls NIST-hosted open, public ERCC 2.0 workshop Workshop report and presentations available: slideshare.net/ercc-workshop All interested parties are welcome to participate Sequence contributions Interlaboratory analysis New and Improved mrna Mimics Transcript Isoforms mirna
New and Improved mrna Mimics Additional controls Expand distributions of RNA control properties Length (> 2kb) GC content Poly-A tail length
Transcript Isoform Controls Transcript Design Non-cognate Spike-in RNA Variants (SIRVs) developed by Lexogen Cognate sequence selection in progress Schizosaccharomyces pombe Mixture design Dynamic Range 2 4 Design Ratios < 2:1 Lukas Paul, Lexogen
Small and mirna Controls Needed for validation of clinical applications Early Detection Research Network Tgen Other applications relevant to bacterial RNA-Seq Non-cognate mirna controls Include some pre-mirna Direct RNA control synthesis by Agilent no need for DNA templates Karol Thompson, FDA
Recap External RNA Controls Consortium (ERCC) RNA spike-in controls erccdashboard analysis tool ERCC 2.0: Building an updated suite of RNA controls
Acknowledgements All External RNA Controls Consortium participants NIST Marc Salit Steve Lund P. Scott Pine Justin Zook David Duewer Jerod Parsons Jennifer McDaniel Margaret Klein Empa Matthias Roesslein SEQC study participants Co-authors on erccdashboard manuscript: S. P. Lund, P. S. Pine, H. Binder, D. Clevert, A. Conesa, J. Dopazo, M. Fasold, S. Hochreiter, H. Hong, N. Jafari, D. P. Kreil, P. P. Łabaj, S. Li, Y. Liao, S. M. Lin, J. Meehan, C. E. Mason, J. Santoyo-Lopez, R. A. Setterquist, L. Shi, W. Shi, G. K. Smyth, N. Stralis-Pavese, Z. Su, W. Tong, C. Wang, J. Wang, J. Xu, Z. Ye, Y. Yang, Y. Yu, & M. Salit For more information contact: sarah.munro@nist.gov