Technical note: Molecular Index counting adjustment methods

Similar documents
Integrated NGS Sample Preparation Solutions for Limiting Amounts of RNA and DNA. March 2, Steven R. Kain, Ph.D. ABRF 2013

Roche Molecular Biochemicals Technical Note No. LC 12/2000

SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA

Development of quantitative targeted RNA-seq methodology for use in differential gene expression

Sample to Insight. Dr. Bhagyashree S. Birla NGS Field Application Scientist

TECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA

Novel methods for RNA and DNA- Seq analysis using SMART Technology. Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc.

BD Single Cell Genomics Bioinformatics Handbook

Introduction into single-cell RNA-seq. Kersti Jääger 19/02/2014

QPCR ASSAYS FOR MIRNA EXPRESSION PROFILING

Maximizing your NGS sequencing with IDT. Adam Chernick, PhD Field Applications Manager, Functional Genomics

QIAseq SPE technology for Illumina : Redefining amplicon sequencing

Automation of Lexogen s QuantSeq 3 mrna-seq Library Prep Kits on the Biomek FX p NGS Workstation

User-Demonstrated Protocol: BD Single-Cell Multiplexing Kit Human

454 Sample Prep / Workflow at the BioMedical Genomics Center (BMGC) University of Minnesota. Sushmita Singh

Supplementary Figure 1

Experimental design of RNA-Seq Data

QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd

Protein and transcriptome quantitation using BD AbSeq Antibody-Oligonucleotide

Isolation of total nucleic acids from FFPE tissues using FormaPure DNA

APPLICATION NOTE

Functional DNA Quality Analysis Improves the Accuracy of Next Generation Sequencing from Clinical Specimens

SEQUENCING FROM SAMPLE TO SEQUENCE READY

QIAseq mirna Library Kit The next-generation in mirna sequencing products

DEFY THE LAW OF AVERAGES. Single-Cell Targeted Gene Expression Analysis

Standardized Next Generation Sequencing Abundance Measurements (StarSeq) using Competitive Template Mixtures. AccuGenomics Inc.

Complete protocol in 110 minutes Enzymatic fragmentation without sonication One-step fragmentation/tagging to save time

Single-Cell. Defy the Law of Averages

CloneTracker XP 10M Barcode-3 Library with RFP-Puro

Single Cell Transcriptomics scrnaseq

Quantitation of mrna Using Real-Time Reverse Transcription PCR (RT-PCR)

Deep Sequencing technologies

Get to Know Your DNA. Every Single Fragment.

Single-cell sequencing

TECH NOTE SMARTer T-cell receptor profiling in single cells

Applications and Uses. (adapted from Roche RealTime PCR Application Manual)

an innovation in high throughput single cell profiling

Supplementary Figures

RIPTIDE HIGH THROUGHPUT RAPID LIBRARY PREP (HT-RLP)

High Resolution LabChip XT Fractionation of Illumina Compatible Small RNA Libraries using the DNA 300 Assay Kit

Nature Methods: doi: /nmeth Supplementary Figure 1

High-quality stranded RNA-seq libraries from single cells using the SMART-Seq Stranded Kit Product highlights:

Welcome to the NGS webinar series

Increased transcription detection with the NEBNext Single Cell/Low Input RNA Library Prep Kit

Single Cell Genomics

Simple, Complete Workflows for Gene Expression Analysis without RNA Purification

PrimePCR Assay Validation Report

qpcr Quantitative PCR or Real-time PCR Gives a measurement of PCR product at end of each cycle real time

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

Single-cell sequencing

Nanopore sequencing How it works

Supplementary Information

Overcome ligation-induced bias and skewed mirna representation in microrna-seq

Supplementary Figure 1 Schematic view of phasing approach. A sequence-based schematic view of the serial compartmentalization approach.

Fully Automated Library Quantification for Illumina Sequencing on the NGS STAR

TaqMan Advanced mirna Assays

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput


TaqMan Advanced mirna Assays

Throughput cells cells. Methodology Full transcript or end-counting end-counting. Chemistry SMARTer V SMARTer V. Run time hours.

Unique, dual-matched adapters mitigate index hopping between NGS samples. Kristina Giorda, PhD

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome

Library Quantification Kit User Manual

Low input RNA-seq library preparation provides higher small non-coding RNA diversity and greatly reduced hands-on time

scgem Workflow Experimental Design Single cell DNA methylation primer design

A Genomics (R)evolution: Harnessing the Power of Single Cells

Reference gene detection assay. Instructions for detection and quantification of a reference gene using SYBR Green detection chemistry

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

TaqPath ProAmp Master Mixes

Single Cell Genomics

Precise quantification of Ion Torrent libraries on the QuantStudio 3D Digital PCR System

PrimerArray Analysis Tool for Embryonic Stem Cells

Single-Cell Whole Transcriptome Profiling With the SOLiD. System

PrimePCR Assay Validation Report

P HENIX. PHENIX PCR Enzyme Guide Tools For Life Science Discovery RESEARCH PRODUCTS

Supplementary Information for Single-cell sequencing of the small-rna transcriptome

SureSelect XT HS. Target Enrichment

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Nature Biotechnology: doi: /nbt Supplementary Figure 1. sndrop-seq overview.

PrimerArray Analysis Tool Ver. 2.2

Single-cell RNA-sequencing

Experimental Design. Dr. Matthew L. Settles. Genome Center University of California, Davis

BD Single-Cell Multiplexing Kit Human Protocol

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

Digital DNA/RNA sequencing enables highly accurate and sensitive biomarker detection and quantification

PrimePCR Assay Validation Report

BIOO LIFE SCIENCE PRODUCTS. NEXTflex TM 16S V4 Amplicon-Seq Kit 4 (Illumina Compatible) BIOO Scientific Corp V13.01

TECH NOTE Stranded NGS libraries from FFPE samples

Applied Biosystems SOLiD 3 Plus System. RNA Application Guide

Quality assurance in NGS (diagnostics)

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

SURESELECTXT LOW INPUT TARGET ENRICHMENT

Cat. # RR430S RR430A. For Research Use. SYBR Fast qpcr Mix. Product Manual. v201610da

Roche Molecular Biochemicals Technical Note No. LC 10/2000

Impact of gdna Integrity on the Outcome of DNA Methylation Studies

Reverse Transcription & RT-PCR

Technical Review. Real time PCR

Procedure & Checklist - Multiplex Isoform Sequencing (Iso-Seq Analysis)

Tech Note. Using the ncounter Analysis System with FFPE Samples for Gene Expression Analysis. ncounter Gene Expression. Molecules That Count

Performance characteristics of the High Sensitivity DNA kit for the Agilent 2100 Bioanalyzer

Transcription:

Technical note: Molecular Index counting adjustment methods By Jue Fan, Jennifer Tsai, Eleen Shum Introduction. Overview of BD Precise assays BD Precise assays are fast, high-throughput, next-generation sequencing (NGS) library preparation kits tailored for small quantity RNA samples, such as single cells, using patented BD Molecular Indexing (MI) technology with Sample Index (SI) to label individual mrna transcripts. During reverse transcription, the BD Precise assays apply a non-depleting pool of 6,56 MI barcodes (or 65,536 barcodes for BD Precise whole transcriptome amplification assay) for stochastic and unique labeling of otherwise identical mrna transcripts to generate unique cdna. In addition to MI, 96 SI barcodes are employed to identify the sample origin of each transcript according to the well position in the 96-well BD Precise plate. These unique barcoded primers are then used to label polyadenylated RNA transcripts in each of the 96 wells, followed by a pooling step into a single tube for a multiplex PCR target enrichment and sequencing adapter attachment (BD Precise targeted assay). The resulting library is then ready for NGS using compatible Illumina sequencers and sequencing kits. Each of the sequencing reads are then processed to identify the MI, SI and target gene using a BD primary analysis pipeline (Figure ). While the use of MIs allows accurate mrna counting, errors in MIs can occur either during PCR or library prep steps of the protocol. This technical note aims to discuss the MI errors that can arise and details new algorithms implemented by the BD primary analysis pipeline that adjust for these potential issues. Sequencing read Gene amplicon of interest Molecular Index Sample index Sequencing read 2 Figure. Sequencing read structure of a BD Precise targeted assay amplicon allows for identification of the origin of each mrna and sample.

.2 Molecular Index counting for high-input samples The BD Precise assay is most suitable when used with small sample input such as single cells to allow for stochastic and unique labeling of mrnas. As the number of transcripts increases relative to the barcode pool, the percentage of MIs being recycled to label the same gene increases and can be theoretically calculated using the Poisson distribution (Figure 2). In these situations, without statistical correction, quantifying gene expression using MIs would underestimate the number of molecules that are initially present without any Poisson adjustments. At extremely high inputs where the number of mrnas per gene is beyond the entire collection of 6,56 barcodes (or 65,536 for WTA), Poisson correction is no longer possible. In these situations, regardless of whether there are 65, or, copies of a particular gene in a well, a maximum of 6,56 saturated barcodes is expected in either case. Hence, this technical note discusses the new result classification for users to flag genes and samples that appear to have high sample input where MI counts would likely be underestimated. Unique Molecular Index % 9% 8% 7% 6% 5% 4% 3% 2% % % 2 3 4 5 6 Molecular labeled for a particular gene Figure 2. Theoretical calculation of the percentage of unique Molecular Index used as input molecules increases when there are 6,56 MIs..3 Molecular Index errors In addition to PCR bias correction, MIs can provide a better understanding of the quality of the library preparation procedure and sequencing data. When looking at the number of reads for the same MI of a given gene referred to as MI depth it is possible to detect erroneous base calls or PCR errors generated during library preparation. For example, a gene that has multiple reads with the same MI and SI represented by multiple reads is likely an accurate measurement compared to an MI for a given gene and an SI that has only a single read. When a gene has MIs with low and high depth, the low-depth MIs are likely due to errors during the library preparation or sequencing. These MI errors generally have distinct distributions from the true MIs (Figure 3). This technical note details two sequential algorithms that are employed in the BD Precise assays analysis pipeline to remove MI errors. First, MI errors that manifest as single base substitution errors are identified and adjusted to the true MI barcode using recursive substitution error correction (RSEC). Subsequently, other MI errors (derived from library preparation steps or sequencing base deletions) are adjusted using distribution-based error correction (DBEC). Count 75 5 25 5 5 2 MI depth ACTB Figure 3. The depth of each MI across a plate for a high expressing gene ACTB, where distinct distributions can be observed between MIs that are likely errors and true MIs. 2

2 Algorithm overview Raw MI counts Saturated if MI count > 6,557 Adjust MI by RSEC Undersequenced if MI coverage < 4 Calculate MI depth per gene MI count filter Remove non-unique MI Adjust MI by DBEC Output 2. Recursive substitution error correction (RSEC) The RSEC algorithm adjusts MI errors that are derived from PCR and sequencing substitution. These rare erroneous events are observed when examining the MI depth. For example, the MI depth for MIs that are likely errors are significantly lower than true MIs in adequately sequenced samples (Figure 3); in cases where two very similar MIs are used during the initial Molecular Indexing (reverse transcription) steps, they would generally have similar MI depth and do not need to be eliminated. As sequencing depth increases, more MI errors appear, hence RSEC is crucial for adjusting the MI count for highly sequenced barcoded libraries. RSEC considers two factors in error correction: ) similarity in MI sequence and 2) MI depth. For each target gene, MIs are connected when both of their MI Figure 4. Overview of the MI correction algorithms. sequence is within one base of each other (Hamming distance = ). For each connection between MI x and y, if: MI Depth(y) > 2*MI Depth(x)+; y is Parent MI and x is Child MI GTCAAATT 3 reads TTCAGAAA read TTCAAACT read GTCAAAAT 24 reads GTCAAAAA 74 reads TTCAAAAA 53 reads TTCAAAAT 4 reads TTCGGACA 88 reads Figure 5. Example MIs going through the RSEC algorithm. Based on this assignment, child MIs are collapsed to their parent MI.2 This process is recursive until there are no more identifiable parent / child MIs for the gene (Figure 5). Raw MI = 9 CTCAAAAA 2 reads RSEC MI = 2 TTCAAAAA 263 reads TTCGGACA 88 reads 3

2.2 MI depth calculations After RSEC, gene MI counts are evaluated to determine their suitability for further correction. First, the algorithm identifies whether the MI depth for each gene is sufficient for error correction. According to the Poisson distribution, if MI depth is less than four, more signal MIs are removed than error MIs using any correction algorithm. Therefore, genes with low MI depth (< 4 reads per MI) bypass subsequent correction steps for DBEC and are designated low depth in the output file. The algorithm also evaluates gene MIs that appear to be saturated, where there are far more input transcripts for stochastic unique MI labeling to occur. Gene MIs that do not meet either of the two decision points move forward to the DBEC algorithm and are designated as pass in the output file. In addition, genes with higher than an average of 65 MIs per well are designated to be high input as >5% of these MIs are based on the Poisson distribution but run through DBEC (Figure 2). 2.3 Distribution-based error correction (DBEC) Unlike RSEC, DBEC algorithm is a method to discriminate whether a MI is an error or true signal regardless of its MI sequence. While RSEC uses both MI sequence and MI depth information to correct for errors, DBEC relies only on MI depth to correct for non-substitution errors. Error barcodes generally have low MI depth that is distinct from true barcodes MI depth; this difference in MI depth can be observed in a histogram plot of the MI depth (Figure 3). DBEC fits two negative binomial distributions to statistically distinguish between MI errors with lower MI depth and one for true signal with higher MI depth (Figure 6). 2.3. Removal of recycled MIs for optimal distribution fitting Probability Density.8.6.4.2. ACTB 2 4 6 8 reads Negative Binomial Mixture Figure 6. Graph of ACTB fitted with two negative binomial distributions during DBEC, x axis is the MI depth. The first step removes potential recycled MIs. A recycled MI is an MI sequence with the same SI that was used more than once for a given gene prior to amplification. Although these are two unique molecules that should both be counted, because they have the same MI sequence they are folded into one and treated as though one is the amplicon of the other. For a given gene, as the number of MIs detected increases, the percentage of recycled MIs increases and can be estimated. Using the Poisson distribution (λ non-unique ), the number of recycled MIs for well i (n non-unique,i) is estimated from the MI recycling rate equation at right. Based on the top MI depth, MIs would be eliminated from distribution fitting but preserved for later counting steps to obtain a better negative binomial distribution fitting. % non-unique MIs = P(X > λ non-unique ) P(X > λ non-unique ) n non-unique = n wells n non-unique,i i= 4

2.3.2 Estimation of parameters In order to fit two negative binomial distributions (one for error and one for true signal), two sets of starting values for parameter estimation are approximated. The error distribution is assumed to be a negative binomial with mean and dispersion of one. An example of how the two distributions are fitted is shown in Figure 6. If there is an instance where DBEC fails, MIs with a depth of one are removed as a simple error correction step. 2.3.3 Error / signal probability estimation After distribution fitting, the signal and error distributions are referred to as NB(µ signal, size signal ) and NB(µ error, size error ), respectively. To avoid overcorrection, if the derived cutoff for error distribution is higher than half of the maximum read, only MIs with a depth of one are removed. P(X = r µ = µ error, size = size error ) < P(X = r µ = µ signal, size = size signal ) 5

3 Sample data mixed Jurkat and T47D cells Raw Molecular Index counting Adjusted Molecular Index counting 5 Adjusted MI 3 Adjusted MI 2 5 () Jurkat (46.9%) (2) NTC (38.5%) (3) T47D (4.6%) A - Jurkat (46.9%) NTC (38.5%) T47D (4.6%) -2-5 -3 - - -8-6 -4-2 2 4 6 8-4 -5-4 -3-2 - 2 3 4 5 CD3E 59 cells (6.5%) 892 mols (.9%).8 3 CD3E 47 cells (49.%) 45 mols (2.7%).4.6.4 2.2 5.2.8.6 log(number of molelcule per cell) B - -2.8.6.4 log(number of molelcule per cell) -5.4.2-3.2 - - -8-6 -4-2 2 4 6 8-4 -5-4 -3-2 - 2 3 4 5 CDH 4 cells (42.7%) 376 mols (.4%).6 3 CDH 5 cells (5.6%) 67 mols (.4%).4 2.9 5.2.8.6.4 log(number of molelcule per cell) C - -2.8.7.6.5.4.3 log(number of molelcule per cell) -5.2.2-3. - - -8-6 -4-2 2 4 6 8-4 -5-4 -3-2 - 2 3 4 5 PSMB4 95 cells (99.%) 7684 mols (7.6%) 2.5 3 2 PSMB4 69 cells (7.9%) 258 mols (7.5%).6 2.4 5.5 log(number of molelcule per cell) D - -2.2.8.6 log(number of molelcule per cell) -5.5.4-3.2 - - -8-6 -4-2 2 4 6 8-4 -5-4 -3-2 - 2 3 4 Figure 7. t-stochastic neighbor embedding (t-sne) visualization of a BD Precise targeted assay from a 96-well plate of mixed Jurkat and breast cancer (T47D) single cells (86 genes examined). (A) Cell clusters were identified using DBScan with the same parameters before and after MI adjustments. (B-D) Individual marker expression scaled both by color and point size. (B) PSMB4, a housekeeping gene that is present in both cell types and after MI adjustments, the lack of PSMB4 signal is highlighted further in the no-template control (NTC) cluster. (C) CD3E, a lymphocyte marker that highlights Jurkat cell clusters. (D) CDH, an epithelial cell marker that highlights the T47D cluster. In general, MI adjustment removes MI noise which allows for clear differentiation of gene expression between cell clusters. 6

A. Low-signal cells T47D NTC Jurkat Adjusted MI T47D NTC Jurkat Raw MI Figure 8. Heat map displaying differential gene expression by MIs between different cell clusters identified in Figure 7 [Left] before any error correction steps (Raw MI) and [Right] after RSEC and DBEC correction (Adjusted MI). Genes that are low in expression is in blue, and genes that are high in expression is orange. Genes that are similar in gene expression pattern between these cell types are clustered together. Without error correction, NTC can have noise from high expressing genes such as CD3E and KRT8, which are Jurkat and T47D markers, respectively. Moreover, error correction reveals distinct gene expression patterns between Jurkat and T47D. References Fu GK, Hu J, Wang P Fodor SP. Counting individual DNA molecules by the stochastic attachment of diverse labels. PNAS. 2;8(22):926 93. 2 S mith T, Heger A, Sudbery I. UMI-tools: Modelling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Research. 27; doi:./gr.296.6. [Epub ahead of print]. BD, Becton Drive, Franklin Lakes, NJ, 747 bd.com 27 BD. BD, the BD logo and all other trademarks are the property of Becton, Dickinson and Company. For research use only. Not for use in diagnostic or therapeutic procedures.