Technical note: Molecular Index counting adjustment methods

Technical note: Molecular Index counting adjustment methods By Jue Fan, Jennifer Tsai, Eleen Shum Introduction. Overview of BD Precise assays BD Precise assays are fast, high-throughput, next-generation sequencing (NGS) library preparation kits tailored for small quantity RNA samples, such as single cells, using patented BD Molecular Indexing (MI) technology with Sample Index (SI) to label individual mrna transcripts. During reverse transcription, the BD Precise assays apply a non-depleting pool of 6,56 MI barcodes (or 65,536 barcodes for BD Precise whole transcriptome amplification assay) for stochastic and unique labeling of otherwise identical mrna transcripts to generate unique cdna. In addition to MI, 96 SI barcodes are employed to identify the sample origin of each transcript according to the well position in the 96-well BD Precise plate. These unique barcoded primers are then used to label polyadenylated RNA transcripts in each of the 96 wells, followed by a pooling step into a single tube for a multiplex PCR target enrichment and sequencing adapter attachment (BD Precise targeted assay). The resulting library is then ready for NGS using compatible Illumina sequencers and sequencing kits. Each of the sequencing reads are then processed to identify the MI, SI and target gene using a BD primary analysis pipeline (Figure ). While the use of MIs allows accurate mrna counting, errors in MIs can occur either during PCR or library prep steps of the protocol. This technical note aims to discuss the MI errors that can arise and details new algorithms implemented by the BD primary analysis pipeline that adjust for these potential issues. Sequencing read Gene amplicon of interest Molecular Index Sample index Sequencing read 2 Figure. Sequencing read structure of a BD Precise targeted assay amplicon allows for identification of the origin of each mrna and sample.

.2 Molecular Index counting for high-input samples The BD Precise assay is most suitable when used with small sample input such as single cells to allow for stochastic and unique labeling of mrnas. As the number of transcripts increases relative to the barcode pool, the percentage of MIs being recycled to label the same gene increases and can be theoretically calculated using the Poisson distribution (Figure 2). In these situations, without statistical correction, quantifying gene expression using MIs would underestimate the number of molecules that are initially present without any Poisson adjustments. At extremely high inputs where the number of mrnas per gene is beyond the entire collection of 6,56 barcodes (or 65,536 for WTA), Poisson correction is no longer possible. In these situations, regardless of whether there are 65, or, copies of a particular gene in a well, a maximum of 6,56 saturated barcodes is expected in either case. Hence, this technical note discusses the new result classification for users to flag genes and samples that appear to have high sample input where MI counts would likely be underestimated. Unique Molecular Index % 9% 8% 7% 6% 5% 4% 3% 2% % % 2 3 4 5 6 Molecular labeled for a particular gene Figure 2. Theoretical calculation of the percentage of unique Molecular Index used as input molecules increases when there are 6,56 MIs..3 Molecular Index errors In addition to PCR bias correction, MIs can provide a better understanding of the quality of the library preparation procedure and sequencing data. When looking at the number of reads for the same MI of a given gene referred to as MI depth it is possible to detect erroneous base calls or PCR errors generated during library preparation. For example, a gene that has multiple reads with the same MI and SI represented by multiple reads is likely an accurate measurement compared to an MI for a given gene and an SI that has only a single read. When a gene has MIs with low and high depth, the low-depth MIs are likely due to errors during the library preparation or sequencing. These MI errors generally have distinct distributions from the true MIs (Figure 3). This technical note details two sequential algorithms that are employed in the BD Precise assays analysis pipeline to remove MI errors. First, MI errors that manifest as single base substitution errors are identified and adjusted to the true MI barcode using recursive substitution error correction (RSEC). Subsequently, other MI errors (derived from library preparation steps or sequencing base deletions) are adjusted using distribution-based error correction (DBEC). Count 75 5 25 5 5 2 MI depth ACTB Figure 3. The depth of each MI across a plate for a high expressing gene ACTB, where distinct distributions can be observed between MIs that are likely errors and true MIs. 2

2 Algorithm overview Raw MI counts Saturated if MI count > 6,557 Adjust MI by RSEC Undersequenced if MI coverage < 4 Calculate MI depth per gene MI count filter Remove non-unique MI Adjust MI by DBEC Output 2. Recursive substitution error correction (RSEC) The RSEC algorithm adjusts MI errors that are derived from PCR and sequencing substitution. These rare erroneous events are observed when examining the MI depth. For example, the MI depth for MIs that are likely errors are significantly lower than true MIs in adequately sequenced samples (Figure 3); in cases where two very similar MIs are used during the initial Molecular Indexing (reverse transcription) steps, they would generally have similar MI depth and do not need to be eliminated. As sequencing depth increases, more MI errors appear, hence RSEC is crucial for adjusting the MI count for highly sequenced barcoded libraries. RSEC considers two factors in error correction: ) similarity in MI sequence and 2) MI depth. For each target gene, MIs are connected when both of their MI Figure 4. Overview of the MI correction algorithms. sequence is within one base of each other (Hamming distance = ). For each connection between MI x and y, if: MI Depth(y) > 2*MI Depth(x)+; y is Parent MI and x is Child MI GTCAAATT 3 reads TTCAGAAA read TTCAAACT read GTCAAAAT 24 reads GTCAAAAA 74 reads TTCAAAAA 53 reads TTCAAAAT 4 reads TTCGGACA 88 reads Figure 5. Example MIs going through the RSEC algorithm. Based on this assignment, child MIs are collapsed to their parent MI.2 This process is recursive until there are no more identifiable parent / child MIs for the gene (Figure 5). Raw MI = 9 CTCAAAAA 2 reads RSEC MI = 2 TTCAAAAA 263 reads TTCGGACA 88 reads 3

2.2 MI depth calculations After RSEC, gene MI counts are evaluated to determine their suitability for further correction. First, the algorithm identifies whether the MI depth for each gene is sufficient for error correction. According to the Poisson distribution, if MI depth is less than four, more signal MIs are removed than error MIs using any correction algorithm. Therefore, genes with low MI depth (< 4 reads per MI) bypass subsequent correction steps for DBEC and are designated low depth in the output file. The algorithm also evaluates gene MIs that appear to be saturated, where there are far more input transcripts for stochastic unique MI labeling to occur. Gene MIs that do not meet either of the two decision points move forward to the DBEC algorithm and are designated as pass in the output file. In addition, genes with higher than an average of 65 MIs per well are designated to be high input as >5% of these MIs are based on the Poisson distribution but run through DBEC (Figure 2). 2.3 Distribution-based error correction (DBEC) Unlike RSEC, DBEC algorithm is a method to discriminate whether a MI is an error or true signal regardless of its MI sequence. While RSEC uses both MI sequence and MI depth information to correct for errors, DBEC relies only on MI depth to correct for non-substitution errors. Error barcodes generally have low MI depth that is distinct from true barcodes MI depth; this difference in MI depth can be observed in a histogram plot of the MI depth (Figure 3). DBEC fits two negative binomial distributions to statistically distinguish between MI errors with lower MI depth and one for true signal with higher MI depth (Figure 6). 2.3. Removal of recycled MIs for optimal distribution fitting Probability Density.8.6.4.2. ACTB 2 4 6 8 reads Negative Binomial Mixture Figure 6. Graph of ACTB fitted with two negative binomial distributions during DBEC, x axis is the MI depth. The first step removes potential recycled MIs. A recycled MI is an MI sequence with the same SI that was used more than once for a given gene prior to amplification. Although these are two unique molecules that should both be counted, because they have the same MI sequence they are folded into one and treated as though one is the amplicon of the other. For a given gene, as the number of MIs detected increases, the percentage of recycled MIs increases and can be estimated. Using the Poisson distribution (λ non-unique ), the number of recycled MIs for well i (n non-unique,i) is estimated from the MI recycling rate equation at right. Based on the top MI depth, MIs would be eliminated from distribution fitting but preserved for later counting steps to obtain a better negative binomial distribution fitting. % non-unique MIs = P(X > λ non-unique ) P(X > λ non-unique ) n non-unique = n wells n non-unique,i i= 4

2.3.2 Estimation of parameters In order to fit two negative binomial distributions (one for error and one for true signal), two sets of starting values for parameter estimation are approximated. The error distribution is assumed to be a negative binomial with mean and dispersion of one. An example of how the two distributions are fitted is shown in Figure 6. If there is an instance where DBEC fails, MIs with a depth of one are removed as a simple error correction step. 2.3.3 Error / signal probability estimation After distribution fitting, the signal and error distributions are referred to as NB(µ signal, size signal ) and NB(µ error, size error ), respectively. To avoid overcorrection, if the derived cutoff for error distribution is higher than half of the maximum read, only MIs with a depth of one are removed. P(X = r µ = µ error, size = size error ) < P(X = r µ = µ signal, size = size signal ) 5

3 Sample data mixed Jurkat and T47D cells Raw Molecular Index counting Adjusted Molecular Index counting 5 Adjusted MI 3 Adjusted MI 2 5 () Jurkat (46.9%) (2) NTC (38.5%) (3) T47D (4.6%) A - Jurkat (46.9%) NTC (38.5%) T47D (4.6%) -2-5 -3 - - -8-6 -4-2 2 4 6 8-4 -5-4 -3-2 - 2 3 4 5 CD3E 59 cells (6.5%) 892 mols (.9%).8 3 CD3E 47 cells (49.%) 45 mols (2.7%).4.6.4 2.2 5.2.8.6 log(number of molelcule per cell) B - -2.8.6.4 log(number of molelcule per cell) -5.4.2-3.2 - - -8-6 -4-2 2 4 6 8-4 -5-4 -3-2 - 2 3 4 5 CDH 4 cells (42.7%) 376 mols (.4%).6 3 CDH 5 cells (5.6%) 67 mols (.4%).4 2.9 5.2.8.6.4 log(number of molelcule per cell) C - -2.8.7.6.5.4.3 log(number of molelcule per cell) -5.2.2-3. - - -8-6 -4-2 2 4 6 8-4 -5-4 -3-2 - 2 3 4 5 PSMB4 95 cells (99.%) 7684 mols (7.6%) 2.5 3 2 PSMB4 69 cells (7.9%) 258 mols (7.5%).6 2.4 5.5 log(number of molelcule per cell) D - -2.2.8.6 log(number of molelcule per cell) -5.5.4-3.2 - - -8-6 -4-2 2 4 6 8-4 -5-4 -3-2 - 2 3 4 Figure 7. t-stochastic neighbor embedding (t-sne) visualization of a BD Precise targeted assay from a 96-well plate of mixed Jurkat and breast cancer (T47D) single cells (86 genes examined). (A) Cell clusters were identified using DBScan with the same parameters before and after MI adjustments. (B-D) Individual marker expression scaled both by color and point size. (B) PSMB4, a housekeeping gene that is present in both cell types and after MI adjustments, the lack of PSMB4 signal is highlighted further in the no-template control (NTC) cluster. (C) CD3E, a lymphocyte marker that highlights Jurkat cell clusters. (D) CDH, an epithelial cell marker that highlights the T47D cluster. In general, MI adjustment removes MI noise which allows for clear differentiation of gene expression between cell clusters. 6

A. Low-signal cells T47D NTC Jurkat Adjusted MI T47D NTC Jurkat Raw MI Figure 8. Heat map displaying differential gene expression by MIs between different cell clusters identified in Figure 7 [Left] before any error correction steps (Raw MI) and [Right] after RSEC and DBEC correction (Adjusted MI). Genes that are low in expression is in blue, and genes that are high in expression is orange. Genes that are similar in gene expression pattern between these cell types are clustered together. Without error correction, NTC can have noise from high expressing genes such as CD3E and KRT8, which are Jurkat and T47D markers, respectively. Moreover, error correction reveals distinct gene expression patterns between Jurkat and T47D. References Fu GK, Hu J, Wang P Fodor SP. Counting individual DNA molecules by the stochastic attachment of diverse labels. PNAS. 2;8(22):926 93. 2 S mith T, Heger A, Sudbery I. UMI-tools: Modelling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Research. 27; doi:./gr.296.6. [Epub ahead of print]. BD, Becton Drive, Franklin Lakes, NJ, 747 bd.com 27 BD. BD, the BD logo and all other trademarks are the property of Becton, Dickinson and Company. For research use only. Not for use in diagnostic or therapeutic procedures.