RNA-seq Using Next Generation Sequencing

search home > Materials and Methods > RNA-seq Using Next Generation Sequencing Like 36 RNA-seq Using Next Generation Sequencing David C Corney 1 (davec dot corney at gmail dot com), Georgeta N Basturea 2 (gbasturea at gmail dot com) 1 Thomas Jefferson University Hospitals, (formerly Princeton University, till April 2014), United States. 2 Florida, USA (formerly of University of Miami Miller School of Medicine, USA) DOI http://dx.doi.org/10.13070/mm.en.3.203 Date last modified : 2016-04-15; original version : 2013-08-27 Cite as MATER METHODS 2013;3:203 Abstract A comprehensive review of RNA-seq methodologies. Introduction Next-generation sequencing is rapidly becoming the method of choice for transcriptional profiling experiments. In contrast to microarray technology, high throughput sequencing allows identification of novel transcripts, does not require a sequenced genome and circumvents background noise associated with fluorescence quantification. Furthermore, unlike hybridization-based detection, RNA-seq allows genome-wide analysis of transcription at single nucleotide resolution, including identification of alternative splicing events and post-transcriptional RNA editing events. All RNA-seq experiments follow a similar protocol. Total RNA is isolated from a sample of interest which, depending on the type of RNA to be profiled, may be purified to enrich for mrnas, micrornas or lincrnas etc prior to preparing an RNA library. Library preparation may involve such steps as reverse transcription to cdna, PCR amplification and may or may not preserve strandedness information. Sequencing can produce one read in a single-end sequencing reaction, or two ends separated by an unsequenced fragment in paired-end reactions. Together, RNA-seq has allowed an unparalleled view of the transcriptome in normal and pathological processes and has revealed that the transcriptome is significantly more complex that previously envisioned [1]. This review will examine planning, performing and analyzing an RNA-seq experiment. Briefly, this includes determining optimal sequencing depth, number of replicates, and choosing a sequencing platform; preparing and sequencing libraries; and mapping of reads to a genome followed by transcript quantification. Each of these steps will be reviewed in turn. However, let us first review the main advantages and disadvantages, as well as similarities, of RNA-seq compared to microarrays in greater detail. Advantages and disadvantages of RNA-seq compared to microarrays [enlarge] Microarray-based analysis of the transcriptome is responsible for a great deal of our current understanding of spatiotemporal-specific gene expression in development and disease. Yet, ignoring the fact that microarray analysis is limited to organisms with sequenced genomes and detection of known transcripts, 1 of 26 8/19/16, 5:07 PM

hybridization-based detection can suffer from a number of inherent weaknesses, such as poor sensitivity, low specificity and a limited dynamic range. RNA-seq reduces, and in some cases eliminates, these limitations [2]. With sufficient sequencing depth the dynamic range is infinite. Non-specific hybridization or crosshybridization is a common concern when interpreting microarrays, especially when closely related gene family members with highly similar sequence are of particular interest. In this regard, RNA-seq largely eliminates ambiguity of sequence detection Figure 1. Overview of analysis workflow for microarray and RNA-seq transcriptional profiling. Image from Fang et al. [2] but at the cost of potential ambiguity in (image released under a Creative Commons Attribution mapping of reads to the genome, since a License). sequenced read may map to multiple locations. However, advances in use of paired-end sequencing have gone a long way to address this problem, since the added information of a second read sequence and prior knowledge of the expected distance between each read allows more accurate mapping. While not suffering from some of the disadvantages of hybridization-based detection, a number of biases unique to high throughput sequencing have been identified. One of the first steps of mrna library preparation is RNA fragmentation to smaller pieces to allow sequencing. This may be done by Covaris sonication or RNase III enzymatic digestion but use of divalent cations under elevated temperature is most common. All methods introduce bias which can be position- and/or sequence content-based [3]. [enlarge] Currently, all commercially available RNA-seq platforms rely on reverse transcription and PCR amplification prior to sequencing and sequencing is therefore subject to the biases inherent to these procedures. First, annealing of random hexamer primers to fragmented RNA is not random, which results in depletion of reads at both 5 and 3 ends [3-6] Figure 2. Sequence logo showing observed and expected nucleotide distribution surrounding the 5 fragmentation site. Similar biases are present at the 3 end. Image: Roberts et al. [3] (image released under a Creative Commons Attribution License). [7, 8]. A number of data analysis tools to correct these biases are available, although achieving varying degrees of success [6, 9, 10]. Given that the total number of reads per transcript is proportional to the level of a transcript multiplied by transcript length, a long transcript will be sequenced more often than a short transcript when expressed at equivalent levels. Since statistical power is closely linked to sample size, a long transcript is more likely to be found differentially expressed than a short This makes the identification of the true start and end of novel transcripts a challenge, as well as underestimating expression level of short genes. Second, PCR can introduce bias based upon GC content and length due to non-linear amplification [enlarge] 2 of 26 8/19/16, 5:07 PM

transcript [11]. To mitigate this problem, expression levels are frequently expressed by calculating the number of reads or fragments per kilobase per million reads (RPKM and FPKM, respectively) [1]. The FPKM transformation also allows direct comparison of transcript expression between two libraries with different sequencing depth as well as an indication of relative expression levels between two or more transcripts in a single library. A typical RNA-seq experiment The next generation sequencing platforms most frequently used for RNA-seq are the Illumina HiSeq, Ion Torrent and SOLiD systems. Whilst the library preparation and nucleotide detection protocols for each platform vary, all consist of the following main steps: Preparation of total RNA. Depending on class of RNA to be sequenced (i.e. mrna, lincrna, microrna etc), enrichment is performed. Good quality total RNA is critical, although alternative protocols for degraded RNA exist [12]. Library preparation. Library preparation consists of: Figure 3. Read coverage over genes is biased against 3 and 5 extremities. Fragmentation was done by either RNA hydrolysis or cdna shearing and RNA fragmentation. Unlike short RNAs, mrnas distribution of reads plotted for small (< 1 kb; top), are typically fragmented to smaller pieces of medium (1-8 kb; middle) and large (> 8 kb; bottom) transcripts. Image modified from Huang et al. [4]. RNA to enable sequencing. Reverse transcription. First and second strand cdna is reverse transcribed from fragmented RNA using random hexamers or oligo(dt) primers. Adapter ligation. The 5 and/or 3 ends of cdna are repaired and adapters (containing sequences to allow hybridization to a flow cell) are ligated. Library cleanup and amplification. Libraries are enriched for correctly ligated cdna fragments and amplified by PCR to add any remaining sequencing primer sequences. Library quantification, quality control and sequencing. Library concentration is assessed using qrt-pcr and/or Bioanalyzer and is ready for sequencing. Data analysis. Downstream data analysis consists of quality control such as trimming of sequencing adapters and removal of reads with poor quality scores followed by mapping reads, analysis of differential expression, identification of novel transcripts and pathway analysis. Experimental design Just as for any other technique, a well-designed RNA-seq experiment consists of proper replication, randomization and blocking [13]. An all too common topic on Internet NGS user forums, such as Seqanswers (http://www.seqanswers.com/), is how to identify statistically significant differential gene expression from an experiment without replicates. Although it is technically possible to calculate DGE without replicates these experiments must be interpreted with extreme caution. Unless one is purely concerned with novel transcript discovery, both technical and biological replicates must be carefully 3 of 26 8/19/16, 5:07 PM

considered from the outset. In the infancy of RNA-seq, technical replicates (libraries prepared from the same RNA sample) were commonly used. However, it has been shown that biological variation far outweighs technical variation, at least when coverage of at least 5 reads/nucleotide is obtained [14, 15]. Technical replicates, therefore, are most useful when the goal is to compare performance of two or more competing sequencing technologies. If the goal is to investigate differences between treatments biological replication is essential in order to generalize the results to a larger population. The required number of replicates will vary greatly depending on amount of biological variability associated with the samples of interest and should be empirically determined. To this end, the number of replicates used for any prior microarray analysis is usually a good starting point. Most sequencing platforms support multiplexing of libraries by introducing barcodes during library preparation. This allows simultaneous sequencing of multiple libraries in a single sequencing run thereby enabling more efficient use of a sequencer machine. Importantly, multiplexing also facilitates a balanced block design to minimize potential confounding factors such as PCR amplification and flow cell effects [13, 16]. Consider an example where three placebo-treated samples and an equal number of drug-treated samples are to be sequenced on an Illumina HiSeq 2000 instrument. The Illumina HiSeq uses a flow cell with eight lanes, one of which is usually reserved for a PhiX sequencing control used for quality control purposes. Rather than use a single lane per sample, all six libraries should be barcoded during library preparation to allow all libraries to be simultaneously sequenced over six lanes. Reads are demultiplexed following sequencing based upon barcode sequence and analyzed accordingly. Such a design offers insurance against one poor sequencing lane compromising the study. On the other hand, if an unblocked design was used and one lane was to have unacceptably strong artifacts, an entire sample would be lost and the study compromised. While offering the major advantage of removing confounding factors, a multiplexed, balanced block design also allows for re-sequencing on an additional lane(s) at a later date to increase the number reads as needed, without introducing sample-specific biases from flow cell to flow cell variation. Determining the correct number of sequencing reads per sample is a challenging problem that is subject to vigorous debate. When profiling the chicken transcriptome, Wang et al. showed that 30 million reads are sufficient to obtain reliable measurement of all genes in the genome, whereas 10 million allows detection of 80% of genes [17]. Supporting a lower number of reads as optimal for transcriptional profiling, Tarazona et al. report that increased sequencing depth results in more false positives due to increased noise [18]. Others, including the ENCODE consortium, propose that between 100-200 million reads per sample or greater are required, especially when novel transcripts or splicing events are of particular interest. Clearly, the issue of sequencing depth is a rapidly evolving issue without a clear consensus. Further complicating the matter is the great deal of inter-species variation in transcriptome size. Clearly, sequencing a bacterial transcriptome will require far fewer reads than needed for a vertebrate transcriptome [19]. However, estimating transcriptome size is problematic since genome size and transcriptome size are imperfectly correlated, and estimating transcriptome size is especially difficult for non-model organisms without a sequenced genome. For example, the genome of the laboratory mouse is 2.6 Gb and encodes ~ 25,000 protein-coding genes, whereas a similar number of genes are encoded in the 1 Gb chicken genome. Therefore, efforts to sequence the transcriptome of non-model organisms will particularly benefit from a small pilot study to empirically determine acceptable sequencing depth [20]. When alternative splicing is of greater interest, obtaining paired-end sequence data can be more valuable than increasing number of reads, due to the increased probability of a splice junction falling within or between the sequenced ends [21]. An additional advantage of paired-end RNA-seq that is particularly useful when sequencing cancer transcriptomes is the opportunity to detect chimeric transcripts resulting from gene fusion events [22]. Furthermore, obtaining paired-end sequence reads allow greater certainty when an individual read can be mapped to multiple loci on the genome, particularly in repetitive regions. An online tool called Scotty allows users to design optimal sequencing depth and number of biological replicates whilst simultaneously satisfying user-defined inputs such as maximum cost and required 4 of 26 8/19/16, 5:07 PM

statistical power [23]. When calculating the required sequencing depth, one should bear in mind the potential for loss of reads to undesired RNA species, chiefly ribosomal RNA (rrna), as well as reads that are unable to be mapped, since both factors can decrease the number of useable reads for downstream analysis by as much as 60-80%. A number of hybridization-based rrna depletion approaches have been developed to enrich for less abundant species of RNA. Broadly, enrichment strategies either deplete rrna or allow positive selection for mrna. For eukaryote transcriptome analysis using SOLiD or Ion Torrent platforms, polya+ selection can optionally be performed prior to library construction using either magnetic bead-conjugated oligo(dt) oligonucleotides (Dynabeads; Life Technologies) or immobilized oligo(dt) capture probes (mrna Catcher PLUS; Life Technologies), for example. On the other hand, Illumina TruSeq libraries utilizes two rounds of magnetic bead-conjugated oligo(dt) capture for polya+ selection, with the final polya+ elution step also serving to fragment and prime RNA for downstream cdna synthesis. During polya+ enrichment for polyadenylated mrnas, non-polyadenylated RNA species, including micrornas, lincrnas and other macro ncrnas are depleted and not represented in the resulting libraries. In contrast, rrna depletion strategies have been shown to preserve these RNA species [4, 24]. Two of the most frequently used rrna depletion methods are RiboMinus (Life Technologies) and Ribo-Zero (Epicentre). Both methods utilize a pool of rrna capture probes followed by spin column or magnetic bead-based collection of the non-rrna fraction. rrna capture probes for multicellular eukaryotes (human/mouse compatibility) as well as microorganisms (yeast/bacteria) are available from both vendors. Ribo-Minus may offer the extra advantage in removing mitochondrial rrna as well as cytoplasmic rrna [4]. Finally, rrna depletion should be the method of choice when sequencing degraded RNA isolates from formalin-fixed paraffinembedded samples, since polya+ selection methods assume availability of high quality total RNA to enable isolation of full-length transcripts [12]. Just as rrna contamination will significantly reduce the number of reads mapping to mrna, so will strongly expressed mrnas, such as housekeeping genes, reduce the number of reads mapping to weakly expressed genes. For example, 75% of reads from a human mammary epithelial cell line library map to the most abundant 7% of the transcriptome [25]. Clearly, when the number of reads is at a premium, it would be most useful to have them map to regions of interest. To remove transcripts corresponding to a small number of housekeeping genes, Epicentre extended the Ribo-Zero concept by designing capture oligos to remove globin mrnas. Exome capture microarrays may also be used to increase sensitivity, although at the cost of reduced quantification accuracy [26]. In contrast, the Rinn laboratory developed RNA CaptureSeq to specifically sequence weakly expressed regions of the transcriptome [27]. Briefly, RNA is hybridized to tiling microarrays containing probes corresponding to genomic regions of interest and the captured RNAs eluted and sequenced. CaptureSeq allowed ~380-fold enrichment of reads mapping to targeted regions of the transcriptome compared to conventional RNA-seq without capture. Although conventionally used for exome sequencing or targeted re-sequencing of DNA, Levin et al. used microarrays to capture 467 cancer-related genes for targeted RNA sequencing [28]. This approach allowed identification of mutations and fusion transcripts while largely preserving transcript abundance. An important caveat, however, is that inferring a somatic DNA mutation by cdna sequencing is problematic without careful validation by Sanger sequencing due to widespread RNA editing [29]. Most recently, Life Technologies released a targeted RNA-seq workflow to enable targeted sequencing of over 6000 RNAs and Illumina is planning to release an equivalent workflow soon. Although based on sequencing of short PCR-amplified amplicons, not full-length transcripts of targeted RNAs, directed sequencing will provide comparable information to quantitative RT-PCR. RNA-seq technologies As of 2013, the three most widely used NGS platforms for RNA-seq are SOLiD and Ion Torrent, both marketed by Life Technologies, and Illumina s HiSeq. All three platforms have similar sample input requirements and sequences millions of cdna fragments per run. Below, sample preparation and pertinent application-specific advantages and disadvantages are discussed. Illumina 5 of 26 8/19/16, 5:07 PM

Illumina and Ion Torrent both sequence using a sequencing by synthesis (SBS) approach, whereby incorporation of dntps is detected simultaneously at millions of fixed positions on a flow cell [30] (Figure 4). [enlarge] Figure 4. Illumina RNA library preparation. PolyA+ RNA is enriched using oligo(dt) beads followed by fragmentation and reverse transcription. The 5 and 3 ends of cdna fragments are next prepared to allow efficient ligation of Y adapters containing a unique barcode and primer binding sites. Finally, ligated cdnas are PCR-amplified and ready for cluster generation and sequencing. Image: David Corney. For Illumina, once TruSeq RNA-seq libraries have been prepared they are hybridized to a flow cell which contains a lawn of covalently bound oligonucleotides complementary to the sequencing adapters that were introduced during library preparation. Once hybridized, the capture oligonucleotide primes DNA polymerase extension activity resulting in a covalently bound full-length complementary copy of the cdna fragment that is subjected to several rounds of PCR amplification to produce discrete clones ~ 1 µm in diameter that can be optically resolved during sequencing. Obtaining optimal cluster density is critical, since it will determine the number of reads obtained. Clearly, low density will result in fewer than expected reads, but over-clustering can be just as problematic, since dense flow cells are difficult to analyze and to obtain accurate base calling due to interference and overlap between adjacent clusters. Therefore, accurate quantification of each library using quantitative PCR is an important aspect of library quality control. In the case of Illumina SBS, all four dntps are fluorescently labeled and concurrently introduced on to the flow cell (Figure 5A). Since all four dntps are present, natural competition for binding between dntps minimizes incorporation biases. SBS proceeds through multiple cycles of nucleotide incorporation and detection. Importantly, only one nucleotide is incorporated per cycle by use of reversibly terminated dntps. After nucleotide incorporation is detected by fluorescence, the fluorophore is removed resulting in regeneration of [enlarge] Figure 5. Sequence detection methods of Illumina, Ion Torrent and SOLiD. A. Illumina detection is fluorescence-based using reversible terminator dntps, resulting in one nucleotide incorporation per cycle. cdna fragments are covalently linked to a flow cell and fluorescence detected with addition of each nucleotide. B. Ion Torrent sequence by synthesis relies on detection of hydrogen ions ( ) for base calling. Each ph detector well contains one clonally amplified cdna fragments on a microbead. Nucleotides are added sequentially; since nucleotides are not reversibly terminated, incorporation of multiple nucleotides is detected by an increase in number of hydrogen ions detected. C. SOLiD sequence detection is unique in that fluorescently labeled oligonucleotides are ligated rather than incorporated by a 6 of 26 8/19/16, 5:07 PM

a 3 hydroxyl polymerase. See text for more details. Image: modified from Berglund et al. [154] (image group which released under a Creative Commons Attribution License). allows incorporation of the next dntp in the subsequent cycle. Importantly, this reversible terminator chemistry allows sequencing of homopolymeric regions, such as AAAAAA, with high confidence. During base calling, fluorescence intensity values for each nucleotide are converted to nucleotide identity using a cross-talk matrix which controls for spectral overlap. Since spectral overlap is determined during the first four cycles it is imperative that approximately equal numbers of each base be present (i.e. to have a balanced library). Therefore, it is especially important to use barcodes that are well balanced to ensure accurate demultiplexing after sequencing. Likewise, use of a dedicated PhiX control lane to estimate correct spectral overlap is strongly recommended when sequencing unbalanced libraries of AT- or GC-rich genomes [31]. Error rate (incorporation of the incorrect nucleotide) progressively mounts with increasing number of cycles; currently up to 150 cycles are supported with an overall error rate of 0.2% [32]. In the first iteration of Illumina sequencing technology, sequencing of only one end of each cdna fragment was supported. While nevertheless a very powerful tool, in recent years the use of paired end sequencing is most frequently used when sequencing the transcriptome. In this case, since the flow cell contains randomly arrayed capture oligos complementary to the 5 and 3 sequencing adapters, during the bridge amplification PCR step, cdna fragments captured by their 5 adapters are susceptible to be captured by 3 capture oligos. This allows for a first sequencing run of up to 150 cycles using the 5 sequencing primer to be followed by a second sequencing run using the 3 primer to obtain a total of 300 nt of sequence per fragment. Importantly, several third-party library preparation kits are commercially available that have advantages for certain experiments, for example, Smart-seq [33] reduces the required input to as little as 100 pg total RNA, Ion Torrent Whereas Illumina sequencing and cluster generation relies on solid-phase PCR amplification, emulsion PCR is used to prepare Ion Torrent libraries for sequencing. First, the library template is prepared from fragmented RNA. Unlike Illumina, the standard library protocol is strand-specific by default (Figure 6). Next, beads with complementary oligonucleotides are mixed with PCR reagents and a dilute solution of cdna library and oil added to make an emulsion. Ideally, each microdroplet of emulsion will contain one bead and one cdna fragment along with PCR reagents to allow for clonal amplification. Following 16-18 cycles of PCR the emulsion is then broken by organic extraction, beads purified and loaded on to a disposable semiconductor sequencing chip. The sequencing chip is modeled similar to a honeycomb, in that one bead fits into one of hundreds of millions of tiny wells that serve as microreactors during sequencing, each with their own detector. Unlike Illumina s fluorescence-based SBS, Ion Torrent determines sequence identity by detecting ph alterations due to hydrogen ion release following nucleotide incorporation (Figure 5B). Since the dntps are not differentially labeled by a fluorophore, they must be added successively so that ion release can be associated with a particular nucleotide. Since Ion Torrent sequencing isn t reliant on optical detection of dntp incorporation, sequencing reactions are much faster and the number of reads obtainable per sequencing run has been rapidly increasing. However, whereas Illumina makes use of reversible terminator chemistry to restrict dntp incorporation to once per cycle and sequence through homopolymers, Ion Torrent relies on the number of hydrogen ions released as being proportional to the number of dntps incorporated. Therefore, A can easily be distinguished from AA by a detecting a doubling in the number of hydrogen ions released. However, distinguishing between a run of 7 and 8 adenosines is far more challenging and consequently the error rate is high (1.7%) [32]. [enlarge] SOLiD Ion Torrent and SOLiD RNA libraries preparation share the same molecular biology (Figure 6), although the adapter sequences are 7 of 26 8/19/16, 5:07 PM

different. In contrast to Illumina/Ion Torrent, SOLiD uses a sequencing by ligation approach to obtain billions of reads per sequencing run, each up to 75 bp in length. First, emulsion PCR is performed and beads containing clonally amplified cdna fragments attached to the surface of a sequencing flow cell. Sequencing takes place during several rounds of ligation reactions. In the first round, a sequencing primer is annealed and a mixture of 16 fluorescently labeled 8-mer oligonucleotides Figure 6. Ion Torrent and SOLiD added (Figure 5C). The 16 oligonucleotides represent all possible libraries are both prepared using combinations of the first two nucleotides (AC, AG, AT etc), similar protocols. Briefly, partly degenerate guide adapters hybridize whereas bases 3-5 are degenerate and unknown. The final three the fragmented target RNA to allow 3 bases are conjugated to one of four fluorescent labels, each splint ligation of 5 and 3 adapter with with a different excitation and emission spectrum. Therefore, each defined sequences. Next, cdna is synthesized and amplified by PCR to fluorophore represents four dinucleotides and in each ligation add additional required sequences reaction, the identity of only the first two nucleotides is followed by emulsion PCR on microbeads. Image: David Corney. interrogated. As a result, after one round of ligation the identity of these two bases is narrowed down but not known. To determine their true identity, the original primer and ligated oligos are removed and a second, n-1, primer annealed and a new round of ligation is performed. By combining knowledge from two rounds of interrogation, the identity of the first base is confirmed. The identity of the next base is confirmed using an n-2 primer, and so forth, until an n-4 primer is used. In practice, the final three 3 bases of each 8-mer oligonucleotide are cleaved after each ligation to remove the fluorophore and provide a 5 phosphate for a second ligation reaction. After 5-7 cycles of ligation, fluorophore detection and fluorophore cleavage, a reset is performed and the next primer (i.e. n-1) is used for another 5-7 cycles. SOLiD sequencing, therefore, has the advantage of interrogating each nucleotide twice and accordingly has reduced errors (<0.1%) during base calling but at the cost of shorter reads length. Non-coding RNA-seq Until now, this review has largely focused on identification and quantification of the small proportion of the transcriptome that has coding potential. However, RNA-seq has been applied to study non-coding RNAs, such as micrornas and lincrnas, and even used to discover a new class of non-coding circular transcripts (circrnas) [34-36]. To gain a complete picture of the transcriptome, biologists may combine coding and non-coding RNA-seq data. In terms of experimental design and sequencing chemistry, the sequencing requirements for non-coding RNA-seq are mostly the same as mrna-seq. MicroRNAs (mirnas) are short pieces of RNA which direct post-transcriptional gene silencing of their targets by imperfect hybridization to the 3 UTRs of mrnas. Mature mirnas are typically 19-24 nt in size and are generated by two cleavage events; first cleavage of a nuclear primary transcript, which may be up to several kilobases in length, and secondly cleavage of the cytoplasmic intermediate hairpin precursor that is approximately 70 nt [37]. Due to rapid processing and turnover, precursor transcripts are sequenced infrequently and most attention has been paid to the mature form. However, identical protocols have been successfully used to sequence precursor and mature mirnas [38]. In contrast to Sanger sequencing, which identified only the most strongly expressed mirnas, NGS can identify weakly expressed mirnas as well as reveal heterogeneity in length and sequence [39]. As for mrna-seq, obtaining a good mirna-seq library begins with obtaining good quality total RNA. It is crucial that the RNA isolation procedure preserve the integrity of small RNAs. Indeed, it was largely due to the fact that frequently used spin columns did not retain RNAs < 200 nt that hindered their discovery. Careful work from the Kim laboratory has shown that although Trizol retains small RNAs, it is a poor choice when a low number of cells are used as starting material since mirnas with low GC content are selectively depleted [40]. Short RNA sequencing is not restricted to mirnas; piwi-interacting RNAs (pirnas) were also identified and characterized using RNA-seq [41]. Except for fragmentation, 8 of 26 8/19/16, 5:07 PM

which is omitted, the stages involved in small RNA library preparation are similar to conventional RNA-seq (Figure 7). The latest version of Illumina mirna library preparation makes use of a 5 monophosphate and 3 hydroxyl groups to specifically ligate mirnas in a reaction containing total RNA [42], whereas short RNA enrichment by polyacrylamide gel electrophoresis or magnetic bead purification is required or strongly suggested for Ion Torrent and SOLiD libraries. Following RNA adapter ligation, mirnas are reverse transcribed, amplified by PCR and sequenced. A number of mirnafocused analysis platforms are freely available [39]. Briefly, reads should be trimmed of barcode and adapter sequences and mapped, either to known mirnas in mirbase [43] or to a reference genome for novel mirna discovery. In contrast to mapping of mrna-seq data, using a splicing-aware aligner is not necessary and BWA [44] or Bowtie [45] may be used. Like mirnas, lincrnas make up part of the non-coding assortment of RNAs within eukaryotic cells, although their function is more heterogeneous and less well defined compared to mirnas [46, 47]. With regards to sequencing of lincrnas, paired-end RNA-seq is typically most useful [48, 49]. However, since lincrnas are frequently antisense to known genes, it is important to know the strandedness of mapped reads, which cannot be known using conventional library preparation methods. To maintain strandedness, either the 5 and 3 adapter sequences must be unique, or the first/second cdna strand biochemically marked during library preparation, typically by substituting dutp for dttp to enable UDG-mediated degradation of dutp-containing DNA [50]. LincRNAs are represented in RNA-seq libraries previously subjected to polya selection, although omitting this step may allow identification of additional RNAs [48]. Identification of novel lincrnas is performed computationally by performing ab initio transcriptome reconstruction in combination with a consideration for epigenetic markers of active transcription as previously described [47, 51]. Likewise, identification of circrna transcripts was done computationally by searching for exon scrambling in paired-end RiboMinus RNA-seq data [35]. [enlarge] Figure 7. MicroRNAs are sequenced by ligating RNA adapters to each end of the mature microrna followed by reverse transcription and PCR (RT-PCR). To enable barcoding, two sequencing reactions are performed using two sequencing primers: primer one to obtain the microrna sequence and primer two to obtain the barcode sequence. Image adapted from Nieuwerburgh et al. [155]. While not a class of non-coding genes in of itself, the polya tails of mrna transcripts are not translated to protein. Following RNA polymerase II-dependent transcription of most mrnas, a stretch of untemplated adenosine monophosphates is added to the transcript following cleavage by cleavage and polyadenylation specificity factor (CPSF) and poly(a) polymerase. A long-standing question has been the relationship between polya tail length, transcript stability and translatability. However, owing to the difficulty to sequence through homopolymeric (< 50 nt of sequential adenosines) regions, by both Sanger and next-generation sequencing, this question has only been indirectly studied. However, a recent innovative approach to use a combination sequencing technologies and complex statistical analysis by the Kim laboratory has provided much insight [52]. In their method, called TAIL-seq, total RNA is first depleted of rrnas and small ncrnas and a biotinylated 3 adapter ligated to the remaining mrnas and long non-coding RNAs. Next, the nuclease RNase T1, which at low concentration specifically cuts after G residues (and not within polya tails), is incubated with the ligated RNAs followed by pull-down with streptavidin beads to enrich for 3 adapter ligated RNA fragments. Following 5 adapter ligation, reverse transcription and PCR amplification, libraries are sequenced from both ends. The first read provides 51 nt of sequence identification for mapping purposes, while the second read, up to 231 nt in length, provides tail length as indicated by a stretch of thymine nucleotides (corresponding to the pre-reverse transcription polya tail). Despite sequencing libraries on the Illumina platform which is better suited to sequencing homopolymeric regions, incomplete cleavage of the thymine reversible terminator fluorophores results in persisting thymine fluorescence signal in subsequent cycles, making non-poly(t) nucleotides largely indistinguishable from true poly(t) stretches. However, the transition from poly(t) to 9 of 26 8/19/16, 5:07 PM

non-poly(t) stretches was accompanied by an increase in non-t signal. By using a Gaussian mixture hidden Markov model to detect the position of this transition poly(a) tail length can be measured with extraordinary resolution and accuracy and at genome-wide scale. Ultimately, this technique revealed that tail length correlates with mrna half-life, but not translational efficiency. Furthermore, TAIL-seq for the first time identified widespread uridylation and guanylation of mammalian mrnas. Data Analysis All of the previous steps experimental design, isolation of RNA and preparation of libraries firmly reside within the skill set of the traditional wet lab biologist. In contrast, biologists may be less familiar with the techniques and approaches to analyze the resulting RNA-seq data. One of the first challengers new RNA-seq researchers will face is the data deluge problem: the compressed single-end sequencing data from one flow cell of an Illumina HiSeq 2500 might be 20 GB and twice as large once uncompressed to allow for processing and manipulation. Learning to handle and manipulate these large files will be one of the first tasks for the novice bioinformatician. Fortunately, there are a wealth of tools which have been generated by biostatisticians and computational scientists to allow biologists handle, manipulate and understand their RNA-seq data. These tools are split in to two groups. Researchers wishing to answer a relatively simple question, such as identifying genes differentially expressed between a cohort of mutants and controls, may consider commercial tools such as those offered by CLC bio (http://www.clcbio.com/) and Partek (http://www.partek.com/). The main advantages of these proprietary tools is the user friendly, one-step means of obtaining differentially expressed genes, etc, with a dedicated technical support team for assistance with troubleshooting and data interpretation. However, given their proprietary nature it can be difficult to fully understand and evaluate the assumptions being made during each step of analysis. For this reason, these tools will not be reviewed any further here. Instead, the remainder of this review will focus on the second group of tools which are open source and developed, supported and published by the scientific community in the spirit of collaboration and openness. While the majority of such tools are run using the command line which might be daunting to the novice, a number of active mailing lists and online support forums exist and are excellent sources of information for beginners and advanced users alike (Appendix). Obtaining even the most cursory understanding of the command line interface, shell commands and scripting will increase the productivity and efficiency of researchers tremendously. However, to aid the beginner and streamline analyses, a number of popular command line RNA-seq analysis tools have been implemented in an open, web-based platform such as Galaxy [53-55] and GeneProf [56]. Manipulating RNA-seq data is computationally intensive and typically requires access to a powerful cluster resource. In many cases, access to these computational resources can be obtained through institutional sequencing/genomic core facilities and a local instance of Galaxy can be installed. In the absence of a local and dedicated cluster, users may obtain a free account on a public Galaxy server hosted by Penn State University and Emory University [57]. An excellent series of step-by-step video tutorials for typical workflows are also provided on the Galaxy website. Feature/Tools NGS QC Toolkit v2.2 FastQC v0.10.0 PRINSEQ lite v0.17 1 TagDust FASTX Toolkit v0.0.13 SolexaQA v1.10 TagCleaner v0.121 CANGS v1.1 Supported NGS platforms Illumina, 454 2 Illumina, 454 Illumina, 454 Illumina, 454 Illumina Illumina Illumina, 454 454 Parallelization Yes Yes No No No No No No Detection of FASTQ variants Primer/Adapter removal Yes Yes Yes No No Yes No No Yes No 3 No Yes Yes No Yes 4 Yes 10 of 26 8/19/16, 5:07 PM

Homopolymer trimming (Roche 454 data) Paired-end data integrity QC of 454 paired-end reads Sequence duplication filtering Low complexity filtering N/X content filtering Compatibility witd compressed input data GC content calculation File format conversion Export HQ and/or filtered reads Graphical output of QC statistics Yes No No No No No No Yes Yes No No No No No No No Yes No No No No No No No No No 5 Yes No Yes No No Yes No No Yes No Yes No No No No No 6 Yes No Yes No No Yes Yes Yes No No No No No No Yes Yes Yes No No No No No Yes No No No No No No No Yes No Yes Yes Yes No Yes Yes Yes Yes No 7 No Yes Yes No 7 No Dependencies Perl modules: Parallel::ForkManager, String:Approx, GD::Graph (optional) - - - Perl module: GD::Graph BLAST, R, matrix2png - NCBI nr database Table 1. Feature comparison of RNA-seq quality control software. Table: Patel & Jain [58]. Several QC-dedicated programs used for raw data identification. 1 Standalone version. 2 Data of any platform in FASTQ file format. 3 only detection. 4 only one primer/adapter sequence at a time. 5 only reports duplication and that too is for only the first 200,000 reads. 6 only reports N/X content. 7 yes, in case of online version. doi:10.1371/journal.pone.0030619.t001 Quality control Best practices for analyzing and understanding RNA-seq data clearly depends on the ultimate goal of the experiment: data processing to identify alternatively spliced and/or novel isoforms is wildly different to a pipeline to perform differential gene expression calling. However, regardless of the eventual goal, all data analysis will begin with quality control and pre-processing. Any major biases present in the raw data produced by the sequencer can be identified using one of several QC-dedicated programs (Table 1) [58]. Fastqc [57] and similar tools use the raw sequences provided in fastq format (Table 2) and display basic statistics to allow a quick evaluation of whether sequences are as expected. Outputted parameters include number of reads and GC percentage, per base sequence quality score (a measure of confidence of correct base calling), per base sequence content (a representation of each nucleotide at each base position to visualize position/sequence bias), per base N content (a plot of uncalled nucleotides (N s) at each base position), duplicate reads (typically a result of PCR over-amplification during library preparation) and overrepresented sequences and K-mers. It is important to evaluate the report in the context of the anticipated results, since QC programs assume sequencing of a random and diverse library, which may not be the case depending on experimental design and library preparation. As mentioned earlier, base calling error rate is highest during the final cycles of sequencing and it is not uncommon for per base quality score to be low (a quality score (in Phred units) of 20 equates to a 1% 11 of 26 8/19/16, 5:07 PM

error rate). pe Description Source of file Reference(s) Q Contains nucleotide sequence and corresponding quality scores together with read identifier Raw output from sequencer Cock et al [59], FASTQ AM Tab-delimited text file containing read alignment data, flags to indicate number of matches, mismatches and presence of correct mate read (in the case of paired-end reads). Note: Bam is a binary (not directly human viewable) version of a Sam file. Output from aligner (TopHat, STAR, etc) Li et al. [60] A general purpose tab-delimited file containing information about a list of genes. One gene per line, with characteristics such as feature type (CDS, UTR, intron etc), start and end coordinates, strandedness and miscellaneous comments. Depending on organism, downloadable from sources such as UCSC, Ensembl etc. Ensembl, WUSTL GTF22 File type and their sources. Table: David Corney. Pre-processing prepares sequences for read alignment. If libraries were barcoded they should be demultiplexed using either internal barcode sequence or a separate index read sequence. Additionally, a trimming step to remove 3 nucleotides with low quality should be performed. In addition to improving the quality of alignment, removing or trimming reads with low quality base calls expedites mapping and reduces the computational resources consumed during later stages of analysis. Read alignment Following completion of any necessary trimming using Cutadapt [57], reads are ready to be aligned ab initio to a reference genome or de novo to a new transcriptome assembly. For most model organisms, aligning to the reference genome is sufficient and will allow quantification of known genes and transcripts. Advantages of alignment to a reference genome include more efficient computing and elimination of contaminating reads, for example from microbial genomes, since they are unlikely to align correctly [61]. However, alignment to a reference relies on a good quality genome build; any errors, such as genomic deletions or rearrangements are problematic. Bowtie 2 [62] is the first step of the Tuxedo suite of RNA-seq software and efficiently maps reads to a reference genome. Although it allows for gapped alignment, Bowtie is best suited to aligning genomic DNA reads since it does not consider introns/splicing. A better choice is TopHat 2, which uses Bowtie but additionally analyzes mapping results to identify splice junctions [63]. An alternative is STAR, which is also splicing-aware but reportedly 50-times faster at aligning than TopHat 2 with better alignment precision and sensitivity [64]. Both aligners ultimately generate a BAM file as output which can be used in subsequent stages of analysis. An in-depth tutorial described start-to-finish analysis of mapping and differential expression testing using the Tuxedo suite in depth [65] (Figure 8). [enlarge] Figure 8. Tuxedo suite for RNA-seq differential expression analysis. Pre-processed reads from two groups are mapped by TopHat. The resulting *.bam files are used for transcript assembly by Cufflinks Computational advances in de novo transcriptome assembly allow RNA-seq analysis of unsequenced genomes [48, 66, 67], although as early non-model organism transcriptome studies demonstrated, utilizing genomic resources from closely related species as a template can aid assembly [68, 69]. However, to obtain a reliable and useful assembly a higher number of mappable reads are required [61]. For example, in their paper describing the Trinity de novo assembler, Grabherr et al. used 52.6 million read pairs when reconstructing a mouse transcriptome [66]. The quality of the resulting transcriptome assembly can be evaluated in several ways. Assemblies can be directly viewed in the Integrated Genomics Viewer (IGV; [70] ), 12 of 26 8/19/16, 5:07 PM

with a given *.gtf file. Individual (sample-specific) the number of potential full-length transcripts assemblies are merged by Cuffmerge to generate determined using the reference genome of a closely one final assembly containing all transcripts identified across all samples. Cuffdiff performs related organism, and the potential coding regions statistical testing to identify differential expression extracted and functionally annotated using which can be viewed as a spreadsheet in Excel or visualized using cummerbund. Image: David TransDecoder and Trinotate, both part of the Trinity Corney. package [71]. Given the above advantages and disadvantages of reference-based and de novo-based alignment, some studies have used a combination of both: reads are first mapped to a reference and any reads that fail to correctly align used for de novo assembly [61]. Gene quantification and differential expression testing One of the most frequent applications of RNA-seq, analogous to microarray experiments, is to identify differentially expressed genes between two or more groups. The number of reads mapping to each RNA species is linearly related to its abundance within the cell [1]. Therefore, the number of reads discretely mapping to each gene or isoform may be used to infer the level of expression. However, a normalization step must first be performed to account for differences between libraries. The main bias that normalization seeks to resolve is the total library size (i.e. the number of aligned reads/sequencing depth), since this will vary sample-to-sample. Additionally, as mentioned earlier, longer transcripts are more likely to be sequenced than a short transcript and read coverage is often not uniform. Several normalization procedures have been developed in recent years, although the relative advantages and disadvantages are still being assessed by the community and no first choice is obvious. However, a recent report from The French StatOmique Consortium that directly compared the most frequently used normalization techniques sheds some light on the issue [72]. The simplest normalization methods scale read counts that map to each locus by the total number of reads. These include total count, upper quartile and median normalization methods, where the number of gene reads is divided by the total number of mapped reads and multiplied by total, upper quartile or median number of reads from all sequenced libraries to be analyzed in the experiment, respectively. Two R/Bioconductor [73] packages, DE-Seq [74] and edger [75], implement similar normalization methods to calculate a gene-specific scaling factor based on the assumption that the majority of genes are not differentially expressed. The final normalization methods tested by Dillies et al. was quantile normalization, which is typically used for normalization of microarray data, and FPKM normalization. Unfortunately, the most widely used FPKM-based normalization failed to sufficiently normalize variation between samples, had high falsepositive rate and did not adequately reduce coefficient of variance of a pool of 30 housekeeping genes [72]. In contrast, DE-Seq/edgeR normalization both resulted in the lowest coefficient of variance of all methods tested and low false-positive rate after testing of simulated and real data. However, whether the underlying assumption of DE-Seq/edgeR that the majority of genes are not differentially expressed holds true in all cases is not clear. Overexpressed and underexpressed genes may not balance out and recent observations of widespread transcriptional amplification in cells overexpressing c-myc highlights the need for careful decision making when choosing a normalization method [76-78]. Alternative strategies might be to re-visit normalization after differential expression testing to remove differentially expressed genes prior to estimation of scaling factors [79] or to make use of synthetic RNA spike-in transcripts [78]. Following library normalization, statistical testing for differential expression can be performed. Two widely used counts-based workflows make use of the R/Bioconductor statistical computing environment [73]. In addition to performing count normalization, EdgeR and DE-Seq both test for differential gene expression using a negative binomial distribution [74, 75]. First, a matrix containing number of reads corresponding to each gene of interest for every sample is prepared using a Python script called HTSeq-count [57]. Next, edger or DE-Seq uses the counts table with biological replicates to calculate variation and test for statistically significant differential expression. Both tools can be operated at the command line or in the MultiExperiment Viewer (MeV) software, which has a convenient graphical user interface [80]. Importantly, both methods are able to make use of ANOVA-like generalized linear models (GLMs) to analyze complex experimental designs. This enables users to control for any known batch 13 of 26 8/19/16, 5:07 PM

effects introduced during library preparation, as well as analyze time course experiments and experiments with greater than two groups. Additionally, both methods use the Benjamini-Hochberg procedure to control the false discovery rate (FDR) associated with multiple hypothesis testing. On the other hand Cufflinks/Cuffdiff [81, 82], part of the Tuxedo suite, are FPKM-based. The Cufflinks component attempts to assemble aligned reads in to transcripts, isoforms and genes whilst simultaneously identifying transcriptional start sites (TSSs) whereas Cuffdiff tests for statistically significant differences in expression in the Cufflinks output. Although unable to apply GLMs, Cuffdiff has a number of advantages over DE-Seq/edgeR. In particular, for those uncomfortable using R/Bioconductor and the command line, Cufflinks/Cuffdiff, along with all of the other tools in the Tuxedo suite, are implemented in Galaxy. A major technical advantage of Cuffdiff is its ability to detect and test differential isoform expression but with the significant disadvantage that GLMs and complex multi-group comparisons are not supported. Quantification of isoforms is arguably one of the most challenging, but important, aspects of RNA-seq experiments, since quantifying expression at the gene level may mask alterations at the transcript level if two or more isoforms have opposite expression patterns. During alignment reads frequently map to multiple regions which leads to ambiguity when deciding the true origin of the read. By default, Cufflinks will uniformly divide multi-mapped reads among all the positions that it maps to. Optionally, multi-read correction can be performed to probabilistically assign reads. However, ambiguity also exists when considering multiple isoforms transcribed from a single open reading frame (ORF). Consider the example of a full-length transcript and truncated transcript resulting from alternative polyadenylation (Figure 9). Although assignment of reads mapping to the 3 exons of the full-length transcript is unambiguous, 5 exon reads could originate from either the full-length or truncated transcript. Cuffdiff attempts to resolve this problem by probabilistic deconvolution to assign reads to the correct isoform. Cuffdiff may therefore be more appropriate than edger/de-seq when characterizing transcriptomes where alternative splicing is frequent, since a constant number of reads per gene might mask differential expression of two or more isoforms. However, the landscape of tools available for isoform quantification is rapidly evolving and the authors of DE-Seq have recently published a variation of their negative binomial method called DEXSeq [83]. Rather than testing for a significant difference in number of reads per gene between samples, DEXSeq tests whether individual exons are differentially expressed. During testing against Cuffdiff (version 1.3), the DEXSeq authors observed far higher number of false positives identified by Cuffdiff. However, shortly after publication, a new version of Cuffdiff (version 2.0) was published and released [81]. Unfortunately, to date a comprehensive direct comparison of DEXSeq to Cuffdiff 2.0 has not been published. Data visualization and higher level analysis After obtaining a list of differentially expressed genes and/or transcripts, visualization and higher level analysis can proceed analogous to microarray experiments with minor modifications. The newer versions of MeV [80], which was designed for analysis of microarrays, has several modules specifically designed for RNA-seq data. One such MeV module, which is based on the GOSeq R/Bioconductor package [10], tests for enrichment of Gene Ontology (GO) terms associated with a list of significantly overor underexpressed genes. Unlike microarray GO analysis, lists of differentially expressed genes identified by RNA-seq are biased towards longer transcripts, since there is greater statistical power to call strongly expressed genes as differentially expressed. At the same time, Young et al. observed that some GO categories are enriched for short or long [enlarge] Figure 9. Paired-end reads are obtained and mapped to a reference genome using a splice-aware aligner, such as TopHat or STAR. In this example, two isoforms are transcribed from a single gene. Although reads mapping to exon 4 are certain to originate from the long isoform, the assignment of remaining reads is ambiguous and must be handled probabilistically, for example using Cufflinks. Image: David Corney. 14 of 26 8/19/16, 5:07 PM

transcripts [10]. The GOSeq method attempts to control for length bias and optionally selection bias. Since microarray analysis is not subject to the same length bias, samples analyzed by microarray were compared to RNA-seq gene lists with and without bias correction. The GOSeq bias correction method resulted in more GO categories consistent with the microarray interrogation [10]. The same GOSeq method is available as a stand-alone package in R/Bioconductor, along with numerous packages for hierarchical clustering, preparation of heatmaps, principle component analysis (PCA) and visualization. CummeRbund, the final component of the Tuxedo suite, is an R/Bioconductor package specifically designed to visualize the Cuffdiff output and offers many of the same functionality [65]. Emerging technologies Much has been made of the fact that recent advances in NGS productivity violates Moore s Law, the prediction originally applied to computing that states that the number of transistors per microprocessor will double every two years (Figure 10). Per base sequencing costs have decreased significantly concomitant with increased sequencing output. Output increases are largely due to increased number of reads and longer read length and this trend continues apace. However, some of the most exciting emerging technologies will potentially address the known biases and disadvantages of RNA-seq, such as eliminating fragmentation, reverse transcription and PCR amplification biases and reducing input requirements. Sample input requirements are currently at the stage where RNA of a single eukaryotic cell may be sequenced with existing technologies [33, 84-87]. However, these approaches have relied on various degrees of amplification by PCR or in vitro transcription. A sequencing method that does away with the need to amplify convert RNA to cdna will allow an undistorted picture of the transcriptome. Direct RNA sequencing (DRS) from Helicos BioSciences has paved the way to achieving this feat by capturing polyadenylated RNAs on a flow cell followed by sequencing by synthesis using fluorescently labeled nucleotide analogs, akin to Illumina sequencing [88-90]. Ultimately, the DRS system has not proved to be commercially viable due to a combination of high error rate, a short read length and an inability to carry out paired-end sequencing. However, Helicos intellectual property rights have been licensed to Illumina and Life Technologies raising the exciting possibility of Illumina/Ion Torrent-like DRS in the near future. The ability to sequence longer transcripts, [enlarge] and eventually full-length transcripts, would remove the uncertainty when quantifying alternatively spliced genes. In addition to allowing long sequencing reactions, nanopore sequencing also might also permit sequencing of RNA without amplification [91]. In nanopore sequencing either biological (e.g. α-hemolysin [92] ) or synthetic (e.g. graphene [93] ) nanopores are embedded in a synthetic polymer membrane. An electric current is applied across the nanopore; as each nucleotide passes through the nanopore a nucleotidespecific disruption in charge is detected. Figure 10. In recent years, sequencing costs per megabase have decreased faster than Moore s Law. Image: National Elimination of the requirement to use Human Genome Research Institute. fluorescence allows far higher density of nanopores, faster sequencing and hence greater sequence output. Oxford Nanopore Technologies are developing an α-hemolysin nanopore-based sequencer that has already been demonstrated to determine sequence identity of single-stranded DNA [92]. Although sequencing of only up to 85-mer DNA oligos was reported in their paper, in a 2012 press release sequencing of the entire 48 kb lamda genome as a single, complete fragment was disclosed [94]. Whether sequencing of full length RNA or cdna fragments can be similarly achieved remains to be determined. In the meantime, long reads can be obtained by 454 and 15 of 26 8/19/16, 5:07 PM

Pacific Biosciences sequencing platforms (1 kb and > 10 kb, respectively), although relatively few reads are generated and is therefore better suited to genome sequencing. Automatic data analysis workflows As described above, RNA-seq data analysis is performed in a number of consecutive steps for which numerous tools, computational and statistical approaches have been developed, and become available continuously. Trimmomatic [95] (http://www.usadellab.org/cms/index.php?page=trimmomatic) for read trimming and adapter removal; FastQC [57] for data sets quality control; Bowtie [45] (http://bowtie.cbcb.umd.edu) and TopHat [96] (http://tophat.cbcb.umd.edu.) to align reads to a reference genome; ; HTSeq [97] (http://www-huber.embl.de/htseq) to count transcript abundance; and Cufflinks [82] (http://cole-trapnell-lab.github.io/cufflinks/) and edger [75] (http://bioconductor.org) to identify and quantitate expressed genes and transcripts are only a few examples, some of which were described in detail above. A more comprehensive list, continuously updated can be found here (https://en.wikipedia.org/wiki/list_of_rna-seq_bioinformatics_tools) for reference. However, the problem of choosing the right combination of tools for each particular experimental goal remains. As of 2015, developing user-friendly analysis pipelines became the focus of more and more research groups. Efforts were even directed to creating toolboxes for building such automatic analysis paths (ViennaNGS [98], NEAT [99] ), and guides for creating and using computing workflows [100], [101] and RNASeqGUI [102], or explicit demonstrations of data analysis [103]. However, many of these instruments still require validation on real data sets and confirmation with biological experimental results. [enlarge] Below, we describe applications for which tools have been developed or improved recently. We do not attempt to create an exhaustive list, as the field is rapidly expanding and more and more examples become available every day. Quantification of gene expression under different experimental conditions, identification of novel transcripts, alternatively spliced sites or editing events, and the detection of RNA-fusion transcripts are major applications of RNA-Seq. To these, one can add the detection of non-coding and small RNAs, single-cell RNA-seq and various others. Pipelines for gene identification and differential gene expression Most of the current analysis workflows have been developed for the study of eukaryotic RNA, but also several analysis Figure 11. Overview of the transcript-compatibility counts (TCC) method. An scrna-seq example with K cells (only the reads coming from Cell1 and Cell2 are shown here) and a reference transcriptome consisting of three transcripts, t1, t2 and t3 are used for exemplification. Conventional approach: Single cells are clustered based on their transcript or gene abundances (here we only focus on transcripts for concreteness). This widely adopted pipeline involves computing a (#transcripts x #cells) expression matrix by first aligning each cell s reads to the reference. The corresponding alignment information is next to each read, which for the purpose of illustration only contains the mapped positions (the aligned reads of Cell1 are also annotated directly on the transcripts). While reads 1 and 5 are uniquely mapped to transcripts 1 and 3, reads 2, 3 and 4 are mapped to multiple transcripts (multi-mapped reads). The quantification step must therefore take into account a specific read-generating model and handle multi-mapped reads accordingly. The proposed method: Single cells are clustered based on their transcript-compatibility counts. Our method assigns the reads of each cell to equivalence classes via the process of pseudoalignment and simply counts the number of reads that fall in each class to construct a (#eq.classes x #cells) matrix of transcript-compatibility counts. Then, the method proceeds by directly using the transcript-compatibility counts for downstream processing and single cell clustering. The underlying idea here is that even though equivalence classes may not have an explicit biological interpretation, their read counts can collectively 16 of 26 8/19/16, 5:07 PM