RNA-seq Using Next Generation Sequencing

Size: px
Start display at page:

Download "RNA-seq Using Next Generation Sequencing"

Transcription

1 search home > Materials and Methods > RNA-seq Using Next Generation Sequencing Like 36 RNA-seq Using Next Generation Sequencing David C Corney 1 (davec dot corney at gmail dot com), Georgeta N Basturea 2 (gbasturea at gmail dot com) 1 Thomas Jefferson University Hospitals, (formerly Princeton University, till April 2014), United States. 2 Florida, USA (formerly of University of Miami Miller School of Medicine, USA) DOI Date last modified : ; original version : Cite as MATER METHODS 2013;3:203 Abstract A comprehensive review of RNA-seq methodologies. Introduction Next-generation sequencing is rapidly becoming the method of choice for transcriptional profiling experiments. In contrast to microarray technology, high throughput sequencing allows identification of novel transcripts, does not require a sequenced genome and circumvents background noise associated with fluorescence quantification. Furthermore, unlike hybridization-based detection, RNA-seq allows genome-wide analysis of transcription at single nucleotide resolution, including identification of alternative splicing events and post-transcriptional RNA editing events. All RNA-seq experiments follow a similar protocol. Total RNA is isolated from a sample of interest which, depending on the type of RNA to be profiled, may be purified to enrich for mrnas, micrornas or lincrnas etc prior to preparing an RNA library. Library preparation may involve such steps as reverse transcription to cdna, PCR amplification and may or may not preserve strandedness information. Sequencing can produce one read in a single-end sequencing reaction, or two ends separated by an unsequenced fragment in paired-end reactions. Together, RNA-seq has allowed an unparalleled view of the transcriptome in normal and pathological processes and has revealed that the transcriptome is significantly more complex that previously envisioned [1]. This review will examine planning, performing and analyzing an RNA-seq experiment. Briefly, this includes determining optimal sequencing depth, number of replicates, and choosing a sequencing platform; preparing and sequencing libraries; and mapping of reads to a genome followed by transcript quantification. Each of these steps will be reviewed in turn. However, let us first review the main advantages and disadvantages, as well as similarities, of RNA-seq compared to microarrays in greater detail. Advantages and disadvantages of RNA-seq compared to microarrays [enlarge] Microarray-based analysis of the transcriptome is responsible for a great deal of our current understanding of spatiotemporal-specific gene expression in development and disease. Yet, ignoring the fact that microarray analysis is limited to organisms with sequenced genomes and detection of known transcripts, 1 of 26 8/19/16, 5:07 PM

2 hybridization-based detection can suffer from a number of inherent weaknesses, such as poor sensitivity, low specificity and a limited dynamic range. RNA-seq reduces, and in some cases eliminates, these limitations [2]. With sufficient sequencing depth the dynamic range is infinite. Non-specific hybridization or crosshybridization is a common concern when interpreting microarrays, especially when closely related gene family members with highly similar sequence are of particular interest. In this regard, RNA-seq largely eliminates ambiguity of sequence detection Figure 1. Overview of analysis workflow for microarray and RNA-seq transcriptional profiling. Image from Fang et al. [2] but at the cost of potential ambiguity in (image released under a Creative Commons Attribution mapping of reads to the genome, since a License). sequenced read may map to multiple locations. However, advances in use of paired-end sequencing have gone a long way to address this problem, since the added information of a second read sequence and prior knowledge of the expected distance between each read allows more accurate mapping. While not suffering from some of the disadvantages of hybridization-based detection, a number of biases unique to high throughput sequencing have been identified. One of the first steps of mrna library preparation is RNA fragmentation to smaller pieces to allow sequencing. This may be done by Covaris sonication or RNase III enzymatic digestion but use of divalent cations under elevated temperature is most common. All methods introduce bias which can be position- and/or sequence content-based [3]. [enlarge] Currently, all commercially available RNA-seq platforms rely on reverse transcription and PCR amplification prior to sequencing and sequencing is therefore subject to the biases inherent to these procedures. First, annealing of random hexamer primers to fragmented RNA is not random, which results in depletion of reads at both 5 and 3 ends [3-6] Figure 2. Sequence logo showing observed and expected nucleotide distribution surrounding the 5 fragmentation site. Similar biases are present at the 3 end. Image: Roberts et al. [3] (image released under a Creative Commons Attribution License). [7, 8]. A number of data analysis tools to correct these biases are available, although achieving varying degrees of success [6, 9, 10]. Given that the total number of reads per transcript is proportional to the level of a transcript multiplied by transcript length, a long transcript will be sequenced more often than a short transcript when expressed at equivalent levels. Since statistical power is closely linked to sample size, a long transcript is more likely to be found differentially expressed than a short This makes the identification of the true start and end of novel transcripts a challenge, as well as underestimating expression level of short genes. Second, PCR can introduce bias based upon GC content and length due to non-linear amplification [enlarge] 2 of 26 8/19/16, 5:07 PM

3 transcript [11]. To mitigate this problem, expression levels are frequently expressed by calculating the number of reads or fragments per kilobase per million reads (RPKM and FPKM, respectively) [1]. The FPKM transformation also allows direct comparison of transcript expression between two libraries with different sequencing depth as well as an indication of relative expression levels between two or more transcripts in a single library. A typical RNA-seq experiment The next generation sequencing platforms most frequently used for RNA-seq are the Illumina HiSeq, Ion Torrent and SOLiD systems. Whilst the library preparation and nucleotide detection protocols for each platform vary, all consist of the following main steps: Preparation of total RNA. Depending on class of RNA to be sequenced (i.e. mrna, lincrna, microrna etc), enrichment is performed. Good quality total RNA is critical, although alternative protocols for degraded RNA exist [12]. Library preparation. Library preparation consists of: Figure 3. Read coverage over genes is biased against 3 and 5 extremities. Fragmentation was done by either RNA hydrolysis or cdna shearing and RNA fragmentation. Unlike short RNAs, mrnas distribution of reads plotted for small (< 1 kb; top), are typically fragmented to smaller pieces of medium (1-8 kb; middle) and large (> 8 kb; bottom) transcripts. Image modified from Huang et al. [4]. RNA to enable sequencing. Reverse transcription. First and second strand cdna is reverse transcribed from fragmented RNA using random hexamers or oligo(dt) primers. Adapter ligation. The 5 and/or 3 ends of cdna are repaired and adapters (containing sequences to allow hybridization to a flow cell) are ligated. Library cleanup and amplification. Libraries are enriched for correctly ligated cdna fragments and amplified by PCR to add any remaining sequencing primer sequences. Library quantification, quality control and sequencing. Library concentration is assessed using qrt-pcr and/or Bioanalyzer and is ready for sequencing. Data analysis. Downstream data analysis consists of quality control such as trimming of sequencing adapters and removal of reads with poor quality scores followed by mapping reads, analysis of differential expression, identification of novel transcripts and pathway analysis. Experimental design Just as for any other technique, a well-designed RNA-seq experiment consists of proper replication, randomization and blocking [13]. An all too common topic on Internet NGS user forums, such as Seqanswers ( is how to identify statistically significant differential gene expression from an experiment without replicates. Although it is technically possible to calculate DGE without replicates these experiments must be interpreted with extreme caution. Unless one is purely concerned with novel transcript discovery, both technical and biological replicates must be carefully 3 of 26 8/19/16, 5:07 PM

4 considered from the outset. In the infancy of RNA-seq, technical replicates (libraries prepared from the same RNA sample) were commonly used. However, it has been shown that biological variation far outweighs technical variation, at least when coverage of at least 5 reads/nucleotide is obtained [14, 15]. Technical replicates, therefore, are most useful when the goal is to compare performance of two or more competing sequencing technologies. If the goal is to investigate differences between treatments biological replication is essential in order to generalize the results to a larger population. The required number of replicates will vary greatly depending on amount of biological variability associated with the samples of interest and should be empirically determined. To this end, the number of replicates used for any prior microarray analysis is usually a good starting point. Most sequencing platforms support multiplexing of libraries by introducing barcodes during library preparation. This allows simultaneous sequencing of multiple libraries in a single sequencing run thereby enabling more efficient use of a sequencer machine. Importantly, multiplexing also facilitates a balanced block design to minimize potential confounding factors such as PCR amplification and flow cell effects [13, 16]. Consider an example where three placebo-treated samples and an equal number of drug-treated samples are to be sequenced on an Illumina HiSeq 2000 instrument. The Illumina HiSeq uses a flow cell with eight lanes, one of which is usually reserved for a PhiX sequencing control used for quality control purposes. Rather than use a single lane per sample, all six libraries should be barcoded during library preparation to allow all libraries to be simultaneously sequenced over six lanes. Reads are demultiplexed following sequencing based upon barcode sequence and analyzed accordingly. Such a design offers insurance against one poor sequencing lane compromising the study. On the other hand, if an unblocked design was used and one lane was to have unacceptably strong artifacts, an entire sample would be lost and the study compromised. While offering the major advantage of removing confounding factors, a multiplexed, balanced block design also allows for re-sequencing on an additional lane(s) at a later date to increase the number reads as needed, without introducing sample-specific biases from flow cell to flow cell variation. Determining the correct number of sequencing reads per sample is a challenging problem that is subject to vigorous debate. When profiling the chicken transcriptome, Wang et al. showed that 30 million reads are sufficient to obtain reliable measurement of all genes in the genome, whereas 10 million allows detection of 80% of genes [17]. Supporting a lower number of reads as optimal for transcriptional profiling, Tarazona et al. report that increased sequencing depth results in more false positives due to increased noise [18]. Others, including the ENCODE consortium, propose that between million reads per sample or greater are required, especially when novel transcripts or splicing events are of particular interest. Clearly, the issue of sequencing depth is a rapidly evolving issue without a clear consensus. Further complicating the matter is the great deal of inter-species variation in transcriptome size. Clearly, sequencing a bacterial transcriptome will require far fewer reads than needed for a vertebrate transcriptome [19]. However, estimating transcriptome size is problematic since genome size and transcriptome size are imperfectly correlated, and estimating transcriptome size is especially difficult for non-model organisms without a sequenced genome. For example, the genome of the laboratory mouse is 2.6 Gb and encodes ~ 25,000 protein-coding genes, whereas a similar number of genes are encoded in the 1 Gb chicken genome. Therefore, efforts to sequence the transcriptome of non-model organisms will particularly benefit from a small pilot study to empirically determine acceptable sequencing depth [20]. When alternative splicing is of greater interest, obtaining paired-end sequence data can be more valuable than increasing number of reads, due to the increased probability of a splice junction falling within or between the sequenced ends [21]. An additional advantage of paired-end RNA-seq that is particularly useful when sequencing cancer transcriptomes is the opportunity to detect chimeric transcripts resulting from gene fusion events [22]. Furthermore, obtaining paired-end sequence reads allow greater certainty when an individual read can be mapped to multiple loci on the genome, particularly in repetitive regions. An online tool called Scotty allows users to design optimal sequencing depth and number of biological replicates whilst simultaneously satisfying user-defined inputs such as maximum cost and required 4 of 26 8/19/16, 5:07 PM

5 statistical power [23]. When calculating the required sequencing depth, one should bear in mind the potential for loss of reads to undesired RNA species, chiefly ribosomal RNA (rrna), as well as reads that are unable to be mapped, since both factors can decrease the number of useable reads for downstream analysis by as much as 60-80%. A number of hybridization-based rrna depletion approaches have been developed to enrich for less abundant species of RNA. Broadly, enrichment strategies either deplete rrna or allow positive selection for mrna. For eukaryote transcriptome analysis using SOLiD or Ion Torrent platforms, polya+ selection can optionally be performed prior to library construction using either magnetic bead-conjugated oligo(dt) oligonucleotides (Dynabeads; Life Technologies) or immobilized oligo(dt) capture probes (mrna Catcher PLUS; Life Technologies), for example. On the other hand, Illumina TruSeq libraries utilizes two rounds of magnetic bead-conjugated oligo(dt) capture for polya+ selection, with the final polya+ elution step also serving to fragment and prime RNA for downstream cdna synthesis. During polya+ enrichment for polyadenylated mrnas, non-polyadenylated RNA species, including micrornas, lincrnas and other macro ncrnas are depleted and not represented in the resulting libraries. In contrast, rrna depletion strategies have been shown to preserve these RNA species [4, 24]. Two of the most frequently used rrna depletion methods are RiboMinus (Life Technologies) and Ribo-Zero (Epicentre). Both methods utilize a pool of rrna capture probes followed by spin column or magnetic bead-based collection of the non-rrna fraction. rrna capture probes for multicellular eukaryotes (human/mouse compatibility) as well as microorganisms (yeast/bacteria) are available from both vendors. Ribo-Minus may offer the extra advantage in removing mitochondrial rrna as well as cytoplasmic rrna [4]. Finally, rrna depletion should be the method of choice when sequencing degraded RNA isolates from formalin-fixed paraffinembedded samples, since polya+ selection methods assume availability of high quality total RNA to enable isolation of full-length transcripts [12]. Just as rrna contamination will significantly reduce the number of reads mapping to mrna, so will strongly expressed mrnas, such as housekeeping genes, reduce the number of reads mapping to weakly expressed genes. For example, 75% of reads from a human mammary epithelial cell line library map to the most abundant 7% of the transcriptome [25]. Clearly, when the number of reads is at a premium, it would be most useful to have them map to regions of interest. To remove transcripts corresponding to a small number of housekeeping genes, Epicentre extended the Ribo-Zero concept by designing capture oligos to remove globin mrnas. Exome capture microarrays may also be used to increase sensitivity, although at the cost of reduced quantification accuracy [26]. In contrast, the Rinn laboratory developed RNA CaptureSeq to specifically sequence weakly expressed regions of the transcriptome [27]. Briefly, RNA is hybridized to tiling microarrays containing probes corresponding to genomic regions of interest and the captured RNAs eluted and sequenced. CaptureSeq allowed ~380-fold enrichment of reads mapping to targeted regions of the transcriptome compared to conventional RNA-seq without capture. Although conventionally used for exome sequencing or targeted re-sequencing of DNA, Levin et al. used microarrays to capture 467 cancer-related genes for targeted RNA sequencing [28]. This approach allowed identification of mutations and fusion transcripts while largely preserving transcript abundance. An important caveat, however, is that inferring a somatic DNA mutation by cdna sequencing is problematic without careful validation by Sanger sequencing due to widespread RNA editing [29]. Most recently, Life Technologies released a targeted RNA-seq workflow to enable targeted sequencing of over 6000 RNAs and Illumina is planning to release an equivalent workflow soon. Although based on sequencing of short PCR-amplified amplicons, not full-length transcripts of targeted RNAs, directed sequencing will provide comparable information to quantitative RT-PCR. RNA-seq technologies As of 2013, the three most widely used NGS platforms for RNA-seq are SOLiD and Ion Torrent, both marketed by Life Technologies, and Illumina s HiSeq. All three platforms have similar sample input requirements and sequences millions of cdna fragments per run. Below, sample preparation and pertinent application-specific advantages and disadvantages are discussed. Illumina 5 of 26 8/19/16, 5:07 PM

6 Illumina and Ion Torrent both sequence using a sequencing by synthesis (SBS) approach, whereby incorporation of dntps is detected simultaneously at millions of fixed positions on a flow cell [30] (Figure 4). [enlarge] Figure 4. Illumina RNA library preparation. PolyA+ RNA is enriched using oligo(dt) beads followed by fragmentation and reverse transcription. The 5 and 3 ends of cdna fragments are next prepared to allow efficient ligation of Y adapters containing a unique barcode and primer binding sites. Finally, ligated cdnas are PCR-amplified and ready for cluster generation and sequencing. Image: David Corney. For Illumina, once TruSeq RNA-seq libraries have been prepared they are hybridized to a flow cell which contains a lawn of covalently bound oligonucleotides complementary to the sequencing adapters that were introduced during library preparation. Once hybridized, the capture oligonucleotide primes DNA polymerase extension activity resulting in a covalently bound full-length complementary copy of the cdna fragment that is subjected to several rounds of PCR amplification to produce discrete clones ~ 1 µm in diameter that can be optically resolved during sequencing. Obtaining optimal cluster density is critical, since it will determine the number of reads obtained. Clearly, low density will result in fewer than expected reads, but over-clustering can be just as problematic, since dense flow cells are difficult to analyze and to obtain accurate base calling due to interference and overlap between adjacent clusters. Therefore, accurate quantification of each library using quantitative PCR is an important aspect of library quality control. In the case of Illumina SBS, all four dntps are fluorescently labeled and concurrently introduced on to the flow cell (Figure 5A). Since all four dntps are present, natural competition for binding between dntps minimizes incorporation biases. SBS proceeds through multiple cycles of nucleotide incorporation and detection. Importantly, only one nucleotide is incorporated per cycle by use of reversibly terminated dntps. After nucleotide incorporation is detected by fluorescence, the fluorophore is removed resulting in regeneration of [enlarge] Figure 5. Sequence detection methods of Illumina, Ion Torrent and SOLiD. A. Illumina detection is fluorescence-based using reversible terminator dntps, resulting in one nucleotide incorporation per cycle. cdna fragments are covalently linked to a flow cell and fluorescence detected with addition of each nucleotide. B. Ion Torrent sequence by synthesis relies on detection of hydrogen ions ( ) for base calling. Each ph detector well contains one clonally amplified cdna fragments on a microbead. Nucleotides are added sequentially; since nucleotides are not reversibly terminated, incorporation of multiple nucleotides is detected by an increase in number of hydrogen ions detected. C. SOLiD sequence detection is unique in that fluorescently labeled oligonucleotides are ligated rather than incorporated by a 6 of 26 8/19/16, 5:07 PM

7 a 3 hydroxyl polymerase. See text for more details. Image: modified from Berglund et al. [154] (image group which released under a Creative Commons Attribution License). allows incorporation of the next dntp in the subsequent cycle. Importantly, this reversible terminator chemistry allows sequencing of homopolymeric regions, such as AAAAAA, with high confidence. During base calling, fluorescence intensity values for each nucleotide are converted to nucleotide identity using a cross-talk matrix which controls for spectral overlap. Since spectral overlap is determined during the first four cycles it is imperative that approximately equal numbers of each base be present (i.e. to have a balanced library). Therefore, it is especially important to use barcodes that are well balanced to ensure accurate demultiplexing after sequencing. Likewise, use of a dedicated PhiX control lane to estimate correct spectral overlap is strongly recommended when sequencing unbalanced libraries of AT- or GC-rich genomes [31]. Error rate (incorporation of the incorrect nucleotide) progressively mounts with increasing number of cycles; currently up to 150 cycles are supported with an overall error rate of 0.2% [32]. In the first iteration of Illumina sequencing technology, sequencing of only one end of each cdna fragment was supported. While nevertheless a very powerful tool, in recent years the use of paired end sequencing is most frequently used when sequencing the transcriptome. In this case, since the flow cell contains randomly arrayed capture oligos complementary to the 5 and 3 sequencing adapters, during the bridge amplification PCR step, cdna fragments captured by their 5 adapters are susceptible to be captured by 3 capture oligos. This allows for a first sequencing run of up to 150 cycles using the 5 sequencing primer to be followed by a second sequencing run using the 3 primer to obtain a total of 300 nt of sequence per fragment. Importantly, several third-party library preparation kits are commercially available that have advantages for certain experiments, for example, Smart-seq [33] reduces the required input to as little as 100 pg total RNA, Ion Torrent Whereas Illumina sequencing and cluster generation relies on solid-phase PCR amplification, emulsion PCR is used to prepare Ion Torrent libraries for sequencing. First, the library template is prepared from fragmented RNA. Unlike Illumina, the standard library protocol is strand-specific by default (Figure 6). Next, beads with complementary oligonucleotides are mixed with PCR reagents and a dilute solution of cdna library and oil added to make an emulsion. Ideally, each microdroplet of emulsion will contain one bead and one cdna fragment along with PCR reagents to allow for clonal amplification. Following cycles of PCR the emulsion is then broken by organic extraction, beads purified and loaded on to a disposable semiconductor sequencing chip. The sequencing chip is modeled similar to a honeycomb, in that one bead fits into one of hundreds of millions of tiny wells that serve as microreactors during sequencing, each with their own detector. Unlike Illumina s fluorescence-based SBS, Ion Torrent determines sequence identity by detecting ph alterations due to hydrogen ion release following nucleotide incorporation (Figure 5B). Since the dntps are not differentially labeled by a fluorophore, they must be added successively so that ion release can be associated with a particular nucleotide. Since Ion Torrent sequencing isn t reliant on optical detection of dntp incorporation, sequencing reactions are much faster and the number of reads obtainable per sequencing run has been rapidly increasing. However, whereas Illumina makes use of reversible terminator chemistry to restrict dntp incorporation to once per cycle and sequence through homopolymers, Ion Torrent relies on the number of hydrogen ions released as being proportional to the number of dntps incorporated. Therefore, A can easily be distinguished from AA by a detecting a doubling in the number of hydrogen ions released. However, distinguishing between a run of 7 and 8 adenosines is far more challenging and consequently the error rate is high (1.7%) [32]. [enlarge] SOLiD Ion Torrent and SOLiD RNA libraries preparation share the same molecular biology (Figure 6), although the adapter sequences are 7 of 26 8/19/16, 5:07 PM

8 different. In contrast to Illumina/Ion Torrent, SOLiD uses a sequencing by ligation approach to obtain billions of reads per sequencing run, each up to 75 bp in length. First, emulsion PCR is performed and beads containing clonally amplified cdna fragments attached to the surface of a sequencing flow cell. Sequencing takes place during several rounds of ligation reactions. In the first round, a sequencing primer is annealed and a mixture of 16 fluorescently labeled 8-mer oligonucleotides Figure 6. Ion Torrent and SOLiD added (Figure 5C). The 16 oligonucleotides represent all possible libraries are both prepared using combinations of the first two nucleotides (AC, AG, AT etc), similar protocols. Briefly, partly degenerate guide adapters hybridize whereas bases 3-5 are degenerate and unknown. The final three the fragmented target RNA to allow 3 bases are conjugated to one of four fluorescent labels, each splint ligation of 5 and 3 adapter with with a different excitation and emission spectrum. Therefore, each defined sequences. Next, cdna is synthesized and amplified by PCR to fluorophore represents four dinucleotides and in each ligation add additional required sequences reaction, the identity of only the first two nucleotides is followed by emulsion PCR on microbeads. Image: David Corney. interrogated. As a result, after one round of ligation the identity of these two bases is narrowed down but not known. To determine their true identity, the original primer and ligated oligos are removed and a second, n-1, primer annealed and a new round of ligation is performed. By combining knowledge from two rounds of interrogation, the identity of the first base is confirmed. The identity of the next base is confirmed using an n-2 primer, and so forth, until an n-4 primer is used. In practice, the final three 3 bases of each 8-mer oligonucleotide are cleaved after each ligation to remove the fluorophore and provide a 5 phosphate for a second ligation reaction. After 5-7 cycles of ligation, fluorophore detection and fluorophore cleavage, a reset is performed and the next primer (i.e. n-1) is used for another 5-7 cycles. SOLiD sequencing, therefore, has the advantage of interrogating each nucleotide twice and accordingly has reduced errors (<0.1%) during base calling but at the cost of shorter reads length. Non-coding RNA-seq Until now, this review has largely focused on identification and quantification of the small proportion of the transcriptome that has coding potential. However, RNA-seq has been applied to study non-coding RNAs, such as micrornas and lincrnas, and even used to discover a new class of non-coding circular transcripts (circrnas) [34-36]. To gain a complete picture of the transcriptome, biologists may combine coding and non-coding RNA-seq data. In terms of experimental design and sequencing chemistry, the sequencing requirements for non-coding RNA-seq are mostly the same as mrna-seq. MicroRNAs (mirnas) are short pieces of RNA which direct post-transcriptional gene silencing of their targets by imperfect hybridization to the 3 UTRs of mrnas. Mature mirnas are typically nt in size and are generated by two cleavage events; first cleavage of a nuclear primary transcript, which may be up to several kilobases in length, and secondly cleavage of the cytoplasmic intermediate hairpin precursor that is approximately 70 nt [37]. Due to rapid processing and turnover, precursor transcripts are sequenced infrequently and most attention has been paid to the mature form. However, identical protocols have been successfully used to sequence precursor and mature mirnas [38]. In contrast to Sanger sequencing, which identified only the most strongly expressed mirnas, NGS can identify weakly expressed mirnas as well as reveal heterogeneity in length and sequence [39]. As for mrna-seq, obtaining a good mirna-seq library begins with obtaining good quality total RNA. It is crucial that the RNA isolation procedure preserve the integrity of small RNAs. Indeed, it was largely due to the fact that frequently used spin columns did not retain RNAs < 200 nt that hindered their discovery. Careful work from the Kim laboratory has shown that although Trizol retains small RNAs, it is a poor choice when a low number of cells are used as starting material since mirnas with low GC content are selectively depleted [40]. Short RNA sequencing is not restricted to mirnas; piwi-interacting RNAs (pirnas) were also identified and characterized using RNA-seq [41]. Except for fragmentation, 8 of 26 8/19/16, 5:07 PM

9 which is omitted, the stages involved in small RNA library preparation are similar to conventional RNA-seq (Figure 7). The latest version of Illumina mirna library preparation makes use of a 5 monophosphate and 3 hydroxyl groups to specifically ligate mirnas in a reaction containing total RNA [42], whereas short RNA enrichment by polyacrylamide gel electrophoresis or magnetic bead purification is required or strongly suggested for Ion Torrent and SOLiD libraries. Following RNA adapter ligation, mirnas are reverse transcribed, amplified by PCR and sequenced. A number of mirnafocused analysis platforms are freely available [39]. Briefly, reads should be trimmed of barcode and adapter sequences and mapped, either to known mirnas in mirbase [43] or to a reference genome for novel mirna discovery. In contrast to mapping of mrna-seq data, using a splicing-aware aligner is not necessary and BWA [44] or Bowtie [45] may be used. Like mirnas, lincrnas make up part of the non-coding assortment of RNAs within eukaryotic cells, although their function is more heterogeneous and less well defined compared to mirnas [46, 47]. With regards to sequencing of lincrnas, paired-end RNA-seq is typically most useful [48, 49]. However, since lincrnas are frequently antisense to known genes, it is important to know the strandedness of mapped reads, which cannot be known using conventional library preparation methods. To maintain strandedness, either the 5 and 3 adapter sequences must be unique, or the first/second cdna strand biochemically marked during library preparation, typically by substituting dutp for dttp to enable UDG-mediated degradation of dutp-containing DNA [50]. LincRNAs are represented in RNA-seq libraries previously subjected to polya selection, although omitting this step may allow identification of additional RNAs [48]. Identification of novel lincrnas is performed computationally by performing ab initio transcriptome reconstruction in combination with a consideration for epigenetic markers of active transcription as previously described [47, 51]. Likewise, identification of circrna transcripts was done computationally by searching for exon scrambling in paired-end RiboMinus RNA-seq data [35]. [enlarge] Figure 7. MicroRNAs are sequenced by ligating RNA adapters to each end of the mature microrna followed by reverse transcription and PCR (RT-PCR). To enable barcoding, two sequencing reactions are performed using two sequencing primers: primer one to obtain the microrna sequence and primer two to obtain the barcode sequence. Image adapted from Nieuwerburgh et al. [155]. While not a class of non-coding genes in of itself, the polya tails of mrna transcripts are not translated to protein. Following RNA polymerase II-dependent transcription of most mrnas, a stretch of untemplated adenosine monophosphates is added to the transcript following cleavage by cleavage and polyadenylation specificity factor (CPSF) and poly(a) polymerase. A long-standing question has been the relationship between polya tail length, transcript stability and translatability. However, owing to the difficulty to sequence through homopolymeric (< 50 nt of sequential adenosines) regions, by both Sanger and next-generation sequencing, this question has only been indirectly studied. However, a recent innovative approach to use a combination sequencing technologies and complex statistical analysis by the Kim laboratory has provided much insight [52]. In their method, called TAIL-seq, total RNA is first depleted of rrnas and small ncrnas and a biotinylated 3 adapter ligated to the remaining mrnas and long non-coding RNAs. Next, the nuclease RNase T1, which at low concentration specifically cuts after G residues (and not within polya tails), is incubated with the ligated RNAs followed by pull-down with streptavidin beads to enrich for 3 adapter ligated RNA fragments. Following 5 adapter ligation, reverse transcription and PCR amplification, libraries are sequenced from both ends. The first read provides 51 nt of sequence identification for mapping purposes, while the second read, up to 231 nt in length, provides tail length as indicated by a stretch of thymine nucleotides (corresponding to the pre-reverse transcription polya tail). Despite sequencing libraries on the Illumina platform which is better suited to sequencing homopolymeric regions, incomplete cleavage of the thymine reversible terminator fluorophores results in persisting thymine fluorescence signal in subsequent cycles, making non-poly(t) nucleotides largely indistinguishable from true poly(t) stretches. However, the transition from poly(t) to 9 of 26 8/19/16, 5:07 PM

10 non-poly(t) stretches was accompanied by an increase in non-t signal. By using a Gaussian mixture hidden Markov model to detect the position of this transition poly(a) tail length can be measured with extraordinary resolution and accuracy and at genome-wide scale. Ultimately, this technique revealed that tail length correlates with mrna half-life, but not translational efficiency. Furthermore, TAIL-seq for the first time identified widespread uridylation and guanylation of mammalian mrnas. Data Analysis All of the previous steps experimental design, isolation of RNA and preparation of libraries firmly reside within the skill set of the traditional wet lab biologist. In contrast, biologists may be less familiar with the techniques and approaches to analyze the resulting RNA-seq data. One of the first challengers new RNA-seq researchers will face is the data deluge problem: the compressed single-end sequencing data from one flow cell of an Illumina HiSeq 2500 might be 20 GB and twice as large once uncompressed to allow for processing and manipulation. Learning to handle and manipulate these large files will be one of the first tasks for the novice bioinformatician. Fortunately, there are a wealth of tools which have been generated by biostatisticians and computational scientists to allow biologists handle, manipulate and understand their RNA-seq data. These tools are split in to two groups. Researchers wishing to answer a relatively simple question, such as identifying genes differentially expressed between a cohort of mutants and controls, may consider commercial tools such as those offered by CLC bio ( and Partek ( The main advantages of these proprietary tools is the user friendly, one-step means of obtaining differentially expressed genes, etc, with a dedicated technical support team for assistance with troubleshooting and data interpretation. However, given their proprietary nature it can be difficult to fully understand and evaluate the assumptions being made during each step of analysis. For this reason, these tools will not be reviewed any further here. Instead, the remainder of this review will focus on the second group of tools which are open source and developed, supported and published by the scientific community in the spirit of collaboration and openness. While the majority of such tools are run using the command line which might be daunting to the novice, a number of active mailing lists and online support forums exist and are excellent sources of information for beginners and advanced users alike (Appendix). Obtaining even the most cursory understanding of the command line interface, shell commands and scripting will increase the productivity and efficiency of researchers tremendously. However, to aid the beginner and streamline analyses, a number of popular command line RNA-seq analysis tools have been implemented in an open, web-based platform such as Galaxy [53-55] and GeneProf [56]. Manipulating RNA-seq data is computationally intensive and typically requires access to a powerful cluster resource. In many cases, access to these computational resources can be obtained through institutional sequencing/genomic core facilities and a local instance of Galaxy can be installed. In the absence of a local and dedicated cluster, users may obtain a free account on a public Galaxy server hosted by Penn State University and Emory University [57]. An excellent series of step-by-step video tutorials for typical workflows are also provided on the Galaxy website. Feature/Tools NGS QC Toolkit v2.2 FastQC v PRINSEQ lite v TagDust FASTX Toolkit v SolexaQA v1.10 TagCleaner v0.121 CANGS v1.1 Supported NGS platforms Illumina, Illumina, 454 Illumina, 454 Illumina, 454 Illumina Illumina Illumina, Parallelization Yes Yes No No No No No No Detection of FASTQ variants Primer/Adapter removal Yes Yes Yes No No Yes No No Yes No 3 No Yes Yes No Yes 4 Yes 10 of 26 8/19/16, 5:07 PM

11 Homopolymer trimming (Roche 454 data) Paired-end data integrity QC of 454 paired-end reads Sequence duplication filtering Low complexity filtering N/X content filtering Compatibility witd compressed input data GC content calculation File format conversion Export HQ and/or filtered reads Graphical output of QC statistics Yes No No No No No No Yes Yes No No No No No No No Yes No No No No No No No No No 5 Yes No Yes No No Yes No No Yes No Yes No No No No No 6 Yes No Yes No No Yes Yes Yes No No No No No No Yes Yes Yes No No No No No Yes No No No No No No No Yes No Yes Yes Yes No Yes Yes Yes Yes No 7 No Yes Yes No 7 No Dependencies Perl modules: Parallel::ForkManager, String:Approx, GD::Graph (optional) Perl module: GD::Graph BLAST, R, matrix2png - NCBI nr database Table 1. Feature comparison of RNA-seq quality control software. Table: Patel & Jain [58]. Several QC-dedicated programs used for raw data identification. 1 Standalone version. 2 Data of any platform in FASTQ file format. 3 only detection. 4 only one primer/adapter sequence at a time. 5 only reports duplication and that too is for only the first 200,000 reads. 6 only reports N/X content. 7 yes, in case of online version. doi: /journal.pone t001 Quality control Best practices for analyzing and understanding RNA-seq data clearly depends on the ultimate goal of the experiment: data processing to identify alternatively spliced and/or novel isoforms is wildly different to a pipeline to perform differential gene expression calling. However, regardless of the eventual goal, all data analysis will begin with quality control and pre-processing. Any major biases present in the raw data produced by the sequencer can be identified using one of several QC-dedicated programs (Table 1) [58]. Fastqc [57] and similar tools use the raw sequences provided in fastq format (Table 2) and display basic statistics to allow a quick evaluation of whether sequences are as expected. Outputted parameters include number of reads and GC percentage, per base sequence quality score (a measure of confidence of correct base calling), per base sequence content (a representation of each nucleotide at each base position to visualize position/sequence bias), per base N content (a plot of uncalled nucleotides (N s) at each base position), duplicate reads (typically a result of PCR over-amplification during library preparation) and overrepresented sequences and K-mers. It is important to evaluate the report in the context of the anticipated results, since QC programs assume sequencing of a random and diverse library, which may not be the case depending on experimental design and library preparation. As mentioned earlier, base calling error rate is highest during the final cycles of sequencing and it is not uncommon for per base quality score to be low (a quality score (in Phred units) of 20 equates to a 1% 11 of 26 8/19/16, 5:07 PM

12 error rate). pe Description Source of file Reference(s) Q Contains nucleotide sequence and corresponding quality scores together with read identifier Raw output from sequencer Cock et al [59], FASTQ AM Tab-delimited text file containing read alignment data, flags to indicate number of matches, mismatches and presence of correct mate read (in the case of paired-end reads). Note: Bam is a binary (not directly human viewable) version of a Sam file. Output from aligner (TopHat, STAR, etc) Li et al. [60] A general purpose tab-delimited file containing information about a list of genes. One gene per line, with characteristics such as feature type (CDS, UTR, intron etc), start and end coordinates, strandedness and miscellaneous comments. Depending on organism, downloadable from sources such as UCSC, Ensembl etc. Ensembl, WUSTL GTF22 File type and their sources. Table: David Corney. Pre-processing prepares sequences for read alignment. If libraries were barcoded they should be demultiplexed using either internal barcode sequence or a separate index read sequence. Additionally, a trimming step to remove 3 nucleotides with low quality should be performed. In addition to improving the quality of alignment, removing or trimming reads with low quality base calls expedites mapping and reduces the computational resources consumed during later stages of analysis. Read alignment Following completion of any necessary trimming using Cutadapt [57], reads are ready to be aligned ab initio to a reference genome or de novo to a new transcriptome assembly. For most model organisms, aligning to the reference genome is sufficient and will allow quantification of known genes and transcripts. Advantages of alignment to a reference genome include more efficient computing and elimination of contaminating reads, for example from microbial genomes, since they are unlikely to align correctly [61]. However, alignment to a reference relies on a good quality genome build; any errors, such as genomic deletions or rearrangements are problematic. Bowtie 2 [62] is the first step of the Tuxedo suite of RNA-seq software and efficiently maps reads to a reference genome. Although it allows for gapped alignment, Bowtie is best suited to aligning genomic DNA reads since it does not consider introns/splicing. A better choice is TopHat 2, which uses Bowtie but additionally analyzes mapping results to identify splice junctions [63]. An alternative is STAR, which is also splicing-aware but reportedly 50-times faster at aligning than TopHat 2 with better alignment precision and sensitivity [64]. Both aligners ultimately generate a BAM file as output which can be used in subsequent stages of analysis. An in-depth tutorial described start-to-finish analysis of mapping and differential expression testing using the Tuxedo suite in depth [65] (Figure 8). [enlarge] Figure 8. Tuxedo suite for RNA-seq differential expression analysis. Pre-processed reads from two groups are mapped by TopHat. The resulting *.bam files are used for transcript assembly by Cufflinks Computational advances in de novo transcriptome assembly allow RNA-seq analysis of unsequenced genomes [48, 66, 67], although as early non-model organism transcriptome studies demonstrated, utilizing genomic resources from closely related species as a template can aid assembly [68, 69]. However, to obtain a reliable and useful assembly a higher number of mappable reads are required [61]. For example, in their paper describing the Trinity de novo assembler, Grabherr et al. used 52.6 million read pairs when reconstructing a mouse transcriptome [66]. The quality of the resulting transcriptome assembly can be evaluated in several ways. Assemblies can be directly viewed in the Integrated Genomics Viewer (IGV; [70] ), 12 of 26 8/19/16, 5:07 PM

13 with a given *.gtf file. Individual (sample-specific) the number of potential full-length transcripts assemblies are merged by Cuffmerge to generate determined using the reference genome of a closely one final assembly containing all transcripts identified across all samples. Cuffdiff performs related organism, and the potential coding regions statistical testing to identify differential expression extracted and functionally annotated using which can be viewed as a spreadsheet in Excel or visualized using cummerbund. Image: David TransDecoder and Trinotate, both part of the Trinity Corney. package [71]. Given the above advantages and disadvantages of reference-based and de novo-based alignment, some studies have used a combination of both: reads are first mapped to a reference and any reads that fail to correctly align used for de novo assembly [61]. Gene quantification and differential expression testing One of the most frequent applications of RNA-seq, analogous to microarray experiments, is to identify differentially expressed genes between two or more groups. The number of reads mapping to each RNA species is linearly related to its abundance within the cell [1]. Therefore, the number of reads discretely mapping to each gene or isoform may be used to infer the level of expression. However, a normalization step must first be performed to account for differences between libraries. The main bias that normalization seeks to resolve is the total library size (i.e. the number of aligned reads/sequencing depth), since this will vary sample-to-sample. Additionally, as mentioned earlier, longer transcripts are more likely to be sequenced than a short transcript and read coverage is often not uniform. Several normalization procedures have been developed in recent years, although the relative advantages and disadvantages are still being assessed by the community and no first choice is obvious. However, a recent report from The French StatOmique Consortium that directly compared the most frequently used normalization techniques sheds some light on the issue [72]. The simplest normalization methods scale read counts that map to each locus by the total number of reads. These include total count, upper quartile and median normalization methods, where the number of gene reads is divided by the total number of mapped reads and multiplied by total, upper quartile or median number of reads from all sequenced libraries to be analyzed in the experiment, respectively. Two R/Bioconductor [73] packages, DE-Seq [74] and edger [75], implement similar normalization methods to calculate a gene-specific scaling factor based on the assumption that the majority of genes are not differentially expressed. The final normalization methods tested by Dillies et al. was quantile normalization, which is typically used for normalization of microarray data, and FPKM normalization. Unfortunately, the most widely used FPKM-based normalization failed to sufficiently normalize variation between samples, had high falsepositive rate and did not adequately reduce coefficient of variance of a pool of 30 housekeeping genes [72]. In contrast, DE-Seq/edgeR normalization both resulted in the lowest coefficient of variance of all methods tested and low false-positive rate after testing of simulated and real data. However, whether the underlying assumption of DE-Seq/edgeR that the majority of genes are not differentially expressed holds true in all cases is not clear. Overexpressed and underexpressed genes may not balance out and recent observations of widespread transcriptional amplification in cells overexpressing c-myc highlights the need for careful decision making when choosing a normalization method [76-78]. Alternative strategies might be to re-visit normalization after differential expression testing to remove differentially expressed genes prior to estimation of scaling factors [79] or to make use of synthetic RNA spike-in transcripts [78]. Following library normalization, statistical testing for differential expression can be performed. Two widely used counts-based workflows make use of the R/Bioconductor statistical computing environment [73]. In addition to performing count normalization, EdgeR and DE-Seq both test for differential gene expression using a negative binomial distribution [74, 75]. First, a matrix containing number of reads corresponding to each gene of interest for every sample is prepared using a Python script called HTSeq-count [57]. Next, edger or DE-Seq uses the counts table with biological replicates to calculate variation and test for statistically significant differential expression. Both tools can be operated at the command line or in the MultiExperiment Viewer (MeV) software, which has a convenient graphical user interface [80]. Importantly, both methods are able to make use of ANOVA-like generalized linear models (GLMs) to analyze complex experimental designs. This enables users to control for any known batch 13 of 26 8/19/16, 5:07 PM

14 effects introduced during library preparation, as well as analyze time course experiments and experiments with greater than two groups. Additionally, both methods use the Benjamini-Hochberg procedure to control the false discovery rate (FDR) associated with multiple hypothesis testing. On the other hand Cufflinks/Cuffdiff [81, 82], part of the Tuxedo suite, are FPKM-based. The Cufflinks component attempts to assemble aligned reads in to transcripts, isoforms and genes whilst simultaneously identifying transcriptional start sites (TSSs) whereas Cuffdiff tests for statistically significant differences in expression in the Cufflinks output. Although unable to apply GLMs, Cuffdiff has a number of advantages over DE-Seq/edgeR. In particular, for those uncomfortable using R/Bioconductor and the command line, Cufflinks/Cuffdiff, along with all of the other tools in the Tuxedo suite, are implemented in Galaxy. A major technical advantage of Cuffdiff is its ability to detect and test differential isoform expression but with the significant disadvantage that GLMs and complex multi-group comparisons are not supported. Quantification of isoforms is arguably one of the most challenging, but important, aspects of RNA-seq experiments, since quantifying expression at the gene level may mask alterations at the transcript level if two or more isoforms have opposite expression patterns. During alignment reads frequently map to multiple regions which leads to ambiguity when deciding the true origin of the read. By default, Cufflinks will uniformly divide multi-mapped reads among all the positions that it maps to. Optionally, multi-read correction can be performed to probabilistically assign reads. However, ambiguity also exists when considering multiple isoforms transcribed from a single open reading frame (ORF). Consider the example of a full-length transcript and truncated transcript resulting from alternative polyadenylation (Figure 9). Although assignment of reads mapping to the 3 exons of the full-length transcript is unambiguous, 5 exon reads could originate from either the full-length or truncated transcript. Cuffdiff attempts to resolve this problem by probabilistic deconvolution to assign reads to the correct isoform. Cuffdiff may therefore be more appropriate than edger/de-seq when characterizing transcriptomes where alternative splicing is frequent, since a constant number of reads per gene might mask differential expression of two or more isoforms. However, the landscape of tools available for isoform quantification is rapidly evolving and the authors of DE-Seq have recently published a variation of their negative binomial method called DEXSeq [83]. Rather than testing for a significant difference in number of reads per gene between samples, DEXSeq tests whether individual exons are differentially expressed. During testing against Cuffdiff (version 1.3), the DEXSeq authors observed far higher number of false positives identified by Cuffdiff. However, shortly after publication, a new version of Cuffdiff (version 2.0) was published and released [81]. Unfortunately, to date a comprehensive direct comparison of DEXSeq to Cuffdiff 2.0 has not been published. Data visualization and higher level analysis After obtaining a list of differentially expressed genes and/or transcripts, visualization and higher level analysis can proceed analogous to microarray experiments with minor modifications. The newer versions of MeV [80], which was designed for analysis of microarrays, has several modules specifically designed for RNA-seq data. One such MeV module, which is based on the GOSeq R/Bioconductor package [10], tests for enrichment of Gene Ontology (GO) terms associated with a list of significantly overor underexpressed genes. Unlike microarray GO analysis, lists of differentially expressed genes identified by RNA-seq are biased towards longer transcripts, since there is greater statistical power to call strongly expressed genes as differentially expressed. At the same time, Young et al. observed that some GO categories are enriched for short or long [enlarge] Figure 9. Paired-end reads are obtained and mapped to a reference genome using a splice-aware aligner, such as TopHat or STAR. In this example, two isoforms are transcribed from a single gene. Although reads mapping to exon 4 are certain to originate from the long isoform, the assignment of remaining reads is ambiguous and must be handled probabilistically, for example using Cufflinks. Image: David Corney. 14 of 26 8/19/16, 5:07 PM

15 transcripts [10]. The GOSeq method attempts to control for length bias and optionally selection bias. Since microarray analysis is not subject to the same length bias, samples analyzed by microarray were compared to RNA-seq gene lists with and without bias correction. The GOSeq bias correction method resulted in more GO categories consistent with the microarray interrogation [10]. The same GOSeq method is available as a stand-alone package in R/Bioconductor, along with numerous packages for hierarchical clustering, preparation of heatmaps, principle component analysis (PCA) and visualization. CummeRbund, the final component of the Tuxedo suite, is an R/Bioconductor package specifically designed to visualize the Cuffdiff output and offers many of the same functionality [65]. Emerging technologies Much has been made of the fact that recent advances in NGS productivity violates Moore s Law, the prediction originally applied to computing that states that the number of transistors per microprocessor will double every two years (Figure 10). Per base sequencing costs have decreased significantly concomitant with increased sequencing output. Output increases are largely due to increased number of reads and longer read length and this trend continues apace. However, some of the most exciting emerging technologies will potentially address the known biases and disadvantages of RNA-seq, such as eliminating fragmentation, reverse transcription and PCR amplification biases and reducing input requirements. Sample input requirements are currently at the stage where RNA of a single eukaryotic cell may be sequenced with existing technologies [33, 84-87]. However, these approaches have relied on various degrees of amplification by PCR or in vitro transcription. A sequencing method that does away with the need to amplify convert RNA to cdna will allow an undistorted picture of the transcriptome. Direct RNA sequencing (DRS) from Helicos BioSciences has paved the way to achieving this feat by capturing polyadenylated RNAs on a flow cell followed by sequencing by synthesis using fluorescently labeled nucleotide analogs, akin to Illumina sequencing [88-90]. Ultimately, the DRS system has not proved to be commercially viable due to a combination of high error rate, a short read length and an inability to carry out paired-end sequencing. However, Helicos intellectual property rights have been licensed to Illumina and Life Technologies raising the exciting possibility of Illumina/Ion Torrent-like DRS in the near future. The ability to sequence longer transcripts, [enlarge] and eventually full-length transcripts, would remove the uncertainty when quantifying alternatively spliced genes. In addition to allowing long sequencing reactions, nanopore sequencing also might also permit sequencing of RNA without amplification [91]. In nanopore sequencing either biological (e.g. α-hemolysin [92] ) or synthetic (e.g. graphene [93] ) nanopores are embedded in a synthetic polymer membrane. An electric current is applied across the nanopore; as each nucleotide passes through the nanopore a nucleotidespecific disruption in charge is detected. Figure 10. In recent years, sequencing costs per megabase have decreased faster than Moore s Law. Image: National Elimination of the requirement to use Human Genome Research Institute. fluorescence allows far higher density of nanopores, faster sequencing and hence greater sequence output. Oxford Nanopore Technologies are developing an α-hemolysin nanopore-based sequencer that has already been demonstrated to determine sequence identity of single-stranded DNA [92]. Although sequencing of only up to 85-mer DNA oligos was reported in their paper, in a 2012 press release sequencing of the entire 48 kb lamda genome as a single, complete fragment was disclosed [94]. Whether sequencing of full length RNA or cdna fragments can be similarly achieved remains to be determined. In the meantime, long reads can be obtained by 454 and 15 of 26 8/19/16, 5:07 PM

16 Pacific Biosciences sequencing platforms (1 kb and > 10 kb, respectively), although relatively few reads are generated and is therefore better suited to genome sequencing. Automatic data analysis workflows As described above, RNA-seq data analysis is performed in a number of consecutive steps for which numerous tools, computational and statistical approaches have been developed, and become available continuously. Trimmomatic [95] ( for read trimming and adapter removal; FastQC [57] for data sets quality control; Bowtie [45] ( and TopHat [96] ( to align reads to a reference genome; ; HTSeq [97] ( to count transcript abundance; and Cufflinks [82] ( and edger [75] ( to identify and quantitate expressed genes and transcripts are only a few examples, some of which were described in detail above. A more comprehensive list, continuously updated can be found here ( for reference. However, the problem of choosing the right combination of tools for each particular experimental goal remains. As of 2015, developing user-friendly analysis pipelines became the focus of more and more research groups. Efforts were even directed to creating toolboxes for building such automatic analysis paths (ViennaNGS [98], NEAT [99] ), and guides for creating and using computing workflows [100], [101] and RNASeqGUI [102], or explicit demonstrations of data analysis [103]. However, many of these instruments still require validation on real data sets and confirmation with biological experimental results. [enlarge] Below, we describe applications for which tools have been developed or improved recently. We do not attempt to create an exhaustive list, as the field is rapidly expanding and more and more examples become available every day. Quantification of gene expression under different experimental conditions, identification of novel transcripts, alternatively spliced sites or editing events, and the detection of RNA-fusion transcripts are major applications of RNA-Seq. To these, one can add the detection of non-coding and small RNAs, single-cell RNA-seq and various others. Pipelines for gene identification and differential gene expression Most of the current analysis workflows have been developed for the study of eukaryotic RNA, but also several analysis Figure 11. Overview of the transcript-compatibility counts (TCC) method. An scrna-seq example with K cells (only the reads coming from Cell1 and Cell2 are shown here) and a reference transcriptome consisting of three transcripts, t1, t2 and t3 are used for exemplification. Conventional approach: Single cells are clustered based on their transcript or gene abundances (here we only focus on transcripts for concreteness). This widely adopted pipeline involves computing a (#transcripts x #cells) expression matrix by first aligning each cell s reads to the reference. The corresponding alignment information is next to each read, which for the purpose of illustration only contains the mapped positions (the aligned reads of Cell1 are also annotated directly on the transcripts). While reads 1 and 5 are uniquely mapped to transcripts 1 and 3, reads 2, 3 and 4 are mapped to multiple transcripts (multi-mapped reads). The quantification step must therefore take into account a specific read-generating model and handle multi-mapped reads accordingly. The proposed method: Single cells are clustered based on their transcript-compatibility counts. Our method assigns the reads of each cell to equivalence classes via the process of pseudoalignment and simply counts the number of reads that fall in each class to construct a (#eq.classes x #cells) matrix of transcript-compatibility counts. Then, the method proceeds by directly using the transcript-compatibility counts for downstream processing and single cell clustering. The underlying idea here is that even though equivalence classes may not have an explicit biological interpretation, their read counts can collectively 16 of 26 8/19/16, 5:07 PM

Deep Sequencing technologies

Deep Sequencing technologies Deep Sequencing technologies Gabriela Salinas 30 October 2017 Transcriptome and Genome Analysis Laboratory http://www.uni-bc.gwdg.de/index.php?id=709 Microarray and Deep-Sequencing Core Facility University

More information

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis -Seq Analysis Quality Control checks Reproducibility Reliability -seq vs Microarray Higher sensitivity and dynamic range Lower technical variation Available for all species Novel transcript identification

More information

ChIP-seq and RNA-seq. Farhat Habib

ChIP-seq and RNA-seq. Farhat Habib ChIP-seq and RNA-seq Farhat Habib fhabib@iiserpune.ac.in Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions

More information

Novel methods for RNA and DNA- Seq analysis using SMART Technology. Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc.

Novel methods for RNA and DNA- Seq analysis using SMART Technology. Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc. Novel methods for RNA and DNA- Seq analysis using SMART Technology Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc. Agenda Enabling Single Cell RNA-Seq using SMART Technology SMART

More information

ChIP-seq and RNA-seq

ChIP-seq and RNA-seq ChIP-seq and RNA-seq Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions (ChIPchromatin immunoprecipitation)

More information

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center High Throughput Sequencing the Multi-Tool of Life Sciences Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center Complementary Approaches Illumina Still-imaging of clusters (~1000

More information

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Introduction to RNA-Seq David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Abundance RNA is... Diverse Dynamic Central DNA rrna Epigenetics trna RNA mrna Time Protein Abundance

More information

Transcriptome analysis

Transcriptome analysis Statistical Bioinformatics: Transcriptome analysis Stefan Seemann seemann@rth.dk University of Copenhagen April 11th 2018 Outline: a) How to assess the quality of sequencing reads? b) How to normalize

More information

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Transcriptomics analysis with RNA seq: an overview Frederik Coppens Transcriptomics analysis with RNA seq: an overview Frederik Coppens Platforms Applications Analysis Quantification RNA content Platforms Platforms Short (few hundred bases) Long reads (multiple kilobases)

More information

Sequence Analysis 2RNA-Seq

Sequence Analysis 2RNA-Seq Sequence Analysis 2RNA-Seq Lecture 10 2/21/2018 Instructor : Kritika Karri kkarri@bu.edu Transcriptome Entire set of RNA transcripts in a given cell for a specific developmental stage or physiological

More information

Wet-lab Considerations for Illumina data analysis

Wet-lab Considerations for Illumina data analysis Wet-lab Considerations for Illumina data analysis Based on a presentation by Henriette O Geen Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center Complementary Approaches Illumina

More information

TECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA

TECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA TECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA Stranded, Illumina ready library construction in

More information

Wheat CAP Gene Expression with RNA-Seq

Wheat CAP Gene Expression with RNA-Seq Wheat CAP Gene Expression with RNA-Seq July 9 th -13 th, 2018 Overview of the workshop, Alina Akhunova http://www.ksre.k-state.edu/igenomics/workshops/ RNA-Seq Workshop Activities Lectures Laboratory Molecular

More information

SO YOU WANT TO DO A: RNA-SEQ EXPERIMENT MATT SETTLES, PHD UNIVERSITY OF CALIFORNIA, DAVIS

SO YOU WANT TO DO A: RNA-SEQ EXPERIMENT MATT SETTLES, PHD UNIVERSITY OF CALIFORNIA, DAVIS SO YOU WANT TO DO A: RNA-SEQ EXPERIMENT MATT SETTLES, PHD UNIVERSITY OF CALIFORNIA, DAVIS SETTLES@UCDAVIS.EDU Bioinformatics Core Genome Center UC Davis BIOINFORMATICS.UCDAVIS.EDU DISCLAIMER This talk/workshop

More information

Experimental Design. Dr. Matthew L. Settles. Genome Center University of California, Davis

Experimental Design. Dr. Matthew L. Settles. Genome Center University of California, Davis Experimental Design Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu What is Differential Expression Differential expression analysis means taking normalized sequencing

More information

RNA-Seq Analysis. Simon Andrews, Laura v

RNA-Seq Analysis. Simon Andrews, Laura v RNA-Seq Analysis Simon Andrews, Laura Biggins simon.andrews@babraham.ac.uk @simon_andrews v2018-10 RNA-Seq Libraries rrna depleted mrna Fragment u u u u NNNN Random prime + RT 2 nd strand synthesis (+

More information

Integrated NGS Sample Preparation Solutions for Limiting Amounts of RNA and DNA. March 2, Steven R. Kain, Ph.D. ABRF 2013

Integrated NGS Sample Preparation Solutions for Limiting Amounts of RNA and DNA. March 2, Steven R. Kain, Ph.D. ABRF 2013 Integrated NGS Sample Preparation Solutions for Limiting Amounts of RNA and DNA March 2, 2013 Steven R. Kain, Ph.D. ABRF 2013 NuGEN s Core Technologies Selective Sequence Priming Nucleic Acid Amplification

More information

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ), Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ), 2012-01-26 What is a gene What is a transcriptome History of gene expression assessment RNA-seq RNA-seq analysis

More information

Non-Organic-Based Isolation of Mammalian microrna using Norgen s microrna Purification Kit

Non-Organic-Based Isolation of Mammalian microrna using Norgen s microrna Purification Kit Application Note 13 RNA Sample Preparation Non-Organic-Based Isolation of Mammalian microrna using Norgen s microrna Purification Kit B. Lam, PhD 1, P. Roberts, MSc 1 Y. Haj-Ahmad, M.Sc., Ph.D 1,2 1 Norgen

More information

Next-Generation Sequencing. Technologies

Next-Generation Sequencing. Technologies Next-Generation Next-Generation Sequencing Technologies Sequencing Technologies Nicholas E. Navin, Ph.D. MD Anderson Cancer Center Dept. Genetics Dept. Bioinformatics Introduction to Bioinformatics GS011062

More information

RNA-Sequencing analysis

RNA-Sequencing analysis RNA-Sequencing analysis Markus Kreuz 25. 04. 2012 Institut für Medizinische Informatik, Statistik und Epidemiologie Content: Biological background Overview transcriptomics RNA-Seq RNA-Seq technology Challenges

More information

Matthew Tinning Australian Genome Research Facility. July 2012

Matthew Tinning Australian Genome Research Facility. July 2012 Next-Generation Sequencing: an overview of technologies and applications Matthew Tinning Australian Genome Research Facility July 2012 History of Sequencing Where have we been? 1869 Discovery of DNA 1909

More information

Computational & Quantitative Biology Lecture 6 RNA Sequencing

Computational & Quantitative Biology Lecture 6 RNA Sequencing Peter A. Sims Dept. of Systems Biology Dept. of Biochemistry & Molecular Biophysics Sulzberger Columbia Genome Center October 27, 2014 Computational & Quantitative Biology Lecture 6 RNA Sequencing We Have

More information

Lecture 7. Next-generation sequencing technologies

Lecture 7. Next-generation sequencing technologies Lecture 7 Next-generation sequencing technologies Next-generation sequencing technologies General principles of short-read NGS Construct a library of fragments Generate clonal template populations Massively

More information

SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA

SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA The most sensitive cdna synthesis technology, combined with next-generation

More information

Differential gene expression analysis using RNA-seq

Differential gene expression analysis using RNA-seq https://abc.med.cornell.edu/ Differential gene expression analysis using RNA-seq Applied Bioinformatics Core, March 2018 Friederike Dündar with Luce Skrabanek & Paul Zumbo Day 1: Introduction into high-throughput

More information

Introduction of RNA-Seq Analysis

Introduction of RNA-Seq Analysis Introduction of RNA-Seq Analysis Jiang Li, MS Bioinformatics System Engineer I Center for Quantitative Sciences(CQS) Vanderbilt University September 21, 2012 Goal of this talk 1. Act as a practical resource

More information

RNAseq Differential Gene Expression Analysis Report

RNAseq Differential Gene Expression Analysis Report RNAseq Differential Gene Expression Analysis Report Customer Name: Institute/Company: Project: NGS Data: Bioinformatics Service: IlluminaHiSeq2500 2x126bp PE Differential gene expression analysis Sample

More information

Welcome to the NGS webinar series

Welcome to the NGS webinar series Welcome to the NGS webinar series Webinar 1 NGS: Introduction to technology, and applications NGS Technology Webinar 2 Targeted NGS for Cancer Research NGS in cancer Webinar 3 NGS: Data analysis for genetic

More information

Next Gen Sequencing. Expansion of sequencing technology. Contents

Next Gen Sequencing. Expansion of sequencing technology. Contents Next Gen Sequencing Contents 1 Expansion of sequencing technology 2 The Next Generation of Sequencing: High-Throughput Technologies 3 High Throughput Sequencing Applied to Genome Sequencing (TEDed CC BY-NC-ND

More information

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center High Throughput Sequencing the Multi-Tool of Life Sciences Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center DNA Technologies & Expression Analysis Cores HT Sequencing (Illumina

More information

Next-generation sequencing technologies

Next-generation sequencing technologies Next-generation sequencing technologies NGS applications Illumina sequencing workflow Overview Sequencing by ligation Short-read NGS Sequencing by synthesis Illumina NGS Single-molecule approach Long-read

More information

RNA-Seq de novo assembly training

RNA-Seq de novo assembly training RNA-Seq de novo assembly training Training session aims Give you some keys elements to look at during read quality check. Transcriptome assembly is not completely a strait forward process : Multiple strategies

More information

How to deal with your RNA-seq data?

How to deal with your RNA-seq data? How to deal with your RNA-seq data? Rachel Legendre, Thibault Dayris, Adrien Pain, Claire Toffano-Nioche, Hugo Varet École de bioinformatique AVIESAN-IFB 2017 1 Rachel Legendre Bioinformatics 27/11/2018

More information

G E N OM I C S S E RV I C ES

G E N OM I C S S E RV I C ES GENOMICS SERVICES ABOUT T H E N E W YOR K G E NOM E C E N T E R NYGC is an independent non-profit implementing advanced genomic research to improve diagnosis and treatment of serious diseases. Through

More information

RNA standards v May

RNA standards v May Standards, Guidelines and Best Practices for RNA-Seq: 2010/2011 I. Introduction: Sequence based assays of transcriptomes (RNA-seq) are in wide use because of their favorable properties for quantification,

More information

Next Generation Sequencing. Tobias Österlund

Next Generation Sequencing. Tobias Österlund Next Generation Sequencing Tobias Österlund tobiaso@chalmers.se NGS part of the course Week 4 Friday 13/2 15.15-17.00 NGS lecture 1: Introduction to NGS, alignment, assembly Week 6 Thursday 26/2 08.00-09.45

More information

RNA-Seq with the Tuxedo Suite

RNA-Seq with the Tuxedo Suite RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop The Basic Tuxedo Suite References Trapnell C, et al. 2009 TopHat: discovering splice junctions with

More information

02 Agenda Item 03 Agenda Item

02 Agenda Item 03 Agenda Item 01 Agenda Item 02 Agenda Item 03 Agenda Item SOLiD 3 System: Applications Overview April 12th, 2010 Jennifer Stover Field Application Specialist - SOLiD Applications Workflow for SOLiD Application Application

More information

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012 Introduction to transcriptome analysis using High Throughput Sequencing technologies D. Puthier 2012 A typical RNA-Seq experiment Library construction Protocol variations Fragmentation methods RNA: nebulization,

More information

Finding Genes with Genomics Technologies

Finding Genes with Genomics Technologies PLNT2530 Plant Biotechnology (2018) Unit 7 Finding Genes with Genomics Technologies Unless otherwise cited or referenced, all content of this presenataion is licensed under the Creative Commons License

More information

Overview of Next Generation Sequencing technologies. Céline Keime

Overview of Next Generation Sequencing technologies. Céline Keime Overview of Next Generation Sequencing technologies Céline Keime keime@igbmc.fr Next Generation Sequencing < Second generation sequencing < General principle < Sequencing by synthesis - Illumina < Sequencing

More information

Reading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction

Reading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction Lecture 8 Reading Lecture 8: 96-110 Lecture 9: 111-120 DNA Libraries Definition Types Construction 142 DNA Libraries A DNA library is a collection of clones of genomic fragments or cdnas from a certain

More information

Next-generation sequencing and quality control: An introduction 2016

Next-generation sequencing and quality control: An introduction 2016 Next-generation sequencing and quality control: An introduction 2016 s.schmeier@massey.ac.nz http://sschmeier.com/bioinf-workshop/ Overview Typical workflow of a genomics experiment Genome versus transcriptome

More information

Illumina s Suite of Targeted Resequencing Solutions

Illumina s Suite of Targeted Resequencing Solutions Illumina s Suite of Targeted Resequencing Solutions Colin Baron Sr. Product Manager Sequencing Applications 2011 Illumina, Inc. All rights reserved. Illumina, illuminadx, Solexa, Making Sense Out of Life,

More information

1. Introduction Gene regulation Genomics and genome analyses

1. Introduction Gene regulation Genomics and genome analyses 1. Introduction Gene regulation Genomics and genome analyses 2. Gene regulation tools and methods Regulatory sequences and motif discovery TF binding sites Databases 3. Technologies Microarrays Deep sequencing

More information

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility 2018 ABRF Meeting Satellite Workshop 4 Bridging the Gap: Isolation to Translation (Single Cell RNA-Seq) Sunday, April 22 Basics of RNA-Seq (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly,

More information

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome See the Difference With a commitment to your peace of mind, Life Technologies provides a portfolio of robust and scalable

More information

Applications of short-read

Applications of short-read Applications of short-read sequencing: RNA-Seq and ChIP-Seq BaRC Hot Topics March 2013 George Bell, Ph.D. http://jura.wi.mit.edu/bio/education/hot_topics/ Sequencing applications RNA-Seq includes experiments

More information

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia RNA-Seq Workshop AChemS 2017 Sunil K Sukumaran Monell Chemical Senses Center Philadelphia Benefits & downsides of RNA-Seq Benefits: High resolution, sensitivity and large dynamic range Independent of prior

More information

RNA Sequencing. Next gen insight into transcriptomes , Elio Schijlen

RNA Sequencing. Next gen insight into transcriptomes , Elio Schijlen RNA Sequencing Next gen insight into transcriptomes 05-06-2013, Elio Schijlen Transcriptome complete set of transcripts in a cell, and their quantity, for a specific developmental stage or physiological

More information

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms Next Generation Sequencing Lecture Saarbrücken, 19. March 2012 Sequencing Platforms Contents Introduction Sequencing Workflow Platforms Roche 454 ABI SOLiD Illumina Genome Anlayzer / HiSeq Problems Quality

More information

Application Note Selective transcript depletion

Application Note Selective transcript depletion Application Note Selective transcript depletion Sample Authors Laura de Jager RED Scientist Michael Berry Bioinformatics Scientist Luke Esau RED Senior Scientist Ross Wadsworth RED Team Lead Roche Sequencing

More information

Functional Genomics Research Stream. Research Meetings: November 2 & 3, 2009 Next Generation Sequencing

Functional Genomics Research Stream. Research Meetings: November 2 & 3, 2009 Next Generation Sequencing Functional Genomics Research Stream Research Meetings: November 2 & 3, 2009 Next Generation Sequencing Current Issues Research Meetings: Meet with me this Thursday or Friday. (bring laboratory notebook

More information

Bi 8 Lecture 4. Ellen Rothenberg 14 January Reading: from Alberts Ch. 8

Bi 8 Lecture 4. Ellen Rothenberg 14 January Reading: from Alberts Ch. 8 Bi 8 Lecture 4 DNA approaches: How we know what we know Ellen Rothenberg 14 January 2016 Reading: from Alberts Ch. 8 Central concept: DNA or RNA polymer length as an identifying feature RNA has intrinsically

More information

NGS in Pathology Webinar

NGS in Pathology Webinar NGS in Pathology Webinar NGS Data Analysis March 10 2016 1 Topics for today s presentation 2 Introduction Next Generation Sequencing (NGS) is becoming a common and versatile tool for biological and medical

More information

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer. DNA Preparation and QC Extraction DNA was extracted from whole blood or flash frozen post-mortem tissue using a DNA mini kit (QIAmp #51104 and QIAmp#51404, respectively) following the manufacturer s recommendations.

More information

Gene Expression Technology

Gene Expression Technology Gene Expression Technology Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu Gene expression Gene expression is the process by which information from a gene

More information

Obtain superior NGS library performance with lower input amounts using the NEBNext Ultra II Directional RNA Library Prep Kit for Illumina

Obtain superior NGS library performance with lower input amounts using the NEBNext Ultra II Directional RNA Library Prep Kit for Illumina be INSPIRED drive DISCOVERY stay GENINE TECHNICAL NOTE Directional rrna depletion Obtain superior NGS library performance with lower input amounts using the NEBNext ltra II Directional RNA Library Prep

More information

Obtain superior NGS library performance with lower input amounts using the NEBNext Ultra II Directional RNA Library Prep Kit for Illumina

Obtain superior NGS library performance with lower input amounts using the NEBNext Ultra II Directional RNA Library Prep Kit for Illumina be INSPIRED drive DISCOVERY stay GENINE TECHNICAL NOTE Directional rrna depletion Obtain superior NGS library performance with lower input amounts using the NEBNext ltra II Directional RNA Library Prep

More information

Surely Better Target Enrichment from Sample to Sequencer

Surely Better Target Enrichment from Sample to Sequencer sureselect TARGET ENRICHMENT solutions Surely Better Target Enrichment from Sample to Sequencer Agilent s market leading SureSelect platform provides a complete portfolio of catalog to custom products,

More information

High-quality stranded RNA-seq libraries from single cells using the SMART-Seq Stranded Kit Product highlights:

High-quality stranded RNA-seq libraries from single cells using the SMART-Seq Stranded Kit Product highlights: TECH NOTE High-quality stranded RNA-seq libraries from single cells using the SMART-Seq Stranded Kit Product highlights: Simple workflow starts directly from 1 1,000 cells or 10 pg 10 ng total RNA to generate

More information

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq Sequencing applications Applications of short-read sequencing: RNA-Seq and ChIP-Seq BaRC Hot Topics March 2013 George Bell, Ph.D. http://jura.wi.mit.edu/bio/education/hot_topics/ RNA-Seq includes experiments

More information

TECH NOTE SMARTer T-cell receptor profiling in single cells

TECH NOTE SMARTer T-cell receptor profiling in single cells TECH NOTE SMARTer T-cell receptor profiling in single cells Flexible workflow: Illumina-ready libraries from FACS or manually sorted single cells Ease of use: Optimized indexing allows for pooling 96 cells

More information

Applied Biosystems SOLiD 3 Plus System. RNA Application Guide

Applied Biosystems SOLiD 3 Plus System. RNA Application Guide Applied Biosystems SOLiD 3 Plus System RNA Application Guide For Research Use Use Only. Not intended for any animal or human therapeutic or diagnostic use. TRADEMARKS: Trademarks of Life Technologies Corporation

More information

Parts of a standard FastQC report

Parts of a standard FastQC report FastQC FastQC, written by Simon Andrews of Babraham Bioinformatics, is a very popular tool used to provide an overview of basic quality control metrics for raw next generation sequencing data. There are

More information

Next Generation Sequencing

Next Generation Sequencing Next Generation Sequencing Complete Report Catalogue # and Service: IR16001 rrna depletion (human, mouse, or rat) IR11081 Total RNA Sequencing (80 million reads, 2x75 bp PE) Xxxxxxx - xxxxxxxxxxxxxxxxxxxxxx

More information

HaloPlex HS. Get to Know Your DNA. Every Single Fragment. Kevin Poon, Ph.D.

HaloPlex HS. Get to Know Your DNA. Every Single Fragment. Kevin Poon, Ph.D. HaloPlex HS Get to Know Your DNA. Every Single Fragment. Kevin Poon, Ph.D. Sr. Global Product Manager Diagnostics & Genomics Group Agilent Technologies For Research Use Only. Not for Use in Diagnostic

More information

RNA-Seq Module 2 From QC to differential gene expression.

RNA-Seq Module 2 From QC to differential gene expression. RNA-Seq Module 2 From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics Support System (RISS) MSI Apr. 24, 2012 RNA-Seq Tutorials Tutorial 1: Introductory (Mar.

More information

RNA-Seq data analysis course September 7-9, 2015

RNA-Seq data analysis course September 7-9, 2015 RNA-Seq data analysis course September 7-9, 2015 Peter-Bram t Hoen (LUMC) Jan Oosting (LUMC) Celia van Gelder, Jacintha Valk (BioSB) Anita Remmelzwaal (LUMC) Expression profiling DNA mrna protein Comprehensive

More information

Rapid Method for the Purification of Total RNA from Formalin- Fixed Paraffin-Embedded (FFPE) Tissue Samples

Rapid Method for the Purification of Total RNA from Formalin- Fixed Paraffin-Embedded (FFPE) Tissue Samples Application Note 17 RNA Sample Preparation Rapid Method for the Purification of Total RNA from Formalin- Fixed Paraffin-Embedded (FFPE) Tissue Samples M. Melmogy 1, V. Misic 1, B. Lam, PhD 1, C. Dobbin,

More information

Introduction to RNA-Seq

Introduction to RNA-Seq Introduction to RNA-Seq Monica Britton, Ph.D. Bioinformatics Analyst September 2014 Workshop Overview of Today s Activities Morning RNA-Seq Concepts, Terminology, and Work Flows Two-Condition Differential

More information

NGS Data Analysis and Galaxy

NGS Data Analysis and Galaxy NGS Data Analysis and Galaxy University of Pretoria Pretoria, South Africa 14-18 October 2013 Dave Clements, Emory University http://galaxyproject.org/ Fourie Joubert, Burger van Jaarsveld Bioinformatics

More information

Introduction to RNA-Seq

Introduction to RNA-Seq Introduction to RNA-Seq Monica Britton, Ph.D. Sr. Bioinformatics Analyst March 2015 Workshop Overview of RNA-Seq Activities RNA-Seq Concepts, Terminology, and Work Flows Using Single-End Reads and a Reference

More information

Exploring of microrna markers for body fluid identification using NGS

Exploring of microrna markers for body fluid identification using NGS Exploring of microrna markers for body fluid identification using NGS Zheng Wang, Yiping Hou Institute of Forensic Medicine Sichuan University, China Barcelona May, 11, 2016 Outline Introduction of Institute

More information

measuring gene expression December 5, 2017

measuring gene expression December 5, 2017 measuring gene expression December 5, 2017 transcription a usually short-lived RNA copy of the DNA is created through transcription RNA is exported to the cytoplasm to encode proteins some types of RNA

More information

TECH NOTE Ligation-Free ChIP-Seq Library Preparation

TECH NOTE Ligation-Free ChIP-Seq Library Preparation TECH NOTE Ligation-Free ChIP-Seq Library Preparation The DNA SMART ChIP-Seq Kit Ligation-free template switching technology: Minimize sample handling in a single-tube workflow >> Simplified protocol with

More information

Genetics and Genomics in Medicine Chapter 3. Questions & Answers

Genetics and Genomics in Medicine Chapter 3. Questions & Answers Genetics and Genomics in Medicine Chapter 3 Multiple Choice Questions Questions & Answers Question 3.1 Which of the following statements, if any, is false? a) Amplifying DNA means making many identical

More information

Contact us for more information and a quotation

Contact us for more information and a quotation GenePool Information Sheet #1 Installed Sequencing Technologies in the GenePool The GenePool offers sequencing service on three platforms: Sanger (dideoxy) sequencing on ABI 3730 instruments Illumina SOLEXA

More information

Supplementary Information for:

Supplementary Information for: Supplementary Information for: A streamlined and high-throughput targeting approach for human germline and cancer genomes using Oligonucleotide-Selective Sequencing Samuel Myllykangas 1, Jason D. Buenrostro

More information

Motivation From Protein to Gene

Motivation From Protein to Gene MOLECULAR BIOLOGY 2003-4 Topic B Recombinant DNA -principles and tools Construct a library - what for, how Major techniques +principles Bioinformatics - in brief Chapter 7 (MCB) 1 Motivation From Protein

More information

RAPID, ROBUST & RELIABLE

RAPID, ROBUST & RELIABLE Roche Sample Prep Solutions for RNA-Seq Sequence what matters RAPID, ROBUST & RELIABLE Sample P le Samp Quant ifi /QC tion ca As the first step in the NGS workflow continuum, sample prep holds the key

More information

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Transcriptome Assembly, Functional Annotation (and a few other related thoughts) Transcriptome Assembly, Functional Annotation (and a few other related thoughts) Monica Britton, Ph.D. Sr. Bioinformatics Analyst June 23, 2017 Differential Gene Expression Generalized Workflow File Types

More information

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es Sequencing technologies Jose Blanca COMAV institute bioinf.comav.upv.es Outline Sequencing technologies: Sanger 2nd generation sequencing: 3er generation sequencing: 454 Illumina SOLiD Ion Torrent PacBio

More information

PLNT2530 (2018) Unit 6b Sequence Libraries

PLNT2530 (2018) Unit 6b Sequence Libraries PLNT2530 (2018) Unit 6b Sequence Libraries Molecular Biotechnology (Ch 4) Analysis of Genes and Genomes (Ch 5) Unless otherwise cited or referenced, all content of this presenataion is licensed under the

More information

Single-Cell Whole Transcriptome Profiling With the SOLiD. System

Single-Cell Whole Transcriptome Profiling With the SOLiD. System APPLICATION NOTE Single-Cell Whole Transcriptome Profiling Single-Cell Whole Transcriptome Profiling With the SOLiD System Introduction The ability to study the expression patterns of an individual cell

More information

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1 BST 226 Statistical Methods for Bioinformatics David M. Rocke March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1 NGS Technologies Illumina Sequencing HiSeq 2500 & MiSeq PacBio Sequencing PacBio

More information

Increased transcription detection with the NEBNext Single Cell/Low Input RNA Library Prep Kit

Increased transcription detection with the NEBNext Single Cell/Low Input RNA Library Prep Kit be INSPIRED drive DISCOVERY stay GENUINE TECHNICAL NOTE Increased transcription detection with the NEBNext Single Cell/Low Input RNA Library Prep Kit Highly sensitive, robust generation of high quality

More information

10/06/2014. RNA-Seq analysis. With reference assembly. Cormier Alexandre, PhD student UMR8227, Algal Genetics Group

10/06/2014. RNA-Seq analysis. With reference assembly. Cormier Alexandre, PhD student UMR8227, Algal Genetics Group RNA-Seq analysis With reference assembly Cormier Alexandre, PhD student UMR8227, Algal Genetics Group Summary 2 Typical RNA-seq workflow Introduction Reference genome Reference transcriptome Reference

More information

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD) Analysis of RNA-seq Data Feb 8, 2017 Peikai CHEN (PHD) Outline What is RNA-seq? What can RNA-seq do? How is RNA-seq measured? How to process RNA-seq data: the basics How to visualize and diagnose your

More information

The New Genome Analyzer IIx Delivering more data, faster, and easier than ever before. Jeremy Preston, PhD Marketing Manager, Sequencing

The New Genome Analyzer IIx Delivering more data, faster, and easier than ever before. Jeremy Preston, PhD Marketing Manager, Sequencing The New Genome Analyzer IIx Delivering more data, faster, and easier than ever before Jeremy Preston, PhD Marketing Manager, Sequencing Illumina Genome Analyzer: a Paradigm Shift 2000x gain in efficiency

More information

Bi 8 Lecture 5. Ellen Rothenberg 19 January 2016

Bi 8 Lecture 5. Ellen Rothenberg 19 January 2016 Bi 8 Lecture 5 MORE ON HOW WE KNOW WHAT WE KNOW and intro to the protein code Ellen Rothenberg 19 January 2016 SIZE AND PURIFICATION BY SYNTHESIS: BASIS OF EARLY SEQUENCING complex mixture of aborted DNA

More information

Total RNA isola-on End Repair of double- stranded cdna

Total RNA isola-on End Repair of double- stranded cdna Total RNA isola-on End Repair of double- stranded cdna mrna Isola8on using Oligo(dT) Magne8c Beads AAAAAAA A Adenyla8on (A- Tailing) A AAAAAAAAAAAA TTTTTTTTT AAAAAAA TTTTTTTTT TTTTTTTT TTTTTTTTT AAAAAAAA

More information

FFPE in your NGS Study

FFPE in your NGS Study FFPE in your NGS Study Richard Corbett Canada s Michael Smith Genome Sciences Centre Vancouver, British Columbia Dec 6, 2017 Our mandate is to advance knowledge about cancer and other diseases and to use

More information

RNA-Seq analysis workshop

RNA-Seq analysis workshop RNA-Seq analysis workshop Zhangjun Fei Boyce Thompson Institute for Plant Research USDA Robert W. Holley Center for Agriculture and Health Cornell University Outline Background of RNA-Seq Application of

More information

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing Gene Regulation Solutions Microarrays and Next-Generation Sequencing Gene Regulation Solutions The Microarrays Advantage Microarrays Lead the Industry in: Comprehensive Content SurePrint G3 Human Gene

More information

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday September 15, 2014

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday September 15, 2014 High Throughput Sequencing Technologies J Fass UCD Genome Center Bioinformatics Core Monday September 15, 2014 Sequencing Explosion www.genome.gov/sequencingcosts http://t.co/ka5cvghdqo Sequencing Explosion

More information

There are four major types of introns. Group I introns, found in some rrna genes, are self-splicing: they can catalyze their own removal.

There are four major types of introns. Group I introns, found in some rrna genes, are self-splicing: they can catalyze their own removal. 1 2 Continuous genes - Intron: Many eukaryotic genes contain coding regions called exons and noncoding regions called intervening sequences or introns. The average human gene contains from eight to nine

More information

Statistical Genomics and Bioinformatics Workshop. Genetic Association and RNA-Seq Studies

Statistical Genomics and Bioinformatics Workshop. Genetic Association and RNA-Seq Studies Statistical Genomics and Bioinformatics Workshop: Genetic Association and RNA-Seq Studies RNA Seq and Differential Expression Analysis Brooke L. Fridley, PhD University of Kansas Medical Center 1 Next-generation

More information

Multiplexed Strand-specific RNA-Seq Library Preparation for Illumina Sequencing Platforms

Multiplexed Strand-specific RNA-Seq Library Preparation for Illumina Sequencing Platforms Multiplexed Strand-specific RNA-Seq Library Preparation for Illumina Sequencing Platforms Important Things to know before you start: This protocol generates strand-specific reads, but may lead to slightly

More information