Demo of mrna NGS Concluding Report

Size: px

Start display at page:

Download "Demo of mrna NGS Concluding Report"

Aileen Wilkerson
6 years ago
Views:

1 Demo of mrna NGS Concluding Report Project: Demo Report Customer: Dr. Demo Company/Institute: Exiqon AS Date: 09-Mar-2015 Performed by Exiqon A/S Company Reg.No.(CVR) Skelstedet 16 DK-2950, Vedbæk Denmark

2 Additional files provided with this report: Content Sampleinfo.xlsx Pictures Data tables (Spreadsheet.tsv files) Description Overview of samples and groups High resolution copies of pictures presented in this report (QC plots, volcano plots, heat maps and PCAs). All tables for genes, isoforms, CDS and TSS. Count data for all samples in tsv tables. Normalized data for all samples (FPKM) in tsv tables. Differential expression of all relevant comparisons in tsv tables. Other relevant.tsv tables (fx attribute tables) and GO analysis tables. Table 1. List of additional data files included with this report Files provided on disc drive: An containing information on encryption of disc drive will be sent, and the disc drive will be forwarded by courier. Content Disk drive Description All FASTQ files associated with the project All BAM files generated in the project including mapped and unmapped files (use IGV viewer to visualize) Table 2. List of data files included on disc drive Ref code: 9999 Page 2 of 32

3 Table of Contents Summary... 4 Experimental overview... 5 Sample overview... 5 Reference genome... 5 Experimental design... 5 Project workflow... 6 QC & Mapping... 7 QC Summary... 7 Mapping and yields Results Identified genes Principal Component Analysis plot Heat map and unsupervised clustering Identification of novel mrnas Differentially expressed genes Differentially expressed novel transcripts Volcano plot Gene Ontology Enrichment Analysis Conclusion and next steps mirsearch Data Analysis workflow Software tools used for the analysis Material and methods Library preparation and Next Generation Sequencing References Frequently asked questions Ref code: 9999 Page 3 of 32

4 Summary Dear Dr. Demo, We have now finalized the Next Generation Sequencing analysis of the mrnas identified in the samples you have submitted to Exiqon Services. Next Generation Sequencing libraries were successfully prepared, quantified and sequenced for all your samples. The collected reads were subjected to quality control and downstream analysis. The principal findings are summarized in this document. Additional information and further details on specific RNA transcripts can be found in the various documents listed in the table on the previous page. Differential expression analysis of read counts identified a subset of mrna sequences that had significant differences in the associated number of reads between the two experimental groups. We also found a number of putative novel transcripts in your samples, some of which show significant differential expression. Exiqon's product line offers many tools for further validating potentially regulated mrnas by qpcr, in situ hybridization, Northern blot or GapmeRs for highly efficient antisense inhibition of mrna and lncrna function. For more information please see If you have any questions related to this report, please do not hesitate to contact us at DxServices@exiqon.com. Kind regards, Exiqon Services Exiqon A/S Ref code: 9999 Page 4 of 32

5 Experimental overview Sample overview The table below lists all the samples processed in this project and their specifications according to the sample submission form. There were a total of 6 samples, split into two experimental groups. Sample ID Group Sequencing batch File Name Control1 Control 1 XXX088363_CS_1.fastq Control2 Control 1 XXX088363_CS_2.fastq Control3 Control 1 XXX088363_CS_3.fastq Treated1 Treated 1 XXX088363_TS_4.fastq Treated2 Treated 1 XXX088363_TS_5.fastq Treated3 Treated 1 XXX088363_TS_6.fastq Table 3. Sample ID, grouping, sequencing batch and associated FASTQ file. Reference genome Annotation of the obtained sequences was performed using the reference annotation listed below. Organism: Human Reference genome: h.sapiens, hg19 / GRC37, UCSC Genome Browser Annotation reference: Gencode v11, Ensembl Experimental design The experiments were performed using the following settings: Instrument: NextSeq500 Number of reads: 50 mio Read length: 50 bp, Paired End Ref code: 9999 Page 5 of 32

transcriptome RNA sequencing at Exiqon A/S.

6 Project workflow The figure below outlines the Next Generation Sequencing process for mrna and whole transcriptome RNA sequencing at Exiqon A/S. Figure 1. Schematic NGS workflow Ref code: 9999 Page 6 of 32

7 QC & Mapping The following sections provide a summary of the QC and mapping results obtained for your dataset. QC Summary Following sequencing, intensity correction and base calling, an initial QC of the data is performed internally by the sequencer. This includes CHASTITY filtering and quality scoring (Q-score, see details on page Error! Bookmark not defined.) of each individual base in each read. At this stage the data is separated for Paired end reads (PE) to determine whether the second read significantly differs from the first in terms of overall quality. As illustrated in the figure below ( Figure 2), we found that the vast majority of the data has a Q score greater than 30 (>99.9% correct), indicating that high quality data was obtained for all samples. Reads pairs R1 (read1) and R2 (read2) are presented seperately. Ref code: 9999 Page 7 of 32

8 Figure 2. Average read quality of the NGS sequencing data. A Q-score above 30 is considered high quality data (red dotted line). Ref code: 9999 Page 8 of 32

9 In the graph below ( Figure 3), an overview of the average base quality is shown. As for the average read quality we found that the vast majority of the bases have a Q score greater than 30 (>99.9% correct), indicating that high quality data was obtained for all samples. Ref code: 9999 Page 9 of 32

10 Figure 3. Average base quality (R1 and R2 Q-scores) of the NGS sequencing data. The vast majority of the bases has a Q score greater than 30 (>99.9% correct), indicating high quality data. Ref code: 9999 Page 10 of 32

11 Mapping and yields Mapping of the sequencing data represents a useful quality control step in the NGS data analysis pipeline as it can help to evaluate the quality of the samples. For this purpose, we classify the reads in the following classes: Outmapped reads or high abundance reads: For example; rrna, mtrna, polya and PolyC homopolymers Unmapped reads: no alignment possible Mappable reads: aligning to reference genome In a typical experiment it is possible to align 60-90% of the reads to the reference genome, However, this number depends upon the quality of the sample and the coverage of the relevant reference genome; if the sample is degraded, fewer reads will be mrna specific and more material will be degraded rrna. The following table and plot summarizes the mapping results. In addition to the mapping results, the table below also shows the total number of reads obtained for each sample. On average 65 million reads were obtained from each sample and genome mapping was on average 91 % for all samples. The uniformity of the sample s mapping results suggests that the samples are comparable. Sample Total reads rrnas (%) Outmapped reads Other (mtrna) (%) Mappable reads (%) Unmapped (%) Control Control Control Treated Treated Treated Table 4. Summary of the mapping results for each sample. The following plot summarizes the mapping results for each sample. Ref code: 9999 Page 11 of 32

Figure 4. Summary of mapping results of the reads by sample. If you want to inspect the mapping in details, please see the BAM alignment files, which are supplied on the hard disk.

12 Figure 4. Summary of mapping results of the reads by sample. If you want to inspect the mapping in details, please see the BAM alignment files, which are supplied on the hard disk. The BAM files can be viewed and inspected in any standard genome viewer such as the IGV browser (Robinson et al.,2011) and (Thorvaldsdóttir et al., (2012) downloadable from Ref code: 9999 Page 12 of 32

13 Results Below you will find a summary of the principal findings for this project. The complete analysis may be found in the associated files listed on page 2. For detailed description of the data analysis process see the Data Analysis section on page 29. Identified genes Based on alignment to the reference genome, the number of identified genes per sample was calculated. The reliability of the identified genes increased with number of identified fragments. When performing the statistical comparison of two groups, we include all genes irrespective of how few calls have been made. As can be seen from the table below, and from Figure 5, all samples included in this study have comparable call rates. Sample ID Number of genes identified Number of isoforms identified Control Control Control Treated Treated Treated Ref code: 9999 Page 13 of 32

14 Table 5. Number of genes and isoforms identified in each sample which have a fragment count estimation of at least 10 counts per gene. Ref code: 9999 Page 14 of 32

15 The distribution of the calls based on the number of fragments identified is illustrated in the radar plot below. The sample name is indicated on the outer rim of the plot. The number of genes with 1, 10, 100 or 1000 fragments are illustrated as colored rings. If one sample results in significantly lower number of genes in each category, this is an indication that the sample is deviating from the remaining samples. Overall, the rings in the plot are consistent. Figure 5. Radar plot showing gene call rates for each sample at different fragment count cutoff values. See color scale at top of figure for specification of cutoff values. Expression levels are measured as FPKM FPKM is a unit of measuring expression for NGS experiments. The number of reads corresponding to the particular gene is normalized to the total number of mapped reads (Fragments Per Kilobase of transcript per Million mapped reads), In the analysis part the FPKM values are normalized with median of the geometric mean (Anders & Huber, 2010). Ref code: 9999 Page 15 of 32

16 Principal Component Analysis plot Principal Component Analysis (PCA) is a method used to reduce the dimension of large data sets and is a useful tool to explore the naturally arising sample classes based on the expression profile. The top 200 transcripts (genes) that have the largest log2 fold difference based on FPKM counts have been included in the analysis. If the biological differences between the samples are pronounced, this will describe the primary components of the variation in the data. This leads to separation of samples in different regions of a PCA plot corresponding to their biology. If other factors, e.g. sample quality, introduce more variation in the data, the samples will not cluster according to the biology. The largest component in the variation is plotted along the X-axis and the second largest is plotted on the Y-axis. As seen below, the groups cluster on the primary component Figure 6. Principal component analysis (PCA) plot. The PCA was performed on all samples passing QC using the top 200 transcripts (genes) that have the largest log2 fold difference based on FPKM counts. Ref code: 9999 Page 16 of 32

Heat map and unsupervised clustering The heat map diagram below shows the result of the two-way hierarchical clustering of RNA transcripts and samples, by including the top 200 transcripts (genes)

The color of each point represents the relative expression level of a transcript across all samples: The color scale is shown at the bottom right: red represents an expression level above the mean;

17 Heat map and unsupervised clustering The heat map diagram below shows the result of the two-way hierarchical clustering of RNA transcripts and samples, by including the top 200 transcripts (genes) that have the largest log2 fold difference based on FPKM counts. Each row represents one RNA transcript and each column represents one sample. The color of each point represents the relative expression level of a transcript across all samples: The color scale is shown at the bottom right: red represents an expression level above the mean; green represents an expression level below the mean. Figure 7. Heat Map and unsupervised hierarchical clustering by sample and transcripts was performed on all samples passing QC using the top 200 transcripts (genes) that have the largest log2 fold difference based on FPKM counts. Ref code: 9999 Page 17 of 32

18 Identification of novel mrnas During the transcriptome assembly process, both known and novel transcripts are identified. A novel transcript is characterized as a transcript which contains features not present in the reference annotation. Thus, a novel transcript can be both a new isoform of a known gene or a transcript without any known features. For example, a novel transcript could be the result of a previously unknown splicing event for a known gene or a previously unknown long noncoding RNA. Identification of novel transcripts depends upon the reference annotation. For the present study, the hsa hg19 genome from Gencode v11, Ensembl has been used for annotation. Transcripts not part of this annotation will be classified as novel. In the result files we will classify novel transcripts with known features by listing the known transcripts most closely resembling the novel transcript. For novel transcripts without any known features we will provide a locally unique name as transcript identifier. In addition, we will provide the genomic positions for the features of the novel transcript, e.g. the location and number of exons. Please see page 21 for differentially expressed novel transcripts, and for full list of identified novel transcripts. The full lists of Coding DNA Sequence (CDS), genes, exon isoforms and differential start site isoforms are presented in these files. The table annotations are complex but a good reference is presented in the Cufflinks manual accessible at Ref code: 9999 Page 18 of 32

19 Differentially expressed genes To identify differentially expressed genes, it is assumed that the number of reads produced by each transcript is proportional to its abundance. Exiqon Services has customized the analysis pipeline based on the Tuxedo suite, including the cufflinks, cuffmerge and cuffdiff steps of the Tuxedo pipeline. For more details see Data Analysis workflow on page 29. Comparison of Control and Treated experimental groups, known mrna The table below shows the individual results for the top 20 most differentially expressed known mrna genes. For a full list of differentially expressed transcripts is given in the associated.tsv file folder listed in table 1. Gene_id Gene Locus Control FPKM Treated FPKM log2_fc q_value XLOC_ PSG5 19: XLOC_ TRAC,TRAJ20 14: XLOC_ GREM1 15: XLOC_ KCNK2 1: XLOC_ HOXD10,HOXD11 2: XLOC_ DKK1 10: XLOC_ CPA4 7: XLOC_ HAPLN1 5: XLOC_ LHX9 1: XLOC_ KIAA : XLOC_ WNT16 7: XLOC_ BNC1 15: XLOC_ FOXE1 9: XLOC_ RP11-94A24.1 8: XLOC_ GALNT5 2: XLOC_ LOX 5: XLOC_ RP11-265N7 15: XLOC_ RP11-709B3.2 15: XLOC_ SLC1A7 1: XLOC_ ADAMTSL1 9: Table 6. Known mrnas: Table of the 20 most differentially expressed mrnas, with log fold change (Log2_FC FPKM) between groups Control and Treated with Benjamini-Hochberg FDR corrected q-values. The list is sorted on Log2_FC. Control and Treated columns are group average FPKM values. Ref code: 9999 Page 19 of 32

20 Comparison of Control and Treated experimental groups, isoforms The table below shows the individual results for the top 20 most differentially expressed isoforms. A full list of differentially expressed transcripts is given in the associated.tsv file folder listed in table 1. Gene_id Gene Locus Control FPKM Treated FPKM log2_fc q_value XLOC_ THBS1 15: E XLOC_ MXRA5 X: XLOC_ KIAA : XLOC_ GREM1 15: XLOC_ DKK1 10: XLOC_ ADAMTSL1 9: XLOC_ DKK1 10: XLOC_ ITGA11 15: XLOC_ COL8A1 3: XLOC_ MIR125B1 11: XLOC_ SULF1 8: XLOC_ MYOF 10: XLOC_ LAMA4 6: XLOC_ LOX 5: XLOC_ HAS2 8: XLOC_ HMGA2 12: XLOC_ COL8A1 3: XLOC_ RGMB 5: XLOC_ COL6A3 2: XLOC_ ENPP2 8: Table 7. Isoforms: Table of the 20 most differentially expressed isoforms, with log fold change (Log2_FC FPKM) between groups Treated and Control, with Benjamini-Hochberg FDR corrected q-values. The list is sorted on Log2_FC. Control and Treated columns are group average FPKM values. Ref code: 9999 Page 20 of 32

21 Differentially expressed novel transcripts The table below lists the top 20 differentially expressed novel transcripts identified in this project. In the second column in the table below are listed known transcripts most closely resembling the novel transcript. For a full list of differentially expressed transcripts is given in the associated.tsv file folder listed in table 1. Gene_id Gene Locus Control Treated log2_fc q_value XLOC_ KIAA : XLOC_ RGMB 5: XLOC_ TWIST2 2: XLOC_ MEGF6 1: XLOC_ AC X: XLOC_ CCDC14 3: XLOC_ EDA2R X: XLOC_ FLJ : XLOC_ SLIT2 4: XLOC_ WEE1 11: XLOC_ HOXA9 7: XLOC_ MACF1 1: XLOC_ SRSF11 1: XLOC_ ADAM33 20: XLOC_ KIF23 15: XLOC_ PRKY Y: XLOC_ FKBP10 17: XLOC_ COL8A1 3: XLOC_ LOXL2 8: XLOC_ HAS2 8: Table 8. Novel transcripts. Table of the 20 most differentially expressed novel transcripts, with log fold change (Log2_FC FPKM) between groups Control and Treated, with Benjamini- Hochberg FDR corrected q-values. The list is sorted on Log2_FC. Control and Treated columns are group average FPKM values. Ref code: 9999 Page 21 of 32

Volcano plot The Volcano plot provides a way to perform a quick visual identification of the RNA transcripts displaying large-magnitude changes which are also statistically significant.

22 Volcano plot The Volcano plot provides a way to perform a quick visual identification of the RNA transcripts displaying large-magnitude changes which are also statistically significant. The plot is constructed by plotting the p-value (-log10) on the y-axis, and the expression fold change between the two experimental groups on the x-axis. There are two regions of interest in the plot: those points that are found towards the top of the plot (high statistical significance) and at the extreme left or right (strongly down and up-regulated respectively). Genes that pass the filtering of q-value <0.05 are indicated on the plot. For the present study, no genes pass this filtering. For volcano plots of other comparisons, please see additional Figures. Figure 8. Volcano plot showing the relationship between the p-values and the fold change in normalized expression between the experimental groups Control and Treated. Ref code: 9999 Page 22 of 32

23 Gene Ontology Enrichment Analysis Gene ontology (GO - Gene Ontology Consortium, 2000) enrichment analysis attempts to identify GO terms that are significantly associated with differentially expressed protein coding genes. We investigate whether specific GO terms are more likely to be associated with the differentially expressed mrnas. Two different statistical tests are used and compared. Firstly a standard Fisher s test is used to investigate enrichment of terms between the two test groups. Secondly, the Elim method takes a more conservative approach by incorporating the topology of the GO network to compensate for local dependencies between GO which can mask significant GO terms. Comparisons of the predictions from these two methods can highlight truly relevant GO terms. The figure below shows a comparison of the results for the GO (Biological process) terms associated with the significantly differentially expressed mrnas that were identified between groups Control and Treated. Complete GO enrichment analysis for all of the comparisons is presented in the associated GO folder in the full dataset supplied with the report. The Cellular component (CC) and Molecular functions (MF) analysis are presented in the associated data folder. In the plot, the majority of overrepresented terms are not statistically significant but a small number of terms appear to be relevant. Figure 9. Scatter plot for significantly enriched GO terms predicted to be associated with differentially expressed genes. Plot shows a comparison of the results obtained by the two statistical tests used. Values along diagonal are consistent between both methods with values in the bottom left of the plot corresponding to the terms with most reliable estimates from both methods. Size of dot is proportional to number of genes mapping to that GO term and coloring represents number of significantly differentially expressed genes corresponding to that term with dark red representing more terms and yellow representing fewer. Ref code: 9999 Page 23 of 32

24 A list of potentially significant GO (Biological process) terms is given in the table below. Rank in Classic KS elimks GO.ID Term Annotated Significant Expected Classic Fisher p-value p-value GO: extracellular matrix organization E E-08 GO: inflammatory response E E-05 GO: homophilic cell adhesion E E-05 GO: anatomical structure formation involved E GO: cell adhesion E GO: brain development E GO: regulation of cell migration E GO: positive regulation of neuron differenti GO: axon guidance GO: glutamate metabolic process GO: positive regulation of epithelial cell p GO: negative regulation of blood coagulation GO: chemical homeostasis GO: leukocyte migration GO: regulation of protein transport GO: complement activation, classical pathway GO: monocarboxylic acid biosynthetic process GO: positive regulation of transport GO: epithelial to mesenchymal transition GO: high-density lipoprotein particle remode Table 9. The top 20 significant GO terms for the genes found to be differentially expressed between Control and Treated and their corresponding annotation for Biological process (BP). The associated network topology is shown in Ref code: 9999 Page 24 of 32

25 Figure 10. Ref code: 9999 Page 25 of 32

To illustrate how the differentt GO terms

Nodes are colored from red to t yellow withh

red and nodes with no significant enrichment

The five nodes with stronges support aree

26 To illustrate how the differentt GO terms are linked, a GO network has been created. Figure 10. GO network generated from the GO terms predicted too be enrichedd for the Biological process (BP vocabulary). Nodes are colored from red to t yellow withh the node with the strongest support colored red and nodes with no significant enrichment colored yellow. The five nodes with stronges support aree marked with rectangular nodes. A high-resolution version of this graph is found in the supplementary Figures. Ref code: 9999 Page 26 of 32

27 Conclusion and next steps mrna Next Generation Sequencing libraries were successfully prepared, quantified and sequenced for all your samples. The data passed all QC metrics, with high Q-score, indicating good technical performance of the NGS experiment. A high percentage of the reads could be mapped to the reference genome, indicating that the samples were of high quality. A large number of novel transcripts were identified. Note, however, that many of these will be novel isoforms or start sites of known genes and transcripts. It is clear from the unsupervised analysis that the two samples/groups cluster according to their biological groups, indicating that the sample groups are causing the largest variation on the samples. The supervised analysis showed large numbers of significantly differentially expressed mrna at the CDS (Coding DNA Sequence) and gene level as well as at the isomer level. Note: when navigating through these data, counts lower than 1-5 FPKM (on average) per group might be difficult to validate in a qpcr experiment. We would like to help you interpret the data presented in this report and guide you on how best to proceed with subsequent experiments. If you would like to arrange a time to discuss the data with us in more detail, please do not hesitate to contact DxServices@exiqon.com and we will be happy to arrange a phone call with you. Ref code: 9999 Page 27 of 32

mirsearch If you are interested in looking at which micrornas are regulating your transcripts, Exiqon offers two options for further data mining of the results: mirsearch 3.

28 mirsearch If you are interested in looking at which micrornas are regulating your transcripts, Exiqon offers two options for further data mining of the results: mirsearch 3.0 An interactive mirsearch database, offering you up-to-date information on specific micrornas, tissues, diseases, as well as co-regulated micrornas, target genes and much more. mirsearch includes a built-in report feature which allows you to easily collect and store all the relevant information gathered. Access mirseach from this address: XploreRNA XploreRNA is an advanced database search tool for scientists engaged in transcriptome analysis. The XploreRNA app enables scientists unfamiliar with database searches to access relevant public and proprietary genetic and molecular biology databases through a simple user interface. All databases are cross-annotated and relevant databases are regularly updated by advanced text mining of the literature e.g. in respect to new information on microrna-mrna interactions. The app provides information from major databases such as Ensembl and mirbase. XploreRNA can be downloaded from App Store and Google Play. All search results provide information on literature reference(s) with integrated access to PubMed for reading of abstracts and original publications. Ref code: 9999 Page 28 of 32

29 Data Analysis workflow Software tools used for the analysis Our data analysis pipeline is based on the Tuxedo software package, which is a combination of open-source software and implements peer-reviewed statistical methods. In addition we employ specialized software developed internally at Exiqon to interpret and improve the readability of the final results. The components of our NGS RNA seq analysis pipeline include Bowtie2 (v ), Tophat (v2.0.11) and Cufflinks (v2.2.1) and are described in detail below. Tophat is a fast splice junction mapper for RNA-Seq reads. It aligns the sequencing reads to the reference genome using the sequence aligner Bowtie2. Tophat also uses the sequence alignments to identify splice junctions for both known and novel transcripts. Cufflinks takes the alignment results from Tophat to assemble the aligned sequences into transcripts, constructing a map or a snapshot of the transcriptome. To guide the assembly process, an existing transcript annotation is used. In addition, we perform fragment bias correction which seeks to correct for sequence bias during library preparation (see Kasper et al., 2010 and Adam et al., 2011). The Cufflinks assembles aligned reads into different transcript isoforms based on exon usage and also determines the transcriptional start sites (TSSs). When comparing groups, Cuffdiff is used to calculate the FPKM (number of fragments per kilobase per million mapped fragments) and test for differential expression and regulation among the assembled transcripts across the submitted samples using the Cufflinks output. Cuffdiff can be used to test differential expression at different levels, from CDS and gene specific, down to the isoform and TSS transcript level. For more information on the Cuffdiff module, see Trapnell et al., (2013). As a final step, CummeRbund, which is an open source R package, will be used in combination with in-house custom software for post processing of Cufflinks and Cuffdiff results. We use these tools to generate a visual representation of your sequencing results to aid the interpretation of the sequencing data and the analysis results. Ref code: 9999 Page 29 of 32

30 Material and methods All experiments were conducted at Exiqon Services, Denmark. Library preparation and Next Generation Sequencing The library preparation was done using TruSeq Stranded mrna Sample preparation kit (Illumina inc.). The starting material (100 ng) of total RNA was mrna enriched using the oligodt bead system (manufacturer). The isolated mrna was subsequently fragmented using enzymatic fragmentation (manufacturer, enzymes?). Then first strand synthesis and second strand synthesis were performed and the double stranded cdna was purified (AMPure XP, Beckman Coulter?). The cdna was end repaired, 3 adenylated and Illumina sequencing adaptors ligated onto the fragments ends, and the library was purified (AMPure XP). The mrna stranded libraries were pre-amplified with PCR and purified (AMPure XP). The libraries size distribution was validated and quality inspected on a Bioanalyzer high sensitivity DNA chip (Agilent Technologies?). High quality libraries were quantified using qpcr, the concentration normalized and the samples pooled according to the project specification (number of reads). The library pool(s) were re-quantified with qpcr and optimal concentration of the library pool used to generate the clusters on the surface of a flowcell before sequencing on Nextseq500 instrument using High Output sequencing kit (150 cycles) according to the manufacturer instructions (Illumina Inc.). Ref code: 9999 Page 30 of 32

31 References Trapnell, C., et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology, 28(5): Trapnell,C., et al.(2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 7, Trapnell, C., et al. (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9): , Langmead, B., et al. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3): R Roberts, A., et al. (2011) Identification of novel transcripts in annotated genomes using RNA- Seq. Bioinformatics, 27(17): Anders S. and Huber W. (2010) Differential expression analysis for sequence count data. Genome Biology 11: R106 Goff L., et al.(2012) Robinson, J.T., et al (2011) Integrative genomics viewer. Nature Biotechnology 29, Thorvaldsdóttir, H., et al. (2012) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in Bioinformatics. Kasper D., et al. (2010), Biases in Illumina transcriptome sequencing caused by random hexamer priming Nucleic Acids Research, Volume 38, Issue 12. Roberts, A., et al., (2011) Improving RNA-Seq expression estimates by correcting for fragment bias Genome Biology, Volume 12, R22. Marinov, G. K., et al (2014) From single-cell to cell-pool transcriptomes: Stochasticity in gene expression and RNA splicing. Genome Res. 24: Kellis, M., et al.(2013) Defining functional DNA elements in the human genome. PNAS, Vol. 111: Ref code: 9999 Page 31 of 32

32 Frequently asked questions What is Q-score? Answer: A quality score (or Q-score) is a prediction of the probability of an incorrect base call. Q-score = -10 log10(p(~x)) where P(~X) is the estimated probability of the base call being wrong. A quality score of 10 indicates an error probability of 0.1, a quality score of 20 indicates an error probability of 0.01, a quality score of 30 indicates an error probability of 0.001, and so on. Question: What is the difference between FPKM and RPKM? Answer: RPKM stands for Reads per Kilobase of exon per Million mapped reads, FPKM stands for Fragments per Kilobase of exon per Million mapped fragments. The term fragments refers to the cdna fragments present during library preparation. Both RPKM and FPKM are normalized numbers which tell you something about the relative abundance of, for example, an assembled transcript. In paired-end sequencing, two reads are produced per cdna fragment during library preparation, whereas only one read is produced per cdna fragment in single-end sequencing. Thus, single-end versus paired-end sequencing will affect the value of RPKM but not FPKM. Consequently, FPKM is preferred over RPKM as it will provide values comparable between single-end sequencing and paired-end sequencing Question: What does 1 FPKM mean in terms of abundance? Answer: This is difficult to estimate and highly variable according to cell type and the total number of mrnas in a given cell. For example,. It was estimated that in a single cell analysis of the cell line GM12878, that one transcript copy corresponds to 10 FPKM (Marinov 2014l). Others find that FPKMs are not directly comparable among different subcellular fractions, as they reflect relative abundances within a fraction rather than average absolute transcript copy numbers per cell (Kellis 2013). Depending on the total amount of RNA in a cell, one transcript copy per cell corresponds to between 0.5 and 5 FPKM in PolyA+ whole-cell samples according to current estimates with the upper end of that range corresponding to small cells with little RNA and vice versa. Question: What is a novel RNA transcript? Answer: A novel transcript is characterized as a transcript from a region that lacks annotation not present in the reference annotation. Identification of novel transcripts depends therefore in the reference annotation. Question: A novel transcript identified seems to be a known gene when I look it up in the gene browser, why is that? Answer: Most novel transcripts are not new genes but different isoforms of previously annotated genes. A novel transcript is most commonly a novel combination of exons or a different start site. Ref code: 9999 Page 32 of 32

Next Generation Sequencing

Next Generation Sequencing Complete Report Catalogue # and Service: IR16001 rrna depletion (human, mouse, or rat) IR11081 Total RNA Sequencing (80 million reads, 2x75 bp PE) Xxxxxxx - xxxxxxxxxxxxxxxxxxxxxx