Demo of mrna NGS Concluding Report

Similar documents
Next Generation Sequencing

RNAseq Differential Gene Expression Analysis Report

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Transcriptome analysis

Introduction to RNA-Seq in GeneSpring NGS Software

Deep Sequencing technologies

10/06/2014. RNA-Seq analysis. With reference assembly. Cormier Alexandre, PhD student UMR8227, Algal Genetics Group

NGS Data Analysis and Galaxy

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA

Bioinformatics Monthly Workshop Series. Speaker: Fan Gao, Ph.D Bioinformatics Resource Office The Picower Institute for Learning and Memory

A guide to the whole transcriptome and mrna Sequencing Service

ChIP-seq and RNA-seq. Farhat Habib

Integrated NGS Sample Preparation Solutions for Limiting Amounts of RNA and DNA. March 2, Steven R. Kain, Ph.D. ABRF 2013

1. Introduction Gene regulation Genomics and genome analyses

Total RNA isola-on End Repair of double- stranded cdna

ChIP-seq and RNA-seq

RNA-Sequencing analysis

RNA-Seq with the Tuxedo Suite

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

RNA-Seq Analysis. Simon Andrews, Laura v

Next-Generation Sequencing Gene Expression Analysis Using Agilent GeneSpring GX

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Galaxy Platform For NGS Data Analyses

Novel methods for RNA and DNA- Seq analysis using SMART Technology. Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc.

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

SUPPLEMENTARY INFORMATION

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

VM origin. Okeanos: Image Trinity_U16 (upgrade to Ubuntu16.04, thanks to Alexandros Dimopoulos) X2go: LXDE

Experimental Design. Dr. Matthew L. Settles. Genome Center University of California, Davis

TECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA

Applications of short-read

Wheat CAP Gene Expression with RNA-Seq

Galaxy for Next Generation Sequencing 初探次世代序列分析平台 蘇聖堯 2013/9/12

Obtain superior NGS library performance with lower input amounts using the NEBNext Ultra II Directional RNA Library Prep Kit for Illumina

Obtain superior NGS library performance with lower input amounts using the NEBNext Ultra II Directional RNA Library Prep Kit for Illumina

Benchmarking of RNA-seq data processing pipelines using whole transcriptome qpcr expression data

RNA Seq: Methods and Applica6ons. Prat Thiru

SO YOU WANT TO DO A: RNA-SEQ EXPERIMENT MATT SETTLES, PHD UNIVERSITY OF CALIFORNIA, DAVIS

RNA-Seq data analysis course September 7-9, 2015

RNA-Seq Module 2 From QC to differential gene expression.

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Gene Expression Technology

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

Analysis of Differential Gene Expression in Cattle Using mrna-seq

Measuring and Understanding Gene Expression

Statistical Genomics and Bioinformatics Workshop. Genetic Association and RNA-Seq Studies

RNA

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Application Note Selective transcript depletion

Increased transcription detection with the NEBNext Single Cell/Low Input RNA Library Prep Kit

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

High-quality stranded RNA-seq libraries from single cells using the SMART-Seq Stranded Kit Product highlights:

Sequence Analysis 2RNA-Seq

Targeted RNA sequencing reveals the deep complexity of the human transcriptome.

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

TECH NOTE Ligation-Free ChIP-Seq Library Preparation

Canadian Bioinforma3cs Workshops

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

How to deal with your RNA-seq data?

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

Finding Genes with Genomics Technologies

Agilent GeneSpring GX 10: Beyond. Pam Tangvoranuntakul Product Manager, GeneSpring October 1, 2008

Differential gene expression analysis using RNA-seq

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Mapping and quantifying mammalian transcriptomes by RNA-Seq. Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer & Barbara Wold

Integrative Genomics 1a. Introduction

Long and short/small RNA-seq data analysis

Non-Organic-Based Isolation of Mammalian microrna using Norgen s microrna Purification Kit

CBC Data Therapy. Metatranscriptomics Discussion

Single Cell Transcriptomics scrnaseq

RNA standards v May


Computational & Quantitative Biology Lecture 6 RNA Sequencing

Isolation of total nucleic acids from FFPE tissues using FormaPure DNA

Automated size selection of NEBNext Small RNA libraries with the Sage Pippin Prep

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

RNA SEQUINS LABORATORY PROTOCOL

Analysis Datasheet Exosome RNA-seq Analysis

Guidelines Analysis of RNA Quantity and Quality for Next-Generation Sequencing Projects

an innovation in high throughput single cell profiling

ChIP-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

Form for publishing your article on BiotechArticles.com this document to

The first thing you will see is the opening page. SeqMonk scans your copy and make sure everything is in order, indicated by the green check marks.

Parts of a standard FastQC report

RNA-Seq Software, Tools, and Workflows

Introduction of RNA-Seq Analysis

Introduction to RNAseq Analysis. Milena Kraus Apr 18, 2016

RNA-Seq analysis using R: Differential expression and transcriptome assembly

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

RNA-Seq Analysis. August Strand Genomics, Inc All rights reserved.

Next-generation sequencing technologies

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

02 Agenda Item 03 Agenda Item

SCALABLE, REPRODUCIBLE RNA-Seq

Quantifying gene expression

Supplementary Information for Single-cell sequencing of the small-rna transcriptome

Gene expression microarrays and assays. Because your results can t wait

Transcription:

Demo of mrna NGS Concluding Report Project: Demo Report Customer: Dr. Demo Company/Institute: Exiqon AS Date: 09-Mar-2015 Performed by Exiqon A/S Company Reg.No.(CVR) 18 98 44 31 Skelstedet 16 DK-2950, Vedbæk Denmark

Additional files provided with this report: Content Sampleinfo.xlsx Pictures Data tables (Spreadsheet.tsv files) Description Overview of samples and groups High resolution copies of pictures presented in this report (QC plots, volcano plots, heat maps and PCAs). All tables for genes, isoforms, CDS and TSS. Count data for all samples in tsv tables. Normalized data for all samples (FPKM) in tsv tables. Differential expression of all relevant comparisons in tsv tables. Other relevant.tsv tables (fx attribute tables) and GO analysis tables. Table 1. List of additional data files included with this report Files provided on disc drive: An email containing information on encryption of disc drive will be sent, and the disc drive will be forwarded by courier. Content Disk drive Description All FASTQ files associated with the project All BAM files generated in the project including mapped and unmapped files (use IGV viewer to visualize) Table 2. List of data files included on disc drive Ref code: 9999 Page 2 of 32

Table of Contents Summary... 4 Experimental overview... 5 Sample overview... 5 Reference genome... 5 Experimental design... 5 Project workflow... 6 QC & Mapping... 7 QC Summary... 7 Mapping and yields... 11 Results... 13 Identified genes... 13 Principal Component Analysis plot... 16 Heat map and unsupervised clustering... 17 Identification of novel mrnas... 18 Differentially expressed genes... 19 Differentially expressed novel transcripts... 21 Volcano plot... 22 Gene Ontology Enrichment Analysis... 23 Conclusion and next steps... 27 mirsearch... 28 Data Analysis workflow... 29 Software tools used for the analysis... 29 Material and methods... 30 Library preparation and Next Generation Sequencing... 30 References... 31 Frequently asked questions... 32 Ref code: 9999 Page 3 of 32

Summary Dear Dr. Demo, We have now finalized the Next Generation Sequencing analysis of the mrnas identified in the samples you have submitted to Exiqon Services. Next Generation Sequencing libraries were successfully prepared, quantified and sequenced for all your samples. The collected reads were subjected to quality control and downstream analysis. The principal findings are summarized in this document. Additional information and further details on specific RNA transcripts can be found in the various documents listed in the table on the previous page. Differential expression analysis of read counts identified a subset of mrna sequences that had significant differences in the associated number of reads between the two experimental groups. We also found a number of putative novel transcripts in your samples, some of which show significant differential expression. Exiqon's product line offers many tools for further validating potentially regulated mrnas by qpcr, in situ hybridization, Northern blot or GapmeRs for highly efficient antisense inhibition of mrna and lncrna function. For more information please see www.exiqon.com. If you have any questions related to this report, please do not hesitate to contact us at DxServices@exiqon.com. Kind regards, Exiqon Services Exiqon A/S Ref code: 9999 Page 4 of 32

Experimental overview Sample overview The table below lists all the samples processed in this project and their specifications according to the sample submission form. There were a total of 6 samples, split into two experimental groups. Sample ID Group Sequencing batch File Name Control1 Control 1 XXX088363_CS_1.fastq Control2 Control 1 XXX088363_CS_2.fastq Control3 Control 1 XXX088363_CS_3.fastq Treated1 Treated 1 XXX088363_TS_4.fastq Treated2 Treated 1 XXX088363_TS_5.fastq Treated3 Treated 1 XXX088363_TS_6.fastq Table 3. Sample ID, grouping, sequencing batch and associated FASTQ file. Reference genome Annotation of the obtained sequences was performed using the reference annotation listed below. Organism: Human Reference genome: h.sapiens, hg19 / GRC37, UCSC Genome Browser Annotation reference: Gencode v11, Ensembl Experimental design The experiments were performed using the following settings: Instrument: NextSeq500 Number of reads: 50 mio Read length: 50 bp, Paired End Ref code: 9999 Page 5 of 32

Project workflow The figure below outlines the Next Generation Sequencing process for mrna and whole transcriptome RNA sequencing at Exiqon A/S. Figure 1. Schematic NGS workflow Ref code: 9999 Page 6 of 32

QC & Mapping The following sections provide a summary of the QC and mapping results obtained for your dataset. QC Summary Following sequencing, intensity correction and base calling, an initial QC of the data is performed internally by the sequencer. This includes CHASTITY filtering and quality scoring (Q-score, see details on page Error! Bookmark not defined.) of each individual base in each read. At this stage the data is separated for Paired end reads (PE) to determine whether the second read significantly differs from the first in terms of overall quality. As illustrated in the figure below ( Figure 2), we found that the vast majority of the data has a Q score greater than 30 (>99.9% correct), indicating that high quality data was obtained for all samples. Reads pairs R1 (read1) and R2 (read2) are presented seperately. Ref code: 9999 Page 7 of 32

Figure 2. Average read quality of the NGS sequencing data. A Q-score above 30 is considered high quality data (red dotted line). Ref code: 9999 Page 8 of 32

In the graph below ( Figure 3), an overview of the average base quality is shown. As for the average read quality we found that the vast majority of the bases have a Q score greater than 30 (>99.9% correct), indicating that high quality data was obtained for all samples. Ref code: 9999 Page 9 of 32

Figure 3. Average base quality (R1 and R2 Q-scores) of the NGS sequencing data. The vast majority of the bases has a Q score greater than 30 (>99.9% correct), indicating high quality data. Ref code: 9999 Page 10 of 32

Mapping and yields Mapping of the sequencing data represents a useful quality control step in the NGS data analysis pipeline as it can help to evaluate the quality of the samples. For this purpose, we classify the reads in the following classes: Outmapped reads or high abundance reads: For example; rrna, mtrna, polya and PolyC homopolymers Unmapped reads: no alignment possible Mappable reads: aligning to reference genome In a typical experiment it is possible to align 60-90% of the reads to the reference genome, However, this number depends upon the quality of the sample and the coverage of the relevant reference genome; if the sample is degraded, fewer reads will be mrna specific and more material will be degraded rrna. The following table and plot summarizes the mapping results. In addition to the mapping results, the table below also shows the total number of reads obtained for each sample. On average 65 million reads were obtained from each sample and genome mapping was on average 91 % for all samples. The uniformity of the sample s mapping results suggests that the samples are comparable. Sample Total reads rrnas (%) Outmapped reads Other (mtrna) (%) Mappable reads (%) Unmapped (%) Control1 66171260 0.17 24.025 67.08 8.802 Control2 64346354 0.174 24.198 66.701 9.01 Control3 54712506 0.131 23.647 67.462 8.853 Treated1 54286828 0.084 15.117 75.551 9.338 Treated2 54613488 0.095 13.96 75.252 10.849 Treated3 98453684 0.08 14.326 75.658 10.025 Table 4. Summary of the mapping results for each sample. The following plot summarizes the mapping results for each sample. Ref code: 9999 Page 11 of 32

Figure 4. Summary of mapping results of the reads by sample. If you want to inspect the mapping in details, please see the BAM alignment files, which are supplied on the hard disk. The BAM files can be viewed and inspected in any standard genome viewer such as the IGV browser (Robinson et al.,2011) and (Thorvaldsdóttir et al., (2012) downloadable from https://www.broadinstitute.org/igv/home. Ref code: 9999 Page 12 of 32

Results Below you will find a summary of the principal findings for this project. The complete analysis may be found in the associated files listed on page 2. For detailed description of the data analysis process see the Data Analysis section on page 29. Identified genes Based on alignment to the reference genome, the number of identified genes per sample was calculated. The reliability of the identified genes increased with number of identified fragments. When performing the statistical comparison of two groups, we include all genes irrespective of how few calls have been made. As can be seen from the table below, and from Figure 5, all samples included in this study have comparable call rates. Sample ID Number of genes identified Number of isoforms identified Control1 21123 75845 Control2 20991 76171 Control3 21249 75276 Treated1 21450 81285 Treated2 21401 81706 Treated3 21488 83213 Ref code: 9999 Page 13 of 32

Table 5. Number of genes and isoforms identified in each sample which have a fragment count estimation of at least 10 counts per gene. Ref code: 9999 Page 14 of 32

The distribution of the calls based on the number of fragments identified is illustrated in the radar plot below. The sample name is indicated on the outer rim of the plot. The number of genes with 1, 10, 100 or 1000 fragments are illustrated as colored rings. If one sample results in significantly lower number of genes in each category, this is an indication that the sample is deviating from the remaining samples. Overall, the rings in the plot are consistent. Figure 5. Radar plot showing gene call rates for each sample at different fragment count cutoff values. See color scale at top of figure for specification of cutoff values. Expression levels are measured as FPKM FPKM is a unit of measuring expression for NGS experiments. The number of reads corresponding to the particular gene is normalized to the total number of mapped reads (Fragments Per Kilobase of transcript per Million mapped reads), In the analysis part the FPKM values are normalized with median of the geometric mean (Anders & Huber, 2010). Ref code: 9999 Page 15 of 32

Principal Component Analysis plot Principal Component Analysis (PCA) is a method used to reduce the dimension of large data sets and is a useful tool to explore the naturally arising sample classes based on the expression profile. The top 200 transcripts (genes) that have the largest log2 fold difference based on FPKM counts have been included in the analysis. If the biological differences between the samples are pronounced, this will describe the primary components of the variation in the data. This leads to separation of samples in different regions of a PCA plot corresponding to their biology. If other factors, e.g. sample quality, introduce more variation in the data, the samples will not cluster according to the biology. The largest component in the variation is plotted along the X-axis and the second largest is plotted on the Y-axis. As seen below, the groups cluster on the primary component Figure 6. Principal component analysis (PCA) plot. The PCA was performed on all samples passing QC using the top 200 transcripts (genes) that have the largest log2 fold difference based on FPKM counts. Ref code: 9999 Page 16 of 32

Heat map and unsupervised clustering The heat map diagram below shows the result of the two-way hierarchical clustering of RNA transcripts and samples, by including the top 200 transcripts (genes) that have the largest log2 fold difference based on FPKM counts. Each row represents one RNA transcript and each column represents one sample. The color of each point represents the relative expression level of a transcript across all samples: The color scale is shown at the bottom right: red represents an expression level above the mean; green represents an expression level below the mean. Figure 7. Heat Map and unsupervised hierarchical clustering by sample and transcripts was performed on all samples passing QC using the top 200 transcripts (genes) that have the largest log2 fold difference based on FPKM counts. Ref code: 9999 Page 17 of 32

Identification of novel mrnas During the transcriptome assembly process, both known and novel transcripts are identified. A novel transcript is characterized as a transcript which contains features not present in the reference annotation. Thus, a novel transcript can be both a new isoform of a known gene or a transcript without any known features. For example, a novel transcript could be the result of a previously unknown splicing event for a known gene or a previously unknown long noncoding RNA. Identification of novel transcripts depends upon the reference annotation. For the present study, the hsa hg19 genome from Gencode v11, Ensembl has been used for annotation. Transcripts not part of this annotation will be classified as novel. In the result files we will classify novel transcripts with known features by listing the known transcripts most closely resembling the novel transcript. For novel transcripts without any known features we will provide a locally unique name as transcript identifier. In addition, we will provide the genomic positions for the features of the novel transcript, e.g. the location and number of exons. Please see page 21 for differentially expressed novel transcripts, and for full list of identified novel transcripts. The full lists of Coding DNA Sequence (CDS), genes, exon isoforms and differential start site isoforms are presented in these files. The table annotations are complex but a good reference is presented in the Cufflinks manual accessible at http://cole-trapnelllab.github.io/cufflinks/manual/. Ref code: 9999 Page 18 of 32

Differentially expressed genes To identify differentially expressed genes, it is assumed that the number of reads produced by each transcript is proportional to its abundance. Exiqon Services has customized the analysis pipeline based on the Tuxedo suite, including the cufflinks, cuffmerge and cuffdiff steps of the Tuxedo pipeline. For more details see Data Analysis workflow on page 29. Comparison of Control and Treated experimental groups, known mrna The table below shows the individual results for the top 20 most differentially expressed known mrna genes. For a full list of differentially expressed transcripts is given in the associated.tsv file folder listed in table 1. Gene_id Gene Locus Control FPKM Treated FPKM log2_fc q_value XLOC_027891 PSG5 19:43670407-43690688 0.105278 287.639 11.4158 0.00017 XLOC_015291 TRAC,TRAJ20 14:22968671-23021092 2201.85 1.82472-10.2368 0.00017 XLOC_017524 GREM1 15:33009470-33026926 2.97941 981.979 8.36452 0.00017 XLOC_002393 KCNK2 1:215179079-215410436 0.392266 116.921 8.21949 0.00017 XLOC_029831 HOXD10,HOXD11 2:176968802-177055688 0.0568068 14.8 8.02532 0.00017 XLOC_005878 DKK1 10:54074055-54077802 0.584649 130.343 7.80053 0.00017 XLOC_048615 CPA4 7:129932973-129964020 0.0782709 16.0989 7.68427 0.00017 XLOC_043699 HAPLN1 5:82933623-83017432 0.127979 26.0171 7.66741 0.00017 XLOC_002158 LHX9 1:197881617-197904608 0.10925 18.199 7.38009 0.00017 XLOC_018098 KIAA1199 15:81071683-81282219 3.25203 517.995 7.31545 0.00017 XLOC_048511 WNT16 7:120965420-120981158 0.0359948 5.10993 7.14937 0.00017 XLOC_019134 BNC1 15:83776158-84108197 0.0357779 4.39717 6.94137 0.00858 XLOC_053529 FOXE1 9:100615535-100620841 0.0127674 1.50468 6.88085 0.00172 XLOC_052560 RP11-94A24.1 8:123424385-123440937 0.092846 9.97136 6.74681 0.00017 XLOC_029702 GALNT5 2:158114004-158184225 0.694575 66.7356 6.58618 0.00017 XLOC_043926 LOX 5:121297655-121414363 3.88793 356.906 6.5204 0.00017 XLOC_018551 RP11-265N7 15:39157031-39719396 0.0901193 8.08362 6.48702 0.00017 XLOC_018916 RP11-709B3.2 15:68591127-68593343 0.0651155 5.5167 6.40466 0.00017 XLOC_000843 SLC1A7 1:53552854-53608289 0.0106137 0.895103 6.39805 0.00172 XLOC_052902 ADAMTSL1 9:18473891-18910948 0.81662 67.3927 6.36678 0.00017 Table 6. Known mrnas: Table of the 20 most differentially expressed mrnas, with log fold change (Log2_FC FPKM) between groups Control and Treated with Benjamini-Hochberg FDR corrected q-values. The list is sorted on Log2_FC. Control and Treated columns are group average FPKM values. Ref code: 9999 Page 19 of 32

Comparison of Control and Treated experimental groups, isoforms The table below shows the individual results for the top 20 most differentially expressed isoforms. A full list of differentially expressed transcripts is given in the associated.tsv file folder listed in table 1. Gene_id Gene Locus Control FPKM Treated FPKM log2_fc q_value XLOC_017584 THBS1 15:39873223-39891667 1.33247 7.57E-19-60.6112 0.00277274 XLOC_061613 MXRA5 X:3226605-3264684 0.0181225 7.29355 8.65269 0.000632571 XLOC_018098 KIAA1199 15:81071683-81282219 1.21301 382.009 8.29887 0.000632571 XLOC_017524 GREM1 15:33009470-33026926 1.46945 461.722 8.29561 0.000632571 XLOC_005878 DKK1 10:54074055-54077802 0.274098 72.94 8.05587 0.0401713 XLOC_052902 ADAMTSL1 9:18473891-18910948 0.0681227 16.4372 7.91461 0.0172923 XLOC_005878 DKK1 10:54074055-54077802 0.268381 52.1181 7.60136 0.000632571 XLOC_018917 ITGA11 15:68594049-68724501 0.347776 61.0402 7.45546 0.000632571 XLOC_036517 COL8A1 3:99356303-99518070 1.03803 152.764 7.20131 0.000632571 XLOC_010831 MIR125B1 11:121899062-121987031 0.104625 14.9989 7.16349 0.000632571 XLOC_050971 SULF1 8:70378858-70573150 0.383663 52.1578 7.0869 0.000632571 XLOC_007420 MYOF 10:95066185-95242378 0.212314 27.9008 7.03796 0.000632571 XLOC_047074 LAMA4 6:112429962-112672498 0.114356 14.5768 6.99399 0.000632571 XLOC_043926 LOX 5:121297655-121414363 1.5448 188.765 6.93303 0.000632571 XLOC_052540 HAS2 8:122624355-122656933 0.292436 34.83 6.89607 0.000632571 XLOC_011756 HMGA2 12:66151799-66360075 0.0182708 2.13712 6.86998 0.00174538 XLOC_036517 COL8A1 3:99356303-99518070 2.34881 0.0227468-6.69013 0.0488389 XLOC_042320 RGMB 5:98104353-98134347 0.120388 12.2992 6.67472 0.000632571 XLOC_032410 COL6A3 2:238232645-238323018 1.05789 103.225 6.60846 0.000632571 XLOC_052527 ENPP2 8:120569325-120685693 0.612772 55.4661 6.50011 0.000632571 Table 7. Isoforms: Table of the 20 most differentially expressed isoforms, with log fold change (Log2_FC FPKM) between groups Treated and Control, with Benjamini-Hochberg FDR corrected q-values. The list is sorted on Log2_FC. Control and Treated columns are group average FPKM values. Ref code: 9999 Page 20 of 32

Differentially expressed novel transcripts The table below lists the top 20 differentially expressed novel transcripts identified in this project. In the second column in the table below are listed known transcripts most closely resembling the novel transcript. For a full list of differentially expressed transcripts is given in the associated.tsv file folder listed in table 1. Gene_id Gene Locus Control Treated log2_fc q_value XLOC_018098 KIAA1199 15:81071683-81282219 1.21301 382.009 8.29887 0.000632571 XLOC_042320 RGMB 5:98104353-98134347 0.120388 12.2992 6.67472 0.000632571 XLOC_030387 TWIST2 2:239756554-239800941 0.351385 24.6092 6.13 0.0254873 XLOC_002847 MEGF6 1:3404343-3528059 0.0160146 1.04102 6.02246 0.0227252 XLOC_062576 AC004383.4 X:133677295-133683592 0.0226872 1.37554 5.92198 0.000632571 XLOC_038209 CCDC14 3:123616151-123680564 0.083794 5.0284 5.90711 0.000632571 XLOC_062109 EDA2R X:65814248-65859108 0.0242766 1.21551 5.64585 0.00174538 XLOC_035172 FLJ27365 22:46445357-46509808 0.0271923 1.29237 5.57067 0.000632571 XLOC_039129 SLIT2 4:20253546-20622184 0.107497 4.78918 5.47741 0.000632571 XLOC_007986 WEE1 11:9595227-9615004 0.0311229 1.37543 5.46576 0.000632571 XLOC_049166 HOXA9 7:27202053-27219880 0.0722173 3.13241 5.43879 0.000632571 XLOC_000623 MACF1 1:39546987-39952849 0.219376 9.31032 5.40735 0.000632571 XLOC_000982 SRSF11 1:70671364-70718735 0.220088 9.33529 5.40654 0.000632571 XLOC_033204 ADAM33 20:3648611-3662893 0.648576 21.274 5.03567 0.000632571 XLOC_017933 KIF23 15:69591275-69740764 0.0504923 1.63842 5.02009 0.0278861 XLOC_062832 PRKY Y:7142012-7249729 0.0875391 2.81586 5.0075 0.000632571 XLOC_022325 FKBP10 17:39968931-39979465 1.78179 56.5106 4.98712 0.000632571 XLOC_036517 COL8A1 3:99356303-99518070 4.68501 148.41 4.98539 0.000632571 XLOC_051770 LOXL2 8:23154691-23315208 1.46322 46.2263 4.9815 0.000632571 XLOC_052540 HAS2 8:122624355-122656933 0.541262 16.4134 4.92241 0.000632571 Table 8. Novel transcripts. Table of the 20 most differentially expressed novel transcripts, with log fold change (Log2_FC FPKM) between groups Control and Treated, with Benjamini- Hochberg FDR corrected q-values. The list is sorted on Log2_FC. Control and Treated columns are group average FPKM values. Ref code: 9999 Page 21 of 32

Volcano plot The Volcano plot provides a way to perform a quick visual identification of the RNA transcripts displaying large-magnitude changes which are also statistically significant. The plot is constructed by plotting the p-value (-log10) on the y-axis, and the expression fold change between the two experimental groups on the x-axis. There are two regions of interest in the plot: those points that are found towards the top of the plot (high statistical significance) and at the extreme left or right (strongly down and up-regulated respectively). Genes that pass the filtering of q-value <0.05 are indicated on the plot. For the present study, no genes pass this filtering. For volcano plots of other comparisons, please see additional Figures. Figure 8. Volcano plot showing the relationship between the p-values and the fold change in normalized expression between the experimental groups Control and Treated. Ref code: 9999 Page 22 of 32

Gene Ontology Enrichment Analysis Gene ontology (GO - Gene Ontology Consortium, 2000) enrichment analysis attempts to identify GO terms that are significantly associated with differentially expressed protein coding genes. We investigate whether specific GO terms are more likely to be associated with the differentially expressed mrnas. Two different statistical tests are used and compared. Firstly a standard Fisher s test is used to investigate enrichment of terms between the two test groups. Secondly, the Elim method takes a more conservative approach by incorporating the topology of the GO network to compensate for local dependencies between GO which can mask significant GO terms. Comparisons of the predictions from these two methods can highlight truly relevant GO terms. The figure below shows a comparison of the results for the GO (Biological process) terms associated with the significantly differentially expressed mrnas that were identified between groups Control and Treated. Complete GO enrichment analysis for all of the comparisons is presented in the associated GO folder in the full dataset supplied with the report. The Cellular component (CC) and Molecular functions (MF) analysis are presented in the associated data folder. In the plot, the majority of overrepresented terms are not statistically significant but a small number of terms appear to be relevant. Figure 9. Scatter plot for significantly enriched GO terms predicted to be associated with differentially expressed genes. Plot shows a comparison of the results obtained by the two statistical tests used. Values along diagonal are consistent between both methods with values in the bottom left of the plot corresponding to the terms with most reliable estimates from both methods. Size of dot is proportional to number of genes mapping to that GO term and coloring represents number of significantly differentially expressed genes corresponding to that term with dark red representing more terms and yellow representing fewer. Ref code: 9999 Page 23 of 32

A list of potentially significant GO (Biological process) terms is given in the table below. Rank in Classic KS elimks GO.ID Term Annotated Significant Expected Classic Fisher p-value p-value GO:0030198 extracellular matrix organization 255 255 255 1 9.30E-08 9.30E-08 GO:0006954 inflammatory response 347 347 347 2 4.00E-05 7.00E-05 GO:0007156 homophilic cell adhesion 65 65 65 3 8.50E-05 8.50E-05 GO:0048646 anatomical structure formation involved... 615 615 615 4 1.20E-06 0.00012 GO:0007155 cell adhesion 682 682 682 5 8.00E-09 0.00019 GO:0007420 brain development 356 356 356 6 2.10E-05 0.00021 GO:0030334 regulation of cell migration 335 335 335 7 1.00E-05 0.00035 GO:0045666 positive regulation of neuron differenti... 60 60 60 8 0.00039 0.00039 GO:0007411 axon guidance 254 254 254 9 0.00045 0.00045 GO:0006536 glutamate metabolic process 27 27 27 10 0.00046 0.00046 GO:0050679 positive regulation of epithelial cell p... 104 104 104 11 0.0006 0.0006 GO:0030195 negative regulation of blood coagulation 32 32 32 12 0.00082 0.00082 GO:0048878 chemical homeostasis 482 482 482 13 0.00082 0.00082 GO:0050900 leukocyte migration 214 214 214 14 0.00083 0.00083 GO:0051223 regulation of protein transport 217 217 217 15 0.001 0.001 GO:0006958 complement activation, classical pathway 30 30 30 16 0.00102 0.00102 GO:0072330 monocarboxylic acid biosynthetic process 128 128 128 17 0.00116 0.00116 GO:0051050 positive regulation of transport 362 362 362 18 0.0002 0.00117 GO:0001837 epithelial to mesenchymal transition 65 65 65 19 0.0014 0.0014 GO:0034375 high-density lipoprotein particle remode... 12 12 12 20 0.00142 0.00142 Table 9. The top 20 significant GO terms for the genes found to be differentially expressed between Control and Treated and their corresponding annotation for Biological process (BP). The associated network topology is shown in Ref code: 9999 Page 24 of 32

Figure 10. Ref code: 9999 Page 25 of 32

To illustrate how the differentt GO terms are linked, a GO network has been created. Figure 10. GO network generated from the GO terms predicted too be enrichedd for the Biological process (BP vocabulary). Nodes are colored from red to t yellow withh the node with the strongest support colored red and nodes with no significant enrichment colored yellow. The five nodes with stronges support aree marked with rectangular nodes. A high-resolution version of this graph is found in the supplementary Figures. Ref code: 9999 Page 26 of 32

Conclusion and next steps mrna Next Generation Sequencing libraries were successfully prepared, quantified and sequenced for all your samples. The data passed all QC metrics, with high Q-score, indicating good technical performance of the NGS experiment. A high percentage of the reads could be mapped to the reference genome, indicating that the samples were of high quality. A large number of novel transcripts were identified. Note, however, that many of these will be novel isoforms or start sites of known genes and transcripts. It is clear from the unsupervised analysis that the two samples/groups cluster according to their biological groups, indicating that the sample groups are causing the largest variation on the samples. The supervised analysis showed large numbers of significantly differentially expressed mrna at the CDS (Coding DNA Sequence) and gene level as well as at the isomer level. Note: when navigating through these data, counts lower than 1-5 FPKM (on average) per group might be difficult to validate in a qpcr experiment. We would like to help you interpret the data presented in this report and guide you on how best to proceed with subsequent experiments. If you would like to arrange a time to discuss the data with us in more detail, please do not hesitate to contact DxServices@exiqon.com and we will be happy to arrange a phone call with you. Ref code: 9999 Page 27 of 32

mirsearch If you are interested in looking at which micrornas are regulating your transcripts, Exiqon offers two options for further data mining of the results: mirsearch 3.0 An interactive mirsearch database, offering you up-to-date information on specific micrornas, tissues, diseases, as well as co-regulated micrornas, target genes and much more. mirsearch includes a built-in report feature which allows you to easily collect and store all the relevant information gathered. Access mirseach from this address: http://www.exiqon.com/mirsearch XploreRNA XploreRNA is an advanced database search tool for scientists engaged in transcriptome analysis. The XploreRNA app enables scientists unfamiliar with database searches to access relevant public and proprietary genetic and molecular biology databases through a simple user interface. All databases are cross-annotated and relevant databases are regularly updated by advanced text mining of the literature e.g. in respect to new information on microrna-mrna interactions. The app provides information from major databases such as Ensembl and mirbase. XploreRNA can be downloaded from App Store and Google Play. All search results provide information on literature reference(s) with integrated access to PubMed for reading of abstracts and original publications. Ref code: 9999 Page 28 of 32

Data Analysis workflow Software tools used for the analysis Our data analysis pipeline is based on the Tuxedo software package, which is a combination of open-source software and implements peer-reviewed statistical methods. In addition we employ specialized software developed internally at Exiqon to interpret and improve the readability of the final results. The components of our NGS RNA seq analysis pipeline include Bowtie2 (v. 2.2.2), Tophat (v2.0.11) and Cufflinks (v2.2.1) and are described in detail below. Tophat is a fast splice junction mapper for RNA-Seq reads. It aligns the sequencing reads to the reference genome using the sequence aligner Bowtie2. Tophat also uses the sequence alignments to identify splice junctions for both known and novel transcripts. Cufflinks takes the alignment results from Tophat to assemble the aligned sequences into transcripts, constructing a map or a snapshot of the transcriptome. To guide the assembly process, an existing transcript annotation is used. In addition, we perform fragment bias correction which seeks to correct for sequence bias during library preparation (see Kasper et al., 2010 and Adam et al., 2011). The Cufflinks assembles aligned reads into different transcript isoforms based on exon usage and also determines the transcriptional start sites (TSSs). When comparing groups, Cuffdiff is used to calculate the FPKM (number of fragments per kilobase per million mapped fragments) and test for differential expression and regulation among the assembled transcripts across the submitted samples using the Cufflinks output. Cuffdiff can be used to test differential expression at different levels, from CDS and gene specific, down to the isoform and TSS transcript level. For more information on the Cuffdiff module, see Trapnell et al., (2013). As a final step, CummeRbund, which is an open source R package, will be used in combination with in-house custom software for post processing of Cufflinks and Cuffdiff results. We use these tools to generate a visual representation of your sequencing results to aid the interpretation of the sequencing data and the analysis results. Ref code: 9999 Page 29 of 32

Material and methods All experiments were conducted at Exiqon Services, Denmark. Library preparation and Next Generation Sequencing The library preparation was done using TruSeq Stranded mrna Sample preparation kit (Illumina inc.). The starting material (100 ng) of total RNA was mrna enriched using the oligodt bead system (manufacturer). The isolated mrna was subsequently fragmented using enzymatic fragmentation (manufacturer, enzymes?). Then first strand synthesis and second strand synthesis were performed and the double stranded cdna was purified (AMPure XP, Beckman Coulter?). The cdna was end repaired, 3 adenylated and Illumina sequencing adaptors ligated onto the fragments ends, and the library was purified (AMPure XP). The mrna stranded libraries were pre-amplified with PCR and purified (AMPure XP). The libraries size distribution was validated and quality inspected on a Bioanalyzer high sensitivity DNA chip (Agilent Technologies?). High quality libraries were quantified using qpcr, the concentration normalized and the samples pooled according to the project specification (number of reads). The library pool(s) were re-quantified with qpcr and optimal concentration of the library pool used to generate the clusters on the surface of a flowcell before sequencing on Nextseq500 instrument using High Output sequencing kit (150 cycles) according to the manufacturer instructions (Illumina Inc.). Ref code: 9999 Page 30 of 32

References Trapnell, C., et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology, 28(5): 511-515. Trapnell,C., et al.(2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 7,562 578 Trapnell, C., et al. (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9):1105-1111, Langmead, B., et al. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3): R25.10. Roberts, A., et al. (2011) Identification of novel transcripts in annotated genomes using RNA- Seq. Bioinformatics, 27(17): 2325-2329. Anders S. and Huber W. (2010) Differential expression analysis for sequence count data. Genome Biology 11: R106 Goff L., et al.(2012) http://www.bioconductor.org/packages/release/bioc/html/cummerbund.html Robinson, J.T., et al (2011) Integrative genomics viewer. Nature Biotechnology 29,24 26. Thorvaldsdóttir, H., et al. (2012) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in Bioinformatics. Kasper D., et al. (2010), Biases in Illumina transcriptome sequencing caused by random hexamer priming Nucleic Acids Research, Volume 38, Issue 12. Roberts, A., et al., (2011) Improving RNA-Seq expression estimates by correcting for fragment bias Genome Biology, Volume 12, R22. Marinov, G. K., et al (2014) From single-cell to cell-pool transcriptomes: Stochasticity in gene expression and RNA splicing. Genome Res. 24: 496-510. Kellis, M., et al.(2013) Defining functional DNA elements in the human genome. PNAS, Vol. 111:6131-6138. Ref code: 9999 Page 31 of 32

Frequently asked questions What is Q-score? Answer: A quality score (or Q-score) is a prediction of the probability of an incorrect base call. Q-score = -10 log10(p(~x)) where P(~X) is the estimated probability of the base call being wrong. A quality score of 10 indicates an error probability of 0.1, a quality score of 20 indicates an error probability of 0.01, a quality score of 30 indicates an error probability of 0.001, and so on. Question: What is the difference between FPKM and RPKM? Answer: RPKM stands for Reads per Kilobase of exon per Million mapped reads, FPKM stands for Fragments per Kilobase of exon per Million mapped fragments. The term fragments refers to the cdna fragments present during library preparation. Both RPKM and FPKM are normalized numbers which tell you something about the relative abundance of, for example, an assembled transcript. In paired-end sequencing, two reads are produced per cdna fragment during library preparation, whereas only one read is produced per cdna fragment in single-end sequencing. Thus, single-end versus paired-end sequencing will affect the value of RPKM but not FPKM. Consequently, FPKM is preferred over RPKM as it will provide values comparable between single-end sequencing and paired-end sequencing Question: What does 1 FPKM mean in terms of abundance? Answer: This is difficult to estimate and highly variable according to cell type and the total number of mrnas in a given cell. For example,. It was estimated that in a single cell analysis of the cell line GM12878, that one transcript copy corresponds to 10 FPKM (Marinov 2014l). Others find that FPKMs are not directly comparable among different subcellular fractions, as they reflect relative abundances within a fraction rather than average absolute transcript copy numbers per cell (Kellis 2013). Depending on the total amount of RNA in a cell, one transcript copy per cell corresponds to between 0.5 and 5 FPKM in PolyA+ whole-cell samples according to current estimates with the upper end of that range corresponding to small cells with little RNA and vice versa. Question: What is a novel RNA transcript? Answer: A novel transcript is characterized as a transcript from a region that lacks annotation not present in the reference annotation. Identification of novel transcripts depends therefore in the reference annotation. Question: A novel transcript identified seems to be a known gene when I look it up in the gene browser, why is that? Answer: Most novel transcripts are not new genes but different isoforms of previously annotated genes. A novel transcript is most commonly a novel combination of exons or a different start site. Ref code: 9999 Page 32 of 32