Targeted RNA sequencing reveals the deep complexity of the human transcriptome.

Size: px
Start display at page:

Download "Targeted RNA sequencing reveals the deep complexity of the human transcriptome."

Transcription

1 Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Tim R. Mercer 1, Daniel J. Gerhardt 2, Marcel E. Dinger 1, Joanna Crawford 1, Cole Trapnell 3, Jeffrey A. Jeddeloh 2,4, John S. Mattick 1,4 and John L. Rinn 3,4 1 Institute for Molecular Bioscience, University of Queensland, Brisbane QLD 4067, Australia. 2 Roche NimbleGen, Inc, Research and Development, Madison, WI 53719, U.S.A. 3 Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA 02138, U.S.A. SUPPLEMENTARY INFORMATION: 1. RESULTS (page 2) 2. FIGURES (page 4) 3. TABLE LEGENDS (page 12) 4. DATA LEGENDS (page 12) 5. REFERENCES (page 13) 1

2 SUPPLEMENTARY RESULTS Validation of array design To confirm our custom array design and manufacture achieves similar specificity and sensitivity to previous exome sequencing approaches 1, 2, we firstly conducted capture sequencing of matched foot fibroblast genomic DNA using 454 GSFLX Titanium. We found 65.8% of sequenced reads overlapped with probed regions providing coverage over 99.2% of targeted bases with relatively uniform coverage (Supplementary Fig. 1b). This coverage exhibited no discernable bias for probed region GC content (r 2 =- 0.05; Supplementary Fig. 1c). To assess the reproducibility of this coverage, we performed an additional round of capture sequencing using harvested genomic DNA from an independent (fetal lung fiboblast) cell line. Of these sequenced reads, 71.8% overlapped with probed regions and provided coverage over 99.3% of targeted bases, similar to that observed above. A comparison between probe region coverage from foot and lung fibroblast cell lines showed a high linear correlation (r 2 =0.94), thereby indicating reproducibility across each independent cell line. Lastly, we employed the relative coverage discerned from the genomic cdna control capture to filter probe design for hybridization defects, excluding from further analysis those regions for which the alignment of reads was anomalously high (Supplementary Fig. 1d). Comparison of RNAseq and CaptureSeq library diversity To assess the impact of capture on library diversity and discern artifactual PCR amplification bias we employed two alternative analyses. Firstly, we compared the population structure of matched RNAseq and CaptureSeq libraries at equivalent depth, observing a similar frequency of sequenced reads represented singly in either RNAseq (69.5%) or CaptureSeq (66.5%) libraries and only slight divergence across the broader sequenced read population (Supplementary Fig. 2b). Secondly, we compared the relative alignment fold coverage across targeted transcripts finding little deviation between pre- capture RNAseq and CaptureSeq libraries (Supplementary Fig. 2d). This analysis also revealed that transcripts overlapping capture probes by as little as 12% were enriched and assembled in their entirety (Supplementary Fig. 2e). This collectively demonstrates that, despite a reduction in transcript diversity, CaptureSeq does not notably reduce library diversity or introduce a substantial PCR amplification bias. Assessment of variation in transcript enrichment following CaptureSeq We next considered variation between transcript enrichment following CaptureSeq, an important prerequisite for the use of CaptureSeq in quantitative applications. We considered variation that may be introduced at two stages within the CaptueSeq protocol, following capture of transcripts and then sequencing. Firstly, we employed qrt- PCR to compare the fold enrichment of 11 transcripts (4 protein- coding genes and 7 novel lncrnas) expressed across a range of abundances between pre- and post- capture. We observe a relatively uniform enrichment (mean = 9.1, SD = 0.95), regardless of differences 2

3 in transcript abundance (Supplementary Fig. 5k). Secondly, we considered the enrichment in estimated abundance for those transcripts shared between conventional RNAseq and RNA CaptureSeq, observing a high correlation between the relative coverage depth of probed regions (r 2 = 0.92) and the expression of shared transcripts (r 2 = 0.83) over a similarly large dynamic range (over five orders of magnitude; Supplementary Fig. 5g- i). Reproducibility of RNA CaptureSeq To discern the reproducibility of the RNA CaptureSeq approach, we next performed two technical replicates for both independent cell lines (foot and fetal lung fibroblasts) employed within the study. Again, we assessed reproducibility between these technical replicates following capture and also following sequencing. Firstly, we employed qrt- PCR to compare the abundance of 11 transcripts (as above) between technical replicates, observing a high correlation (r 2 = 0.99 and r 2 = 1.0; Supplementary Fig. 5a,b,d,e). To consider the reproducibility of RNA CaptureSeq, we next performed 454 sequencing on matched captured technical replicate samples, returning a high correlation between technical replicates for both cell line when considering either coverage of probed regions (r 2 = 0.99) or estimated abundance of captured transcripts (r 2 = 97; Supplementary Fig. 5c,f), Assessment of RNA CaptureSeq ability to retain differential gene abundance We next assessed the ability for RNA CaptureSeq to retain with fidelity the differential gene expression profiles of original uncaptured samples and thereby permit cross- sample comparisons. We employed qrt- PCR to determine whether changes between gene expression observed in CaptureSeq between two independent cell lines (foot and fetal lung fibroblasts) accurately reflect underlying differences between samples prior to capture. We confirmed for 11 transcripts (see above) expressed across a range of abundance that relative differences in gene expression between foot and lung fibroblasts were well preserved following capture (Supplementary Fig. 6b). Furthermore, for those HOX genes considered, fold changes were similar to those previously reported using alternative methods 3, 4. We lastly considered whether changes in differential expression for those selected transcripts are also concordant with changes in transcript abundance as estimated by sequencing. We found the expression of transcripts as determined by qrt- PCR in both pre and post- capture RNA was highly concordant with estimates from sequencing abundance (r 2 = 0.90; Supplementary Figure 6b). However, it is worth noting that we observe a slightly (6%) lower estimate for fold- change of highly expressed genes as determined by sequencing relative to qrt- PCR, given it does not appreciably affect lowly expressed transcripts, possibly reflecting the saturation of capture probes. 3

4 SUPPLEMENTARY FIGURES Supplementary Fig. 1. Array design and validation. (a) Summary of probe region characteristics including fractional overlap with assembled transcripts from previous gene annotation (top blue panel) or pre- capture library (lower blue panel), sum of sequenced reads from pre- capture library that align within probe region, including total aligned reads (top red panel) and aligned reads unable to be assigned to previous gene models (lower red panel). (b) Frequency distribution showing uniform mean coverage of probed regions by sequenced genomic DNA reads. (c) Comparison of probe regions GC content with aligned genomic DNA enrichment. (d) Frequency distribution of sequenced genomic DNA reads aligning to probed regions. Probed regions showing high hybridisation were omitted from further analysis (red). 4

5 Supplementary Fig. 2. RNAseq and CaptureSeq comparison. (a) Cumulative distribution of probed region expression in pre- capture (blue) and captured (red) foot fibroblast libraries. (b) Cumulative frequency distribution of sequenced read representation within RNAseq (blue) and CaptureSeq (red) libraries. (c) Relationship between capture region size and effective sequencing depth according to sequenced libraries used within this study. The size of capture (0.77Mb) employed within this study indicated by blue line. (d) Normalised coverage of transcripts by aligned sequenced reads in RNAseq and RNA CaptureSeq shows similar coverage profile. (e) Fraction of captured transcript length overlapping probed bases. 5

6 Supplementary Fig. 3. Validation of captured isoforms. (a- b) RT- PCR validation of p53 isoform variants (a) and 13 (from 15) identified intergenic ncrnas (b). RT- PCR primers described in Supplementary Table 4. 6

7 Supplementary Fig. 4. Characterisation of novel isoforms and intergenic transcripts. (a) Nucleotide enrichment of 5 and 3 splice junctions from annotated genes and novel isoforms and intergenic transcripts identified by RNA CaptureSeq. (b- c) Cumulative fractional coverage of novel and annotated exons to captured genes by PhastCons elements 5 (b) or assembled full- length transcripts (c; including annotated RefSeq genes, captured genes, and novel isoforms, intergenic and antisense lncrna transcripts). (d, e) Cumulative frequency distribution indicating relative expression of novel and annotated exons (d) and assembled full- length transcripts (e). Cumulative coding potential score 6 (f) and codon substitution frequency 7 (g) of full- length transcripts assembled from captured libraries. 7

8 Supplementary Fig. 5. Retention of differential transcript abundances by CaptureSeq. (a- d) Comparative analysis of reproducibility by qrt- PCR between two replicate libraries before capture (a,d) and following capture (b,e). (c,f) Comparative analysis of reproducibility of transcript abundance estimates derived from 454 sequencing between two replicates from foot (e) and lung (f) fibroblasts. (g) Cumulative frequency distribution with overlain box- whisker plot (mean with

9 percentile) indicates the magnitude of differential expression between high and lowly expressed genes in RNAseq and RNA CaptureSeq. (h) Comparative analysis of relative expression of probed regions as determined by RNAseq and RNA CaptureSeq. (i) Comparative analysis of estimated abundance for assembled transcripts shared between pre- and post capture libraries. (j) Probed regions ranked and grouped by descending expression show fold- enrichment of between pre- and post- capture sequencing (box- whisker plot showing mean with 5-95 percentile). (k) Histogram indicates fold- change enrichment (mean shown, error bars indicate SD, n=4) following capture for 6 lowly expressed and newly described ncrnas (blue) and 4 abundant protein coding genes (red). Supplementary Fig. 6. Assembled transcript abundance. (a) Comparative analysis of assembled transcript abundance between foot and lung primary human fibroblasts. (b) Comparative scatter- plot indicates the close concordance between these three measures of fold- change. (c) Frequency of publicly available ESTs, mrnas (from UCSC genome browser, hg19) and CAGE tags that share any overlap with novel intergenic transcripts. 9

10 Supplementary Fig. 7. Characterization of captured intergenic transcripts. (a) Sum of RNAseq (blue) and CaptureSeq (red) aligned sequenced reads across probed regions. (b) Box- whisker plot (mean with 5-95 percentile) showing frequency of alignments overlapping probed regions with evidence of splicing (left) or no evidence of splicing (right) between pre- capture (red shades) and post- capture libraries (blue shades). (c) Cumulative fractional coverage by PhastCons elements 5 of full- length transcripts assembled from captured libraries. (d) Cumulative coding potential score 6 of full- length transcripts assembled from captured libraries. 10

11 Supplementary Fig. 8. Alignment coverage provided by CaptureSeq. (a) Cumulative frequency distribution indicating the raw sequenced read frequency aligning to captured intergenic exons from both RNAseq and CaptureSeq. Also included (yellow dashed line) is the raw sequenced read alignment for RNAseq to all assembled exons. The large difference between this raw alignment frequency indicates the massively enriched coverage achieved by CaptureSeq. (b) Cumulative distribution of nucleotide sequences expression in pre- and post- captured libraries showing significant drop in nucleotides that are represented singly within CaptureSeq. (c) RT- PCR validation of novel exons annotated from captured RNA that rescue unassigned intronic reads from RNAseq. (RT- PCR primers described in Supplementary Table 4). (d) Abundance recall rates for CaptureSeq at varying library depth. For different sized libraries we have determined the fraction of transcripts that reach within percentage of the final RPKM (as determined from the full CaptureSeq library) indicates the loss of accuracy for gene abundance estimates that occurs following multiplex sample preparation and concomitant loss of sequencing depth. 11

12 SUPPLEMENTARY TABLE LEGENDS Table S1. Summary of library sequencing and alignment employed within study. Table S2. Size and genome coordinates of probed regions. Table S3. Characterisation of selected probe regions by; expression in c45 fibroblast libraries, fraction coverage by assembled transcripts and current RefSeq annotations and corresponding assembled transcript or RefSeq identifiers. Table S4. Primer sequences used within the study. SUPPLEMENTARY DATA LEGENDS Data S1. Transcripts assembled from RNAseq foot fibroblast short read (Illumina paired- end) libraries (bed file). Data S2. Transcripts assembled from CaptureSeq foot fibroblast short read (Illumina paired- end) libraries (bed file). Data S3. Transcripts assembled from CaptureSeq foot fibroblast long read (454) libraries (bed file). Data S4. Transcripts assembled from CaptureSeq combined long (454) and short (Illumina paired- end) read (bed file). Data S5. Transcripts assembled from CaptureSeq fetal lung fibroblast long read (454) libraries (bed file). 12

13 SUPPLEMENTARY REFERENCES 1. Li, Y. et al. Resequencing of 200 human exomes identifies an excess of low- frequency non- synonymous coding variants. Nat Genet 42, (2010). 2. Ng, S.B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, (2009). 3. Rinn, J.L., Bondre, C., Gladstone, H.B., Brown, P.O. & Chang, H.Y. Anatomic demarcation by positional variation in fibroblast gene expression programs. PLoS Genet 2, e119 (2006). 4. Rinn, J.L. et al. Functional demarcation of active and silent chromatin domains in human HOX loci by noncoding RNAs. Cell 129, (2007). 5. Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15, (2005). 6. Kong, L. et al. CPC: assess the protein- coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res 35, W (2007). 7. Stacey, S.N. et al. Common variants on chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptor- positive breast cancer. Nat Genet 39, (2007). 13