mrna Sequencing Quality Control (V6)

Size: px
Start display at page:

Download "mrna Sequencing Quality Control (V6)"

Transcription

1 mrna Sequencing Quality Control (V6) Notes: the following analyses are based on 8 adult brains sequenced in USC and Yale 1. Error Rates The error rates of each sequencing cycle are reported for 120 tiles per lane. A tile in which the error rate is less than 5% is considered to be a high-quality tile. We sequenced 16 regions collected from 5 brains. Most of the tiles satisfy this threshold, particularly for the first 50 sequencing cycles (Fig.1) Fig.1 Box plots showing error rate against sequencing cycle number. The median error rate (thick black line) is maintained below the 5% threshold (red line) throughout all sequencing cycles, though there is a trend of increased error rates towards the later sequencing cycles.

2 2. mrnaseq sequencing depth The number of unique reads mapping to the 23 chromosomes of human reference genome (light blue) and to the chrm (pink) are depicted in a bar graph (Fig.2a). It is ordered by brain codes. In another format, the bar graph is ordered according to brain neocortical areas and regions (Fig.2c). The remained reads with lower mapping score (yellow), reads with multiple mapping sites (red), QC filtered reads (cyan) and no mapped reads (gray) are depicted in a bar graph (Fig.2c). It is shown that HSB144 has significant amount of chrm reads than other brains, especially two samples (HSB144.MD and HSB144.V1C) showing less than 10M unique reads (black horizontal line) mapped to 23 chromosomes, which might not be enough for the downstream analysis. The whole regions of HSB125 have small number of unique reads and large number of multiple mapping reads. In addition, HSB136.A1C, HSB136.DFC and HSB136.V1C have smaller number of unique reads because of large number of multiple mapping reads and QC filter reads. Fig.2a Bar plots showing mrnaseq depth of unique mapping, ordered by brain codes

3 Fig.2b Bar plots showing mrnaseq depth of unique mapping, ordered by brain function areas and regions Fig.2c Bar plots showing mrnaseq depth, ordered by brain codes

4 3. Effect of mitochondrial reads on overall gene expression We compared the gene RPKM with and without mitochondrial reads. The log2 ratio of (gene RPKM +1) with chrm reads to without chrm reads is plotted against the sample index (Fig.3). Circles indicate mitochondrial genes. The presence of chrm reads slightly affected samples from each brain, except HSB144. The presence of chrm reads produced slightly more variation in exon reads (data not shown). Fig.3 Effect of mitochondrial reads on the RPKM of genes

5 4. Uniformity of reads for protein-coding regions We analyzed sequencing uniformity of each gene in every sample. One exon-model gene was split into 100 equal segments, from 5'-end to 3'-end. We then calculated the RPKM of each segment. The ratio of each segment to the median value of one gene is calculated. The approximately flat black line around zero means the uniform gene coverage. There is a slight trend of 5 underrepresentation and 3 overrepresentation (Fig.4). Fig.4 Reads uniformity within protein-coding region

6 5. Classification of reads The percentage distribution of to exon, intron and intergenic region are calculated using GencodeV3c annotation and shown by box plot (Fig.5). About 20% reads are transcribed from intronic and intergenic regions, ~80% reads mapped exons. Fig.5 Classification of sequenced reads

7 6. Ratio of transcribed nucleotides to the reference genome The ratio of transcribed nucleotides relative to reference genome is calculated and shown by box plot (Fig.6). ~4% of total genome is transcribed, while ~95% of mitochondrial genome is transcribed. Fig.6a Percentage ratio of transcribed region relative to reference genome

8 Fig.6b Percentage ratio of annotated exons relative to reference genome Fig.6c Percentage ratio of transcribed nucleotides relative to annotated exons

9 7. Biological and technical replicates We performed a pairwise comparison of the RPKM for individual regions between different individuals quantifying by log2 ratio of (RPKM+1) and shown by box plot (Fig.7a). The same trend was observed for exon expression RPKM, but with slightly greater variation (data not shown). Here, the same region in HSB123, HSB126, HSB130 and HSB145 are selected as biological replicates. We also performed pairwise comparison of technical replicates of HSB145-VFC to determine the variation between the RPKM of genes and exons (Fig.7b). Technical replicates show less variation than biological replicates. Fig.7a Biological replicates comparison Fig.7b Technical replicates comparison

10 8. mrnaseq vs. Exon Array We compared log2 RNASeq gene expression RPKM with log2 Exon Array gene signal intensity. It shows good correlation, R=0.81, HSB123-DFC is selected as one example (Fig.8a). Bar plot shows the Pearson correlation for all samples (Fig.8b). HSB144 showed lower correlation coefficient, probably due to insufficient number of mappable reads Fig.8a Comparison of gene expression results obtained from mrnaseq and Exon array Fig.8b Distribution of Pearson correlation coefficients of gene expression between mrnaseq and Exon array

11 9. Principal components analysis (PCA) Fig.9a PCA analysis using four brains: HSB123, HSB126, HSB130, HSB145 Fig.9b PCA analysis using all 8 brains

12 10. Multidimensional scaling analysis (MDS) Fig.10a MDS analysis by using 4 brains: HSB123, HSB126, HSB130, HSB145 Fig.10b MDS analysis by using all 8 brains

13 11. Hierarchical cluster analysis (HCLUST) Fig.11a Hierarchical cluster analysis by using 4 brains: HSB123, HSB126, HSB130, HSB145 Fig.11b Hierarchical cluster analysis by using all 8 brains