Benchmarking of RNA-seq data processing pipelines using whole transcriptome qpcr expression data Jan Hellemans 7th international qpcr & NGS Event - Freising March 24 th, 2015
Therapeutics lncrna oncology antisense qpcr solutions qbase+ qpcr training Lab services mrna, mirna, lncrna qpcr, ddpcr, RNA-seq
The Biogazelle team & collaborators Lab team Ariane De Ganck Gaelle Vanseveren Nele Nijs Anthony Van Driessche Shana Robbrecht Tom Maes Bio-IT team Manuel Luypaert Sander Claus Project management & data analysis Pieter Mestdagh Jo Vandesompele Anneleen Beckers Collaborators Bio-Rad UGent
RNA-seq workflow experiment design sample collection & RNA extraction library prep & sequencing QC, data processing & interpretation
RNA-seq workflow experiment design https://www.biogazelle.com/knowledge-center/selected-video-presentations
RNA-seq workflow experiment design sample collection & RNA extraction library prep & sequencing QC, data processing & interpretation
Data processing tools Included in this analysis TopHat + HTseq TopHat + Cufflinks Sailfish Running & planned Star + HTseq Salmon (Sailfish successor) StringTie Interpretation & statistics of differential gene expression not considered
Data processing tools What is your preferred RNA-seq data mapper? TopHat MapSplice Star alignment free (eg Sailfish) don t know other What is your favorite RNA-seq quantification tool Cufflinks RSEM HTseq don t know Sailfish (Salmon) other What is the front end for your analysis no (command line) Basespace Galaxy Commercial solution
Data processing tools TopHat TopHat input RNA-seq reads (fastq files) genome index transcriptome annotation (optional GTF file) Procedure transcriptome mapping (when annotation is provided) genome mapping (non-junction reads) spliced mapping (de-novo, based on canonical D/A sites) TopHat output SAM/BAM files
Data processing tools HTseq HTseq input SAM/BAM file with mapping results (e.g. from TopHat) only unique mapping fraction is considered GTF/GFF file with genome features (e.g. gene models) processing htseq-count tool different modes to deal with reads overlapping multiple features HTseq output table with counts for each feature (exon, gene, )
Data processing tools Cufflinks Cufflinks input SAM/BAM file with mapping results (e.g. from TopHat) GTF/GFF file with genome features (e.g. gene models) Processing assemble transcripts estimate transcript abundances Cufflinks output GTF file with transcript coordinates, abundances & class codes abundances in FPKM (expected fragments per kilobase of transcript per million fragments sequenced)
Data processing tools Sailfish Sailfish input RNA-seq reads (fastq files) transcriptome index Processing counting k-mer occurrence in reads rather than aligning reads Sailfish output table with transcript abundance KPKM (K-mers Per Kilobase per Million mapped k-mers) TPM (Transcripts Per Million) FPKM
Data processing tools comparison TopHat Cufflinks TopHat HTSeq Sailfish mapping alignment based alignment free quantification level transcript gene reported metric read counts normalized expression novel transcripts only known transcripts known & novel transcripts
Benchmarking Published comparisons (Chandramohan et al, 2013) Comparison of public MAQC RNA dataset against matching TaqMan data Good overall correlation for the tools evaluated Limitations Gene based analysis not considering transcript variability Small comparison (531 datapoints) based on historic assays No detailed analysis on potential causes of different interpretations
Benchmarking A qpcr transcriptome reference data set Studies relying on MAQC samples Microarray quality control study (Shi 2006) Sequencing quality control study (Su 2014) MicroRNA quality control study (Mestdagh 2014) MAQC samples A = universal RNA B = brain RNA C = ¾ A + ¼ B D = ¼ A + ¾ B
Benchmarking A qpcr transcriptome reference data set Benefits of MAQC samples commercially available multiple large scale published datasets built in truths (relying on known mixing proportions) allows evaluation of reproducibility titration response accuracy dynamic range Biogazelle & Bio-Rad have performed a full qpcr transcriptome profiling on the MAQC samples using validated primepcr assays 4 x 22 238 data points
Benchmarking Setup Samples MAQC-A & MAQC-B RNA-seq: 2 replicates qpcr: 1 replicate Sequencing poly-a seq @ NextSeq500 18M reads per sample qpcr primepcr assays for all human coding genes 5 µl assays @ CFX384
Benchmarking Setup Data taken into consideration qpcr signal in MAQC-A & B Cq values between 11 & 32 Only transcripts detected by the qpcr assays (n = 15 087) Comparing qpcr & RNA-seq Sailfish: sum of expression for transcripts detected by qpcr other tools: gene level quantification based on custom gtffile containing only transcripts detected by qpcr
Benchmarking Gene expression levels qpcr data global mean normalized Cq values (already log2 scale) Sailfish log2 of sum of KPKM values of transcripts detected by qpcr HTseq log2 of normalized gene counts Cufflinks log2 of gene level FPKM values
Benchmarking Gene expression levels
Benchmarking Gene expression levels
Benchmarking Relative expression levels
Benchmarking More non-concordance @ low expression
Benchmarking Looking at the differences in Q2-Q4
Impact of Sailfish reference Completeness of reference impacts mapping % total RNA-seq 4 MAQC samples Mapping % for different reference transcriptomes 0% 20% 40% 60% 80% 100% Ensembl cdna 40-45% + Ensembl ncrna ~60% + LNCipedia ~61%
Impact of Sailfish reference Ensembl (cdna+ncrna) vs Ensembl+LNCipedia
Impact of Sailfish reference Ensembl (cdna+ncrna) vs Ensembl+LNCipedia 17.7% 0.5% (1.1%) 4.5% (10.6%) 34.9% 4.9%
Impact of Sailfish reference Ensembl (cdna+ncrna) vs Ensembl+LNCipedia KPKM 10.4 0.0 KPKM 1.5 10.3
Impact of reference transcriptome Ensembl vs RefGene Zhao, BMC Genomics 2015 expression level differential expression Ensembl Ensembl RefGene RefGene
Conclusions Overall good concordance between qpcr & all different RNA-seq analysis tools Some genes show strongly different absolute expression levels between qpcr & RNA-seq Some genes show pronounced differences in differential expression RNA-seq vs qpcr analysis method specific specific differences may be called / missed by a single tool The choice of reference transcriptome significantly impacts transcript expression analysis by Sailfish Validation required because some genes show strongly deviating results depending on quantification and data processing method