GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment

Size: px
Start display at page:

Download "GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment"

Transcription

1 GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment Zhaojun Zhang, Shunping Huang, Jack Wang, Xiang Zhang, Fernando Pardo Manuel de Villena,Leonard McMillan and Wei Wang Tõnis Tasa Bioinformatics 2013;29:i291-i299 Bioinformatics Seminar

2 RNA-Seq Align first / Assemble first

3 Problem! A fragment with paired end reads can be aligned to two locations in the genome. Zhang Z et al. Bioinformatics 2013;29:i291-i299 The Author Published by Oxford University Press.

4 It has already been taken care of! Output Best score of alignments (eg. TopHat) Equal scores? First, random, multiple alignments(5%) Assembler: Probabilistic model, equal weights (eg. Cufflinks)

5 Genomic factors are the cause Retrotransposition and gene duplication high levels of similarity. Processed pseudogenes mrna reverse transcribed and reintegrated to genome. Introns are lost Nonprocessed pseudogenes. Gene duplication events, mutations, loss of function. Repetitive gene sequences Gene families

6 Two transcripts reported by Cufflinks. Zhang Z et al. Bioinformatics 2013;29:i291-i299 The Author Published by Oxford University Press.

7 Things to consider Biological reasons for multiple alignment not considered Databases for identifying pseudo-genes No restoration of lost alignments Around 3.5% of data false positives from multiple alignments

8 The workflow of GeneScissors Pipeline. Zhang Z et al. Bioinformatics 2013;29:i291-i299 The Author Published by Oxford University Press.

9 Sharing graph Fragment attractors where do fragments align Links between attractors shared fragments Position-by-position correspondence Transcript discovery enabled Shared regions

10 Classification model Simulated data fragments from annotation database(ensembl) Aligner, assembler Construction of sharing graphs Binary classification of (target + assistant) fragment attractors RandomForests, SVM, Decision tree

11 RandomForests Best Good for: Discrete features Correlations between features 10-fold cross validation Precision, recall, F-score. AUC

12 Features Mismatch count pseudogenes have higher mutation rate Number of exons Singletons could be processed pseudogenes Proportions of #1 fragments aligned to attractors # 2 shared fragments aligned to attractors #3 entire regions of attractors covered by shared fragments unexpressed attractors span a shorter region and have a lower proportion of shared fragments.

13 Data simulator Cast/EiJ PWK/PhJ WSB/EiJ Simulator 60 RNA-Seq samples Random number of genes Subset of transcripts Paired end fragments Uniform quality score Errors

14 Modification to existing pipelines TopHat does not report alignment crossing more splice junctions processed pseudo genes Cufflinks suppresses gene that consists of more than 75% of genes mappable to multiple locations Unannotated pseudogenes Downloadable patch

15 Simulated data results Genes reported MapSplice TopHat MapSplice( GS) TopHat(GS) Precision 35,6% 41,8% 48,2% 48,3% Recall 95,1% 93,2% 93% 93,2% F-score 51,5% 58,2% 63.5% 63.6%

16 Real data 53 F1 samples + 9 inbred samples Each fragment aligned to genome of both parents separately Alignments merged with all distinct multiple alignments kept (False negatives recovered)? TopHat + GeneScissors

17 Comparisons between multiple samples run through both the GeneScissors pipeline and the TopHat pipeline. Zhang Z et al. Bioinformatics 2013;29:i291-i299 The Author Published by Oxford University Press.

18 Comparisons between multiple samples run through both the GeneScissors pipeline and the TopHat pipeline. Zhang Z et al. Bioinformatics 2013;29:i291-i299 The Author Published by Oxford University Press.

19 Comparisons between multiple samples run through both the GeneScissors pipeline and the TopHat pipeline. Zhang Z et al. Bioinformatics 2013;29:i291-i299 The Author Published by Oxford University Press.

20 Comparisons between multiple samples run through both the GeneScissors pipeline and the TopHat pipeline. Zhang Z et al. Bioinformatics 2013;29:i291-i299 The Author Published by Oxford University Press.

21 Conclusions GeneScissors adds 10 hours to the general workflow per sample Compared to TopHat GS reported: 4.25% less transcripts 0.97% more transcripts transcripts matching splice junction annotations in Ensembl 53.6% less annotated pseudogenes 16% less not-annotated transcripts

22 Questions 1. What is meant by a fragment/transcript having multiple alignments? 2. How do GeneScissors approach the multiple alignment problem differently from other RNA- Seq analysis tools?