Conserved introns reveal novel transcripts in eukaryotic genomes

Size: px
Start display at page:

Download "Conserved introns reveal novel transcripts in eukaryotic genomes"

Transcription

1 Conserved introns reveal novel transcripts in eukaryotic genomes Dominic Rose Bioinformatics Group, University of Leipzig Jena, Nov. 2009

2 Outline

3 Outline Genome-wide comparative genomics approach Search for short cons. introns in eukaryotic genomes Capable to identify novel conserved genes/transcripts Gene-detection based on intron/splice site prediction Intron detection allows to...

4 Outline Genome-wide comparative genomics approach Search for short cons. introns in eukaryotic genomes Capable to identify novel conserved genes/transcripts Gene-detection based on intron/splice site prediction Intron detection allows to... extend annotation of existing coding or UTRs identify novel protein coding genes identify novel mrna-like ncrnas

5 Gene-finding: protein-cod. vs non-cod. Protein-coding genes: Likely to be found by comparative genomics Exhibit clear evolutionary signatures (start/stop codon, accumulate synonymous mutations or reading frame preserving indels, high sequence conservation) Bunch of sophisticated implementations exists

6 Gene-finding: protein-cod. vs non-cod. Protein-coding genes: Likely to be found by comparative genomics Exhibit clear evolutionary signatures (start/stop codon, accumulate synonymous mutations or reading frame preserving indels, high sequence conservation) Bunch of sophisticated implementations exists Non-coding genes (ncrnas): Heterogeneous class of transcripts directly acting at the RNA-level Usually lack conserved sequence patterns Subclass structured ncrnas : Preserve secondary structure while underlying sequences evolve Some are easy to find: trnas, mirnas,... Others are notoriously difficult to detect: telomerase RNAs, long ncrnas,...

7 non-coding RNAs Central topic of current research [Taft 2009]

8 mrna-like non-coding RNAs

9 mrna-like non-coding RNAs ENCODE: Pervasive transcription In average 5.4 transcripts per site. Large portion of the transcriptional output of eukaryotic genomes consists of mrna-like non-coding RNAs.

10 mrna-like non-coding RNAs ENCODE: Pervasive transcription In average 5.4 transcripts per site. Large portion of the transcriptional output of eukaryotic genomes consists of mrna-like non-coding RNAs. Capped, polyadenylated, often (alternatively) spliced (just like protein-coding genes), but lack discernible open reading frames

11 mlncrnas: Examples

12 mlncrnas: Examples Gene regulators: Evf-2, Xist, rox1, Tsix, XistAS, rox2, H19, mei, LPW, KvDMR1, DGCR5, CMPD (e.g. Evf-2 acts as transciptional enhancer for distal-less homeobox genes)

13 mlncrnas: Examples Gene regulators: Evf-2, Xist, rox1, Tsix, XistAS, rox2, H19, mei, LPW, KvDMR1, DGCR5, CMPD (e.g. Evf-2 acts as transciptional enhancer for distal-less homeobox genes) Some serve as precursor for mirnas and snornas

14 mlncrnas: Examples Gene regulators: Evf-2, Xist, rox1, Tsix, XistAS, rox2, H19, mei, LPW, KvDMR1, DGCR5, CMPD (e.g. Evf-2 acts as transciptional enhancer for distal-less homeobox genes) Some serve as precursor for mirnas and snornas Abiotic stress signals: gadd7/adapt15, adapt33, hsrω, OxyR, DsrA, lbi, G90 (e.g. expression caused by UV radiation)

15 mlncrnas: Examples Gene regulators: Evf-2, Xist, rox1, Tsix, XistAS, rox2, H19, mei, LPW, KvDMR1, DGCR5, CMPD (e.g. Evf-2 acts as transciptional enhancer for distal-less homeobox genes) Some serve as precursor for mirnas and snornas Abiotic stress signals: gadd7/adapt15, adapt33, hsrω, OxyR, DsrA, lbi, G90 (e.g. expression caused by UV radiation) Biotic stress signals: His-1, ENOD40, CR20, GUT15 (e.g. expression correlated with viral insertion or carcinogenesis)

16 mlncrnas: Examples Gene regulators: Evf-2, Xist, rox1, Tsix, XistAS, rox2, H19, mei, LPW, KvDMR1, DGCR5, CMPD (e.g. Evf-2 acts as transciptional enhancer for distal-less homeobox genes) Some serve as precursor for mirnas and snornas Abiotic stress signals: gadd7/adapt15, adapt33, hsrω, OxyR, DsrA, lbi, G90 (e.g. expression caused by UV radiation) Biotic stress signals: His-1, ENOD40, CR20, GUT15 (e.g. expression correlated with viral insertion or carcinogenesis) Others: UHG, NTT, Bsr, BC1, BC200, SRA

17 mlncrnas: Examples Gene regulators: Evf-2, Xist, rox1, Tsix, XistAS, rox2, H19, mei, LPW, KvDMR1, DGCR5, CMPD (e.g. Evf-2 acts as transciptional enhancer for distal-less homeobox genes) Some serve as precursor for mirnas and snornas Abiotic stress signals: gadd7/adapt15, adapt33, hsrω, OxyR, DsrA, lbi, G90 (e.g. expression caused by UV radiation) Biotic stress signals: His-1, ENOD40, CR20, GUT15 (e.g. expression correlated with viral insertion or carcinogenesis) Others: UHG, NTT, Bsr, BC1, BC200, SRA functionally important ncrna class

18 The idea Functional pair of donor (5 ) and acceptor (3 ) splice sites will be retained over long evolutionary time scales only if

19 The idea Functional pair of donor (5 ) and acceptor (3 ) splice sites will be retained over long evolutionary time scales only if 1. The locus is transcribed into a functional transcript

20 The idea Functional pair of donor (5 ) and acceptor (3 ) splice sites will be retained over long evolutionary time scales only if 1. The locus is transcribed into a functional transcript 2. Accurate intron removal is necessary to produce a functional transcript

21 The idea Functional pair of donor (5 ) and acceptor (3 ) splice sites will be retained over long evolutionary time scales only if 1. The locus is transcribed into a functional transcript 2. Accurate intron removal is necessary to produce a functional transcript Find the intron it guides you to your novel transcript

22 The data 2 screens (all input data available at the UCSC genome browser) 15 insects, already published: Hiller et al., Genome Research 2009 (12 drosophila genomes, mosquito, beetle, honeybee) 44 vertebrates (human teleosts, lamprey)

23 The plan Insects: focus on short conserved introns (40-81 nt) Apply intronscan (preliminary filter) build alignments evaluate characteristic intron evolution train support vector machine (SVM) classify candidate set of novel introns Vertebrates: focus on independent SS prediction Apply MaxEntScan (preliminary filter) compile set of real (positive) and pseudo (false) donor/acceptor splice-sites evaluate characteristic splice-site evolution train SVM classify candidate set of novel splice-sites

24 The method A genome B genome predict introns in individual insect genomes using intronscan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > retain orthologous intronscan predictions > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + strand intron strand intron D.mel D.ere D.moj + 12 insects 1,398,939 predicted introns for D. melanogaster > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > D.mel D.ere intronscan alignments C positive negative distributions of training samples splice site substitution scores conservation scores > > > > > > > > > > > > > > > > > > > > > > > > 498,231 predictions with orthologs evaluate characteristic intron evolution intron length variation donor score variation D.moj + 12 insects acceptor score variation SVM train an SVM with these 5 discriminative features apply to 342,785 predictions that overlap no protein coding gene 0 1 True Positive Rate ROC curve of independent test set AUC = False Positive Rate conserved introns predicted

25 Nucleotide frequencies in SS positions differ less frequent more frequent compared to D.mel compared to D.mel donor (5 ) splice site acceptor (3 ) splice site percent D.sim D.sec D.yak D.ere A C G T D.ana D.pse D.per D.wil D.moj D.vir D.gri A.gam T.cas A.mel less frequent more frequent (compared to Dmel) e.g. Apis prefers A over G (donor +3) and T over C (acceptor -3)

26 Learn log-odd substitution scores log 2 ( freqpos(x,y) freq neg(x,y) ) substitution matrix x, y {A, T, C, G} x y

27 0 0 0 Evaluating intron evolution - an example positive negative A substitution scores B D.mel D.sim D.sec D.ere D.ana D.pse D.per D.wil D.vir D.moj D.gri 1 0 G T G A G A C A C C C A A G A C A T T A T A T T G G C A A T A A T A T C A T C T A C C A A G G G C T C A T T T A C A G G T G A G A C A C C C A A G A C A T T C T A T T G G C A A T A A T A T C C T T T A C C A A G G A C C C A T T T A C A G G T G A G A C C C C C A A G A C A T T T T A T T G G C A A T A A T A T C C T A T A C C A A G G A C C C A T T T A C A G G T G A G A A A C A C A A G A C A T G C T A T T G C C A A T A A T A T C A T A T A C C A A G A A C T C A A T T T A C A G G T A A G C T T T T C C G A A G A G A T A G C A T T T A T T A T G A T T C A A T T G T T T T T C A C A G G T G A G T G T A A C C G T A A C C A G C A A C T G G C T C C A G C A G T A G A C C T A T C G A A T A T A T C C G C A G G T G A G T G T A A C C G T A A C C A G C A A C T G G C T C C A G C A G T A G A C C T A T C G A A T A T A T C C G C A G G T A A G A C G C T A T T A G A A T T C A T T T A C A T T T A C A G A C G A T A A T A G T G T A T A T C T T C A T G G T A G G T A G G A T T A A C C A T C C A G C T A T C T A T A T A T C T G T A G T A A T A T C T T G A A C T A T A A T T T G C A G G T A A G C C T T A C A A A A A A C C A T A T A T A T T T T T A G T G A A T C A A T A T T G C C T T A T T T T T G T A G G T A A G A T T A T T C C G A T T T T T A T A G C T T C A T T T T T G A G A A A T T T A A T T T G A T T A A T T T T T A G +8 classified as real intron (SVM probability 0.999) +20 Conservation scores (PhastCons) 20 8 distribution of positive training samples average conservation sum of scores for region substitution scores and Density 0.03 Density 20 sum = 31.6 average = D.mel D.sim D.sec D.yak D.ere D.pse D.vir D.gri 1 0 classified as false prediction (SVM probability 0.001) G T G G G C T C A G T C G G T A C T C C A T T A T G A T T G T T T A T T T A A T A T G C G C T T G A T T T G A A G G T G G G C T C A G T C T G T G G T A C T C C A T T A T G A T T G T T T A T T T A A T A T G C G C T T G A T T T G A A G G T G G G C T C A G T C T G T G G T A C T C C A T T A T G A T T G T T T A T T T A A T A T G C G C T T G A T T T G A A G G T G G G C T C T C T C G G T A C T G C A T T A T G A T T G T T T A T T T T A T A T G C G C T T G A T T T G G A G G T G G G C T C A G T C G G A A C T C C A T T A C G A T T G T T T A T T T T A T A T G C G C T T G A T T T G G A G G T G G G C T C A G A G T C G G T A C T C C A C T G C G A T T A T T T A T T T T A T T T G C G C C T G A T T T G G A G G T G G A T T T T G G A C T C C A T T A T A A T T A T T T A T A T T A C C C G T G T T T G C G C T T G A T T T G A A G G T G G A G A T C T G G G G A C T C C A T T A T A A T T A T T T A T A T T T G C T C G T A T T T G C G C T T G A T T T G A A G Conservation scores (PhastCons) distribution of negative training samples average conservation sum of scores for region substitution scores and Density 0.5 Density sum = 3.7 average = 0.92

28 Insects: Results intronscan: 1.4 Mio introns in Dmel alignments: 498k loci 155.5k overlap annotated protein-coding transcripts Agreement with existing exon/intron annotation Both SS: 23.5k (positive samples) one SS: 14.5k (ommited) no SS: 117.5k (negative samples) SVM training: 95%, SVM-testing: 5% area under ROC: p>0.95: 80% true positives at 0.12% false positives p>0.99: 72% true positives at 0.07% false positives (4 FP, manual inspection: 3 are true introns 1 FP)

29 Novel spliced transcripts A chrx: bp predicted intron CG14614 EC FlyBase Protein Coding Genes D. melanogaster ESTs That Have Been Spliced Conservation B chr3l: bp predicted intron dally Conservation BE AI FlyBase Protein Coding Genes D. melanogaster ESTs That Have Been Spliced CO EC C chr3r: bp predicted intron CA CA CA CA CO CA CA CA CA CA CA CA CA CA D. melanogaster ESTs That Have Been Spliced Conservation D chr3l: bp predicted intron pncr009:3l RA AY CK EC CO CO FlyBase Noncoding Genes D. melanogaster mrnas from GenBank D. melanogaster ESTs That Have Been Spliced Conservation E 600 bp chr2r: predicted intron predicted intron predicted intron D. melanogaster ESTs That Have Been Spliced EY EY Conservation 369 predictions outside of protein-cod. genes (p>0.95) 131 EST/FlyBase-transcript confirmed introns, 238 unconfirmed

30 Novel protein-coding genes A chr2r: bp predicted intron chr2r.1731 CONTRAST Gene Predictions Conservation B chr3l: predicted intron predicted intron 800 bp predicted intron predicted intron chr3l a N SCAN PASA EST Gene Predictions Conservation BLASTX hit Q6IL55...IFMGLAVSQWILVCLVESGRSIFRDFEFICSGTFVLAGLFFTVFVFNQSVRNHKIYSWVVAFVIVELEIFSMYVLVARTWVPDLLACFFFCTLLMIVALVVGCHLSFSMDLTRYIAPLFILSFVLSMMSTYFLLAHLFLPNLMPYAYLIFELGLTFVMTSMIMLHAQTI... A) CONTRAST predicted coding gene, B) NSCAN coding gene 20/238 located within 100nt upstream of cod. genes 14/20 no annotated 5 UTR (in contrast to 77/218, Fischer s exact test, p=0.005) 23 extend CDS, 30 belong to novel CDS

31 Novel spliced non-coding RNAs How can we obtain novel mlncrnas out of the data?

32 Novel spliced non-coding RNAs How can we obtain novel mlncrnas out of the data? Filter the predictions

33 Novel spliced non-coding RNAs How can we obtain novel mlncrnas out of the data? Filter the predictions remove everything protein-coding

34 Novel spliced non-coding RNAs How can we obtain novel mlncrnas out of the data? Filter the predictions remove everything protein-coding remove repeats

35 Novel spliced non-coding RNAs How can we obtain novel mlncrnas out of the data? Filter the predictions remove everything protein-coding remove repeats Heureka! We ve found mlncrnas.

36 Novel spliced non-coding RNAs How can we obtain novel mlncrnas out of the data? Filter the predictions remove everything protein-coding remove repeats Heureka! We ve found mlncrnas. (at least prime candidates) 129 bona fide mlncrnas

37 Novel mrna-like non-coding RNAs 29/129 have predicted orthologous introns outside Sophophora subgenus (D. virilis, D. mojavensis, D. grimshawi) conserved exon-intron structure over 63 My

38 Novel mrna-like non-coding RNAs 29/129 have predicted orthologous introns outside Sophophora subgenus (D. virilis, D. mojavensis, D. grimshawi) conserved exon-intron structure over 63 My Mostly unstructured (just 2 transcripts have RNAz hit)

39 Experimental verification RT-PCR, 5 different developmental stages of Dmel: embryo, larva, pupa, male, female

40 Experimental verification RT-PCR, 5 different developmental stages of Dmel: embryo, larva, pupa, male, female 18/29 (62%) experimentally validated: introns in putative coding transcripts: 11/17 mlncrnas: 7/12

41 Experimental verification of mlncrnas

42 Vertebrates: Refinements Vertebrate introns! = insect introns (2% vs 54% short introns)

43 Vertebrates: Refinements Vertebrate introns! = insect introns (2% vs 54% short introns) Intron prediction! = splice-site prediction (new SVM features needed)

44 Vertebrates: Refinements Vertebrate introns! = insect introns (2% vs 54% short introns) Intron prediction! = splice-site prediction (new SVM features needed) (1) Human MaxEntScan SS score. (2-4) Log-odd substitution scores s tree, s pair, s median. (5) Total number of species in the alignment (6) Total number of species with conserved GT/AG dinucleotides and a MaxEntScan score >= 0. (7) Slope of a regression line fitted to the splice sites PhastCons sequence conservation profile. (8) Average PhastCons score. (9) Average GC content. (10) Mean pairwise identity. (11) Normalised Shannon Entropy

45 entropy mpi entropy avg GC 0.4 avg phastcons phastcons slope Nr. of 100% cons. species mpi Nr. of species avg GC Median LogOdds avg phastcons phastcons slope 1000 Fly LogOdds Tree LogOdds 5 MaxEntScan Nr. of 100% cons. species Nr. of species Median LogOdds Fly LogOdds ACC AUC Tree LogOdds 5 MaxEntScan DO AUC Vertebrates: SVM features

46 Improve log-odd substitution scores Pairwise substitution scores may not capture the true sequence evolution, e.g. due to alignment errors

47 Improve log-odd substitution scores Pairwise substitution scores may not capture the true sequence evolution, e.g. due to alignment errors. Improvement Take UCSC s 44-species tree. Reconstruct ancestral sequences for each splice-site region using prequel Learn splice-site substitution patterns for each edge e of the tree

48 Improve log-odd substitution scores Imagine a 3-way alignment

49 Improve log-odd substitution scores Reconstruct ancestor sequences

50 Improve log-odd substitution scores Count substitutions

51 Improve log-odd substitution scores Twice, for all positive and all negative samples

52 Improve log-odd substitution scores Score your splice-sites: Sum of all log-odds of all observed substitution events for a given alignment s tree = e E ( fpos (x y)/ n A log f ) pos(x n) 2 f neg (x y)/ n A f neg(x n) Consider only those substitution events that really occurred in evolution Better AUCs: 0.82 vs 0.65 (DO), 0.83 vs 0.74 (ACC)

53 Vertebrate results 2 models, AUC: 0.96 donor, 0.95 acceptor SVM training-set: 50k, SVM test-set: 5k (pos, neg each) chr21: classified 243,041 donor sites: SVM p TP [%] FP [%] sites > ,

54 Experimental validation? Experimental validation needs introns not splice-sites (primer design) Intron candidates: arbitrarily defined as adjacent do/acc pairs with distance <=5000 nt on same strand Filter for high-scoring candidates: High SVM probability, no repeats, PhastCons-basin, same species within both splice-sites, overlap with existing annotation, EST-support, Tiling Array data,... chr21: intron candidates chosen for experimental validation

55 Vertebrate examples 1/3 Scale chr22: chr22 ACC chr22 DO GM128 cell tot GM128 cell tot GM128 cyto pa- GM128 cyto pa- GM128 cyto pa+ GM128 cyto pa+ GM128 nucl pa- GM128 nucl pa- GM128 nucl pa+ GM128 nucl pa+ K562 cell total K562 cell total K562 psom pa- K562 psom pa- K562 cyto pa- K562 cyto pa- K562 cyto pa+ K562 cyto pa+ K562 nucl pa- K562 nucl pa- K562 nucl pa+ K562 nucl pa+ K562 nplsm total K562 nplsm total K562 chrm total K562 chrm total K562 nlos tot K562 nlos tot 1 _ 1 kb chr22 ACC chr22 DO chr22 phast-filtered introns intron:chr22: ENCODE Affymetrix/CSHL Subcellular RNA Localization by Tiling Array Vertebrate Multiz Alignment & Conservation (44 Species) Primate Conservation by PhastCons Primate Cons 0 _ 1 _ Placental Mammal Conservation by PhastCons Mammal Cons 0 _ 1 _ Vertebrate Conservation by PhastCons Vertebrate Cons 0 _ Multiz Align RepeatMasker Repeating Elements by RepeatMasker

56 Vertebrate examples 2/3 Scale chr22: CONTRAST SGP Genes Geneid Genes Genscan Genes 1 _ 100 bases chr22 phast-filtered introns intron:chr22: CONTRAST Gene Predictions SGP Gene Predictions Using Mouse/Human Homology Geneid Gene Predictions Genscan Gene Predictions Vertebrate Multiz Alignment & Conservation (44 Species) Primate Conservation by PhastCons Primate Cons 0 _ 1 _ Placental Mammal Conservation by PhastCons Mammal Cons 0 _ 1 _ Vertebrate Conservation by PhastCons Vertebrate Cons 0 _ Multiz Align RepeatMasker Repeating Elements by RepeatMasker

57 Primate Cons Mammal Cons Vertebrate Cons Scale chr22: <--- Gencode Manual ENST SIB Genes SGP Genes Geneid Genes Genscan Genes 1 _ 0 _ 1 _ 0 _ 1 _ 0 _ Multiz Align RepeatMasker 5 bases G A T C G G T G T G A C C C C C C T chr22 phast-filtered introns intron:chr22: ENCODE Gencode Gene Annotations Ensembl Gene Predictions Swiss Institute of Bioinformatics Gene Predictions from mrna and ESTs SGP Gene Predictions Using Mouse/Human Homology Geneid Gene Predictions Genscan Gene Predictions Vertebrate Multiz Alignment & Conservation (44 Species) Primate Conservation by PhastCons Placental Mammal Conservation by PhastCons Vertebrate Conservation by PhastCons Repeating Elements by RepeatMasker Scale chr22: <--- intron:chr22: Primate Cons Mammal Cons Vertebrate Cons Gencode Manual ENST SIB Genes SGP Genes Geneid Genes Genscan Genes 1 _ 0 _ 1 _ 0 _ 1 _ 0 _ Multiz Align RepeatMasker 10 bases T C G T C G G G T G C C T G G C C A A T G G A G A G T C G G T T C C A C T T C A G chr22 phast-filtered introns ENCODE Gencode Gene Annotations Ensembl Gene Predictions Swiss Institute of Bioinformatics Gene Predictions from mrna and ESTs SGP Gene Predictions Using Mouse/Human Homology Geneid Gene Predictions Genscan Gene Predictions Vertebrate Multiz Alignment & Conservation (44 Species) Primate Conservation by PhastCons Placental Mammal Conservation by PhastCons Vertebrate Conservation by PhastCons Repeating Elements by RepeatMasker Vertebrate examples 3/3

58 Summary Novel method that predicts genes for spliced transcripts We solely use intron information for prediction We identify novel... transcripts coding for proteins or mlncrnas transcripts without conserved secondary structures transcripts with low sequence conservation Limitations: Transcript start, transcript end?

59 Thank you Michael Hiller (Stanford) Leipzig: Sven Findeiß, Manja Marz, Christine Schulz, Sonja J. Prohaska Halle: Sandro Lein, Claudia Nickel, Gunter Reuter Freiburg: Rolf Backofen Various cities, countries, and continents: Peter F. Stadler