Promoter definition by mass genome annotation data: in silico primer extension EMBNET course Bioinformatics of transcriptional regulation Jan 28 2008 Christoph Schmid Regulation of eukaryotic transcription: Levine, M. and R. Tjian (2003). "Transcription regulation and animal diversity." Nature 424(6945): 147-51. with permission Nature Macmillan Publishers Ltd
Upstream promoter region defined by transcription start sites (TSS) conventional techniques: nuclease protection assay primer extension cdnas genomic DNA Core promoter TSS In Silico (Digital) versus in Vitro (Analog) Primer Extension cctcacccctttccttcccacaggtccctggccaaagatttatttctcttgacaacca
A job for Bioinformatics? Prediction based on sequence motifs does not (yet?) achieve satisfying results. (for review, see Ohler and Niemann, 2001) Large scale projects provide corresponding data: Genome projects cdna sequencing projects oligocapping method (Suzuki, Y. et al. 2002) MGC project (Strausberg, R.L. et al. 2002) Oligocapping method -> full-length libraries http://dbtss.hgc.jp/
DBTSS vs. conventional techniques # of 5 end of DBTSS transcripts 100 bp Genomic position Characterization of three optional promoters in the 5' region of the human aldolase A gene. Maire P. et al (1987) J. Mol. Biol. 197, 425-438 TSS determined by modelling Gaussian distributions (MADAP) Frequency of full-length transcripts 45 bp 10 bp R 84046905-84046987 R 84047148-84047231 Genomic position MADAP, a flexible clustering tool for the interpretation of one-dimensional genome annotation data. Schmid CD, Sengstag T, Bucher P, Delorenzi M (2007) Nucleic Acids Res 35: W201-205. Webserver: http://www.isrec.isb-sib.ch/madap/
DATA INTERPRETATION WITH MADAP input: positions of 5'ends initial model: k normal distributions parameter fitting with EM eliminations of distributions? evaluation: data likelihood with this model no yes k=k-1 until k=1 output: best model = maximal likelihood
[-10;10] [-400;400] EPD 70 0.83 1 36 RefSeq mrna 0.32 0.95 933 Genome annot. 0.31 0.95 890 DBTSS 0.13 0.68 933 Eponine 0.12 0.46 494 Higher precision of in silico PE in silico primer ext. conv. methods Ohler-set RefSeq mrnas
Eukaryotic Promoter Database (EPD) ID HS_RPS19 standard; multiple; VRT. AC EP68002; DT 22-AUG-2001 (Rel. 68, created) DT 19-DEC-2003 (Rel. 77, Last annotation update). DE Ribosomal protein S19. OS Homo sapiens (human). HG none. AP none. NP none. DR GENOME; NT_011109.15; NT_011109; [-14632542, 16750487]. DR CLEANEX; HS_RPS19. DR EMBL; AC010616.5; [-21462, 150387]. DR EMBL; AF092906.1; [-792, 1344]. DR SWISS-PROT; P39019; RS19_HUMAN. DR RefSeq; NM_001022. DR MIM; 603474. RN [1] RX MEDLINE; 11752328. RA Suzuki Y., Yamashita R., Nakai K., Sugano S. RT DBTSS: database of human transcriptional start sites and RT full-length cdnas. RL Nucleic Acids Res. 30:328-331(2002). RN [2] RX MEDLINE; 10521335. RA Strausberg RL., Feingold EA., Klausner RD., Collins FS.; RT The mammalian gene collection; RL Science 286:455-457(1999). ME NEDO full length human cdna sequencing project. ME Oligo-capping [1]. ME Mammalian gene collection (MGC) full-length cdna cloning [2]. SE tctcgcgagaccctacgcccgacttgtgcgcccgggaaaccccgtcgttccctttcccct FL DBTSS MGC : IF -3 G 1 IF -2 T 1 IF -1 T 4 IF 0 C 20 80 IF +1 C 12 13 IF +2 C 32 2 IF +3 T 2 2 : TX 6. Vertebrate promoters TX 6.1. Chromosomal genes TX 6.1.2. Structural proteins TX 6.1.2.3. RNA-binding proteins TX 6.1.2.3.2. Ribosomal proteins KW Ribosomal protein, Disease mutation. FP Hs ribosomal p. S19 :+M EU:NC_000019.8 1+ 47056165; 68002. DO Experimental evidence: 11,12 DO Expression/Regulation: RF NAR30:328 Sci286:455 // GC-content around TSS - 1489 Human promoter seq. - 1802 Drosophila
TATA is one of several signals Constraint (SSA-Cpr) (1830) (1664) (225) (47) Alternative sources of raw data to determine promoters: Sequencing: 5 SAGE (5 -end Serial Analysis of Gene Expression) CAGE (Cap Analysis Gene Expression) GIS-PET (Gene Identification Signature Paired-End ditag) Hybridization: Tiling array (probes for entire genome/chromosomes) ChIP-chip (Chromatin ImmunoPrecipitation on DNA chip)
CAGE Advantages: CAGE / 5 SAGE enriched for full-length 5 end of transcripts high throughput (lower cost) Disadvantages: no information on coding region relatively short tags with sequencing errors difficult to map
GIS-PET Advantages: GIS-PET Paired-End tags enhances mapping enriched for full-length 5 end of transcripts high throughput (lower cost) Disadvantages: no information on coding region
Advantages: ChIP-chip high resolution by overlapping probes (oligos) signal on entire genome/chromosomes Disadvantages: maps pre-initiation complex (not TSS) hybridization artifacts limited resolution repeat regions are excluded
virtual counts (2** log ratio)-1 New data sources for EPD ChIP-chip pre-initiation complexes Kim et al. (2005) Nature, 436, 876-880 GEO: GSE2672 (remapped!) ENSEMBL chro12: 6.8 6.94 Mb ChIP-chip data with insufficient resolution FP Hs USP5 :+R EU:NC_000012.10 1+ 6831557; 74339. Frequency 0.0 0.5 1.0 1.5 2.0 6831200 6831400 6831600 6831800 6832000 G enom ic position