High-throughput Transcriptome analysis

Size: px
Start display at page:

Download "High-throughput Transcriptome analysis"

Transcription

1 High-throughput Transcriptome analysis CAGE and beyond Dr. Rimantas Kodzius, Singapore, A*STAR, IMCB for KAUST 2008

2 Agenda 1. Current research - PhD work on discovery of new allergens - Postdoctoral work on Transcriptional Start Sites a) Tag based technologies allow higher throughput b) CAGE technology to define promoters c) CAGE data analysis to understand Transcription - Work in Singapore on Comparative Transcriptomics 2. Research outlook at KAUST - Nanofluidic devices for Genomics - Production of high volume mol. biology/ Genomics data - Collaboration with bioinformatics to analyze the data

3 PhD work on identification of allergens

4 Picking/ ReArraying/ Spotting robot

5 Examples of DNA filter hybridization 5 patients, allergen A 5 patients, allergen B

6 Work in Japan, Genomic Sciences Centre Supported by: EU FP5 INCO2 program Prof. Yoshihide Hayashizaki (RIKEN) Dr. Piero Carninci (RIKEN) >200 co-authors on publication in Science

7 Genomics goes hand to hand with Transcriptomics To understand phenotypes, diseases need to know transcriptional regulatory networks timing and quantity of controlled transcripts TF binding sites = at promoters

8 Gene structure & EST cloning Promoter TSS ATG exon TAA 5 -UTR x x transcription 3 -UTR AAAAAAA splicing AAAAAAA reverse transcription AAAAAAA TTTTTTT 2 nd strand synthesis in vivo in vitro AAAAAAA TTTTTTT cloning, sequencing genomic alignment

9 Transcripts contain lots of information IRES internal ribosome entry sites CPE cytoplasmic polyadenylation element mirna

10 Full-length cdna libraries Transcriptome allows a snapshot about cell activity Experimental evidence of transcribed region Alternative (promoters splicing - polyadenylation sites) Defined TSS and predicted promoter ORF Open reading frame 5 - and 3 -UTRs, Transcript stability Quantitative analysis of gene expression Trancriptionally interacting partners Gene Networks

11 Tag based technologies Promoter TSS ATG TAA RE 1.SAGE tag Serial Analysis of Gene Expression SAGE Cap Analysis of Gene Expression 3. CAGE 5 -UTR 3 -UTR 3 -tag AAAAAAA TTTTTTT AAAAAAA TTTTTTT AAAAAAA TTTTTTT 5 -tag Gene Identification Signature Paired End ditaq 4. GIS-PET 5 -tag 3 -tag AAAAAAA TTTTTTT

12 CAGE tags represent cdna Genome annotation - Experimental evidence of TSS and transcribed region UTR location - Alternative promoter sites Promoter analysis - Regulatory elements - TF binding sites - CpG islands - repetitive elements Quantitative analysis of gene expression

13 CAGE steps from RNA to 20 bp tags Cap AAAAA Reverse transcription N 20 Biotin Cap AAAAA Full-length cdna selection ssdna release Biotin XmaJI MmeI 5 bp + ssdna capture by CAGE linker Second strand synthesis Biotin XmaJI MmeI 5 bp MmeI digestion of dsdna Biotin MmeI-PCR Biotin + XmaJI MmeI XmaJI 5 bp Ligation of Second linker XmaJI Biotin 20mer tag Biotin Uni-PCR Biotin Biotin XmaJI XmaJI XmaJI tag 1 tag 2 tag 3 tag 4 XmaJI PCR amplification CAGE tag release Concatenation Fractionation Cloning Sequencing

14 Species Assenble Ver. Chromosomes Species Assenble Ver. Chromosomes Mus musculus UCSC-May ,X,Y Homo sapiens UCSC-May ,X,Y Current Statistics Fri, 12 Nov 2004 Number of CAGE Library 145 Number of CAGE Tissue 23 Number of CAGE Plate 8,862 Number of CAGE Clone 2,721,800 Number of CAGE Tag 11,567,973 Average of CAGE Tags/Clone 4.25 Number of mapped CAGE Tag [ at least 1 site ] Number of mapped CAGE Tag [ specified 1 site ] 8,825,172 7,151,511 Average of mapping rate 0.62 Number of CTSS 1,260,079 Number of TC 594,136 Number of TU 39,593 Number of TU in whole genomes 50,612 Current Statistics Thu, 13 Jan 2005 Number of CAGE Library 41 Number of CAGE Tissue 17 Number of CAGE Plate 3,327 Number of CAGE Clone 1,035,181 Number of CAGE Tag 10,165,217 Average of CAGE Tags/Clone 9.82 Number of mapped CAGE Tag [ at least 1 site ] Number of mapped CAGE Tag [ specified 1 site ] 6,475,536 5,312,921 Average of mapping rate 0.52 Number of CTSS 1,057,486 Number of TC 629,716 Number of TU 33,903 Number of TU in whole genomes 39,903

15 5 -RACE validation of Opioid receptor 1

16 Example of tissue-specific TSS UDP-glucuronyl transferase gene example Usage of seven alternative promoters

17 Definitions: CTSS and tag clusters CAGE-tag starting site (CTSS) = CAGE tags with identical 5 -site Tag cluster = overlapping CTSS on same strand TC can be defined by start, end positions, count of tags, distribution of counts

18 TC with >100 tags analyzed Four main classes of tag clusters Four different shape classes for tag clusters

19 Sharp or focused Broad or dispersed

20 TSS sequence representation TATA box in - sharp TSS, - minority of promoters, - tissue-specific genes, - high conservation CpG islands in broad TSS TATA site ~ -30nt from TSS nt

21 The consensus initiator sequence TATA-box -1,+1 Py-Pu (C,T A,G) Most preferred initiators are CG, CA and TG 3 -UTR TSSs GGG motif

22 Dinucleotide frequency in dominant TSS

23 Over-represented k-mers

24 New concept of genes

25 Conclusions for FANTOM3 data After accessing 145 mouse and 41 human CAGE libraries, inclusive GIC/GSC, 5 ESTs, FANTOM3 clones potential 736,403 mouse TC; 665,278 human TC 159,075 mouse TC; 177,563 human TC by >1 tags 181,047 independent transcripts in mouse genome, 62.5% of genome is transcribed (not only 2% protein coding) 65% of TU contain alternatively splicing variants ~ TUs, protein-coding and non-coding TUs, 51,135 proteins 78,393 splicing variants > (72% TU) sense-antisense transcript pairs

26 In summary There are more different transcripts than genes (~10x) More than half (58%) or TUs have two or more alternative promoters, polyadenylation sites; 65% have multiple splice variants Four categories of promoters can be defined TATA-box containing promoters are a minor subset - majority of promoters lie within CpG islands There are transcription forests and deserts

27

28 Complementary information CAGE-TSSchip can be used for measuring promoterbased transcriptional activity Next generation sequencing technologies boost the tag approach data output (Roche Genome Sequencer 20 (454), ABI SOLiD Analyzer, Illumina Genome Analyzer, Helicos HeliScope) Improved promoter and TSS prediction algorithms Encode project Genome annotation (TSS with evidence of 5 or more CAGE tags used) CAGE tags can be found in USCS Genome browser

29 CAGE data in UCSC browser HoxA cluster

30 Still in touch with RIKEN RIKEN president visits Alumni in Singapore

31 Work in Singapore Comparative Genomics (Marine Genomics) laboratory at IMCB Institute of Molecular and Cell Biology ---belongs to A*STAR organization---

32 Marine Genomics group in Singapore

33 Work on Comparative Transcriptomics Elephant shark (Callorhinchus milii) as a model Phylogenetically the oldest group of living jawed vertebrates (separated 450 million years ago) Genome is smaller than H. sapiens (1.2 Gb) Genome is being sequenced at WUGSC Transcriptome (full-length cdna libraries) at IMCB in Singapore UCE Ultraconserved elements

34 Future research plans Get to know and introduce nano instruments for molecular biology/ Genomics Generate experimental data to support hypothesis Collaborate with computer people to analyze the highvolume data

35 Acknowledgements Teachers and Scientists who introduced me to science Colleagues and collaborators for enriching my research Joint KAUST-HKUST laboratory for inviting me today Thanks everyone for listening!