Outline 08/2018. Searching for Transcription Start Sites in Drosophila. TSS of F element genes show lower levels of H3K9me3 and HP1a

Size: px
Start display at page:

Download "Outline 08/2018. Searching for Transcription Start Sites in Drosophila. TSS of F element genes show lower levels of H3K9me3 and HP1a"

Transcription

1 Outline Searching for Transcription Start Sites in Drosophila Transcription start sites (TSS) annotation goals Promoter architecture in D. melanogaster Find the initial transcribed exon in the target species Annotate putative transcription start sites Search for core promoter motifs Wilson Leung 08/2018 Muller F element, heterochromatic, and euchromatic genes show similar expression levels TSS of F element genes show lower levels of H3K9me3 and HP1a H3K9me2 H3K9me3 H3K36me3 HP1a POF Su(var)3-9 PolII Riddle NC, et al. PLoS Genet Sep;8(9):e Riddle NC, et al. PLoS Genet Sep;8(9):e Three strategies for motif finding Multiple genes in a single species Genes with common expression pattern Sequences associated with ChIP-Seq peaks Single gene in multiple species Phylogenetic footprinting Motif finding using multiple genes within a single species A C G T Trl: FlyReg_DNaseI Multiple genes in multiple species Compare multiple sequence alignment profiles of multiple genes (Magma) Bits Sequences surrounding TSS 5 10 Predicted motif instances 1

2 Motif finding using single gene in multiple species Motif finding using multiple genes in multiple species (PhyloNet) Based on Figure 1 from Wang T and Stormo GD. PNAS 2005 Nov 29;102(48): Magma: Multiple Aligner of Genomic Multiple Alignments Key features of Magma: Runs ~70x faster than PhyloNet Analyze multiple sequence alignments with gaps Use set-covering approach to minimize redundancy in discovered motifs Computationally tractable to analyze conserved motifs in multiple eukaryotic genomes Ihuegbu NE, Stormo GD, Buhler J. J Comput Biol Feb;19(2): Estimated evolutionary distances with respect to D. melanogaster D. melanogaster D. simulans D. sechellia D. yakuba D. erecta D. ficusphila D. eugracilis D. biarmipes D. takahashii D. elegans D. rhopaloa D. kikkawai D. bipectinata D. ananassae D. pseudoobscura D. persimilis D. willistoni D. mojavensis D. virilis D. grimshawi Species sequenced by modencode Species Substitutions per neutral site D. ficusphila 0.80 D. eugracilis 0.76 D. biarmipes 0.70 D. takahashii 0.65 D. elegans 0.72 D. rhopaloa 0.66 D. kikkawai 0.89 D. bipectinata 0.99 Data from Table 1 of the modencode comparative genomics white paper GEP annotation projects Goals for the transcription start sites (TSS) annotations Research goal: Identify motifs that enable Muller F element genes to function within a heterochromatic environment Annotation goals: Define search regions enriched in regulatory motifs Define precise location of TSS if possible Define search regions where TSS could be found Document the evidence used to support the TSS annotations Proposed strategies for utilizing the GEP TSS annotations Use TSS annotations to anchor the whole-genome multiple sequence alignments Multiple sequence alignment is hard (NP-hard) Example: anchored multiple alignment with DIALIGN Align orthologous promoters to identify conserved regulatory motifs: Examples: EvoPrinterHD, Pro-Coffee De novo and discriminatory motif discovery Analyze TSS search regions to identify over-represented motifs Examples: MEME Suite, HOMER 2

3 Challenges with TSS annotations Fewer constraints on untranslated regions (UTRs) UTRs evolve more quickly than coding regions Open reading frames, compatible phases of donor and acceptor sites do not apply to UTRs Low percent identity (~50-70%) between D. biarmipes contigs and D. melanogaster UTRs Most gene finders do not predict UTRs Lack of experimental data Cannot use RNA-Seq data to precisely define the TSS TSS annotation workflow 1. Identify the ortholog 2. Note the gene structure in D. melanogaster 3. Annotate the coding exons 4. Classify the type of core promoter in D. melanogaster 5. Annotate the initial transcribed exon 6. Identify any core promoter motifs in region 7. Define TSS positions or TSS search regions RNA Polymerase II core promoter Peaked versus broad promoters Initiator motif (Inr) contains the TSS TFIID binds to the TATA box and Inr to initiate the assembly of the preinitiation complex (PIC) Juven-Gershon T and Kadonaga JT. Regulation of gene expression via the core promoter and the basal transcriptional machinery. Dev Biol Mar 15;339(2): Kadonaga JT. Perspectives on the RNA polymerase II core promoter. Wiley Interdiscip Rev Dev Biol Jan-Feb;1(1): RNA-Seq Read Count RNA-Seq biases introduced by library construction 5 3 Gene Span cdna fragmentation Strong bias at the 3 end RNA fragmentation More uniform coverage Miss the 5 and 3 ends of the transcript Wang Z, et al. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet Jan;10(1): Techniques for finding TSS Identify the 5 cap at the beginning of the mrna Cap Analysis of Gene Expression (CAGE) RNA Ligase Mediated Rapid Amplification of cdna Ends (RLM-RACE) Cap-trapped Expressed Sequence Tags (5 ESTs) More information on these techniques: Takahashi H, et al. CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. Methods Mol Biol : Sandelin A, et al. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet Jun;8(6):

4 Promoter architecture in Drosophila Classify core promoter based on the Shape Index (SI) Determined by the distribution of CAGE and 5 RLM-RACE reads Shape index is a continuum Most promoters in D. melanogaster contain multiple TSS Median width = 162 bp ~70% of vertebrate genes have broad promoters Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res Feb;21(2): Genes with peaked promoters show stronger spatial and tissue specificity 46% of genes with broad promoters are expressed in all stages of embryonic development 19% of genes with peaked promoters are expressed in all stages Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res Feb;21(2): Resources for classifying the type of core promoter in D. melanogaster 9-state chromatin model Only a subset of the modencode data are available through FlyBase D. melanogaster GEP UCSC Genome Browser [Aug (BDGP Release 6) assembly] FlyBase gene annotations (release 6.22) modencode TSS (Celniker) annotations DNase I hypersensitive sites (DHS) CAGE and RAMPAGE datasets 9-state and 16-state chromatin models Transcription factor binding site (TFBS) HOT spots Kharchenko PV, et al. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature Mar 24;471(7339): DNaseI Hypersensitive Sites (DHS) correspond to accessible regions modencode TSS annotations Aasland R and Stewart AF. Analysis of DNaseI hypersensitive sites in chromatin by cleavage in permeabilized cells. Methods Mol Biol. 1999;119: Two sets of modencode TSS predictions TSS (Celniker) Most recent dataset produced by modencode Available on the GEP UCSC Genome Browser TSS (Embryonic) Older dataset available from FlyBase GBrowse Ho JW, et al. Comparative analysis of metazoan chromatin organization. Nature Aug 28;512(7515): Use TSS (Celniker) dataset as the primary evidence Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res Feb;21(2):

5 Classify the D. melanogaster core promoter based on (TSS) Celniker annotations and DHS positions Consider DHS positions within a 300bp window surrounding the start of the D. melanogaster transcript DEMO: Classify the core promoter of D. melanogaster Rad23 Additional DHS data from different stages of embryonic development Additional TSS data available in FlyBase release 6.11 DHS data produced by the BDTNP project Evidence tracks: Detected DHS Positions (Embryos) DHS Read Density (Embryos) Thomas S, et al. Dynamic reprogramming of chromatin accessibility during Drosophila embryo development. Genome Biol. 2011;12(5):R43. Batut P, Dobin A, Plessy C, Carninci P, Gingeras TR. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res Jan;23(1): Benefits of RAMPAGE RAMPAGE = RNA Annotation and Mapping of Promoters for Analysis of Gene Expression CAGE only allows sequencing of short sequence tags (~27 bp) near the 5 cap Ambiguous read mapping to large parts of the genome RAMPAGE results on the GEP UCSC Genome Browser Lifted RAMPAGE results from release 5 to release 6 Results from 36 developmental stages Combined TSS peak call from all samples Available under the Expression and Regulation section RAMPAGE produces long paired-end reads instead of short sequence tags Developed novel algorithm to identify TSS clusters Used paired-end information during peak calling Used Cufflinks to produce partial transcript models Batut P, Gingeras TR. RAMPAGE: promoter activity profiling by paired-end sequencing of 5'- complete cdnas. Curr Protoc Mol Biol Nov 11;104:Unit 25B.11. 5

6 New Drosophila RAMPAGE datasets D. melanogaster RAMPAGE data from second biological replicate Lifted results from release 5 to the release 6 assembly Combined RAMPAGE TSS (Replicate 2) (R5) track RAMPAGE data from three other Drosophila species D. erecta, D. ananassae, and D. pseudoobscura Samples collected at 1 hour intervals throughout embryonic development (1-hour to 23-hour embryos) Combined RAMPAGE TSS track Standardize analysis of MachiBase and modencode CAGE data using CAGEr Bioconductor package developed by RIKEN Map datasets against release 6 assembly 37 modencode CAGE samples; 7 MachiBase samples Define TSS and promoters for each sample Define consensus promoters across all samples Batut PJ, Gingeras TR. Conserved noncoding transcription and core promoter regulatory code in early Drosophila development. Elife Dec 20;6. Haberle V, et al. CAGEr: precise TSS data retrieval and high-resolution promoterome mining for integrative analyses. Nucleic Acids Res Apr 30;43(8):e51. TSS classifications based on CAGEr Changes in the dominant TSS of Rad23 across different developmental stages Rad23 Stages of Development CAGE Tag Clusters Adult females Evidence for TSS annotations (in general order of importance) 1. Experimental data RNA-Seq RNA Pol II ChIP-Seq 2. Conservation Type of TSS (peaked/intermediate/broad) in D. melanogaster Sequence similarity to initial exon in D. melanogaster Sequence similarity to other Drosophila species (ROAST) Determine the gene structure in D. melanogaster FlyBase: GBrowse UTR CDS Gene Record Finder: Transcript Details 3. Core promoter motifs Inr, TATA box, etc. 6

7 Identify the initial transcribed exon using NCBI blastn Retrieve the sequences of the initial exons from the Transcript Details tab of the Gene Record Finder Use placement of the flanking exons to reduce the size of the search region if possible Increase sensitivity of nucleotide searches Change Program Selection to blastn Change Word size to 7 Change Match/Mismatch Scores to +1, -1 Change Gap Costs to Existence: 2, Extension: 1 Extrapolate TSS position based on blastn alignment of the initial transcribed exon Query start: 6 Extrapolate TSS position: 28,941-5 = 28,936 blastn: D. mel: Rad23:1 (Query) vs. contig19 (Sbjct) Assume the length of the initial transcribed exon is conserved between D. melanogaster and D. biarmipes Core promoter motifs can affect gene expression levels SCP1: High turnover of DNase I Hypersensitive Sites (DHS) between human and mouse Juven-Gershon T, et al. Rational design of a super core promoter that enhances gene expression. Nat Methods Nov;3(11): Vierstra J et al. Science Nov 21;346(6212): High turnover of core promoter motifs between D. erecta and D. ananassae Use core promoter motifs to support TSS annotations Some sequence motifs are enriched in the region (~300 bp) surrounding the TSS Some motifs (e.g., Inr, TATA) are well-characterized Other motifs are identified based on computational analysis Presence of core promoter motifs can be used to support the TSS annotations Inr motif (TCAKTY) overlaps with the TSS (-2 to +4) Rach EA et al. Genome Biol. 2009;10(7):R73. Absence of core promoter motifs is a negative result Most D. melanogaster TSS do not contain the Inr motif 7

8 Use UCSC Genome Browser Short Match to find Drosophila core promoter motifs Core Promoter Motifs tracks TATA box Initiator (Inr) Ohler U, et al. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 2002; 3(12):RESEARCH0087. Available under Projects è Annotation Resources è Core Promoter Motifs on the GEP web site: Show core promoter motif matches for each contig Separated by strand Visualize matches to different core promoter motifs Use UCSC Table Browser (or other means) to export the list of motif matches within the search region RNA PolII ChIP-Seq tracks (available for D. biarmipes, D. elegans, and D. ficusphila) Show regions that are enriched in RNA Polymerase II compared to input DNA DEMO: Use the Inr motif to support the TSS position of Rad23 Using RNA-Seq and RNA PolII ChIP-Seq data to define the TSS search region TSS annotation for Rad23 TSS position: 28,936 Conservation with D. melanogaster blastn search of initial exon D. mel Transcripts track Location of the Inr motif TSS search region: 28,716-28,936 Enrichment of RNA PolII upstream of the TSS position RNA-Seq read coverage upstream of the TSS position Search region defined by the extent of the RNA PolII peak 8

9 TSS annotation resources Walkthroughs: Annotation of Transcription Start Sites in Drosophila Sample TSS report for onecut Exercises: TSS Modules 1-4 (under development) Reference: TSS Annotation Workflow GEP Annotation Report: Classify the type of core promoter Evidence that supports or refutes the TSS annotation Distribution of core promoter motifs Additional TSS annotation resources The D. melanogaster gene annotations are the primary source of evidence Resources that could be useful if the D. melanogaster evidence is ambiguous Multiple sequence alignments of 26 Drosophila species PhastCons and PhyloP conservation scores Genome browsers for 28 Drosophila species RNA Pol II ChIP-Seq (D. biarmipes, D. elegans, and D. ficusphila) CAGE data for D. pseudoobscura RAMPAGE data for D. erecta, D. ananassae, and D. pseudoobscura RNA-Seq coverage, splice junctions, assembled transcripts Gnomon gene predictions Genome Browsers for 28 Drosophila species Genome alignments of 27 Drosophila species against D. melanogaster Use the conservation tracks to identify regions under selection Examine the ROAST alignments to identify the orthologous TSS regions 9

10 TSS annotation summary Questions? Most of the D. melanogaster core promoters have multiple TSS Classify the type of promoter (peaked/intermediate/broad) based on the transcriptome evidence from D. melanogaster Define search regions that might contain TSS Use multiple lines of evidence to infer the TSS region Identify initial exon RNA-Seq coverage blastn (change search parameters) Distribution of core promoter motifs (e.g., Inr) RNA PolII ChIP-Seq peaks Maintain conservation compared to D. melanogaster 10