ORTHOMINE - A dataset of Drosophila core promoters and its analysis. Sumit Middha Advisor: Dr. Peter Cherbas

Similar documents
Transcription Start Sites Project Report

nature methods A paired-end sequencing strategy to map the complex landscape of transcription initiation

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

MODULE TSS2: SEQUENCE ALIGNMENTS (ADVANCED)

Regulation of eukaryotic transcription:

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Outline 08/2018. Searching for Transcription Start Sites in Drosophila. TSS of F element genes show lower levels of H3K9me3 and HP1a

PromSearch: A Hybrid Approach to Human Core-Promoter Prediction

Annotation of Drosophila erecta Contig 14. Kimberly Chau Dr. Laura Hoopes. Pomona College 24 February 2009

Applied Bioinformatics - Lecture 16: Transcriptomics

Annotation of Contig8 Sakura Oyama Dr. Elgin, Dr. Shaffer, Dr. Bednarski Bio 434W May 2, 2016

Drosophila ficusphila F element

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Annotation of contig62 from Drosophila elegans Dot Chromosome

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Computational analysis of core promoters in the Drosophilagenome

Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018

MODULE 5: TRANSLATION

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009

Supplementary Online Material. the flowchart of Supplemental Figure 1, with the fraction of known human loci retained

9/19/13. cdna libraries, EST clusters, gene prediction and functional annotation. Biosciences 741: Genomics Fall, 2013 Week 3

Data Retrieval from GenBank

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

Identification of highly specific localized sequence motifs in human ribosomal protein gene promoters

Analysis of large deletions in human-chimp genomic alignments. Erika Kvikstad BioInformatics I December 14, 2004

Genome Sequence Assembly

Genome annotation & EST

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Gene Identification in silico

PRIMEGENSw3 User Manual

Browser Exercises - I. Alignments and Comparative genomics

High-throughput Transcriptome analysis

Finishing Drosophila ananassae Fosmid 2410F24

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

Lab Week 9 - A Sample Annotation Problem (adapted by Chris Shaffer from a worksheet by Varun Sundaram, WU-STL, Class of 2009)

ppdb: a plant promoter database

Aaditya Khatri. Abstract

Chapter 24: Promoters and Enhancers

Finishing Drosophila Grimshawi Fosmid Clone DGA19A15. Matthew Kwong Bio4342 Professor Elgin February 23, 2010

ppdb: a plant promoter database

Introduction to NGS analyses

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Greene 1. Finishing of DEUG The entire genome of Drosophila eugracilis has recently been sequenced using Roche

Small Exon Finder User Guide

Identifying Regulatory Regions using Multiple Sequence Alignments

Fine scale structural variants distinguish the genomes of Drosophila melanogaster and D. pseudoobscura

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

Genome-Wide Survey of MicroRNA - Transcription Factor Feed-Forward Regulatory Circuits in Human. Supporting Information

Conserved introns reveal novel transcripts in eukaryotic genomes

Drosophila White Paper 2003 August 13, 2003

Minor Introns vs Major Introns

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Chapter 17 Lecture. Concepts of Genetics. Tenth Edition. Regulation of Gene Expression in Eukaryotes

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

ChIP-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

Annotation of a Drosophila Gene

Transcription Gene regulation

Finishing Drosophila Ananassae Fosmid 2728G16

Biol 478/595 Intro to Bioinformatics

Finishing Drosophila grimshawi Fosmid Clone DGA23F17. Kenneth Smith Biology 434W Professor Elgin February 20, 2009

MODULE TSS1: TRANSCRIPTION START SITES INTRODUCTION (BASIC)

Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain. Elfar Þórarinsson February 2006

BME 110 Midterm Examination

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Assessing De-Novo Transcriptome Assemblies

CS313 Exercise 1 Cover Page Fall 2017

Annotating the D. virilis Fourth Chromosome: Fosmid 99M21

Identification of individual motifs on the genome scale. Some slides are from Mayukh Bhaowal

Computational identification of promoters and first exons in the human genome

A Brief History. Bootstrapping. Bagging. Boosting (Schapire 1989) Adaboost (Schapire 1995)

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

In silico identification of transcriptional regulatory regions

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

File S1. Program overview and features

Figure S4 A-H : Initiation site properties and evolutionary changes

Computational gene finding

The goal of this project was to prepare the DEUG contig which covers the

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

Lecture 7 Motif Databases and Gene Finding

Ensembl Tools. EBI is an Outstation of the European Molecular Biology Laboratory.

ab initio and Evidence-Based Gene Finding

Sundaram DGA43A19 Page 1. Finishing Drosophila grimshawi Fosmid: DGA43A19 Varun Sundaram 2/16/09

Introduction to Plant Genomics and Online Resources. Manish Raizada University of Guelph

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Why learn sequence database searching? Searching Molecular Databases with BLAST

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Annotation and the analysis of annotation terms. Brian J. Knaus USDA Forest Service Pacific Northwest Research Station

Finishing of Fosmid 1042D14. Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae

Guided tour to Ensembl

Supplementary Information

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

CLASS 3.5: 03/29/07 EUKARYOTIC TRANSCRIPTION I: PROMOTERS AND ENHANCERS

Axiom mydesign Custom Array design guide for human genotyping applications

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

The Human Genome Project

Introduction to genome biology

Supplementary Table 1. Summary of whole genome shotgun sequence used for genome assembly

Transcription:

ORTHOMINE - A dataset of Drosophila core promoters and its analysis Sumit Middha Advisor: Dr. Peter Cherbas

Introduction Challenges and Motivation D melanogaster Promoter Dataset Expanding promoter sequences Merging the Datasets Analysis of D melanogaster Dataset Associating Flybase Transcript IDs Multiple fly species Dataset Initial results with multiple drosophila Dataset Future Work 5/21/2005 2

INTRODUCTION Stages of gene expression: Single gene multiple products 5/21/2005 3

INTRODUCTION Transcriptional Control 5/21/2005 4

INTRODUCTION Drosophila Core Promoters Above image edited from: http://163.238.8.180/~davis/bio_327/lectures/transcription/transcriptionover.html 5/21/2005 5

Challenges and Motivation Known regulatory elements are small and degenerate Position of TSS is not accurately known Many regulatory elements are still unknown Focus on assembling a dataset of promoter regions around the Transcription Start Site Promoters for RNA polymerase II genes Experimentally determined TSS (+/-10bp) 5/21/2005 6

The Promoter Datasets Rubin, et al: 1941 promoter sequences; -250 to +50 Kadonaga, et al: 205 promoter sequences; -47 to +45 EPD: 1926 promoter sequences; -250 to +50 5/21/2005 7

Kadonaga, et al (UCSD) 205 promoters Proposed adjustment from Inr consensus (T-C-A +1 -G/T- T-C/T) was nullified using original references Removed 9 promoters for which we could not confirm the TSS from the original papers 5/21/2005 8

Expanding promoter sequences Total 4060 promoters Extract genomic sequence for each promoter ( -250 to +100bp) uniformity 80 Ambiguous/low scoring BLAST hits removed http://bioportal.cgb.indiana.edu/orthomine/extend_dataset_cases.htm 5/21/2005 9

Merging the datasets Overlaps entries in datasets for the same promoter element Fasta all pairwise comparison E-value cutoff = 1e 34 Max distance between TSS cutoff = 14bp Non Redundant Dataset 3393 (1908 + 157 + 1328) 5/21/2005 10

Merging the datasets 5/21/2005 11

Introduction Challenges and Motivation D melanogaster Promoter Dataset Expanding promoter sequences Merging the Datasets Analysis of D melanogaster Dataset Associating Flybase Transcript IDs Multiple fly species Dataset Initial results with multiple drosophila Dataset Future Work 5/21/2005 12

Dataset Analysis Word over-representation Cannot ignore position preference of regulatory motifs An analysis of the Inr-negative set showed over-expression of patterns in the same positions as the Inr, TATA and DPE should be, and could be possible yet unknown synonyms Yogita Mantri (MS Bioinformatics IUB) 5/21/2005 13

Dataset Analysis GO Clustering 1 st Level of Biological Processes (7 clusters) No strong TCAGT motif in 'behavior' and 'regulation of biological processes' clusters Initial Results 5/21/2005 14

Flybase Transcript ID Promoter BLAST the non-redundant promoter dataset with the database (D mel chromosomes arms) Flybase rel4.1 TSS coordinate TSS of dataset promoter within +/- 50 of TSS from rel4.1 assign the transcript ID to promoter (2699) Dmel Chromosome 5/21/2005 15

Flybase Transcript ID Promoter (contd.) -5 18-4 27-3 41-2 37-1 490 0 961 1 137 2 53 3 33 4 26 5 21 6 18 5/21/2005 16

Introduction Challenges and Motivation D melanogaster Promoter Dataset Expanding promoter sequences Merging the Datasets Analysis of D melanogaster Dataset Associating Flybase Transcript IDs Multiple fly species Dataset Initial results with multiple drosophila Dataset Future Work 5/21/2005 17

Multiple Fly species Dataset D. ananassae D. erecta D. mojavensis D. pseudoobscura D. simulans D. virilis D. yakuba Image from Flybase http://bugbane.bio.indiana.edu:7151/ 5/21/2005 18

Methods for obtaining Orthologs 1. Direct BLAST 2. Circular Method 5/21/2005 19

Method 1 Direct BLAST Short cut method: Simply BLAST the promoter sequence with new-fly database (chr/contig/scaffold) 5/21/2005 20

Method 2 Circular Method Extract CDS using the Flybase gene ID for promoter BLAST CDS with new-fly database (contig/scaffold/chr) Take 5kbp region upstream of the hit BLAST with original D melanogaster promoter to get orthologous promoter in new-fly (same orientation here) 5/21/2005 21

Method 2 Circular Method Dataset Promoter CDS from Flybase New Fly Promoter Best BLAST hit in New Fly genome 5k upstream region 5/21/2005 22

Compare the methods Two orthologs match if: the start coordinates of the BLAST hit are within 50bp OR BLAST of the orthologous new fly promoters by 2 methods gives e-70 or lower hit (paralog) 5/21/2005 23

Direct BLAST vs Circular method 7560 comparisons of which 6890 matched Dxxx Matches No Match Dsim 1651 20 Dyak 2192 27 Dere 2032 20 Dann 556 240 Dpse 291 197 Dvir 117 96 Dmoj 87 70 No match cases 5/21/2005 24

Dataset Analysis Multiple Drosophila MEME motifs: Weak TCAGT and TCGAT (DRE) motifs in the D mojavensis Ortholog sets motifs: Conserved Blocks Clustalw analysis: High conservation in INR & TATA regions; conserved 5mers GO + Clustal: TCATT is more conserved than TCAGT in reg of biological processes and unknown function genes 5/21/2005 25

Future Work More analysis of Datasets to draw further conclusions Further analysis of MEME and CLUSTAL results 5/21/2005 26

Acknowledgements Dr. Peter Cherbas Dr. Haixu Tang Dr. Sun Kim CGB The discussion groups 5/21/2005 27

References Kutach, A.K. and Kadonaga, J.T, 2000. The downstream promoter element DPE appears to be as widely used as the TATA box in Drosophila core promoters, Mol. Cell. Biol. 20: 4754-4764 Ohler, U., Liao, G., Niemann, H., and Rubin, G.M, 2002 Computational analysis of core promoters in the Drosophila genome Genome Biol. 3: research0087.1 research0087.12 Cherbas, L., and P. Cherbas. 1993, The arthropod initiator: the capsite consensus plays an important role in transcription, Insect Biochem. Mol. Biol. 23:81-90 Christoph D. Schmid, Viviane Praz, Mauro Delorenzi, Rouaïda Périer, Philipp Bucher, The Eukaryotic Promoter Database EPD: the impact of in silico primer extension, 82-85 5/21/2005 28

Questions 5/21/2005 29

Basic Statistics of Dataset (Extra) 1187550 basepairs in 3393 sequences AT-content 59.05%, GC-content 40.95% Mono nucleotides Di nucleotides % A 29.70% C 20.67% G 20.28% T 29.35% second nucleotide A C G T first A 10.83% 5.21% 5.41% 8.16% nucl. C 6.45% 4.08% 4.97% 5.10% G 5.31% 5.73% 3.71% 5.46% T 7.02% 5.59% 6.14% 10.53% 5/21/2005 30

Ohler, Rubin, et al (Berkley) 1941 promoters (Extra) aligning 5' expressed sequence tags (ESTs) from cap-trapped cdna libraries to the genome stringent criteria concerning coverage and 5'-end distribution cautious approach for identifying TSSs - require 5' ends of the alignments of multiple, independent cap-selected cdnas to lie in close proximity http://genomebiology.com/2002/3/12/research/0087 5/21/2005 31

Kadonaga, et al (UCSD) 205 promoters (Extra) Drosophila Core promoter database TSS mapped by nuclease protection, primer extension, or multiple 5' RACE clones If reported start site overlaps a consensus Inr element, the central A in the Inr consensus (T-C-A+1-G/T-T-C/T) was designated as the TSS http://mcb.asm.org/cgi/content/full/20/13/4754 http://www-biology.ucsd.edu/labs/kadonaga/dcpd.html 5/21/2005 32

The Eukaryotic Promoter Database 1926 promoters (Extra) Transcription by RNA polymerase II (protein coding genes) Assignment of TSS experimental data (precision of +/- 5 bp or higher) Proposed adjustment from consensus Ignored Minor discrepancies b/w publications averaged Alternative promoters for same gene min. 20bp apart http://www.epd.isb-sib.ch/ 5/21/2005 33

Multiple flies AXT Alignments (Extra) UCSC Genome Browser (http://hgdownload.cse.ucsc.edu/downloads.html) BLASTZ alignments (Miller s lab PSU) AXT file 0 chr19 3001012 3001075 chr11 70568380 70568443 3500 TCAGCTCATAAATCACCTCCTGCCACAAGCCTGGCCTGGTCCCA TCTGTTCATAAACCACCTGCCATGACAAGCCTGGCCTGTTCCCAA Issue Gapped aligment 5/21/2005 34