Lecture 7 Motif Databases and Gene Finding

Size: px
Start display at page:

Download "Lecture 7 Motif Databases and Gene Finding"

Transcription

1 Introduction to Bioinformatics for Medical Research Gideon Greenspan Lecture 7 Motif Databases and Gene Finding

2 Motif Databases & Gene Finding Motifs Recap Motif Databases TRANSFAC BLOCKS Gene Finding ORFFinder GenScan 2

3 Motifs Recap Biological motivation Transcription factors and protein domains Types of Motif Consensus-type motifs Profile motifs Motif Finding MEME MAST instead of BLAST 3

4 TRANSFAC Eukaryotic regulatory DNA elements Individual regulatory sites Genes to which they belong Proteins which bind them Proteins which bind sites Cellular source of protein Nucleotide motif profile for binding Some grouping and classification 4

5 Searching TRANSFAC Search by database content By identifier, factor name, gene name By species, author Browse alphabetically By name or species Search within a sequence MatInspector 5

6 TRANSFAC Site (1) DNA or RNA Gene Gene region Accession number Sequence of regulatory element Position range of factor binding site 6

7 TRANSFAC Site (2) Binding factor accession Factor name 1 2 Binding quality functionally confirmed binding of pure protein Organism 3 4 immunologically characterized extract via known binding sequence Cellular source 5 6 extract protein binding to bona fide element unassigned External links Methods of identifying site 7

8 TRANSFAC Factor (1) Factor name Accession number Other names Organism Homologs Classification Size Amino acid sequence 8

9 TRANSFAC Factor (2) Protein sequence reference Features and positions Cell specificity Structural features 9

10 TRANSFAC Factor (3) FF IN MX BS Functional features Interacting factors Matrices Bound sites Appropriate IUPAC subset symbol Nucleotide counts from bound sites 10

11 BLOCKS Highly conserved protein regions From InterPro and Prints family databases Multiple regions from each family Profile motifs for each region Additional features Phylogenetic relationships CODEHOP Primer Design Source of BLOSUM matrices 11

12 Searching BLOCKS Retrieve ID Number Search by keywords Boolean rules but no field identifiers Search against sequence Nucleotide or Protein RPS-BLAST Reverse PSI-BLAST 12

13 BLOCKS Record (1) Source of sequences Motif positions Function description Link for each motif Motif graphs External links 13

14 BLOCKS Record (2) Distances from previous block Motif width Measure of specificity Number of appearances Matched sequence Position in sequence Relative similarity to others 14

15 Other Motif Databases PFAM (5724 family Markov models) ProDom ( domain families) PROSITE (1634 families and domains) SMART (654 domain Markov models) 15

16 Gene Finding Input Chromosomal genetic sequence Output Region which encodes for gene Strand and reading frame Start and end of coding sequence Exon-intron boundaries 16

17 Gene Finding Approaches Product-based Compare to cdnas, proteins Comparitive genomics Against other genes in genome Against other genomes, e.g. mouse Characteristic approaches Learn characteristics of known genes Search for new genes using characteristics 17

18 Problems with Eukaryotes Large genome size Very little codes for genes Intron/exon structure Alternative splicing Pseudogenes About 50% success Many false positives 18

19 Characteristic Approaches Transcription initiation signals Uneven codon usage Amino acid bias Species preferred codons GC Content Markov model differences Splicing signals 19

20 ORFFinder: Output Open reading frames on one strand frame Length threshold for drawing Strand and frame Base pair range 20

21 GenScan One of most accurate programs Best for human/vertebrate sequences Markov parameters for different regions Introns beginning at 3 phases Exons: first, intermediate, last Promoter region 3 and 5 untranslated regions Intragenic regions 21

22 GenScan: Input Type of organism Show mrnas or just amino acids Show more than one gene prediction Sequence file Paste sequence 22

23 GenScan: Output Feature number Feature type Strand Predicted probability of exon Overall feature score Base pair range 23

24 Other Gene Finding Tools GeneMark (prokaryote, eukaryote) Glimmer (bacteria, archaea) GeneFinder (human, mouse, arabidopsis) HMMgene (vertebrate, C. elegans) 24