Genomic and bioinformatics resources 徐唯哲 Paul Wei-Che HSU Assistant Research Specialist Bioinformatics Core, Institute of Molecular Biology, Academia Sinica, Taiwan, R.O.C. 1 What Bioinformatics Can Do for You Data mining Data analysis Dry lab Experimental verification Wet lab 2 1
Data analysis http://bc.imb.sinica.edu.tw/online_tool.php Sequence Analysis RNAi Design Motif Searching DNA Primer Design Pathway Analysis Microarray Analysis Protein Interactions Prediction 3D Structure Modeling 3D Structure Comparison RNA Secondary Structure Prediction Biomolecular Interaction Protein Secondary Structure Prediction Subcellular Localization Prediction Protein Functional (Domain) Analysis 3 2010 NAR Database Summary Paper Category List http://www.oxfordjournals.org/nar/database/c/ Nucleotide Sequence Databases RNA sequence databases Protein sequence databases Structure Databases Genomics Databases (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases Microarray Data and other Gene Expression Databases Proteomics Resources Other Molecular Biology Databases Organelle databases Plant databases Immunological databases 4 2
Data Mining Data mining (knowledge discovery in databases): Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) information from data in large databases http://bc.imb.sinica.edu.tw/online_database.php Genome databases Nucleotide databases Protein databases Structure databases Other databases 5 1953 Watson and Crick propose the double helix model for DNA 1955 The first protein sequence, bovine insulin, is announced by F. Sanger. Establishing a Biotech and genetic engineering November 4, 1988 Bioinformatics booming Structure analysis Gene annotation Genotyping Cross-species comparisons Function annotation Gene regulation analysis 1970 The details of the Needleman-Wunsch algorithm for sequence comparison are published 1977 Protein Data Bank is published 1980 1981 The Smith-Waterman algorithm for sequence alignment is published 1982 1985 Genetics Computer Group (GCG) created as a part of the University of Wisconsin of Wisconsin Biotechnology Center 1988 The National Center for Biotechnology Information (NCBI) is established at the National Cancer Institute 1990 1991 1995 1996 1998 2003 Using of multi-dimensional NMR for protein structure determination The PCR reaction is described by Kary Mullis and co-workers The BLAST program (Altschul, et. al.) is published Begin the Human Genome Project (HGP), an international research program The creation and use of expressed sequence tags (ESTs) is described Microsoft releases version 1.0 of Internet Explorer Affymetrix produces the first commercial DNA chips RNA interference is discovered in C. elegans by Mello and Fire The Human Genome Project (HGP) is completed 6 2008 Next-generation sequencing (NGS) technologies are advancing in quality and applications 3
Most commonly used webs in bioinformatics NCBI (http://www.ncbi.nlm.nih.gov/guide/) National Center for Biotechnology Information Ensembl (http://www.ensembl.org/) Ensembl is a joint project between European Bioinformatics Institute (EBI), an outstation of the European Molecular Biology Laboratory (EMBL), and the Wellcome Trust Sanger Institute (WTSI). UCSC (http://genome.ucsc.edu/) University of California Santa Cruz 7 NCBI Organizational Structure Computational Biology Branch (CBB) Developing innovative algorithms (BLAST, PSI-BLAST, SEG, VAST, and COGs) and novel research approaches (text neighboring) Information Engineering Branch (IEB) Designing and building NCBI's production software and databases Information Resources Branch (IRB) Plans, directs, and manages the technical operations of NCBI, including the computer systems used for research and development as well as the computer systems used to access public databases 8 4
BLAST 9 Basic Local Alignment Search Tool (BLAST) http://blast.ncbi.nlm.nih.gov/blast.cgi 10 5
Basic BLAST BLAST Specialized BLAST 11 Request ID 12 6
13 Score 1. Score for match = +1 2. Mismatch penalty = -1 3. Assume gap opening (GO) penalty = -2 and gap extension (GE) penalty = -1 Expectation Values K = constant (correction for non-independence of possible starting points for matches) m = total length of sequences in database n = length of query sequence λ = scaling constant S = score of the high-scoring sequence pair (HSP) 14 7
Basic BLAST BLAST Specialized BLAST 15 Example: Search Conserved Domains on a protein 16 8
17 Homolog Homolog: A gene related to a second gene by descent from a common ancestral DNA sequence. Ortholog: Orthologs are genes in different species that have evolved from a common ancestral gene via speciation. Paralog: Paralogs are genes produced via gene duplication within a genome. 18 9
Types of Databases Archival or Primary Data Text : PubMed DNA sequence : GenBank/EMBL/DDBJ Protein sequences/structures : PDB (RCSB) Curated or Processed Data Sequences : RefSeq (curated, non-redundant DNA, mrna, protein, etc.) Protein Sequences and Structures : MMDB Organism Maps : Entrez Genomes (human, mouse, yeast, etc.) Genes : LocusLink (loci), Homologene (orthologs), OMIM (disease) Specialized Databases Organism : Maps in Entrez Genomes (human, mouse, yeast, etc) Function : Sequences in UniVec (vectors), UniGene (genes) Sequencing Methods : dbest, dbgss, dbsts, HTG Databases Taxonomy Browser Article Abstracts MedLine VAST Taxonomy Map Viewer Genomes 3-D Structure MMDB BLAST Nucleotide Sequences Protein Sequences BLAST 10
Other Databases Genetic Variation dbsnp Cancer Chromosome Aberration CCAP Gene Expression SAGE Cancer Gene Expression CGAP Genetic Disease OMIM Protein Swiss Prot Entrez Homepage 11
Ensembl The Ensembl project was started in 1999 the Ensembl group consists of between 40 and 50 people Genebuild team Creates the gene sets for the various species Software team develops and maintains the BioMart data mining tool Comparation, Variation and Functional Genomics teams are responsible for the comparative and the variation and regulatory data, respectively Web team makes sure that all data are presented on the website in a clear and user-friendly way Outreach team answers questions from users and gives workshops 23 Genome browsers Ensembl public site + installable system UCSC Human Genome Browser NCBI Map Viewer 12
Ensembl naming conventions ENSG0000XXXX for gene ENST0000XXXX for gene transcripts ENS for human, ENSMUS for mouse, ENSRNO for rat, etc 51 species 26 13
Species homepage Species Version Chromosome maps 14
Chromosome maps The "MapView" page displays the map of chromosome bands. To the left, feature density plots for genes, GC contents, repetitive sequences and SNPs are shown. Chromosome-overview 15
Introduction to BioMart http://www.ensembl.org/biomart/martview/ Data mining using BioMart BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research (OICR) and the European Bioinformatics Institute (EBI). The system can be used with any type of data and is particularly suited for providing 'data mining' like searches of complex descriptive data. 16
UCSC Genome Browser Center for Biomolecular Science and Engineering (CBSE) at the University of California Santa Cruz (UCSC). Photo: Jim MacKenzie 33 UCAC Genome Browser Genome Browser Zooms and scrolls over chromosomes, showing the work of annotators worldwide Gene Sorter Shows expression, homology and other information on groups of genes that can be related in many ways Blat Quickly maps your sequence to the genome. Table Browser Provides convenient access to the underlying database VisiGene Lets you browse through a large collection of in situ mouse and frog images to examine expression patterns Genome Graphs Allows you to upload and display genome-wide data sets 34 17
Genome Browser 35 Gene Sorter 36 18
Blat 37 Table Browser 38 19
VisiGene 39 Genome Graphs 40 20
Scenario 1: How to get genes with highest enrichment in embryonic stem cells (ES sells)? 41 Digital Differential Display (DDD) Expressed Sequence Tags (EST) A set of single-pass sequenced cdnas from mrnas derived from a specific tissue or cell population Digital Differential Display (DDD) DDD is a tool for comparing EST profiles in order to identify genes with significantly different expression levels 42 21
Step 1: Find UniGene in NCBI 43 Step 2: select species 44 22
Step 3: Define pools 45 Step 4: View Results 46 23
47 Scenario 2: Get gene sequences 48 24
Search Gene for Your gene name 49 50 25
51 52 26
53 Scenario 3:How to get all human transcription factor gene sequences Promoter analysis TF TF TF TF gene gene gene gene NCBI + Ensembl!! 54 27
STEP 1: Use NCBI Search bar to search keywords 55 2949 transcription factors! 56 28
STEP 2: Change Display: Summary -> Brief 57 STEP 3: Save file EntrezGene ID Associated Gene Name or HGNC symbol UI List HGNC: HUGO Gene Nomenclature Committee 58 29
Save this file 59 STEP 6: Go to Ensembl STEP 7: Click BioMart Ensembl Genome Browser 60 30
STEP 8: CHOOSE DATABASE: Select Ensembl 56 61 STEP 9: CHOOSE DATASET: Select Homo sapiens genes (GRCh37) 62 31
STEP 11: Click GENE: STEP 10: Click Filters 63 STEP 12: Check ID list limit : 64 32
STEP 13: Select EntrezGene ID(s) 65 STEP 14: Input gene_result.txt or copy paste EntrezGene IDs 66 33
STEP 15: Click Attributes 67 STEP 16: Select Sequences STEP 17: Click SEQUENCES 68 34
STEP 18: Select Unspliced (Gene) 69 STEP 19: Click Header Information 70 35
STEP 21: Click Results STEP 20: Select Associated Gene Name 71 72 36
Scenario 4 To retrieve all the human genes in Chromosome I The retrieving gene information includes: Associated Gene Name Start Position (bp) End Position (bp) Strand 1000 bps 5 Upstream Constraints With a 5 UTR GO:0030528: transcription regulator activity Gene Ontology 74 37
Fold Change (Log 2 ) 2010/11/22 Overview in the analysis of gene regulatory network PHX 15 1h 4h 8h 12h 24h 36h 48h 3d 4d 7d 10d Gene expression analysis Co-expressed genes - Normalization - Filtering - Clustering TF TF Promoter analysis TF TF gene gene gene gene Regulatory network - Promoter extraction - TF binding site (TFBS) - Motif discovery - Homologous analysis - Co-occurrence of TFBS - TF and targets - protein-protein interaction - pathway - protein modification 75 How to measure similarity between expression patterns? The Pearson correlation coefficient. Pearson s correlation coefficient measures the linear association between two sets of pairs {x i } and {y i } n ( y i y)( x i x) r i 1 x y n n 2 2 ( y i y) ( x i x) i 1 i 1 {x i } and {y i } are the paired percentage errors for multiplicative models {x i } and {y i } are the paired residuals for additive models 76 38
An illustrative Example 77 Log 2 Ratio (experimental/control) 78 39
Pearson s Correlation Coefficients 79 Hierarchical Clustering 80 40
Fold Change (Log 2 ) 2010/11/22 K-means Clustering 81 Co-expressed Gene Groups Co-expressed genes PHX 15 1h 4h 8h 12h 24h 36h 48h 3d 4d 7d 10d -2-1 0 1 2 3 4 82 41
HCE - Hierarchical Clustering Explorer 83 Gene regulation database: TRANSFAC 84 42
TRANSFAC a database on gene transcription regulation contains GENE encodes for SITE binds to and regulates FACTOR interacts is used to construct is an attribute of MATRIX TRANSFAC: FACTOR table, protein sequence 43
TRANSFAC: FACTOR table, protein domains TRANSFAC: FACTOR table, structural and functional features 44
TRANSFAC: FACTOR table, links to other databases TRANSFAC: classification of transcription factors 45
TRANSFAC: CLASS table TRANSFAC: FACTOR table, protein-dna and protein-protein interactions 46
TRANSFAC: MATRIX table TM Two important parameters matrix and core similarities in MATCH. TF matrix actgcgaattatcgc tacacgaatagaagc agcgcgaattgacct aatgcgaattaacgc core 47
TRANSFAC: Match TM tool TRANSFAC: Match TM output 48
Pathway Analysis & Data Mining for Gene Expression MetaCore (commercial ) Choose from ten network-creating algorithms and multiple filters for optimal data mining Take advantage of the annotated content database that took over 100 man-years to assemble Over 2,000 interactive maps with consensus knowledge of human biology and diseases Visualize mouse, rat, worm, fly, yeast, chimpanzee, bovine, zebrafish, mosquito, mold, rice, arabidopsis, candida, plasmodium and dog data on maps and networks Pathway Studio (commercial ) Find pathways and gene ontology groups affected in an experiment Overlay expression data on canonical pathways and visualize the effects Identify significant genes from a network relevance prospective Build new pathways/regulation networks using molecular and functional relationship information extracted from publicly available literature visant (free!) 97 MetaCore Pathway Studio visant 98 49
: Integrative Visual Analysis Tool for Biological Networks and Pathways Hu, Z., Mellor, J., Wu, J. and DeLisi, C. (2004) VisANT: an online visualization and analysis tool for biological interaction data. BMC Bioinformatics, 5, 17. 99 100 50
Introduction VisANT, an application for integrating biomolecular interaction data into a cohesive, graphical interface offers an online interface for a large range of published data sets on biomolecular interactions, including those entered by users integrated with standard databases for organized annotation, including GenBank, KEGG and SwissProt 101 URL: http://visant.bu.edu/ 102 51
Searching the Protein/Gene 103 Searching KEGG Pathway and Chemical Compounds 104 52
Load Your Own Data 105 ClustalW: Multiple Sequence Alignment 106 53
Multiple Sequence Alignment MSA is the process of finding the similarities among multiple sequences. 107 Sequence Homology Multiple Sequence Alignment E.g. ClustalW: A MSA Software S 1 S 2 S 3 S 4 A-CGTGCA ACCGTGCA A-CGTGC- A-CTTGCA * * *** *Match Insertion Deletion Substitution Distances Between Sequences E.g. Phylip: Neighbor-Joining Algorithm Constructing Evolutionary Trees S 1 S 2 S 3 S 4 S 1 7 6 6 S 2 6 6 S 3 5 S 4 S 1 S 2 S 3 S 4 108 54
Multiple sequence alignment: ClustalW http://www.ebi.ac.uk/tools/clustalw2/index.html 109 110 55
Results Phylogenetic tree 111 112 56
Motif Prediction 113 DNA Sequence 114 57
DNA Sequence Multiple Em for Motif Elicitation Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994. 116 58
Motif discovery tools: MEME 117 URL: http://meme.sdsc.edu/meme/intro.html Get the result by Email 118 59
119 120 60
121 Summary of motifs 122 61
melina II http://melina2.hgc.jp/public/index.html 123 : A sequence logo generator Schneider TD, Stephens RM. 1990. Sequence Logos: A New Way to Display Consensus Sequences. Nucleic Acids Res. 18 Crooks GE, Hon G, Chandonia JM, Brenner SE WebLogo: A sequence logo generator, Genome Research, 14:1188-1190, (2004) 124 62
URL: http://weblogo.berkeley.edu/ 125 Motifs can mutate on non important bases The five motifs at top right have mutations in position 3 and 5 Representations called motif logos illustrate the conserved regions of a motif Motif Logo TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA 126 63
TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA 5 TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA TGGGGGA TGAGAGA TGGGGGA : : 100 Entropy Define frequencies for the occurrence of each letter in each column p A = 1 or p A = 0.75, p T = 0.25 Compute entropy of each column X A, T, G, C p X log p X 128 64
Entropy: Example A A entropy 0 A A Best case 1. AATGAGGGA 2. ATTGTGAGA 3. ACTGCGGGA 4. AGTGGGAGA Worst case A T entropy G C 1 1 log 4 4 1 4( 2) 2 4 129 Entropy of an Alignment: Example column entropy: -( p A logp A + p C logp C + p G logp G + p T logp T ) A A A A C C A C G A C T Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0 Column 2 = -[( 1 / 4 )*log( 1 / 4 ) + ( 3 / 4 )*log( 3 / 4 ) + 0*log0 + 0*log0] = -[ ( 1 / 4 )*(-2) + ( 3 / 4 )*(-.415) ] = +0.811 Column 3 = -[( 1 / 4 )*log( 1 / 4 )+( 1 / 4 )*log( 1 / 4 )+( 1 / 4 )*log( 1 / 4 ) +( 1 / 4 )*log( 1 / 4 )] = 4* -[( 1 / 4 )*(-2)] = +2 Column_height = 2 column_entropy 130 65
Motif Logos: An Example (http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html) 131 132 NATURE REVIEWS GENETICS, VOLUME 5,APRIL 2004, 279 66
133 BioPHP Minitools http://140.109.32.31/biophp 134 67
135 136 68
137 138 69
139 Tools Summary IMB Bioinformatics Core (online tools & DB) http://140.109.32.31/online_intro.php EMBOSS BioPHP http://anabench.bcm.umontreal.ca/html/emboss/ http://140.109.32.31/biophp/index.php Genomic Databases Ensembl Genome Browser http://www.ensembl.org/ UCSC Genome Bioinformatics Home http://genome.ucsc.edu/ NCBI HomePage http://www.ncbi.nlm.nih.gov/ GeneCards Homepage http://www.genecards.org/ Alignment ClustalW and others (Max-Planck) http://toolkit.tuebingen.mpg.de/sections/alignment BLAST http://blast.ncbi.nlm.nih.gov/blast.cgi Motif Discovery The MEME Suite http://meme.sdsc.edu/meme4_3_0/intro.html Melina II http://melina2.hgc.jp/public/index.html WebLogo 3 http://weblogo.threeplusone.com/create.cgi 140 70
Thanks for your attention paul@imb.sinica.edu.tw 141 71