Bioinformatics for Cell Biologists

Bioinformatics for Cell Biologists 15 19 March 2010 Developmental Biology and Regnerative Medicine (DBRM)

Schedule Monday, March 15 09.00 11.00 Introduction to course and Bioinformatics (L1) D224 Helena Storvall 12.00 17.00 Core databases for bioinformatics (C1) Space Ersen Kavak, Helena Storvall, Daniel Ramskold Tuesday, March 16 09.00 09.45 Alignments (L2) D224 Daniel Ramskold 10.00 10.45 Phylogenetics (L3) D224 Prof. Bengt Persson, CMB and LIU 11.00 11.45 Protein Sequence Bioinformatics (L4) D224 Prof. Bengt Persson, CMB and LIU 13.00 17.00 Computer Excercise 1: Alignments, Genomes and Browsers (C2) Space Rickard Sandberg

Wednesday, March 17 09.00 12.00 Computer Exercise 2: Phylogenetics and Proteins (C3) Space Rickard Sandberg 13.30 15.00 Invited Speaker 1: D224 Transcriptome and translational regulation Dr. Ola Larsson, McGill University 15.30 17.00 Next generation sequencing bioinformatics (L5) D224 Rickard Sandberg Thursday, March 18 09.00 12.00 Computer Exercise 3: Tools for Next Gen Sequencing, Galaxy (C4) Space Rickard Sandberg 13.00 14.30 Bioinformatics of microrna target predictions (L6) D224 15.00 16.30 Statistical issues with genome wide experiments (L7) D224 Yudi Pawitan, Dept of Medical Epidemiology and Biostatistics

Friday, March 19 09.00 11.00 Project work 12.00 15.00 Project presentations (20 min per group) 15.00 16.00 Wrap up and course evaluation Rickard Sandberg

Examination Project in groups of 2 (or 3). Form groups today! Apply bioinformatics resources to gather all information possible about your gene of interest. Save all information in a wiki. Each group will present their project at the end of the course. Each group member is expected to participate in the presentation. Examination date: 19 March 2010 (Friday)

Getting to know you better Name Department Areas of research Bioinformatics resources currently using Expectations of the course 7

Introduction to bioinformatics Helena Storvall Department of Cell and Molecular Biology Karolinska Institutet Stockholm, Sweden

Overview What is bioinformatics? Why is it important? Uses of bioinformatics Example problems Databases and tools What databases solves the problem? Take home message Goals of the course

What is bioinformatics? Bioinformatics is the use of computer technology to manage, analyze and understand biological information Storage and sharing (databases) Computations and statistics Visualization of data Simulations Comparisons of data

Why is it important? Data is in abundance Genome assemblies Expression data Protein sequence and structure Challenges: Storing the data Visualizing data Translating it into knowledge!

Sequence data The amount of sequencing data is increasing exponentially 1988: ~20 000 sequences 1998: ~ 3 milj sequences 2008: ~ 99 milj sequences

Examples of uses De novo genome assembly revolutionized by next generation sequencing Transcriptomics genome wide expression measurements Alignments structure and function prediction, heritage Protein folding simulation folding@home, Blue Gene QSAR Quantitative structure activity relationship

Genome wide mindset HeLa cells transfected with microrna, expression measured by microarray Is downregulation due to direct interaction or secondary effect? Simple approach: search for sequence complem entarity to the mirna Bioinformatics approach: search for enriched sequence motifs Lim et al. Nature 2005

Scenarios

What kind of data is out there? Others: OMIM, PDB Pfam UniProt

Sequence databases are synchronized

Entrez Cross database search in NCBI resources Results include: PubMed Entrez Gene RefSeq OMIM Protein sequence Protein structure

Entrez gene Focuses on the genomes that have been completely sequenced, have an active research community to contribute gene specific information, or that are scheduled for intense sequence analysis. Content of Entrez Gene: RefSeq collaborating model organism databases many other databases available from NCBI.

Gene Annotations Annotation = descriptive summary Gene annotations encompass Genomic position, strand information Intron exon boundaries Gene name Isoforms RefSeq Manually curated Ensembl Gene set and UCSC known genes Automatic annotations

RefSeq RefSeq represents the NCBI curated reference sequences. Contains useful annotations and it is manually curated RefSeq are either genomic, mrna or protein sequences. All RefSeq sequences are assembled/taken from data deposited into GenBank. Not all sequences are in RefSeq

Ensemble Gene set and UCSC known genes UniProt RefSeq Automatically annotated Contains predicted genes Contains more non proteincoding genes

Gene ontology GO describes how gene products behave in a cellular context. Three organizing principles: Molecular function: describes activities, such as catalytic or binding activities, at the molecular level. Biological process: involvement in multistep process, eg signal transduction, cell physiological process. Cellular component: what the gene product is localized to or a subcomponent of, eg localized to nucleus, subcomponent of ribosome.

OMIM OMIM = Online mendelian inheritance in man Summaries of human gene function Manually curated Focused on relationship between genotype and phenotype Originally focused on human disease, now encompass all kinds of genes Good place to start searching information about a gene

Alignment tools Alignment = match your sequence to known sequences BLAST Basic Local Alignment Search Tool Nucleotide, protein, translated nucleotides Maps against all known sequences Inheritance maps to several organisms

Alignment tools BLAT BLAST Like Alignment Tool Faster than BLAST Simultaneous queries Maps only to genome assembly Only one organism at a time Might miss divergent or short alignments Connected to UCSC genome browser

Genome browsers UCSC genome browser and Ensembl Collects genomic information Alignments to the genome Expression data Different isoforms Position on the genome Provides a comprehensive visualization of this collection Among the most important tools in bioinformatics!

UCSC genome browser

Protein annotations UniProt reviewed proteins = swiss prot protein sequence and annotation data merge between SWISS PROT and PIR Both reviewed and un reviewed proteins Manually curated brings together experimental results, computed features and scientific conclusions unreviewed proteins = TrEMBL (Translated EMBL) contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database not yet integrated in SWISSPROT.

UniProt

Pfam Pfam Protein Family database Collection of protein domain families Pfam A built from UniProt Pfam B un annotated, automatically rendered Pfam entries are classified in one of four ways: Family: A collection of related proteins Domain: A structural unit which can be found in multiple protein contexts Repeat: A short unit which is unstable in isolation but forms a stable structure when multiple copies are present Motifs: A short unit found outside globular domains

Other protein tools PDB Protein Data Bank protein structures (NMR, x ray chrystallography) STRING Protein protein interactions Emboss pepinfo Physico chemical properties of protein Hydrophobicity, polarity, charge

Pathways KEGG biochemical pathways BioCarta intracellular signaling pathways

Functional enrichment in gene sets DAVID functional annotation tool DAVID = Database for Annotation, Visualization and Integrated Discovery Screens both gene ontology and pathways Searches for enrichment of functional features

Expression patterns Antibody based data Human Protein Atlas Mamep mouse development Allen brain map mouse and human brain Sequencing and array data Gene expression omnibus Array express

Scenario 1 Entrez > OMIM, PubMed UCSC genome browser UCSC genome browser BLAST OMIM BLAST > OMIM 1

Scenario 2 BLAT, UCSC genome browser BLAST Emboss pepinfo Uniprot Pfam 2

Scenario 3 STRING Gene Ontology KEGG BioCarta PDB 3

Take home messages Bioinformatics is needed to translate data into knowledge A genome wide approach gives a broader result There are many tools and databases out there, this lecture only covers a selection Several ways to solve a problem find your own preference

Goals of the course After taking this course: you will know about the most commonly used bioinformatic databases have a better understanding of how they work know how to find and use genomic data and genome wide datasets, such as transcriptomes see how bioinformatics can be part of your own research projects