Lecture 2 Introduction to Data Formats

Similar documents
Lecture 11 Microarrays and Expression Data

DNA is normally found in pairs, held together by hydrogen bonds between the bases

Basic concepts of molecular biology

Introduction to Bioinformatics for Computer Scientists. Lecture 2

BIOLOGY. Monday 14 Mar 2016

Algorithms in Bioinformatics ONE Transcription Translation

Lecture 7 Motif Databases and Gene Finding

Basic concepts of molecular biology

Bioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras. Lecture - 5a Protein sequence databases

Protein Bioinformatics Part I: Access to information

Genome Informatics. Systems Biology and the Omics Cascade (Course 2143) Day 3, June 11 th, Kiyoko F. Aoki-Kinoshita

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

Bioinformatics for Proteomics. Ann Loraine

What is necessary for life?

BIOSTAT516 Statistical Methods in Genetic Epidemiology Autumn 2005 Handout1, prepared by Kathleen Kerr and Stephanie Monks

EE550 Computational Biology

Bi Lecture 3 Loss-of-function (Ch. 4A) Monday, April 8, 13

Bioinformatics. ONE Introduction to Biology. Sami Khuri Department of Computer Science San José State University Biology/CS 123A Fall 2012

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Name. Student ID. Midterm 2, Biology 2020, Kropf 2004

SNVBox Basic User Documentation. Downloading the SNVBox database. The SNVBox database and SNVGet python script are available here:

Chimp Sequence Annotation: Region 2_3

ELE4120 Bioinformatics. Tutorial 5

The University of California, Santa Cruz (UCSC) Genome Browser

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Genome Resources. Genome Resources. Maj Gen (R) Suhaib Ahmed, HI (M)

11 questions for a total of 120 points

Roadmap. The Cell. Introduction to Molecular Biology. DNA RNA Protein Central dogma Genetic code Gene structure Human Genome

Chapter 2: Access to Information

What is necessary for life?

Nucleic acid and protein Flow of genetic information

Human Biology 175 Lecture Notes: Cells. Section 1 Introduction

Gene-centered resources at NCBI

DNA.notebook March 08, DNA Overview


Will discuss proteins in view of Sequence (I,II) Structure (III) Function (IV) proteins in practice

Two Mark question and Answers

APPENDIX. Appendix. Table of Contents. Ethics Background. Creating Discussion Ground Rules. Amino Acid Abbreviations and Chemistry Resources

Unit 1. DNA and the Genome

Materials Protein synthesis kit. This kit consists of 24 amino acids, 24 transfer RNAs, four messenger RNAs and one ribosome (see below).

A Zero-Knowledge Based Introduction to Biology

BME205: Lecture 2 Bio systems. David Bernick

user s guide Question 1

Biology 644: Bioinformatics

Genomics and Database Mining (HCS 604.3) April 2005

NCBI Molecular Biology Resources. NCBI Resources

THE GENETIC CODE Figure 1: The genetic code showing the codons and their respective amino acids

A Field Guide to GenBank and NCBI Molecular Biology Resources

CECS Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka

Sequence Databases and database scanning

Lecture 19A. DNA computing

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

BST227 Introduction to Statistical Genetics

Types of Databases - By Scope

Create a model to simulate the process by which a protein is produced, and how a mutation can impact a protein s function.

Just one nucleotide! Exploring the effects of random single nucleotide mutations

Key Concept Translation converts an mrna message into a polypeptide, or protein.

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

user s guide Question 3

Molecular Biology. Biology Review ONE. Protein Factory. Genotype to Phenotype. From DNA to Protein. DNA à RNA à Protein. June 2016

RNA and Protein Synthesis

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Testing the ABC floral-organ identity model: cloning the genes

Connect the dots DNA to DISEASE

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

BIOLOGY LTF DIAGNOSTIC TEST DNA to PROTEIN & BIOTECHNOLOGY

Exploring genomes - A Treasure Hunt on the Internet

ENZYMES AND METABOLIC PATHWAYS

I nternet Resources for Bioinformatics Data and Tools

13.1 RNA Lesson Objectives Contrast RNA and DNA. Explain the process of transcription.

Array-Ready Oligo Set for the Rat Genome Version 3.0

Granby Transcription and Translation Services plc

Deoxyribonucleic Acid DNA. Structure of DNA. Structure of DNA. Nucleotide. Nucleotides 5/13/2013

GREG GIBSON SPENCER V. MUSE

INTRODUCTION TO BIOINFORMATICS. SAINTS GENETICS Ian Bosdet

Data Retrieval from GenBank

Introduction to BIOINFORMATICS

Important gene-information's

Protein Synthesis: Transcription and Translation

Sequence Based Function Annotation

STRUCTURE, DYNAMICS AND INTERACTIONS OF PROTEINS BY NMR SPECTROSCOPY

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

CHAPTER 21 GENOMES AND THEIR EVOLUTION

Introduction to Bioinformatics Online Course: IBT

Testing the ABC floral-organ identity model: cloning the genes

Why learn sequence database searching? Searching Molecular Databases with BLAST

Protein Synthesis. Application Based Questions

Introduction to Bioinformatics. What are the goals of the course? Who is taking this course? Different user needs, different approaches

ab initio and Evidence-Based Gene Finding

Forensic Science: DNA Evidence Unit

FUNCTIONAL BIOINFORMATICS

Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL


The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

Find this material useful? You can help our team to keep this site up and bring you even more content consider donating via the link on our site.

A self- created drawing, collage, or computer generated picture of your creature.

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Introduction to BIOINFORMATICS

Transcription:

Introduction to Bioinformatics for Medical Research Gideon Greenspan gdg@cs.technion.ac.il Lecture 2 Introduction to Data Formats

Introduction to Data Formats Real world, data and formats Sequences and alignments Trees and Pedigrees Motifs and Profiles Annotations and 3D structure Expression and pathways GenBank, SWISS-PROT, GeneCards 2

Data Formats Data: representation of real world Never a perfect copy Data format: computer s representation Computers require formal definitions Almost always represented as text Tools are data translators Input one type of data, output another 3

Nucleotide Sequences GAATCG TACTGT CCATTG CTCAGA ATCGTA CTGTCA T (for DNA) and U (for RNA) interchangeable 4

Reverse Complementation 5 GAATCGTACTGTCCATTGCTCA 3 3 ACTCGTTACCTGTCATGCTAAG 5 5 GAATCGTACTGTCCATTGCTCA 3 5 TGAGCAATGGACAGTACGATTC 3 5

Protein Sequences KGSQEFWPGTSHLEIGVKMDVYFS 6

Amino Acid Codes A Alanine G Gycine P Proline R Arganine H Histadine S Serine N Asparagine I Isoleucine T Threonine D Aspartic acid L Leucine W Tryptophan C Cysteine K Lysine Y Tyrosine Q Glutamine M Methionine V Valine E Glutamic acid F Phenylalanine 7

Multiple Sequences Many ways to encode in a file FASTA format is most common >HSBGPG Human gene for gla protein (BGP) GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCAT GCCCCTCCTCATCGCTGGGCACAGCCCAGAGGGTATAAACA GTGCTGGAGGCTGGCGGGGCAGGCCAGC >HSGLTH1 Human theta 1-globin gene CCACTGCACTCACCGCACCCGGCCAATTTTTGTGTTTTTAG TAGAGACTAAATACCATATAGTGAACACCTAAGACGGGGGG CCTTGGATCCAGGGCGA 8

Alignments Comparison between two sequences Add gaps to make equal in length Seek the best possible alignment Global or local alignment GAATCGTAC CGACTTCG -GAATCGTAC- CGACT--G-CG 9

Multiple Alignment Comparison between many sequences Many encoding formats ALN is popular Human DSHUCZ M-ATKAVCVLKGDGPVQGII Bovine DSBOCZ --ATKAVCVLKGDGPVQGTI Swordfish SODL V-L-KAVCVLRGAGETTGTV Drosophila DSF V-V-KAVCVING-D-AKGTV Maize SDMZ M-V-KAVAVLAGTD-VKGTI Yeast DSBYC V---QAVAVLKGDAGVSGVV 10

Taxonomy Trees 11

Phylogenetic Trees (raccoon: 19.19959, ((sea_lion: 11.99700, seal: 12.00300): 7.52973, ((monkey: 50.85930, cat: 47.14069): 20.59201, weasel: 18.87953): 2.09460): 3.87382, dog: 25.46154); 12

Pedigrees Pe In M F Ch MS FS Se Pr Markers... 1 1 0 0 3 0 0 1 1 2 0 1 0 1 1 1 2 0 0 3 0 0 2 0 1 1 1 0 1 1 1 3 1 2 7 4 4 1 0 2 1 1 0 0 1 13

Motifs Common short sequence Simple motifs use IUPAC symbols R A or G K G or T B C or G or T Y C or T M A or C D A or G or T S G or C N Any base H A or C or T W A or T V A or C or G Example: CHGW matches CCGA, CTGT, CAGC 14

Motif Profiles Statistical model for a motif Position 1 2 3 4 5 A 0.3 0.1 0 0.1 0.5 C 0 0.4 0.9 0.2 0 G 0.5 0.2 0.1 0 0 T 0.2 0.3 0 0.7 0.5 GTCTA: 0.5 0.3 0.9 0.7 0.5 = 0.0473 TTGCT: 0.2 0.3 0.1 0.2 0.5 = 0.0006 15

Genomic Annotations Physical structure Centromere, telomeres Genes Introns, exons, alternative splice sites Binding sites Transcription factors Variable sites SNPs, repeats 16

Protein Annotations Domains Hydrogen bonds between peptides a-helices, b-sheets Pred: CCCEEEEEEEEEEHHHHHHHHHHEEEEEEEEEECCC AA: PPPILFGLSLSLEVTTFDNLVLARFSVRSVSLDVDT 17

Protein 3D structure x y z 10.982-9.774 1.377 9.623-9.833 1.984 8.913-11.104 1.521 9.187-11.630 0.461 8.814-8.614 1.546 7.372-8.754 2.039 7.339-8.625 3.562 8.370-8.307 4.131 6.284-8.846 4.132 7.998-11.599 2.304 7.266-12.832 1.907 6.096-12.456 1.005 18

Expression Levels normal hot cold uch1-2.0 0.0 0.924 gut2 0.398 0.402-1.329 fip1 0.225 0.225-2.151 msh1 0.676 0.685-0.564 vma2 0.41 0.414-1.285 meu26 0.353 0.286-1.503 git8 0.47 0.47-1.088 sec7b 0.39 0.395-1.358 apn1 0.681 0.636-0.555 wos2 0.902 0.904-0.149 sec1 0.5 0.737-1.0 spf31 1.171 0.946 0.228 slp1 0.378 0.364-1.404 shm2 0.502 0.512-0.994 19

Pathways 20

Data Formats: Summary Real world, data and data formats Nucleotide, protein sequences Sequence comparisons Alignments, motifs, profiles Relationships between entities Phylogenetic trees, pedigrees, pathways Annotations Tools are data translators 21

GenBank DNA/RNA sequence database USA National Institute of Health Synchronized daily with Europe, Japan Founded in 1982 Contains > 20 million sequences Total size > 20 billion base pairs >200 full species genomes Public submission 22

A GenBank Entry (1) Unique IDs Accession is permanent for an entry GenBank Identifier specifies one version Verbal description Organism, gene/source GenBank division, keywords Date, version Submitter, references 23

GenBank Divisions PRI Primate PHG Bacteriophage ROD Rodent SYN Synthetic MAM Other mammal UNA Unannotated VRT Vertebrate EST Expressed seq. tag INV Invertebrate PAT Patented PLN Plant, fungus, algae STS Sequence tag site BCT Bacterial GSS Genome survey VRL Viral HTG High-throughput 24

A GenBank Entry (2) The sequence itself Summary of base pair frequencies Features (at point within sequence) Genes, exons, introns, translation Promoters, binding sites Repeats, stem and loop Variation: RFLPs and SNPs Known sequence tag locations 25

Searching GenBank Structureless and structured All information fields are searchable Limits Only recent sequences Specific database only Exclude drafts, patented, etc Searching by sequence? Using BLAST tools 26

SWISS-PROT Protein sequence database Geneva University and Europe s EBI Some curation to minimize redundancy Founded in 1986 Contains > 120,000 entries Total size > 46 million amino acids TrEMBL translated nucleotide database For all coding sequences in GenBank 27

A SWISS-PROT Entry Core data References Biological source Annotations Function and associated diseases Post-translational modifications Domains and binding sites Secondary, quaternary structure 28

Searching SWISS-PROT Using SRS (sequence retrieval system) European equivalent to Entrez Other search types Retrieve accession or ID Unstructured ( full text ) Structured ( advanced ) Taxonomy browser BLAST to search by sequence 29

GeneCards Human gene database Weizmann Institute in Israel Founded in 1997 Describes > 14,000 known genes > 21,000 predicted genes Also: pseudogenes, gene clusters, etc Data mining approach Data sourced from 36 other databases 30

A GeneCards Entry (1) Aliases and descriptions HUGO standard nomenclature Chromosomal location Links to genome browsers Expression levels Different tissues 31

A GeneCards Entry (2) Available sequences Homologues in other organisms Proteins Protein families Mutations and other variation Related diseases Publications News articles 32