Bioinformatics overview

Similar documents
Bioinformatics overview

AAGTGCCACTGCATAAATGACCATGAGTGGGCACCGGTAAGGGAGGGTGATGCTATCTGGTCTGAAG. Protein 3D structure. sequence. primary. Interactions Mutations

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

NiceProt View of Swiss-Prot: P18907

Redundancy at GenBank => RefSeq. RefSeq vs GenBank. Databases, cont. Genome sequencing using a shotgun approach. Sequenced eukaryotic genomes

Regulation of eukaryotic transcription:

DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences

Biological databases an introduction

Bioinformatics Course AA 2017/2018 Tutorial 2

Linking the EMBL Australia Bioinformatics Resource with the Australian National Data Service

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

NCBI web resources I: databases and Entrez

Computational Biology and Bioinformatics

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

NUCLEIC ACIDS. DNA (Deoxyribonucleic Acid) and RNA (Ribonucleic Acid): information storage molecules made up of nucleotides.

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Databases for Life Science Research. Ulf Leser

Biotechnology Explorer

ELE4120 Bioinformatics. Tutorial 5

Chimp Sequence Annotation: Region 2_3

Computational Molecular Biology Intro. Alexander (Sacha) Gultyaev

Fundamentals of Bioinformatics: computation, biology, computational biology

Bioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras. Lecture - 5a Protein sequence databases

user s guide Question 3

Types of Databases - By Scope

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Data Retrieval from GenBank

ORGANISATION AND STANDARDISATION OF INFORMATION IN SWISS-PROT AND TREMBL

Two Mark question and Answers

Genome Sequence Assembly

CHAPTER 21 LECTURE SLIDES

Introduction to Bioinformatics

Product Applications for the Sequence Analysis Collection

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important!

Chapter 2: Access to Information

Discovering gene regulatory control using ChIP-chip and ChIP-seq. Part 1. An introduction to gene regulatory control, concepts and methodologies

CHAPTERS , 17: Eukaryotic Genetics

Biological databases an introduction

Introduction to CGE tools

Lecture 7 Motif Databases and Gene Finding

Gene-centered resources at NCBI

MS bioinformatics analysis for proteomics. Protein anotations

The University of California, Santa Cruz (UCSC) Genome Browser

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Frietze_Figure S1. Validation of the ZNF263 antibody.

Sequence Databases and database scanning

RNA-Sequencing analysis

Protein Bioinformatics Part I: Access to information

Unit 1: DNA and the Genome. Sub-Topic (1.3) Gene Expression

Array-Ready Oligo Set for the Rat Genome Version 3.0

Integration of data management and analysis for genome research

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Lecture 2 Introduction to Data Formats

Discovering gene regulatory control using ChIP-chip and ChIP-seq. An introduction to gene regulatory control, concepts and methodologies

Introduction to Bioinformatics

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Klinisk kemisk diagnostik BIOINFORMATICS

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

Databases in Bioinformatics. Molecular Databases. Molecular Databases. NCBI Databases. BINF 630: Bioinformatics Methods

earray 5.0 Create your own Custom Microarray Design

Genome Annotation - 2. Qi Sun Bioinformatics Facility Cornell University

Biology From gene to protein

CSE/Beng/BIMM 182: Biological Data Analysis. Instructor: Vineet Bafna TA: Nitin Udpa

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Introduction to 'Omics and Bioinformatics

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Lecture 11. Initiation of RNA Pol II transcription. Transcription Initiation Complex

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

CS313 Exercise 1 Cover Page Fall 2017

NGS Approaches to Epigenomics

From assembled genome to annotated genome

Applied Biosystems SOLiD 3 Plus System. RNA Application Guide

ENCODE DCC Antibody Validation Document

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

The use of bioinformatic analysis in support of HGT from plants to microorganisms. Meeting with applicants Parma, 26 November 2015

MCB 102 University of California, Berkeley August 11 13, Problem Set 8

Reference genomes and common file formats

Introduction and Public Sequence Databases. BME 110/BIOL 181 CompBio Tools

M1 - Biochemistry. Nucleic Acid Structure II/Transcription I

Computational gene finding

The Need for Scientific. Data Annotation. Alick K Law, Ph.D., M.B.A. Marketing Manager IBM Life Sciences.

CRISPR GENOMIC SERVICES PRODUCT CATALOG

Bacterial Genome Annotation

An Introduction to the package geno2proteo

Quick reference guide

MODULE TSS1: TRANSCRIPTION START SITES INTRODUCTION (BASIC)

Retroelement-guided protein diversification abounds in vast lineages of Bacteria and Archaea

KDD Cup Task 1 Information Extraction from Biomedical Articles

Synthetic Biology. Sustainable Energy. Therapeutics Industrial Enzymes. Agriculture. Accelerating Discoveries, Expanding Possibilities. Design.

Introduction to Bioinformatics for Medical Research. Gideon Greenspan TA: Oleg Rokhlenko. Lecture 1

Exercises (Multiple sequence alignment, profile search)

Will discuss proteins in view of Sequence (I,II) Structure (III) Function (IV) proteins in practice

MODULE 5: TRANSLATION

Digital information cycle. Database. Database. BINF 630: Bioinformatics Methods

Processing Very Large Genomic Files

Transcription in Eukaryotes

Theoretische Biologie

Novel methods for RNA and DNA- Seq analysis using SMART Technology. Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc.

Sequence Analysis Lab Protocol

2014 Pearson Education, Inc. CH 8: Recombinant DNA Technology

Transcription:

Bioinformatics overview Aplicações biomédicas em plataformas computacionais de alto desempenho Aplicaciones biomédicas sobre plataformas gráficas de altas prestaciones Biomedical applications in High performance computing platforms Oswaldo Trelles, PhD University of Malaga In this section we survery the bioinformatics application domain and the typical sources of data in the field

Definition Computer sciences, statistics, physics, chemistry,... Information Technologies Bioinformatics: The application of computational techniques to the management and analysis of biological data Molecular, clinic, population, environmental,... Acquisition, storage, retrieval, transmission, processing...

The domain of the data

Data production Huge data production at different levels Atoms Proteins Interactions Metabolic pathways Cells Organs Organisms Populations

Diversity of types of data > E01306 229 bp DNA linear gaattctaac ggtcccgaaa ctctgtgcgg tgctgaactg gttgacgctc tgcagtttgt ttgcggtgac cgtggttttt attttaacaa acccactggt tatggttctt cttctcgtcg tgctccccag actggtattg ttgacgaatg ctgctttcgt tcttgcgacc tgcgtcgtct ggaaatgtat tgcgctcccc tgaaacccgc taaatctgct tagaagctt

Format heterogeneity LOCUS E01306 229 bp DNA linear PAT 04-NOV-2005 DEFINITION DNA encoding human insulin-like growth factor I(IGF-I). ID E01306; SV 1; linear; unassigned DNA; PAT; SYN; 229 BP. ACCESSION E01306 AC E01306; VERSION E01306.1 GI:2169565 DT 07-OCT-1997 (Rel. 52, Created) KEYWORDS JP 1987190088-A/1. DT 09-NOV-2005 (Rel. 85, Last updated, Version 3) SOURCE synthetic construct DE DNA encoding human insulin-like growth factor I(IGF-I). ORGANISM synthetic construct KW JP 1987190088-A/1. other sequences; artificial sequences. OS synthetic construct REFERENCE 1 (bases 1 to 229) OC other sequences; artificial sequences. AUTHORS Raasu,A., Toomasu,M., Berun,N. and Majiasu,U. RA Raasu A., Toomasu M., Berun N., Majiasu U.; TITLE METHOD FOR TRANSPORTING GENE PRODUCT TO MEDIUM PROPAGATING GRAM RT "METHOD FOR TRANSPORTING GENE PRODUCT TO MEDIUM PROPAGATING NEGATIVE BACTERIA GRAM JOURNAL Patent: JP 1987190088-A 1 20-AUG-1987; RT NEGATIVE BACTERIA"; KABIGEN AB RL Patent number JP1987190088-A/1, 20-AUG-1987. COMMENT OS Artificial gene RL KABIGEN AB. OC Artificial sequence; Genes. CC OS Artificial gene OS Homo sapiens CC OC Artificial sequence; Genes. PN JP 1987190088-A/1 CC OS Homo sapiens PD 20-AUG-1987 CC CC strandedness: Single; CC strandedness: Single; CC CC topology: Linear; CC topology: Linear; CC CC hypothetical: No; CC hypothetical: No; CC CC anti-sense: No; CC anti-sense: No; CC FH Key Location/Qualifiers FH Key Location/Qualifiers CC FT mat_peptide 11..220 FT /product='human insuline-like growth factor I CC FT CDS >2..223 FT CDS >2..223 CC FT /product="human insulin-like growth factor I" FEATURES Location/Qualifiers FH Key Location/Qualifiers source 1..229 FT source 1..229 /organism="synthetic construct" FT /organism="synthetic construct" /mol_type="unassigned DNA" FT /mol_type="unassigned DNA" /db_xref="taxon:32630" FT /db_xref="taxon:32630" ORIGIN SQ Sequence 229 BP; 40 A; 57 C; 55 G; 77 T; 0 other; 1 gaattctaac ggtcccgaaa ctctgtgcgg tgctgaactg gttgacgctc tgcagtttgt gaattctaac ggtcccgaaa ctctgtgcgg tgctgaactg gttgacgctc tgcagtttgt 60 61 ttgcggtgac cgtggttttt attttaacaa acccactggt tatggttctt cttctcgtcg ttgcggtgac cgtggttttt attttaacaa acccactggt tatggttctt cttctcgtcg 120 121 tgctccccag actggtattg ttgacgaatg ctgctttcgt tcttgcgacc tgcgtcgtct tgctccccag actggtattg ttgacgaatg ctgctttcgt tcttgcgacc tgcgtcgtct 180 181 ggaaatgtat tgcgctcccc tgaaacccgc taaatctgct tagaagctt ggaaatgtat tgcgctcccc tgaaacccgc taaatctgct tagaagctt 229 // // The DNA encoding human insulin-like growth factor I(IGF-I) available at GenBank: E01306.1 http://www.ncbi.nlm.nih.gov/ (search for insulin in All databases ) The same insulin (E01306) sequence at www.ebi.ac.uk (in both text-boxes some lines has been removed)

Dispersion of data sources More than 1000 biological data collections Bioinformatics workflows: the usual way to work See: [1] Infobiogen: Catalog of Databases: http://www.infobiogen.fr/services/dbcat Bioinformatics: a web-based domain

Types of data and applications (overview)

Sequencing data The long DNA chain is split in small fragments that are read using sequencing technology. Read: short sequences obtained during the sequencing process Software is used to obtain Contigs, Scaffolds, Consensus Reading in annotation data from a GFF file. Assigning aligned reads to exons and genes. Biologically intelligent interpretation of genomic data FASTA and FASTQ formats >000014_1863_0292 length=76 uaccno=fgsmdpn08etuie AATACTCAGGAATCGAACGGACTCGGGTATAGTATATGATCGGCAGCCAGCCG AACATAACAGCGGCATGAAAACC >000016_1821_0619 length=120 uaccno=fgsmdpn08ep50t GGCAAGTTTTCGGTGTCGCTAAGCCCGAGATATCGCAGCTCACCCGTGTCGGC GATTGCTGCTGTGACCGTCCCCAGTCGGTCACCCTCCGGCTGATTCTATCCTTACATCGG TCGTTTC >000021_1845_1786 length=69 uaccno=fgsmdpn08esarw ATCCGCGCGGCCGCATTGTCGACACTGCCTGCCGGCAGTGAAGGCGAGGCGCA GGTGGCCGATGCGCTG >000030_1849_0863 length=69 uaccno=fgsmdpn08esmpd ATCCGCGCGGCCGCATTGTCGACACTGCCTGCCGGCAGTGAAGGCGAGGCGCA GGTGGCCGATGCGCTG >000035_1856_0283 length=148 uaccno=fgsmdpn08es8dp GACGCCCTTTATGCACGTTTCGCTCACAGTATCCCTTAATAGCAAGATTAATA CCCTCAGTGGCCCCACTAGTAAAAACGATCTCTCGAGAACGACAGTTCAGTTC ATTGGCAATCAATTTTCGGGCCGTTTCTTACCGCCTCCTCAG

Assembling the puzzle From spectrograms to a sequence of letters An exhaustive and resource consuming procedure is needed to solve the assembling fragments into a longer Contigs... the sequence is coming up

Biological sequence data biológicas >ref NT_033779.4 :1-23011544 Drosophila melanogaster chromosome 2L CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGG GAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAA TGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTT AAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCG TTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATT TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGAC TGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGAT ATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGC CAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTC AAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAA GACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCG AGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAAC CCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATAT TATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAAC CCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGC AACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGC CTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGA GATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATAT AATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGA ACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAG TAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAAT ATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGT ATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACC CAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCG TATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACC GCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTC CCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTT GGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATT CATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAAT GAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGC AGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCT CTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGA AGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATG CGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT GATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGA TCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATG ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTAT TTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTT TCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGA TTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAA TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGT GCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGA GAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTG TGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGAC AGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCG From assembly to databases entries FASTA is the favorite format used for this type of data (without annotations) We knows the text, but the meaning needs more processing

Annotated Sequence databases ID 100K_RAT STANDARD; PRT; 889 AA. AC Q62671; DT 01-NOV-1997 (Rel. 35, Created) DT 01-NOV-1997 (Rel. 35, Last sequence update) DT 15-JUL-1999 (Rel. 38, Last annotation update) DE 100 KD PROTEIN (EC 6.3.2.-). OS Rattus norvegicus (Rat). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia; OC Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=WISTAR; TISSUE=TESTIS; RX MEDLINE; 92253337. RA MUELLER D., REHBEIN M., BAUMEISTER H., RICHTER D.; RT "Molecular characterization of a novel rat protein structurally RT related to poly(a) binding proteins and the 70K protein of the U1 RT small nuclear ribonucleoprotein particle (snrnp)."; RL Nucleic Acids Res. 20:1471-1475(1992). RN [2] RP ERRATUM. RA MUELLER D., REHBEIN M., BAUMEISTER H., RICHTER D.; RL Nucleic Acids Res. 20:2624-2624(1992). CC -!- FUNCTION: E3 UBIQUITIN-PROTEIN LIGASE WHICH ACCEPTS UBIQUITIN FROM CC AN E2 UBIQUITIN-CONJUGATING ENZYME IN THE FORM OF A THIOESTER AND CC THEN DIRECTLY TRANSFERS THE UBIQUITIN TO TARGETED SUBSTRATES (BY CC SIMILARITY). THIS PROTEIN MAY BE INVOLVED IN MATURATION AND/OR CC POST-TRANSCRIPTIONAL REGULATION OF MRNA. CC --------------------------------------------------------------------- - CC This SWISS-PROT entry is copyright. It is produced through... CC --------------- ----------------------------------------------------- -- DR EMBL; X64411; CAA45756.1; -. DR PFAM; PF00632; HECT; 1. DR PFAM; PF00658; PABP; 1. KW Ubiquitin conjugation; Ligase. FT DOMAIN 77 88 ASP/GLU-RICH (ACIDIC). FT DOMAIN 127 150 PRO-RICH. FT DOMAIN 579 590 ASP/GLU-RICH (ACIDIC). FT BINDING 858 858 UBIQUITIN (BY SIMILARITY). SQ SEQUENCE 889 AA; 100368 MW; DD7E6C7A CRC32; MMSARGDFLN YALSLMRSHN DEHSDVLPVL DVCSLKHVAY VFQALIYWIK AMNQQTTLDT PQLERKRTRE LLELGIDNED SEHENDDDTS QSATLNDKDD ESLPAETGQN HPFFRRSDSM VYEYVRKYAE HRMLVVAEQP LHAMRKGLLD VLPKNSLEDL TAEDFRLLVN GCGEVNVQML ISFTSFNDES GENAEKLLQF KRWFWSIVER MSMTERQDLV YFWTSSPSLP ASEEGFQPMP SITIRPPDDQ HLPTANTCIS RLYVPLYSSK QILKQKLLLA IKTKNFGFV //