Introduction and Public Sequence Databases. BME 110/BIOL 181 CompBio Tools

Similar documents
Genome and DNA Sequence Databases. BME 110: CompBio Tools Todd Lowe April 5, 2007

Introduction to BIOINFORMATICS

Types of Databases - By Scope

O C. 5 th C. 3 rd C. the national health museum

Introduction to Bioinformatics for Medical Research. Gideon Greenspan TA: Oleg Rokhlenko. Lecture 1

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

NCBI & Other Genome Databases. BME 110/BIOL 181 CompBio Tools

Big picture and history

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

NCBI web resources I: databases and Entrez

Overview of Health Informatics. ITI BMI-Dept

Advanced Techniques. Electrophoresis, PCR, Southern Blot, Sequencing, RFLPs, Microarrays. AP Biology

Introduction to the course

Genetics Lecture 21 Recombinant DNA

CSC 121 Computers and Scientific Thinking

Genome Sequencing Technologies. Jutta Marzillier, Ph.D. Lehigh University Department of Biological Sciences Iacocca Hall

B I O I N F O R M A T I C S

2017 Amplyus, all rights reserved

Engineering Genetic Circuits

Computers in Biology and Bioinformatics

Sequence Analysis Lab Protocol

ELE4120 Bioinformatics. Tutorial 5

CHAPTER 20 DNA TECHNOLOGY AND GENOMICS. Section A: DNA Cloning

Next Generation Sequencing. Dylan Young Biomedical Engineering

Introduction to Bioinformatics

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

Computational Biology and Bioinformatics

Basic Bioinformatics: Homology, Sequence Alignment,

Entrez Gene: gene-centered information at NCBI

Why learn sequence database searching? Searching Molecular Databases with BLAST

CODIS. Sir Alec Jeffreys. minisatellites

Introduction to Bioinformatics

Protein Bioinformatics Part I: Access to information

Access to Information from Molecular Biology and Genome Research

Mate-pair library data improves genome assembly

Growing Needs for Practical Molecular Diagnostics: Indonesia s Preparedness for Current Trend

DNA is normally found in pairs, held together by hydrogen bonds between the bases

Bioinformatics for Proteomics. Ann Loraine

Applied Bioinformatics

Introduction to Bioinformatics

Sequence Databases and database scanning

COMPUTER RESOURCES II:

Livestock Genomics: The Odyssey

Bioinformatics Support of Genome Sequencing Projects. Seminar in biology

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Bridging the Gap Between Basic and Clinical Research. Julio E. Celis Danish Cancer Society

Genome Sequence Assembly

Bio 311 Learning Objectives

E2ES to Accelerate Next-Generation Genome Analysis in Clinical Research

Computational gene finding. Devika Subramanian Comp 470

Targeted Sequencing in the NBS Laboratory

AGRO/ANSC/BIO/GENE/HORT 305 Fall, 2016 Overview of Genetics Lecture outline (Chpt 1, Genetics by Brooker) #1

Appendix A DNA and PCR in detail DNA: A Detailed Look

A Brief Introduction to Bioinformatics

Class 35: Decoding DNA

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Use of Drosophila Melanogaster as a Model System in the Study of Human Sodium- Dependent Multivitamin Transporter. Michael Brinton BIOL 230W.

Presented by: Tess L. Crisostomo Laboratory Quality Assurance/Compliance Officer Naval Medical Center San Diego

DNA Technology. B. Using Bacteria to Clone Genes: Overview:

Flow of Genetic Information

Problem Set 8. Answer Key

Chapter 15 Gene Technologies and Human Applications

Ontologies - Useful tools in Life Sciences and Forensics

Human genome sequence

Algorithms in Bioinformatics

Introduction to Bioinformatics

Discover the Microbes Within: The Wolbachia Project. Bioinformatics Lab

Genetics and Genomics in Medicine Chapter 3. Questions & Answers

Biotechnology. Chapter 20. Biology Eighth Edition Neil Campbell and Jane Reece. PowerPoint Lecture Presentations for

Replication. Replication fork

ALSO: look at figure 5-11 showing exonintron structure of the beta globin gene

Applications in Bio-informatics and Biomedical Engineering

Genetics. Genetics- is the study of all manifestation of inheritance from the distributions of traits to the molecules of the gene itself

CSC Assignment1SequencingReview- 1109_Su N_NEXT_GENERATION_SEQUENCING.docx By Anonymous. Similarity Index

Basic Concepts and History of Genetic Engineering. Mitesh Shrestha

Technician: Dionne Lutz, BS: Biology & MsED Office: Kanbar Center, Room 704, 41 Cooper Sq. (212) (office)

NOTES - CH 15 (and 14.3): DNA Technology ( Biotech )

Bioinformatics, in general, deals with the following important biological data:

UNDERSTANDING GENETIC RESEARCH AND ALTERNATING HEMIPLEGIA OF CHILDHOOD. IT ALL BEGINS WITH THE AHC PATIENT And The Scientific Process

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

Linking the EMBL Australia Bioinformatics Resource with the Australian National Data Service

2. Outline the levels of DNA packing in the eukaryotic nucleus below next to the diagram provided.

BIOINFORMATICS IN BIOCHEMISTRY

2. Materials and Methods

Genome Assembly Workshop Titles and Abstracts

ACCELERATING GENOMIC ANALYSIS ON THE CLOUD. Enabling the PanCancer Analysis of Whole Genomes (PCAWG) consortia to analyze thousands of genomes

Worksheet for Bioinformatics

Biology 252 Nucleic Acid Methods

Introduction to Bioinformatics

CHAPTER 21 LECTURE SLIDES

Genetics 101. Prepared by: James J. Messina, Ph.D., CCMHC, NCC, DCMHS Assistant Professor, Troy University, Tampa Bay Site

Molecular Biology, Lecture 3 DNA Replication

earray 5.0 Create your own Custom Microarray Design

Genomics AGRY Michael Gribskov Hock 331

Biochemistry 412. New Strategies, Technologies, & Applications For DNA Sequencing. 12 February 2008

Stalking the Genetic Basis of a Trait

Texas A&M University-Corpus Christi CHEM4402 Biochemistry II Laboratory Laboratory 4 - Polymerase Chain Reaction (PCR)

I nternet Resources for Bioinformatics Data and Tools

Transcription:

Introduction and Public Sequence Databases BME 110/BIOL 181 CompBio Tools Todd Lowe March 29, 2011

Course Syllabus: Admin http://www.soe.ucsc.edu/classes/bme110/spring11 Reading: Chapters 1, 2 (pp.29-56), & 14 Notes available in PDF format on-line (see class calendar page): http://www.soe.ucsc.edu/classes/bme110/spring11/bme110-calendar.htm Study Sections: Mon 11am, JBE Rm 373, and Wed 3:30pm, Soc Sci 2, Rm 165 TA: John Archie (jarchie@soe.ucsc.edu)

Class Objectives Use bioinformatics to investigate biological problems Identify and use bioinformatics resources online Databases Remote analysis server Tools for use on your computer Understand bioinformatics methods in sufficient detail select the proper tool and settings to solve problems effectively Interpret the significance of results Generate quality figures to illustrate biological hypotheses or conclusions Appreciate future impacts of genomics on you and society

Today s topics Basic rules for doing bioinformatics on the web Types of Databases Searching PubMed for scientific papers Sequencing Technologies Basic tools for genomics Genome Databases

Bioinformatics on the Web Golden Rules: Use published databases and methods Supported and maintained Trusted by community Document what you ve done Sequence identification numbers Server, database, program versions Program parameters Assess reliability of results Understand and use reported confidence measures Compare results of multiple servers Do results support/conflict with other available data?

Most Basic Database: Sequence Repositories Three major sequence repositories NCBI National Center for Biotechnology Information www.ncbi.nlm.nih.gov EMBL-EBI European Bioinformatics Institute www.ebi.ac.uk DDBJ DNA Data Bank of Japan www.ddbj.nig.ac.jp Identical sequence information in all three for the primary databases, but in slightly file formats (e.g., Genbank at NCBI, EMBL-Bank at EBI) Different tools and web interfaces for searching and retrieval Some other specialty databases are unique to each site

Major NCBI Databases Genbank / Entrez nucleotide / protein search PubMed / Medline journal publication search Genomes full genomes, info & sequence Taxbrowser full species taxonomy GEO Gene Expression Omnibus Many other smaller db s

NAR Database Index Great collation of biological databases, Nucleic Acids Research Database Issue Alphabetical http://www.oxfordjournals.org/nar/database/a/ By Category http://www.oxfordjournals.org/nar/database/c/ 1330 Databases (!!) (96 new since last year) Sorted alphabetically & by category Books date quickly, use on-line collections like this to get most current information

Literature: PubMed @ NCBI What-- access to Medline Primarily biomedical, molecular biology & biochemistry journals Searching Entrez search engine Logical operators Field Delimiters What s related PMIDs Returns article title, authors, & abstract PubMed Central free access to many full-text scientific papers UCSC Electronic Journal access Special procedure for getting access to journal articles off-campus (http://oca.ucsc.edu/login)

Genes & Genomes

Three Domains of Life Credit: http://scienceblogs.com/clock/2007/01/current_biological_diversity.php

What is a Genome? A complete set of instructions for life encoded in DNA Organized in chromosomes prokaryotes generally have one main circular chromosome eukaryotes have multiple linear chromosomes Instructions are generally in the form of genes, and are the unit of heredity

Why Sequence a Genome? We wish to understand how the entire cell / organism works thousands of complex gene interactions complete parts list is first step to understanding how parts work together as a whole Economy of scale - faster, cheaper to sequence all genes at once, than one at a time by many different researchers

First fully sequenced Organism: H. Influenza The Institute for Genome Research (TIGR) - 1995

Human Genome Project In 1980 s, initial discussion to sequence human genome (first key meeting here at UCSC!) Began: 1990 Planned finish: 2005 Original Estimates: ~100,000 genes 3 billion nucleotides, projected cost $300 Million less than 5% of genome codes for genes Early opposition to sequencing 95% junk, taking money away from basic research Final Results: Draft completed June 2000, Finished April 2003 Final cost ~ $3 billion ~25,000-30,000 genes now 5% of genome is conserved, likely important Up to 50% of genome is transcribed with uncertain biological function

Preparing for the Human Sequencing Mapping the human genome Practice: sequencing model organisms Size (Mb) Finished Baker s yeast (S. cerevisiae) 12.5 1996 E. coli 4.5 1997 Roundworm (C. elegans) 100 1998 Fruit fly (D. melanogaster) 160 2000 Plant weed (A. thaliana) 120 2000

Sequencing Genomes: Strategy #1 Original Top-Down Strategy Deliberate, small chance for major errors

Sequencing Genomes: Strategy #2 WGS (Whole Genome Sequencing) Bottom Up Strategy No ordering of a clone library - straight to sequencing to build scaffolds Current method used for genome sequencing

Race to the Finish June 2000 A new private company, Celera, announces it would use strategy #2 to beat the public, international effort to sequence the human genome, and planned to patent as many human genes as possible Public team effort changes strategy from #1 to mixture of #1 and #2 Change in strategy meant ordered clone libraries would not be finished, and an alternate method was needed for putting the pieces together (quickly) UC Santa Cruz to the rescue! David Haussler & Jim Kent, in just a four weeks, created a clever program (GigAssembler) to run on a network of 100 Dell desktop computers to assemble the genome by Celera s target date in June The whole story: http://www.nytimes.com/2001/02/13/health/13hero.html

Sequencing Technologies Sanger Sequencing Capillary sequencing: ABI 3700 and MegaBASE machines allow first rapid automated sequencing in tiny capillary tubes 600-800 bases at a time x 384 wells = ~ 268,000 bases decoded / run Bulk of orignal human genome decoded on these NextGen Technologies (Massively parallel, shorter reads) 454 / Roche Sequencers 100-500 bases, 400-600 Million bases / run Illumina Genetic Analyzer: ~2x100 bases, 150-200 Billion bases/ run ABI SOLiD Sequencing: ~2x50 base each, 80-100 Billion bases / run Chemical decoding is no longer the major cost (10 Million bp / cent) re-assembling all the pieces with computer programs and interpretation (bioinformatics) is now the major cost / bottleneck

Genome Challenges $1,000 Genome Prize goal set in 2003 by J. Craig Venter Science Foundation to make personalized medicine available to all ($500,000 prize) X Prize build a machine that can sequence 100 people in 10 days or less, for $10,000/genome or less ($10 Million prize): http://genomics.xprize.org/

Plummeting Cost of Human Genome Sequencing Source: Wetterstrand KA http://www.genome.gov/sequencingcosts/

More than a Dozen Human Genomes Decoded, and counting In 2010, more than a dozen human genomes had been completed, including those of James Watson and J. Craig Venter Five new genomes were published March 2010, focusing on people with genetic diseases: http://www.nytimes.com/2010/03/11/health/research/11gene.html Perspective on the human genome, 10 yrs later: http://www.nature.com/nature/journal/v464/n7289/full/464649a.html Nature 464, 649-650 (1 April 2010)

The Cancer Genome Atlas (TCGA) Project The Cancer Genome Atlas (TCGA) is a project to catalogue genetic mutations responsible for cancer, using genome analysis techniques started in 2005. TCGA represents an effort in the War on Cancer that is applying recently developed high-throughput genome analysis techniques and is seeking to improve our ability to diagnose, treat, and prevent cancer Includes sequencing selected genes, gene expression, and decoding the entire genomes of a portion of the 500+ patient tissue samples (20-25 different tumor types)

The Cancer Genome Atlas (TCGA) Project TCGA is the first large-scale genomics project funded by the NIH to include significant resources to bioinformatic discovery. The NCI has devoted 50% of TCGA appropriated funds, approximately $12M/year, to fund bioinformatic discovery. Four centers are called Genome Analysis Data Centers -Bs, which are responsible for the biological interpretation of TCGA data. They include University of California at Santa Cruz, MD Anderson Cancer Center, Memorial Sloan Kettering Cancer Center and The Institute for Systems Biology