Introduction and Public Sequence Databases. BME 110/BIOL 181 CompBio Tools

Introduction and Public Sequence Databases BME 110/BIOL 181 CompBio Tools Todd Lowe March 29, 2011

Course Syllabus: Admin http://www.soe.ucsc.edu/classes/bme110/spring11 Reading: Chapters 1, 2 (pp.29-56), & 14 Notes available in PDF format on-line (see class calendar page): http://www.soe.ucsc.edu/classes/bme110/spring11/bme110-calendar.htm Study Sections: Mon 11am, JBE Rm 373, and Wed 3:30pm, Soc Sci 2, Rm 165 TA: John Archie (jarchie@soe.ucsc.edu)

Class Objectives Use bioinformatics to investigate biological problems Identify and use bioinformatics resources online Databases Remote analysis server Tools for use on your computer Understand bioinformatics methods in sufficient detail select the proper tool and settings to solve problems effectively Interpret the significance of results Generate quality figures to illustrate biological hypotheses or conclusions Appreciate future impacts of genomics on you and society

Today s topics Basic rules for doing bioinformatics on the web Types of Databases Searching PubMed for scientific papers Sequencing Technologies Basic tools for genomics Genome Databases

Bioinformatics on the Web Golden Rules: Use published databases and methods Supported and maintained Trusted by community Document what you ve done Sequence identification numbers Server, database, program versions Program parameters Assess reliability of results Understand and use reported confidence measures Compare results of multiple servers Do results support/conflict with other available data?

Most Basic Database: Sequence Repositories Three major sequence repositories NCBI National Center for Biotechnology Information www.ncbi.nlm.nih.gov EMBL-EBI European Bioinformatics Institute www.ebi.ac.uk DDBJ DNA Data Bank of Japan www.ddbj.nig.ac.jp Identical sequence information in all three for the primary databases, but in slightly file formats (e.g., Genbank at NCBI, EMBL-Bank at EBI) Different tools and web interfaces for searching and retrieval Some other specialty databases are unique to each site

Major NCBI Databases Genbank / Entrez nucleotide / protein search PubMed / Medline journal publication search Genomes full genomes, info & sequence Taxbrowser full species taxonomy GEO Gene Expression Omnibus Many other smaller db s

NAR Database Index Great collation of biological databases, Nucleic Acids Research Database Issue Alphabetical http://www.oxfordjournals.org/nar/database/a/ By Category http://www.oxfordjournals.org/nar/database/c/ 1330 Databases (!!) (96 new since last year) Sorted alphabetically & by category Books date quickly, use on-line collections like this to get most current information

Literature: PubMed @ NCBI What-- access to Medline Primarily biomedical, molecular biology & biochemistry journals Searching Entrez search engine Logical operators Field Delimiters What s related PMIDs Returns article title, authors, & abstract PubMed Central free access to many full-text scientific papers UCSC Electronic Journal access Special procedure for getting access to journal articles off-campus (http://oca.ucsc.edu/login)

Genes & Genomes

Three Domains of Life Credit: http://scienceblogs.com/clock/2007/01/current_biological_diversity.php

What is a Genome? A complete set of instructions for life encoded in DNA Organized in chromosomes prokaryotes generally have one main circular chromosome eukaryotes have multiple linear chromosomes Instructions are generally in the form of genes, and are the unit of heredity

Why Sequence a Genome? We wish to understand how the entire cell / organism works thousands of complex gene interactions complete parts list is first step to understanding how parts work together as a whole Economy of scale - faster, cheaper to sequence all genes at once, than one at a time by many different researchers

First fully sequenced Organism: H. Influenza The Institute for Genome Research (TIGR) - 1995

Human Genome Project In 1980 s, initial discussion to sequence human genome (first key meeting here at UCSC!) Began: 1990 Planned finish: 2005 Original Estimates: ~100,000 genes 3 billion nucleotides, projected cost $300 Million less than 5% of genome codes for genes Early opposition to sequencing 95% junk, taking money away from basic research Final Results: Draft completed June 2000, Finished April 2003 Final cost ~ $3 billion ~25,000-30,000 genes now 5% of genome is conserved, likely important Up to 50% of genome is transcribed with uncertain biological function

Preparing for the Human Sequencing Mapping the human genome Practice: sequencing model organisms Size (Mb) Finished Baker s yeast (S. cerevisiae) 12.5 1996 E. coli 4.5 1997 Roundworm (C. elegans) 100 1998 Fruit fly (D. melanogaster) 160 2000 Plant weed (A. thaliana) 120 2000

Sequencing Genomes: Strategy #1 Original Top-Down Strategy Deliberate, small chance for major errors

Sequencing Genomes: Strategy #2 WGS (Whole Genome Sequencing) Bottom Up Strategy No ordering of a clone library - straight to sequencing to build scaffolds Current method used for genome sequencing

Race to the Finish June 2000 A new private company, Celera, announces it would use strategy #2 to beat the public, international effort to sequence the human genome, and planned to patent as many human genes as possible Public team effort changes strategy from #1 to mixture of #1 and #2 Change in strategy meant ordered clone libraries would not be finished, and an alternate method was needed for putting the pieces together (quickly) UC Santa Cruz to the rescue! David Haussler & Jim Kent, in just a four weeks, created a clever program (GigAssembler) to run on a network of 100 Dell desktop computers to assemble the genome by Celera s target date in June The whole story: http://www.nytimes.com/2001/02/13/health/13hero.html

Sequencing Technologies Sanger Sequencing Capillary sequencing: ABI 3700 and MegaBASE machines allow first rapid automated sequencing in tiny capillary tubes 600-800 bases at a time x 384 wells = ~ 268,000 bases decoded / run Bulk of orignal human genome decoded on these NextGen Technologies (Massively parallel, shorter reads) 454 / Roche Sequencers 100-500 bases, 400-600 Million bases / run Illumina Genetic Analyzer: ~2x100 bases, 150-200 Billion bases/ run ABI SOLiD Sequencing: ~2x50 base each, 80-100 Billion bases / run Chemical decoding is no longer the major cost (10 Million bp / cent) re-assembling all the pieces with computer programs and interpretation (bioinformatics) is now the major cost / bottleneck

Genome Challenges $1,000 Genome Prize goal set in 2003 by J. Craig Venter Science Foundation to make personalized medicine available to all ($500,000 prize) X Prize build a machine that can sequence 100 people in 10 days or less, for $10,000/genome or less ($10 Million prize): http://genomics.xprize.org/

Plummeting Cost of Human Genome Sequencing Source: Wetterstrand KA http://www.genome.gov/sequencingcosts/

More than a Dozen Human Genomes Decoded, and counting In 2010, more than a dozen human genomes had been completed, including those of James Watson and J. Craig Venter Five new genomes were published March 2010, focusing on people with genetic diseases: http://www.nytimes.com/2010/03/11/health/research/11gene.html Perspective on the human genome, 10 yrs later: http://www.nature.com/nature/journal/v464/n7289/full/464649a.html Nature 464, 649-650 (1 April 2010)

The Cancer Genome Atlas (TCGA) Project The Cancer Genome Atlas (TCGA) is a project to catalogue genetic mutations responsible for cancer, using genome analysis techniques started in 2005. TCGA represents an effort in the War on Cancer that is applying recently developed high-throughput genome analysis techniques and is seeking to improve our ability to diagnose, treat, and prevent cancer Includes sequencing selected genes, gene expression, and decoding the entire genomes of a portion of the 500+ patient tissue samples (20-25 different tumor types)

The Cancer Genome Atlas (TCGA) Project TCGA is the first large-scale genomics project funded by the NIH to include significant resources to bioinformatic discovery. The NCI has devoted 50% of TCGA appropriated funds, approximately $12M/year, to fund bioinformatic discovery. Four centers are called Genome Analysis Data Centers -Bs, which are responsible for the biological interpretation of TCGA data. They include University of California at Santa Cruz, MD Anderson Cancer Center, Memorial Sloan Kettering Cancer Center and The Institute for Systems Biology