Computational Biology and Bioinformatics

Computational Biology and Bioinformatics Computational biology Development of algorithms to solve problems in biology Bioinformatics Application of computational biology to the analysis and management of biological data Applied bioinformatics Intelligent use of tools to navigate the sequence space Better experiments can be designed by a careful Bioinformatics analysis before the bench work Bioinformatics 7,850,000 hits

DNA sequencing DNA sequences are the most abundant type of sequences (5 * 10 10 ) Generated by the chain termination method (Sanger sequencing) Based on the action of a DNA polymerase that adds nucleotides to complementary strand Fluorescently labeled ddntp (Dideoxynucleotides) stop synthesis acting as chain terminators. They are included in amounts so as to terminate every time the base appears in the template Requires template DNA, and primer and one ddntp for each base: A,C,G, and T Products are separated by electrophoresis G GG Primer A A A C C C T T T

Summary of chain termination sequencing In the past, four different reactions, one for each ddntp, were separated on a gel that could resolve one-base differences. The sequence was then read from the bottom of the gel to the top.

Sequence reading of fluorescently labeled reactions Fluorescently labeled reactions scanned by laser as particular point is passed Color picked up by detector Output sent directly to computer In its simplest form a sequence can be represented as a string of nucleotides with a basic tag or identifier after a greater than character > : FASTA format Definition line (commonly called def line ) >U54469.1 CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCAACAATCGATA GCTGCCTTTGGCCACCAAAATCCCAAACTTAATTAAAGAATTAAATAATTCGAATAATAATTAAGCCCAG TAACCTACGCAGCTTGAGTGCGTAACCGATATCTAGTATACATTTCGATACATCGAAATCATGGTAGTGT TGGAGACGGAGAAGGTAAGACGATGATAGACGGCGAGCCGCATGGGTTCGATTTGCGCTGAGCCGTGGCA GGGAACAACAAAAACAGGGTTGTTGCACAAGAGGGGAGGCGATAGTCGAGCGGAAAAGAGTGCAGTTGGC GTGGCTACATCATCATTGTGTTCACCGATTATTTTTTGCACAATTGCTTAATATTAATTGTACTTGCACG CTATTGTCTACGTCATAGCTATCGCTCATCTCTGTCTGTCTCTATCAAGCTATCTCTCTTTCGCGGTCAC TCGTTCTCTTTTTTCTCTCCTTTCGCATTTGCATACGCATACCACACGTTTTCAGTGTTCTCGCTCTCTC TCTCTTGTCAAGACATCGCGCGCGTGTGTGTGGGTGTGTCTCTAGCACATATACATAAATAGGAGAGCGG More information can be be added to the FASTA definition line >gb U54469.1 DMU54469 Drosophila melanogaster eukaryotic initiation factor 4E (eif4e) gene. GenBank Accession.version Locus Descritpion

Editing Errors

Capillary electrophoresis Newer automated sequencers use very thin capillary tubes Run all four fluorescently tagged reactions in same capillary Can have 96 capillaries running at the same time robotic arm and syringe 96 glass capillaries 96 well plate load bar

New sequencing technologies 454 Life Sciences: FLX titanium: 1,000,000 reads 400 bp= 400,000,000 high quality bp in 10h run http://www.454.com/products-solutions/how-it-works/index.asp

Present and Future of sequencing Sequencing costs Dropping each year Opens possibility of sequencing genomes of individuals Greatly facilitates comparative genomics.

DNA RNA protein phenotype genomic DNA databases cdna ESTs UniGene protein sequence databases

There are three major public DNA databases EMBL Housed at EBI European Bioinformatics Institute GenBank Housed at NCBI National Center for Biotechnology Information DDBJ Housed in Japan

Taxonomy at NCBI: >200,000 species are represented in GenBank http://www.ncbi.nlm.nih.gov/taxonomy/txstat.cgi

The most sequenced organisms in GenBank Homo sapiens 13.1 billion bases Mus musculus 8.4b Rattus norvegicus 6.1b Bos taurus 5.2b Zea mays 4.6b Sus scrofa 3.6b Danio rerio 3.0b Oryza sativa (japonica) 1.5b Strongylocentrotus purpurata 1.4b Nicotiana tabacum 1.1b GenBank release 168.0

Sequence databases What is a database? An indexed set of records Records retrieved using a query language Examples of sequence databases Primary databases (archival) GenBank EMBL (European Molecular Biology Laboratory) DDBJ (DNA Data Bank of Japan) Secondary databases (curated) RefSeq EMBL Genome Reviews Protein databases TPA (Third party annotations) The client server model has made access to sequence databases fast and easy Biologist query Web Browser Web Server BLAST Search Engine Database

Data flow of submissions between primary databases (Chp. 1) Integrated information retrieval system Is an interface not a database http://www.ensembl.org/index.html Entrez NIH National Institute of Health (USA) Submissions Updates (Sequin) NCBI National Center for Biotechnology Information GenBank EMBL European Molecular Biology Laboratory International Nucleotide Sequence Database Collaboration Updated every 24 hs EMBL http://www.ensembl.org/i ndex.html EBI Ensambl European Bioinformatics Institute DDBJ Center for Information Biology CIB DNA Data Bank of Japan Submissions Updates Getentry http://getentry.ddbj.nig.a c.jp/getstart-e.html Submissions Updates (Sakura) NIG National Institute of Genetics (JAPAN)

Nucleotide sequence flatfiles Header: Locus (10 characters, 1 st letter, arbitrary name, not useful), Length, Molecule type (DNA or RNA), Division code (includes functional divisions as ExpressedSeqTags, SeqTagsSites, WholeGenomeSeq, etc.), Last release date EMBL). Definition: Summary of biological information Accession number: Primary key to reference a record in the database. Used in publications (1+5 or 2+6). Version: The version is increased every time the sequence is updated. For each version there is a dif. GI geninfo identifier, which is specific of GenBank, not very useful Keywords: abandoned because of absence of controlled vocabulary. Source / Organism : taxonomic information top-down. Reference: submission credit and published paper. Feature Table: direct representation of the biological information in the record: Source: org, chromosome, map Gene may include regulatory sequences mrna from start of translation to polyadenilation signal CDS (coding sequences): from start to stop codon, provides exon coordinates. Every CDS is assigned a protein_id.version (3+5) Sequence: actual sequence ending in //

Secondary databases Refseq Comprehensive, integrated, and non-redundant set of sequences Genomic DNA RNA Protein explicitly linked 2_6 Format (the underscore is never present in GenBank accessions): Experimental Predicted NT_123456 (genomic contig) NM_123456 (mrnas) NP_123456 (Proteins) [XM_123456 (model mrnas)] [XP_123456 (model protein)] Undergo continuous curation: most up to date sequence Each RefSeq is a synthesis of information, not a piece of a primary research: equivalent to a review article Message for Removed Secondary

Secondary databases Third Party Annotation (TPA) Includes Reannotations, Combinations of novel and existing primary entries Annotations of trace archives Whole genome Shotgun data Provides GenBank accession. Version numbers and nucleotide locations for all primary entries to which the TPA sequence relates EMBL Genome reviews Includes Add information from UniProt knowledgebase, Gene Ontology Annotation, InterPro, and others Curated versions of entries representing complete genomes Standardize annotations

Protein databases http://www.ebi.ac.uk/blast2/index.html?uniprot Merge 50% = GenPept: translations of all CDS. Not curated Merge 90% = Uniprot (Swiss-Prot/TrEMBL/PIR-PSD) UniParc: most comprehensive, public nonredundant protein database Swiss-Prot (manual)/trembl(computer)/pir-psd GenBank, Patents, Int. Pr. Index (IPI) Protein Data Bank UniProt Knowledgebase: curated subset of UniParc Function Postranslational modifications Domains Catalytic sites Structures Associated diseases Pathways Etc. UniRef: UniProt nonredundant reference database: 95%, 90% and 50% sets. Functional groups Pfam Prosite IternPro Merge 95% = All predicted coding regions

www.ncbi.nlm.nih.gov

PubMed is National Library of Medicine's search service 19 million citations in MEDLINE links to participating online journals PubMed tutorial

Entrez integrates the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes

Entrez is a search and retrieval system that integrates NCBI databases

BLAST is Basic Local Alignment Search Tool NCBI's sequence similarity search tool supports analysis of DNA and protein databases 100,000 searches per day

OMIM is Online Mendelian Inheritance in Man catalog of human genes and genetic disorders created by Dr. Victor McKusick; led by Dr. Ada Hamosh at JHMI

Bookshelf is searchable resource of on-line books

Taxonomy Browser is browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) taxonomy information such as genetic codes molecular data on extinct organisms practically useful to find a protein or gene from a species

Structure site includes Molecular Modeling Database (MMDB) biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST)

Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26

What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 NT_030059 Rs7079946 GenBank genomic DNA sequence Genomic contig dbsnp (single nucleotide polymorphism) DNA N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) RNA NP_007635 AAC02945 Q28369 1KT7 RefSeq protein GenBank protein SwissProt protein Protein Data Bank structure record protein Page 27

NCBI s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon reference version of a sequence. RefSeq identifiers include the following formats: Complete genome Complete chromosome Genomic contig mrna (DNA format) Protein NC_###### NC_###### NT_###### NM_###### e.g. NM_006744 NP_###### e.g. NP_006735 Page 27

NCBI s RefSeq project: accession for genomic, mrna, protein sequences Accession Molecule Method Note AC_123456 Genomic Mixed Alternate complete genomic AP_123456 Protein Mixed Protein products; alternate NC_123456 Genomic Mixed Complete genomic molecules NG_123456 Genomic Mixed Incomplete genomic regions NM_123456 mrna Mixed Transcript products; mrna NM_123456789 mrna Mixed Transcript products; 9-digit NP_123456 Protein Mixed Protein products; NP_123456789 Protein Curation Protein products; 9-digit NR_123456 RNA Mixed Non-coding transcripts NT_123456 Genomic Automated Genomic assemblies NW_123456 Genomic Automated Genomic assemblies NZ_ABCD12345678 Genomic Automated Whole genome shotgun data XM_123456 mrna Automated Transcript products XP_123456 Protein Automated Protein products XR_123456 RNA Automated Transcript products YP_123456 Protein Auto. & Curated Protein products ZP_12345678 Protein Automated Protein products

Access to sequences: Entrez Gene at NCBI Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_000518 for beta globin DNA corresponding to mrna) or protein (NP_000509) Page 29

From the NCBI home page, type beta globin and hit Search

Follow the link to Gene

Entrez Gene is in the header Note the Official Symbol HBB for beta globin Note the limits option

By applying limits, there are now far fewer entries

Entrez Gene (top of page) Note that links to many other HBB database entries are available

Entrez Gene (middle of page): genomic region, bibliography

Entrez Gene (middle of page, continued): phenotypes, function

Entrez Gene (bottom of page): RefSeq accession numbers

Entrez Protein: accession, organism, literature

Entrez Protein: features of a protein, and its sequence in the one-letter amino acid code

FASTA format: versatile, compact with one header line followed by a string of nucleotides or amino acids in the single letter code