What is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a single discipline. - NCBI The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. - NCBI http://www.ncbi.nlm.nih.gov/about/primer/bioinformatics.html
DNA Sequencing 5 CHAIN TERMINATOR 3 A 3 hydroxyl group 2005 is Prentice essential Hall Inc. / for A Pearson chain Education elongation Company / Upper Saddle River, New Jersey 07458
Capillary Gel Electrophoresis The sequencing reaction is run out in a single capillary gel. The gel is scanned by a laser. The sequence is read automatically using computer software from the pattern of different wavelengths emitted by the fluorescent dyes.
2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Automated sequencers: ABI 3700 Made by Applied Biosystems Most widely used automated sequencers: 96 capillaries robot loading from 384- well plates Two to three hours per run 600 700 bases per run robotic arm and syringe 96 glass capillaries 96 well plate load bar
Workflow of conventional vs. second-generation sequencing High-throughput shotgun Sanger sequencing Cyclic array shotgun sequencing Template amplification (Template amplification) Sanger cycle seq Capillary electrophoresis 96 or 384 long reads per run Template immobilization Seq by synthesis or hybridization Millions of short reads per run 6
Illumina Figure from M. Metzker, Nat Rev Genet, Jan. 2010 7
Cost of Sequence per megabase
Benefits of Next-gen sequencing https://genomevolution.org/wiki/images/1/16/plant_genome_growth.png
Why do we sequence? Genome Annotation: A complete genome sequence provides us with the raw data to construct a "parts list". Comparative Genomics: Conserved regions in the genome are more likely to play an important role in biology of the species. Functional Genomics: Sequencing the RNA provides us with an insight into the transcriptionally active regions of the genome. Population Genetics and Genomics: Genetic structure and diversity reveals history and distribution of phenotypic traits (e.g. disease susceptibility alleles) Genetic Analysis: Map and characterize molecular basis of allelic variants 10
We have the genome sequence, now what? Well...! We don t know how many genes there are!! We don t know where they are!! We don t know what they do!!
Definitions of Annotation Interpreting raw sequence data into useful biological information Information attached to genomic coordinates with start and end point, can occur at different levels Addition of as much reliable and up-to-date information as possible to describe a sequence Identification, structural description, characterization of putative protein products and other features in primary genomic sequence
Genome annotation Two Main Levels Structural annotation = Nucleotide-Protein level annotation. Finding genes and other biologically relevant sites thus building up a model of genome as objects with specific locations Functional annotation = Objects are used in database searches (and experiments) aim is attributing biologically relevant information to whole sequence and individual objects Large-scale genome analysis projects Rate-limiting step is annotation
How do we get from here 14
to here,
Summary of gene annotation steps
Gene prediction through comparative genomics Highly similar (Conserved) regions between two genomes are useful or else they would have diverged If genomes are too closely related all regions are similar, not just genes If genomes are too far apart, analogous regions may be too dissimilar to be found 17
Mouse-human comparison 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458 From: J.W. Thomas et al - Nature 14 August 2003
The ENCODE Project Consortium (2011) A User s Guide to the Encyclopedia of DNA Elements (ENCODE). PLoS Biol 9(4)
Automated Manual Merged
Basic Distributed Annotation Systems (DAS)
Contents of an Integrated Database Experimental Data Microarray Chip-Chip Genome and Functional Annotation: Predicted genes, GO, MIPSFuncat Data to support modeling efforts Protein-protein interactions Protein-DNA interactions Pathways (KEGG, AraCyc)
Bioinformaticians integrate the data into one database 1) Find the data. Decentralized databases Data in different formats Experiments Function Models 2) Convert to a common format XML is a good idea (SBML) 3) Data integration. Manual: Excel sheet comparisons (Biologists) Automated: Perl Scripts (Informatician) Database: Queries e.g. SQL (High-production labs) 4) Gene list intersect. Annotation 5) Modeling Biological function in Gene list Need visualization and network modeling tools
UCSC browser
Examples of Large Genome Projects 1000 Genomes Project (www.1000genomes.org). An effort to sequence the genome of 1000 people to identify genetic variants that affect 1% of the human population. 1001 Arabidopsis thaliana Genomes Project ( www.1001genomes.org). Study the genomes and phenotypes of 1001 strains that can explain difference in phenotype caused by adaptation of different conditions. Metagenomics (http://commonfund.nih.gov/hmp/): Sequencing of DNA samples from environments, for example mouth, skin, and digestive system, to identify the different bacterial species present.
Your genome Personal Genome Sequencing: Several companies provide a service where you can submit your DNA to get sequenced. This can help you learn more about your heritage and also which diseases you are susceptible to. Medical Genomic Studies: There are already a collection of genetic testing procedures that look for specific genes. Unfortunately they are not accurate which can result in individuals making bad decisions. But hope is that with more genes, we can make better and more informed decisions.