Introduction to Bioinformatics

Size: px
Start display at page:

Download "Introduction to Bioinformatics"

Transcription

1 Introduction to Bioinformatics Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 8 th 2016 Info and documentation but only for guidance and hints: never take the internet for granted Campbell Biology, 9 th or 10 th edition, Pearson Reader Printed in black and white Download full color PDF at: Errata: 1

2 Course evaluation Final mark course 40% mark of Bioinformatic Data Analysis Bas Dutilh 10% mark of Basic Maths Kirsten ten Tusscher 50% mark of Mathematics/Theoretical Biology Kirsten ten Tusscher en Rob de Boer Bioinformatic Data Analysis exam Written exam Cheat sheet allowed: onehand-written A4, double-sided is OK Date: March 14 th 2015 at 13:30-16:30 in Educatorium Gamma Bioinformatic Data Analysis bonus point Make all exercises and have themsigned by yourassistant This has to be done in the same week of the practical In case of emergency: last chance to sign off is on Monday before lecture The maximum mark is a 10 Mini-article was cancelled How would you figure out the function of a protein? Activity assay X-ray structure Knock-out mouse BLAST search 2

3 How about for all proteins in a genome? Genome sizes Chaos chaos (1.4 Tb, Friz 1968) Tb: Tera base pairs (10 12 ) Gb: Giga base pairs (10 9 ) Mb: Mega base pairs (10 6 ) Kb: Kilo base pairs (10 3 ) 3

4 Gene density and non-coding DNA Mammals (including humans) have the lowest gene density Number of genes in a given length of DNA Introns within genes Noncoding DNA between genes Components of the human genome 20,000 25,000 protein-coding genes (1.5%) Introns (25.9%) Transposable elements (44.7%) DNA transposons Long terminal repeat (LTR) retrotransposons Short interspersed nuclear elements (SINEs) Long interspersed nuclear elements (LINEs) Endogenous retroviruses Miniature inverted repeat transposable elements (MITEs) 4

5 Largest genomes Kinugasasō (Paris japonica) 149,000,000,000 bp (149 Gb) Largest sequenced genome: Loblolly pine (Pinus taeda) 20,000,000,000 bp (20 Gb) Eukaryota Smallest genomes Free: Ostreococcus tauri (12.6 Mb) Endosymb: Encephalitozoon intestinalis (2.3 Mb) Bacteria and Archaea Free: Mycoplasma genitalium (580 kb) Endosymb: Cand. Carsonella ruddii (160 kb) Viruses Circoviridae (1.8 kb only two proteins!) 5

6 Human genome 3,000,000,000 bp (3 Gb) Human Genome Project (HGP) Draft genome sequence complete in 2000 Reference genome Source: blood (female) and sperm (male) Samples taken from many donors, but only a few were used to protect donor identities Sequence is not from one individual >70% from one male donor Cost HGP: $ 3,000,000,000 Target: $ 1,000 genome Genetic diversity Phylogenetic Tree of Life Eukaryotes Archaea Prokaryotes Bacteria 6

7 2/8/16 Genome sequencing Cloned genomes Segments known order Fragment and sequence Assemble sequences Consensus genome Whole Genome Shotgun (WGS) approach 7

8 Personal genome sequences ~ differences Craig Venter James Watson ~ differences ~ differences Reference Genome Your personal genome sequence 8

9 So we have a $200 personal genome now the million dollar question is: What can I learn from my 3,000,000,000 A s, C s, G s, and T s? Personalized medicine Sergey Brin Co-founder Co-invester From reactive to proactive medicine Identify high risk alleles Adapt lifestyle (e.g. risk of high blood pressure) Preventive screening or treatment (e.g. risk of cancer) Pharmacogenomics: LRRK2 polymorphism on chromosome 12-28% risk of Parkinson s at age 59-51% at age 69-74% at age 79 Impact of genetic variation on response to medication 9

10 Biology is Big Data science # sequenced genomes Moore's Law: computer power doubles every ~2 years. Omics sciences The suffix -ome refers to a totality of some sort Gene (genetics) Transcript (RNA) Protein Genome Transcriptome Proteome Genomics Transcriptomics Proteomics DNA RNA Protein Metabolite Lipid Microbe Metabolome Lipidome Microbiome Metabolomics Lipidomics Microbiomics (?!) 10

11 Genomics Identify differences in gene content between genomes Discover new species: Biological Dark Matter Analyze genome evolution Predict gene functions Chordata Echinodermata 10,000 species cultured 30,000 genomes sequenced 1,000,000,000,000 species on earth? 11

12 2/8/16 Metagenomics Sample Filter Microbes or viruses 12

13 2/8/16 Metagenomic discovery of Lokiarchaeota Spang et al. Nature 2015 Genetic diversity Phylogenetic Tree of Life Eukaryotes Archaea Prokaryotes Bacteria 13

14 Human microbiome and virome In your body: ~10 13 human cells ~10 14 bacteria ~10 15 viruses Image: Lisa Brown for Bioinformatics Bioinformatics: study of informatic processes in biotic systems Paulien Hogeweg and Ben Hesper (Utrecht University, 1970) Bioinformatic Data Analysis: using computational methods to analyze biological data 14

15 2/8/16 Bioinformatics in Utrecht today Bring your laptop 15