How Computing Science Saved The Human Genome Project Athabasca Hall Sept. 9, 2013

Similar documents
Gene Structure & Gene Finding Part II

Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian,

How to design an HMM for a new problem. HMM model structure. Inherent limitation of HMMs. Duration modeling. Duration modeling

Gene Prediction in Eukaryotes

Applications of HMMs in Computational Biology. BMI/CS Colin Dewey

Genomic region (ENCODE) Gene definitions

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs

Homework 4. Due in class, Wednesday, November 10, 2004

Bioinformatics for Genomics

Studying the Human Genome. Lesson Overview. Lesson Overview Studying the Human Genome

Chemical treatments cause cells and nuclei to burst The DNA is inherently sticky, and can be pulled out of the mixture This is called spooling DNA

Genome 373: Hidden Markov Models III. Doug Fowler

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

GENES & GENOME DATABASES

Genome Projects. Part III. Assembly and sequencing of human genomes

Lesson Overview. Studying the Human Genome. Lesson Overview Studying the Human Genome

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Genomes contain all of the information needed for an organism to grow and survive.

AP Biology

Introduction and Public Sequence Databases. BME 110/BIOL 181 CompBio Tools

Introduction to Bioinformatics

ab initio and Evidence-Based Gene Finding

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

Chapter 4 METHOD FOR PREDICTING GENES IN EUKARYOTIC GENOMES

ChIP-seq and RNA-seq

Genome and DNA Sequence Databases. BME 110: CompBio Tools Todd Lowe April 5, 2007

March 9, Hidden Markov Models and. BioInformatics, Part I. Steven R. Dunbar. Intro. BioInformatics Problem. Hidden Markov.

Gene Prediction Background & Strategy Faction 2 February 22, 2017

The Human Genome Project

MODULE 5: TRANSLATION

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Gene Prediction 10/21/05

ChIP-seq and RNA-seq. Farhat Habib

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

The first order of genome annotation is to identify what elements a genome contains in the first place.

Class 35: Decoding DNA

HUMAN GENOME BIOINFORMATICS. Tore Samuelsson, Dec 2009

DNA Sequencing and Assembly

Genetics and Biotechnology. Section 1. Applied Genetics

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Genome annotation & EST

EGPred: Prediction of Eukaryotic Genes Using Ab Initio Methods After Combining With Sequence Similarity Approaches

2017 Amplyus, all rights reserved

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

3) This diagram represents: (Indicate all correct answers)

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

Annotating the Genome (H)

Biochemistry. Dr. Shariq Syed. Shariq AIKC/FinalYB/2014

Textbook Reading Guidelines

Epigenetics and DNase-Seq

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

Improved Splice Site Detection in Genie

GENOME ANNOTATION INTRODUCTION TO CONCEPTS AND METHODS. Olivier GARSMEUR & Stéphanie SIDIBE-BOCS

DNA Structure and Analysis. Chapter 4: Background

#28 - Promoter Prediction 10/29/07

Genomics AGRY Michael Gribskov Hock 331

Lecture 7 Motif Databases and Gene Finding

Gene Identification in silico

Chapter 5. Structural Genomics

13-2 Manipulating DNA Slide 1 of 32

The Structure of Proteins and DNA

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

INTRODUCTION TO MOLECULAR GENETICS. Andrew McQuillin Molecular Psychiatry Laboratory UCL Division of Psychiatry 22 Sept 2017

ProGen: GPHMM for prokaryotic genomes

Small Exon Finder User Guide

Genes and gene finding

Genome Sequence Assembly

CISC 889 Bioinformatics (Spring 2004) Lecture 3

From Infection to Genbank

Genome 373: Gene Predic/on I. Doug Fowler

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

Mate-pair library data improves genome assembly

Phat a gene finding program for Plasmodium falciparum

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Machine Learning. HMM applications in computational biology

Hidden Markov Models. Some applications in bioinformatics

Multiple choice questions (numbers in brackets indicate the number of correct answers)

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

What is Bioinformatics?

Genetic tests are available for hundreds of disorders. DNA testing can pinpoint the exact genetic basis of a disorder.

Gene finding. Lorenzo Cerutti Swiss Institute of Bioinformatics. EMBNet course, September 2002

Genome Sequencing-- Strategies

Computational gene finding

The Diploid Genome Sequence of an Individual Human

May 16. Gene Finding

Galaxy Platform For NGS Data Analyses

A Lot More Advanced Biotechnology Tools. DNA Sequencing. DNA Sequencing. Sequencing and more. Sanger method

Baum-Welch and HMM applications. November 16, 2017

وراثة األحياء الدقيقة Microbial Genetics

Computational identification of promoters and first exons in the human genome

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

FAST AND ACCURATE GENE PREDICTION BY PROTEIN HOMOLOGY

Transcription:

How Computing Science Saved The Human Genome Project david.wishart@ualberta.ca 3-41 Athabasca Hall Sept. 9, 2013

The Human Genome Project

HGP Announcement June 26, 2000*** J. Craig Venter Francis Collins Bill Clinton Tony Blair

Most Significant Scientific Accomplishments of the Last 50 Years June 26, 2000 July 20, 1969

The Human Genome Project First efforts started in October 1990 Celera vs. NIH Two competing efforts (private vs. public) First Draft completed on June 26, 2000 Finished on May 18, 2006 ($3.8 billion) Used hundreds of machines and 1000s of scientists to sequence a total of 3,283,984,159 bases on 24 chromosomes

21,000 metabolite

DNA Structure & Bases

DNA Sequencing The Key to the Human Genome Project

Shotgun Sequencing Isolate ShearDNA Chromosome into Fragments Clone into Seq. Vectors Sequence

Multiplexed CE with Fluorescent detection ABI 3700 96x700 bases

Shotgun Sequencing Sequence Chromatogram Send to Computer Assembled Sequence

HGP - Challenges Reading the DNA sequencer chromatograms (base calling) Putting millions of short reads together to assemble the genome (assembly) Identifying the genes from the DNA sequence (gene finding) Figuring out what each gene does

HGP - Challenges Reading the DNA sequencer chromatograms 35 billion base calls Putting millions of short reads together to assemble the genome piecing 35 million reads together Identifying the genes from the DNA sequence Finding 1% signal with >95% accuracy Figuring out what each gene does 20,000x100,000,000 comparisons

Biting Off Too Much

Computational Challenges Reading the DNA sequencer chromatograms 35 billion base calls Putting millions of short reads together to assemble the genome piecing 35 million reads together Identifying the genes from the DNA sequence Finding 1% signal with >95% accuracy Figuring out what each gene does 20,000x100,000,000 comparisons

Principles of DNA Sequencing

Principles of DNA Sequencing G T C A short G C A T G + + C long

Capillary Electrophoresis Separation by Electro-osmotic Flow

Multiplexed CE with Fluorescent detection ABI 3700 96x700 bases

Base Calling Image processing Peak detection De-noising Peak deconvolution Signal analysis Reliability assessment 99.99% accurate for 35 billion base calls

Base Calling With Phred* READ QUALITY Bad Better Good Excellent

Base Calling - Result ATGTCACTGCAATTGATGTATAAATGGA! GTTAGACACTAGATCACATAGGAGTTTA! CGCTAAATGACAGATAGACA! GGGATATCTATAGATAGACACATAGCTCTCT! AATGACGACTAGCTGAGTAGATT! TTACGATCGATCGATATTACCGCGCGAAATAT! AGCTATGATGTCGAT! AGACTAGCTAGCTTCTCGGATATTAGA!

Shotgun Sequencing Sequence Chromatogram Send to Computer Assembled Sequence

Computational Challenges Reading the DNA sequencer chromatograms 35 billion base calls Putting millions of short reads together to assemble the genome piecing 35 million reads together Identifying the genes from the DNA sequence Finding 1% signal with >95% accuracy Figuring out what each gene does 20,000x100,000,000 comparisons

Sequence Assembly Reads Consensus ATGGCATTGCAA! TGGCATTGCAATTTG! AGATGGTATTG! GATGGCATTGCAA! GCATTGCAATTTGAC! ATGGCATTGCAATTT! AGATGGTATTGCAATTTG!! AGATGGCATTGCAATTTGAC!

Sequence Assembly (An Analogy) The DARPA Shredder Challenge

Dynamic Programming GAATTCAGTTA! GGATCGA!

Dynamic Programming GAATTCAGTTA! GGAT-C-G--A!

De Bruijn Graphs & Assembly

A Real Assembler

The Result >P12345 Human chromosome1 GATTACAGATTACAGATTACAGATTACAGATTACAG! ATTACAGATTACAGATTACAGATTACAGATTACAGA! TTACAGATTACAGATTACAGATTACAGATTACAGAT! TACAGATTAGAGATTACAGATTACAGATTACAGATT! ACAGATTACAGATTACAGATTACAGATTACAGATTA! CAGATTACAGATTACAGATTACAGATTACAGATTAC! AGATTACAGATTACAGATTACAGATTACAGATTACA! GATTACAGATTACAGATTACAGATTACAGATTACAG! ATTACAGATTACAGATTACAGATTACAGATTACAGA! TTACAGATTACAGATTACAGATTACAGATTACAGAT! TTGGC. And on for 150,000,000 bases

February, 2002 *

Computational Challenges Reading the DNA sequencer chromatograms 35 billion base calls Putting millions of short reads together to assemble the genome piecing 35 million reads together Identifying the genes from the DNA sequence Finding 1% signal with >95% accuracy Figuring out what each gene does 20,000x100,000,000 comparisons

Genome Sequence >P12345 Human chromosome1 GATTACAGATTACAGATTACAGATTACAGATTACAG! ATTACAGATTACAGATTACAGATTACAGATTACAGA! TTACAGATTACAGATTACAGATTACAGATTACAGAT! TACAGATTAGAGATTACAGATTACAGATTACAGATT! ACAGATTACAGATTACAGATTACAGATTACAGATTA! CAGATTACAGATTACAGATTACAGATTACAGATTAC! AGATTACAGATTACAGATTACAGATTACAGATTACA! GATTACAGATTACAGATTACAGATTACAGATTACAG! ATTACAGATTACAGATTACAGATTACAGATTACAGA! TTACAGATTACAGATTACAGATTACAGATTACAGAT! TTGGC. And on for 150,000,000 bases

Eukaryotic Gene Structure Transcribed Region exon 1 intron 1 exon 2 intron 2 exon3 Upstream Intergenic Region Start codon 5 UTR Stop codon 3 UTR Downstream Intergenic Region

Genome Sequence GATTACAGATTACAGATTACAGATTACAGATTACAG! ATTACAGATTACAGATTACAGATTACAGATTACAGA! TTACAGATTACAGATTACAGATTACAGATTACAGAT! TACAGATTAGAGATTACAGATTACAGATTACAGATT! ACAGATTACAGATTACAGATTACAGATTACAGATTA! CAGATTACAGATTACAGATTACAGATTACAGATTAC! AGATTACAGATTACAGATTACAGATTACAGATTACA! GATTACAGATTACAGATTACAGATTACAGATTACAG! ATTACAGATTACAGATTACAGATTACAGATTACAGA!! ATTACAGATTACAGATTACAATTAGAGATTACAGAT! TACAGATTACAGATTACAGATTACAGATTACAGATT! ACCAGATTACAGA!

Problem Similar to Speech or Text Recognition

Hidden Markov Models

HMM for Gene Finding

Genscan The Ultimate Gene Finder

How Well Do They Do? "Evaluation of gene finding programs" S. Rogic, A. K. Mackworth and B. F. F. Ouellette. Genome Research, 11: 817-832 (2001).

Gene Prediction (Evaluation) Actual TP FP TN FN TP FN TN Predicted Sensitivity Measure of the % of false negative results (sn = 0.996 means 0.4% false negatives) Specificity Measure of the % of false positive results Precision Measure of the % positive results Correlation Combined measure of sensitivity and specificity

Gene Prediction (Evaluation) Actual TP FP TN FN TP FN TN Predicted Sensitivity or Recall Sn=TP/(TP + FN) Specificity Sp=TN/(TN + FP) Precision Pr=TP/(TP + FP) Correlation CC=(TP*TN-FP*FN)/[(TP+FP)(TN+FN)(TP+FN)(TN+FP)] 0.5 This is a better way of evaluating

Computational Challenges Reading the DNA sequencer chromatograms 35 billion base calls Putting millions of short reads together to assemble the genome piecing 35 million reads together Identifying the genes from the DNA sequence Finding 1% signal with >95% accuracy Figuring out what each gene does 20,000x100,000,000 comparisons

A List of Genes >P12346 Gene 1 ATGTACAGATTACAGATTACAGATTACAGATTACAG ATTACAGATTACAGATTACAGATTACAGATTACAGA TTACAGATTACAGATTACAGATTACAGAT >P12347 Gene 2 ATGAGATTAGAGATTACAGATTACAGATTACAGATT ACAGATTACAGATTACAGATTACAGATTACAGATTA CAGATTACAGATTACAGATTACAGATTACAGATT >P12348 Gene 3 ATGTTACAGATTACAGATTACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACA...

What Biologists Want >P12346 Gene 1 Human hemoglobin alpha chain, transports oxygen, located on chromosome 14 p.12.1 >P12347 Gene 2 Human super oxide disumutase, removes oxygen radicals and prevents rapid aging, located on chromosome 14 p.12.21 >P12348 Gene 3 Human hemoglobin beta chain, transports oxygen, located on chromosome 14 p.12.23

What Biologists Want Trick is to use sequence similarity or sequence matching and prior knowledge By 2005 millions of genes had already been characterized from other organisms Find the human genes that are similar to the already-characterized genes and assume they are pretty much the same Annotation by sequence homology Key is to do rapid sequence comparisons

Definitions by Similarity Query: Bananas Database Banana a yellow curved fruit Bandana a colorful kerchief Banal boring and obvious Banyan a fig that starts as an epiphyte Ananas genus name for pineapple

Dynamic Programming Too Slow GAATTCAGTTA! GGATCGA!

The BLAST Search Algorithm 1000-10,000X faster than DP methods

The BLAST Server

Computational Challenges Reading the DNA sequencer chromatograms Solved with Phred Putting millions of short reads together to assemble the genome Solved with Phusion Identifying the genes from the DNA sequence Solved with Genscan Figuring out what each gene does Solved with BLAST

Who Were the Real Heroes of The Human Genome Project? J. Craig Venter Francis Collins Bill Clinton Tony Blair

Questions? david.wishart@ualberta.ca 3-41 Athabasca Hall