CS 68: BIOINFORMATICS. Prof. Sara Mathieson Swarthmore College Spring 2018

Similar documents
EECS730: Introduction to Bioinformatics

Genomics and Gene Recognition Genes and Blue Genes

COMP 555 Bioalgorithms. Fall Lecture 1: Course Introduction

MODULE 5: TRANSLATION

Transcription and Translation. DANILO V. ROGAYAN JR. Faculty, Department of Natural Sciences

Gene Identification in silico

Lecture 2: Biology Basics Continued. Fall 2018 August 23, 2018

Introduction to Bioinformatics

DNA is normally found in pairs, held together by hydrogen bonds between the bases

Tutorial for Stop codon reassignment in the wild

PROTEIN SYNTHESIS Flow of Genetic Information The flow of genetic information can be symbolized as: DNA RNA Protein

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Lecture 10. Ab initio gene finding

Computational gene finding

BISHOPAgEd.Weebly.com. Weeks: Dates: 1/18-1/29 Unit: RNA &Protein Synthesis. Monday Tuesday Wednesday Thursday Friday. FFA Meeting 6pm 27 E

Computing for Biologists, Part I Python vs. Pathogens

Chapter 6 MOLECULAR BASIS OF INHERITANCE

Lecture 2: Biology Basics Continued

Genome Sequence Assembly

Lecture 2: Central Dogma of Molecular Biology & Intro to Programming

RNA and Protein Synthesis

Assessment Schedule 2013 Biology: Demonstrate understanding of gene expression (91159)

Introduction. CS482/682 Computational Techniques in Biological Sequence Analysis

Chapter 4 DNA Structure & Gene Expression

DNA. Using DNA to solve crimes

Introduction to Bioinformatics and Gene Expression Technology

MBioS 503: Section 1 Chromosome, Gene, Translation, & Transcription. Gene Organization. Genome. Objectives: Gene Organization

Basic concepts of molecular biology

MCDB 1041 Class 21 Splicing and Gene Expression

MATH 5610, Computational Biology

Algorithms in Bioinformatics

DNA. translation. base pairing rules for DNA Replication. thymine. cytosine. amino acids. The building blocks of proteins are?

Basic concepts of molecular biology

BIOLOGY - CLUTCH CH.17 - GENE EXPRESSION.

COMPUTER RESOURCES II:

Molecular Biology. IMBB 2017 RAB, Kigali - Rwanda May 02 13, Francesca Stomeo

Roadmap. The Cell. Introduction to Molecular Biology. DNA RNA Protein Central dogma Genetic code Gene structure Human Genome

Chapter 7: Genetics Lesson 7.1: From DNA to Proteins

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

DNA Model Stations. For the following activity, you will use the following DNA sequence.

Computational gene finding

Genes and gene finding

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall

Videos. Lesson Overview. Fermentation

Introduction to Cellular Biology and Bioinformatics. Farzaneh Salari

If Dna Has The Instructions For Building Proteins Why Is Mrna Needed

Biology From gene to protein

Annotating the Genome (H)

O C. 5 th C. 3 rd C. the national health museum

Unit 1: DNA and the Genome. Sub-Topic (1.3) Gene Expression

DNA: The Molecule of Heredity

Computer Applications in Molecular Biology

Molecular Genetics of Disease and the Human Genome Project

RNA & PROTEIN SYNTHESIS

Videos. Bozeman Transcription and Translation: Drawing transcription and translation:

7. For What Molecule Do Genes Contain The

Gene Prediction Group

Processing Data from Next Generation Sequencing

3'A C G A C C A G T A A A 5'

Transcription. Unit: DNA. Central Dogma. 2. Transcription converts DNA into RNA. What is a gene? What is transcription? 1/7/2016

Molecular Cell Biology - Problem Drill 01: Introduction to Molecular Cell Biology

Basic Bioinformatics: Homology, Sequence Alignment,

Introduction to Bioinformatics

Class 35: Decoding DNA

De novo genome assembly with next generation sequencing data!! "

NUCLEIC ACIDS AND PROTEIN SYNTHESIS

Genes and Proteins. Objectives

CSE/Beng/BIMM 182: Biological Data Analysis. Instructor: Vineet Bafna TA: Nitin Udpa

Bioinformatics Programming and Analysis CSC Dr. Garrett Dancik

CS313 Exercise 1 Cover Page Fall 2017

CSE 527 Computational Biology Autumn Lectures ~14-15 Gene Prediction

Predicting Transcription Factors Binding Sites With Deep Learning

I. Gene Expression Figure 1: Central Dogma of Molecular Biology

Jay McTighe and Grant Wiggins,

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:

GENETICS 1 Classification, Heredity, DNA & RNA. Classification, Objectives At the end of this sub section you should be able to: Heredity, DNA and RNA

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

7.2 Protein Synthesis. From DNA to Protein Animation

DNA, RNA, and Protein Synthesis

MATH 5610, Computational Biology

From DNA to Proteins

DNA AND PROTEIN SYSNTHESIS

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs

Biology (2017) INTERNATIONAL GCSE. TOPIC GUIDE: Protein synthesis: transcription and translation. Pearson Edexcel International GCSE in Science

7.03, 2005, Lecture 20 EUKARYOTIC GENES AND GENOMES I

Written by: Prof. Brian White

The gene. Fig. 1. The general structure of gene

MCB 102 University of California, Berkeley August 11 13, Problem Set 8

Bioinformatics for Genomics

7.03 Fall 2003 Problem Set #3 Solutions Issued Friday, October 17, 2003

Gene Expression - Transcription

Bio-Reagent Services. Custom Gene Services. Gateway to Smooth Molecular Biology! Your Innovation Partner in Drug Discovery!

Connect-A-Contig Paper version

Ch 10.4 Protein Synthesis

Genes & Gene Finding

Textbook Reading Guidelines

(deoxyribonucleic acid)

Identification of Putative Coding Regions in a 10kB Sequence of the E. coli 536 Genome. By Someone. Partner: Someone else

Transcription:

CS 68: BIOINFORMATICS Prof. Sara Mathieson Swarthmore College Spring 2018

Outline: Jan 24 Central dogma of molecular biology Sequencing pipeline Begin: genome assembly Note: office hours Monday 3-5pm and Wednesday 1-3pm

Outline: Jan 24 Central dogma of molecular biology Sequencing pipeline Begin: genome assembly Note: office hours Monday 3-5pm and Wednesday 1-3pm Age of universe: 13.8 billion, Age of earth: 4.5 billion, Age of life: 3.8 billion

The Central Dogma of Molecular Biology

Central Dogma of Molecular Biology More correctly stated: The central dogma states that information in nucleic acid can be perpetuated or transferred but the transfer of information into protein is irreversible. (B. Lewin, 2004) Image: http://ib.bioninja.com.au/

Central Dogma of Molecular Biology More correctly stated: The central dogma states that information in nucleic acid can be perpetuated or transferred but the transfer of information into protein is irreversible. (B. Lewin, 2004) DNA RNA protein ATGCAATCAGATTAG CAAUCAGAU Q S D GFP protein example

Codons Codons are made up of 3 consecutive RNA bases Transcription begins at the start codon (ATG) and stops at one of the stop codons (TAG, TGA, or TAA) Each codon is translated into an amino acid, and amino acids make up proteins If each triplet of bases encoded a unique amino acid, how many amino acids would there be?

Codon table A: GCU,GCC,GCA,GCG,AGA R: CGU,CGC,CGA,CGG,AGG N: AAU,AAC D: GAU,GAC C: UGU,UGC Q: CAA,CAG E: GAA,GAG G: GGU,GGC,GGA,GGG H: CAU,CAC I: AUU,AUC,AUA L: UUA,UUG,CUU,CUC,CUA,CUG K: AAA,AAG M: AUG F: UUU,UUC P: CCU,CCC,CCA,CCG S: UCU,UCC,UCA,UCG,AGU,AGC T: ACU,ACC,ACA,ACG W: UGG Y: UAU,UAC V: GUU,GUC,GUA,GUG

Sequencing Pipeline

InTech open science

Genome Assembly

Goal of genome assembly Input: millions of reads from next-generation sequencing Output: (ideally) entire consensus sequence of the original DNA

Assembly vocabulary Long read: a fragment that has been read from a genomic sequence (DNA for us), usually > 1000 bp Short read: same as a long read but < 1000 bp (usually 50-100 bp) Paired-end read: both ends of a fragment are read, but the portion between them is unknown bp: base pair kb, Mb, Gb: kilo bases 10 3, mega bases 10 6, giga bases 10 9 Coverage: number of times (on average) any given base is sequenced. Total number of bases in all reads (n reads L bases/read), divided by the length of the genome G.

Could you design an algorithm for genome assembly? 1) With a partner, analyze these given reads. What is L (length of each read)? What is n (number of reads)? 2) Try to assemble these given reads into one continuous string. For these small numbers we can often do this by eye, but what if n = millions and L = 100? How would you tell a computer to assemble them? 3) What is G (length of the resulting genome)? From all the numbers, compute the coverage.

Overlap graph assembly

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2:

Overlap graph assembly read 1: read 2: overlap = 17

Overlap graph assembly read 1: read 2: overlap = 1

Overlap graph assembly read 1: read 2: overlap = 1

Takeaways Overlaps should meet some minimum threshold T (often 1/3 of the read length) Overlaps should have a maximum number of errors allows (roughly 2-3 depending on the error rate and overlap threshold)

Overlap graph (directed, why?) read 1 17 read 2 What is the runtime for creating the overlap graph?

Overlap graph (directed, why?) read 1 17 read 2 What is the runtime for creating the overlap graph? O(n 2 ) pairs, O(L 2 ) for each pair, => really slow

Overlap graph read 1 read 2 read 3 read 4 read 5 read 6

Perfect graph traversal read 6 read 2 read 5 read 3 read 4 read 1 Final sequence

Activity example: L = 10, T = 5 Li et al, Briefings in Functional Genomics (2012)

Activity example: L = 10, T = 5 R1 R2 R3 R4 R5 R6 Li et al, Briefings in Functional Genomics (2012)

Activity example: L = 10, T = 5 R1 R2 R3 R4 R5 R6 Li et al, Briefings in Functional Genomics (2012)

Activity example: L = 10, T = 5 R1 R2 R3 R4 R5 R6 Li et al, Briefings in Functional Genomics (2012)

Activity example: L = 10, T = 5 R1 R2 R3 R4 R5 R6 Li et al, Briefings in Functional Genomics (2012)

Activity example: L = 10, T = 5 R1 R2 R3 R4 R5 R6 Li et al, Briefings in Functional Genomics (2012)

Activity example: L = 10, T = 5 R1 R2 R3 R4 R5 R6 Li et al, Briefings in Functional Genomics (2012)

Activity example: L = 10, T = 5 R1 R2 R3 R4 R5 R6 Li et al, Briefings in Functional Genomics (2012)

Activity example: L = 10, T = 5 R1 R2 R3 R4 R5 R6 Li et al, Briefings in Functional Genomics (2012)

Activity example: L = 10, T = 5 R1 R2 R3 R4 R5 R6 Li et al, Briefings in Functional Genomics (2012)

Activity example: L = 10, T = 5 R1 R2 R3 R4 R5 R6 Li et al, Briefings in Functional Genomics (2012)

Activity example: L = 10, T = 5 R1 R2 R3 R4 R5 R6 Li et al, Briefings in Functional Genomics (2012)