Course Information. Introduction to Algorithms in Computational Biology Lecture 1. Relations to Some Other Courses

Size: px
Start display at page:

Download "Course Information. Introduction to Algorithms in Computational Biology Lecture 1. Relations to Some Other Courses"

Transcription

1 Course Information Introduction to Algorithms in Computational Biology Lecture 1 Meetings: Lecture, by Dan Geiger: Mondays 16:30 18:30, Taub 4. Tutorial, by Ydo Wexler: Tuesdays 10:30 11:30, Taub 2. Grade: 20% in five question sets. These questions sets are obligatory. Each contains 4-6 theoretical problems. Submit in pairs in two weeks time 80% test. Must pass beyond 55 for the homework s grade to count Background Readings: The first three chapters (pages 1-31) in Genetics in Medicine, Nussbaum et al., Information and handouts: This class has been edited from Nir Friedman s lecture which is available at Changes made by Dan Geiger.. A brochure with zeroxed material at Taub library 2 Course Prerequisites Computer Science and Probability Background Data structure 1 (cs234218) Algorithms 1 (cs234247) Probability (any course) Some Biology Background Formally: None, to allow CS students to take this course. Recommended: Biology 1 (especially for those in the Bioinformatics track), or a similar Biology course, and/or a serious desire to complement your knowledge in Biology by reading the appropriate material (see the course web site). Studying the algorithms in this course while acquiring enough biology background is far more rewarding than ignoring the biological context. 3 Relations to Some Other Courses Intro to Bioinformatics (cs236523). This course covers practical aspects and hands on experience with web-based bioinformatics Software. Albeit not a formal requirement, it is recommended that you look on the web site and examine the relevant software. Algorithms in Computational Biology (cs236522). This is the current course which focuses on modeling some bioinformatics problems and presents algorithms for their solution. Bioinformatics project (cs ). Developing bioinformatics tools under close guidance. 4

2 First Homework Assignment Read carefully the first three chapters (pages 1-31) in Genetics in Medicine, Nussbaum et al., Solve two of the questions for Chapter 2 and two of the questions for Chapter 3. Computational Biology Computational biology is the application of computational tools and techniques to (primarily) molecular biology. It enables new ways of study in life sciences, allowing analytic and predictive methodologies that support and enhance laboratory work. It is a multidisciplinary area of study that combines Biology, Computer Science, and Statistics. Due time: During the third tutorial class, or earlier in the teaching assistant s mail slot. Recall to submit in pairs. Computational biology is also called Bioinformatics, although many practitioners define Bioinformatics somewhat narrower by restricting the field to molecular Biology only. 5 6 Examples of Areas of Interest Building evolutionary trees from molecular (and other) data Efficiently assembling genomes of various organisms Understanding the structure of genomes (SNP, SSR, Genes) Understanding function of genes in the cell cycle and disease Deciphering structure and function of proteins Exponential growth of biological information: growth of sequences, structures, and literature. 7 8

3 Biological What is the task? Algorithmic Four Aspects How to perform the task at hand efficiently? Learning How to adapt/estimate/learn parameters and models describing the task from examples Statistics How to differentiate true phenomena from artifacts Example: Sequence Comparison Biological Evolution preserves sequences, thus similar genes might have similar function Algorithmic Consider all ways to align one sequence against another Learning How do we define similar sequences? Use examples to define similarity Statistics When we compare to ~10 6 sequences, what is a random match and what is true one 9 10 Course Goals Learning about computational tools for (primarily) molecular biology. We will cover computational tasks that are posed by modern molecular biology We will discuss the biological motivation and setup for these tasks We will understand the kinds of solutions that exist and what principles justify them Topics I Dealing with DNA/Protein sequences: Finding similar sequences Models of sequences: Hidden Markov Models Gene finding Genome projects and how sequences are found 11 12

4 Topics II Models of genetic change: Long term: evolutionary changes among species Reconstructing evolutionary trees from sequences Short term: genetic variations in a population Finding genes by linkage and association Topics III (One class, if time allows) Protein World: How proteins fold - secondary & tertiary structure How to predict protein folds from sequences data How to analyze proteins changes from raw experimental measurements (MassSpec) Human Genome DNA Organization Most human cells contain 46 chromosomes: 2 sex chromosomes (X,Y): XY in males. XX in females. 22 pairs of chromosomes named autosomes. Source: Alberts et al 15 16

5 The Double Helix DNA Components Four nucleotide types: Adenine Guanine Cytosine Thymine Source: Alberts et al Hydrogen bonds (electrostatic connection): A-T C-G Genome Sizes Genetic Information E.Coli (bacteria) Yeast (simple fungi) Smallest human chromosome Entire human genome 4.6 x 10 6 bases 15 x 10 6 bases 50 x 10 6 bases 3 x 10 9 bases Gene basic unit of genetic information. They determine the inherited characters. Genome the collection of genetic information. Chromosomes storage units of genes

6 Genes The DNA strings include: Coding regions ( genes ) E. coli has ~4,000 genes Yeast has ~6,000 genes C. Elegans has ~13,000 genes Humans have ~32,000 genes Control regions These typically are adjacent to the genes They determine when a gene should be expressed Junk DNA (unknown function) 21 The Cell All cells of an organism contain the same DNA content (and the same genes) yet there is a variety of cell types. 22 Example: Tissues in Stomach Central Dogma Transcription Translation Gene mrna Protein cells express different subset of the genes In different tissues and under different conditions How is this variety encoded and expressed? 23 24

7 Transcription Transcription: RNA Editing Coding sequences can be transcribed to RNA RNA nucleotides: Similar to DNA, slightly different backbone Uracil (U) instead of Thymine (T) Source: Mathews & van Holde Transcribe to RNA 2. Eliminate introns 3. Splice (connect) exons * Alternative splicing exists Exons hold information, they are more stable during evolution. This process takes place in the nucleus. The mrna molecules diffuse through the nucleus membrane to the outer cell plasma. 26 RNA roles Translation (Outside the nucleolus) Messenger RNA (mrna) Encodes protein sequences. Each three nucleotide acids translate to an amino acid (the protein building block). Transfer RNA (trna) Decodes the mrna molecules to amino-acids. It connects to the mrna with one side and holds the appropriate amino acid on its other side. Ribosomal RNA (rrna) Part of the ribosome, a machine for translating mrna to proteins. It catalyzes (like enzymes) the reaction that attaches the hanging amino acid from the trna to the amino acid chain being created.... Translation is mediated by the ribosome Ribosome is a complex of protein & rrna molecules The ribosome attaches to the mrna at a translation initiation site Then ribosome moves along the mrna sequence and in the process constructs a sequence of amino acids (polypeptide) which is released and folds into a protein

8 Genetic Code Protein Structure Proteins are polypeptides of amino-acids This structure is (mostly) determined by the sequence of amino-acids that make up the protein There are 20 amino acids from which proteins are build Protein Structure Evolution Related organisms have similar DNA Similarity in sequences of proteins Similarity in organization of genes along the chromosomes Evolution plays a major role in biology Many mechanisms are shared across a wide range of organisms During the course of evolution existing components are adapted for new functions 31 32

9 Evolution The Tree of Life Evolution of new organisms is driven by Diversity Different individuals carry different variants of the same basic blue print Mutations The DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc. Selection bias 33 Source: Alberts et al 34 Example for Phylogenetic Analysis Input: four nucleotide sequences: AAG,, GGA, AGA taken from four species. Question: Which evolutionary tree best explains these sequences? One Answer (the parsimony principle): Pick a tree that has a minimum total number of substitutions of symbols between species and their originator in the evolutionary tree (Also called phylogenetic tree) GGA AGA AAG Total #substitutions = 4 35 Example Continued There are many trees possible. For example: 1 AAG 1 1 GGA AGA AGA AGA GGA AAG Total #substitutions = 3 Total #substitutions = 4 The left tree is better than the right tree. Questions: Is this principle yielding realistic phylogenetic trees? (Evolution) How can we compute the best tree efficiently? (Computer Science) What is the probability of substitutions given the data? (Learning) Is the best tree found significantly better than others? (Statistics) 36

10 Werner s Syndrome A successful application of genetic linkage analysis The Disease First references in 1960s Causes premature ageing Linkage studies from 1992 WRN gene cloned in 1996 Subsequent discovery of mechanisms involved in wild-type and mutant proteins H A 1 /A 1 1 A sample Input 2 D A 2 /A 2 Phase inferred H H H D 3 4 H D A 1 A 2 A 1 /A 2 A 2 /A 2 A 2 A 2 Recombinant D D A 1 A 2 D A 1 /A 2 5 D D A 2 A 2 The study used 13 Markers; here we see only one. The study used 14 families; here we see only one. distance between markers in centimorgans Most likely position Genehunter Output position LOD_score information Marker s name [data skipped] [data skipped]... D8S131 D8S339 D8S Log likelihood of placing disease gene at distance, relative to it being unlinked. Maximum log likelihood score 39 40

11 Final Location location of marker D8S339 Marker D8S259 Marker D8S131 WRN Gene final location Error in location by genetic linkage of about 1.25M base pairs. 41