Algorithms in Computational Biology (236522) Winter 2012 Intro Lecture

Size: px
Start display at page:

Download "Algorithms in Computational Biology (236522) Winter 2012 Intro Lecture"

Transcription

1 Algorithms in Computational Biology (236522) Winter 2012 Intro Lecture Lecturer: Zohar Yakhini, Taub 615 Office hours: by appointment TA: Limor Leibovich Office hours: in the course site Slides for this lecture have been initially edited from Nir Friedman s lecture at HUJI and from Roded Sharan s course at TAU. Changes and additions were introduced over several years of teaching at the Technion, by Dan Geiger, Shlomo Moran, Tomer Shlomi and Zohar Yakhini 1

2 Course Information Requirements & Grades: 2 HW assignments 15% each 1 programming and analysis HW assignment. 30% Exam: 40%. Must pass beyond 55 for the homework s grade to count 2

3 Bibliography Biological Sequence Analysis, R.Durbin et al., Cambridge University Press, 1998 Introduction to Computational Molecular Biology, J. Setubal, J. Meidanis, PWS publishing Company, 1997 Bioinformatics, A. Polanski & M. Kimmel, Springer, 2007 Algorithms in Bioinformatics: A Practical Introduction, Wing-Kin Sung, Chapman and Hall,

4 Course Prerequisites Computer Science and Probability Background Data structures 1 (cs234218) Algorithms 1 (cs234247) Probability (any course) 11 Strong programming skills Some Biology Background Formally: None, to allow CS students to take this course. Recommended: Biology 1, or a similar Biology course, and/or a serious desire to complement your knowledge in Biology by reading the appropriate material, including papers that may be related to the motivation and application of the algorithms 4

5 Relations to Some Other Courses Bioinformatics Software (cs236523). The course Introduction to Bioinformatics covers practical aspects and hands on experience with many web-based bioinformatics programs. Albeit not a formal requirement, it is recommended that you look on the web site and examine the relevant software. Bioinformatics algorithms (cs236522). This is the current course which focuses on modeling some bioinformatics problems and presents algorithms for their solution. Bioinformatics project (cs236524). Developing bioinformatics tools under close guidance. 5

6 What is Computational Biology??? Computational biology addresses the development of computational tools and techniques to analyze biological data mainly in molecular biology. It enables new ways of study in life sciences, allowing analytic and predictive methodologies that support and enhance laboratory work. It is a multidisciplinary area of study that combines Biology, Computer Science, and Statistics. Computational biology is closely related to Bioinformatics. 6

7 The Cell Basic unit of life. Carries complete characteristics of the species. All cells store hereditary information in DNA. All cells transform DNA to proteins, which determine cell s structure and function. Two classes: eukaryotes (with nucleus) and prokaryotes (without).

8 Gregor Mendel laws of inheritance, gene 1866 Watson and Crick DNA structure 1953 DNA Nucleotide Chain Double helix polymer of nucleotides Nucleotides/ Bases: Adenine (A), Guanine (G), Cytosine (C), Thymine (T).

9 Four nucleotide types: Adenine Guanine Cytosine Thymine DNA Components Hydrogen bonds (electrostatic connection): A-T C-G 13 9

10 DNA is packaged (10000-fold) DNA Organization Chromatin: complex of DNA and proteins that pack it (histones) Chromosome: contiguous stretch of DNA; carries genes Chromosomes come in pairs, one from each parent Genome: totality of DNA material

11 U4 Genome Sizes E.Coli (bacteria) 4.6 x 10 6 bases Yeast (simple fungus) 15 x 10 6 bases Smallest human chromosome 50 x 10 6 bases Entire human genome 3 x 10 9 bases 11

12 Human/Mouse syntheni From: Initial sequencing and analysis of the human genome, Nature

13 The Human Genome Most human cells contain 46 chromosomes: 2 sex chromosomes (X,Y): XY in males. XX in females. 22 pairs of chromosomes named autosomes. Cancer cells have ABERRANT genomes 3 Normal human karyotype vs HT29 karyotype. HT29 is a cell line derived from human colon carcinoma. 13

14 Central Dogma Transcription תרגום שעתוק Translation DNA (Gene) mrna Protein cells express different subset of the genes In different tissues and under different conditions 14

15 Genes Gene: a segment of DNA that specifies the sequence of a protein. (Mendel: a unit of heredity) Contains (upstream) one or more regulatory sequences that either increase or decrease the rate of its transcription Genes are 2-3% of human DNA The rest - non-coding DNA. Used to be called junk DNA. We now understand that much of it is functional. Function under active investigation. E. coli has ~4,000 genes Yeast has ~6,000 genes C. Elegans has ~18,000 genes Humans have ~35,000 genes

16 Build the cell and drive most of its functions. Proteins are poly-peptides of amino-acids Fold into 3D structure of lowest energy. This structure is partially determined by the sequence of amino-acids that make up the protein Proteins are modified post translation. These PTMs determine much of the activity and other properties of the protein. PTM examples: phosphorylation, glycosylation. Proteins

17 Protein Structure 17

18 PTM example: Glycosylation Adapted from C&EN

19 Transcription Coding sequences can be transcribed to RNA U5 RNA Similar to DNA, slightly different nucleotides: different backbone Uracil (U) instead of Thymine (T) Source: Mathews & van Holde 19

20 Translation Translation is mediated by the ribosome Ribosome is a complex of protein & rrna molecules The ribosome attaches to the mrna at a translation initiation site Then the ribosome moves along the mrna sequence and facilitates the construction of the appropriate sequence of amino acids. This chain is released and folds into a protein. In 2010 Prof Ada Yonat received the Nobel 20 Prize for solving the structure of the ribosome

21 Central Dogma in Action

22 RNA roles שלמה 4 Messenger RNA (mrna) Encodes protein sequences. Each three nucleotide acids translate to an amino acid (the protein building block). Transfer RNA (trna) Decodes the mrna molecules to amino-acids. It connects to the mrna with one side and holds the appropriate amino acid on its other side. Ribosomal RNA (rrna) Part of the ribosome, a machine for translating mrna to proteins. It catalyzes (like enzymes) the reaction that attaches the hanging amino acid from the trna to the amino acid chain being created

23 The Genetic Code Codon - a triplet of bases, codes a specific amino acid (except the stop codons) Stop codons - signal termination of the protein synthesis process Redundancy - different codons may code the same amino acid

24 Gene structure in Eukaryotes Exons coding regions; more stable during evolution Introns non-coding regions Introns are spliced out to form the mature mrna Alternative splicing exons may also be spliced out, resulting in many possible proteins per gene In human: ~30K genes coding for many more proteins

25 Central Dogma (+ Splicing) DNA PremRNA Mature mrna protein transcription splicing translation

26 Model Organisms Eukaryotes; increasing complexity Easy to store, manipulate. Budding yeast 1 cell 6K genes Nematode worm 959 cells 19K genes Fruit fly vertebrate 14K genes mouse mammal 30K genes

27 How is phenotypic diversity created across genetically identical cells? Virtually every cell in your body contains a complete set of genes But they are not all turned on in every tissue/cell-type Each cell in your body expresses only a small subset of genes at any time 27

28 Gene Regulation DNA PremRNA Mature mrna protein transcription splicing translation Transcription factors (TFs) control transcription by binding to specific DNA sequence motifs. Gene

29 Fig. 16.6

30 High-throughput measurement technologies Genetic interactions: HiC Protein-DNA (transcriptional) interactions: ChIP-on-chip ChIP-seq Protein-RNA interactions: CLIP Protein-protein interaction (PPI): yeast two-hybrid DNA RNA protein Genome/s and related variations: DNA-seq SNP microarrays CGH microarrays Methyl-seq Transcriptome: Microarrays RNA-seq (NGS) mirna microarrays mirna-seq Proteome and PTMs: Mass spectrometry Protein arrays ELISA NMR, Crystalography

31 Watson and Crick James Watson and Francis Crick discovered, in 1953, the double helix structure of DNA.

32 Watson-Crick Complimentarity A binds to T C binds to G AATGCTTAGTC TTACGAATCAG Perfect match AATGCGTAGTC TTACGAATCAG One-base mismatch

33 Microarray technology Every spot represents a gene

34 Labeled Hybridization

35 Expression Profiling on MicroArrays Differentialy label the query sample and the control (1-3). Mix and hybridize to an array. Analyze the image to obtain expression levels information. Zohar Yakhini, Israel Steinfeld

36 Thermal Ink Jet Arrays, by Agilent Technologies cdna array, Inkjet deposition Zohar Yakhini, Israel Steinfeld In-Situ synthesized oligonucleotide array mers.

37 Evolution Evolution of new organisms is driven by Diversity Different individuals carry different variants of the same basic blue print Mutations The DNA sequence spontaneously changes - single base changes, deletion/insertion of DNA segments, etc. Selection bias The more fit variants have a higher expected offspring size 37

38 Evolution Related organisms have similar DNA Similarity in sequences of proteins Similarity in organization of genes along the chromosomes Evolution plays a major role in biology Many mechanisms are shared across a wide range of organisms During the course of evolution existing components are adapted for new functions 38

39 The Tree of Life 39 Source: Alberts et al

40 Algorithms in Computational Biology 40

41 The Four Pillars of Comp Bio Biological שלמה What is the task? What is the relevant question? Can it be modeled and how? Algorithmic How to perform the task at hand efficiently and effectively? Learning How to use data to adapt/estimate/learn parameters and models that will address the task? Statistics How to estimate the confidence of our findings? How to distinguish true observations from spurious results How to confidently form new hypotheses? 41

42 Example: DNA Sequence Comparison Biological Similar genes might have similar function thus we need to identify similar DNA sequences Algorithmic Find efficient ways to compute similarity between sequences Learning How do we define similar sequences? Use data to define similarity Statistics When we compare to ~10 6 sequences, what is a spurious match (to be expected under a null model) and what is true one 42

43 Probe Specificity 43

44 Feature Extraction and QC Statistics Example: Dapple by J. Buhler.

45 Prior knowledge of binary (categorical) sample information is required. E.g: Tumor vs Normal; subtypes of a pathology; prognosis; etc. Identifying (statistically significant) informative genes: Provides biological insight Indicate promising research directions Reduce data dimensionality Diagnostic assay Statistics soundly assign significance to every gene Algorithmics efficiently assess significance for tens of 1000s of genes genes Differential Expression Healthy Disease

46 70 gene signature predicts good/bad prognosis in breast cancer van t Veer et al, Nature 2002 Find 70 genes most correlated with prognosis Generate a good/bad prognosis signature. Compare with a validation cohort. Bad signature patients are 28-time more likely to develop distance metastasis In 2007 the FDA approved MammaPrint for diagnostic use. The first microarray based diagnostic test!

47 Course Goals Learning about computational tools for molecular biology. Describe computational tasks that address major questions in modern molecular biology Discuss the biological motivation and setup for these tasks Understand the kinds of solutions that exist and what principles justify them Adapt and develop method variants and apply them to data 47

48 Topics to be addressed Sequence alignment Gene expression data analysis Statistical enrichment RNA secondary structure HMMs and their applications State of the art measurement technologies and related computational questions Phylogeny Protein structure 48

49 Best wishes for a PROLIFIC SEMESTER!!! 49