Computational Genomics. Irit Gat-Viks & Ron Shamir & Haim Wolfson Fall

Computational Genomics Irit Gat-Viks & Ron Shamir & Haim Wolfson Fall 2015-16 1

What s in class this week Motivation Administrata Some very basic biology Some very basic biotechnology Examples of our type of computational problems 2

Bioinformatics The information science of biology: organize, store, analyze, visualize biological data Responds to the explosion of biological data, and builds on the IT revolution 3

Paradigm shift in biological research Classical biology: focus on a single gene or sub-system. Hypothesis driven Large-scale data; Bioinformatics Systems biology: measure (or model) the behavior of numerous parts of an entire biological system. Hypothesis generating 4

Personalized medicine 6

Administration ~5 home assignments as part of a home exam, to be done independently (50%) Final exam (50%) Must pass the Final to pass the course (TAU rules) Classes: Tue 12:15-13:30; Thu 14:45-16:00 TA: Ron Zeira (Thu 16-17). 7

Administration (cont.) Web page of the course: http://www.cs.tau.ac.il/~rshamir/cg/15/ Includes slides and full lecture scribes of previous years on each of the classes. 8

Bibliography No single textbook covers the course :-( See the full bibliography list in the website (also for basic biology) Key sources: Gusfield: Algorithms for strings, trees and sequences Durbin et al.: Biological sequence analysis Pevzner: Computational molecular biology Pevzner and Shamir (eds.): Bioinformatics for Biologists 9

learn.genetics.utah.edu 10

Lecture 1: Introduction 1. Basic biology 2. Basic biotechnology + some computational challenges arising along the way Slides prepared mainly by Ron Shamir and Adi Akavia 11

1. Basic Biology Touches on Chapters 1-8 in The Cell by Alberts et al. 12

The Cell Basic unit of life. Carries complete characteristics of the species. All cells store hereditary information in DNA. All cells transform DNA to proteins, which are the robots of the cell and determine cell s structure and function. Two classes: eukaryotes (with nucleus) and prokaryotes (without). http://regentsprep.org/regents/biology/units/organization/cell.gif 13

Gregor Mendel laws of inheritance, gene 1866 Nucleotide Chain Watson and Crick DNA structure 1953 Double helix phosphate sugar Nucleotides/ Bases: Adenine (A), Guanine (G), Cytosine (C), Thymine (T). Strong covalent bonds (phophodiester linkage) between sugars Weak hydrogen bonds between base pairs 14

http://www.cs.utexas.edu/users/almstrum/s2s/f98/10-5-kidzsoft/src/rnadnatutor.html 15

DNA (Deoxy-Ribonucleic acid) Bases: Adenine (A) Purines Guanine (G) Cytosine (C) pyrimidines Thymine (T) Bonds: G - C A - T Oriented from 5 to 3. Located in the cell nucleus 16

DNA and Chromosomes DNA is packaged (10 5 -fold) Chromatin: complex of DNA and proteins that pack it (histones) Chromosome: contiguous stretch of DNA Diploid: two homologous chromosomes, one from each parent Genome: totality of DNA material 17

Replication Replication fork 18

Genes Gene: a segment of DNA that specifies a protein. The transformation of a gene into a protein is called expression. Genes are < 3% of human DNA The rest - non-coding (used to be called junk DNA ) RNA elements Regulatory regions Retrotransposons Pseudogenes and more 19

Gene Structure 20

Gene Structure 21

The Gene Finding Problem Given a DNA sequence, predict the location of genes (open reading frames) exons and introns. A simple solution: seeking stop codons. 6 ways of interpreting DNA sequence In most cases of eukaryotic DNA, a segment encodes only one gene. Difficulty in Eukaryotic DNA: introns & exons CG Ron Shamir 2010 22

Proteins: The Cellular Machines CG Ron Shamir 2010 23

Proteins Build the cell and drive most of its functions. Polymers of amino-acids (20 total), linked by peptide bonds. Oriented (from amino to carboxyl group). Fold into 3D structure of lowest energy. 24

http://www.ornl.gov/hgmis/publicat/tko/index.htm DNA RNA protein The hard disk One program Its output transcription translation 25

RNA (Ribonucleic acid) Bases: Adenine (A) Guanine (G) Cytosine (C) Uracil (U); replaces T Oriented from 5 (phosphate) to 3 (sugar). Single-stranded => flexible backbone => secondary structure => catalytic role. 26

The RNA Folding Problem Given an RNA sequence, predict its (secondary structure) folding = the one that creates a maximum number of matched pairs GCCUUAAUGCACAUGGGCAAGCCCACGUAGCUAGUCGCGCGACACCAGUCCCAAAUAUGUUCACCCAACUCGC CUGACCGUCCCGCAGUAGCUAUACUACCGACUCCUACGCGGUUGAAACUAGACUUUUCUAGCGAGCUGUCAUA GGUAUGGUGCACUGUCUUUAAUUUUGUAUUGGGCCAGGCACGAAAGGCUUGGAAGUAAGGCCCCGCUUGACCC GAGAGGUGACAAUAGCGGCCAGGUGUAACGAUACGCGGGUGGCACGUACCCCAAACAAUUAAUCACACUGCCC GGGCUCACAUUAAUCAUGCCAUUCGUUGCCGAUCCGACCCAUAAGGAUGUGUAUGCCUCAUUCCCGGUCGGGG CGGCGACUGUUAACGCAUGAGAACUGAUUAGAUCUCGUGGUAGUGCUUGUCAAAUAGAAUGAGGCCAUUCCAC AGACAUAGCGUUUCCCAUGAGCUAGGGGUCCCAUGUCCAGGUCCCCUAAAUAAAAGAGUCUCAC 27 http://www.phys.ens.fr/~wiese/highlights/rna-folding.html 27

Transcription Template http://www.iacr.bbsrc.ac.uk/notebook/courses/guide/words/transcriptiongif.htm 28

The Genetic Code Codon - a triplet of bases, codes a specific amino acid (except the stop codons) Stop codons - signal termination of the protein synthesis process Different codons may code the same amino acid http://ntri.tamuk.edu/cell/ribosomes.html 29

Translation http://biology.kenyon.edu/courses/biol114/chap05/chapter05.html#protein 30

Expression and Regulation DNA RNA Protein transcription translation Transcription factors (TFs) : proteins that control transcription by binding to specific DNA sequence motifs. Gene 32

Proteins: The Cellular Machines 33

The Protein Folding Problem Given a sequence of amino acids, predict the 3D structure of the protein. Motivation: functionality of protein is determined by its 3D structure. Solution Approaches: Homology Threading de novo (=from scratch) 34 CG Ron Shamir 2010

The Human Genome: numbers 23 pairs of chromosomes ~3,200,000,000 bases ~21,000 genes Gene length: 1000-3000 bases, spanning 30-40K bases 35

Model Organisms Eukaryotes; increasing complexity Easy to grow, manipulate. Budding yeast 1 cell 6K genes Nematode worm 959 cells 19K genes Fruit fly vertebrate-like 14K genes mouse mammal 30K genes Lots of common ground with humans: many / most genes are common but with mutations 36

Sequence Alignment problems Given two sequences, find their best alignment: Match with insertion/deletion of min cost. Same for best match of contiguous subeq. Same for several sequences Workhorse of Bioinformatics! Key challenge: huge volume of data (more on this later) CG Ron Shamir 2010 37

Introduction II: Basic Biotechnology and computational challenges Ron Shamir and Roded Sharan CG, Fall 2014-15 39

Restriction Enzymes Natural role: break foreign DNA entering the cell. Ability: Breaks the phosphodiester bonds of a DNA upon appearance of a certain cleavage (cut) sequence. Different sequence for each enzyme Hundreds of different enzymes known. Digestion = application of restriction enzymes to a sequence. 40

Cloning Foreign DNA Cloning vector (plasmids) Recombinant DNA Introduction into host cell Use of antibiotics to grow recombinant cells

PCR 5 3 3 5 Denaturation 5 3 3 5 Annealing 5 3 5 3 3 5 3 5 5 Extension 3 5 3 Cycle 1 3 5 3 5 5 5 3 3 3 5 5 3 Cycle 2 3 5 3 5 5 3 5 3 5 3 3 5 5 3 3 5 Cycle 3 5 3 3 5 3 5 3 5 42

http://www.atdbio.com/content/20/sequencing-forensic-analysis-and-genetic-analysis 43

Gel Electrophoresis Use: race digested DNA fragments through electrically charged gel Goals: Separate a mixture of DNA fragments Measure length of DNA fragments How does it work: smaller molecule travel faster than larger ones same size and shape the same movement speed 44

http://dlab.reed.edu/projects/vgm/vgm/vgmprojectfolder/vgm/red/red.isg/mapping.html 45

The Double Digest Problem Given 3 sets of distances {X i } {Y i } {Z i }, reconstruct cut sites A 1 < <A n and B 1 < <B m s.t. {A i -A i-1 }={X}, {B i -B i-1 }={Y} for C=A U B (ordered), {C i -C I-1 }={Z} Complexity: NP hard, many heuristics. 46 46

The Partial Digest Problem Problem: Given a (multi-) set of distances { X i -X j } 1 i j n, reconstruct the original series X 1,,X n Complexity: unknown (yet) 47 47

Sequencing -----G----G Sequencing: determining the sequence of bases in a given DNA molecule. Classical approach: gel electrophoresis ---A-----A- -CC---CC -- T---T------ Basic idea: knowing the lengths of all prefixes ending with letter X gives a partial seq Creating DNA strands of different lengths : catalyzing replication in environment with terminator A*. Repeat separately with C*, G*, T* Abilities: reconstructs sequences of 500-1000 nucleotides. 48

49 http://www.atdbio.com/content/20/sequencing-forensic-analysis-and-genetic-analysis

The Sequence Assembly Problem Given a set of sub- strings, find the shortest (super)string containing all the members of the set. http://www.ornl.gov/hgmis/graphics/slides/images1.html 51

Rearrangement Rearrangement is a change in the order of complete segments along a chromosome. http://www.copernicusproject.ucr.edu/ssi/hsbiologyresources.htm 52

Genome Rearrangements Challenges: Reconstruct the evolutionary path of rearrangements Shortest sequence of rearrangements 53 between two permutations 53

More problems in sequencing data Solve all the problems above (alignment, gene finding, rearrangements, ) on really huge datasets Need to handle practical problems of efficiency time and space Need to overcome large noise (errors) due to data size 54 54

DNA Microarrays 55

Hybridization DNA double strands form by gluing of complementary single strands Complementarity rule: A-T, G-C TGAGGC ACTCCG 56

Gene Expression Arrays Assumption: transcription level indicates gene s importance in a specific condition. Given the expression profiles of normal vs. disease: - Build an algorithm to predict if a new sample is normal or disease (classifier) - Cluster disease profiles into sub-classes - Cluster genes into functional groups 58

Breast cancer treatment Van t veer et al., nature 02 59

FIN 60

Classifier construction 70-gene signature No met. Patients Met. Classify to minimize incorrect assignments in the no met. class. CG 2015 17/19 correct predictions on test set. 61

Human Variation DNA of two human beings is ~99.9% identical Phenotype and disease variation is due these 1/1000 mutations Challenges: Associate mutations to specific disease Deal with huge datasets (noise and statistics) 62 62

Challenges in network analysis: identifying modules 63

Complexity summary ~21,000 genes in the genome Hard to identify Harder to figure their function Even harder to figure how they work together 64

Promise and problems in comparative genomics (2009) genes arabidposis c. elegans h. sapiens s. cerevisiae arabidopsis 26207 0.19 0.24 0.42 c. elegans 19992 0.26 0.38 0.38 h. sapiens 21673 0.30 0.28 0.43 s. cerevisiae 5884 0.21 0.13 0.19 a(x,y)= fraction of genes in genome y that have strict orthologs in genome x Source: 10/2009 65