Computational Genomics. Ron Shamir & Roded Sharan Fall

Size: px
Start display at page:

Download "Computational Genomics. Ron Shamir & Roded Sharan Fall"

Transcription

1 Computational Genomics Ron Shamir & Roded Sharan Fall

2 Bioinformatics The information science of biology: organize, store, analyze and visualize biological data Responds to the explosion of biological data, and builds on the IT revolution

3 Paradigm shift in biological research Classical biology: focus on a single gene or sub-system. Hypothesis driven Large-scale data; Bioinformatics Systems biology: model the behavior of an entire biological system. Hypothesis generating

4 Drug discovery via the connectivity map Lamb et al., Science 2006

5 Personalized medicine

6 Administration ~5 home assignments as part of a home exam, to be done independently (50%) Final exam (50%) Classes: Tue 12:15-13:30; Thu 14:45-16 TA: Yaron Orenstein (Thu 16-17).

7 Administration (cont.) Web page of course: Includes slides and scribes of previous years on each of the classes.

8 Lecture 1: Introduction 1. Basic biology 2. Basic biotechnology & computational challenges

9 1. Basic Biology Touches on Chapters 1-8 in The Cell.

10 The Cell Basic unit of life. Carries complete characteristics of the species. All cells store hereditary information in DNA. All cells transform DNA to proteins, which determine cell s structure and function. Two classes: eukaryotes (with nucleus) and prokaryotes (without).

11 Gregor Mendel laws of inheritance, gene 1866 Nucleotide Chain Watson and Crick DNA structure 1953 Double helix phosphate sugar Nucleotides/ Bases: Adenine (A), Guanine (G), Cytosine (C), Thymine (T). Strong covalent bonds between sugars Weak hydrogen bonds between base pairs

12

13 DNA (Deoxy-Ribonucleic acid) Bases: Adenine (A) Purines Guanine (G) Cytosine (C) pyrimidines Thymine (T) Bonds: G - C A - T Oriented from 5 to 3. Located in the cell nucleus

14 DNA is packaged (10000-fold) DNA and Chromosomes Chromatin: complex of DNA and proteins that pack it (histones) Chromosome: contiguous stretch of DNA Diploid: two homologous chromosomes, one from each parent Genome: totality of DNA material

15 Replication fork Replication

16 Genes Gene: a segment of DNA that specifies a protein. The transformation of a gene into a protein is called expression. Genes are 3% of human DNA The rest - non-coding junk DNA : RNA elements (62%) Regulatory regions Retrotransposons Pseudogenes and more

17 Proteins Build the cell and drive most of its functions. Polymers of amino-acids (20 total), linked by peptide bonds. Oriented (from amino to carboxyl group). Fold into 3D structure of lowest energy.

18 DNA RNA protein The hard disk One program Its output transcription translation

19 RNA (Ribonucleic acid) Bases: Adenine (A) Guanine (G) Cytosine (C) Uracil (U); replaces T Oriented from 5 (phosphate) to 3 (sugar). Single-stranded => flexible backbone => secondary structure => catalytic role.

20 Transcription Template

21 The Genetic Code Codon - a triplet of bases, codes a specific amino acid (except the stop codons) Stop codons - signal termination of the protein synthesis process Different codons may code the same amino acid

22 Translation

23 Expression and Regulation DNA RNA Protein transcription translation Transcription factors (TFs) control transcription by binding to specific DNA sequence motifs. Gene

24 The Human Genome: numbers 23 pairs of chromosomes ~3,200,000,000 bases ~21,000 genes Gene length: bases, spanning 30-40K bases

25 Model Organisms Eukaryotes; increasing complexity Easy to store, manipulate. Budding yeast 1 cell 6K genes Nematode worm 959 cells 19K genes Fruit fly vertebrate-like 14K genes mouse mammal 30K genes

26

27 Biotechnology - Restriction Enzymes Restriction Enzyme: Natural role: break foreign DNA entering the cell. Ability: Breaks the phosphodiester bonds of a DNA upon appearance of a certain cleavage sequence. Different cleavage for each enzyme Hundreds of different enzymes known. Digestion = application of restriction enzymes to a sequence. 27

28 Cloning Foreign DNA Cloning vector (plasmids) Recombinant DNA Introduction into host cell Use of antibiotics to grow recombinant cells

29 PCR Denaturation Annealing Extension Cycle Cycle Cycle 3

30 Biotechnology - Gel Electrophoresis Use: race digested DNA fragments through electrically charged mapping.html gel Goals: Separate a mixture of DNA fragments Measure length of DNA fragments How does it work: smaller molecule travel faster than larger ones same size and shape the same movement speed GMProjectFolder/VGM/RED/RED.ISG/ 30

31 31

32 The Double Digest Problem Given 3 sets of distances {X i } {Y i } {Z i }, reconstruct cut sites A 1 < <A n and B 1 < <B m s.t. {A i -A i-1 }={X}, {B i -B i-1 }={Y} for C=A U B (ordered), {C i -C I-1 }={Z} Complexity: NP hard, many heuristics. 32

33 The Partial Digest Problem Problem: Given a (multi-) set of distances { X i -X j } 1 i j n, reconstruct the original series X 1,,X n Complexity: unknown (yet) 33

34 Sequencing Sequencing: determining the nucleotide sequence of a given molecule. Classical approach: gel electrophoresis Creating DNA strands of different lengths : catalyzing replication in environment with terminator A*. Repeat separately with C*, G*, T* Abilities: reconstructs sequences of nucleotides. 34

35 35

36 The Sequence Assembly Problem Problem: Given a set of substrings, find the shortest (super)string containing all the members of the set. 36

37 Gene Structure 37

38 The Gene Finding Problem Given a DNA sequence, predict the location of genes (open reading frames) exons and introns. 38

39 Mutations - Wild-Type DNA: 5 CAA GTA TAC CGC 3' 3 GTT CAT ATG GCG 5' Substitution Mutations are a major source of genetic variation Mutation does not always change the polypeptide Mutation types: substitution insertion deletion rearrangement mrna: 5 CAA GUA UAC CGC 3' Polypeptide : Gln Val Tyr Arg Substitution DNA: 5 CAC GTA TAC CGC 3' 3 GTG CAT ATG GCG 5' RNA: 5 CAC GUA UAC CGC 3' Polypeptide: His Val Tyr Arg In this example of a base substitution, a "C-G" base pair has been substituted for the "A-T" in the original DNA strand. The resulting mrna transcript has a "C" in place of the original "A". Thus the first three-letter codon is now "CAC" instead of "CAA". The polypeptide encoded by this mrna is different from the original, because the codon "CAC" codes for the amino acid histidine, whereas the original codon "CAA" codes for glutamine. 39

40 Mutations - Wild-Type DNA: 5 CAA GTA TAC CGC 3' 3 GTT CAT ATG GCG 5 Duplication Mutations are a major source of genetic variation Mutation does not always change the polypeptide Mutation types: substitution insertion deletion rearrangement mrna: 5 CAA GUA UAC CGC 3' Polypeptide : Gln Val Tyr Arg Duplication DNA: 5 CAA GTA TAC TAC CGCbv 3 3 GTT CAT ATG ATG GCG 5' RNA: 5 CAA GUA UAC UAC CGC 3' Polypeptide: Gln Val Tyr Tyr Arg In this example, there has been an addition to the DNA molecule. A sequence of three base pairs has been duplicated. The mrna transcript now contains another "UAC" codon, and the resulting polypeptide is longer by one amino acid -- a second tyrosine has been included. 40

41 Mutations - Wild-Type DNA: 5 CAA GTA TAC CGC 3' 3 GTT CAT ATG GCG 5 Deletion Mutations are a major source of genetic variation Mutation does not always change the polypeptide Mutation types: substitution insertion deletion rearrangement mrna: 5 CAA GUA UAC CGC 3' Polypeptide :Gln Val Tyr Arg Deletion T-A DNA: 5 CAA GAT ACC GC 3' 3 GTT CTA TGG CG 5 mrna: 5 CAA GAU ACC GC 3' polypeptide: Gln Asp Thr In this example, one of the base pairs in the DNA molecule has been deleted. An "T-A" has been lost, as shown above. The mrna is now missing the "U" that was in position 7. This has caused a shift in the reading frame of the mrna; thus, a mutation of this type is often referred to as a "frameshift mutation". The second and third amino acids are different from those in the original polypeptide. 41

42 Sequence Alignment problems Given two sequences, find their best alignment: Match with insertion/deletion of min cost. Many variants Workhorse of Bioinformatics! Key challenge: huge volume of data (more on this later) 42

43 Mutations - Rearrangemen Mutations tare a major source of genetic variation Mutation does not always change the polypeptide Mutation types: substitution insertion deletion rearrangement Before After Rearrangement is a change in the order of complete segments along a chromosome. 43

44 M. Ozery-Flato Genome Rearrangements Challenges: Reconstruct the evolutionary path of rearrangements Shortest sequence of rearrangements between two permutations 44

45 More problems in sequencing data Solve all the problems above (alignment, gene finding, rearrangements, ) on really huge datasets Need to handle practical problems of efficiency time and space Need to overcome large noise (false hits) due to data size 45

46 DNA Microarrays

47 Hybridization DNA double strands form by gluing of complementary single strands Complementarity rule: A-T, G-C TGAGGC ACTCCG

48

49 Gene Expression Arrays Assumption: transcription level indicates gene s importance in a specific condition. Variants enable measuring absolute or relative levels of a transcript. Given the expression profiles of normal vs. disease: -Build a classifier to recognize the disease -Cluster disease profiles into sub-classes; or cluster genes into functional groups

50 Breast cancer treatment Van t veer et al., nature 02

51 Patients Classifier construction 70-gene signature No met. Met. Classify to minimize incorrect assignments in the no met. class. 17/19 correct predictions on test set.

52 Human Variation DNA of two human beings is ~99.9% identical Challenges: Associate mutations to specific Phenotype and disease disease variation is due these 1/1000 mutations Deal with huge datasets Overcome the multiple hypothesis testing problem 52

53 Protein interaction detection: yeast two-bybrid

54 GO semantic similarity Challenges in network analysis: function prediction Network distance

55 Challenges in network analysis: identifying modules

56 Complexity summary ~21,000 genes in the genome Hard to identify Harder to figure their function Even harder to figure how they work together 56

57 Promise and problems in comparative genomics (2009) genes arabidposis c. elegans h. sapiens s. cerevisiae arabidopsis c. elegans h. sapiens s. cerevisiae a(x,y)= fraction of genes in genome y that have strict orthologs in genome x Source: 10/