Important points from last time

Similar documents
Level 2 Biology, 2017

Folding simulation: self-organization of 4-helix bundle protein. yellow = helical turns

Protein Synthesis. Application Based Questions

Fishy Amino Acid Codon. UUU Phe UCU Ser UAU Tyr UGU Cys. UUC Phe UCC Ser UAC Tyr UGC Cys. UUA Leu UCA Ser UAA Stop UGA Stop

Biomolecules: lecture 6

1. DNA, RNA structure. 2. DNA replication. 3. Transcription, translation

Degenerate Code. Translation. trna. The Code is Degenerate trna / Proofreading Ribosomes Translation Mechanism

Bioinformatics CSM17 Week 6: DNA, RNA and Proteins

Biomolecules: lecture 6

p-adic GENETIC CODE AND ULTRAMETRIC BIOINFORMATION

(a) Which enzyme(s) make 5' - 3' phosphodiester bonds? (c) Which enzyme(s) make single-strand breaks in DNA backbones?

CONVERGENT EVOLUTION. Def n acquisition of some biological trait but different lineages

iclicker Question #28B - after lecture Shown below is a diagram of a typical eukaryotic gene which encodes a protein: start codon stop codon 2 3

ANCIENT BACTERIA? 250 million years later, scientists revive life forms

How life. constructs itself.

A Zero-Knowledge Based Introduction to Biology

7.016 Problem Set 3. 1 st Pedigree

Human Gene,cs 06: Gene Expression. Diversity of cell types. How do cells become different? 9/19/11. neuron

Codon Bias with PRISM. 2IM24/25, Fall 2007

The combination of a phosphate, sugar and a base forms a compound called a nucleotide.

Just one nucleotide! Exploring the effects of random single nucleotide mutations

Chemistry 121 Winter 17

UNIT I RNA AND TYPES R.KAVITHA,M.PHARM LECTURER DEPARTMENT OF PHARMACEUTICS SRM COLLEGE OF PHARMACY KATTANKULATUR

Enduring Understanding

Honors packet Instructions

Protein Synthesis: Transcription and Translation

Basic Biology. Gina Cannarozzi. 28th October Basic Biology. Gina. Introduction DNA. Proteins. Central Dogma.

Bioinformation by Biomedical Informatics Publishing Group

PROTEIN SYNTHESIS Study Guide

Chapter 10. The Structure and Function of DNA. Lectures by Edward J. Zalisko

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Gene Prediction

Describe the features of a gene which enable it to code for a particular protein.

Deoxyribonucleic Acid DNA. Structure of DNA. Structure of DNA. Nucleotide. Nucleotides 5/13/2013

Emergence of the Canonical Genetic Code

Evolution of protein coding sequences

7.013 Problem Set

CISC 1115 (Science Section) Brooklyn College Professor Langsam. Assignment #6. The Genetic Code 1

UNIT (12) MOLECULES OF LIFE: NUCLEIC ACIDS

BIOL591: Introduction to Bioinformatics Comparative genomes to look for genes responsible for pathogenesis

INTRODUCTION TO THE MOLECULAR GENETICS OF THE COLOR MUTATIONS IN ROCK POCKET MICE

7.013 Exam Two

SUPPLEMENTARY INFORMATION

PGRP negatively regulates NOD-mediated cytokine production in rainbow trout liver cells

Keywords: DNA methylation, deamination, codon usage, genome, genomics

Trends in the codon usage patterns of Chromohalobacter salexigens genes

DNA sentences. How are proteins coded for by DNA? Materials. Teacher instructions. Student instructions. Reflection

Problem Set 3

It has not escaped our notice that the specific paring we have postulated immediately suggest a possible copying mechanism for the genetic material

IMAGE HIDING IN DNA SEQUENCE USING ARITHMETIC ENCODING Prof. Samir Kumar Bandyopadhyay 1* and Mr. Suman Chakraborty

Disease and selection in the human genome 3

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 1. Bioinformatics 1: Biology, Sequences, Phylogenetics

CHAPTER 12- RISE OF GENETICS I. DISCOVERY OF DNA A. GRIFFITH (1928) 11/15/2016

Molecular Level of Genetics

MOLECULAR EVOLUTION AND PHYLOGENETICS

Bioinformatics of 18 Fungal Genomes

Keywords: Staphylococcal phage, Synonymous codon usage, Translational selection, Mutational bias, Phage therapy.

Materials Protein synthesis kit. This kit consists of 24 amino acids, 24 transfer RNAs, four messenger RNAs and one ribosome (see below).

Expression analysis of genes responsible for amino acid biosynthesis in halophilic bacterium Salinibacter ruber

Today in Astronomy 106: polymers to life

comparing acrylamide gel patterns of restriction enzyme digests of plasmid pbr322 with those of pbr322/fpv2-22, the plasmid

Lecture 19A. DNA computing

PRINCIPLES OF BIOINFORMATICS

Today in Astronomy 106: the important polymers and from polymers to life

Gene Prediction. Srivani Narra Indian Institute of Technology Kanpur

Homework. A bit about the nature of the atoms of interest. Project. The role of electronega<vity

Chapter 10. The Structure and Function of DNA. Lectures by Edward J. Zalisko

If stretched out, the DNA in chromosome 1 is roughly long.

The complete amino acid sequence of human fibroblast interferon as deduced using synthetic oligodeoxyribonucleotide primers of reverse transcriptase

Chapter 3: Information Storage and Transfer in Life

Mechanisms of Genetics

DNA Base Data Hiding Algorithm Mohammad Reza Abbasy, Pourya Nikfard, Ali Ordi, and Mohammad Reza Najaf Torkaman

Q1: Find the secret message in the sentences. Q2: Can I delete a w letter in these sentences?

Worksheet: Mutations Practice

Dynamic Programming Algorithms

7.014 Quiz II 3/18/05. Write your name on this page and your initials on all the other pages in the space provided.

The Final Exam will be: Monday, May 17 9:00 am - 12:00 noon Johnson

NAME:... MODEL ANSWER... STUDENT NUMBER:... Maximum marks: 50. Internal Examiner: Hugh Murrell, Computer Science, UKZN


Genes & Inheritance Series: Set 1. Copyright 2005 Version: 2.0

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Selection, history and chemistry: the three faces of the genetic code

Biology'AHA' 4 th 'Marking'Period'' Benchmark'Review'

Synonymous codon usage in Pseudomonas aeruginosa PA01

Review of Central Dogma; Simple Mendelian Inheritance

A 2D graphical representation of the sequences of DNA based on triplets and its application

Station 1: DNA Structure Use the figure above to answer each of the following questions. 1.This is the subunit that DNA is composed of. 2.

The Genetic Code Degeneration I: Rules Governing the Code Degeneration and the Spatial Organization of the Codon Informative Properties.

DNA Begins the Process

Daily Agenda. Warm Up: Review. Translation Notes Protein Synthesis Practice. Redos

Translating the Genetic Code. DANILO V. ROGAYAN JR. Faculty, Department of Natural Sciences

UNIT (12) MOLECULES OF LIFE: NUCLEIC ACIDS

Chapter 13 From Genes to Proteins

SUPPLEMENTARY INFORMATION

Codon Preferences in Free-Living Microorganisms

Cloning and sequence analysis of cdna for rat angiotensinogen (angiotensin/recombinant DNA/DNA sequence/blood pressure)

Quiz 2 on Wednesday 4/3 from 11-noon. Bring IDs to exam. Review Session 4/1 from 7-9 pm in Tutoring Session 4/2 from 4-6 pm in

BIOSTAT516 Statistical Methods in Genetic Epidemiology Autumn 2005 Handout1, prepared by Kathleen Kerr and Stephanie Monks

National PHL TB DST Reference Center PSQ Reporting Language Table of Contents

Lecture 11: Gene Prediction

Transcription:

Important points from last time Subst. rates differ site by site Fit a Γ dist. to variation in rates Γ generally has two parameters but in biology we fix one to ensure a mean equal to 1 and the other parameter (α) is called the shape parameter Estimates of α from sequences are small Estimates of K = 2µt jump up when α is small

Other DNA based methods The PAML package of programs is most common method. It makes use of a codon model. It takes the codon as the variable changing and measures the changes from any one codon to any other. 104

Other DNA based methods In its simplest form the programs from the PAML package assume Pr(> one change ) = 0 and has a model with two parameters (beyond the phylogenetic relationship of the sequences) κ - a rate to measure transition/transversion bias ω - a measure of nonsynonymous versus synonymous rates 105

Other DNA based methods So if... ω < 1 ω = 1 ω > 1 purifying or negative selection no selection, neutral positive selection 106

Other DNA based methods - example no. positive Inferred Number of Genes Under Positive Selection (338-382) (119-162) (32-62) (234-327) (183-232) (219-257) (318-360) (357-426) (255-325) (213-292) (204-278) (281-333) From: Kosiol et al. 2011 107

Other DNA based methods - example GO over-represented From: Kosiol et al. 2008 PLoS Genetics 4:e1000144 108

Other DNA based methods - example coevolution positive Co-evolution in complement immunity P<0.05 FDR<0.05 6 From: Kosiol et al. 2011 109

Other DNA based methods - example immune positive From: Kosiol et al. 2008 PLoS Genetics 4:e1000144 110

Amino acid distance measures As for the nucleotide sequences the Jukes Cantor distance can be applied to amino acid sequences: The only difference is 20aa rather than 4bp. D JC = (19/20) ln(1 (20/19)D) Often simplified to just D JC = ln(1 D) As for the nucleotide sequences it assumes the same rate of substitution between amino acids. 111

Amino acid distance measures Various characteristics of the amino acids charge polarity hydrophobicity aromaticity size It is therefore unlikely that amino acid substitutions will occur with a similar probability Use empirical weighting schemes when computing amino acid distances 112

UUU Phe UUC Phe UUA Leu UUG Leu CUU Leu CUC Leu CUA Leu CUG Leu AUU Ile AUC Ile AUA Ile AUG Met GUU Val GUC Val GUA Val GUG Val UCU Ser UCC Ser UCA Ser UCG Ser CCU Pro CCC Pro CCA Pro CCG Pro ACU Thr ACC Thr ACA Thr ACG Thr GCU Ala GCC Ala GCA Ala GCG Ala UAU Tyr UAC Tyr UAA ter UAG ter CAU His CAC His CAA Gln CAG Gln AAU Asn AAC Asn AAA Lys AAG Lys GAU Asp GAC Asp GAA Glu GAG Glu UGU Cys UGC Cys UGA ter UGG Trp CGU Arg CGC Arg CGA Arg CGG Arg AGU Ser AGC Ser AGA Arg AGG Arg GGU Gly GGC Gly GGA Gly GGG Gly non polar polar Unusual 113

Dayhoff et al (1978) computed the percent accepted mutations (PAM) Margaret O. Dayhoff (1925-1983) Columbia University Took a number of globular proteins and compared every site, cataloging the changes. Extrapolates the changes from a short period of time to a longer period. Picture from http://wikipedia.org 114

PAM steps 1 Calculate how often pairs of amino acids are exchanged 2 The frequency of occurrence of each amino acid 3 The mutation probability 4 How mutable is each amino acid 5 Scale to one amino acid change 6 Calculates not only the probability for changes but also the probability of no change 7 End with a PAM score for all changes aa i to aa j 115

1572 amino acid pairwise differences (1978) Ala Arg Asn Asp Cys Gln Glu A R N D C Q E Ala A - 30 109 154 33 93 266 Arg R - 17 0 10 120 0 Asn N - 532 0 50 94 Asp D - 0 76 831 Cys C - 0 0 Gln Q - 422 Glu E - 116

Normalized Frequencies of aa s within her dataset Gly 0.089 Arg 0.041 Ala 0.087 Asn 0.040 Leu 0.085 Phe 0.040 Lys 0.081 Gln 0.038 Ser 0.070 Ile 0.037 Val 0.065 His 0.034 Thr 0.058 Cys 0.033 Pro 0.051 Tyr 0.030 Glu 0.050 Met 0.015 Asp 0.047 Trp 0.010 117

Relative Mutabilities (# substitutions/freq) Asn 134 His 66 Ser 120 Arg 65 Asp 106 Lys 56 Glu 102 Pro 56 Ala 100 gly 49 Thr 97 Tyr 41 Ile 96 Phe 41 Met 94 Leu 40 Gln 93 Cys 20 Val 74 Trp 18 Ala has been arbitrarily set to 100. 118

PAM-1 Matrix 10,000 From: Ala Arg Asn Asp Cys Gln Glu To: A R N D C Q E Ala A 9867 2 9 10 3 8 17 Arg R 1 9913 1 0 1 10 0 Asn N 4 1 9822 36 0 4 6 Asp D 6 0 42 9859 0 6 53 Cys C 1 1 0 0 9973 0 0 Gln Q 3 9 4 5 0 9876 27 Glu E 10 0 7 56 0 35 9865 119

PAM1 is the expectation after approximately 1% of the sequence has been substituted. PAM2 is calculated as PAM1 PAM1 PAMx is calculated as PAM(x-1) PAM1 PAM250 is generally used for distant comparisons. It corresponds to 2.5 differences per site ( 20% identity). NOTE: These measure divergence not time. 120

PAM-250 Matrix 100 From: Ala Arg Asn Asp Cys Gln Glu To: A R N D C Q E Ala A 13 6 9 9 5 8 9 Arg R 3 17 4 3 2 5 3 Asn N 4 4 6 7 2 5 6 Asp D 5 4 8 11 1 7 10 Cys C 2 1 1 1 52 1 1 Gln Q 3 5 5 6 1 10 7 Glu E 5 4 7 11 1 9 12 121

PAM scoring matrix The PAM scoring values are generally shown as a symmetric log odds ratio matrix. Odds (for those who do not gamble) are 1 p where p is the probability of an event and 1 p is the probability of some other event. For example if p = 0.5 then the odds are 50/50 or 1 to 1 ( 0.5 0.5 = 1). While if p = 0.75 then the odds are 3 to 1 ( 0.75 0.25 = 3). The odds ratio is the ratio of the odds for and against. p 122

PAM scoring matrix Generally the odds are presented as log values. For PAM matrices it is generally log 10 that is used and so each integer value represents an order of magnitude. For example if p = 0.08, odds are 0.08/0.92 = 0.087 (11 to 1) and log odds are log 10 (0.087) = 1.06 while if p = 0.996, odds are 0.996/0.004 = 249 (order magnitude larger and opposite direction), the log odds are log 10 (249) = +2.40. 123

For a PAM scoring matrix S ij = log p i M ij p i p j = log M ij p j = log observed frequency expected frequency This matrix will be symmetric. 124

C S T P A G N D E Q H R K M I L V F Y W C S T P A G N D E Q H R K M I L V F Y W 12 0 2 2 1 3 3 1 0 6 2 1 1 1 2 3 1 0 1 1 5 4 1 0 1 0 0 2 5 0 0 1 0 1 2 4 5 0 0 1 0 0 1 3 4 5 1 1 0 0 1 1 2 2 4 3 1 1 0 1 2 2 1 1 3 6 4 0 1 0 2 3 0 1 1 1 2 6 5 0 0 1 1 2 1 0 0 1 0 3 5 5 2 1 2 1 3 2 3 2 1 2 0 0 6 2 1 0 2 1 3 2 2 2 2 2 2 2 2 5 6 3 2 3 2 4 3 4 3 2 2 3 3 4 2 6 2 1 0 1 0 1 2 2 2 2 2 2 2 2 4 2 4 4 3 3 5 4 5 3 6 5 5 2 4 5 0 1 2 1 9 0 3 3 5 3 5 2 4 4 4 0 4 4 2 1 1 2 7 10 8 2 5 6 6 7 4 7 7 5 3 2 3 4 5 2 6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W C S T P A G N D E Q H R K M I L V F Y W Values multiplied by 10. 125

A log odds of zero implies the two amino acids are found across from each in an alignment as often as expected by chance (given their mutabilities and frequencies of occurrence). A log odds greater than zero implies the two amino acids are found across from each in an alignment more often than expected by chance (given their mutabilities and frequencies of occurrence). A log odds less than zero implies the two amino acids are found across from each in an alignment less often than expected by chance (given their mutabilities and frequencies of occurrence). 126

Two uses for PAM matrices, Scoring matrix PAM250 (very distant) PAM160 (distant) PAM70 (less distant) PAM30 (more similar) etc Transition matrix PAM1 127

PAM-1 Matrix 10,000 From: Ala Arg Asn Asp Cys Gln Glu To: A R N D C Q E Ala A 9867 2 9 10 3 8 17 Arg R 1 9913 1 0 1 10 0 Asn N 4 1 9822 36 0 4 6 Asp D 6 0 42 9859 0 6 53 Cys C 1 1 0 0 9973 0 0 Gln Q 3 9 4 5 0 9876 27 Glu E 10 0 7 56 0 35 9865 128

PAM - strange (?) patterns Lots of interesting properties Many exchanges between amino acids D and E Far more double codon substitutions than expected Fewer of some single codon substitutions; e.g. G and W 129

PAM - scoring an amino acid alignment Consider an alignment... Seq1 C G N G Seq2 C G D R PAM250 12 5 2-3 Total score is 12 + 5 + 2 3 = 16 The chances of getting an alignment this good by chance is given by the odds. Normally one would multiply the odds at each site (assuming independence) but since log s have been taken we can add the log odds. The log 10 odds of 1.6 corresponds to odds of 39.8. So this is an unusual similarity between these two peptides despite their length (in large part due to rare cysteines across from each other). 130

The PAM matrix was computed on globular proteins and may therefore not be a good representation of the substitution matrix for membrane or other non-globular proteins. It assumes that all sites are equally mutable (but not all residues). Only a limited number of proteins were available in comparison to the huge numbers today. 131

The JTT matrix (Jones, Taylor, Thornton 1992) was an update of the PAM matrix. It is mostly used as a transition matrix rather than as a scoring matrix (for the later purpose PAM250 still seems the method of choice). 132

A matrix of BLOCKS BLOcks SUbstitution Matrix Based on the analysis of conserved proteins regions from the BLOCKS database. More reliable than the PAM matrix for distantly related proteins Default for BLAST searches Used in many other programs including FASTA 133