G+C content. 1 Introduction. 2 Chromosomes Topology & Counts. 3 Genome size. 4 Replichores and gene orientation. 5 Chirochores.

Similar documents
Disease and selection in the human genome 3

Materials Protein synthesis kit. This kit consists of 24 amino acids, 24 transfer RNAs, four messenger RNAs and one ribosome (see below).

NAME:... MODEL ANSWER... STUDENT NUMBER:... Maximum marks: 50. Internal Examiner: Hugh Murrell, Computer Science, UKZN

ORFs and genes. Please sit in row K or forward

Lecture 10, 20/2/2002: The process of solution development - The CODEHOP strategy for automatic design of consensus-degenerate primers for PCR

Lecture 11: Gene Prediction

Lecture 19A. DNA computing

Homework. A bit about the nature of the atoms of interest. Project. The role of electronega<vity

Codon Bias with PRISM. 2IM24/25, Fall 2007

Protein Structure Analysis

Project 07/111 Final Report October 31, Project Title: Cloning and expression of porcine complement C3d for enhanced vaccines

Primer Design Workshop. École d'été en géné-que des champignons 2012 Dr. Will Hintz University of Victoria

Supporting information for Biochemistry, 1995, 34(34), , DOI: /bi00034a013

Electronic Supplementary Information

Det matematisk-naturvitenskapelige fakultet

Figure S1. Characterization of the irx9l-1 mutant. (A) Diagram of the Arabidopsis IRX9L gene drawn based on information from TAIR (the Arabidopsis

Arabidopsis actin depolymerizing factor AtADF4 mediates defense signal transduction triggered by the Pseudomonas syringae effector AvrPphB

National PHL TB DST Reference Center PSQ Reporting Language Table of Contents

Supplementary. Table 1: Oligonucleotides and Plasmids. complementary to positions from 77 of the SRα '- GCT CTA GAG AAC TTG AAG TAC AGA CTG C

Hes6. PPARα. PPARγ HNF4 CD36


Supplemental Data Supplemental Figure 1.

Y-chromosomal haplogroup typing Using SBE reaction

Multiplexing Genome-scale Engineering

Genomic Sequence Analysis using Electron-Ion Interaction

for Programmed Chemo-enzymatic Synthesis of Antigenic Oligosaccharides

ΔPDD1 x ΔPDD1. ΔPDD1 x wild type. 70 kd Pdd1. Pdd3

SAY IT WITH DNA: Protein Synthesis Activity by Larry Flammer

PROTEIN SYNTHESIS Study Guide

Supplementary Figure 1A A404 Cells +/- Retinoic Acid

PGRP negatively regulates NOD-mediated cytokine production in rainbow trout liver cells

Lezione 10. Bioinformatica. Mauro Ceccanti e Alberto Paoluzzi

Supplementary Materials for

Supplement 1: Sequences of Capture Probes. Capture probes were /5AmMC6/CTG TAG GTG CGG GTG GAC GTA GTC

FROM DNA TO GENETIC GENEALOGY Stephen P. Morse

Table S1. Bacterial strains (Related to Results and Experimental Procedures)

Supporting Information

Expression of Recombinant Proteins

Supplemental Data. mir156-regulated SPL Transcription. Factors Define an Endogenous Flowering. Pathway in Arabidopsis thaliana

Creation of A Caspese-3 Sensing System Using A Combination of Split- GFP and Split-Intein

Dierks Supplementary Fig. S1

Supplementary Information. Construction of Lasso Peptide Fusion Proteins

Supporting Information

Introduction to Bioinformatics Dr. Robert Moss

Overexpression Normal expression Overexpression Normal expression. 26 (21.1%) N (%) P-value a N (%)

DNA sentences. How are proteins coded for by DNA? Materials. Teacher instructions. Student instructions. Reflection

A Circular Code in the Protein Coding Genes of Mitochondria

RPA-AB RPA-C Supplemental Figure S1: SDS-PAGE stained with Coomassie Blue after protein purification.

Sampling Random Bioinformatics Puzzles using Adaptive Probability Distributions

Supplemental Data. Bennett et al. (2010). Plant Cell /tpc

Supplemental material

II 0.95 DM2 (RPP1) DM3 (At3g61540) b

Supporting Information. Copyright Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, 2006

Thr Gly Tyr. Gly Lys Asn

strain devoid of the aox1 gene [1]. Thus, the identification of AOX1 in the intracellular

SUPPORTING INFORMATION

Supporting Online Information

Supplemental Table 1. Mutant ADAMTS3 alleles detected in HEK293T clone 4C2. WT CCTGTCACTTTGGTTGATAGC MVLLSLWLIAAALVEVR

PCR analysis was performed to show the presence and the integrity of the var1csa and var-

Nucleic Acids Research

2

Complexity of the Ruminococcus flavefaciens FD-1 cellulosome reflects an expansion of family-related protein-protein interactions

INTRODUCTION TO THE MOLECULAR GENETICS OF THE COLOR MUTATIONS IN ROCK POCKET MICE

Add 5µl of 3N NaOH to DNA sample (final concentration 0.3N NaOH).

S4B fluorescence (AU)

hcd1tg/hj1tg/ ApoE-/- hcd1tg/hj1tg/ ApoE+/+

Gene synthesis by circular assembly amplification

evaluated with UAS CLB eliciting UAS CIT -N Libraries increase in the

FAT10 and NUB1L bind the VWA domain of Rpn10 and Rpn1 to enable proteasome-mediated proteolysis

SUPPLEMENTAL MATERIAL GENOTYPING WITH MULTIPLEXING TARGETED RESEQUENCING

Search for and Analysis of Single Nucleotide Polymorphisms (SNPs) in Rice (Oryza sativa, Oryza rufipogon) and Establishment of SNP Markers

Evolution of protein coding sequences

Cat. # Product Size DS130 DynaExpress TA PCR Cloning Kit (ptakn-2) 20 reactions Box 1 (-20 ) ptakn-2 Vector, linearized 20 µl (50 ng/µl) 1

BIOSTAT516 Statistical Methods in Genetic Epidemiology Autumn 2005 Handout1, prepared by Kathleen Kerr and Stephanie Monks

Additional Table A1. Accession numbers of resource records for all rhodopsin sequences downloaded from NCBI. Species common name

PCR-based Markers and Cut Flower Longevity in Carnation

Supporting Information

SUPPLEMENTARY MATERIALS AND METHODS. E. coli strains, plasmids, and growth conditions. Escherichia coli strain P90C (1)

Supplemental Table 1. Primers used for PCR.

Quantitative reverse-transcription PCR. Transcript levels of flgs, flgr, flia and flha were

Legends for supplementary figures 1-3

SUPPLEMENTARY INFORMATION

Converting rabbit hybridoma into recombinant antibodies with effective transient production in an optimized human expression system

Codon bias and gene expression of mitochondrial ND2 gene in chordates

Sequence Design for DNA Computing

11th Meeting of the Science Working Group. Lima, Peru, October 2012 SWG-11-JM-11

BioInformatics and Computational Molecular Biology. Course Website

Molecular Level of Genetics

Genomics and Gene Recognition Genes and Blue Genes

Level 2 Biology, 2017

Supplementary Figure 1. Localization of MST1 in RPE cells. Proliferating or ciliated HA- MST1 expressing RPE cells (see Fig. 5b for establishment of

Supplemental Data. Distinct Pathways for snorna and mrna Termination

Causes and Effects of N-Terminal Codon Bias in Bacterial Genes. Mikk Eelmets Journal Club

Folding simulation: self-organization of 4-helix bundle protein. yellow = helical turns

Supplementary Figures

Supplemental Information. Target-Mediated Protection of Endogenous. MicroRNAs in C. elegans. Inventory of Supplementary Information

SUPPORTING INFORMATION FILE

Supplemental Information. Human Senataxin Resolves RNA/DNA Hybrids. Formed at Transcriptional Pause Sites. to Promote Xrn2-Dependent Termination

1. DNA, RNA structure. 2. DNA replication. 3. Transcription, translation

Transcription:

1 Introduction 2 Chromosomes Topology & Counts 3 Genome size 4 Replichores and gene orientation 5 Chirochores 6 7 Codon usage 121 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Introduction [G+C] Is calculated in percentage of G+C : 100[A+T+C+G] First nucleic acid technology applied to bacterial systematics One of the genomic characteritics recommended for the description of species and genera 5% and 10% are the common range found within a species and a genera, respectively Modulates the aminoacid content of proteins 122 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Introduction DNA double helix Watson and Crick, Nature, 1953 123 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Introduction The original figure 124 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Introduction is the same for both strands Let ACGT the primary formula on one strand. Then, its complementary strand composition is given by : A t C g G c T a The is not affected : g+c a+c+g+t = c+g t+g+c+a 125 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Introduction DNA denaturation Quizz: How do you know if a sample of DNA is double-stranded or single-stranded? 126 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Introduction Single strand or double strand? 127 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Introduction T m is the temperature at midpoint of transition 128 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Introduction T m increases with DNA AT pairs: 2 hydrogen bonds GC pairs: 3 hydrogen bonds 129 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Between species variability What is the distribution of in bacteria? Study this yourself: 1 Extract G+C data from GOLD (http://www.genomesonline.org/, use the goldtable.txt dataset as previously) and study its distribution. 2 Study data from http://pbil.univ-lyon1.fr/members/ mbailly/amig/data/gctopt.rdata. What is the relationship with optimum growth temperature T opt? With the class of organism? 130 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Between species variability from GOLD 131 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Between species variability and T opt 132 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Between species variability and T opt (n = 739) 133 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Between species variability and T opt (n = 739) > shapiro.test(gctopt$topt) Shapiro-Wilk normality test data: gctopt$topt W = 0.6389, p-value < 2.2e-16 > shapiro.test(gctopt$gc) Shapiro-Wilk normality test data: gctopt$gc W = 0.957, p-value = 6.941e-14 134 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Between species variability and T opt (n = 739) 135 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Between species variability and T opt 136 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Between species variability and aerobiosis Study this yourself: Study data from http://pbil.univ-lyon1.fr/r/donnees/gco2.txt. What is the relationship with (an)aerobiosis? How does it compare with the relation to T opt? 137 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Between species variability and aerobiosis 138 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Between species variability The distribution of in bacteria Results from your study: 139 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Between species variability Underlying mechanism: why no 100% G+C genomes? Symmetric Directional Mutation Pressure : Mutation rate v Noboru Sueoka AT pair GC pair u Sueoka, N. (1962) On the genetic basis of variation and heterogeneity of DNA base composition. Proc. Natl. Acad. Sci. USA, 48:582-592 140 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Between species variability Underlying mechanism : theory θ(t) = dθ dt ( θ 0 at equilibrium : = v(1 θ) uθ v ) e (u+v)t + v u +v u +v θ = θ(+ ) = v u +v 141 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Between species variability Underlying mechanism : predictions 60 Mutation pressure Negative selection 50 Number of species 40 30 20 10 0 10 20 30 40 50 60 70 80 90 % u = 3v u = v 3u = v 142 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

Between species variability Underlying mechanism : experiment Direct experimental evidence : Cox, E.C., Yanofsky, C. (1967) Altered base ratios in the DNA of an Escherichia coli mutator strain. Proc. Natl. Acad. Sci. USA, 58:1895-1902 Accelerated evolution experiment with a mutator strain: G+C content variation visible at a lab time scale. 143 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

and amino-acid content in proteins GC and the genetic code 144 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

and amino-acid content in proteins and aa content The impact of on the amino-acid composition of proteins was known even before the deciphering of the genetic code : Sueoka, N. (1961) Correlation between base composition of deoxyribonucleic acid and amino acid composition of protein. Proc. Natl. Acad. Sci. USA, 48:582-592 Was used as a clue to crack the genetic code (e.g. here Ala codons are expected to be G+C rich, and yes indeed GCN codons are G+C rich!). First evidence that the genetic code is (almost) universal. 145 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

and amino-acid content in proteins A null hypothesis for the aa content of proteins CDS are built by random sampling from an urn with a given θ. We assume for the sake of simplicity that C = G = θ 2 and A = T = 1 θ 2. What would be the amino-acid composition of proteins under this simplistic model? 146 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

and amino-acid content in proteins A null hypothesis for the aa content of proteins Let X i {A,C,G,T} a random variable for the result of outcome number i. Note P A = P(X i = A), P C = P(X i = C), P G = P(X i = G), P T = P(X i = T) the probabilities for the four bases. We have assumed that P C = P G = θ 2 and P A = P T = 1 θ 2. The probability for codon GAA is for instance : P(GAA) = P(X 1 = G X 2 = A X 3 = A) = P G P A P A In coding sequences there are no stop codons (TAA, TAG or TGA) so that: P(GAA not stop) = P(GAA) P(not stop) = P G P A P A 1 (P T P A P A +P T P A P G +P T P G P A ) 147 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

and amino-acid content in proteins A null hypothesis for the aa content of proteins At the amino-acid level, Glu is encoded by GAA or GAG, so that: P(Glu) = P(GAA GAG not stop) = P(GAA not stop)+p(gag not stop) In a similar way, we can deduce of the expected frequencies for all amino-acids under the model. 148 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

and amino-acid content in proteins A null hypothesis for the aa content of proteins f(θ) = f(θ) P(θ,aa) = 8 (1 θ) 2 (1+θ) (1 θ) 2 (2 θ) if aa {Ile} (1 θ) 2 if aa {Phe, Lys, Tyr, Asn} 1 θ 2 if aa {Leu} (1 θ) 2 θ if aa {Met} (1 θ)θ if aa {Asp, Glu, His, Gln, Cys} 2(1 θ)θ if aa {Val, Thr} 3(1 θ)θ if aa {Ser} (1 θ)θ 2 if aa {Trp} θ(θ+1) if aa {Arg} 2θ 2 if aa {Gly, Pro, Ala} 149 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

and amino-acid content in proteins What is the impact of G+C on the aa content? Study this yourself: > load("http://pbil.univ-lyon1.fr/members/lobry/repro/gene06/uco739.rdata") > dim(uco739) [1] 739 61 > uco739[1:5,1:5] aaa aac aag aat aca ACHROMOBACTER DENITRIFICANS 216 417 691 149 134 ACHROMOBACTER XYLOSOXIDANS 349 807 1225 283 169 ACIDIANUS AMBIVALENS 756 330 625 519 301 ACIDITHIOBACILLUS FERROOXIDANS 732 897 1252 662 270 ACINETOBACTER BAUMANNII 3442 1233 1429 2590 1271 This is a dataset of codon counts in 739 bacterial species. 150 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

and amino-acid content in proteins What is the impact of G+C on the aa content? Compute the from codon counts. From colnames(uco739), make a vector of in each codon: aaa aac aag aat aca acc acg act aga agc agg agt ata atc atg att 0 1 1 0 1 2 2 1 1 2 2 1 0 1 1 0 caa cac cag cat cca ccc ccg cct cga cgc cgg cgt cta ctc ctg ctt 1 2 2 1 2 3 3 2 2 3 3 2 1 2 2 1 gaa gac gag gat gca gcc gcg gct gga ggc ggg ggt gta gtc gtg gtt 1 2 2 1 2 3 3 2 2 3 3 2 1 2 2 1 tac tat tca tcc tcg tct tgc tgg tgt tta ttc ttg ttt 1 0 1 2 2 1 2 2 1 0 1 1 0 There are multiple ways to do that: using strsplit or s2c in library seqinr, using %in% or looping ==, using explicit loops or sapply. 151 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

and amino-acid content in proteins What is the impact of G+C on the aa content? Now, thanks to matrix multiplication (%*%), compute the G+C content in percent: [,1] ACHROMOBACTER DENITRIFICANS 61.88166 ACHROMOBACTER XYLOSOXIDANS 63.37462 ACIDIANUS AMBIVALENS 37.59371 ACIDITHIOBACILLUS FERROOXIDANS 58.72497 ACINETOBACTER BAUMANNII 43.92444 ACINETOBACTER CALCOACETICUS 42.96045 152 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

and amino-acid content in proteins What is the impact of G+C on the aa content? Compute the amino-acid content from codon counts. From colnames(uco739), make a vector of the corresponding amino-acid: aaa aac aag aat aca acc acg act aga agc "Lys" "Asn" "Lys" "Asn" "Thr" "Thr" "Thr" "Thr" "Arg" "Ser" agg agt ata atc atg att caa cac cag cat "Arg" "Ser" "Ile" "Ile" "Met" "Ile" "Gln" "His" "Gln" "His" cca ccc ccg cct cga cgc cgg cgt cta ctc "Pro" "Pro" "Pro" "Pro" "Arg" "Arg" "Arg" "Arg" "Leu" "Leu" ctg ctt gaa gac gag gat gca gcc gcg gct "Leu" "Leu" "Glu" "Asp" "Glu" "Asp" "Ala" "Ala" "Ala" "Ala" gga ggc ggg ggt gta gtc gtg gtt tac tat "Gly" "Gly" "Gly" "Gly" "Val" "Val" "Val" "Val" "Tyr" "Tyr" tca tcc tcg tct tgc tgg tgt tta ttc ttg "Ser" "Ser" "Ser" "Ser" "Cys" "Trp" "Cys" "Leu" "Phe" "Leu" ttt "Phe" Useful functions are s2c(), translate() and aaa() in the seqinr package. 153 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

and amino-acid content in proteins What is the impact of G+C on the aa content? Now, in one line 2, thanks to apply() and tapply(), compute the amino-acid content in each proteome: Ala Arg Asn Asp Cys ACHROMOBACTER DENITRIFICANS 2783 1601 566 1204 206 ACHROMOBACTER XYLOSOXIDANS 5031 2828 1090 2062 330 ACIDIANUS AMBIVALENS 1297 700 849 853 218 ACIDITHIOBACILLUS FERROOXIDANS 5876 3794 1559 2705 623 ACINETOBACTER BAUMANNII 7428 4257 3823 4476 745 2 or more if you do not feel that geeky 154 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

and amino-acid content in proteins What is the impact of G+C on the aa content? Plot the results : Show the influence of for Ala, Lys, and Glu (at least). Add the linear fit Add the neutral model as a line. 155 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

and amino-acid content in proteins What should be obtained for Ala: Ala frequency evolution with G+C Ala content [%] 16 14 12 10 8 6 4 Linear fit Neutral model 30 40 50 60 70 [%] 156 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

and amino-acid content in proteins What should be obtained for Lys: 30 40 50 60 70 2 4 6 8 10 12 Lys frequency evolution with G+C [%] Lys content [%] Linear fit Neutral model 157 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

and amino-acid content in proteins What should be obtained for Glu: Glu frequency evolution with G+C Glu content [%] 10 8 6 4 2 0 Linear fit Neutral model 30 40 50 60 70 [%] 158 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures

and amino-acid content in proteins How to do these graphs? 159 marc.bailly-bechet@univ-lyon1.fr Bacterial genome structures