Causes and Effects of N-Terminal Codon Bias in Bacterial Genes. Mikk Eelmets Journal Club

Size: px
Start display at page:

Download "Causes and Effects of N-Terminal Codon Bias in Bacterial Genes. Mikk Eelmets Journal Club"

Transcription

1 Causes and Effects of N-Terminal Codon Bias in Bacterial Genes Mikk Eelmets Journal Club

2 Introduction Ribosomes were first observed in the mid-195s (Nobel Prize in 1974) Nobel Prize in 29 for determining the detailed structure and mechanism of the ribosome There are still many questions without an answer Many organism are enriched for poorly adapted codon at the N terminus. Why? Which mechanisms are causal for expression changes? Strategies: study genome-wide sequence correlation, study conservation among species, use synthetic genes

3 Genetic code

4 Design of experiment FlowSeq Sharon E et al Nat Biotechnol. 212 May 2;3(6):521-3

5 Design of experiment FlowSeq A mcherry sfgfp weak 114 promoters 111 ribosome binding sites + + medium strong library of 12,653 synthesized constructs transform, DNAseq & RNAseq sort on sfgfp/mcherry ratio barcode & sequence estimate protein levels for each construct B D FlowSeq C 1) 5 Kosuri S et al. Proc Natl Acad Sci U S A. 213 Aug2;11(34): count (x1) log(rfp) 15 1

6 Design of experiment Expression System and Library Design ession System and Library Design Each library construct contained a promoter, RBS, and 11 codon N-terminal peptide (including the initiating ATG ). The peptide sequences correspond to the N terminus of 137 natural E. coli genes. For each promoter/rbs/peptide combination, we encoded 13 codon variants including the most common codons (C), most rare codons (R), wild-type sequence (wt), and codon variants with variable secondary structures (ΔG). The library was cloned in-frame with superfolder GFP (sfgfp). The GFP expression level is compared via relative fluorescence to a constitutively co-expressed mcherry protein.

7 Usage (RSCU) S Protein expression of the reporter library C Promoter RBS Strong High Mid High Weak High WT High Strong Low Mid Low x 1/ 16x 6 256x ge in Expression from C to R reporter library. (A) N-terminal donvariantsshowincreasedex- (C). (B)Foldchangeinexpression ndent of RBS strength. WT, wild suredbythesfgfp:mcherryratio) n variant sets are grouped into n variants include C, R, wild-type ondary structure ( G). Not shown re mostly outside the quantitative data, and light gray squares cor- I K L 1-AA N-terminal peptides from 137 genes wt C R inc G Protein Expression (FlowSeq) Protein expression of the library (as measured by the sfgfp:mcherry ratio) covers a ~2-fold range. The 13-member codon variant sets are grouped into columns by promoter/rbs combination (top). Codon variants include B C, R, wild-type sequence C (wt), and 1 sequences with varying secondary structure ( G). ression tion R 2 = Protein levels: 7327 (51.5%) constructs They used UNAFOLD software to predict free energy of folding for each RBS-codon variant

8 RNA and DNA Abundance DNA abundances for each construct. Promoter identity is the first column and RBS identity is the second column. DNA abundances varied due to differences in DNA synthesis efficiencies as well as lower growth rate for very highly expressed genes

9 RNA and DNA Abundance Relative RNA abundances (ratio of RNA to DNA contig counts) for each construct are displayed grouped by promoter and RBS identities.

10 A Fold Change in Protein Expression 16x x Gene expression measurements of the reporter library R C Codon Variant B Number of N-terminal peptides Strong RBS Mid RBS Weak RBS WT RBS 1/6 1/16x 1/ 16x 6 256x Fold Change in Expression from C to R Fig. 1. Gene expression measurements of the reporter library. (A) N-terminal peptide sequences encoding the most rare (R) codonvariantsshowincreased expression when compared to the most common ones (C). (B)Foldchangeinexpression between C and R codon variants is largely independent of RBS strength. WT, wild type. (C) Proteinexpressionofthelibrary(asmeasuredbythesfGFP:mCherryratio) (A) N-terminal peptide sequences encoding the most rare (R) codon variants show increased expression when compared to the most common ones (C) (mean 14-fold; median 4-fold). (B) (B) Fold change in expression between C and R codon variants is largely independent of RBS strength. WT, wild type. C 1-AA N-terminal peptides from 137 genes Promoter RBS Strong High

11 wt C R inc G C E F 2 G H I K B L R =.41 p < x * T.75x 1 1 log2 RSCU R =.73 Y * V ** ** 2 p = 1 9 CTG TTA TTG CTC CTT CTA S AAA AAG ATT ATC ATA CAT CAC R.75x GGC GGT GGG GGA Q TTT TTC P GAA GAG N GAT GAC -1 C TGC TGT -2 Mean fold change in expression due to codon substitution D ** * Mean fold change in expression A (RSCU) Usage Relative Synonymous Codon due to codon substitution A ** Codon Frequency Enrichment at N-terminus of E. coli Genes Protein Expression (FlowSeq) C Relative Synonymous R =.41 Codon p < 1 Usage n = number of synonymous codons (1 n 6) for the amino acid under R2 =.73 study, Xi = number.75x of occurrences p = 1 9 of codon i..75x Relative Synonymous Codon Usage (RSCU) wt C R Choice of codon and Expression GCG GCC GCA GCT -3 Protein sequence (wt), and 1 sequences with varying secondary structure ( G). Not shown Expression are two additional inc low-promoter the quantitative G panels, which were mostly1 outside 2 gray 4 squares (FlowSeq) FlowSeq range. Dark gray squares had insufficient data, and light correspond to duplicate constructs If-1the synonymous codons of an Codon Frequency Enrichment log 2 RSCU amino acid are used equalof E. coli Genes at with N-terminus frequencies, their RSCU values will Fig Rare equal codons1.generally increase expression levels. (A) The -3-2 average fold change in expression is correlated with the choice of codon. The y axis is the slope of a linear model linking codon use to expression change..5 Codons are sorted left to right by increasing genomic frequency and colored according to their relative synonymous codon usage (RSCU) in E. coli. (P values after Bonferroni correction: *P <.5, **P <.5, P <.1). (B) The individual codon slopes (y axis) as in (A) show an inverse relationship with RSCU (x axis). (C) The individual codon slopes correlate with enrichment of codons at the N terminus of genes in E. coli. X RSCU = n i i 1 X n i i=1 TAT TAC GTG GTT GTC GTA ACC ACG ACT ACA AGC TCG TCC AGT TCT TCA CGC CGT CGG CGA AGA AGG CAG CAA CCG CCA CCT CCC AAC AAT. Rare codons generally levels. (A) The The y axis is the slope of a The average fold changeincrease in expressionexpression is correlated with the choice of codon. linear in model linking codon to change. Codons are left to right by increasing genomic e fold change expression isuse correlated with the choice ofsorted codon. 25expression OCTOBER 213 VOL 342 SCIENCE frequency and colored according to their relative synonymous codon usage (RSCU) in E. coli. (P values axis is the slope of a linear model linking codon use to expression after Bonferroni correction: *P <.5, **P <.5, P <.1). e. Codons are sorted left to right by increasing genomic frequency lored according to their relative synonymous codon usage (RSCU)

12 squares corthe quantitative gray squares cor- wt C R wt C R inc G inc G Protein Expression Protein (FlowSeq) Expression (FlowSeq) CTG TTA TTG CTG CTC TTA V Y Y C C B Mean fold change in expression due to codon substitution L Mean fold change in expression due to codon substitution B R2 =.41 2 R p < 1= p < x R2 2=.73 R =.73 p = p = 1.75x.75x.75x log2log RSCU 2 RSCU (RSCU) Usage Codon Synonymous Relative (RSCU) Usage Codon Synonymous Relative Choice of codon and Expression Codon CodonFrequency FrequencyEnrichment Enrichment at at N-terminus of E. N-terminus of E.coli coligenes Genes Fig. 2. Rare codons generally increase expression levels. (A) The average foldfold change in expression is iscorrelated average change in expression correlatedwith withthe thechoice choice of codon. codon. The The y axis is the slope of of a linear model y axis is the slope a linear modellinking linkingcodon codonuse use to to expression expression change. Codons sorted rightbybyincreasing increasinggenomic genomic frequency frequency change. Codons areare sorted leftleft to to right (B) The individual codon slopes (y axis) as in (A) show an inverse relationship with RSCU (x axis). Fig. 2. Rare codons generally increase expression levels. (A) The (C) The individual codon slopes correlate with enrichment of codons at the N terminus of genes in E. coli.

13 Choice of codon and Expression Fold Change in Protein Expression A 16x x B 16x x E G from average codon variant > -1% -8% -4% % +4% +8% > +1% > > +4 n= Change in %GC from average variant 19 Rare codons alter expression by reducing mrna secondary structure. (A) Expression changes are correlated Cwith relative changes in %GC content. Each D boxplot includes ±2% of centered value. F(B) Expression increases correlate to relative increases in free energy of folding at the front of the transcript (ΔΔG). Each boxplot includes ±2 kcal/mol of centered value. sion n R 2 =.87 ssion n G) n= R 2 = G from average variant (kcal/mol) 176 ssion x - p < 1 16

14 F 1 16x > -1% n=277-8% -4% % +4% +8% > +1% > > +4 Choice of 861 codon 19 n=89 and Expression Change in %GC from average variant 1 16x G from average variant (kcal/mol) C D F G f tai= tai fro Mean fold change in expression due to codon substitution.75x R 2 =.87 p < 1 16 Mean fold change in expression due to codon substitution (After Controlling for G).75x R 2 =.5 p =.197 Fold Change in Protein Expression 16x 1 p < 1 16 R 2 = x Codon substitution, G (kcal/mol) log 2 RSCU Fig. 3. Rare codons alter expression by reducing mrna secondary Rare codons alter expression by reducing mrna secondary structure. (C) Individual codon slopes correlate with structure. the ΔΔG (A)Expressionchangesarecorrelatedwithrelativechangesin%GC per individual codon substitution. (D) After controlling for ΔΔG with a multiple linear regression, there is no longer any relation between individual codon slopes and RSCU content. Each boxplot includes T2% of centered value. (B) Expressionincreases correlate to relative increases in free energy of folding at the front of the transcript (DDG). Each boxplot includes T2kcal/mol ofcentered value.(c) Individual codon slopes (same as Fig. 2A y axis) correlate with the DDG per -1-5 G from average codo linear regression, there is no longe slopes and RSCU (compare with Fig. plotted for all constructs within the qu their relative fold change in expressi the set. (F)Subsetsofconstructscorre Points with constant codon adaptatio

15 Choice of codon and Expression 2 +4 > e variant 176 E G from average codon variant Subset G= tai= Fold Change in Protein Level < 1 16x 1 16x 1 16x > 16x The classical translational efficiency cte i for each codon is the trna adaptation index (tai) W i absolute adaptiveness n i - the number of trna isoacceptors recognizing codon i s ij - a selective constraint on the efficiency of the codon-anticodon coupling tgcn ij gene copy number of the trna F (E) The ΔΔG versus tai= change in tai is plotted for G= all constructs within the quantitative range. Constructs are colored by their relative fold change in expression from the average codon variant within the set. in Expression 16x p < 1 16 R 2 = tai from average codon variant p =.773 R 2 =

16 G from average variant (kcal/mol) > > 16x Choice -15 of codon and Expression F -1 F tai from average codon variant tai= G= 16x 2 =.5 =.197 Fold Change in Protein Expression 16x 1 p < 1 16 R 2 =.39 p =.773 R 2 = 1 16x log 2 RSCU G from average codon variant tai from average codon variant ing mrna secondary elativechangesin%gc (B)Expressionincreases ing at the front of the ofcenteredvalue.(c) elate with the DDG per rddg withamultiple linear regression, there is no longer any relation between individual codon slopes and RSCU (compare with Fig. 2B). (E) TheDDG versus change intaiis plotted for all constructs within the quantitative range. Constructs are colored by their relative fold change in expression from the average codon variant within the set. (F)Subsetsofconstructscorrespondingtotheshadedboxesin(E).(Left) Points with constant codon adaptation and varied secondary structure, (right) points with constant secondary structure and varied codon adaptation. (F) Subsets of constructs corresponding to the shaded boxes in (E). (Left) Points with constant codon adaptation and varied secondary structure, (right) points with constant secondary structure and varied codon adaptation.

17 Hybridization probabilities and Expression +3% RBS variable codons GFP +2% Window Relative Hybridization Probability -log 1 (p-value) +1% % -1% -2% -3% Mean +/- 1 SD Best 5% of constructs (422 Constructs) Worst 5% of constructs (422 Constructs) bp Windows across transcript A multiple linear regression model that combines promoter and RBS choice, as well as N- terminal secondary structure and GC content, still explains only 54% of variation in expression levels Hybridization probabilities was calculated using UNAFOLD software mrna structure downstream of start codon is most correlated with reduced expression. (Top) The best and worst 5% of constructs as ranked by relative expression within a codon variant set are grouped and plotted as blue and red ribbons, respectively. The ribbon tops and bottoms are one standard de- viation from the mean, which is shown as a solid line. (Bottom) The p-value for linear regressions, correlating hybridization probabilities within each window to expression fold change in all constructs.

18 Conclusion Using N-terminal rare codons instead of common ones increase expression by ~14 fold (median 4 fold) Reduced RNA structure, not codon rarity itself, is responsible for expression increases This knowledge helps better predict how to synthesize genes that make enzymes or drugs in a bacterial cell.

19 Reference Daniel B. Goodman, George M. Church, Sriram Kosuri Causes and Effects of N-Terminal Codon Bias in Bacterial Genes Science. 213 Oct 25;342(6157):475-9

20 THANK YOU

21 Flow Cytometry Controls Fig. S5. All Flow Cytometry Controls. FlowSeq estimates of fluorescence ratio FlowSeq estimates of fluorescence ratio correlate well (R2 =.955, p < ) with individually measured fluorescence correlate well ratios (Rfrom 2 = sequence-verified.955, p < 2 1 clones. 16 ) with 51 individually constructs that measured were outside fluorescence of the quantitative ratios FlowSeq range are shown in red. from sequence-verified clones. 51 constructs that were outside of the quantitative FlowSeq range are shown in red.