chromosome Ill: evolution of chromosome primary

Size: px
Start display at page:

Download "chromosome Ill: evolution of chromosome primary"

Transcription

1 QD-l 1993 Oxford University Press Nucleic Acids Reserch, 1993, Vol. 21, No Regionl se composition vrition long yest chromosome Ill: evolution of chromosome primry structure Pul M.Shrp nd Andrew T.Lloyd Deprtment of Genetics, Trinity College, Dulin 2, Irelnd Received Novemer 25, 1992; Revised nd Accepted Decemer 14, 1992 ABSTRACT The recent determintion of the complete sequence of chromosome III from the yest Scchromyces cerevisie llows, for the first time, the investigtion of the long rnge primry structure of eukryotic chromosome. We hve found tht, ginst ckground G + C level of out 35%, there re two regions (one in ech chromosome rm) in which G + C vlues rise to over 50%. This effect is seen in silent sites within genes, ut not in noncoding intergenic sequences. The vrition in G + C content is not relted to differentil selection of synonymous codons, nd proly reflects muttionl ises. Tht the intergenic regions do not exhiit the sme phenomenon is prticulrly interesting, nd suggests tht they re under sustntil constrint. The yest chromosome my e model of the structure of the humn genome, since there is evidence tht it is lso mosic of long regions of different se compositions, reflected in wide vrition of G + C content t silent sites mong genes. Two possile cuses of this regionl effect, repliction timing, nd recomintion frequency, re discussed. INTRODUCTION As vst mounts of DNA sequence dt ccumulte, it is ecoming possile to investigte whether the sequences of genes (nd their ptterns of evolution) re influenced y their chromosoml loction. Perhps the most striking result hs een the demonstrtion tht se composition (G+C content) vries enormously mong mmmlin genes nd ppers to e relted to their genomic context (1). For exmple, G+C content t third codon positions (which re generlly silent) vries mong humn genes from less thn 30% to over 90% (2), these vlues re correlted with the G +C content of the introns nd flnking sequences of the sme genes (3), nd lso mong neighouring genes (4,5), suggesting tht the se composition vrition is locl chromosoml effect. This hs een interpreted s evidence tht the humn genome comprises mosic of long regions ('isochores') of different se composition (1,6). A similr genomic structure hs een suggested for irds (1,6), nd certin monocotyledenous plnts (7), lthough the dt re s yet less numerous. It is interesting, prticulrly with respect to elucidting the origins nd possile significnce of this phenomenon, to sk how widespred it is mong other orgnisms. Here we investigte se composition vrition mong genes from the udding yest Scchromyces cerevisie. In prticulr, we hve exmined the recently determined complete sequence of chromosome 11 (8); this chromosome is 315k long, nd contins presumptive genes (8,9). The results indicte position dependent vrition in gene G+C content. Severl fctors which my e relted to, nd perhps cuse this vrition, nmely (i) codon usge, (ii) time of repliction, nd (iii) frequency of recomintion, re investigted. The ltter two spects re proly etter understood for this chromosome thn for ny other, nd codon usge hs een more extensively nlysed in S.cerevisie thn in ny other eukryote (2,10-13). MATERIALS AND METHODS From the complete sequence (8) of chromosome III (GenBnk/EMBL/DDBJ ccession numer X59720), ll 182 fetured open reding frmes (ORFs), nd the intergenic regions etween them, were extrcted using the ACNUC sequence retrievl system (14). The fetured ORFs re open reding frmes of more thn 100 codons in length identified in Ref.8. Two trnsposle elements, Ty2 (8) nd Ty5 (15), were excluded ecuse codon usge in trnsposle element genes differs from tht in chromosoml genes (16). Two open reding frmes (L26c, R74c; in the terminology of Ref.8) were replced y others which completely overlp them (LX8c, RX13w; see the GenBnk entry), ecuse the ltter re longer, nd hve codon positionl nucleotide frequencies more typicl of yest genes (13). One gene identified recently (17) s RIMl ws dded. This yielded totl of 178 genes. There were only 153 intergenic sequences ecuse some ORFs overlp slightly. Introns (only two hve een identified on this chromosome) were excluded from the nlysis. It is possile tht some of these ORFs re not genes, though it is quite unlikely tht long open reding frmes exist y chnce in DNA of this se composition. In ddition, in sequence of this length there my well e some errors (see, for exmple, Ref. 18), ut we hve found the overll results to e roust ginst minor chnges in the dtset. To void ny effects of mino cid composition, G+C content for genes ws clculted only from silent third codon positions (i.e., excluding Met, Trp nd termintion codons). Codon usge

2 180 Nucleic Acids Reserch, 1993, Vol. 21, No. 2 Tle 1. G +C content t silent sites in Scchromyces cerevisie. No. of G+C content Codon usge is sequences Men A SD (O/E) Men CAI Rho All genesc (2.4) Chromosome HI genesd (2.8) Chromosome 11 noncoding A 0.05 (1.7) 1 - For genes, G+C content ws clculted only for silent third codon positions; O/E is the rtio of the Oserved/Expected stndrd devitions. Codon usge is is mesured y CAI, the codon dpttion index (19); Rho is the correltion coefficent of G+C content with CAI. c 575 genes from throughout the genome, listed in Ref.13. d 178 open reding frmes (presumptive genes) from chromosome HI (see Mterils nd Methods for detils). is ws mesured y CAI, the codon dpttion index (19); CAI vlues cn rnge etween 0 nd 1, with higher vlues indicting stronger is. Vlues for silent site G+C content nd CAI were clculted using the CODONS progrm (20). The expected stndrd devition of G+C content mong genes ws clculted from the inomil theory, using the hrmonic men length of sequences. RESULTS To nlyse possile regionl se composition vrition in yest we hve exmined 'silent' sites, i.e., synonymously vrile sites within genes, nd noncoding sequences, on the complete sequence (8) of chromosome HI. By comprison with set of 575 gene sequences from throughout the S. cerevisie genome (13), it ws estlished tht chromosome III ppers to e representtive of the genome s whole (Tle 1). The G+C content of the entire chromosome is 39%, while tht for the genome hs een estimted (21) t 39-40%. The verge vlue for G+C content t silent sites in ORFs on chromosome III is similr to tht for ll genes, nd to the genomic se composition; chromosome HI genes exhiit t lest s much vrition s the genome-wide dtset. The verge G+C content in noncoding regions on chromosome m is somewht lower thn the verge t silent sites in chromosome HI genes; it is lso less vrile, oth when oserved stndrd devitions re compred, nd when the rtio of oserved/expected stndrd devitions re considered (Tle 1). The G+C content vlue for ech gene ws plotted ginst its position on chromosome mi; the more G+C-rich genes did not pper to e rndomly distriuted (Fig. l). When the corresponding dt for noncoding sequences were plotted, their lower vriility ws evident, nd there ws less sign of ny sptil pttern (Fig. l). Regionl effects ecme much clerer when moving verge vlues for 15 djcent sequences (genes, or intergenic regions) were considered (Fig. 2). Genes with high G+C content were seen to e clustered in two mjor regions, one on ech rm of the chromosome (Fig. 2). Similr ptterns of regionliztion of se composition vrition were evident for genes coded on ech strnd of the DNA (Fig. 2). Noncoding sequences did not show similr ptterns: the only pprent (smll) pek ws ner the centromere (Fig. 2). To test whether the clustering of the more G+C-rich genes ws significntly greter thn might occur y chnce, we performed simultions in which the order of genes ws. rndoomized, nd the moving verge G+C vlues for 15 djcent sequences were recomputed. We then sked whether ny vlues in the simultions were s high s those oserved in the nlysis in Fig. 2, where the oserved vlues t the peks in the left nd right chromosome rms were 0.48 nd 0.52, respectively. Out of 10,000 simultions in which ll genes were shuffled, only 8 yielded pek vlue greter thn 0.52, ut 426 yielded vlues greter thn Therefore, seprte simultions were performed for genes from the two chromosome rms: 8 out of 10,000 simultions for the left rm genes yielded pek vlue greter thn 0.48, nd 3 out of 10,000 simultions for the right rm yielded vlue greter thn These simultions provide strong evidence tht the clustering of G+C-rich genes in the two rms is highly significnt (P < 0.001). The chromosome ws then considered s if consisting of five regions: two res of high genic G+C content (the peks in Fig. 2, tken to e locted t k, nd k, respectively), nd the three surrounding res. The men G+C content vlues for genes nd intergenic regions within these res re given in Tle 2. With respect to men intergenic G+C content, there ws no significnt difference mong the five regions (nlysis of vrince: F4,148 d.f. = 0.41, p = 0.80); for genes, there ws no significnt difference mong the three regions with low G+C content (F2,107d.f. = 0.64, p = 0.53), or etween the two regions with high G+C content (t-test: t66d.f. = 1.4, p = 0. 16). Pooling the regions of high nd low genic G+C content, respectively, the men G +C content vlues for genes differed significntly etween the two types of region (t176 d.f. = 9.08, p < ), ut the men vlues for intergenic sequences did not (t151 d.f. = 0.43, p = 0.67). However, these sttisticl tests must e tken with some cution, ecuse the regions were designted sujectively fter considering Fig. 2. The reltionship etween silent site G+C content nd codon usge is (mesured y the CAI) of yest genes ws investigted. Among genes from ll chromosomes there ws smll positive correltion etween the strength of codon usge is nd G+C content, ut for chromosome mi genes there ws no significnt correltion (Tle 1). Furthermore, codon usge is vlues showed no reltion to chromosome position, either when individul genes were considered (Fig. 3), or when moving verge ws plotted (Fig. 3). DISCUSSION Mmmlin chromosomes pper to e mosic of 'isochores', regions severl hundred kiloses in length which differ in se composition, ut within which se composition is reltively homogeneous (1,6). Recently, it hs een suggested tht vrition in se composition my e similrly dependent on genomic loction in wide rnge of (if not ll) eukryotes nd even

3 * * - Bs--.. * *-- "-- Nucleic Acids Reserch, 1993, Vol. 21, No G+C * * - *- *" *:- *. *z - *" " - : - w * *:" * * * :-. *.. * B n I v * * * B * * * s@r * * Figure 1. G+C content t silent sites long yest chromosome Im., 178 open reding frmes., 153 intergenic sequences. The ellipse indictes the position of the centromere. prokryotes (22,23). However, the ltter studies hve not considered chromosome position nd hve focussed on G+C content t the third positions of genes, without tking into ccount vrition in codon usge. This point is criticl, ecuse in diverse species (e.g., n insect, Drosophil melnogster (24); fungus, Aspergillus nidulns (25); slime mould, Dictyostelium discoideum (26); nd n enteric cterium, Serrti mrcescens (27)), nlyses hve indicted tht silent site G+C content vrition mong genes cn e lrgely ccounted for y the frequencies of certin 'optiml' codons; these frequencies re in turn relted to the level of gene expression. Here we hve demonstrted tht genes from different regions of S. cerevisie chromosome III hve different levels of silent site G+C content. This is prticulrly interesting ecuse it is quite unexpected: n erlier exmintion of long S.cerevisie sequences reveled their G+C content vlues 'to e within nrrow rnge round tht of the whole genome' (4), while in the studies cited ove (22,23) yest ws the one eukryote in which those uthors did not find much G+C vrition mong genes. There hs een extensive documenttion of the fct tht yest genes vry considerly in codon usge is, depending on their level of expression-cn this explin the vrition in G+C content? In yest genes expressed t high level, codon usge is very ised (10), nd 22 trnsltionlly 'optiml' codons hve een identified (13): in the most strongly ised genes very few other codons re used. However, of these 22 optiml codons, 50% end in C or G, nd so there would not necessrily e ny correltion etween the strength of codon usge is nd silent site G+C content (nd none ws oserved mong chromosome III genes). Also, it hs een reported tht highly expressed trnscripts re encoded y genes scttered over the chromosome (9), nd we hve found tht genes with high or low codon usge is (mesured y the CAI) show no prticulr distriution long the chromosome. Thus, the G + C content vrition mong chromosome IH genes does not pper to e relted to gene expression level: we infer tht it is due to the genes' loction. There is no reson to expect tht chromosome III is typicl mong yest chromosomes. Genes locted throughout the genome show levels of G+C content vriility similr to those on chromosome HI. Interestingly, when multivrite sttisticl nlysis (correspondence nlysis) is pplied to codon usge in yest genes, the second most importnt trend mong genes is Tle 2. G+C content in regions of yest chromosome III. Genes Intergenic sequences Region N Men i SDC N Men + SDc X A A High G+C regionsd Low G+C regionse Coordintes of the region re given in kiloses. Numer of sequences. c Men nd stndrd devition of G+C content; for genes the vlue refers to silent third codon positions. d Regions 40-80k nd k. e Regions 0-40k, k, nd k. highly correlted with G+C content t silent sites. [For the 575 genes from throughout the genome (13), the positions of genes on the second xis produced y the correspondence nlysis, nd the silent site G+C content vlues for those genes, re correlted with coefficient of ] The primry trend mong genes is tht lredy discussed, i.e., vrition in the extent of usge of optiml codons, ccounting for 34% of the vrition mong genes; the secondry trend (i.e., G+C content) cn explin further 6.4% of this vrition. [Recently, we hve reported similr oservtions from correspondence nlysis of codon usge in relted yest, Cndid licns (28), lthough the numer of genes ville for nlysis ws fr more limited.] Thus, the G +C content vrition is reltively minor effect y comprison with the mjor vrition in usge of optiml codons; furthermore, without the precise mp loctions provided y the complete sequence of S. cerevisie chromosome IH, it ws not pprent tht this se composition vrition is relted to gene loction. An ovious question is whether the G + C content vrition in yest is similr phenomenon to tht oserved in the humn genome (1). The regions of different G + C content on yest chromosome IH re out k in length (Fig. 2; Tle 2). This is rther shorter thn the presumed size of humn isochores, lthough their lengths hve not een estimted with ny ccurcy. Chromosomes in yest hve n verge length

4 182 Nucleic Acids Reserch, 1993, Vol. 21, No G+C I Flgure 2. Moving verge G+C content t silent sites long the yest chromosome III sequence; ech point is the weighted verge for 15 djcent sequences (weighting ws y the numer of sites in ech sequence)., G+C vlues for coding (open circle) nd noncoding (closed circle) sequences., Coding sequences on the Crick (open circle) nd Wtson (closed circle) strnds. The ellipse indictes the position of the centromere CAI I S v. I o 'm o 6'r. s!,, AP,, my '.1Is V 9 -%. I.. -%.: 't.1 - op S is...'e-sprx PA. :8 o I o Figure 3. Codon usge is nd gene position on yest chromosome ImL., CAI vlues for individul genes., Moving verge of CAI vlues for 15 djcent genes. The ellipse indictes the position of the centromere. little under 1 megse, wheres humn chromosomes re on verge round 100 megses long. Humn genes re lso longer thn those in yest (since most humn genes contin introns), nd the intergenic regions in humns re lso longer thn those in yest. Thus, given this difference in scle etween the two genomes, if isochores do exist in yest, they might e expected to e shorter thn those on humn chromosomes. However, this depends criticlly on wht the cuses of this spect of chromosome structure re, nd whether these fctors re the sme in the two species. Since silent sites in humn genes pper to e under little, if ny, selective constrint, their se composition is expected to reflect muttionl ises (5,29,30). Consequently, the G+C content vrition long the humn genome hs een interpreted s evidence tht different regions of chromosomes re suject to different muttionl spectr (5,31-33). Similrly, codon usge (nd hence silent site se composition) in yest genes, other thn those which re highly expressed, ppers to e lrgely influenced y muttionl ises (11,12). Why might muttionl ises vry mong chromosome regions? For mmmls, we hve speculted tht this rises ecuse the regions re replicted t different times (nd tht the intrcellulr conditions which influence the spectrum of muttions, such s the reltive concentrtions of free nucleotide pools, vry during the repliction cycle) nd tht the more G+Crich regions re those replicted erly (5). [It hs recently een suggested tht there is no correltion etween G+C content nd time of repliction for humn genes (34); however, the time of repliction vries mong tissues, nd since the dt cited do not refer to germline cells, it my not e relevnt.] The time of repliction hs een mesured (35) for severl points long the first 200 k of yest chromosome 11, nd the G+C-rich pek in the left rm of the chromosome (Fig. 2) coincides pproximtely with n erly replicting region; the repliction time dt do not extend s fr s the loction of the pek in the right rm. Alterntively, it hs een suggested tht se composition vrition round the humn genome my e correlted with the locl frequency of recomintion (A.Eyre-Wlker, pers.comm.), since recomintion events involve DNA repir, nd tht process is ised to G+C-richness (36). Comprison of distnces on the physicl nd genetic mps of yest chromosome HI hs indicted

5 Nucleic Acids Reserch, 1993, Vol. 21, No tht recomintion frequencies re higher towrds the middle of the two chromosome rms (8,9). Thus, on the lrge scle, the res of higher G+C content nd higher recomintion frequency my coincide. On shorter scle, there does not pper to e cler correltion etween these two fctors. Four 'hot spots' of recomintion hve een identified on chromosome III (reviewed in Ref.37). One hot spot lies etween HIS4 nd BIKJ t out 68 k, within the G+C-rich region on the left rm, though little to the right of its pek. A second hot spot lies to the left of (perhps 10k wy from) THR4; THR4 is in centre of the G+C-rich pek on the right rm. A third hot spot my e close to the smll pek to the left of the centromere (Fig. 2), ut the fourth (ner 85 k) does not pper to e in G +C-rich region. Of course, if G+C-richness nd high recomintion frequency re correlted, it remins to e determined whether either effect cuses the other. In the humn genome, G+C content in noncoding sequences (introns or flnking sequences) is highly correlted with tht t silent sites in the neighouring coding sequences (4). However, the yest nd humn genomes differ in this respect: the intergenic regions of yest chromosome III do not exhiit se composition vrition similr to the silent sites in genes. It might hve een expected tht intergenic sequences re under reltively little constrint, nd should e influenced y ny regionl muttionl ises. Thus, the oservtion tht the intergenic sequences within the G+C-rich peks do not show the sme elevted G+C content s the silent sites in genes is prticulrly interesting. In yest these intergenic regions re much shorter thn in the humn genome (the verge length on yest chromosome 111 is 638p). Since mny regultory elements re locted within these regions, these my constitute selective constrint on se composition. In conclusion, it remins to e seen whether the fctors determining regionl genic se composition vrition re the sme in the humn nd yest genomes. However, whether repliction time, recomintion frequency, or some other fctors re the underlying cuse(s), the issue is likely to e more esily resolved for 'model' orgnism like yest. 12. Bulmer, M. (1990) Nucleic Acids Res., 18, Shrp, P.M. nd Cowe, E. (1991) Yest, 7, Gouy, M., Gutier, C., Attimonelli, M., Lnve, C. nd di Pol, G. (1985) CABIOS, 1, Voyts, D.F. nd Boeke, J.D. (1992) Nture, 358, Shields, D.C. nd Shrp, P.M. (1989) J. Mol. Biol., 207, Vn Dyck, E., Foury, F., Stilhmn, B. nd Brill, S.J. (1992) EMBO J., 11, Bork, P., Ouzonis, C., Snder, C., Schrf, M., Schneider, R. nd Sonnhmmer, E. (1992) Nture, 358, Shrp, P.M. nd Li, W.-H. (1987) Nucleic Acids Res., 15, Lloyd, A.T. nd Shrp, P.M. (1992) J. Hered., 83, Mndel, M. (1970) In Soer, H.A. (ed.), Hndook of Biochemistry. CRC, Clevelnd, pp. H75-H D'Onofrio, G. nd Bernrdi, G. (1992) Gene 110, Sueok, N. (1992) J. Mol. Evol., 34, Shields, D.C., Shrp, P.M., Higgins, D.G. nd Wright, F. (1988) Mol. Biol. Evol., 5, Lloyd, A.T. nd Shrp, P.M. (1991) Mol. Gen. Genet., 230, Shrp, P.M. nd Devine, K.M. (1989) NucleicAcids Res., 17, Shrp, P.M. (1990) Mol. Microiol., 4, Lloyd, A.T. nd Shrp, P.M. (1992) Nucleic Acids Res., 20, Shrp, P.M. (1989) In Hill, W.G. nd Mcky, T.F.C. (eds.) Evolution nd Animl Breeding. C.A.B. Interntionl, Wllingford, pp Eyre-Wlker, A.C. (1991) J. Mol. Evol. 33, Filipski, J. (1987) FEBS Letts., 217, Sueok, N. (1988) Proc. Ntl. Acd. Sci. U.S.A., 85, Filipski, J., Slins, J. nd Rodier, F. (1989) J. Mol. Biol. 206, Eyre-Wlker, A. (1992) Nucleic Acids Res., 20, Reynolds, A.E., McCrroll, R.M., Newlon, C.S. nd Fngmn, W.L. (1989) Mol. Cell. Biol., 9, Brown, T.C. nd Jiricny, J. (1988) Cell, 54, Zenvirth, D., Arel, T., Shermn, A., Goldwy, M., Klein, S. nd Simchen, G. (1992) EMBO J., 11, ACKNOWLEDGEMENTS This is pper from the Irish Ntionl Centre for Bioinformtics. We re grteful to Andr6 Goffeu, Dvid McConnell, Myr O'Regn nd Ken Wolfe for discussion, nd prticulrly to Adm Eyre-Wlker for shring his ides on the possile link etween G + C content nd recomintion frequencies. This work ws supported y grnt SC/91/603 from EOLAS (The Irish Science nd Technology Agency). REFERENCES 1. Bernrdi, G., Olofsson, B., Filipski, J., Zeril, M., Slins, J., Cuny, G., Meunier-Rotivl, M. nd Rodier, F. (1985) Science, 228, Ikemur, T. (1985) Mol. Biol. Evol., 2, Aot, S.-i. nd Ikemur, T. (1986) Nucleic Acids Res., 14, Ikemur, T. nd Aot, S-i. (1988) J. Mol. Biol., 203, Wolfe, K.H., Shrp, P.M. nd Li, W.-H. (1989) Nture, 337, Bemrdi, G. (1989) Annu. Rev. Genet., 23, Mtssi, G., Montero, L.M., Slins, J. nd Bernrdi, G. (1989) Nucleic Acids Res., 17, Oliver, S.G. et l. (1992) Nture, 357, Yoshikw, A. nd Isono, K. (1990) Yest, 6, Bennetzen, J.L. nd Hll, B.D. (1982) J. Biol. Chem., 257, Shrp, P.M., Tuohy, T.M.F. nd Mosurski, K.R. (1986) NucleicAcids Res., 14,