Architectural Power Optimization by Bus Splitting

Size: px
Start display at page:

Download "Architectural Power Optimization by Bus Splitting"

Transcription

1 Archtectural Power Otmzato by Bus Slttg Abstract A slt-bus archtecture s roosed to mrove the ower dssato for global data exchage amog a set of modules. The resultg bus slttg roblem s formulated ad solved combatorally. Exermetal results show that the ower savg of the slt-bus archtecture comared to the moolthc-bus archtecture vares from 6% to 50%, deedg o the characterstcs of the data trasfer amog the modules ad the cofgurato of the slt bus. The roosed slt-bus archtecture ca be exteded to mult-way slt-bus whe a large umber of modules are to be coected. Itroducto To crease the level of tegrato ad the erformace, system-o-a-ch s wdely deloyed today s desgs. I such desgs, commucato resources are allocated to coect the o-ch modules for data exchage. Two wdely used commucato archtectures are ot-to-ot coecto (udrectoal ad shared bus (b-drectoal. I ato to system-o-a-ch desgs, mcrorocessors, dgtal sgal rocessors ad embeed cotrollers also use these two tyes of tercoecto archtecture. Ths aer rooses a slt shared-bus archtecture (c.f. Fgure to reduce the ower cosumto of the moolthc sharedbus (c.f. Fgure. bus M M 3 BUF e e BUF M M 4 M 6 Fgure. Slt shared bus archtecture. The advatages of the shared bus archtecture clude smle toology, low area cost ad extesblty. The dsadvatages of the shared bus archtecture are larger load er data-bus le, loger delay for data trasfer, larger ower cosumto, ad lower badwdth. Fortuately, the above dsadvatages, excet the badwdth, may be overcome by usg a low-voltage swg sgalg techque []. I a low-voltage swg archtecture, the sgal beg trasferred from a module s Cheg-Ta Hseh ad Massoud Pedram EE-Systems, Uversty of Souther Calfora bus Ths research was suorted art by SRC uder cotract No. 98-DJ- 606 ad by NSF uder NSF-PECASE Award MIP M 5 frst coverted to a low-voltage swg sgal ad the roagated alog the shared bus. The low-voltage swg s fally coverted back to a full-swg sgal at the ut of the recevg module. I ths way, the amout of the charge o the bus wll oly chage by C BUS, where s the voltage swg o the bus ad C BUS s the caactve load of the bus. Therefore, the low-voltage swg bus acheves a ower reducto of ( - / comared to the case of a full-swg bus. The sgal delay o the bus s also reduced by CBUS ( - t = where I s the average curret of the I drver. Notce that bus ecodg techques revewed [] ca be used to futher reduce the ower cosumto of the o-ch bus. data bus M M 3 M M 4 M 6 Fgure. Moolthc shared-bus archtecture. Moolthc Bus Structure Wthout loss of geeralty, we cosder a oe-bt bus. Results for a k-bt bus ca be easly obtaed by scalg the oe-bt bus results by k. Assume we have modules M, M,, M coected to each other through a bdrectoal shared bus as show Fgure. Durg the archtectural smulato, we smulate the system for cycles, form cycle to cycle. I each cycle, the data wth logc value of s trasferred from module M SRC( to module M DST(. Assume that the recever gate for each module has mmum sze ad ts ut caactace s C g. Furthermore, the outut caactace of the drver for each module M s C o,. C BUS s calculated as follows: C = L C C ( C C BUS BUS u c o, g where L BUS s the hyscal legth of the bus, C u deotes the caactace er ut legth of the bus, ad C c deotes the coulg caactace due to the arallel rug bus wres as well as other earby wres o adacet metal layers, ad s the umber of modules coected to the bus. The average eergy cosumto durg the cycles s: M 5

2 E = 0.5 CBUS ( Edrver, M where E drver,m s the average teral eergy dssato er clock cycle of the bus drver of module M. oe drver stage Fgure 3 Crcut dagram of tr-state bus drver. A tycal tr-state drver s show Fgure 3. Notce that = oe ad = oe. The swtchg actvty of ad are: sw( = rob( oe =, = 0 rob( oe =, = 0 rob( oe = 0, = x rob( oe = 0, = x sw( = rob( oe =, = 0 rob( oe =, = 0 rob( oe = 0, = 0 x rob( oe = 0, = x 0 where rob( oe = v v, = v3 v4 deotes the robablty for ( oe, = ( v, v3 the curret cycle ad ( oe, = ( v, v4 the ext clock cycle; x deotes do t care. If ut s ot correlated wth oe, above equatos ca be smlfed as: sw( = rob( rob( oe[ rob( rob( oe] sw( = rob( = 0 rob( oe[ rob( = 0 rob( oe] where rob(x ad rob(x=0 deote the robablty for x= ad x=0 a clock cycle. The average teral eergy dssato of the drver stage er clock cycle s: E drver = 0.5( sw( Ceff, buf sw( Ceff, buf where C eff,buf (C eff,buf deote the hyscal caactace drve by NAND (NOR. 3 Slt Bus Archtecture For a log bus le, the arastc resstace ad caactace of the bus le are large. For examle, Fgure, the roagato delay from module M to module M 6 s very large. To mrove the tmg ad ower cosumto of the log bus, we ca artto the bus to two bus segmets as show Fgure. The dual-ort drver at the boudary of bus ad bus relays the data from oe bus to the other whe such data trasfer s eeded. Therefore the slt bus archtecture works the same way as a sgle bus. If the trsc delay (ad ower cosumto of the dual-ort drver s small comared to the rest of the bus, whch s the case for a log bus coecto, the the ew bus archtecture wll always be better tha the sgle bus archtecture. Advatages of the bus slttg are: buf buf dffuso ca o!" Smaller arastc load: Because the bus legth s reduced, the arastc load of each bus segmet s reduced.!" Larger tmg slack: Due to the smaller arastc load of the two bus arts ad because smaller outut caactaces from the drver oututs are aed as load to ay art of the slt bus, the tmg slack becomes more ostve.!" Smaller drver sze: Because the tmg slack s larger, the drver sze ca be made smaller whle meetg the tmg costrat!" Lower ower cosumto: Sce smaller load ad smaller drvers are used, the effectve hyscal caactace for each bus art s smaller. I the case of data beg trasferred wth the same bus artto, the ower cosumto s sgfcatly reduced because there s o swtchg actvty the other bus artto.!" Lower ose roblems: The arallel rug buses are at the greatest rsk wth resect to coulg ose. Reducg the bus wre legth effectvely reduces the amout of caactve coulg ose. I Fgure, modules M, M ad M 3 resde the bus o the left ad modules M 4, M 5 ad M 6 st o the other sde. Let BUS be the set of modules the left bus ad BUS deote the set of modules the rght bus. Whe e s, BUF wll relay the data from bus to bus. Smlarly, BUF wll ass the data from bus to bus whe e s. Note that e ad e should ot be set to at the same tme. Whe both e ad e are 0, bus ad bus are solated from oe aother. I ths secto, we assume the drver szes are fxed. 3. Examles I the followg examles, we assume that the outut caactace of drvers s zero ad gore the eergy cosumto wth the drvers. The data beg trasferred by ay module o the data bus s modeled as a deedet radom varable wth a average swtchg actvty equal to sw. The average eergy cosumto of the sgle bus archtecture s calculated as: E= 0.5 sw CBUS Let C BUS ad C BUS deote the hyscal caactace o bus ad bus. The average eergy cosumto of the slt bus archtecture er clock cycle s calculated as: E = sw C xfer M M 0.5 [ BUS (, BUS BUS, C xfer( M, M BUS BUS BUS, ( C C xfer( M, M BUS BUS BUS BUS ( C C xfer( M, M ] BUS BUS BUS BUS where xfer(m,m deotes the robablty of module M trasferrg data to module M ay clock cycle.

3 I the followg examles, we set sw=0.5 ad CBUS CBUS CBUS ormalze = = =, =. BUS BUS Examle Assume we have =k modules ad BUS =k-a, BUS =ka where a {0,,..,k-}. The robablty of trasferrg data from module M to module M ay clock cycle s, for..k, =..k, k(k. E = 0. 5k 3 3k k a ( k E = 0.5 k k The ower savg of the slt bus over the moolthc bus ca be calculated by: 3 E E k k a ( k = E k k The ower savg s maxmzed whe a=0. For the case of k= ad a=0, ower savg s 6%. Whe k ad a=0, the ower savg s 5%. If we set a=k, whch s the case of moolthc bus, the the ower savg s 0. Examle Assume that there are four modules coected to the bus. The robablty of trasferrg data betwee module M ad module M s secfed by the label of the edge (M,M Fgure 4. M M /4 /8 /8 /8 /8 M /4 3 M 4 Fgure 4. Data trasfer robabltes for Examle. The eergy cosumto for varous archtectures s summarzed the followg table: Archtecture Eergy BUS={M,M,M 3,M 4 } BUS={M,M } BUS={M 3,M 4 } 0.75 BUS={M,M 3 } BUS={M,M 4 } BUS={M,M 4 } BUS={M,M 3 } The bus arttog soluto wth BUS={M,M },BUS={M 3,M 4 } cosumes the least ower because more of the data trasfers are erformed wth each art. M 3/4 /8 M /64 /64 /64 M 3 /64 /64 /64 /64 /64 M 4 M 5 Fgure 5. Data trasfer robabltes for Examle 3. Examle 3 For the 5 module cofgurato show Fgure 5, the ower cosumto for several cofguratos are lsted below: Archtecture Eergy BUS={M,M,M 3,M 4,M 5 }.5 BUS={M,M } BUS={M 3,M 4,M 5 } 0.66 BUS={ M M,M 3 } BUS={M 4,M 5 } 0.79 BUS={M,M 3 } BUS={M,M 4,M 5 }.3 The secod bus slttg cofgurato has the lowest eergy cosumto, whch acheves 47% reducto the eergy cosumto comared to the sgle bus archtecture. Note that although edge (M, M 3 has a weght of /8, whch s the secod largest value ths examle, ag M 3 to BUS={M,M } creases C BUS ad hece results hgher ower dssato. 3. A Accurate Power Cosumto Model Smlar to the case of the moolthc bus, the hyscal caactace o bus ad bus ca be calculated as: C = L C C ( C C C C BUS BUS u c, o, g o, BUF, BUF BUS C = L C C ( C C C C BUS BUS u c, o, g o, BUF, BUF BUS where L BUS ad L BUS are the bus legths of bus ad bus; C ce, ad C c, are the coulg caactaces for bus ad bus; C o,buf ad C o,buf are the outut caactaces of BUF ad BUF, resectvely; C,BUF ad C,BUF are ut caactace of BUF ad BUF. Here we assume that the wre wdths of both buses are the same. Aga, we assume the mmum gate sze for the recever of each module. The logc values o bus ad bus each clock cycle are calculated as follows: BUS, = BUS,- f M SRC( BUS ad M DST( BUS = otherwse BUS, = BUS,- f M SRC( BUS ad M DST( BUS = otherwse where deotes the logc value beg trasferred clock cycle. The average eergy cosumto of the slt bus archtecture s calculated as: E = E BUS = 0.5C 0.5C E BUS BUS BUS drver, M ( BUS, ( drver BUS, drver, BUF BUS, BUS, drver, BUF where E drver,m ad E drver,bufx are the average eergy cosumtos er clock cycle for module M ad buffer x ad are calculated by equatos Secto. s the umber of smulated cycles. 3.3 A Probablstc Power Cosumto Model Usually must be very large so that the collected trace becomes reresetatve of real alcato data. To seed u the ower cosumto calculato, a robablstc model ca be used. Note that the model s oly exact 3

4 uder the assumto of data statoarty [4]. Assume that the data beg trasferred from each module ca be modeled as a tme-varat radom rocess wth robablty rob(m for the data value to be. Furthermore, assume that the data trasfer at clock (M SRC(,M DST( s ot correlated wth the data trasfer ar (M SRC(,, M DST( at clock. Let xfer(bus, BUS deote the robablty of bus trasferrg data to bus ay clock cycle. It s calculated as: xfer( BUS, BUS xfer( M, M =. BUS BUS xfer(bus,bus s calculated smlarly. Let xfer(bus deote the robablty of data trasfers occurrg o bus. It s calculated as: xfer( BUS = xfer( BUS, BUS xfer( M, M BUS BUS BUS, xfer (BUS s calculated smlarly. Let rob(bus deote the robablty for the bus havg a logc value a clock cycle. It s calculated as: rob( BUS = { xfer( M, M rob( BUS = BUS BUS BUS BUS, xfer( M, M rob( }/ xfer( BUS rob(bus s calculated smlarly. The swtchg actvtes of bus ad bus (assumg temoral deedece of data values o the bus are: sw( BUS = rob( BUS[ rob( BUS] xfer( BUS sw( BUS = rob( BUS[ rob( BUS] xfer( BUS Therefore, the average eergy cosumto er clock cycle of the slt bus archtecture s calculated as: E = 0.5( C E BUS sw( BUS C BUS sw( BUS drver, M drver, BUF drver, BUF where E drver,x ca be calculated by the equatos Secto. 4 Bus Slttg for Low Power Assume that we erform bus slttg after the modules o the bus have bee laced ad the bus wres have bee routed. Durg ths desg hase, the order of the modules o the bus s already kow; therefore the oly degree of freedom s selectg a bus segmet, from to -, to lace the dual-ort drver. Let sw BUS ( ad E( deote the swtchg actvty of the data o bus ad eergy dssato o bus wth the dual-ort drver ostoed at bus segmet. The symbols wth subscrt BUS are also defed smlarly. Greedy Algorthm. Calculate the sw BUS ( ad sw BUS ( for buffer osto at segmet, -.. Calculate E( for buffer osto at segmet, Fd the mmum E( The comlexty of the algorthm s domated by that of the frst ste whch s O(. The algorthm s obvously otmal. Whe the bus slttg s erformed before the systemlevel floor-lag s comleted, we have the freedom to rearrage the order of the modules to maxmze the ower reducto. We frst show that ths roblem s NPcomlete. Theorem: The bus slttg roblem wth ukow module order s a NP-hard roblem. Proof outle: The roof s doe by covertg the mmum cut to bouded sets (MCBS roblem [3] wth equal set szes to the bus slttg roblem. The coverso s doe by formg a bus slttg roblem whch the umber of modules s equal to the umber of vertces the MCBS ad the data trasfer robablty xfer(m,m s roortoal to xw, where x s a costat ad w, s the weght betwee vertex v ad vertex v the MCBS. If x s suffcetly large (x>>max(w,, the arttog to equal-sze subsets wll be ecessary to obta the otmal soluto to the bus slttg roblem (c.f. Examle. Ths comletes the roof. Heurstc Algorthm Because the bus slttg wth ukow module order s a NP-hard roblem, we may use exhaustve search for a small value of. The umber of feasble slttg s - -. I our exermets, the exhaustve search for =30 ca be doe wth 0 mutes o a Petum-II 66Mhz mache. If s large, the a module clusterg ste s frst erformed to make the effectve less tha or equal to some redefed value. The clusterg ste ca be doe by mmzg ter-cluster data trasfers whle avodg the case that the sze of certa cluster becomes much larger tha the szes of the others (to avod the tfall of the thrd cofgurato Examle 3. Clusterg s erformed by a recursve max-weght matchg algorthm [5]. Next, all ossble ways of bus slttg are eumerated exhaustvely based o the clusterg soluto. BUS,left Jucto Pot BUS Fgure 6. T-shaed bus structure. Jucto Pot Bus Bus Jucto Pot BUS,rght Fgure 7. H-shaed bus structure. 5 Bus Toology arato Istead of algg all the modules horzotally, we may resort to other coecto toologes, whe allowed, 4

5 to mrove the tmg or ower cosumto. A T-shaed cofgurato s sutable for ubalaced arttog whle the H-shaed cofgurato s sutable for balaced arttog. Note that both cofguratos have better delay characterstcs tha a horzotally-alged cofgurato. Fgure 8 Trasfer frequeces for varous dstrbutos. ower savg 60% 50% 40% 30% 0% 0% 0% ormal uform # of modules exo Imlus ormal exo. mluse uform Fgure 9. Power savg for varous dstrbutos of trasfer frequeces. 6 Exermetal Results There are o exstg bechmarks to use for ths roblem. We therefore geerated our ow test beches. I our exermetal setu, the assumtos dscussed Secto 3. are adoted. I ato, the data exchage frequecy betwee ay two modules M ad M s radomly weghted by a teger betwee 0 ad 9 ad follows oe of the robablty dstrbutos secfed Fgure 8 a radomly geerated test case. The heght of each bar Fgure 8 shows the (relatve robablty of the data exchage frequecy betwee a ar of modules to be equal to the x-axs value. Each ot Fgure 9 reresets the average ower savgs of the slt bus over the moolthc bus, gve k modules (k=4...0 are coected to the bus, for 500 radomly geerated test cases whch the trasfer frequeces betwee ar of modules follow a gve dstrbuto dst. I the followg dscusso, we refer to each ot Fgure 9 by (k, dst, e.g., (4,ormal. Smulato results show that the test cases wth exoetal dstrbutos have the largest average ower savg whle the test cases wth mulse dstrbutos have the smallest ower savg. Ths s because test cases wth exoetal dstrbuto have fewer hgh-frequecy trasfers betwee modules ad therefore t becomes easer to kee the modules wth hgh-frequecy trasfers wth oe art of the slt bus. O the other had, the mulse dstrbuto has o varato trasfer frequeces ad therefore t rovdes the smallest oortuty for otmzato. Oe mortat observato s that a aomaly occurs at ot (6, exoetal, whch has the hghest average ower savg comared to other ots of the same dstrbuto (c.f. Fgure 9. The reaso s that (6, exoetal has a hgher ower savg oortuty comared to (4, exoetal due to the fact that the ubalace bus arttos ( BUS =, BUS =4 or ( BUS =4, BUS = ca result larger ower savg as was llustrated Examle 3. For ots (k, exoetal where k > 6, t s harder to acheve ower savg because modules are more lkely to be tghtly couled. For dstrbutos other tha the exoetal dstrbuto, the frequecy dstrbutos have much lower varace tha that of the exoetal dstrbuto. Therefore the results follow the tred redcted by Examle Secto Cocluso A slt-bus archtecture was roosed to mrove the seed ad ower dssato for global data exchage amog a set of modules. The ower model for slt bus was reseted ad the bus slttg roblem was solved combatorally. Exermetal results showed that the ower savg of the slt-bus comared to the moolthcbus archtecture vares from 6% to 50%, deedg o the characterstcs of the data trasfer amog the modules ad the cofgurato of the slt bus. T-shaed bus ad H-shaed bus structures were roosed to further mrove the bus erformace. The roosed slt-bus archtecture ca be exteded to mult-way slt bus whe a large umber of modules are to be coected. 8 Referece [] Y. Nakagome et al. Sub-- Swg Iteral Bus Archtecture for Future Low-Power ULSI s, IEEE Joural of Sold State Crcuts, ol. 8, No. 4, , 993. [] E. Mac, M. Pedram ad F. Somez, Hgh level ower modelg, estmato ad otmzato, IEEE Tras. o Comuter Aded Desg, ol. 7. No., , Nov [3] M. R. Garey, D. S. Johso, "Comuter ad Itractablty, A Gude to the Theory of NP-Comleteess, W.H. Freema ad Comay, 979. [4] A. Leo-Garca, Probablty ad Radom Processes for Electrcal Egeerg, Secod Edto, Aso Wesley, 993. [5] T. Legauer, Combatoral Algorthms for Itegrated Crcut Layout, Joh Wley & Sos,