Program Phase and Runtime Distribution-Aware Online DVFS for Combined Vdd/Vbb Scaling

Size: px
Start display at page:

Download "Program Phase and Runtime Distribution-Aware Online DVFS for Combined Vdd/Vbb Scaling"

Transcription

1 Program Phase and Runtme Dstrbuton-Aware Onlne DVFS for Combned Vdd/Vbb Scalng Jungsoo Km, Sungjoo Yoo, and Chong-Mn Kyung Dept. of EECS at KAIST Dept. of EE at POSTECH, Abstract Complex software programs are mostly characterzed by phase behavor and runtme dstrbutons. Due to the dynamsm of the two characterstcs, t s not effcent to make workload predctons durng desgn-tme. In our work, we present a novel onlne DVFS method that explots both phase behavor and runtme dstrbuton durng runtme n combned Vdd/Vbb scalng. The presented method performs a b-modal analyss of runtme dstrbuton, and then a runtme dstrbuton-aware workload predcton based on the analyss. In order to mnmze the runtme overhead of the sophstcated workload predcton method, t performs table lookups to the pre-characterzed data durng runtme wthout compromsng the qualty of energy reducton. It also offers a new concept of program phase sutable for DVFS. Experments show the effectveness of the presented method n the case of H.264 decoder wth two sets of long-term scenaros consstng of total 4655 frames. It offers 6.6% 33.5% reducton n energy consumpton compared wth exstng offlne and onlne solutons. I. INTRODUCTION Dynamc voltage and frequency scalng (DVFS) s one of the most effectve low power desgn methods. Due to the ncreasng leakage power consumpton, DVFS now controls both supply voltage (Vdd) and body bas (Vbb) dynamcally to mnmze the total power consumpton [1]. DVFS sets the performance level of CPU to the rato of predcted remanng workload to the gven deadlne. Thus, the accuracy of remanng workload predcton (n short, workload predcton) plays a crucal role n obtanng mnmum energy consumpton. The workload predcton s to predct future workload mostly based on recent hstory. For nstance, the average of recent workload can be a workload predcton. However, n realty, due to the complex behavor of software program (e.g., data dependent teraton counts of nested loops) and archtectural factors (e.g., cache mss, DDR memory page mss, etc.), such a nave predcton may not work effcently. Fg. 1 (a) llustrates a profle of per-frame workload n H.264 decoder (an excerpt from the move Lord of the Rngs ). The X-axs and the left-hand sde Y-axs represent frame ndex and per-frame workload, respectvely. The fgure shows that the profle has two dfferent scales of behavor n both macroscopc and mcroscopc ways. Frst, t has a macroscopc tme-varyng behavor,.e., phase behavor. There are tme duratons whose workload characterstcs (e.g., mean, standard devaton, max value, etc.) are dstnctly dfferent from other tme duratons. We call a tme duraton wth a dstnct workload characterstc a phase (we wll gve a formal defnton Fg. 1. Example of phase behavor and runtme dstrbuton: (a) phase behavor n the per-frame workload of H.264 decoder and (b) runtme dstrbutons (PDFs) later n ths paper). Fg. 1 (a) shows 10 phases (see the rghthand sde Y-axs for phase ndexes). Note that the phase ndex does not correspond to the requred performance level of the correspondng phase n ths example. The example of Fg. 1 also shows the mcroscopc behavor, the runtme dstrbuton. Fg. 1 (b) gves the runtme dstrbutons of three representatve phases to llustrate that there can be a wde runtme dstrbuton even wthn a phase, and the phase tself s characterzed by the runtme dstrbuton. Our observaton on the runtme characterstcs of software programs suggests that, as shown n Fg. 1, the program workload has two characterstcs: phase behavor and (multmodal) runtme dstrbuton (even wthn a phase). Especally, a phase can have a mult-modal runtme dstrbuton wth more than one salent peaks as phases 6 and 7 n Fg. 1 (b) show 1. Consderng the worst-case executon tme has a sgnfcant mpact on the effcency of DVFS for real-tme systems, such a mult-modal dstrbuton needs to be carefully analyzed and exploted n order to obtan an accurate predcton of remanng workload. In our work, we am at an onlne DVFS method that explots both phase behavor and mult-modal runtme dstrbuton n 1 Note that the sngle mode dstrbuton can be consdered to belong to the mult-modal dstrbuton /DATE EDAA

2 order to make accurate workload predctons for dynamc Vdd/Vbb scalng. We address the onlne DVFS problem n two ways: ntra-phase workload predcton and phase detecton. The ntra-phase workload predcton s to predct workloads based on the runtme dstrbuton of the current phase. The phase detecton s to dentfy to whch phase the current nstant belongs. To the best of the authors knowledge, our work s the frst approach of onlne DVFS for real-tme systems whch explots both phase behavor and runtme dstrbuton n combned Vdd/Vbb scalng. Ths paper s organzed as follows. Secton II revews exstng work. Secton III presents an overall flow. Secton IV explans a mult-modal (.e., b-modal) analyss of runtme dstrbuton. Secton V gves the detals of workload predcton. Secton VI presents the phase detecton method. Secton VII reports expermental results, followed by the concluson n Secton VIII. II. RELATED WORK There have been presented a lot of research works on the workload predcton for onlne DVFS, e.g., (weghted) average of N recent workloads [2]. Recently, a control theory-based predcton method s presented [3]. The above studes are effectve n the case of smple workload characterstcs wthout phase behavor or wde runtme dstrbutons. Phase detecton has been one of hot research ssues snce t wll allow for new opportuntes of performance optmzaton, e.g., dynamc adaptatons of cache archtecture [4] [5]. Phase detecton s also appled to DVFS n [6] [7]. In ths work, the per-phase runtme characterstc s modeled wth a vector of executon cycles of basc blocks. A new phase s detected when two vectors are sgnfcantly dfferent, e.g., when there s a large Hammng dstance between the two vectors. The key problem here s to dentfy a subset of basc blocks that represent phase behavor. Explorng all the combnatons of basc blocks wll be prohbtvely expensve n the case of current and future complex software applcatons wth a large number of basc blocks. In ths paper, we present a practcal method of phase detecton, sutable for DVFS purpose, whch s based on the vectors of predcted workloads for coarse gran code sectons as explaned n Secton VI. In addton, compared wth exstng phase-based DVFS methods, the presented method explots runtme dstrbuton wthn a phase to better predct the remanng workload. Runtme dstrbuton has been actvely exploted mostly n offlne DVFS methods. Control flow-dependent tme slack s exploted by predctng the remanng workload on a path bass n most of ntra-task DVFS methods: the worst-case executon path [8] [9], average-case executon path [10], and vrtual executon path [11]. Recently, analytcal approaches have been presented to address all the sources of runtme dstrbuton: data dependency (e.g., number of loop teratons), and archtecture (e.g., cache msses) as well as control flow [12] [13]. Exstng offlne runtme dstrbuton-aware DVFS methods, f appled to onlne DVFS, would suffer from two lmtatons. Algorthm 1: Overall flow 1: f (Tme % PHASE UNIT==0) then 2: for from N leaf to N root do 3: B-modal analyss of runtme dstrbuton 4: Workload predcton 5: end for 6: Check whether a new phase starts 7: end f Frst, they lack n utlzng phase behavor. In these methods, a sngle runtme dstrbuton s obtaned by runnng all the test benches over possbly numerous phases. Thus, phase-specfc workload nformaton s lost, whch may lead to neffcency n lowerng energy consumpton for software programs wth notceable phase behavor. Second, f they are appled to onlne DVFS wthout modfcatons, they wll ncur prohbtvely hgh runtme overhead due to ts computaton complexty (e.g., up to 2.5 tmes of entre program runtme n solvng dfferental equatons numercally [13] as explaned n Secton VII). In order to overcome these lmtatons, we present a low overhead onlne verson of orgnally offlne runtme dstrbuton-aware DVFS method. III. OVERALL FLOW FOR ONLINE DVFS Algorthm 1 shows the overall flow of the proposed method. Our work focuses on ntra-task DVFS where the performance level s set at each performance settng pont (PSP) nserted nto the software code by desgners or automatcally. We perform workload predcton and phase detecton perodcally (e.g., on a granularty of PHASE UNIT cycle perod n lne 1 of Algorthm 1. Note that a phase can consst of multple consecutve perods (e.g., each perod wth PHASE UNIT cycles). On every perod, PSPs are traversed from the end of program (N leaf ) to the begnnng of program (N root )for workload predcton (lnes 2 5). At each PSP, we perform a b-modal analyss of runtme dstrbuton (lne 3) and predct the remanng workload (lne 4). We approxmate a multmodal dstrbuton wth a b-modal one, snce, n most cases, the number of modes s less than or equal to two. Thus, our approxmaton does not ncur a sgnfcant neffcency n energy reducton as Secton VII shows. The phase detecton check s performed (lne 6) utlzng the predcted workloads. The phase detecton s to dentfy to whch phase the current perod belongs. A new phase s detected when there s a large dfference (n terms of Hammng dstance of PSP vectors, to be explaned n Secton VI) between the predcted workload of current phase and that of current perod. Fg. 2 llustrates the workload predcton based on the bmodal analyss. Fg. 2 (a) shows two program regons, n and n +1 (a program regon s a code secton startng wth a PSP and fnshng wth another PSP). Fg. 2 (b) shows the PDF (probablty dstrbuton functon) of runtme dstrbuton for each program regon and the key steps of workload predcton for the program regon n n ths case. Gven the runtme dstrbuton of a phase, n order to predct the energy optmal remanng workload for combned

3 Algorthm 2: Mode decomposton 1: fnd two non-contnuous ponts (x 0,p 0) and (x 1,p 1) whose probablty values, p 0 and p 1 are the two hghest probabltes n the orgnal PDF 2: f there are no such two non-contnuous ponts, then 3: the entre PDF s consdered to be a sngle mode 4: return the orgnal PDF 5: else 6: fnd the saddle pont (x s,p s) between the two ponts, whch gves the mnmum probablty 7: f there are more than on saddle ponts then 8: the medan s selected as the saddle pont 9: end f 10: PDF 0 = {(x, p) x x s, (x, p) PDF} 11: PDF 1 = {(x, p) x >x s, (x, p) PDF} 12: return PDF 0 and PDF 1 13: end f Fg. 2. Workload predcton: (a) two program regons and (b) workload predcton wth the b-modal analyss Vdd/Vbb scalng, we apply the soluton presented n [13] durng runtme. However, a drect executon of the soluton durng the software program run on the target CPU wll cause prohbtvely hgh overhead of runtme and energy consumpton as shown n Secton VII. For the onlne purpose, we need a lghtweght, yet accurate soluton to the workload predcton. In ths paper, we propose an approach that gves a low runtme overhead by explotng the pre-characterzed data. To do that, we frst modularze the remanng workload from program regon n to the end of program nto two parts: one (called the effectve workload of program regon n denoted by ) mostly determned by the PDF of program regon n and the other (called the effectve workload of remanng program regons except n denoted by w eff +1 ) mostly determned by the PDFs of remanng program regons. At the PSP for program regon n, we predct the remanng workload w as the sum of x eff and w eff +1. xeff s calculated by table lookups to the pre-characterzed data (n the form of two types of look-up table) wth the runtme dstrbutons (obtaned durng runtme) as the nput to the table lookups. w eff +1 s obtaned analytcally (Secton V). Fg. 2 (b) also llustrates how each of x eff and w eff +1 s calculated. In order to obtan xeff, the b-modal analyss decomposes the PDF of n, PDF nto two parts called modes 0 and 1 as Fg. 2 (b) shows. For each mode, we calculate x eff(0) and x eff(1) called the effectve workload of mode by usng the lookup table called LUT λ (detals wll be gven n Secton V). Then, we obtan the effectve workload of program regon n by the weghted sum of the two effectve workloads of mode by lookng up the other table LUT β for the parameter β (Secton IV). The effectve workload of remanng x eff, s calculated n a smple manner as Fg. 2 (b) shows (Secton V). Fnally, the summaton of the two effectve workloads gves the predcted remanng workload of program regon n, w as the lowest equaton n Fg. 2 (b) shows. program regons other than n,.e., w eff +1 At each PSP, the performance level s set to the rato of predcted workload to the remanng tme-to-deadlne or to a level that satsfes the gven deadlne constrant dependng on the result of real-tme constrant check as n [13]. Note that the performance settng mples Vdd/Vbb settng snce there s a one-to-one correspondence between a performance level and a par of Vdd/Vbb settngs that gve the mnmum energy consumpton [1]. IV. BI-MODAL ANALYSIS We calculate the effectve workload of program regon n,.e., x eff, n three steps: mode decomposton, workload predcton for each mode, and mode recomposton to obtan the effectve workload of the program regon. A. Mode Decomposton Mode decomposton s to decompose the orgnal runtme dstrbuton nto two modes,.e., two separated dstrbutons each of whch has a salent peak. Algorthm 2 explans how to decompose the orgnal runtme dstrbuton nto two modes. As Algorthm 2 shows, the orgnal PDF s decomposed nto two sub-pdfs, PDF 0 and PDF 1 at a saddle pont where the probablty s the mnmum. In the case that there s only one mode (lnes 2 4 n Algorthm 2), the orgnal PDF s returned. B. Modes Recomposton In the step of workload predcton (Secton V), the effectve workload for each of the two modes (x eff(0) and x eff(1) )s obtaned. Then, the effectve workload of program regon n s calculated as a weghted sum of the two values as follows. x eff = βx eff(0) +(1 β)x eff(1) (1) Parameter β determnes the relatve mportance of two modes. In our work, we calculate the parameter by a table lookup of pre-characterzed data wth the runtme dstrbutons of program regons as the nput of table lookup. In the followng, we explan how to buld the lookup table for the accurate parameter calculaton. The two modes wll have dfferent mpacts on the fnal predcted remanng workload dependng on three factors as follows.

4 - Factor 1: Rato between the relatve probabltes of the two modes n the orgnal dstrbuton - Factor 2: Rato between the executon cycles of the two modes - Factor 3: Rato between the executon cycle of program regon n and the remanng workload after the program regon n For nstance (Factor 1), f mode 0 has a sgnfcant porton,.e., hgh probablty, parameter β wll have a hgh value approachng 1. As another example, f mode 1 has much hgher executon cycles than mode 0,.e., f WCEC (worstcase executon cycle) s much hgher than AEC (average executon cycle), then mode 1 has more mpact than mode 0 snce DVFS tends to set a hgh frequency (.e., hgh effectve workload of program regon n ) due to the large WCEC n order to meet the gven deadlne constrant. In such a case, parameter β becomes a small value to reduce the effect of mode 0 and to ncrease that of mode 1. We prepare a pre-characterzed table for parameter β, LUT β wth the above three factors as the ndex. To be specfc, Factor 1 s represented by the cumulatve probablty of mode 0, P 0 (snce P 0 + P 1 =1). Factor 2 s represented by the rato of x eff(1) to x eff(0) snce each of them represents the executon cycle nformaton of each mode. Factor 3 s represented by the rato of (x eff(0) + x eff(1) )/w eff +1.2 As x eff(1) /x eff(0) ncreases (e.g., WCEC AEC) or x eff(1) / w eff +1 ncreases, parameter β decreases snce mode 1 comes to have more mpact on the energy optmal remanng workload than mode 0. Regardng x eff(1) /w eff +1,asasmple case, f w eff +1 approaches 0, then the program regon n domnates the remanng workload. Thus, the energy optmal remanng workload of program regon n wll approach the worst-case executon cycle of program regon n. Thus, mode 1 (hgher porton of PDF) domnates the remanng workload, whch requres parameter β (1-β) to decrease (ncrease) n Eqn. (1). As a summary, when the PDFs of program regons are avalable durng runtme by performance montorng the four varables, P 0, x eff(1), x eff(0), and w eff +1 are calculated as explaned n Secton V. Then, the table LUT β s looked up for the parameter β, whch s used n the calculaton of Eqn. (1). V. PREDICTING REMAINING WORKLOAD FOR A DECOMPOSED MODE In ths secton, we explan how to calculate the effectve workload,.e., x eff, assumng a sngle (decomposed) mode. Our approach s based on an analytcal formulaton utlzng an analytcal energy functon. Combned Vdd/Vbb scalng does not have the quadratc relatonshp between energy consumpton per cycle and frequency that the Vdd-only scalng has. 2 In our mplementaton, we use the rato of x eff(1) /w eff +1 snce, gven a rato of x eff(1) /x eff(0),(x eff(0) + x eff(1) )/w eff +1 and xeff(1) /w eff +1 represent the same nformaton. Fg. 3. Effectve remanng workload, w eff +1 Thus, we approxmate the energy consumpton by fttng the golden energy model (measurement data or estmaton result) as follows. E cycle = af b + c (2) where E cycle s the energy consumpton per cycle, and parameters a, b and c are fttng parameters 3. A. Effectve Workload of Program Regon Fg. 3 llustrates the PDFs of two program regons, n and n +1 as n Fg. 2 (a). For smplcty, we assume a unt functon for PDF. We wll generalze the case (.e., utlze a general form for PDF ) later n ths secton. Gven the PDFs (PDF and PDF +1 n Fg. 3), the average energy consumpton of two program regons,.e., E(w ), and the energy optmal remanng workload of program regon n,.e., w, are calculated usng the energy model n Eqn. (2) as follows 4. E(x,x +1,w ) = (af b x + cx )+(af+1x b +1 + cx +1) E(w ) = E(x,x +1,w )p p +1dx dx +1 = aw b x + awb +1x +1 + cx + cx+1 (1 x /w ) b E(w ) =0= x + w b x +1x +1 [ ]=0 w (w x ) b+1 w = x +(w+1x b +1) 1/(b+1) = x + w eff +1 (3) As shown n Eqn. (3), the predcted remanng workload of program regon n, w conssts of the workload of program regon,.e., x, and the second term,.e., (w+1 b x +1) 1/(b+1). The second term represents the porton of remanng workload after program regon n. We call t the effectve remanng workload of n +1, w eff +1. Fg. 3 llustrates that weff +1 represents the PDFs of remanng program regons after n. Assume the general case where program regon n also has a wde PDF. In ths case, we need to apply the numercal soluton n [13] to obtan the energy optmal remanng workload. However, f such an analyss s performed durng runtme, t 3 The energy model can be derved analytcally and numercally [14]. 4 For the sake of explanaton, we assume the unt delay as the remanng tme to the gven deadlne at n,.e., T =1. Thus, we set the frequency of program regon n to f = w /T = w. In Eqn. (3), the frequency of program regon n +1, f +1 becomes w +1 /(T x /f ) = w +1 /(1 x /w ).More detaled dervaton can be found n [12]

5 wll cause prohbtvely large runtme overhead. Thus, beng nspred by Eqn. (3), we model the soluton, w as follows. w = x eff + w eff +1 (4) where x eff s the effectve workload of program regon n. Note that w eff +1 s obtaned as Eqn. (3) shows. In the case that the software program has cascaded program regons and condtonal branches, we calculate the effectve remanng workload of program regon n a smlar manner to [13]. We calculate x eff (for each of two modes n Secton IV) by explotng the pre-characterzaton of solutons. To do that, we represent x eff as follows. x eff = x (1 + λ) (5) Then, we prepare a lookup table, LUT λ for the resdue λ durng desgn-tme and perform table lookups durng runtme to obtan λ. We derved the ndexes of LUT λ as follows: σ /μ, γ, w eff +1 /μ, where μ, σ, and γ represent the mean, standard devaton, and skewness of PDF, respectvely. The ratonale of choosng the three ndexes s as follows. Frst, the resdue λ depends on μ, w eff +1, and PDF as Appendx explans. The PDF of a mode s modeled as a skewed normal dstrbuton snce the decomposed mode usually does not have a nce normal dstrbuton though there s mostly one salent peak per mode. Thus, there can be some level of skewness (γ ) n the PDF of decomposed mode. The dependence of resdue, and the skewed normal approxmaton of PDF λ on μ, w eff +1 (σ,γ ) gves the above three ndexes of LUT λ. VI. PHASE DETECTION A phase needs to be characterzed by a salent dfference n program behavor, especally, n terms of executon cycles. Conventonally, the phase s represented by a vector of executon cycles of basc blocks [4] [5]. As mentoned n Secton II, a drect applcaton of the exstng phase defnton may not be effectve n DVFS. In our work, we frst defne a new vector, called PSP vector, whch conssts of predcted remanng workloads of program regons. Then, we detect a new phase when the Hammng dstance between the representatve PSP vector of current phase and that of current perod becomes greater than a threshold (set to 10% n our experments). The ratonale s that the predcted remanng workload of a program regon represents the entre runtme dstrbutons of remanng program regons. Thus, t can be a representatve of future behavor. The representatve PSP vector of a phase s calculated as the medan vector of all the PSP vectors of perods belongng to the phase. After the phase detecton, n order to utlze the phase-level repettve behavor, we check to see f there s any prevous phase smlar to the newly detected one by comparng the PSP vector of the current perod and the representatve PSP vectors of ever exsted phases. If so, we reuse the runtme dstrbuton of the matched prevous phase as that of the new phase. If there s no prevous phase smlar to the new one, then a new phase starts by mantanng a new set of runtme dstrbuton nformaton untl another new phase s detected. Fg. 4. Runtme dstrbuton of test pctures: (a) foreman, (b) football, (c) stefan, and (d) akyo A. Expermental Setup VII. EXPERIMENTAL RESULTS We use a real software program, H.264 decoder (QCIF, 10fps) n the experments. In order to nvestgate the effects of phase and runtme dstrbuton, we used total 4655 frames of pctures. The examples consst of two sets as follows. - Set 1: A sequence of conventonal test pctures: foreman 127 frames football 89 frames Stefan 89 frames akyo 150 frames (total 455 frames). Fg. 4 shows the runtme dstrbutons (PDFs) of test pctures obtaned from cycleaccurate smulatons. - Set 2: Four move clps from Lord of the Rngs (total 4200 frames). The per-frame runtme s shown n Fg. 1 (a). We use the processor model wth combned Vdd/Vbb scalng n [13]. We use 11 frequency levels up to 6.0GHz wth 0.5GHz step sze 5. We run cycle-accurate smulaton wth a commercal tool, ARM SoCDesgner n order to obtan the PDFs for all the program regons n the software program. B. Results We compared the energy consumpton of fve methods: two offlne methods and three onlne ones. As the offlne methods, we use an average executon-cycle based method (AEC) and a runtme dstrbuton-aware method (DIST), both from [13]. For the onlne methods, we use a control theory-based method (CON) [3], phase-aware average executon cycle-based one (P- AEC), and the presented one (OURS). Regardng the control theory-based method, we used the coeffcents reported n [3] as the ntal ones and made a further exploraton of coeffcents to obtan the best results. The phase-aware average executon cycle-based method, whch we also present n ths paper, s to explot only the phase behavor whle predctng the remanng workload based on the average remanng executon cycle obtaned from the hstory. In ths case, the phase detecton s performed based on the Hammng dstance n the vectors of average executon cycles of program regons. Tables I (a) and (b) show the energy consumptons for Sets 1 and 2, respectvely. All the energy consumpton data 5 We set the maxmum frequency at 6GHz due to the tght deadlne constrant of H.264 decoder, 10fps. We wll be able to use lower maxmum frequency f the deadlne constrant s relaxed. The presented method stll works n both cases.

6 (a) Oracle Offlne soluton Onlne soluton DVFS (nj) AEC DIST CON P-AEC OURS 6.37E (b) Oracle Offlne soluton Onlne soluton DVFS (nj) AEC DIST CON P-AEC OURS Clp E Clp E Clp E Clp E All 5.78E TABLE I ENERGY CONSUMPTION FOR (A) SET 1(B) SET 2 are normalzed nto the energy consumpton of Oracle-DVFS whch predcts the remanng workload perfectly. Analyss Table I shows that the presented method gves 6.6% 33.5% (6.6% 28.2%, compared wth only DIST) further reductons n energy consumpton than the exstng offlne methods. Such a sgnfcant mprovement results from the fact that the presented method explots phase behavor whle stll applyng the runtme dstrbuton-aware workload predcton. Compared wth the control theory-based method, the presented method gves 13.9% 33.9% lower energy consumpton. The man reason of such a large mprovement s that H.264 pctures have (random-lke) fast changng local runtme varatons as well as phase behavor as Fg. 1 shows. The control theorybased method does not track well the fast changng local runtme varatons. The phase-aware average executon cyclebased method gves better results than the control theory-based one snce t has an effect of averagng out the local random workload varatons. The presented method even offers 5.6% 14.0% further mprovements over P-AEC. The mprovements are obtaned from two dfferences: (1) workload predcton based on runtme dstrbuton (OURS) or average (P-AEC), and (2) phase detecton based on runtme dstrbuton (OURS) or average (P-AEC). Runtme Overhead We measured the runtme overhead of the proposed onlne workload predcton method when runnng H.264 decoder wth Lord of the Rngs (4200 frames) on ARM926 processor. We set PHASE UNIT and the number of program regons as 15 frames and eght, respectvely. Under the condton, the runtme overhead ranges from 913,761 1,003,178 clock cycles, whch corresponds to 0.021% of the total executon cycles. Compared wth the runtme overhead of the presented onlne method, that of the desgn-tme soluton [13], f the desgn-tme soluton s appled durng runtme wthout modfcaton, amounts to about 12 bllon clock cycles under the same condton as above. It s 12,000 tmes bgger than that of the presented onlne method. Such a hgh overhead s unacceptable snce the runtme overhead alone takes 2.5 tmes longer runtme than that of H.264 decoder run. Memory Overhead of LUTs The presented method requres two types of LUTs: LUT β and LUT λ. The LUTs requre memory space. The memory overhead largely depends on the number of steps (scales) n the ndexes of the tables. As the numbers of steps ncrease, more accurate workload predcton wll be acheved wth a hgher memory area overhead. In our mplementaton, the total area overhead of LUTs s 20kB by adjustng the step szes and by compressng the contents of LUTs whle explotng the value localty n the tables. VIII. CONCLUSION In ths paper, we presented a novel onlne DVFS method that utlzes both phase behavor and runtme dstrbuton to gve accurate workload predctons thereby lower energy consumpton n combned Vdd/Vbb scalng. It performs a b-modal analyss to practcally account for the mult-modal characterstcs of runtme dstrbuton. The runtme dstrbuton-aware workload predcton s executed whle explotng the precharacterzed data n order to mnmze the runtme overhead of onlne method. For the phase detecton for DVFS, a new concept of phase s presented whch s based on the runtme dstrbuton of program regons. Expermental results show that the presented method offers 6.6% 33.9% further energy savngs compared wth exstng offlne and onlne methods. REFERENCES [1] S. M. Martn, et al., Combned Dynamc Voltage Scalng and Adaptve Body Basng for Lower Power Mcroprocessors under Dynamc Workloads, Proc. ICCAD, [2] K. Govl, et al., Comparng Algorthms for Dynamc Speed-Settng of a Low-Power CPU, Proc MOBICOM, [3] Y. Gu, et al., Control Theory-based DVS for Interactve 3D Games, Proc. DAC, [4] T. Sherwood, et al., Dscoverng and Explotng Program Phases, IEEE Mcro, Nov/Dec, [5] T. Sherwood, et al., Phase Trackng and Predcton, Proc. ISCA, [6] Q. Wu, et al., A Dynamc Complaton Framework for Controllng Mcroprocessor Energy and Performance, Proc. MICRO, [7] C. Isc, et al., Lve, Runtme Phase Montorng and Predcton on Real Systems wth Applcaton to Dynamc Power Management, Proc. MICRO, [8] S. Lee and T. Sakura, Run-tme Voltage Hoppng for Low-Power Real-tme Systems, Proc. DAC, [9] A. Azevedo, et al., Profle-based Dynamc Voltage Schedulng usng Program Checkponts, Proc. DATE, [10] D. Shn and J. Km, Optmzng Intra-task Voltage Schedulng usng Data Flow Analyss, Proc. ASPDAC, [11] J. Seo, et al., Profle-based Optmal Intra-task Voltage Schedulng for Hard Real-tme Applcatons, Proc. DAC, [12] S. Hong, et al., Runtme Dstrbuton-aware Dynamc Voltage Scalng, Proc. ICCAD, [13] S. Hong, et al., Dynamc Voltage Scalng of Supply and Body Bas Explotng Software Runtme Dstrbuton, Proc. DATE, [14] J. Km, et al., An Analytcal Dynamc Volage Scalng of Supply Voltage and Body Bas Based on Parallelsm-aware Workload and Runtme Dstrbuton, to appear n IEEE Transactons on CAD. APPENDIX If we substtute w wth Eqn. (4) n Eqn. (3) and f we assume that the PDF of n s represented by M bns,.e., M pars of executon cycle and probablty, <x (k),p (k) > s, we obtan the followng equaton. M x +(w eff x(k)p (k) +1 )b+1 [ ]=0 (x eff + w eff +1 x(k))b+1 k=1 As shown n the above equaton, x eff and the PDF of n,.e., <x (k),p (k) > s. depends on x, w eff +1,