1-slide summary Many machine learning methods involve solving a minimum regularized risk objective. Why I chose this paper. Outline.

Size: px
Start display at page:

Download "1-slide summary Many machine learning methods involve solving a minimum regularized risk objective. Why I chose this paper. Outline."

Transcription

1 Bundle Mehods for Machne Learnng (Teo, Vshwanahan, Smola, Le) JMLR 2010, NIPS 2007 Presened by Kevn Duh Bayes Readng Group 6/4/ slde summary Many machne learnng mehods nvolve solvng a mnmum regularzed rsk objecve Cung-plane algorhm & Bundle mehods solve eravely by usng a pece-wse lower bound 1 2 Why I chose hs paper Oulne These opmzaon mehods (nvened n 1960s) are becomng popular n supervsed learnng Very fas Scale o large daases Handles non-smooh convex opmzaon, so wdely applcable Can be used when LBFGS fals 3. Bundle mehod 4. Dfferen loss funcons 3 4 Warm-up Convex Se: a se s a convex se f conans he lne segmen jonng any of s pons x, y S; a, b 0; a + b = 1 ax + by S are hese ses convex? Background: Convex funcons Convex funcon: A funcon s convex s s doman s a convex se and he segmen jonng any wo pons on f do no have values lower han f x, y dom( f ); a, b 0, a + b = 1 af ( x) + bf ( y) f ( ax + by) Wha s convex, wha s concave? 5 6 1

2 Convex & dfferenble funcons Nonsmooh (non-dfferenable) funcons Graden exss f f s dfferenable 1 s -order condon: a dfferenable f s convex ff: Wha f f s no dfferenable, e.g. L1-regularzer or x Envelope funcon: Dervave a x>1 s 1 Dervave a x<1 s -1 Dervave a x=0? f ( y) f ( x) + y x, f y Subgraden: a vecor s s a subgraden f f ( y) f ( x) + y x, s y f(y) f(x) Graden provdes a global lower bound o f There may exs many subgradens a a pon The se of subgradens s called he subdfferenal f(x)+(y-x) graden The mehods we deal wh only requre one subgraden 7 8 Subgraden Mehod for opmzng nonsmooh funcons Smlar as graden descen, excep: works on non-dfferenable funcons sep-lenghs no chosen by lne-search, bu fxed s no a descen mehod descen drecon d can only be defned by <d,s> <0 for all s n sub-dfferenal Pseudo-code: Repea unl convergence: 1. s = subgraden a f(x) 2. x = x sepsze * s 3. keep rack of bes x so far Oulne 3. Bundle mehod 4. Dfferen loss funcons 9 10 Cung-plane algorhm for opmzng non-smooh funcons Fgure: lower bound mproves afer each eraon Recall subgraden forms a lowerbound on f Man Idea: If we have mulple subgradens a dfferen pons, we ge a gher lowerbound The lowerbound mproves wh each eraon, so mnmzng he lowerbound evenually mnmzes he desred objecve

3 The mah Overall opmzaon goal: Lower bound: Gven sequence of eraes w and subgradens s, he (pecewse-lnear) lower bound s: Cung-plane pseudocode 1. Compue J(w ) and s subgraden s 2. Compue error 3. If error < hreshold, sop 4. Updae bound 5. Opmze o ge new erae 6. Goo sep 1 Because A each eraon, compue Noe: Slgh change of noaon sarng now: funcon f(x) J(w) Why does hs work? Le w* be opmal soluon, hen mn 0 mn 0 By consrucon, J ( w) J So opmal pon s sandwched: ( w), w J ( w*) J J Ths error s monooncally decreases. When reaches zero, we have w* ( w ) ( w ) Fnal word on cung plane-algorhm I has nce soppng crera (beer han subgraden mehod) Cos s solvng lnear programs per eraon: Sze of hs subproblem grows wh each eraon Bu usually hs can be solved quckly Speed depends crcally on he se of cung planes Zg-zag behavor possble, slowng down convergence Oulne Sandard Bundle Mehod 3. Bundle mehod 4. Dfferen loss funcons Zg-zag n cung-plane s caused by akng large seps and neglecng prevous soluons Bundle mehods exend cung-plane by ensurng new erae s no oo far

4 (Sandard) Bundle mehod pseudo-code Proposed Bundle Mehod (BMRM) Regularzed rsk mnmzaon objecve already has a regularzaon erm: So opmze hs subproblem: Ths paper argues ha some parameers are hard o une: No need for serous/null sep In more deal: how o solve subproblem n sep 6 Reformulae as consraned opmzaon: Then call lnear/quadrac program dependng on regularzer # consrans = #eraons, unrelaed o #samples! Dual program for L2 regularzer: Convergence Analyss Expermens n acual speed Every eraon he error s halved!!

5 Oulne 3. Bundle mehod 4. Dfferen loss funcons Bnary classfcaon Accuracy-based loss: Convex upper-bounds: Sof margn loss: Non-smooh pon here. Subgradens: -yf (x), 0 Logsc: MCE (Kaagr e. al.) sgmod, wh adjusable parameer Gaussan process classfer, MAP soluon: Mnmze log( 1+ exp( yf ( x))) Srucured Predcon Smlar o prevous slde, convex upper bounds for srucured loss CRF: ROC score AUC s no connuous n w: Bu hs nonsmooh convex bound s: Srucured SVM: For bundle mehods, jus collec he vecors and gve o he LP/QP We can drecly calculae subgradens n closed-form, bu we can also oban from an algorhm f s more effcen See algorhm 7 n JMLR paper Do you see he paern? 1. Gve me any problem, wh any evaluaon merc (may be dffcul o opmze) 2. Thnk of a convex upper bound for he merc Ths bound does no need o be smooh Jus need o ge subgradens from 3. Solve wh: Graden descen, LFBGS; or Subgraden mehod, Bundle mehod, ec. 4. Done: subm NIPS paper Dscussons Fear no non-smooh convex funcons Wha abou non-convex opmzaon? EM-syle ranng where M s solved by bundle Yu & Joachms, Learnng Srucural SVMs w/ Laen Varables (ICML09) Modfed bundle mehod: Do & Areres, Large margn ranng of HMMs w/ parally-observed saes (ICML09)