Screening for a Reweighted Penalized Conditional Gradient Method
Open Journal of Mathematical Optimization, Volume 3 (2022), article no. 3, 35 p.

The conditional gradient method (CGM) is widely used in large-scale sparse convex optimization, having a low per iteration computational cost for structured sparse regularizers and a greedy approach for collecting nonzeros. We explore the sparsity acquiring properties of a general penalized CGM (P-CGM) for convex regularizers and a reweighted penalized CGM (RP-CGM) for nonconvex regularizers, replacing the usual convex constraints with gauge-inspired penalties. This generalization does not increase the per-iteration complexity noticeably. Without assuming bounded iterates or using line search, we show O(1/t) convergence of the gap of each subproblem, which measures distance to a stationary point. We couple this with a screening rule which is safe in the convex case, converging to the true support at a rate O(1/(δ 2 )) where δ0 measures how close the problem is to degeneracy. In the nonconvex case the screening rule converges to the true support in a finite number of iterations, but is not necessarily safe in the intermediate iterates. In our experiments, we verify the consistency of the method and adjust the aggressiveness of the screening rule by tuning the concavity of the regularizer.

Received:
Revised:
Accepted:
Published online:
DOI: 10.5802/ojmo.14
Keywords: Dual screening, conditional gradient method, atomic sparsity, reweighted optimization
Yifan Sun 1; Francis Bach 2

1 Stonybrook University - Department of Computer Science, Stonybrook, New York, USA
2 INRIA - Département d’Informatique de l’Ecole Normale Supérieure PSL Research University Paris, France
License: CC-BY 4.0
Copyrights: The authors retain unrestricted copyrights and publishing rights
@article{OJMO_2022__3__A3_0,
     author = {Yifan Sun and Francis Bach},
     title = {Screening for a {Reweighted} {Penalized} {Conditional} {Gradient} {Method}},
     journal = {Open Journal of Mathematical Optimization},
     eid = {3},
     pages = {1--35},
     publisher = {Universit\'e de Montpellier},
     volume = {3},
     year = {2022},
     doi = {10.5802/ojmo.14},
     language = {en},
     url = {https://ojmo.centre-mersenne.org/articles/10.5802/ojmo.14/}
}
TY  - JOUR
AU  - Yifan Sun
AU  - Francis Bach
TI  - Screening for a Reweighted Penalized Conditional Gradient Method
JO  - Open Journal of Mathematical Optimization
PY  - 2022
SP  - 1
EP  - 35
VL  - 3
PB  - Université de Montpellier
UR  - https://ojmo.centre-mersenne.org/articles/10.5802/ojmo.14/
DO  - 10.5802/ojmo.14
LA  - en
ID  - OJMO_2022__3__A3_0
ER  - 
%0 Journal Article
%A Yifan Sun
%A Francis Bach
%T Screening for a Reweighted Penalized Conditional Gradient Method
%J Open Journal of Mathematical Optimization
%D 2022
%P 1-35
%V 3
%I Université de Montpellier
%U https://ojmo.centre-mersenne.org/articles/10.5802/ojmo.14/
%R 10.5802/ojmo.14
%G en
%F OJMO_2022__3__A3_0
Yifan Sun; Francis Bach. Screening for a Reweighted Penalized Conditional Gradient Method. Open Journal of Mathematical Optimization, Volume 3 (2022), article  no. 3, 35 p. doi : 10.5802/ojmo.14. https://ojmo.centre-mersenne.org/articles/10.5802/ojmo.14/

[1] Francis Bach, Advances in Neural Information Processing Systems (2010), pp. 118-126

[2] Francis Bach Duality between subgradient and conditional gradient methods, SIAM J. Optim., Volume 25 (2015) no. 1, pp. 115-129 | DOI | MR | Zbl

[3] Heinz H. Bauschke; Jérôme Bolte; Marc Teboulle A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications, Math. Oper. Res., Volume 42 (2017) no. 2, pp. 330-348 | DOI | MR | Zbl

[4] Heinz H. Bauschke; Patrick L. Combettes Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 408, Springer, 2011 | DOI

[5] Leonard Berrada; Andrew Zisserman; M. Pawan Kumar Deep Frank-Wolfe for neural network optimization (2018) (https://arxiv.org/abs/1811.07591)

[6] Howard D. Bondell; Brian J. Reich Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR, Biometrics, Volume 64 (2008) no. 1, pp. 115-123 | DOI | MR | Zbl

[7] Antoine Bonnefoy; Emiya Valentin; Ralaivola Liva; Remi Gribonval Dynamic screening: Accelerating first-order algorithms for the LASSO and group-LASSO, IEEE Trans. Signal Process., Volume 63 (2015) no. 19, pp. 5121-5132 | DOI | MR | Zbl

[8] Jonathan Borwein; Adrian S. Lewis Convex Analysis and Nonlinear Optimization: Theory and Examples, Springer, 2010

[9] Kristian Bredies; Dirk A. Lorenz Iterated hard shrinkage for minimization problems with sparsity constraints, SIAM J. Sci. Comput., Volume 30 (2008) no. 2, pp. 657-683 | DOI | MR | Zbl

[10] Kristian Bredies; Dirk A. Lorenz; Peter Maass A generalized conditional gradient method and its connection to an iterative shrinkage method, Comput. Optim. Appl., Volume 42 (2009) no. 2, pp. 173-193 | DOI | MR | Zbl

[11] James V. Burke; Jorge J. Moré On the identification of active constraints, SIAM J. Numer. Anal., Volume 25 (1988) no. 5, pp. 1197-1211 | DOI | MR | Zbl

[12] Jim Burke On the identification of active constraints II: The nonconvex case, SIAM J. Numer. Anal., Volume 27 (1990) no. 4, pp. 1081-1102 | DOI | MR | Zbl

[13] Emmanuel Candès; Justin Romberg, 2006 International Conference on Image Processing (2006), pp. 1281-1284 | DOI

[14] Emmanuel Candès; Terence Tao Decoding by linear programming, IEEE Trans. Inf. Theory, Volume 51 (2005) no. 12, pp. 4203-4215 | DOI | MR | Zbl

[15] Venkat Chandrasekaran; Benjamin Recht; Pablo A. Parrilo; Alan S. Willsky The convex geometry of linear inverse problems, Found. Comput. Math., Volume 12 (2012) no. 6, pp. 805-849 | DOI | MR | Zbl

[16] Visesh Chari; Simon Lacoste-Julien; Ivan Laptev; Josef Sivic, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 5537-5545

[17] Xiaojun Chen; Weijun Zhou Convergence of reweighted 1 minimization algorithms and unique solution of truncated p minimization, Department of Applied Mathematics, The Hong Kong Polytechnic University, 2010

[18] Frank H. Clarke Generalized gradients and applications, Trans. Am. Math. Soc., Volume 205 (1975), pp. 247-262 | DOI | MR | Zbl

[19] Frank H. Clarke, Proceedings of the international congress of mathematicians, Volume 5 (1983), pp. 847-853

[20] Kenneth L. Clarkson Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm, ACM Trans. Algorithms, Volume 6 (2010) no. 4, p. 63 | MR | Zbl

[21] Ingrid Daubechies; Ronald DeVore; Massimo Fornasier; C Sinan Güntürk Iteratively reweighted least squares minimization for sparse recovery, Commun. Pure Appl. Math., Volume 63 (2010) no. 1, pp. 1-38 | DOI | MR | Zbl

[22] David L. Donoho Compressed sensing, IEEE Trans. Inf. Theory, Volume 52 (2006) no. 4, pp. 1289-1306 | DOI | MR | Zbl

[23] Miroslav Dudik; Zaid Harchaoui; Jérôme Malick, Artificial Intelligence and Statistics (2012), pp. 327-336

[24] Joseph C. Dunn; S. Harshbarger Conditional gradient algorithms with open loop step size rules, J. Math. Anal. Appl., Volume 62 (1978) no. 2, pp. 432-444 | DOI | MR | Zbl

[25] Laurent El Ghaoui; Vivian Viallon; Tarek Rabbani Safe feature elimination for the Lasso and sparse supervised learning problems, Pac. J. Optim., Volume 8 (2012) no. 4, pp. 667-698 | Zbl

[26] Alina Ene; Adrian Vladu, International Conference on Machine Learning (2019), pp. 1794-1801

[27] Zhenan Fan; Halyun Jeong; Yifan Sun; Michael Friedlander et al. Atomic Decomposition via Polar Alignment: The Geometry of Structured Optimization, Found. Trends Optim., Volume 3 (2020) no. 4, pp. 280-366

[28] Olivier Fercoq; Alexandre Gramfort; Joseph Salmon, International Conference on Machine Learning (2015), pp. 333-342

[29] Marguerite Frank; Philip Wolfe An algorithm for quadratic programming, Nav. Res. Logist. Q., Volume 3 (1956) no. 1-2, pp. 95-110 | DOI | MR

[30] Robert M. Freund Dual gauge programs, with applications to quadratic programming and the minimum-norm problem, Math. Program., Volume 38 (1987) no. 1, pp. 47-67 | DOI | MR | Zbl

[31] Robert M. Freund; Paul Grigas; Rahul Mazumder An Extended Frank–Wolfe Method with “In-Face” Directions, and Its Application to Low-Rank Matrix Completion, SIAM J. Optim., Volume 27 (2017) no. 1, pp. 319-346 | DOI | MR | Zbl

[32] Michael Friedlander; Ives Macedo; Ting Kei Pong Gauge optimization and duality, SIAM J. Optim., Volume 24 (2014) no. 4, pp. 1999-2022 | DOI | MR | Zbl

[33] Pinghua Gong; Changshui Zhang; Zhaosong Lu; Jianhua Huang; Jieping Ye, International Conference on Machine Learning (2013), pp. 37-45

[34] Jacques Guélat; Patrice Marcotte Some comments on Wolfe’s ‘away step’, Math. Program., Volume 35 (1986) no. 1, pp. 110-119 | DOI | MR | Zbl

[35] Isabelle Guyon; Steve Gunn; Asa Ben-Hur; Gideon Dror Result analysis of the NIPS 2003 feature selection challenge, Adv. Neural Inf. Process. Syst., Volume 17 (2004)

[36] Zaid Harchaoui; Anatoli Juditsky; Arkadi Nemirovski Conditional gradient algorithms for norm-regularized smooth convex optimization, Math. Program., Volume 152 (2015) no. 1-2, pp. 75-112 | DOI | MR | Zbl

[37] Warren Hare Identifying active manifolds in regularization problems, Fixed-Point Algorithms for Inverse Problems in Science and Engineering, Springer, 2011, pp. 261-271 | DOI | Zbl

[38] Elad Hazan, Latin American Symposium on Theoretical Informatics (2008), pp. 306-316 | Zbl

[39] Martin Jaggi, ICML (2013), pp. 427-435

[40] Tyler B. Johnson; Carlos Guestrin, Proceedings of the 34th International Conference on Machine Learning-Volume 70 (2017), pp. 1752-1760

[41] Rahul G. Krishnan; Simon Lacoste-Julien; David Sontag, Advances in Neural Information Processing Systems (2015), pp. 532-540

[42] Simon Lacoste-Julien; Martin Jaggi On the global linear convergence of Frank-Wolfe optimization variants, Advances in Neural Information Processing Systems, 2015, pp. 496-504

[43] Simon Lacoste-Julien; Martin Jaggi; Mark Schmidt; Patrick Pletscher, International Conference on Machine Learning (2013), pp. 53-61

[44] Simon Lacoste-Julien; Fredrik Lindsten; Francis Bach, Artificial Intelligence and Statistics (2015), pp. 544-552

[45] Adrian S. Lewis; Stephen J. Wright Identifying activity, SIAM J. Optim., Volume 21 (2011) no. 2, pp. 597-614 | DOI | MR | Zbl

[46] Jun Liu; Zheng Zhao; Jie Wang; Jieping Ye, International Conference on Machine Learning (2014), pp. 289-297

[47] Haihao Lu; Robert M. Freund; Yurii Nesterov Relatively smooth convex optimization by first-order methods, and applications, SIAM J. Optim., Volume 28 (2018) no. 1, pp. 333-354 | MR | Zbl

[48] Julien Mairal; Francis Bach; Jean Ponce Sparse modeling for image and vision processing (2014) (https://arxiv.org/abs/1411.3230)

[49] Abed Malti; Cédric Herzet, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 4732-4736 | DOI

[50] Cun Mu; Yuqian Zhang; John Wright; Donald Goldfarb Scalable robust matrix recovery: Frank–Wolfe meets proximal methods, SIAM J. Sci. Comput., Volume 38 (2016) no. 5, p. A3291-A3317 | MR | Zbl

[51] Eugene Ndiaye; Olivier Fercoq; Alexandre Gramfort; Joseph Salmon, Advances in Neural Information Processing Systems (2015), pp. 811-819

[52] Yurii Nesterov Introductory Lectures on Convex Optimization: A Basic Course, Springer, 2013

[53] Julie Nutini; Mark Schmidt; Warren Hare “Active-set complexity” of proximal gradient: How long does it take to find the sparsity pattern?, Optim. Lett., Volume 13 (2019) no. 4, pp. 645-655 | DOI | MR | Zbl

[54] Julie Nutini; Mark Schmidt; Issam Laradji; Michael Friedlander; Hoyt Koepke, International Conference on Machine Learning (2015), pp. 1632-1641

[55] Guillaume Obozinski; Laurent Jacob; Jean-Philippe Vert Group lasso with overlaps: the latent group lasso approach (2011) (https://arxiv.org/abs/1110.0413)

[56] Peter Ochs; Alexey Dosovitskiy; Thomas Brox; Thomas Pock On iteratively reweighted algorithms for nonsmooth nonconvex optimization in computer vision, SIAM J. Imaging Sci., Volume 8 (2015) no. 1, pp. 331-372 | DOI | MR | Zbl

[57] Wei Ping; Qiang Liu; Alexander T. Ihler, Advances in Neural Information Processing Systems (2016), pp. 3063-3071

[58] Alain Rakotomamonjy; Gilles Gasso; Joseph Salmon, International Conference on Machine Learning (2019), pp. 5341-5350

[59] Nikhil Rao; Parikshit Shah; Stephen J. Wright Forward-backward greedy algorithms for atomic norm regularization, IEEE Trans. Signal Process., Volume 63 (2015) no. 21, pp. 5798-5811 | MR | Zbl

[60] R. Tyrrell Rockafellar Convex Analysis, 28, Princeton University Press, 1970 | DOI

[61] Ozan Sener; Vladlen Koltun, Advances in Neural Information Processing Systems (2018), pp. 527-538

[62] Ambuj Tewari; Pradeep K. Ravikumar; Inderjit S. Dhillon, Advances in Neural Information Processing Systems (2011), pp. 882-890

[63] Robert Tibshirani Regression shrinkage and selection via the Lasso, J. R. Stat. Soc., Ser. B, Stat. Methodol., Volume 58 (1996) no. 1, pp. 267-288 | MR | Zbl

[64] Marina Vinyes; Guillaume Obozinski, Proceedings of International Conference on Artificial Intelligence and Statistics (2017)

[65] Balder Von Hohenbalken Simplicial decomposition in nonlinear programming algorithms, Math. Program., Volume 13 (1977) no. 1, pp. 49-68 | DOI | MR | Zbl

[66] Jie Wang; Jiayu Zhou; Peter Wonka; Jieping Ye LASSO screening rules via dual polytope projection, Adv. Neural Inf. Process. Syst. (2013), pp. 1070-1078

[67] R. Wolke; H. Schwetlick Iteratively reweighted least squares: algorithms, convergence analysis, and numerical comparisons, SIAM J. Sci. Stat. Comput., Volume 9 (1988) no. 5, pp. 907-921 | DOI | MR | Zbl

[68] Stephen J. Wright; Robert D. Nowak; Mário A. T. Figueiredo Sparse reconstruction by separable approximation, IEEE Trans. Signal Process., Volume 57 (2009) no. 7, pp. 2479-2493 | DOI | MR

[69] Yaoliang Yu; Xinhua Zhang; Dale Schuurmans Generalized conditional gradient for sparse estimation, J. Mach. Learn. Theory, Volume 18 (2017) no. 1, pp. 5279-5324

[70] Alp Yurtsever; Madeleine Udell; Joel Tropp; Volkan Cevher, Artificial intelligence and statistics (2017), pp. 1188-1196

[71] Xiangrong Zeng; Mário A. T. Figueiredo The Ordered Weighted 1 Norm: Atomic Formulation, Projections, and Algorithms (2014) (https://arxiv.org/abs/1409.4271)

[72] Song Zhou; Swati Gupta; Madeleine Udell Limited memory Kelley’s method converges for composite convex and submodular objectives, Advances in Neural Information Processing Systems, 2018, pp. 4414-4424

Cited by Sources: