The conditional gradient method (CGM) is widely used in large-scale sparse convex optimization, having a low per iteration computational cost for structured sparse regularizers and a greedy approach for collecting nonzeros. We explore the sparsity acquiring properties of a general penalized CGM (P-CGM) for convex regularizers and a reweighted penalized CGM (RP-CGM) for nonconvex regularizers, replacing the usual convex constraints with gauge-inspired penalties. This generalization does not increase the per-iteration complexity noticeably. Without assuming bounded iterates or using line search, we show convergence of the gap of each subproblem, which measures distance to a stationary point. We couple this with a screening rule which is safe in the convex case, converging to the true support at a rate where measures how close the problem is to degeneracy. In the nonconvex case the screening rule converges to the true support in a finite number of iterations, but is not necessarily safe in the intermediate iterates. In our experiments, we verify the consistency of the method and adjust the aggressiveness of the screening rule by tuning the concavity of the regularizer.
Revised:
Accepted:
Published online:
@article{OJMO_2022__3__A3_0, author = {Yifan Sun and Francis Bach}, title = {Screening for a {Reweighted} {Penalized} {Conditional} {Gradient} {Method}}, journal = {Open Journal of Mathematical Optimization}, eid = {3}, pages = {1--35}, publisher = {Universit\'e de Montpellier}, volume = {3}, year = {2022}, doi = {10.5802/ojmo.14}, language = {en}, url = {https://ojmo.centre-mersenne.org/articles/10.5802/ojmo.14/} }
TY - JOUR AU - Yifan Sun AU - Francis Bach TI - Screening for a Reweighted Penalized Conditional Gradient Method JO - Open Journal of Mathematical Optimization PY - 2022 SP - 1 EP - 35 VL - 3 PB - Université de Montpellier UR - https://ojmo.centre-mersenne.org/articles/10.5802/ojmo.14/ DO - 10.5802/ojmo.14 LA - en ID - OJMO_2022__3__A3_0 ER -
%0 Journal Article %A Yifan Sun %A Francis Bach %T Screening for a Reweighted Penalized Conditional Gradient Method %J Open Journal of Mathematical Optimization %D 2022 %P 1-35 %V 3 %I Université de Montpellier %U https://ojmo.centre-mersenne.org/articles/10.5802/ojmo.14/ %R 10.5802/ojmo.14 %G en %F OJMO_2022__3__A3_0
Yifan Sun; Francis Bach. Screening for a Reweighted Penalized Conditional Gradient Method. Open Journal of Mathematical Optimization, Volume 3 (2022), article no. 3, 35 p. doi : 10.5802/ojmo.14. https://ojmo.centre-mersenne.org/articles/10.5802/ojmo.14/
[1] , Advances in Neural Information Processing Systems (2010), pp. 118-126
[2] Duality between subgradient and conditional gradient methods, SIAM J. Optim., Volume 25 (2015) no. 1, pp. 115-129 | DOI | MR | Zbl
[3] A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications, Math. Oper. Res., Volume 42 (2017) no. 2, pp. 330-348 | DOI | MR | Zbl
[4] Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 408, Springer, 2011 | DOI
[5] Deep Frank-Wolfe for neural network optimization (2018) (https://arxiv.org/abs/1811.07591)
[6] Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR, Biometrics, Volume 64 (2008) no. 1, pp. 115-123 | DOI | MR | Zbl
[7] Dynamic screening: Accelerating first-order algorithms for the LASSO and group-LASSO, IEEE Trans. Signal Process., Volume 63 (2015) no. 19, pp. 5121-5132 | DOI | MR | Zbl
[8] Convex Analysis and Nonlinear Optimization: Theory and Examples, Springer, 2010
[9] Iterated hard shrinkage for minimization problems with sparsity constraints, SIAM J. Sci. Comput., Volume 30 (2008) no. 2, pp. 657-683 | DOI | MR | Zbl
[10] A generalized conditional gradient method and its connection to an iterative shrinkage method, Comput. Optim. Appl., Volume 42 (2009) no. 2, pp. 173-193 | DOI | MR | Zbl
[11] On the identification of active constraints, SIAM J. Numer. Anal., Volume 25 (1988) no. 5, pp. 1197-1211 | DOI | MR | Zbl
[12] On the identification of active constraints II: The nonconvex case, SIAM J. Numer. Anal., Volume 27 (1990) no. 4, pp. 1081-1102 | DOI | MR | Zbl
[13] , 2006 International Conference on Image Processing (2006), pp. 1281-1284 | DOI
[14] Decoding by linear programming, IEEE Trans. Inf. Theory, Volume 51 (2005) no. 12, pp. 4203-4215 | DOI | MR | Zbl
[15] The convex geometry of linear inverse problems, Found. Comput. Math., Volume 12 (2012) no. 6, pp. 805-849 | DOI | MR | Zbl
[16] , Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 5537-5545
[17] Convergence of reweighted minimization algorithms and unique solution of truncated minimization, Department of Applied Mathematics, The Hong Kong Polytechnic University, 2010
[18] Generalized gradients and applications, Trans. Am. Math. Soc., Volume 205 (1975), pp. 247-262 | DOI | MR | Zbl
[19] , Proceedings of the international congress of mathematicians, Volume 5 (1983), pp. 847-853
[20] Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm, ACM Trans. Algorithms, Volume 6 (2010) no. 4, p. 63 | MR | Zbl
[21] Iteratively reweighted least squares minimization for sparse recovery, Commun. Pure Appl. Math., Volume 63 (2010) no. 1, pp. 1-38 | DOI | MR | Zbl
[22] Compressed sensing, IEEE Trans. Inf. Theory, Volume 52 (2006) no. 4, pp. 1289-1306 | DOI | MR | Zbl
[23] , Artificial Intelligence and Statistics (2012), pp. 327-336
[24] Conditional gradient algorithms with open loop step size rules, J. Math. Anal. Appl., Volume 62 (1978) no. 2, pp. 432-444 | DOI | MR | Zbl
[25] Safe feature elimination for the Lasso and sparse supervised learning problems, Pac. J. Optim., Volume 8 (2012) no. 4, pp. 667-698 | Zbl
[26] , International Conference on Machine Learning (2019), pp. 1794-1801
[27] et al. Atomic Decomposition via Polar Alignment: The Geometry of Structured Optimization, Found. Trends Optim., Volume 3 (2020) no. 4, pp. 280-366
[28] , International Conference on Machine Learning (2015), pp. 333-342
[29] An algorithm for quadratic programming, Nav. Res. Logist. Q., Volume 3 (1956) no. 1-2, pp. 95-110 | DOI | MR
[30] Dual gauge programs, with applications to quadratic programming and the minimum-norm problem, Math. Program., Volume 38 (1987) no. 1, pp. 47-67 | DOI | MR | Zbl
[31] An Extended Frank–Wolfe Method with “In-Face” Directions, and Its Application to Low-Rank Matrix Completion, SIAM J. Optim., Volume 27 (2017) no. 1, pp. 319-346 | DOI | MR | Zbl
[32] Gauge optimization and duality, SIAM J. Optim., Volume 24 (2014) no. 4, pp. 1999-2022 | DOI | MR | Zbl
[33] , International Conference on Machine Learning (2013), pp. 37-45
[34] Some comments on Wolfe’s ‘away step’, Math. Program., Volume 35 (1986) no. 1, pp. 110-119 | DOI | MR | Zbl
[35] Result analysis of the NIPS 2003 feature selection challenge, Adv. Neural Inf. Process. Syst., Volume 17 (2004)
[36] Conditional gradient algorithms for norm-regularized smooth convex optimization, Math. Program., Volume 152 (2015) no. 1-2, pp. 75-112 | DOI | MR | Zbl
[37] Identifying active manifolds in regularization problems, Fixed-Point Algorithms for Inverse Problems in Science and Engineering, Springer, 2011, pp. 261-271 | DOI | Zbl
[38] , Latin American Symposium on Theoretical Informatics (2008), pp. 306-316 | Zbl
[39] , ICML (2013), pp. 427-435
[40] , Proceedings of the 34th International Conference on Machine Learning-Volume 70 (2017), pp. 1752-1760
[41] , Advances in Neural Information Processing Systems (2015), pp. 532-540
[42] On the global linear convergence of Frank-Wolfe optimization variants, Advances in Neural Information Processing Systems, 2015, pp. 496-504
[43] , International Conference on Machine Learning (2013), pp. 53-61
[44] , Artificial Intelligence and Statistics (2015), pp. 544-552
[45] Identifying activity, SIAM J. Optim., Volume 21 (2011) no. 2, pp. 597-614 | DOI | MR | Zbl
[46] , International Conference on Machine Learning (2014), pp. 289-297
[47] Relatively smooth convex optimization by first-order methods, and applications, SIAM J. Optim., Volume 28 (2018) no. 1, pp. 333-354 | MR | Zbl
[48] Sparse modeling for image and vision processing (2014) (https://arxiv.org/abs/1411.3230)
[49] , 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 4732-4736 | DOI
[50] Scalable robust matrix recovery: Frank–Wolfe meets proximal methods, SIAM J. Sci. Comput., Volume 38 (2016) no. 5, p. A3291-A3317 | MR | Zbl
[51] , Advances in Neural Information Processing Systems (2015), pp. 811-819
[52] Introductory Lectures on Convex Optimization: A Basic Course, Springer, 2013
[53] “Active-set complexity” of proximal gradient: How long does it take to find the sparsity pattern?, Optim. Lett., Volume 13 (2019) no. 4, pp. 645-655 | DOI | MR | Zbl
[54] , International Conference on Machine Learning (2015), pp. 1632-1641
[55] Group lasso with overlaps: the latent group lasso approach (2011) (https://arxiv.org/abs/1110.0413)
[56] On iteratively reweighted algorithms for nonsmooth nonconvex optimization in computer vision, SIAM J. Imaging Sci., Volume 8 (2015) no. 1, pp. 331-372 | DOI | MR | Zbl
[57] , Advances in Neural Information Processing Systems (2016), pp. 3063-3071
[58] , International Conference on Machine Learning (2019), pp. 5341-5350
[59] Forward-backward greedy algorithms for atomic norm regularization, IEEE Trans. Signal Process., Volume 63 (2015) no. 21, pp. 5798-5811 | MR | Zbl
[60] Convex Analysis, 28, Princeton University Press, 1970 | DOI
[61] , Advances in Neural Information Processing Systems (2018), pp. 527-538
[62] , Advances in Neural Information Processing Systems (2011), pp. 882-890
[63] Regression shrinkage and selection via the Lasso, J. R. Stat. Soc., Ser. B, Stat. Methodol., Volume 58 (1996) no. 1, pp. 267-288 | MR | Zbl
[64] , Proceedings of International Conference on Artificial Intelligence and Statistics (2017)
[65] Simplicial decomposition in nonlinear programming algorithms, Math. Program., Volume 13 (1977) no. 1, pp. 49-68 | DOI | MR | Zbl
[66] LASSO screening rules via dual polytope projection, Adv. Neural Inf. Process. Syst. (2013), pp. 1070-1078
[67] Iteratively reweighted least squares: algorithms, convergence analysis, and numerical comparisons, SIAM J. Sci. Stat. Comput., Volume 9 (1988) no. 5, pp. 907-921 | DOI | MR | Zbl
[68] Sparse reconstruction by separable approximation, IEEE Trans. Signal Process., Volume 57 (2009) no. 7, pp. 2479-2493 | DOI | MR
[69] Generalized conditional gradient for sparse estimation, J. Mach. Learn. Theory, Volume 18 (2017) no. 1, pp. 5279-5324
[70] , Artificial intelligence and statistics (2017), pp. 1188-1196
[71] The Ordered Weighted Norm: Atomic Formulation, Projections, and Algorithms (2014) (https://arxiv.org/abs/1409.4271)
[72] Limited memory Kelley’s method converges for composite convex and submodular objectives, Advances in Neural Information Processing Systems, 2018, pp. 4414-4424
Cited by Sources: