Screening for a Reweighted Penalized Conditional Gradient Method

The conditional gradient method (CGM) is widely used in large-scale sparse convex optimization, having a low per iteration computational cost for structured sparse regularizers and a greedy approach to collecting nonzeros. We explore the sparsity acquiring properties of a general penalized CGM (P-CGM) for convex regularizers and a reweighted penalized CGM (RP-CGM) for nonconvex regularizers, replacing the usual convex constraints with gauge-inspired penalties. This generalization does not increase the per-iteration complexity noticeably. Without assuming bounded iterates or using line search, we show $O(1/t)$ convergence of the gap of each subproblem, which measures distance to a stationary point. We couple this with a screening rule which is safe in the convex case, converging to the true support at a rate $O(1/(\delta^2))$ where $\delta \geq 0$ measures how close the problem is to degeneracy. In the nonconvex case the screening rule converges to the true support in a finite number of iterations, but is not necessarily safe in the intermediate iterates. In our experiments, we verify the consistency of the method and adjust the aggressiveness of the screening rule by tuning the concavity of the regularizer.


Introduction
Conditional gradient methods (CGMs) are used in constrained optimization to quickly arrive at sparse solutions of large-scale optimization problems. In this paper, we generalize their applicability to nonconvex penalized (unconstrained) problems and investigate safe screening methods to obtain sparse supports in finite time. We describe these problems as where f : R d → R is a convex loss function with an L-Lipschitz continuous gradient, φ : R + → R is a strictly convex monotonically increasing function, and r P : R d → R + a nonconvex variant of a gauge function, defined as the solution to for some concave monotonically increasing function γ : R + → R + . Here, P 0 is a finite collection of vectors in R d . In the usual nonzero sparsity case, this penalty reduces to well-studied nonconvex penalties like SCAD, LSP, or p-"norms" for 0 < p < 1. Problems of this form arise in machine learning, compressed sensing, low-rank matrix factorization, etc., and are often observed in practice to be more effective sparsifiers than their convex relaxations [17].
In particular, we solve (1) using the following iteration scheme ∇f (x (t) ) T s + h (t) (s), (Min-Maj) where h (t) (s) is a local convexification of φ(r P (s)) at x (t) . We call this the reweighted penalized conditional gradient method (RP-CGM), as it resembles both the conditional gradient method (CGM) in sparse convex optimization and reweighting schemes in majorization-minorization methods for nonconvex optimization.
Example 1. The 1 norm is formed by picking P 0 = {±e 1 , . . . , ±e d } the signed unit bases, and γ(ξ) = ξ. Then the solution to (2) is always unique and can be expressed in closed form as r P (x) = x 1 . Picking instead a concave penalty γ(ξ) = 2 √ ξ leads to the variation r P (x) = 2 i |x i | the "half norm". Similar transformations also lead to the smoothed capped absolute deviation (SCAD) penalty, minimum concave penalty (MCP), etc. (See Table 1.) By using a generalized convex aggregate penalty φ, we can sweep the space between constrained and unconstrained problems, via the penalty's tunable curvature: maximum curvature reduces to the usual constrained problem, and minimum curvature to the usual LASSO penalty problem. The addition of the nonconvex elementwise term γ strengthens the sparsifying behavior. However, because of the sometimes erratic way that the conditional gradient method picks step directions, simple implementations of these features easily lead to divergence. Therefore, a main contribution of this work is to identify carefully the conditions on φ and γ such that these two modified CGMs perform optimally.
The other main contribution of this work concerns safe screening, in which the variable search space is reduced dynamically by identifying which components will safely not appear in the converged solution. For example, in nonzero sparsity, we identify early on the indices i in which we are guaranteed that x * i = 0, in hopes of prematurely estimating the solution sparsity pattern. This technique is intended to reduce memory and computational cost.

Conditional gradient method
When h(s) = ι P (s) the indicator for s in P, the proposed method is the conditional gradient method (CGM) [24,29]. Also called the Frank-Wolfe method, it has been studied since the 50s and was revitalized recently [39] for its success at quickly estimating solutions to sparse optimization problems. Because this foundational method serves as a baseline, we will refer to it as the "vanilla CGM". This method is particularly useful when the computation of the supporting hyperplane in the (Min-Maj) step is cheap (e.g., when P is the unit ball of the 1 -norm or a group norm). Much work has come from expanding its use to general (atomic) norms [20,38,39,62] with many variations such as backward steps [42,59] and fully-corrective steps [65]. Many connections between the CGM and existing methods have also been discovered, such as to mirror descent [2], cutting plane method [72], and greedy coordinate-wise methods [20]. In its simplest version (with no away-steps, line search, or strongly convex assumptions on f or P) the minimum duality gap in CGM converges at rate O(1/t) [24].
bounded assumptions on iterates [2], or with improvement steps to ensure boundedness of sublevel sets [36,69]. When f is quadratic and for a special form of φ, the P-CGM can be shown to be equivalent to a form of the iterative shrinkage method, and under proper problem conditioning, has linear convergence [9,10].

Reweighted methods for nonconvex minimization
Our main algorithmic novelty is to solve a sequence of reweighted penalized CGM (RP-CGM) iterations in order to accommodate nonlinear γ, which appear in nonconvex penalties like SCAD or MCP penalties in difference-of-convex or majorization-minimization methods. This results in a nonconvex penalty h(x), which in practice have been shown to have superior sensing properties [17,21,26,33,48,56,67,68]. We leverage these observations to improve the screening properties of RP-CGM; by increasing the concavity of γ, we can create an aggressive support recovery method based on an easily computable duality-gap-like residual.

Applications
A main use case of CGMs is in finding generalized sparse solutions to convex losses [15,39], where the 1 -norm penalty, which promotes element-wise sparsity [13,14,22,63], is generalized to gauge functions that promote sparsity with respect to "atoms", or low dimensional facets of a convex set. This generalizes sparse optimization to applications such as low-rank matrix optimization [31,69] and grouped feature extraction [6,64,71]. Additionally, these atoms may be feasible solutions to combinatorial problems, such as in submodular optimization [1] and object tracking [16]. CGM has also been applied to a variety of machine learning tasks, such as graphical models [41], multitask learning [61], SVMs [43], particle filtering [44], and deep learning [5,57].

Safe screening
A screening rule returns an estimate of the support of x * given a noisy approximation x. The screening rule is safe if there are no false positives (and called sure if there are no false negatives). Safe screening rules for LASSO were first proposed by [25], and have since been extended to a number of smooth losses and generalized penalties [7,28,46,49,51,66]. An interesting related work is the "stingy coordinate descent" method [40] for LASSO, which optimizes the sparse regularized problem in a CGM-like manner, but uses screening to dynamically skip steps; this kind of method can be extended to P-CGM as well for generalized atoms. In nonconvex optimization, support recovery is discussed by [12] for handling nonlinear constraints which are iteratively linearized, and screening rules by [58] are proposed for a reweighted proximal gradient method.

Contributions and outline
We analyze the support recovery and convergence properties of P-CGM and RP-CGM on (1). We assume that the loss function f is L-smooth, the function φ grows at least asymptotically quadratically, the function γ has slope bounded away from 0 and +∞, and the set P 0 is either finite or a union of a finite set and a nonoverlapping cone. We give three main contributions.
Under mild assumptions the RP-CGM converges to a stationary point. In particular, without boundedness assumptions on iterates, using the deterministic step size schedule of θ (t) = 2/(1 + t), the function value error and gap-like residual of RP-CGM converge as O(1/t). We offer an online gap-based screening rule, which at each iteration removes some of the non-support atoms of the true solution x * . This method is safe for convex penalties and a useful heuristic for nonconvex penalties; for all penalties it converges in finite time to the true support. Having this information can improve caching for improving subproblem efficiency, and can be used in two-stage methods if the method is ended early. In general, CGM without line search or away steps does not guarantee finite-time support recovery. We thus give a finite-time support identification rate of O(1/δ 2 ) on the post-screened atoms, where δ is a problem-dependent conditioning parameter that measures its distance to degeneracy.
We present the RP-CGM in three stages, with increasing complexity. In Section 2 we consider the nonconvex element-wise penalty, giving the key intuition behind the general method, with simple proofs and analysis. In Section 3 we consider the generalized convex gauge penalized problem, using P-CGM, and show how to handle simple recession cones in P. Finally, in Section 4, we introduce reweighting of the gauge penalties, and give fully general convergence results and screening rules. Experimental results suggest promising method behavior in Section 5.

Reweighted Penalized CGM for simple sparse recovery
In this section, we introduce the RP-CGM over problems intending to regularize for nonzero elementwise sparsity. The goal is to present a simple implementation of the full method, to clearly describe the implementation and screening steps, and give intuition to its analysis. Later, we will expand the analysis for more generalized problems. We begin by considering the optimization problem This is the simplification of (1) with r := r P and P 0 = {±e 1 , . . . , ±e d } the signed unit basis. The more general case of the r P gauge-like penalty follows a similar analysis to what is presented in this section, and can be viewed intuitively as sparsity in a preimage space.

Reweighted penalized CGM
Inspired by methods in majorization-minimization and difference-of-convex literature, we propose the RP-CGM, which at each iteration takes a penalized conditional gradient step over the following convex proxy problem where r(x; x) := i γ (|x i |)|x i | is the linearized function of r with reference point x. We summarize the linearized function in terms of a slope and offset The RP-CGM on (3) runs by repeatedly iterating for some predetermined decaying step size sequence θ (t) = O(1/t). We decompose step (5) as follows. First, assigning the reweighted variables then (5) is equivalently expressed as which incidentally is also the conjugate function of g(u) = φ(r 0 + u 1 ). Now, we further simplify the task by dividing u into a direction and magnitude Then, because u and ξ can be optimized independently, (8) can be further simplified to two separable problems: Solving for u is exactly the same as the usual LMO for vanilla CGM, and is simply u = sign(v k )e k where k = argmax k |v k |. Solving for ξ is at worse a 1-D convex optimization problem, which can be solved efficiently via bisection. However, if we pick φ cleverly, then recognizing that the convex conjugate φ * (ν) = max ξ νξ − φ(ξ), then the optimal ξ + r 0 = (φ * ) (ν) the derivative of φ * . (To relate to the vanilla CGM, where φ(ξ) = ι ·≤1 (ξ), the convex conjugate φ * (ν) = ν and is always optimized at ξ = 1.) This leads to the efficient generalization of CGM in Alg. 1.

Algorithm 1 RP-CGM on simple sparse optimization
Initialize with any x (0) ∈ R d .

9:
Compute next atom s = ξ sign(v k )e k in two steps. Min-maj 10: 1. Find the maximizing index k = argmax i |v i |.

The convex penalty function φ
The vanilla CGM is written as an optimization function over a bounded set where P is some closed compact set. For example, a common choice of P is a norm ball. By introducing φ, we allow the problem statement to generalize not just to convex sets, but convex penalties as well. Specifically, let us first constrain γ( is an indicator function, then (3) is equivalent to (9) where P is the 1 -norm ball. On the other extreme, if we allow φ(ξ) = ξ, (3) resembles the usual LASSO penalized problem for sparse optimization. This type of problem poses a big problem in the RP-CGM world, since the conjugate function φ * (ν) = ι ·<1 (ν) and the recovered ξ will either be 0 (no step) or +∞ (diverge right away). Therefore, it is clear that some curvature must be imposed upon φ for Algorithm 1 to be convergent.
This minimum curvature assumption is also essential for convergence analysis. Under the usual CGM framework, each new iterate s ∈ P is by design bounded, so as long as θ (t) decays, convergence is guaranteed. In the P-CGM and RP-CGM case, Assumption 1 is much weaker than boundedness, and leads to the following growth property.

Lemma 2.
If Assumption 1 holds, then φ * is smooth everywhere, and the derivative of φ * is asymptotically nonexpansive; e.g., for some finite-valued ξ 0 , (φ * ) (ν) ≤ ν µ φ + ξ 0 . The proof is in Appendix A. Since ξ = (φ * ) (ν) will be the magnitude of each new step, this Lemma says that ξ can grow at most linearly with ν, the magnitude of the gradient. We can interpret this as a relaxation of a boundedness assumption to a controlled growth assumption, which is not fully general, but still much more relaxed.

The concave sparsifier γ
The function γ is inspired by concave regularization functions like the LSP or fractional p-norms, that have been shown in practice to more aggressively enforce sparsity. Other popular concave penalties are listed in Table 1; a more complete table is given by [33,58]. The linearization (4), given γ concave, is a majorant of (3) and is exactly equal when x (t) reaches a stationary point. However, actually computing the reweighted LMO can be numerically ill-defined if i |) is either 0 or +∞, since the reweighted variables (7) will be ill-defined. This leads us to impose Assumption 2 on γ.
Note that the standard nonconvex sparsifiers (SCAD, MCP, LSP, p-norm for p < 1) do not satisfy these assumptions, and when used directly in this reweighting scheme will cause numerical instability. Therefore, we make the following modifications, to ensure stability of RP-CGM.
It is interesting to note that though we do not use the "full effect" of these canonical sparsifiers, we are able to leverage their aggressive sparsifying effect. When even a very small amount of nonconvex curvature is present, we notice a significant benefit in the numerical experiments in terms of screening and sparsification of the final solution.

Stationary points and support recovery
We define the support of x as the indices of the nonzeros as supp(x) = {i : x i = 0}. For a method producing iterates x (1) , x (2) , → x * , we say that this method has recovered the support at iteration t if for all t ≥ t, supp(x (t) ) = supp(x * ). For a continuous function h : [18,19]. Given Assumptions 1 and 2, the Clarke subdifferential for h(x) is 1 where we use the · notation here for scaling elements in a set (α · S = {αx : x ∈ S}). In other words, in cases where φ (r(x)) exists, the optimality conditions can be summarized as follows: x * is a stationary point of (3) if Example 5. Suppose that γ(x i ) = |x i | and φ(ξ) = 1 2 ξ 2 . Since h(x) = φ(r(x)) is convex in this example, the Clarke subdifferential reduces to the usual convex subdifferential, and can be expressed element-wise The optimality conditions can also be summarized in terms of "wiggle room"; that is, whenever x i = 0, then ∇f (x) i lies in an interval. But when x i = 0, ∇f (x) i must take a specific value. Duality will then allow the element-wise gradient to act as a sparsity indicator. (See also [53,69].) and r(ξ) = |ξ| + ξ 0 . Its Clarke-subdifferential can be expressed element-wise Again, note that the duality conditions show "wiggle room" in the values of ∇f (x) at stationary x = x * , for the indices for which x i = 0. However, in the case of nonconvex functions γ, the gradient at optimality is less informative, since γ (|x i |) changes with different input values, and moreover is not necessarily maximal when |x i | > 0. For this reason, designing screening rules is nontrivial for nonconvex penalty functions, and fully safe rules may not prove fully efficient.
The concave penalty γ increases the "spike-ness"; the convex penalty φ increases the effect of the aggregate value. Right: Three example functions of γ. RP-CGM will behave erratically when γmin = 0 (red and blue) and γmax is unbounded (red), so we use a penalty that is bounded on both ends (green = concave + linear).

Duality
We now give the primal and Fenchel dual formulations of (3) given a reference point x: .
Here, we define r 0 := r(x) − r(x; x) and w i = γ (|x i |). Given x, both primal and dual objective functions are convex. In particular, the duality gap of this convexified subproblem, using a primal candidate x and dual candidate z = −∇f (x), can be expressed as and adds little overhead when used to monitor the progress of Alg. 1. Now, we will show that gap(x; x) is an effective residual measurement, and indeed converges to 0 at the usual O(1/t) rate.

Convergence of RP-CGM
We begin with an unusual twist on a usual assumption.

Assumption 3 (L-smoothness).
We assume that f is convex and L-smooth w.r.t. · 1 : An important consequence of (13) is that, while the set of minimizers of (P-simple) may not necessarily be unique, their gradient ∇f (x * ) will be unique. Specifically, (13) implies that and in particular taking and thus x is optimal only if ∇f (x) = ∇f (x * ). Under Assumption 3, we first show that the duality gap of the original nonconvex problem (3) is (as expected) bounded away from 0, and is thus an inadequate measure of suboptimality.

Proof. First, given the conjugate function
and picking Since φ is monotonically increasing and γ is concave, we have the majorant property of the linearizer, and In particular, since this holds for any In other words, the duality gap of the original nonconvex problem is somewhat useless for screening, since it does not converge to 0. Instead, we measure convergence via the gap of the linearized problem at x = x.

Proposition 8 (Residual). The duality gap gap(x, z; x) between (P-simple) and (D-simple) at primal variable x and dual variable
where and represent element-wise multiplication and division, respectively. Tightness of (a) occurs iff Fenchel-Young is satisfied with equality, e.g. ν ∈ ∂φ(r(x)). Tightness of (b) occurs iff Combining these two observations, then gap( which is the condition for x = x * a stationary point of (3).

. by the steps dictated in (Min-Maj) and (Merge), using the step size sequence
This is a special case of Theorem 33, which is proven in Section 4 and Appendix B. The proof is inductive, and shows that O(1/t) behavior "kicks in" at a large enough t; explicit constants are given in Section 4.

Convex support recovery and screening
To understand how gap-based screening works, suppose first that for some x, we magically have a bound on the gradient error over all indices: Then the value of the true maximum gradient at the stationary point is at most away from the maximum value of the current gradient, e.g.
Moreover, if at any index k, in other words, index k cannot possibly be maximal. Therefore, it must be that at optimality, x * k = 0. The last missing detail is the observation that the duality gap gives us this bound explicitly. Now we formalize this notion. From optimality conditions, In other words, for this convex reweighting problem, the sparsity pattern of x * can be partially ascertained from ∇f (x * ), in that the set of nonzeros of x * must be contained in the set of maximal indices of the reweighted ∇f (x * ). Formally, define Then the optimality condition (16) states that supp(x * ) ⊆ dsupp(x * ; x), where x * minimizes (P-simple). We are in particular interested in x * = x * the stationary point of (3). From this observation, we have our first screening property.
Proposition 10 (Screening for simple sparsity). Then Since this proposition is a consequence of Proposition 34 in Section 4, we leave the proof for then.
From these two properties, we immediately get a screening rule for (3): Theorem 12 (Screening rule). For any x, define If , where x * any minimizer of (3).
Note that in the convex case (γ(|x i |) = |x i |) then D(x) = 0 and = 0 is a safe choice, for all x. In the general case, since we do not know D(x), we cannot guarantee the safety of an intermediate iterate; however, since D(x * ) = 0 by definition of stationary point, then x (t) → x * implies D(x (t) ) → 0. Picking any decaying sequence (t) → 0, therefore, forms a heuristic rule that converges to the true support.

Degeneracy and support recovery guarantee
Following the terminology introduced in [37], we say that .
To characterize nearly degenerate solutions, we define and the quantity δ min (x * ) expresses the distance to degeneracy for this solution. This can be interpreted as a complementary slackness-like condition in duality, where both the primal and dual variables are jointly active. While we may reasonably believe that many real world problems with randomized data do not lead to degenerate solutions, near-degenerate solutions do pose problems for screening and manifold identification [11,37,45].

Corollary 13.
If δ min > 0, then for a method x (t) → x * , the screening rule (19) with = 0 identifies supp(x * ) after a finite number of iterations t; that is, for all t ≥ t, I

P-CGM for general convex sparse optimization
Our goal is to now extend the studies of the previous section to solve the generalized sparse optimization problem (1). The key addition is the introduction of the "gauge-like" function r P (x), but which uses the sparsifying properties of γ. In this section, we will focus on problem (32) when it is convex; namely, we assume that γ(c) = c. Just as studying the convex LASSO brings to light many of the sparse recovery properties illustrated from the nonconvex problem (3), we will first study the convex penalized version of (32) to gain intuition, and present the full extension in the next section.

Gauge penalized problems
The penalized CGM (P-CGM) solves problems of the form where κ P (x) is the gauge function [15,30] defined by a set P at point x: This function generalizes the 1-norm to more size-measuring functions that include norms, semi-norms, and convex cone restrictions. It is useful to compare (21) with the definition usually given in convex analysis literature [8,60], where the gauge function over a closed convex set P is defined as In fact, this is equivalent to (21). In particular, when P is the convex hull of a set of atoms, κ P (x) can be used to promote sparsity with respect to those atoms. The corresponding "dual gauge" is the support function which is closely related to the generalized LMO If κ P is a norm, then σ P is the usual dual norm [8,60]. A key feature of the CGM is that this LMO is often cheap to compute in practice, and despite weaker convergence guarantees compared to higher order methods, often converges quickly when x * is sparse with respect to structured P 0 . (See also Table 2.) Example 14 ( 1 norm). We start with the usual sparsity case of the 1 norm. In this case, σ P = · ∞ is the dual norm of · 1 . Then, by setting the optimality condition 0 ∈ ∂g(x * ) and decomposing by index, at optimality In words, the gradient of f along a coordinate for which the optimal variable is nonsmooth with respect to κ P is allowed "wiggle room"; in contrast, if g(x) is smooth in the direction of x i then the gradient is fixed. In terms of support recovery, max i |∇f (x * ) i | = x * 1 and additionally, if |∇f (x * ) i | < x * 1 then it must be that x * i = 0. Example 15 (Weighted 1 norm). The convex majorant in Section 2 specifically considered κ P (x) = i w i |x i |, for weights w i > 0. Here, P 0 = {±w −1 1 e 1 , . . . , ±w −1 d e d }, with corresponding "dual gauge" σ P (z) = max i |zi| |wi| , and the LMO follows exactly the steps for the bounded maximization computation in (5). Note also that the optimality conditions of (20) for this choice of κ P (x) is It exactly characterizes the optimality conditions for (P-simple). Later, we will generalize this reweighting technique for general atomic sets P 0 , to construct the convex majorant of the general nonconvex problem (1).
Example 16 (Latent group norm). For the task of selecting a sparse collection of overlapping subvectors, such as in gene identification, the latent group norm was proposed in [55]. For x ∈ R d , given a collection of overlapping groups G = {G 1 , . . . , G K } where G k ⊂ {1, . . . , d}, this norm a gauge function, In particular, (23) is the solution to (21) when Then σ P (z) = max k=1,...,K z G k 2 . Now consider (20) for some smooth φ. Then at optimality, decomposing Screening in this case refers to identifying the subvectors where, at optimality, s * k 2 might be nonzero; however, just as support identification in the 1-norm case does not imply that the values of x * i are known, in a similar vein here it does not imply that the values of s * k are known.
Both P-CGM and RP-CGM can be efficiently implemented for the latent group norm. However, a key numerical issue is that computing the group norm x G when the groups overlap is computationally burdensome (requires solving complex subproblems) and is needed in the gap computation. Nevertheless, since gap computations are used only infrequently for monitoring progress and for screening, this overhead can be mitigated. (Note that computing the dual norm, and thus the LMO, is comparatively computationally cheap / trivial.) Example 17 (Nuclear norm). For a matrix X ∈ R m×n , the nuclear norm X * , defined as the sum of singular values of matrix X, can be expressed as a gauge over the infinite set Because P 0 is not a finite set, screening in this scenario will most likely not be very efficient, or even useful. However, CGM is indeed frequently applied to this version of P 0 , in order to promote low-rank matrix solutions, and applying P-CGM to spectral problems is a central application in [69]. In particular, while computing the nuclear norm requires a full spectral calculation, computing the dual norm, the spectral norm, is often much cheaper using fast spectral methods, and can often be compressible [70]. Table 2 summarizes these examples and key properties. Gauges and support functions for convex sets are fundamental objects in convex analysis, and are discussed more by [8,30,32,60].
Example 18 (Total variation (TV) "norm"). We now investigate a case where P 0 contains a direction of recession, which introduces some ambiguity into our construct. Specifically, we investigate the TV norm, which is often used in signal processing as a "smoothing regularizer": A common way to express this in matrix/vector notation is to introduce a difference matrix In particular, for any constant vector x, x TV = 0. This adds an unbounded direction for the support function; specifically and thus the LMO is not always defined. Note here that if z ∈ range(D T ), then u = D(D T D) −1 z is uniquely determined; this inspires an "effective band-aid" to deal with directions of recession.
Whenever P has a direction of recession, CGM struggles as the LMO can return an infinite atom. We offer to isolate optimization over this set separately. In particular, suppose where P 0 is a finite set, and thus defining P as the convex hull of P 0 ensures that P is compact. Then we rewrite (20) as where cone(P) := {αx : α ∈ R + , x ∈ P} is the conic hull of P. At each iteration, x takes a conditional gradient step, and y is updated through a full minimization. (In the case of the TV norm, this simply means that the LMO is applied to a de-meaned x = x − 1 d x T 1.) Since the portion of the solution in K is minimized exactly at each step, from this point on we only consider the support recovery properties for recovering the atoms in P 0 .
Assumption 4 (Atomic set conditions). P 0 = P 0 ∪ K where P 0 is a finite set of atoms and K is the recession cone; moreover, P 0 ∩ K = ∅. We denote P = conv(P 0 ).

Generalized smoothness
To ensure the uniqueness of dsupp P (−∇f (x * )) and to give a useful gap bound, we again need a notion of smoothness on f . We again use our unusual twist on the gauge penalty.

Definition 19. A function
The purpose of this generalized notion is that sometimes, given the data, tighter bounds can be computed [54]. It is similar in spirit to the notion of relative smoothness [3,47] which facilitates the analysis of generalized proximal gradient methods, where the 2-norm squared proximity measure is replaced by a Bregman divergence. For CGM, it is more computationally efficient to consider generalized gauges as the penalty generalization, which we incorporate to the generalized smoothness definition. Additionally, the subadditivity property of gauges assists with bounding the iterates, a crucial step in the convergence proof.

Example 20 (Quadratic function). Suppose that f
While norm bounds would give d 2 L 1 ≥ dL 2 ≥ L ∞ , the actual values in A might lead to tighter inequalities.
The relationship to usual smoothness is as follows. Suppose that f is L 2 -smooth in the usual sense (with respect to · 2 ). Then since diam(P)κ P (x) ≥ x 2 , it follows that L ≤ diam(P)L 2 . In this way, we refine the analysis of CGM by absorbing the usual "set size" term into L, which in certain cases may be smaller than diam(P)L 2 . (25) holds and 0 ∈ int P, then ∇f (x * ) is unique at the optimum.

Proposition 21 (Uniqueness of gradient). If
The same logical argument as before applies, as "smoothness" in the primal corresponds to "strong convexity" (w.r.t. · ∞ ) in the dual.

Generalized support recovery
Given a solution to (21), define the decomposition of x with respect to P 0 as tuples c p , p, extracted via the mapping coeff P (x, p) = c p . The support of x with respect to P 0 is For general P, neither the decomposition nor the support of x is unique. As before, we say the support recovery is achieved if one such support supp P (x * ) of the limiting point x (t) → x * ∈ X * is revealed. The reduction to the support definition in the previous section occurs when P 0 = {±e 1 , . . . , ±e d } the signed standard basis. Then supp P (x) is unique, and explicitly supp P (x) = {sign(x i ) e i : x i = 0}. (20) where φ : R + → R + is a monotonically nondecreasing function. Then for any x * a minimizer of (20),

Proposition 22 (Support optimality condition). Consider the general convex sparse optimization problem
and This is the gauge equivalent of "nonzero primal gives maximal dual", and is referred to in [27] as alignment.
We now generalize the definition of dual support from (17): and Property 22 says that for any x, supp P (x) ⊆ dsupp P (x). Finally, as in the previous section, we express this distance as δ min (x * ), where for any support of x. In particular, δ min (x * ) = 0 means the problem is degenerate.

Duality and gap
For φ monotonically nondecreasing, the convex function h(x) = φ(κ P (x)) has conjugate h * (z) = φ * (σ P (z)). This gives the primal-dual pair where K • is the polar cone of K. The duality gap between (P-convex) and (D-convex) can be written as where ι K • (z) = +∞ if z is not dual-feasible, and 0 otherwise.

Lemma 23 (Feasible gradient). Take
Proof. The first part is true from chain rule. Then by optimality condition, z is in the normal cone Since 0 ∈ K, this implies z T y ≤ 0, which means z ∈ K • . From Lemma 23, the LMO step acquires s where for z := −∇ x f (x + y), Additionally, by Fenchel-Young's inequality, we know that f (x) + f * (∇f (x)) = ∇f (x) T x, and thus we can simplify the gap to an online-computable quantity

Proposition 24 (Gap bounds gradient error). Given a primal feasible x and denote the optimum variable as
Furthermore, denote y = argmin y ∈K f (x + y ) and y * = argmin y ∈K f (x * + y ). Then the duality gap bounds the gradient error Proof. Since the conjugate of h(x) = φ(κ P (x)) is h * (z) = φ * (σ P (z)), then Then denoting z = −∇f (x + y), , since K and K • are polar cones and thus y T ∇f (x) ≤ 0. Next, recognizing that h(x) = φ(κ P (x)) is convex, we pick −∇f (x * + y * ) ∈ ∂h(x * ) and use convexity to further reduce to the result: Theorem 25 (Support identification of screened P-CGM). Given Assumptions 1, 2, 4, 5, then the screening rule for convex penalties is safe and convergent: and where t is such that which happens at a rate t = O(1/(δ 2 min )). Proof. This is a direct consequence to Theorems 33 and 35.
Note that Theorem 25 imposes no conditions on the sequence θ (k) , or choice of φ, f , etc., except L-smoothness of f . In other words, for any method where the gap is easily computable and its convergence rate known, then a corresponding screening rule and support identification rate automatically follow. Additionally, computing L may be challenging, depending on κ P ; as shown previously, at the very least it may require a full pass over the data. However, this is a one-time calculation per dataset, and can be estimated if data are assumed to be drawn from specific distributions (as in sensing applications).

Invariance
One appealing feature of the CGM is that the iteration scheme and analysis can be done in a way that is invariant to both linear scaling and translation. However when the gauge function is not used as an indicator, this translation invariance vanishes; in general, κ P (x) = κ P+{b} (x + b). Therefore the generalized problem formulation (32) is only linear (not translation) invariant.
Proposition 27 (Invariance properties). Define Q = AP, and f (x) = g(Ax). Define w = Ax where A has full column rank. Then, using (22) and chain rule, the following hold

RP-CGM for general nonconvex sparse optimization
Finally, we consider the complete RP-CGM, which expands the method presented in Section 2 to generalized gauge penalties. The fully generalized optimization problem is By imposing the concave transformation on c p , we effectively gain the same effect as the nonconvex regularizer on the 1 norm in Section 2. For the most part, much of the analysis will seem very similar to that in Section 2, especially in the proofs of key concepts, which we therefore put in the appendix to avoid repetitiveness. We also use much of the same assumptions (1, 2, 3) and analyses for the scalar functions γ and φ.

Lemma 28 (Smoothness equivalences)
. Suppose that f is L-smooth with respect to P. Then the following also holds:

2.
Strongly convex conjugate The proof is in Appendix A.
Lemma 29 (Uniqueness of gradient). Suppose Assumption 4 holds. If (25) holds, then at the global optimum Proof. Assume that f (x) = f (x * ) for some x = x * , x feasible. Then by optimality conditions, which implies that σ P (∇f (x) − ∇f (x * )) = 0. This means that the vector w = ∇f (x) − ∇f (x * ) cannot have any component in cone(K • ), e.g. it is orthogonal to any z ∈ K.

Support recovery
As it was for κ P , the domain of r P is cone(P). However, the support of κ P (x) and r P (x) are often not equivalent.
Example 30 (Different optimal support). Consider κ P (x) = x 1 and r P (x) = 1 has optimal solution x * = (1/2, 0, 1/2). We verify this from the normal cone condition, where Note that r P (x * ) = 1 as well. However, taking x = (0, √ 2, 0) also yields r P (x) = 1, and has a lower objective value Example 31 (Different gauge support). The problem can be made even worse, in that the support of x w.r.t. r P may not even intersect with that w.r.t. κ P . Suppose that and consider x = (6, 6). Then, taking γ(c) = √ c, we have two options In other words, the support supp P (x) as defined in (26) may not be the support created by the nonconvex gauge r P (x), which is often sparser. More generally, r P (x) does not act merely as a concave transformation on the weights c p in κ P , as even the atoms themselves may be selected differently. However, it is worth noting that this scenario does not happen for the 1 norm or the TV norm, which have unique and consistent supports across choices of monotonically increasing γ.
Overall, the question of nonunique support of a given vector x over atoms P 0 is an interesting one, but not a focus of this paper, which focuses on cases where the support is always unique.

Stationary points
We can rewrite (32), as the combined optimization problem over c p , p ∈ P 0 : The stationary points of (35) are x satisfying Our goal is to find a support of such a stationary point x * . Given γ smooth everywhere except at 0, note the close similarity between this and the support optimality conditions for convex gauges: Here, the wiggle room condition looks asymmetric, but note that if p and −p is in P 0 , then c p = c −p = 0 implies −p∇f (x * ) ∈ α · [−γ max , γ max ], recovering the symmetric condition from Section 2. As before, since γ is a decreasing function, a nonzero coefficient for x * does not mean a maximal gradient inner product.

RP-CGM
In the case that P 0 includes directions of recession, we treat them separately by writing P 0 = P 0 ∪ K where P 0 contains the important (finite-sized) atoms and K contains directions of recession. We define the reweighted atomic set for a given reference point x as P 0 (u) = 1 γ (coeff P (u, p)) p : p ∈ P 0 , P(u) = conv(P 0 (u)).
Then r P (s; u) = κ P(u) (s), with corresponding reweighted support function At each iteration, we take a penalized conditional gradient step toward solving the reweighted gauge optimization problem with dual (P-general) minimize x,y∈K f (x + y) + φ(r 0 + κ P(x) (x)), A description of the most generalized version of the reweighted method is given in Algorithm 2.

Convergence
Proposition 32 (Residual). Denoting gap P (x; x) the gap at x with reference x, then The proof follows closely that of Proposition 8; see Appendix A for full details.
The details of the proof closely mirror steps in previous works, and thus we give the explicit details in Appendix B.
Let us compare Theorem 33 with the usual rates for CGM. In [39], the primal convergence rate for vanilla CGM (with noiseless gradients) is given as t+2 where C f is a curvature constant that depends on the conditioning of f and the size of P . These players appear here in the form of the conditioning of f (quadratic in L/µ), and implicitly σ P (which grows proportionally with diam P). The new players ν 0 , γ min , and γ max account for the penalty and nonconvex generalizations.
Recession component return x (T ) + y (T )

Invariance
Finally, we investigate the linear invariance properties of RP-CGM. Specifically, we consider Q = AP, f (x) = g(Ax), w = Ax, w = Ax, where A has full column rank. We will have preserved linear invariance if RP-CGM applied to min x {f (x) : x ∈ P} and min w {g(w) : w ∈ Q} are equivalent. Assume additionally that both x, x ∈ cone(P). Then the following hold.
Penalty. r P (x) = r Q (w). This follows from noting that and in fact noting that the coefficients are equal (c q = c Ap ). Stationarity. We construct P with columns containing the atoms in P 0 , and c such that x = P c, w = Ax = AP c.
Additionally, for any stationary point x * , if ∇f (x * ) ∈ cone(P) then there exists a descent direction that is uneffected by the penalty r P (x), and thus it must be that ∇f (x * ) ∈ cone(P). By the same token, A T ∇g(w * ) ∈ cone(P). Therefore, the stationary conditions are equivalent: for x * = Aw * , Additionally, it can be shown through the chain rule that AP(x) = Q(w) and res P (x) = res Q (w). Overall, this shows that the steps and analysis of RP-CGM are all invariant to linear transformations on x.

Screening
We now describe the gradient error measured in terms of this "dual gauge", where the symmetrization P := P ∪−P ensures that σ P (z − z * ) = σ P (z * − z), bounding errors in both directions.
Proposition 34 (Gap bound on gradient error). Denote D(x) = r P (x) − r P (x * ) + r P (x; x) − r P (x * ; x) the linearization error at x. Denoting x * a stationary point of (32) and y(x) = argmin y ∈K f (x + y ), then The linearization error D(x) = 0 when the regularizer is convex. The proof is similar to that for Proposition 11, and is detailed in Appendix C.

Theorem 35 (Dual screening).
For any x and some choice of > 0, define the screened set as Then given Assumptions 1, 2, and 5, if then p ∈ supp P (x * ), where x * is the optimal variable in (20).
In the convex case, D(x) = 0, and thus we pick = 0 in our screening rule. In this scenario, not only does this screening rule achieve finite-iteration support identification, but the finite time t depends directly on δ min .

Experiments
In this section, we explore the convergence behavior and screening ability of P-CGM and RP-CGM on compressed sensing (with 1 , group norm, and TV regularization), and on a sparse logistic regression task on a real world dataset. The code for all the experiments is publicly available. 2

Sensing experiment
We first compare the various CGM variants on a simple simulated sparse sensing problem (Figures 2 and 3). We solve a least squares problem where A ∈ R m×n as A ij ∼ N (0, 1/n) i.i.d. for i = 1, . . . , m, j = 1, . . . , n, and for a given x 0 with 10% nonzero sparsity, b = Ax 0 . Specifically, we pick m = n = 100, where perfect sensing is possible, and either sweep or tune all the hyperparameters to investigate each case. An important modification needed to improve the stability of P-CGM and RP-CGM is to intensely diminish the step size; in particular, using θ (t) = 2/(2 + t) is too aggressive, so instead we use θ (t) = 2/(2 + t + t 0 ), where t 0 is another tuned hyperparameter. Note that in performance, this does not slow down the convergence or sensing abilities of the P-CGM and RP-CGM, suggesting that this is a more appropriate step size sequence in these regimes (and is still O(1/t)). All hyperparamters (α, ρ, t 0 ) were tuned to present the best results for each individual method. These two collections of figures are presented to illustrate several points: The gaps (left column) in all cases converges to 0 or machine precision at about a O(1/t) rate. The screen error (right column), measured as the support difference between x (t) and x * the final converged point, eventually goes to 0, at a speed somewhat correlated with the "aggressiveness" of the method (where RP-CGM is often more aggressive than P-CGM, but all three variants also depend heavily on choice of hyperparameter). Note that higher ρ, smaller θ, and smaller α all correspond to more aggressive methods. In contrast, the support error, measured as the support difference between x (t) and x 0 the ground truth, seems to have better performance when the method is less aggressive. It is hard to make sweeping conclusions, but suggests that both metrics are essential to evaluate the success of sparse recovery methods.

Other gauges
We now pursue the sensing problem for more creative choices of P 0 .
First, we consider the group norm in cases where when x 0 has a "pulse-like" structure, in that the signal has blocks of nonzero activity, separated by long spans of zero activity. This can be modeled as x 0 = i s i 0 where s i 0 is a pulse signal across the ith overlapping window. Figure 4 shows the trajectory for such an experiment, where the complementary characteristics between the primal variable and dual norms is visible, as over time, the nonzero blocks of the primal correspond to the maximal blocks of the dual.
Next, we consider the total variation penalization, where κ P (x) = i |x i − x i−1 |, and what is plotted is the cumulative sum of the demeaned z (t) = −∇f (x (t) ). Note that at optimality, the peaks of this dual atom exactly match the "flip points" of x (t) .   Here we investigate RP-CGM where θ = 1/2, ρ = 0.01, and p = 2. For stability, t0 = 100. The ground truth x0 contains 3 "pulses", e.g. areas where it is nonzero, and the goal is to fit x (t) to x0, using this group structure prior. Here we investigate RP-CGM where θ = 3/4, ρ = 0.25, and p = 2. For stability, t0 = 100. The ground truth x0 contains 3 flips, and is otherwise smooth. As before, the goal is to fit x (t) to x0, using this group structure prior.

Dorothea experiment
Finally, we consider a "real world" experiment, in which we use these methods to classify the Dorothea dataset [35]. Sparse optimization is essential in this application, which has only 1950 samples but 100000 attributes. Additionally, the dataset is heavily imbalanced, with very few positive labels. We run sparse logistic regression over this dataset, and illustrate the performance of the different methods in Figure 6. Note that the best implementation reaches an F1 score of about 0.3; without regularization, logistic regression achieves a test F1 score of about 0.16, highlighting the importance of sparse regularization. 3

Discussion
This work considers two variations of the conditional gradient method (CGM): the P-CGM, which accommodates gauge-based penalties in place of constraints, and the RP-CGM, which allows concave transformations of the gauges. The gauges may be induced by compact sets, but also accomodate "simple" directions of recession. We give a convergence rate to a stationary point, and propose a gradient screening rule and support recovery guarantee. Compared with proximal methods, these CGM-based methods often have a much cheaper per-iteration cost; e.g. in the group norm, computing the LMO (without reweighting) is trivial compared to even computing the gauge function itself. Additionally, the almost-for-free computation of the gap and residual quantity makes screening a very small computational addition.
The key challenge in showing the convergence of these methods is controlling the size of each s (t) . This was trivial in the CGM case when s (t) was constrained in a compact set; when transformed to a penalty, we require a minimum amount of curvature of φ at ξ → +∞, and we restrict γ to only having strict concavity over a finite support. However, as shown in the numerical results, these restrictions do not greatly inhibit the sparsifying effects of the penalty functions.
After determining convergence behavior, we then implement gap-based screening, which allows for knowledge of the true solution's sparsity pattern without completing optimization. This is a deliberate tool to reduce computational cost, and can be used in a number of ways. For nonzero sparsity or group sparsity, we can simply avoid computation over the "determined zeros". For problems where the solution is significantly sparse, a 2-stage solving technique can be used, where after enough zero components have been screened away, the problem can be solved over the reduced support using a more powerful (e.g. 2nd order) method. And, for problems with a very large number of atoms that need to be explicitly queried at each iteration (e.g. in submodular optimization) we can significantly reduce the search space. Therefore we believe these techniques have many practical benefits in a number of applications.
Finally, we do not incorporate away step [34,42]. In implementation, they are somewhat orthogonal to the extensions provided in this work; an away-step implementation of P-CGM and RP-CGM can be directly implemented, and its analysis is a subject for future work.

Proof of Lemma 28
Proof. The proof largely follows from [52], mildly adapted.
First prove (25) ⇒ (33). Construct g(x) = f (x)−x T ∇f (y), which is convex, also L-smooth, and has minimum at x = y. Then, for any w, where (a) is since g is L smooth and convex. Now pick and thus Then and plugging in the construction for g gives Applying the last inequality twice gives (y − x) T (∇f (y) − ∇f (x)) ≤ 1 2L ((σ P (∇f (x) − ∇f (y)) 2 + (σ P (∇f (y) − ∇f (x)) 2 ). Now prove (25) ⇒ (34). Using the same g as before, consider Using optimality conditions, picking w = z − y, we have and overall Plugging in f gives
Since φ is monotonically nondecreasing over R + , α ≥ 0. If α = 0, then ∇f (x * ) = 0 and both results are trivially true. Now consider α > 0. Noting that κ P = σ P • where P • is the polar set of P, which proves (27). Now take the conic decomposition x * = p∈P0 c p p where c p ≥ 0, and , which is with equality if and only if p T z * = σ P (z * ) whenever c p > 0, proving (28).

Proof of Proposition 32
Proof. Denote y = argmin y f (x + y), and z = −∇f (x + y), and plug in κ P(x) (x) = r P (x; x). Then where (a) uses the Fenchel-Young inequality on f and f * , (b) uses the Fenchel-Young inequality on φ and φ * , (c) follows since −∇f (x + y) ∈ K • and y ∈ K, and thus y T z ≥ 0, and (d) follows from the definition of σ P(x) .
Tightness of (b) occurs iff Fenchel-Young is satisfied with equality, e.g.
Putting it all together gives the desired result.
where ∆ (t) is defined as an averaging over square roots, e.g.
which satisfies the inductive step.
The following is a generalized and modified version of a proof segment from [39], which will be used for proving O(1/t) gap convergence. Lemma 39. Pick some 0 < T 2 < T 1 and pick Then if then for all k > T 1 , Proof. Using integral rule, we see that This yields c(k) := Lemma 40 (Generalized non-monotonic gap bound). Given t+D for some G 2 and D, and ∆ (t+1) − ∆ (t) (1 + αθ (t) ) ≤ −θ (t) res(x (t) ) + (θ (t) ) 2 G 3 for some G 3 , then for , we have min i≤t res(x (i) ) ≤ G 4 t + D .
Proof. We have Now assume that for all i ≤ t, gap (i) > G4 t+D . Then, telescoping from t to t gives Picking C 1 = G 1 , C 2 = αG 1 G 2 + G 3 G 2 2 , C 3 = G 2 G 4 , and invoking Lemma 39, this yields that ∆ (t+1) < 0, which is impossible. Therefore, the assumption must not be true.
Piecing everything in this section together gives Theorem 33 (main convergence theorem.)

C
Screening proofs from Section 4
This inequality is quadratic in σ P (z * − z), which leads to the bound