Open Journal of Mathematical Optimization

Short Paper - A note on the Frank–Wolfe algorithm for a class of nonconvex and nonsmooth optimization problems


Introduction
The conditional gradient method [11], also known as Frank-Wolfe (FW) algorithm, is one of the simplest and oldest iterative methods for minimizing a (sufficiently) smooth function over a convex and compact set. Despite its modest convergence rate of O(1/ √ k) for nonconvex objectives [18], the algorithm is particularly attractive when minimizing linear functions over the feasible set is computationally cheap. That is the case of several modern large-scale optimization problems from machine learning and data science, which have revitalized the interest and research on first-order optimization methods. We refer the reader to [1,6,12,16,18] and references therein for recent developments on the FW algorithm.
In the convex setting, it is well known that the FW algorithm may fail to converge to a stationary point if the objective function is nonsmooth [15,25]. Approaches for coping with such a shortcoming consist of (a) approximating the objective function with a smooth one, obtained by standard smoothing techniques such as the Moreau-Yosida regularization [24] and randomized rules employing probability densities [10]; (b) considering (when possible) an epigraphic reformulation by adding a new variable and moving the source of nonsmoothness to the constraint [9]; (c) assuming stronger assumptions than a simple oracle that provides an arbitrary subgradient, so that a descent direction can be computed by solving a complex subproblem per iteration [4,19,25]. Furthermore, when the objective is the sum of two convex functions h and c, with h smooth and c having a simple structure (e.g., piece-wise linear), then the FW algorithm in its generalized form given in [14] (and revisited in many recent publications) is convergent at the cost of solving a more involving subproblem per iteration.
Without relying on smoothing techniques, particular choices of subgradients, restrictive oracles, or other reformulation tricks that are only applicable in certain particular cases, we show that the classic FW algorithm computes Clarke-stationary points for the broad class of nonconvex and nonsmooth problems of the form where X ̸ = ∅ is a convex and compact subset of an open set O ⊂ ℜ n , and f : O → ℜ can be expressed, over X, as a minimum of a compactly parametrized family of α-Hölder smooth functions (see Definition 1 below). Such a class of functions is denoted by upper-C 1,α as −f is lower-C 1,α , a family of functions introduced in [7].

Contributions and organization
Despite nonsmoothness of the objective function, we show that under standard rules for defining stepsizes, the classic FW algorithm applied to (1) computes Clarke-stationary points and possesses a convergence rate of O(1/k α α+1 ), matching the one known for the smooth (but nonconvex) setting (take α = 1 and compare with [18, Table 2]). Furthermore, for applications in which f is the point-wise minimum of finitely many α-Hölder smooth functions F i : O → ℜ (i = 1, . . . , q) we propose a new variant of the FW algorithm with stronger stationarity guarantees, namely directional stationarity. The latter is the sharpest kind among the various stationary concepts in nonsmooth and nonconvex optimization. Furthermore, we show that the concept of d-stationarity is equivalent to local optimality when all functions F i are convex.
The remainder of this work is organized as follows. First, in Section 2 we recall some basic definitions and provide implementable formulations for two stationarity conditions. Next, the FW algorithm is revisited in Section 3 as well as its convergence analysis for the setting under consideration. The new algorithm variant able to compute directional stationary points is presented in Section 4.

Notation
Throughout this work, O is an open set of ℜ n and ∅ ̸ = X ⊂ O is a convex and compact set. Given a point x ∈ O, we denote by V x ⊂ O an open neighborhood of x, that is, the set V x := {y ∈ O : ∥y − x∥ < δ} for some δ > 0, where ∥ · ∥ is the Euclidean norm. We denote by N X (x) the normal cone to the set X at the point x: x ∈ X and +∞ otherwise. The notation α is reserved for scalars in [0, 1].

Main definitions, subdifferentiability and stationarity
In this section we present some key definitions and stationary conditions.
Under the compactness assumption on X the local representation (2) can indeed be extended to a common representation (the same function From now on we will only consider such a representation for f . Let f : O → ℜ be upper-C 1,α . It follows from definition that f is locally Lipschitz (see [22,Thm. 10.31] for the Lipschitz constant). Therefore, the Clarke-directional derivative . Such a mathematical concept permits to define the Clarke subdifferential of f at which is a nonempty, convex and compact subset of Given this structure, Theorem 7.3 of [21] asserts and thus The class of LC 1,α functions is Clarke regular (because LC 1 is; [20, Thm. 1]), that is, the Clarke-directional derivative coincides with the ordinary directional derivative from Convex Analysis Unfortunately, this is not the case for U C 1,α functions as we may have the strict inequality

Assuming the common representation (3), Proposition 2 and (4) yield that, for all
Non-regularity of U C 1,α functions is now evident: compare formulae (5) and (6). Given the two derivatives f • and f ′ , we can define two stationary conditions for (1).
By adopting the representation (3) for f , C-stationarity becomes It follows from Proposition 2 that d-stationarity means By adopting the representation (3) for f , d-stationarity becomes Equations (7) and (9) show that d-stationarity is a much stronger condition than C-stationarity. Indeed, d-stationarity is the sharpest kind among the various stationary concepts in nonsmooth and nonconvex optimization [17]. Both conditions coincide when f is smooth at the point under consideration: in this case, . Furthermore, f is nondifferentiable atx = 0: equation (4) gives Hence f is not regular atx = 0, a point that is C-stationary (in fact a global maximizer) but not d-stationary.
The following result shows that if a d-stationary point lies in the interior of X, then f is smooth at this point. (1).
showing that g = 0. ◀ A word of caution may be necessary: the above result does not imply that the index set I(x) is a singleton.
Revisiting the Frank-Wolfe algorithm for U C 1,α functions The alternative representation of C-stationarity condition given by (7) motivates us to apply the FW algorithm of [11] to problem (1). In Algorithm 1 we assume the existence of an oracle that, for any given point x ∈ X, provides us with the value f (x) and an arbitrary Clarke-subgradient g ∈ ∂ C f (x).

3:
Call the oracle to obtain an arbitrary subgradient g k ∈ ∂ C f (x k ) and compute z k ∈ arg min z∈X ⟨g k , z⟩ 4: Set d k = z k − x k and θ k = −⟨g k , d k ⟩

5:
Stop if θ k ≤ Tol: x k is a C-stationary point within tolerance Tol

6:
Choose τ k ∈ (0, 1] and set x k+1 = x k + τ k d k 7: end for Algorithm 1 is interesting in situations where minimizing a linear function over X is computationally cheap. We refer the interested reader to [12] and references therein for several optimization problems arising from machine learning and data science that are suitable for the FW algorithm. Differently from [12] that deals with smooth problems, the recent work [19] also gives a list of nonsmooth but convex optimization problems that can be solved by a variant of the FW algorithm tailored to a particular structure of nonsmoothness. The main idea in [19], that dates back to [25], is to work with well-chosen approximate subgradients of f . More precisely, instead of computing an arbitrary subgradient, the method requires at every iteration the construction of a set T k ⊂ ℜ n containing all the subgradients of f in a neighborhood of x k and defines z k by solving the more involving subproblem min z∈X max s∈T k ⟨s, z⟩, which is implementable only in some particular (convex) cases [4,19,25].
Under the assumption that f is smooth, then Algorithm 1 is the classic conditional gradient method of [11]. The sole contribution of this section is the proof that the algorithm (asymptotically) computes a C-stationary point of (1) provided f is U C 1,α with α ∈ (0, 1]. To this end, we assume that Tol = 0 and the algorithm does not stop. (If Algorithm 1 stops at iteration k, then θ k = 0 because θ k ≥ 0 for all k: in this case, 0 = min z∈X ⟨g k , z−x k ⟩ and thus x k is C-stationary; cf. (7).) We start our analysis with the following key result. Once the inequality in Lemma 5 is established, convergence analysis follows the same techniques found in the (smooth) FW algorithm's literature.

The Carathéodory Theorem [22, Thm. 2.29] ensures that every
can be written as a convex combination of no more than n + 1 vectors ∇ x F (x k , u), u ∈ I(x k ). Therefore, by replicating u i ∈ I(x k ) and assigning λ k i = 0 if necessary, the subgradient g k at iteration k of Algorithm 1 can be expressed as i=1 λ k i = 1, and u i ∈ I(x k ). We get from the above inequality that, for We have thus shown that f ( If {τ k } satisfies ∞ k=0 τ k = ∞ and lim k→∞ τ k = 0, then lim k→∞ θ k = 0. Furthermore, let j(k) ∈ {1, . . . , k} be such that θ k = θ j(k) . Then any cluster point of the sequence {x j(k) } is a C-stationary point for (1).
The sequence {f (x k )} can be made monotone upon more strict rules to define stepsizes. The following result is an adaptation of [2, Thm. 13.9] (that considers α = 1 and f to be ℓ-smooth). otherwise.
In particular, This result is a conceptual one because the Lipschitz constant ℓ in (i) is in general unknown and rule (ii) amounts to globally solving a uni-dimensional function over the interval [0, 1]. Less stringent schemes that work well in practice employ inexact line-searches (e.g. [18] and the Armijo rule).
Let us consider again Example 3, and apply Algorithm 1. If we start with x 0 = 0 and the oracle returns g 0 = ∇F 1 (x 0 ) = 4, then we get θ 0 = 0: the algorithm stops at iteration k = 0 with the C-stationary point x 0 that is a global maximizer. This fact motivates the following section.

Directional stationarity via a modified FW algorithm
Ideally, for nonsmooth minimization problems, one wants to design an algorithm that will compute a stationary point that has the best chance to be a local minimum [17]. However, without further structure, having guarantees of computing a d-stationary point is out of reach. One special structure arises when X has finitely many known vertices, a setting exploited in [3] for more structured functions and briefly adapted to our framework in the Appendix. We now propose a new variant of the FW algorithm for computing d-stationary points upon the following additional assumption on the objective function: f is given by , with every F i : O → ℜ known and α-Hölder smooth, (12) i.e, U in (3) is the finite index set {1, . . . , q}. Accordingly, it follows from (9) thatx ∈ X is a d-stationarity point of problem (1) if Based on this fact, Algorithm 2 seeks, at every iteration, for a descent direction by checking the gradient of all functions F i that are ϵ-active.
To be more precise, we define the following index set, with ϵ > 0 a small tolerance (12), then Algorithm 2 d-stationary Frank-Wolfe Algorithm 1: Let x 0 ∈ X, Tol ≥ 0, and ϵ > 0 be given 2: for k = 0, 1, 2, . . . do 3: for i ∈ I ϵ (x k ) do 4: end for 8: Define i * ∈ arg min i∈Iϵ(x k ) f (x i,k ) and set x k+1 = x i * ,k 10: end for Algorithm 2 boils down to the classic FW algorithm. Again aligned with the main motivation from [4,19,25], when q > 1 Algorithm 2 searches for a subgradient in ∂ C f yielding maximum descent. In order to provide an asymptotic analysis, we must allow the possibility of employing "approximate" subgradients yielded by ∇F i (x k ) with i in I ϵ (x k )\I(x k ). This is in the same vein as the proximal method of [17] for DC programming. Note that θ i (x) is nonnegative for all i ∈ I ϵ (x) and all x ∈ X. Naturally, if at iteration k we have θ i (x k ) = 0 for all active index i ∈ I(x k ), thenx = x k is, from (13), d-stationary for problem (1) and the algorithm should halt. This explains why the stopping test above employs the set I(x k ) instead of I ϵ (x k ). Indeed, the alternative stopping test max i∈Iϵ(x k ) θ i (x k ) ≤ Tol may never be triggered even when Tol > 0: we may find a direction d j,k that is of descent for some F j with j ∈ I ϵ (x k )\I(x k ) but not for f .
In what follows we analyze the convergence properties of Algorithm 2. To this end, we need to assert on the continuity of the function θ i (x). Let ω i : O × O → ℜ be given by ω i (x, y) := ⟨∇F i (y), x − y⟩. Since ∇F i is continuous by assumption, ω i (x, y) is continuous on both arguments. Then, Theorem 1.17(c) from [22] ensures that θ i (y) = − min x∈X ω i (x, y) is a continuous function.
▶ Theorem 8. Let f : O → ℜ be given by (12) and X ⊂ O a convex and compact set. Consider Algorithm 2 with ϵ > 0 applied to problem (1) and suppose that the sequence of stepsizes is defined by one of the following . Then every cluster point of the sequence {x k } generated by the algorithm is d-stationary for (1).

Proof. Let us define
). Letx ∈ X be an arbitrary cluster point of {x k } and {x k l } be a subsequence such that lim l→∞ x k l =x. It follows from continuity of F i , i = 1, . . . , q, that I(x) ⊂ I ϵ (x k l ) for all l large enough. Then, by working with such indexes we obtain for all i ∈ I(x). By taking the limit with l going to infinite we get from continuity that The next result shows how sharp the d-stationary concept is for the problems of interest.
▶ Theorem 10. Letx ∈ X be d-stationary for (1) with f given by (12). In addition, assume that there exists δ > 0 such that F i in (12) As there are only finitely many functions F j , we conclude thatε = min j∈{1,...,q}\I(x) ϵ j is a strictly positive constant. Again, continuity ensures the existence of This corollary, apparently innocuous, conveys an original result in the area of DC programming [8]. Indeed, if all functions F i in (12) are convex, then f can be decomposed as a DC function f (x) = f 1 (x) − f 2 (x), with f 1 (x) = m i=1 F i (x) and f 2 (x) = max j=1,...,m ℓ̸ =j F ℓ (x). In this setting, the concept of d-stationarity of a point x ∈ X to problem min x∈X f 1 (x) − f 2 (x) is equivalent to the inclusion ∂f 2 (x) ⊂ ∂[f 1 (x) + i X (x)]. Hence, all local minimizers must satisfy this condition. However, the reverse implication is only known to hold if f 2 is (locally) polyhedral. Corollary 11 gives another framework where d-stationarity implies local optimality without needing f 2 to be locally polyhedral.
Suppose that θ . Combining these two last inequalities we conclude that the former cannot hold for k ≥ . This inequality proves that lim k→∞ θ k = 0 at convergence rate of . ◀

A.3 Computing d-stationary points when the feasible set has known vertices
▶ Proposition 13. Let f : O → ℜ be an upper-C 1,α function and suppose that X has finitely many vertices v 1 . . . , v m . Thenx is a d-stationary point of problem (1) if and only if f ′ (x; v ι −x) ≥ 0 for all ι = 1, . . . , m.
Proof. The first implication follows directly from the definition of d-stationarity: f ′ (x; x −x) ≥ 0 for all x ∈ X. To prove the converse implication, note that for every x ∈ X = co v 1 . . . , v m , there exists a vector λ x ∈ ℜ m + such that m ι=1 λ ι x = 1 and x = m ι=1 λ ι x v ι . Furthermore, recall that the directional derivative is positively homogeneous, i.e., rf ′ (x; d) = f ′ (x; rd) for all r ≥ 0. Then, using the expression in (6) The latter is equal to min g∈∂ C f (x) ⟨g, x −x⟩ = f ′ (x; x −x). As x ∈ X is arbitrary, d-stationarity ofx to (1) holds. ◀