A note on approximate accelerated forward-backward methods with absolute and relative errors, and possibly strongly convex objectives

In this short note, we provide a simple version of an accelerated forward-backward method (a.k.a. Nesterov’s accelerated proximal gradient method) possibly relying on approximate proximal operators and allowing to exploit strong convexity of the objective function. The method supports both relative and absolute errors, and its behavior is illustrated on a set of standard numerical experiments. Using the same developments, we further provide a version of the accelerated proximal hybrid extragradient method of [21] possibly exploiting strong convexity of the objective function. Acknowledgments MB acknowledges support from an AMX fellowship. The authors acknowledge support from the European Research Council (grant SEQUOIA 724063).This work was funded in part by the french government under management of Agence Nationale de la recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute).


Introduction
In this work, we consider a standard composite convex minimization problem of the form where f : R d → R is a L-smooth convex function (with 0 < L < ∞), and g : R d → R ∪ {+∞} is a proper closed convex function. In addition, we allow either f or g to be possibly µ-strongly convex. In this setting, we propose an inexact accelerated forward-backward method for solving (1) relying on the access to the gradient of f , and to an iterative routine for approximating the proximal operator of g.
relative errors of different types), while allowing to exploit strong convexity of f or g-see e.g. [11,23,24], for original analyses in the strongly convex case, when the proximal operator of g is readily available. The same developments allow obtaining a version of the accelerated hybrid proximal extragradient method (A-HPE)-in the spirit of [21]-for exploiting strong convexity of the problem at hand. The notion of an "approximate proximal point" used in this note (see Section 2.2) was used in a few previous works, starting with the hybrid extragradient method [32,33]. It was also used for its accelerated version [21] and in the context of another forward-backward splitting method [20]. In these works, the primal-dual requirement is presented under a different formulation involving the notion of ε-subdifferentials [7, Section 3] (or ε-enlargement in the context of monotone operators [8,9,10]). Among others, a variant of the hybrid extragradient method was also studied in [10] under both absolute and relative errors, similar in spirit with the accelerated methods presented below. A survey on common notions of "approximate proximal point" used in the literature can be found in [4,Section 2].

Paper organization and contribution
This note is organized as follows. First, we give some basic results and notations in Section 2. We provide the inexact accelerated forward-backward in Section 3, along with a worst-case analysis, relying on a standard Lyapunov argument (for which we provide symbolic notebooks, helping the reader reproducing the algrebraic part of the proof without pain). Numerical experiments illustrating the practical behavior of the method are then provided in Section 4. After that, Section 5 shows how to slightly modify the proof for obtaining an accelerated hybrid proximal extragradient method [21], specifically for the case f = 0. We draw some conclusions in Section 6.

Notations
We refer to classical textbooks [16,26] for standard elements of convex analysis. We use the notation F 0,∞ (R d ) to denote the set of closed convex proper functions on R d . The corresponding subset of closed convex proper functions that are µ-strongly convex and L-smooth where ∂h(x) denotes the subdifferential of h at x ∈ R d . When h ∈ F µ,L (R d ) with L < ∞, we use h (x) to denote the unique element h (x) ∈ ∂h(x) (i.e. the gradient of h at x).

Codes
For helping the reader reproducing the analytical results (via Mathematica notebooks) as well as numerical experiments, our code is available at https://github.com/mathbarre/StronglyConvexForwardBackward.

2
Background results

Smooth strongly convex functions
We recall some standard inequalities satisfied by smooth convex and strongly convex functions, which we use in the sequel for exploiting strong convexity and smoothness, see e.g. [23].
For all x, y ∈ R d and all s g (x) ∈ ∂g(x) it holds that In the sequel, the use of the inequalities provided by Theorem 1 and Theorem 2 are motivated by their interpolation (or extension) properties; that is, the analyses provided below were obtained following a principled approach to worst-case analyses of first-order methods, see e.g., [34] or [4] specifically for the cases of methods relying on approximate proximal operations.

Proximal operations
The proximal operation is a basic primitive that is widely used in modern optimization methods; it is a central building blocks in many optimization algorithms, see e.g., [25,29]. The proximal operator of a function g ∈ F 0,∞ (R d ) with step size λ > 0 is defined as with z ∈ R d . When g ∈ F 0,∞ (R d ), the proximal operation is well defined, and its solution is unique. A comprehensive list of cases where (2) has an analytical solution is provided in [13]. In other cases, the proximal operator has to be approximated. For doing that, one can define the following primal and dual problems associated to the proximal operation where g * ∈ F 0,∞ (R d ) is the Fenchel conjugate of g. Let us further note that prox λg (z) is the unique solution to (P), and that prox g * /λ (z/λ) is the unique solution of (D). In this context, the primal and dual solutions are linked by the well-known Moreau's identity prox λg (z) + λ prox g * /λ (z/λ) = z. Under relatively weak conditions (such as ri(domg) = ∅, see e.g., [26, Corollary 31.2.1]), strong duality holds between (P) and (D) and hence Motivated by those elements, we use the quantity for quantifying how well (x, v) approximates the pair (prox λg (z), prox g * /λ (z/λ)), in the sequel.

A notion of approximate proximal point
In this section, we define the notion of approximate proximal point of g ∈ F µ,∞ (R d ) used throughout the paper (see Section 1 §"Relation to previous works" for historical references for the case µ = 0). This notion features two parameters: a tolerance and a lower bound on the strong convexity parameter of g (possibly 0). The estimate of the strong convexity is used for relating proximal points of g(·) in terms of that of g µ (·) = g(·)− µ 2 · 2 ∈ F 0,∞ (R d ), and the tolerance is used for quantifying the quality of an approximate solution to the proximal problem on g µ (·), which simplifies the analyses below. More precisely, for g ∈ F µ,∞ (R d ), it is relatively straightforward to verify that This observation motivates the introduction of the following inexactness criterion. Definition 3. Let µ > 0, g ∈ F µ,∞ (R d ), and let λ > 0 be a step size and ε ≥ 0 be a tolerance. For a triplet for denoting that with g µ (x) = g(x) − µ 2 x 2 and PD is the primal-dual gap of the proximal problem defined in (PD). In the following technical lemma, we provide an explicit expression for quantifying the quality of a triplet

Lemma 4.
Let µ ≥ 0, g ∈ F µ,∞ (R d ), and let λ > 0 be a step size and (x, v, z) ∈ R d × R d × R d . The following equality holds In particular Finally, using the expression of g * µ (v − µx) in (4) leads to the desired results.
In the next section, we present an inexact accelerated forward-backward method where inexactness in proximal computations are measured using the primal-dual criterion from Theorem 3.

An inexact accelerated forward-backward method
In this section, we provide the main contribution of this work, namely Algorithm 3.1. This method aims at solving problem (1) when the gradient of f is readily available and the proximal operator of g can be efficiently approximated within a target precision (e.g., by an iterative method). It further allows to exploit g to be µ-strongly convex. In the case where f is strongly convex, one can shift this strong convexity to g instead (by removing the corresponding quadratic of f and adding it to g). Of course, any under-approximation of µ can be used within the method. The worst-case analysis is based on a simple Lyapunov (or potential) argument, following the now standard template for accelerated schemes as in [22], for which surveys are provided in e.g., [2,38], and [14,Chapter 4]. As a byproduct of the analysis, the method does not require an accurate estimate of the smoothness constant L, whose estimation is improved on the fly using standard backtracking tricks, similar in spirit with [5,22].
The algorithm below builds on approximations of the forward-backward operator (with step sizes λ k ) of problem (1). More precisely, it relies on primal-dual pairs (x k+1 , v k+1 ) approximating the forward-backward operator evaluated at some iterates y k , and satisfying where ε k encodes some approximation level. In this work, this error term is parameterized by three sequences of nonnegative scalars {σ k } k , {ζ k } k , {ξ k } k that can be chosen by the user for possibly mixing both relative (or multiplicative) and absolute (or additive) error terms where {ξ k } k parameterizes the absolute error term, and where {σ k } k and {ζ k } k parametrize two types of relative errors. Of course, convergence properties of the algorithm depend on the choice of those sequences of parameters, as provided in Theorem 8 and Theorem 9 below. Examples of simple rules for in Section 4 (typically, {σ k } k , {ζ k } k can be chosen constant, whereas {ξ k } k should be either identically 0 or decreasing fast enough). Before going into the algorithm itself, let us mention that the backtracking line-search strategy (Btr) for estimating the smoothness constant builds on the condition where x k 's and y k 's are some iterates. In particular, picking λ k ∈ (0, 1−σ 2 k L ] (hence depending on the true smoothness constant L) guarantees (Smooth) to be satisfied without backtracking, as, when f ∈ F 0,L (R d ), Theorem 2 holds.
Remark 5 (Related methods). When the objective function is not strongly convex (i.e. µ = 0), the update rules of Algorithm 3.1 are very similar to those of the accelerated inexact forward-backward methods from [20, Algorithm 3] (when ζ k = 0 and ξ k = 0) or [6, Algorithm 2] (when σ k = 0 and ξ k = 0). Compared to those works in this setup, Algorithm 3.1 allows using both relative and absolute errors while having a backtracking strategy. Note also the similarities with some inexact FISTA [31,36], although these methods do not re-use explicitly the dual direction v k+1 and focus on absolute error terms (i.e., σ k = ζ k = 0). Finally, when the computation of the proximal operator is exact, we recover one of the many variants of an accelerated forward-backward method; see for example [5,24,35]; we refer to [14,Chapter 4] and the references therein for further discussions.
is not satisfied, set λ k ← αλ k and go back to step (6)] (Btr) The following theorem contains the main (Lyapunov-based) ingredient of the worst-case analysis.
with x ∈ argmin x F (x), and where z k+1 and x k+1 are constructed by one iteration of Algorithm 3.1.
The proof of this Theorem is deferred to Section 3.2. The following (classical) corollary establishes that the growth rate of the sequence {A k } k drives the convergence rate of the worst-case guarantee. Those factors A k+1 , controlling the convergence rate, were greedily chosen (as large as possible) while enforcing (7) to hold.
λ 0 be a positive initial step size, α ∈ (0, 1) and β ≥ 1 be some backtracking parameters, sequences (relative error parameters) {σ k } k , {ζ k } k , satisfying σ k , ζ k ∈ [0, 1) and a sequence (absolute error parameters) Proof. We denote by Φ k the quantity (a.k.a., the Lyapunov/potential function) for k ≥ 0. Theorem 6 allows nesting the Φ k 's together as We reach the target conclusion using Let us note that when µ = 0, we recover a composite version of the A-HPE method [21]. In that case, we assuming the existence of some η min ≤ η k for all k ≥ 0. Such a lower bound on η k exists as soon as the parameters {σ k } k , {ζ k } k are well chosen (see for example Theorem 8 and Theorem 9 below), and due to the L-smoothness of the function. Similarly, when µ > 0, A k 's are growing exponentially as assuming again the existence of some η min ≤ η k for all k ≥ 0. The following corollaries provide more precise convergence bounds for Algorithm 3.1, by quantifying the growth rate of the A k 's, for some particular choices of parameters {σ k } k (constant), {ζ k } k (constant), and {ξ k } k (parameterized function of k), linking the behavior of the decrease rate of the absolute errors ξ k with the convergence bound.

Proof.
Starting from the conclusion of Theorem 7, we obtain the desired result using classical properties of When µ = 0, the proof is still valid, and 1 A N = O(N −2 ). In particular, we recover the same rates as those of [36,Theorem 4.4

] (who used the particular choice
Corollary 9. Let f ∈ F 0,L (R d ), g ∈ F µ,∞ (R d ) and F ≡ f + g. Let x 0 ∈ R d , λ 0 be an initial positive step size and sequences (relative error parameters) σ k = σ, ζ k = ζ with σ, ζ ∈ [0, 1). Let x N ∈ R d denote the output after N ∈ N * iterations of Algorithm 3.1 on F initiated at x 0 ∈ R d .
We further let α ∈ (0, 1) and β = 1 be the backtracking parameters, and a sequence (absolute error parameters) ) and x ∈ argmin x F (x). We further let α ∈ (0, 1) and β ≥ 1 be the backtracking parameters, and a sequence ξ k = 0 (no absolute error). It holds that Proof. Starting from the conclusion of Theorem 7, we obtain the desired result in the case β = 1 using comparisons of sums with integrals along with the bounds ηmin 4 k 2 ≤ A k ≤ η max k 2 . In the second case, where β ≥ 1 and ξ k = 0, the target result follows from ηmin 4 k 2 ≤ A k .

Proof of Theorem 6
The following proof is presented in a purely algebraic form consisting in a weighted sum of inequalities satisfied by the functions f and g as well as inexactness requirements. Indeed, it has been obtained from a dual certificate of a performance estimation problem (see [4, Section 3] for more details on performance estimation in the context of inexact proximal operations). As mentioned in Section 1, the algebraic equivalences stated below can be verified either by hand or with help of Mathematica notebooks (see Section 1, §"Codes").
Proof. Let w k+1 ∈ R d such that v k+1 − µx k+1 + µw k+1 ∈ ∂g(w k+1 ). Using (3), this leads to The proof consists in performing a weighted sum of the following inequalities: strong convexity of g between w k+1 and x with weight strong convexity of g between w k+1 and x k with weight strong convexity of g between w k+1 and x k+1 with weight convexity of f between y k and x with weight convexity of f between y k and x k with weight convexity and 1−σ 2 k λ k -smoothness of f between x k+1 and y k required by (Smooth) with weight ν 6 The weighted sum can be written as Substituting y k and z k+1 in the weighted sum, that is

is equivalently reformulated as
where the inequality in the second to last line comes from the fact that factors in front of three squared Euclidean norms are nonpositive. In addition, the last equality follows from the particular choice of A k+1 satisfies which implies that the factors in front of the last squared Euclidean norm vanishes.

Numerical examples
In this section, we present a few numerical experiments illustrating the behavior of the accelerated inexact forward backward method (Algorithm 3.1) on two convex problems. More precisely, we applied the method to a factorization problem and to a total variation problem.
In both cases, we use a Tikhonov regularization, improving the conditioning and rendering the problems strongly convex, and illustrate the numerical performances of the algorithm with different tunings, including in the purely relative (ξ k = 0) and absolute accuracy (σ k = ζ k = 0) setups, as well as the influence of the knowledge of strong convexity parameter.

Factorization problem
Our first numerical experiment is a CUR-like factorization problem, introduced in [19]. It consists, given a matrix W ∈ R m×p , in solving the minimization problem , where · F is the Frobenius norm, and where X i and X j respectively denote the ith row and the jth column of the matrix X. This problem has already been used in [31] for illustrating convergence guarantees of an inexact accelerated proximal gradient method with absolute errors. As in [31], we use an inexact version of the proximal operator of the regularization part, which we solve via a dual block coordinate ascent method [17] (i.e., we solve the dual of the proximal problem). Our implementation (see link in §Codes from Section 1) is based on that of [31], and our experiments are done on the "a1a" dataset from the LIBSVM library [12]. The corresponding matrix W is normalized for having zero mean and unit norm. We also impose λ col = p m λ row for having a similar scaling for the row and column regularization parameters. The choice of the error criteria and regularization parameters is detailed in Fig. 1 where we plot gaps between the objective function values at the iterates of Algorithm 3.1 and the optimal objective value versus the number of iteration of Algorithm 3.1 (left) and versus the total number of dual block coordinate ascent iterations (right).

Total variation regularization
In this section, we compare the behaviors of the accelerated inexact forward backward method (Algorithm 3.1) with different tunings, on the classical problem of deblurring through total variation regularization [27,28,37]. Given a blurred image Y ∈ R n×n and a blurring operator A, the problem consists in solving , where ∇ is the discrete gradient of an image, see e.g., [11,Equation (2.4)]. One way of dealing with this problem is to approximate the proximal operator of the discrete total variation plus the Tikhonov regularization. As in [20,36], we apply FISTA [5] on the dual of the proximal subproblem (which is provided e.g., in [11,Example 3.1]), which we use in the accelerated inexact forward-backward method.
In the experiments Y is the popular 256 × 256 greyscale boat image (see e.g., http://sipi.usc.edu/ database/). We blur Y via a 5 × 5 box blur kernel A, and add a Gaussian noise of standard deviation 0.01 times the mean of the blurred image and zero mean to the picture. Some results are detailed in Fig. 2 where we plot gaps between the objective function values at the iterates of Algorithm 3.1 and the optimal objective value versus the number of iterations of Algorithm 3.1 (left) and versus the total number iterations of FISTA on the dual subproblem (right).   L , the initial L and λreg to 1 and µreg to 10 −2 L. Top: σ k = 0.8, ζ k = 0 and ξ k = 0. Bottom: accelerated inexact forward-backward with µ = µreg and backtracking. When backtracking is used, α is set to 1 2 and β to 1.1. "Total iterations" refers to total the number of FISTA iterations used in the subroutine that computes the proximal steps approximately. F is approximated by the smallest objective values encountered in 2.10 4 total FISTA iterations.

An accelerated hybrid proximal extragradient method
In this section, we provide an improved analysis for the specific case f = 0 (no smooth convex term in (1)). This type of methods is often used as a globalization strategy for higher-order methods, see [21]. The version presented in this section allows exploiting the possible strong convexity of the objective, which was not incorporated in previous versions of the method, to the best of our knowledge.
Proof. The proof of this theorem is deferred to Section 5.2 Just as for the its forward-backward version, one can obtain a final worst-case guarantee driven by the growth rate of the sequence {A k } k . Corollary 11. Let g ∈ F µ,∞ (R d ), {λ k } k be a sequence of positive parameters, and a sequence (relative error parameters) {σ k } k satisfying σ k ∈ [0, 1]. Let x N ∈ R d be the output after N ∈ N * iterations of Algorithm 5.1 on g initiated at x 0 ∈ R d , it holds that where x ∈ argmin x g(x).
Proof. The proof follows from the same lines as that of Theorem 7, using Theorem 10 instead of Theorem 6.

Proof of Theorem 10
The proof follows the same structure as that of in Section 3.2, and simply consists in reformulating a weighted sum of inequalities.
Proof. First we consider σ k ∈ (0, 1] as the case σ k = 0 requires a particular treatment. Let w k+1 ∈ R d such that v k+1 − µx k+1 + µw k+1 ∈ ∂g(w k+1 ). Using (3), this leads to The proof consists in performing a weighted sum of the following inequalities: strong convexity between w k+1 and x with weight strong convexity between w k+1 and x k with weight ν 2 = A k strong convexity between w k+1 and x k+1 with weight ν 3 = A k+1 (1−σ k +λ k µ) approximation requirement on x k+1 with weight ν 4 = A k+1 Substituting y k and z k+1 in the weighted sum, that is

Conclusion
In this note, we proposed an inexact accelerated forward-backward method for solving composite convex minimization problems, along with some worst-case guarantees. The method supports inexact evaluations of the proximal subproblems, backtracking line-search on the smoothness parameter, and allows exploiting the possible strong convexity of one of the component in the objective function. The analysis relies on a now standard Lyapunov argument of the same type as that of [22], and the numerical behavior is illustrated on a factorization and a total variation problem. We further provide a version of the hybrid proximal extragradient method [21] allowing to exploit strong convexity of the objective.