Quadratic error bound of the smoothed gap and the restarted averaged primal-dual hybrid gradient

We study the linear convergence of the primal-dual hybrid gradient method. After a review of current analyses, we show that they do not explain properly the behavior of the algorithm, even on the most simple problems. We thus introduce the quadratic error bound of the smoothed gap, a new regularity assumption that holds for a wide class of optimization problems. Equipped with this tool, we manage to prove tighter convergence rates. Then, we show that averaging and restarting the primal-dual hybrid gradient allows us to leverage better the regularity constant. Numerical experiments on linear and quadratic programs, ridge regression and image denoising illustrate the findings of the paper.


Introduction
Primal-dual algorithms are widely used for the resolution of optimization problems with constraints.Thanks to them, we can replace complex nonsmooth functions like those encoding the constraints by simpler, sometimes even separable functions, at the expense of solving a saddle point problem instead of an optimization problem.Then, this amounts to replacing a complex optimization problem by a sequence of simpler problems.In this paper, we shall consider more specifically min x∈X f (x) + f 2 (x) + g g 2 (Ax) . ( where f and g are convex with easily computable proximal operators, A : X → Y is a linear operator and f 2 and g * 2 are differentiable with L f and L g * lipschitz gradients.Here, g g 2 (z) = inf y g(y) + g 2 (z − y) is the infimal convolution of g. and g 2 .To encode constraints, we just need to consider an indicator function for g.When using a primal-dual method, one is looking for a saddle point of the Lagrangian, which is given by L(x, y) = f (x) + f 2 (x) + Ax, y − g * (y) − g * 2 (y) .
Of course, we shall assume throughout this paper that saddle points do exist, which can be guaranteed using conditions like Slater's constraint qualification condition [4].A natural question is then: at what speed do primal-dual algorithms converge?This is trickier for saddle point problems than when we deal with a problem which is in primal form only.For instance, if we just assume convexity, methods like Primal-Dual Hybrid Gradient (PDHG) [6] or Alternating Directions Method of Multipliers (ADMM) [17] can be very slow, with a rate of convergence in the worst case in O(1/ √ k) [10].Yet, if we average the iterates, we obtain an ergodic rate in O(1/k).Nevertheless, it has been observed that, except for specially designed counter-examples, the averaged algorithms usually perform less well that the plain algorithm.This is not unexpected.Indeed, the problem you are interested in has no reason to be the most difficult convex problem.In order to get a more positive answer, we should understand what makes a given problem easier to solve than another.In the case of gradient descent, strong convexity of the objective function implies a linear rate of convergence, and the more strongly convex the function, the faster is the algorithm.Strong convexity can be generalized to the objective quadratic error bound (QEB) and the Kurdyka-Lojasiewicz inequality in order to show improved rates for a large class of functions [5].
Before going further, let us discuss how one quantifies convergence speed for saddle point problems.Several measures of optimality have been considered in the literature.The most natural one is feasibility error and optimality gap.It directly fits the definition of the optimization problem at stake.However, one cannot compute the optimality gap before the problem is solved.Hence, in algorithms, we usually use the Karush-Kuhn-Tucker (KKT) error instead.It is a computable quantity and if the Lagrangian's gradient is metrically subregular [28], then a small KKT error implies that the current point is close to the set of saddle points.When the primal and dual domains are bounded, the duality gap is a very good way to measure optimality: it is often easily computable and it is an upper bound to the optimality gap.A generalization to unbounded domains has been proposed in [30]: the smoothed gap, based on the smoothing of nonsmooth functions [25], takes finite values even for constrained problems, unlike the duality gap.Moreover, if the smoothness parameter is small and the smoothed gap is small, this means that optimality gap and feasibility error are both small.In the present paper, we shall reuse this concept not only for showing a convergence speed but also to define a new regularity assumption that we believe is better suited to the study of primaldual algorithms.
Regularity conditions for saddle point problems have been investigated more recently than for plain optimization problems.The most successful one is the metric subregularity of the Lagrangian's generalized gradient [22].It holds among others for all linear-quadratic programs [21] and implies a linear convergence rate for PDHG and ADMM, as well as the proximal point algorithm [24].One can also show linear convergence if the objective is smooth and strongly convex and the constraints are affine [13,2,29].If the function defined as the maximum between objective gap and constraint error has the error bound property, then we can also show improved rates [23].These result can also be extended to the coordinate descent case [32,1], as well as the setup of distributed computations where doing less communication steps is an important matter [20].The other assumptions look more restrictive because they require some form of strong convexity.Yet, we will see that for a problem that satisfies two assumptions, the rate predicted by each theory may be different.
Our contribution is as follows.
• In Section 2, we formally review the main regularity assumptions and do first comparisons.
• In order to do deeper comparisons, we analyze PDHG in detail in Sections 3 and 4 under each assumption.This choice is motivated by the self-containedness of the method, which does not require to solve any subproblem.
• In Section 5, we show that the present regularity assumptions may not reflect properly the behavior of PDHG, even on a very simple optimization problem.
• We introduce a new regularity assumption in Section 6: the quadratic error bound of the smoothed gap.We then show its advantages against previous approaches.The smoothed gap was introduced in [30] as a tool to analyse and design primal-dual algorithms.Here, we use it directly in the definition of the regularity assumption.We analyze PDHG under this assumption in Section 7 • We then present and analyze the Restarted Averaged Primal-Dual Hybrid Gradient (RAPDHG) in Section 8 and show that is some situations, it leads to a faster algorithm.An adaptive restart scheme is also presented for the cases where the regularity parameters are not known.This is a first step in leveraging our new understanding of saddle point problems to design more efficient algorithms.
• The theoretical results are illustrated in Section 9, devoted to numerical experiments.
We note striking similarities between this paper and the concurrent work of Applegate, Hinder, Lu and Lubin [3].Although they focus on linear programs, the authors analyse PDHG and other first order methods thanks to the sharpness of the restricted duality.Indeed, in the case of linear programs, the restricted duality gap is a computable finite-valued measure of optimality and it is always sharp.The methodology is very similar except that the arguments are taylored to linear programs.

Regularity assumptions for saddle point problems
In this section, we define three regularity assumptions for saddle point problems from the literature.We will then present their application range.

Notation
We shall denote X the primal space and Y the dual space.We assume that thoses vector spaces are Hilbert spaces.Let us denote Z = X × Y the primal-dual space.Similarly for a primal vector x and a dual vector y, we shall denote z = (x, y).This notation will be throughout the paper: for instance x and ȳ will be the primal and dual parts of the vector z.For z = (x, y) ∈ Z, and τ, σ > 0, we denote We will make use of the convex indicator function In order to ease reading of the paper, we shall use a blue font for results that use differentiable parts of the objective f 2 and g 2 and an orange font for results that use strong convexity.

Definitions
The simplest regularity assumption is strong convexity.
The Lagrangian function is µ-strongly convex-concave, that is (x → L(x, y)) is µ-strongly convex for all y and (y → L(x, y)) is µ-strongly concave for all x.This regularity assumption is used for instance in [6].We can generalize strong convexity as follows.
Definition 2. We say that a function f : X → R ∪ {+∞} has a quadratic error bound if there exists η and an open region R ⊆ X that contains arg min f such that for all x ∈ R, We shall use the acronym f has a η-QEB.
Although this is more general than strong convexity, the quadratic error bound is an assumption which is not general enough for saddle point problems.Indeed, for the fundamental class of problems with linear constraints (y → L(x, y) is linear.Thus, it cannot satisfy a quadratic error bound in y.To resolve this issue, we may resort to metric regularity.This assumption is used for instance in [13].The indicator functions encode the constraint Ax = b.Assumption 4. Suppose that g 2 = ι {0} and g = ι b+R m − and we encode the constraints Ax − b ≤ 0. Denote x * a minimizer of (1) and X * the set of minimizers.The problem with inequality constraints satisfies the error bound if there exists µ > 0 such that This regularity assumption is used to deal with functional inequality constraints in [23] but we restrict our study to linear inequalities to simplify the exposition of this paper.Yet, since it involves primal quantities only, it is not really adapted to a primal-dual algorithm and we will not discuss it much further in this paper.
The next two propositions show that for the minimization of a convex function, quadratic error bound of the objective is merely equivalent to metric subregularity of the subgradient.
In Table 1, we can see that the situation is more complex for saddle point problems than plain optimization problems.Indeed, the assumptions are not generalizations one of the other.Yet, metric subregularity seems to be the most general since it holds for more types of problems.In particular all linear programs and quadratic programs have a metrically subregular Lagrangian's generalized gradient [21].
3 Basic inequalities for the study of PDHG Primal-Dual Hybrid Gradient (also known as asymmetric forward-backward-adjoint) is the algorithm defined by Algorithm 1.We shall use the definition of [21] rather than [8,31] analysis.Note that the algorithm of Chambolle and Pock [6] can be recovered in the case f 2 = 0 by taking zk+1 as a state variable instead of z k+1 and using PDHG is widely used for the resolution of large-dimensional convex-concave saddle point problems.Indeed, this algorithm only requires simple operations, namely matrix-vector multiplications, proximal operators and gradients, while keeping good convergence properties.We refer the reader to [9] for a review of variants of the algorithm and their analysis.As shown in [19], the proof techniques for all these variants share strong similarities and we believe that the results of the present paper could be easily adapted to them.
It can be conveniently seen as a fixed point algorithm z k+1 = T (z k ) where T is defined by For z = (x, y) ∈ Z, we denote We will first show that the fixed point operator T is an averaged operator [4] in this norm.Then, we will give an upper bound on the Lagrangian's gap and a convergence result.All the results are small variations of already known facts so we defer the proofs to the appendix.Note that we may have adapted the results for our purpose.
Lemma 1 (Prop 12.26 in [4]).Let p = prox τ f (x) and p = prox τ f (x ) where f is µ f -strongly convex.For all x and x , The following lemma can be mostly found in [21,Theorem 2.5].In comparison, we write everything in the same norm • V and we do not restrict to z being a saddle point of the Lagrangian.Lemma 2. Let T : X × Y → X × Y be defined for any (x, y) by (3).Suppose that ∇f 2 is L f -Lipschitz continuous and ∇g * 2 is L g * -Lipschitz continuous.If the step sizes satisfy γ = στ A 2 < 1, and T is 1 1+λ -averaged where which means for z = (x, y) and z = (x , y ) As a consequence, (z k ) converges to a saddle point of the Lagrangian.Moreover, if A side result of independent interest proved within Lemma 2 is as follows.
Lemma 3.For any z * ∈ Z * , Ṽ satisfies As noted in [19], the case α f > 1 2 is not covered by most of the results in the literature on convergence speed results.We propose here an extension of results in the proof of [6, Theorem 1] that allows the larger step size range 0 ≤ α f < 1 where convergence is guaranteed.
For all k ∈ N and for all z ∈ Z, where The next proposition is adapted from Theorem 1 in [6].We shall show in Section 8 how to generalize it to τ L f < 2. Proposition 4. Let z 0 ∈ Z and let R ⊆ Z.If στ A 2 +σL g * ≤ 1 and τ L f ≤ 1 then we have the stability zl and the restricted duality gap G(z, R) = sup z∈R L(x, y) − L(x, ȳ).We have the sublinear iteration complexity 4 Linear convergence of PDHG In this section, we show that under the regularity assumptions stated in Section 2, the Primal-Dual Hybrid Gradient converges linearly.Most of the results were already known, we only improved slightly some constants.Hence, in this section also, we defer some of the proofs to Appendix B. We begin with a technical lemma showing that zk+1 is close to z k+1 .
Proof.We use the fact that for any z, z we get the result of the lemma.
The next proposition is a modification of [14,Theorem 4] in order to allow Here, we also concentrate on the deterministic version of PDHG.We put the proof in the main text because the proof of Theorem 1 in Section 7 will reuse some of the arguments.Proposition 5.If L is µ-strongly convex concave in the norm • V , then the iterates of PDHG satisfy for all k, where z * is the unique saddle point of L, a 2 = max( and λ is defined in Lemma 2. Proof.From Lemma 4 applied at z = z * , we have In order to deal with the case a 2 ≥ 0, we add to this inequatity a times (4), where a ≥ 0, Since L is µ-strongly convex-concave, (x → L(x, y * )) is minimized at x * and (y → L(x * , y)) is minimized at y * , we have We combine these two inequalities with Lemma 3 and Lemma 5 to get for all α ∈ (0, 1) and a ≥ max(0, a 2 ) We then choose α = µ λ(a−a2)+µ so that µ(α −1 − 1) = λ(a − a 2 ) and we choose a = a 2 + 1 ≥ 0. Thus We next study the second case where some primal-dual methods have been proved to have a linear rate of convergence [13], [2, Theorem 1], [29,Theorem 6.2], that is, minimizing a strongly convex objective under affine equality constraints.Here also, we pay attention to allow 1/2 < α f < 1 in our proof.Proposition 6.If f + f 2 has a L f + L f -Lipschitz gradient and is µ f -strongly convex, and g + g 2 = ι {b} , then PDHG converges linearly with rate ), λ is defined in Lemma 2 and a 2 ≥ −1 is defined in Lemma 4.
Note that this does not contradict the lower bound of [27].In [27], the authors consider the setup where the number of iterations is smaller than the dimension of the problem and showed that the convergence is necessarily sublinear in the worst case.On the other hand, our result becomes useful after a number of iterations that may be large for ill-conditioned problems but is more optimistic.
Finally, we will show that if the Lagrangian's generalized gradient is metrically sub-regular then PDHG converges linearly.Compared to [21,Theorem 5], we obtain a rate where the dependence in the norm is directly taken into account in the definition of metric sub-regularity and does not appear explicitly in the rate.
Proposition 7. If ∂L is metrically subregular at z * for 0 for all z * ∈ Z * with constant η > 0 in the norm • V , then (I − T ) is metrically subregular at z * for 0 for all z * ∈ Z * with constant bounded below by and PDHG converges linearly with rate 1 − 5 Coarseness of the analysis

Strongly convex-concave Lagrangian
Suppose that f is µ f strongly convex and that g * is µ g * strongly convex.Then L is µ L strongly convex in the norm • V with µ L = min(µ f τ, µ g * σ).Note that in this case, the objective is the sum of the differentiable term g(Ax) and the strongly convex proximable term f (x).We have seen that this implies a linear rate of convergence for PDHG with rate (1 − cµ L ) with c close to 1.We may wonder what is the choice of τ and σ that leads to the best rate.We need µ L = min(µ f τ, µ g * σ) the largest possible and στ A 2 ≤ 1.Hence, we take τ = This rate is optimal for this class of problem [26], which is noticeable.
We have seen in Proposition 3 that having a strongly convex concave Lagrangian implies the metric sub-regularity of the Lagrangian's gradient.However, applying Proposition 7 with η = µ L leads to a rate equal to (1 − cµ 2 L ) which is much worse than what we can show using the more specialized assumption.This means that metric sub-regularity applies to more problems but is not a more general assumption because it leads to a coarser analysis.

Quadratic problem
We consider the toy problem min where a, b ∈ R and µ ≥ 0. The Lagrangian is given by L(x, y) Since ∇L is affine, we can see using an eigenvalue decomposition that ∇L is globally metrically sub-regular with constant in the norm • V .We can also do a direct calculation.For all α > 0 and the unique primal-dual optimal pair x * , y * , We choose α > 0 such that , which leads to Let us now try to solve this (trivial) problem using PDHG: This can be written Hence, we can compute the exact rate of convergence, which is given by the largest eigenvalue of R different from 1.
We shall compare this actual rate with what is predicted by Proposition 7, that is 1− and what is predicted by Proposition 6, ).On Figure 1, we can see that there can be a large difference between what is predicted and what is observed, even for the simplest problem.Moreover, although the actual rate improves when µ increases, metric sub-regularity decreases, so that theory suggests the opposite of what is actually observed.On the other hand, using strong convexity explains the improvement of the rate when µ increases but does not manage to capture the linear convergence for µ = 0.

Quadratic error bound of the smoothed gap
We now introduce a new regularity assumption that truly generalized strongly convex-concave Lagrangians and smooth strongly convex objectives with linear constraints and is as broadly applicable as metric subregularity of the Lagrangian's gradient.We call the function (z → G β (z, ż)) the smoothed gap centered at ż.

Main assumption
Although the smooth gap can be defined for any center ż, the next proposition shows that if ż = z * ∈ Z * , then the smoothed gap is a measure of optimality.
Proof.We first remark that G 0 (z, z * ) is the usual duality gap and that G ∞ (z; For the converse implication, we denote By the strong convexity of the problem defining G β (•; z * ), we know that With a similar argument for x β (y), we get Thus, if G β (z; z * ) = 0, then y β (x) = y * and x β (y) = x * .
which completes the proof of the proposition.Assumption 5.There exists β = (β x , β y ) ∈]0, +∞] 2 , η > 0 and a region R ⊆ Z such that for all z * ∈ Z * , G β (•, z * ) has a quadratic error bound with constant η in the region R and with the norm • V .Said otherwise, for all z ∈ R, The next proposition, which is a simple consequence of [16, Prop.1] says that even though QEB is a local concept, it can be extended to any compact set at the expense of degrading the constant.

Problems with strong convexity
We now give a few examples to show that Assumption 5 is often satisfied.
) has a μ-QEB and f + f 2 is µ f -strongly convex, then the smoothed gap has a QEB: Note that we require either µ f > 0 or μ > 0.
Proof.The proof is a generalization of Proposition 6 and reuses most of the argument.
We decompose x = x A + x A ⊥ with x A ⊥ = P {x :Ax =b} (x) and Moreover by convexity of f +f 2 and optimality condition ∇f where the last inequality comes from the assumption on the primal function and smoothness of ∇(f + f 2 ).We combine this with to get for all λ ∈ [0, 1] and α > 0, For the dual vector, we use the smoothness of the objective, the equality ∇f For a ∈ R, we restrict ourselves to x = x * + aA (y * − y) so that sup Moreover, as in Proposition 6, we know that A y − A y * ≥ σ min(A) dist(y, Y * ), where σ min(A) is the smallest singular value of A.
Combining this with (7) yields the result of the proposition.
Proposition 12. Suppose that X and Y are finite-dimensional.Suppose that f, f 2 , g, g 2 are convex piecewise linear-quadratic, which means that their domain is a union of polyhedra and on each of these polyhedra, they are quadratic functions.Then for all β ∈ [0, +∞[ 2 , there exists η(β) and R(β 2 dist V (z, Z * ) 2 for all z ∈ R(β) and z * ∈ Z * .Proof.The proof follows the lines of [21].The class of piecewise linear-quadratic functions is closed under scalar multiplication, addition, conjugation and Moreau envelope [28].Hence for all β ∈ [0, +∞[ 2 , G β (•, z * ) is piecewise linear quadratic.As a consequence, its subgradient ∂ z G β (•, z * ) is piecewise polyhedral and thus there exists η > 0 such that it satisfies metric sub-regularity with constant η at all z * ∈ Z * for 0 [11].Since G β (•, z * ) is a convex function, this implies the result by Proposition 2.

Linear programs
In the rest of the section, we are going to show that linear programs do satisfy Assumption 5 and give the constant as a function of the Hoffman constant [18].
We consider the linear optimization problem It happens that the set of primal-dual solution of an LP is characterized by a system of linear equalities and inequalities.This holds true because a feasible primal-dual pair with equal values is necessarily optimal.We get the following system Let us denote the Hoffman constant [18] of this system by θ.This constant is the smallest positive number such that for all It is known that the Lagrangian's subgradient of an LP satisfies metric sub-regularity with a constant proportional to θ [24].We shall show that the same holds for the QEB of the smoothed gap centered at z * .Proposition 13.For any β ≥ 0, R > 0 and z * ∈ Z * , the linear program (8) satisfies the quadratic error bound: for all z such that G β (z; z * ) ≤ R, we have Hence, for R of the order of 1 θ , G 1 θ (•, z * ) has a c θ -QEB with c independent of θ.Proof.See Appendix C.

Analysis of PDHG under quadratic error bound of the smoothed gap
In this section, we show that under the new regularity assumption, PDHG converges linearly.Moreover, we give an explicit value for the rate.This result is central to the paper because it shows that the quadratic error bound of the smoothed gap is a fruitful assumption: not only it is as broadly applicable as the metric subregularity of the Lagrangian's generalized gradient, but also the rates it predicts reach the state of the art in all subcases of interest.
Theorem 1.Under Assumption 5, if R contains {z : where Proof.In this proof, we will use the notation β z = (β x x, β y y) and z 2 βV = βx τ x 2 + βy σ y 2 .By Lemma 4, we have For z * = P Z * (z k ), the projection of z k onto the set of saddle points using norm For the right hand side, we are looking for z such that β (z where the last inequality comes from our choice of z * .We also have by Lemma 2 Using Assumption 5, this leads to: ∀Λ ∈ [0, 1], Using Lemma 5 and Lemma 3, we get, as soon as Λa So, taking α = η λ+η and Λ = λ max((1+a2)λ+1/βx,(2+a2)λ+1/βy) ≤ 1 leads to Λ βy + (α and thus the algorithm enjoys a linear rate of convergence.

Strongly convex-concave Lagrangian
If the Lagrangian is strongly convex concave, then we can take β = (+∞, +∞) and η = µ (Proposition 10), so that we recover the rate of Proposition 5. Note that in that case, the rate of order 1 − cµ given by Proposition 5, and so by its generalized version Theorem 1, is much better than what Proposition 7 tells us: a rate of order 1 − cµ 2 .Hence, we can see that for this important particular case, the rate predicted using the quadratic error bound of the smoothed gap is more informative than using the metric subregularity of the Lagrangian's gradient.Moreover, the new assumption applies to all piecewise-linear quadratic problems, making it at the same time accurate and general.
Back to the toy problem We consider again the linearly constrained 1D problem min x∈R { µ 2 x 2 : ax = b} where a, b ∈ R and µ ≥ 0 introduced in Section 5.2 and we calculate the quadratic error bound of the smoothed gap.
As we have seen in Proposition 11, we can leverage the strong convexity of the objective.But also the smoothed gap may enjoy a quadratic error bound even if the objective is not strongly convex.
).Since the algorithm does not depend on β x or β y we can choose them so that they minimize the rate (or maximize ρ).On Figure 2, we can see that the rate of convergence explained using the quadratic error bound of the smoothed gap is as good as the rate using strong convexity (Assumption 3) when µ is large and does not vanish when µ goes to 0. On top of this, for small values of µ, we obtain a much better rate than what is predicted using metric sub-regularity.
In Appendix D, Proposition 17, we derive a finer analysis in the case where we solve a linearly constrained problem whose objective function is strongly convex.Indeed, we can show that the largest singular value of the matrix R described in Section 5.2 is 1 − γ.Yet, its spectral radius is much smaller.This implies that a contraction on dist V (z k − z * ) 2 is not enough to account for the actual rate.We propose to combine it with a contraction on z k+1 − z k 2 V .The rationale for this addition is that for large strong convexity parameters, the primal sequence will behave as if it were tracking arg min x L(x , y k ).This is a kind of slow-fast system where the dual variable is slowly varying and the primal variable is fast.
When we plot the curve of the rate as a function of µ f (with the legend "slow-fast double concentration rate") we can see that this more complex analysis manages to explain the improvement of the rate for an increasing strong convexity parameter, together with its degradations when the parameter becomes too large.In this section we will see how our new understanding of the rate of convergence of PDHG can help us design a faster algorithm.
Let averaged PDHG be given by Algorithm 2. On the class of convex functions, averaged PDHG has an improved convergence speed in O(1/k) in the worst case while PDHG has a convergence in O(1/ √ k) [10].
Algorithm 2 Averaged Primal Dual Hybrid Gradient -APDHG(x 0 , y 0 , K) For k ∈ {0, . . ., K − 1}: However, when averaging, we loose the linear convergence for well behaved problems.We thus propose to restart the algorithm as in Algorithm 3. The following proposition shows that RAPDHG enjoys an improved rate of convergence where the product βη is replaced by max(β, η).Hence for problems where η(β) is a decreasing function of β, like linear programs, we will expect an improved convergence rate by averaging and restarting.
Consider z * ∈ Z * and denote a + 2 = max(0, a 2 ).We combine (6) with a + 2 /2 times (4) to get Summing this inequality for k between 0 and K − 1, using the fact that the Lagrangian is convex-concave, and that a 2 − a + 2 ≤ 0, we get which leads to and so, as soon as Kβ > 1, since the maximum of the right hand side is attained at We now use Assumption 5 to get We choose z * = P Z * (z 0 ) and K such that Kβ ≥ 2 and Kη ≥ 2(2 + a + 2 ) in order to get If we choose K = max(2/β, 2(2 + a + 2 )/η) we thus get a linear convergence where sK is the total number of iterations.
The rate of convergence of RAPDHG has two nice features as compared to plain PDHG.Indeed, there is a factor Λ in Theorem 1 in front of the quadratic error bound constant η, which is of order λβ when β is small.On the other hand, the rate of RAPDHG has no direct dependence on λ, which means that it will behave well even if στ A 2 is close to 1.Moreover, it replaces βη by min(β, η), which will be orders of magnitude better in the case of linear programs where η = O(β) for β = 1/θ (Proposition 13)

Self-centered smoothed gap
In this paper, we have shown that the smoothed gap is a convenient quantity for the analysis of PDHG and that assuming that it satisfies a quadratic error bound condition explains well its behaviour.However, since computing it requires the knowledge of a saddle point, one cannot use the smoothed gap for algorithmic design, and in particular for the tuning of RAPDHG.
We thus propose the following approximation, that we call the self-centered smoothed gap.
The motivation for this definition is the following lemma.
Lemma 6.For all z, ż ∈ Z and z * equal to the projection of ż onto Z * , Proof.
This shows that G β (z, ż) is a good approximation to the measure of optimality G 2β (z, z * ) as soon as β is small enough or ż is close enough to z * .It happens that for ż = z, we can prove even more.
For the second point, it is clear by Proposition 8 that G β (z * , z * ) = 0.For the converse implication, we shall do the proof only for β > 0 because G 0 (z, z) is the usual duality gap.
In the numerical experiment section, we shall use the self-centered smoothed gap as a stopping criterion with β = (0, δ) where δ is the dual infeasibility.

Adaptive restart
We now modify RAPDHG so that instead of using unknown quantities β and η to set the restart period K, we monitor the self-centered smoothed gap and restart when this quantity has been halved.In order to take into account cases where averaging is detrimental, we then compare zk and zk and restart at the best of these in terms of smoothed gap.This adaptive restart is formalized in Algorithm 4 and justified by the following proposition.
Proof.As in Proposition 14, we have ∀z, Summing (6) for l between s and k − 1 and using the fact that the Lagrangian is convex-concave, we get for all z, We go on with This supremum is attained at z = zk + because Lemma 2 implies that z k − z * ≤ z s − z * for all k ≥ s, and thus also zk − z * ≤ z s − z * .We now use the quadratic error bound of the self-centered smoothed gap, which holds thanks to Proposition 15.

Numerical experiments
In the last section, we present some numerical experiments to illustrate the linear convergence behaviour of PDHG and RAPDHG1 .We will first look at a two linear program to show that the linear rate of RAPDHG can be much faster than PDHG's.Then, we will exemplify the limits of the methods with a ridge regression problem where restarted averaging does not help and a non-polyhedral problem where we do not observe a linear rate of convergence.

Small linear program
The first experiment is on a small LP where the dual optimal set is known: To give an estimate the quadratic error bound constant, we compute for several values of β the quantity η 2 .We can do it because Z * is known for this small problem.Using a similar idea we can also get an estimate of the metric subregularity constant of the Lagrangian's gradient, here η ≈ 0.0187.
On Figure 3, we can see that the actual rate of convergence is rather close to what is predicted by theory.Moreover, RAPDHG is much faster than PDHG.Yet, note that thousands of iterations for a LP with 4 variables and 3 constraints is not competitive with the state of the art.

Larger polyhedral problem
We then run an experiment on a more realistic problem.We run PDHG and RAPDHG with adaptive restart on the following sparse SVM problem: min where (y i , x i,: ) 1≤i≤n are the data points from the a1a dataset [7] (d = 119 and n = 1, 605).We normalized the data matrix so that x :,j 2 = 1.
The convergence profile is given in Figure 4.The behaviour of the algorithms is similar to what was seen in the small size problem.Here however, we can see clearly two phases.In the beginning, we observe a sublinear convergence, where restart and averaging does not help.Then the linear rate kicks in after a nonnegligible time.We believe that it comes from something related to the condition G β (z; z * ) ≤ R in Proposition 13.Note that this cold start phase is quite long.On our laptop computer with 4 Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz it took 5.7s while the adaptive proximal point method of [24] took 0.93s to solve the problem.

Ridge regression
In this experiment, we test on a problem where restarting does not help.We consider least squares with 2 regularization min where A and b are given by the real-sim dataset [7].Since we know the strong convexity-concavity parameter of the Lagrangian, we choose the step sizes σ and τ as in Section 5.1.As a consequence, PDHG has a convergence rate that matches the theoretical lower bound for this class problem and cannot be improved.We can see on Figure 5 that, as expected, restart and averaging does not help: zk is consistently better than zk so that RAPDHG with adaptive restart selects the same sequence as PDHG and the the two curves match.We added a comparison with restarted-FISTA [15] to show that the choice of step sizes indeed suffices to get an algorithm with accelerated rate.

TV-L1
We consider the minimization of the following non-polyhedral function where I is the Cameraman image, D is the 2D discrete gradient, z 2,1 = p∈P z 2 p,1 + z 2 p,2 and λ = 1.9.This problem is not piecewise linear-quadratic, so that our linear convergence result does not hold.Yet is rather structured: it is equivalent to a second order cone program.We can see in Figure 6 that this is a difficult problem for PDHG but that RAPDHG does improve the convergence speed significantly.The solution we obtain is shown in Figure 7.

Conclusion
In this paper, we have tried to understand the linear rate of convergence of primal-dual hybrid gradient.Even on a very simple problem, we have seen that current regularity assumptions are not sufficient to explain the behavior of the algorithm.We have then introduced the quadratic error bound of the smoothed gap and argue that this new condition is more widely applicable and more precise than previous ones.Finally, we showed how this new knowledge can be used to improve the algorithm.
This work opens several perspectives: • Can the quadratic error bound of the smooth gap be used to understand better the convergence rate of other primal-dual algorithms?Interesting cases would be the ADMM, the augmented Lagrangian method and coordinate update methods to cite a few.
• We have seen in ( 11) that the smoothed gap at a non-optimal point can approximate the smoothed gap at an optimal point.Considering it as a stopping criterion would be an alternative to the KKT error, which implicitly requires metric sub-regularity to make sense, and duality gap, which is +∞ nearly everywhere for linearly constrained problems.• Our first attempt for the design of a primal-dual algorithm with an improved linear rate of convergence has shown the usefulness of our regularity assumption.Would we be able to design an optimal algorithm for the class of problems with a given quadratic error bound of the smoothed gap function?
A Proofs of Section 3 Lemma 1 Let p = prox τ f (x) and p = prox τ f (x ) where f is µ f -strongly convex.For all x and x , p − z 2 is convex and 0 ∈ ∂h(p).This implies the first inequality by Fermat's rule.
We now apply the first inequality at (x, p ) and at (x , p) and then sum.
Rearranging the squared norm terms we get Lemma 2 Let T : X × Y → X × Y be defined for any (x, y) by (3).Suppose that ∇f 2 is L f -Lipschitz continuous and ∇g * 2 is L g * -Lipschitz continuous.If the step sizes satisfy and T is 1 1+λ -averaged where which means for z = (x, y) and z = (x , y ) As a consequence, (z k ) converges to a saddle point of the Lagrangian.
Proof.In the appendix, we will improve slightly the result in the case where f or g * is strongly convex.Note that all what follows works even if µ f = µ g * = 0.
Since the proximal operator of a convex function is firmly nonexpansive, for (x, y), (x , y ) ∈ Z, We also have Similarly, We then proceed to We choose α f = τ L f /2 < 1 and α g = σL g * /2 < 1 and we note that which proves (4).Now, we shall prove that where λ ∈ [0, 1−α f ] and α > 0 are arbitrary.We choose λ and α such that √ γ when f 2 = 0 and g 2 = 0.In the case f 2 and g 2 non zero, we take Note that as soon as We get that T is β-averaged with 1−β β = λ, that is β = 1 λ+1 .For the convergence, we use Krasnosels'kii Mann theorem [4].
Proof.The last part of the proof of Lemma 2 shows that for any z, z ∈ Z, , we get the desired result.
. For all k ∈ N and for all z ∈ Z, where . a 2 ≥ −1 may be positive or negative.
Proof.By Taylor-Lagrange inequality and convexity of f 2 and g * 2 , By definitions of xk+1 and ȳk+1 , for all x ∈ X and y ∈ Y, we have: Summing these inequalities and using the relations x k+1 = xk+1 − τ A (ȳ k+1 − y k ) and y k+1 = ȳk+1 yields where a 2 = max( ≥ −1 may be negative or positive.
We then sum (6) for k between 0 and K − 1 and use convexity in x and concavity in y of the Lagrangian: In particular,

B Proofs of Section 4
Proposition 6 If f + f 2 has a L f + L f -Lipschitz gradient and is µ f -strongly convex, and g + g 2 = ι {b} , then PDHG converges linearly with rate ), λ is defined in Lemma 2 and a 2 ≥ −1 is defined in Lemma 4.
Proof.We know by Lemmas 4 and 3 that for all z = (x, y), For the dual vector, we use the smoothness of the objective, the equality ∇f For a ∈ R, we choose x = x * + aA (y * − ȳk+1 ) so that Moreover, we can show that A ȳ − A y * ≥ σ min(A) dist(ȳ, Y * ), where σ min(A) is the smallest singular value of A. Indeed, Y * = {y : A y = −∇(f + f 2 )(x * )} = P Y * (ȳ) + ker A is an affine space.Here, we denoted by P Y * the orthogonal projection on Y * .We can then decompose ȳ as ȳ = P Y * (ȳ) + z where z We now develop Using the fact that B is Lipschitz-continuous with constant 2 max(α f , α g ) in the norm • V and that z V = D −1/2 z , this leads to Gathering these three inequalities gives Finally, we remark that Then, to prove the linear rate of convergence, we recall that for all z * ∈ Z * , Combined with the metric sub-regularity of (I − T ), we get and thus the linear rate of PDHG follows directly from this contraction property of operator T .

C Proof of Proposition 13
Proposition 13 For any β ≥ 0, R > 0 and z * ∈ Z * , the linear program (8) satisfies the quadratic error bound: for all z such that G β (z; z * ) ≤ R, we have Hence, for R of the order of 1 θ , G 1 θ (•, z * ) has a c θ -QEB with c independent of θ.
Proof.First of all, we calculate the smoothed gap for (8).(10).Our goal is to upper bound this by a function of S P β (x, y * ).First, we note that S b, y * is the sum of many nonnegative terms: Suppose that S P β (x, y * ) ≤ .Then each of these terms is smaller than .The most complex term is the last one.We shall consider separately 2 sub cases: I − = {j ∈ I : y * j + σ β (A j,: x − b j ) ≤ 0}, and
Proof.We shall write the proof for µ g > 0, even though we state the proposition for µ g = +∞ only.We apply Lemma 2 to z = z k and z = z k−1 so that T (z) = z k+1 and T (z ) = z k .Note that we apply the appendix version of Lemma 2 in order to leverage the most of strong convexity.
We combine with To get the rate, we then need and λ 2 = 1.We shall let the choice of C ∈ [0, 1 − α 2 ] for a 1D grid search since the rate will depend a lot on its value.This yields λ 2 η x − 2λ 3 γ (1 − α 2 ) = Cη x .

Definition 3 .Assumption 3 .
A set-valued function F : Z ⇒ Z is metrically subregular at z for b if there exists η > 0 and a neighborhood N (z) of z such that ∀z ∈ N (z),dist(F (z ), b) ≥ η dist(z , F −1 (b))The problem is a smooth strongly convex linearly constrained problem.Said otherwise, f + f 2 is strongly convex and differentiable, f and f 2 both have a Lipschitz continuous gradient, g 2 = ι {0} and g = ι {b} , where b ∈ Y.

Figure 1 :
Figure 1: Comparison of the true rate (line above) and what is predicted by theory (2 lines below) for a = 0.03, τ = σ = 1 and various values for µ.
E and I are disjoint sets of indices such that E ∪ I = {1, . . ., m} and N , F are disjoint sets of indices such that N ∪ F = {1, . . ., n}.A dual of this problem is given by max y∈R m −b y (A :,F ) y + c F = 0 (A :,N ) y + c N ≥ 0 y I ≥ 0

Figure 2 :
Figure 2: Comparison of the true rate ρ (line above), what is predicted by theory using previous theories and what is predicted by using quadratic error bound of the smoothed gap for a = 0.03, τ = σ = 1 and various values for µ.We plot 1 − ρ in logarithmic scale.

Figure 4 :
Figure 4: Comparison of PDHG and RAPDHG: sparse SVM on the a1a dataset.We are plotting the optimality measure for the last iterate

Figure 5 :
Figure 5: Solving 2 regularized least squares on the real-sim dataset.

Figure 6 :
Figure 6: Comparison of PDHG and RAPDHG on the 1 ROF problem.

Figure 7 :
Figure 7: Left:original image -Right: solution, 59% of the pixels are unchanged

Table 1 :
because we believe it simplifies the Domain of applicability of each assumption."Strongly convex & smooth" means that g g 2 is a differentiable function and f + f 2 is strongly convex.