Inertial-relaxed splitting for composite monotone inclusions

In a similar spirit of the extension of the proximal point method developed by Alves et al. [2], we propose in this work an Inertial-Relaxed primal-dual splitting method to address the problem of decomposing the minimization of the sum of three convex functions, one of them being smooth, and considering a general coupling subspace. A unified setting is formalized and applied to different average maps whose corresponding fixed points are related to the solutions of the inclusion problem associated with our extended model. An interesting feature of the resulting algorithms we have designed is that they present two distinct versions with a Gauss–Seidel or a Jacobi flavor, extending in that sense former proximal ADMM methods, both including inertial and relaxation parameters. Finally we show computational experiments on a class of the fused LASSO instances of medium size. Digital Object Identifier 10.5802/ojmo.22


Introduction
We will propose in this paper new versions of existing splitting methods for the following general convex minimization model involving the sum of three convex functions, one of them being smooth and the other ones coupled by linear operators: Ax + Bz = 0 where f : R n → R and g : R p → R are proper convex lsc functions, A and B are (m × n) and (m × p) matrices, respectively, and h : R n → R is convex and ( 1 β )-Lipschitz-differentiable, with β positive. The above model can be seen as a functional version of the general inclusion problem of finding a zero of the sum of three monotone operators defined on a Hilbert space, one of them being co-coercive (see Davis and Yin [12]). The role of the linear operators A and B in the coupling subspace intends to cover a broad scope of current applications, justifying the reference to "composite inclusions" in the title.
That model has received a lot of attention recently, most of the work aiming at extending known splitting schemes adapted to the two functions case where h = 0. Here again, we will explore the corresponding splitting issues, thus designing algorithms which involve forward or backward steps associated with each function separately.
The celebrated Alternate Direction Method of Multipliers (ADMM) is one of the most important first order splitting method to solve (1) when h = 0 (see [14] for the original introduction or [5] for a survey with potential applications to signal processing). It can be seen as a dual approach based on the composite Augmented Lagrangian function where the dual multipliers are denoted by y ∈ R m : l(x, z, y) = f (x) + g(z) where M is a positive definite (m × m) symmetric matrix and ∥a∥ 2 M = a T M a the corresponding elliptic norm. ADMM basically consists in alternating the minimization of that Lagrangian w.r.t. x and z separately, followed by the dual update of the y variables.
Many variants of ADMM have been developed, including unified variants of ADMM which are condensed in the Shefi-Teboulle's algorithms [22] in the case (B = −I p×p ) where additional primal proximal terms are added in the respective Lagrangian functions like in the Proximal Method of Multipliers [21]. These can be easily extended to the general case with any matrix B and called hereafter Proximal Primal-Dual Splitting (PPDS) with two distinct interpretations as a Gauss-Seidel and a Jacobi-like versions.
where choosing η k as below gives us two algorithmic versions: for Jacobi version algorithm We note that both algorithms result from applying the preconditioned proximal points (corresponding to two appropriate positive semidefinite matrices, see [19]) to the following Lagrangian inclusion problem associated with (1) (case h = 0): Find (x, z, y) ∈ R n × R p × R m such that 0 ∈ L(x, z, y), where L is the maximal monotone map defined on R n × R p × R m as The convergence of the previous algorithm is obtained from its relationship with the fixed point formulation applied to particular 1/2-averaged maps 1 (each version corresponding to different averaged map), in the same way as ADMM is related to the Douglas-Rachford map (see [13] for instance).
We will use the same strategy below to propose new and generalized splitting algorithms by inspecting averaged maps associated with model (1).
When a smooth part is added to the model, represented by a function h, the aim is to further improve these algorithms by inserting forward gradient steps without destroying the splitting strategy. Condat [9] (and independently Vũ [23]), has developed two forms of algorithms considering two different Primal-Dual Forward-Backward Splitting (PDFB), whose corresponding Lagrangian maps have less variables than the map L defined by (2).
One of these algorithms is (considering here the simplified formulation with B = −I p×p ): The other one switches the roles of primal and dual variables: Both algorithms include the relaxation parameter ρ k , which is known to accelerate convergence for values in (1, 2) (see [10]).
On the other hand, and without the relaxation effect (ρ k = 1), Chambolle and Pock [8] have proposed a Primal-Dual Splitting with Inertial step (IPDS) method, showing to be closely related to Condat-Vũ's algorithm, Form I, but with an inertial accelerating term as following: The inertial parameter λ k has indeed a different effect than the relaxation strategy; it has been introduced in [1] and contains many similar features of Güler's accelerated Proximal Point algorithm [15], the latter being related with early Nesterov's optimal gradient methods for convex minimization [18].
One of the feature of the generalized primal-dual splitting proposed in this paper is the inclusion of both the relaxation (ρ k ) and inertial (λ k ) parameters in the primal updates, thus inspired by Alves et al. [2], where the authors consider a Relative-error Inertial-Relaxed variant of the Proximal Point Algorithm to produce variants of ADMM. Analogously, we will consider Inertial-Relaxed variants of fixed point formulations applied to different averaged maps.
When h = 0, Condat-Vũ and Chambolle-Pock algorithms without relaxed or inertial terms can be deduced from PPDS in Gauss-Seidel version (see [22]). More exactly: from the Gauss-Seidel version of PPDS (in the case B = −I p×p ) and considering M = σI m×m , V 1 = τ −1 I n×n − σA t A, V 2 = 0, then, after a change of variables ( x k , z k , y k ) = (x k+1 , z k , y k ), we can reobtain Condat-Vũ algorithm, form II. Analogously, interchanging the order of f and g in the model, we obtain the switched version, i.e. Condat-Vũ algorithm, form I.
Alternately, another class of splitting algorithms for (1) (in the case B = −I p×p , but with the three functions) called Primal-Dual Three Operator (PD3O) has been analyzed by Yan [24], extending a former work by Davis and Yin [12] who supposed A = I n×n .
Observe that in the case h = 0 and after the change of variables ( x k , y k ) = (x k , y k+1 ), PD3O gets back to Condat-Vũ algorithm, form I, and is thus again a consequence of the PPDS scheme. In the general case, Yan [24] showed that PD3O has a broader domain of convergence and a better numerical behavior compared to Condat-Vũ's algorithm. So we consider the extension of PD3O (instead of Condat-Vũ's algorithm) for the general model (1). The extended algorithm that will be developed includes in particular the switched version of PD3O (similar to Condat-Vũ's Algorithm, Form II) and its parallel version (similar to PPDS Jacobi version).
Davis-Yin's 3 operator splitting [12] has been recently improved in [20] who proposed an adaptive stepsize tuning to compensate the difficulty to estimate the Lipschitz constant. In [7,6], the authors consider an extension of Spingarn's Partial Inverse method to the 3 functions model coupled by a subspace constraint. Quite recently, a more intricate model with 4 operators is analyzed in [3] and inexact computations are allowed.
We should cite too [17] where the authors developed another class of splitting methods for a more general model including (1) that extends PPDS in the Gauss-Seidel version. Finally, more composite models and different extensions of Chambolle-Pock and Condat-Vũ's schemes can be found in the recent survey by Condat et al. [10].
In summary, associated with the extended model (1), we will construct first order splitting algorithms that unify PD3O (getting switched parallel versions) and the PPDS algorithms (inserting forward gradient steps), including in all of them relaxed and inertial parameters. To achieve that goal first, in Section 2, we will construct two types of averaged maps associated with our extended sequential and parallel splitting algorithms, respectively. Then in order to include inertial and relaxation parameters, in Section 3 we rewrite an Inertial-Relaxed variant of the corresponding fixed point applied to averaged maps. In Section 4, applying these variants of fixed point to the averaged maps constructed in Section 2, we obtain the desirable general splitting algorithm that includes Inertial-Relaxed terms. In Section 5, we choose special multidimensional scaling matrices parameters to better tune the general algorithm obtained before, in order to show the equivalence with the existing algorithms. Finally in the last section, a limited set of computational comparisons between the algorithms will be presented.

Deriving three candidate averaged maps
Assimilating the sum of f and h as a unique function in model (1), under regularity conditions, the problem is equivalent to the following saddle-point inclusion problem The difficulty here is the necessity to split the regularization steps between f, h, g and the composite effect of matrices A and B. A direct application of the approach studied in [19] does not solve the difficulty, since for any matrix D, the map G L D = D(L + D t D) −1 D t which is 1 2 −averaged, does not separate the map ∇h(x). Alternatively we can obtain an equivalent problem to (V L ) which is amenable for the application of the Davis-Yin α−averaged map [11], where α ∈ ( 1 2 , 1), building formally two distinct α−averaged maps which provide the complete splitting even when ∇h(x) ̸ = 0. These new maps are variants of G L D , choosing D as the matrices Q and Q defined below in (11). On the other hand and regardless of the Davis-Yin map, we can obtain another class of α−averaged map with a parallel splitting structure, developed below in (16).

Two averaged maps related to the Davis-Yin map
Following the strategy given by Davis and Yin [11] to identify a single averaged map associated with an explicit 3-operators inclusion, we will now reformulate (V L ) above as a single inclusion with three operators, as seen in the next proposition.
▶ Proposition 1. Let M an m × m symmetric positive definite matrix and V 1 and V 2 symmetric positive semidefinite matrices of order n × n and p × p, respectively, such that V 1 + A t M A is positive definite. Using the notation w = x z y for the primal-dual space variables belonging to R n+p+m , problem (V L ) can be written as where Proof. Following the lifting strategy introduced in [12], problem (1) is now lifted, adding the dummy variables ξ ∈ R p and χ ∈ R n and using the notation and gets the following condensed form: Considering the notations given in the hypothesis, it holds that under regularity conditions, the last problem is equivalent to the following inclusion problem (in its dual form), using w * ∈ R n+p+m , The key trick now is to observe that the composite inclusion above is a valid dual formulation of a primal inclusion which splits S and C, associated with the composite transformation AQA t , and it gives (now in R n+p ): taking again the dual, but using different dual variables denoted now by w * It is now straightforward to perform a last dual inclusion associated with the former one to obtain the desired equivalent inclusion problem. ◀ Observe now that the inclusion (3) obtained in Proposition 1 can be seen as a 3-operator setting, amenable to the application of Davis-Yin's framework [11], resumed below with S, T , C being the three operators by The corresponding Davis-Yin's operator, with parameter γ > 0, associated with the above inclusion, is defined as where J γT denotes the classical resolvent (I + γT ) −1 . In that sense, problem (4) is equivalent to finding a fixed point of operator D γ . When S and T are maximal monotone operators, and C is β−cocoercive 2 , that map D γ has nice properties: indeed, it is 2β 4β−γ −averaged (provided γ < 2β) and of full domain, properties ensuring the convergence of fixed point algorithms applied to this map.
Back to the inclusion (3) obtained in Proposition 1, the Davis-Yin map with scalar parameter γ = 1 associated with that inclusion generates two distinct maps denoted by G 1 and G 2 (where G 2 is obtained switching the position of AS −1 A t −1 and (−B)T −1 (−B t ) −1 ): where S : On the other hand, the positive definiteness assumption on . Similarly, the map T is too maximal monotone under the additional assumption that V 2 + B t M B is positive definite (equivalent to B injective). So, under both conditions (which will be denoted by Hypo-M ): we get an alternative way to write both G 1 and G 2 as shown in the next proposition: ▶ Proposition 2. Under Hypo-M , the maps G 1 and G 2 are single valued, they apply R n × R p × R m into itself, and can be rewritten as: and for G 2 Proof. For a given operator Γ and an injective matrix K, the next expression can be easily deduced (see Proposition 23.25 (ii) in [4]) Under assumption Hypo-M , both matrices A and B are injective, and then considering (7) we can rewrite the resolvent of S and T involved in (6), as , and using that On the other hand, we get also Then combining (8) and (9) in (6), we deduce the result.

◀ ◀
Observe that these new maps G 1 and G 2 can also be seen as a generalization of Davis-Yin maps, since they maintain the averageness and splitting properties (Davis-Yin map can be recovered in the case A = I n×n and B = −I p×p , considering V 1 = 0, V 2 = 0 and M = λI m×m which allows to restrict their domain to R m ). The fixed point set of G 1 and G 2 , which are related to sol(V L ), are and where Q and Q are the matrices defined as These matrices are also involved in the fixed point algorithms derived from G 1 and G 2 and in the corresponding splitting algorithms, as we will show in Section 4.
The next proposition resumes the averageness properties of G 1 and G 2 .
Proof. The fullness of the domains of G 1 and G 2 is deduced from the maximality of ∂f and ∂g and the fullness of the domain of ∇h.
We now show the α−averagedness of G 1 and G 2 . As said before, since V 1 + A t M A and V 2 + B t M B are positive definite, it holds that A and B are injective, following the maximal monotonicity of (AS −1 A t ) −1 and ((−B)T −1 (−B t )) −1 . On the other hand, the cocoercivity property of C follows from the β−cocoercivity of C: From these properties and the definition in (6), the relation above shows that G 1 and G 2 hold the averaged properties of Davis-Yin 3−operator, provided 1 < 2 β (which is satisfied by hypothesis). ◀

A new averaged map with a parallel structure
We can easily obtain a splitting algorithm in a parallel form directly from the application of known algorithms to a special reformulation of the composite model. For instance, rewriting the principal model (1) as a sum of three blocks, namely: where δ Ω denotes the characteristic function of a set Ω, i.e. δ Ω (x) = 0, if x ∈ Ω and δ Ω (x) = +∞, else. Then, applying Davis-Yin algorithm to the functional setting (13), one obtains a splitting algorithm that considers the calculation of proximal terms on f and g in a parallel way, but the iterations also require the computation of the projection over the subspace {(x, z) : Ax + Bz = 0}.
Instead of using the methodology resumed in the former section, since PD3O for Jacobi version does not require any implementation of a projection, we consider its extension in order to obtain a parallel algorithm. To that purpose, we will need to construct another averaged map related to our principal model whose proximal step subproblems on f and g can also be calculated in parallel ways. We will see too below that the computation of the projection step which breaks the parallel features can be avoided.
Unlike G 1 and G 2 which were obtained from Davis-Yin's 3 operator splitting, the construction of the new required average map is deduced from P := S(L + S t S) −1 S t , an 1 2 −averaged map as developed below, associated with PPDS for the Jacobi version (see [19]) with h = 0 (recall that L, defined in (2), is the maximal monotone operator associated to the subdifferential of the Lagrangian of the initial problem). In Proposition 4 below, one can notice the adaptation of P to our general problem (1).
Let M be a m × m symmetric positive definite matrix and R 1 and R 2 symmetric positive semidefinite matrices of order n × n and p × p respectively, such that R 1 + 2A t M A and R 2 + 2B t M B are positive definite matrices (these hypotheses are denoted below by Hypo-M ).
We consider the map G 3 , that applies R n × R p × R m × R m into itself, defined as Notice that the evaluation of this map at any point just considers the parallel calculations of the subproblems related to f and g, and it is not necessary to compute explicitly the projection on the coupling subspace {(x, z) : Ax + Bz = 0}. The fixed points of G 3 are also related to sol(V L ) and contained in where S is a matrix defined as In what follows, we show that G 3 is an averaged map, beginning with a partial result. Denote by G the map that applies R n × R p × R m into itself and is defined by After calculations, this map has the following properties which help us to show later the averaged properties of G 3 .

▶ Proposition 4. Under the Hypo-M hypotheses, it holds that
Moreover, for any u := ( x, z, y), the following inclusion is valid: Proof. The equivalence G 3 = SG S t is easily found using the definitions of these maps. To show the second part, given u := ( x, z, y), considering (x, z, ν) = Gu, it holds that Then the last relations show the desirable result. ◀ Using the last proposition, we show the averaged properties of G 3 .

▶ Proposition 5. Let again assume that the Hypo-M hypotheses are satisfied. Considering that
To make the formulation easier, we use the following notations: µ 1 = ( x 1 , z 1 , y 1 ) and µ 2 = ( x 2 , z 2 , y 2 ). The evaluations of G S t S using µ 1 and µ 2 are, for i = 1, 2: From the inclusion (17) and considering u equal to S t Sµ 1 or S t Sµ 2 , then using the monotonicity of operator L i.e.
and then using expression (16), we have: so that, rewriting the first term, we obtain the following inequality: Now we find an appropriate upper bound for then, since ∇h is co-coercive and using Cauchy inequality, it holds that On the other hand, it holds that ∥ Sµ 1 − G 3 Sµ 1 − Sµ 2 + G 3 Sµ 2 ∥ is equal to where and then using the last relation and (23), we obtain that Therefore from (22), we obtain the desired upper bound Finally using this upper bound in (20) we obtain that Then, since for any s, s ′ ∈ R q there exists u 1 , u 2 ∈ R r such that S t Su 1 = S t s and S t Su 2 = S t s ′ , from the last inequality and (16) we get that G 3 is α−averaged. ◀

Inertial-relaxed fixed point algorithms
In practice, any variant of a basic fixed point algorithm applied to some averaged map will yield a valid variant of the splitting algorithm related to it. Considering a maximal monotone operator T , we describe here an Inertial-Relaxed variant of a fixed point algorithm with relative error, inspired by a recent work of Alves et al. [2]. We recall first their variant of the Proximal Point algorithm to solve the inclusion 0 ∈ T (x) for a given maximal monotone operator T , called Relative Error Inertial Relaxed Hybrid Proximal Point (RIRHPP); it includes three parameters (θ driving the relative error measure, λ the inertial parameter, and ρ the relaxation parameter: ▶ Relative-error Inertial-Relaxed HPP (RIRHPP). Initialization: If v k = 0, then STOP. Otherwise, choose ρ k ∈ [0, ρ] (Relaxing parameter) and set

end for
Observe first that, without relative error (θ = 0), it can be checked easily that z k = J c k T (w k ). In the general case, too, RIRHPP can be rewritten in terms of the resolvent of T : so that this algorithm can be interpreted as a variant of a fixed point method applied to the resolvent of T . Then we can extend this algorithm in order to find a fixed point of a 1−co-coercive map with full domain since, by Minty's Theorem, any 1−co-coercive map with full domain is the resolvent of a maximal monotone operator. This fixed point variant can also be used applied to an α−average map. The following lemma shows that, to find a fixed point of an α−averaged map, it is equivalent to find a fixed point of a 1−co-coercive map which is constructed easily from the α−averaged one, assuming that we know the parameter α. Proof. Let N be a nonexpansive map associated with F .
which means that the transformed map is 1/2-averaged (or firmly nonexpansive) and equivalently 1-co-coercive.
Thus, given the problem of finding a fixed point of an α−averaged map F with full domain, we consider alternately (1 − 1 2α )I + 1 2α F a 1−co-coercive map with full domain, which has the same fixed points. Thus, one can extend algorithm RIRHPP for an α−averaged map with full domain. In summary, one obtains the following algorithm where, for sake of simplicity, we do not include the relative error feature of (RIRHPP) and consider fixed inertial-relaxed parameters: ▶ Inertial-Relaxed Fixed Point (IRFP).

end for
We consider similar conditions on the bounds λ and ρ as those given in [2] for RIRHPP i.e.

H1. (λ, ρ) ∈ ]0, 1[ × ]0, 2[ and the upper bound ρ is a function of λ given by
In the case that no inertial term is used (λ = 0), we bound the relaxation parameter by ρ < 2 so that ρ < 1 α (which is known to guarantee convergence in that case). The convergence of IRFP can be directly derived from [2] since the algorithm can be seen as an application of RIRHPP to a special 1−co-coercive map as shown in Lemma 3.1 above, but we give in the annex an equivalent direct proof.

Inertial-relaxed splitting algorithms
Based on the averaged maps constructed in Section 2 and the variants of the fixed point algorithm IRFP developed in the previous section, we obtain generalized splitting algorithms, including the Inertial and Relaxation parameters, to solve the composite model (1).

Splitting algorithms in the Gauss-Seidel version
Considering first the map G 1 , we will obtain a splitting algorithm which, without inertial nor relaxation tuning parameters, goes back to PD3O (as we will see in Section 5). We obtain thus a different algorithm compared to Condat-Vũ Algorithm, Form I, when h ̸ = 0. This new algorithm is termed Multi-scaling Inertial-Relaxed primal-dual algorithm, Form I : ▶ Multi-scaling Inertial-Relaxed primal-dual algorithm, Form I (MIRPD, Form I). Choose (x 0 , z 0 , y 0 , Considering ξ k = (x k − r k , z k , y k ) and ξ k w = (x k w − r k w , z k w , y k w ), the relation of this algorithm with IRFP applied to the averaged map G 1 is This relation allows us to obtain the following convergence result. Proof. From relation (35), we observe that this algorithm is related to IRFP applied to operator G 1 which, from Proposition 1, is α−averaged. Then by Proposition 7 and relation (10), we deduce that where (x * , z * , y * ) ∈ sol(V L ) and W = (V 1 +A t M A) −1 . Since (∂f +V 1 +A t M A) −1 is single-valued and continuous, from (30) we have that { x k } converges to (∂f Proof. From relation (49), we observe that this algorithm is related to IRFP applied to operator G 3 which, from the hypotheses, is α−averaged. Then by Proposition 7 and relation (14), we deduce that Without inertial and relaxed terms and after the change of variables (x k , y k , z k ) = (x k , η k−1 , x k−1 − τ A t η k−1 − r k−1 ), the last algorithm goes back to PD3O. Therefore this algorithm can be seen as resulting of the inclusion of inertial and relaxed terms into PD3O. Analogously, considering MIRPD, Form II, with matrix parameters satisfying (50) and V 2 = 0, we obtain the switched version of Inertial-Relaxed PD3O.
The two forms of the last algorithm IRPD3O avoid the use of variable z compared to its source algorithm MIRPD, and we can clearly notice the distinction with Condat-Vũ, form I and form II respectively, when h ̸ = 0. Comparing now the two forms of the last algorithm IRPD3O, Form II avoids using the previous values r k−1 and r k .
Alternatively, considering MIRPPD, with matrix parameters satisfying (51) and R 2 = 0, we obtain an algorithm that can be considered as the parallel version of PD3O, since MIRPPD is the Gauss-Seidel version of MIRPD and the Form I of the last one is related to PD3O.
The convergence of the last algorithms follows from Propositions 8, 9 and 10 respectively.

Numerical Results
We consider the problem (commonly referred as fused lasso) with the least squares loss as in [24] min where Q is a q × n matrix, b ∈ R q and A is a n − 1 × n matrix defined by We take the values µ 1 = 20 and µ 2 = 200 for the weights in the objective function. Moreover, the matrix Q and vector b are randomly generated, following Yan's paper as described in [24]. We just consider the dimension n = 400 and q = 40. Since the problem (P f l ) has the structure of problem 1 (case B = −II), we apply algorithm IRPD3O and IRPPD3O and we compare them with Condat-Vũ's and Chambolle-Pock's algorithms. For the evaluation, we consider the primal error, i.e. ∥x k − x * ∥, where x * is approximated as the primal value of the 18000−iteration of PD3O algorithm.
We observe that the three variants of PD3O have a larger theoretical area of convergence. Even the parallel version has a light advantage with respect to Condat-Vũ Algorithms. Also notice that the tuning of the Relaxation parameter yields more impact on the convergence of the algorithms than the Inertial parameter. Finally the numerical results show the necessity to investigate the possibility of extending that theoretical area of convergence, similarly to the study given in [16] when the model does not present a smooth part.

Conclusions
Associated with the composite model based on three types of functions, we have obtained two new averaged splitting maps: the Gauss-Seidel type that generalizes the Davis-Yin averaged map, and the Jacobi type that is a generalized parallel version of Davis-Yin averaged map. Then, similarly to the construction of ADMM from the Douglas-Rachford map, but also considering a variant of the fixed point algorithm, we have obtained three new splitting algorithms from these averaged maps, including in all of them Inertial-Relaxed parameters. Choosing special scaling matrices parameters allows us to obtain algorithmic variants of PD3O, which we have compared numerically, showing the high sensitivity of the rate of convergence with respect to the relaxation parameters, and also noticing the advantage of the variants of PD3O compared to Condat-Vũ Algorithms.
Observe that parallel implementations of a few algorithms have been mentioned but it remains to confirm their respective speedups on real-life cases. Moreover, the numerical experimentation has revealed the high sensitivity of the performance in terms of number of iterations with respect to the tuning of the different parameters. Corresponding adaptive strategies to allow dynamic changes for the proximal and inertial parameters are currently on study.