Convergence Rates of Gradient Methods for Convex Optimization in the Space of Measures

We study the convergence rate of Bregman gradient methods for convex optimization in the space of measures on a $d$-dimensional manifold. Under basic regularity assumptions, we show that the suboptimality gap at iteration $k$ is in $O(log(k)k^{--1})$ for multiplicative updates, while it is in $O(k^{--q/(d+q)})$ for additive updates for some $q \in {1, 2, 4}$ determined by the structure of the objective function. Our flexible proof strategy, based on approximation arguments, allows us to painlessly cover all Bregman Proximal Gradient Methods (PGM) and their acceleration (APGM) under various geometries such as the hyperbolic entropy and $L^p$ divergences. We also prove the tightness of our analysis with matching lower bounds and confirm the theoretical results with numerical experiments on low dimensional problems. Note that all these optimization methods must additionally pay the computational cost of discretization, which can be exponential in $d$.


Introduction
Convex optimization in the space of measures is a theoretical framework that leads to fruitful point of views on a large variety of problems, ranging from sparse deconvolution [Bredies and Pikkarainen, 2013] and two-layer neural networks [Bengio et al., 2006] to global optimization [Lasserre, 2001] and many more [Boyd et al., 2017]. Various algorithms have been proposed to solve such problems including moments methods [Lasserre, 2001], conditional gradient Pikkarainen, 2013, Denoyelle et al., 2019], (non-convex) particle gradient flows [Chizat, 2021] and noisy versions [Mei et al., 2018, Nitanda et al., 2020.
In this paper, we consider perhaps the simplest methods: gradient descent and its extensions that handle non-smooth regularizers and non-Euclidean geometries, the Bregman Proximal Gradient Method (PGM) (an extension of mirror descent [Nemirovsky and Yudin, 1983] that handles composite objectives) and its acceleration (APGM) [Tseng, 2010]. Our aim is to establish well-posedness and convergence rates for these methods when minimizing composite functions over the space of measures M(Θ) over a d-dimensional manifold Θ, of the form where Φ is continuous and Hilbert space-valued, R convex and smooth and H is convex and "simple" (see precise assumptions in Section (3.1)). For such problems, minimizers are typically at an infinite (Bregman) distance from the initialization, and thus all the standard convergence bounds are inapplicable. Our contributions are the following: • We recall and adapt (A)PGM in Section 3, taking care of the subtleties that appear in our context (definition of the iterates and lack of strong convexity of the divergence); • We prove in Section 4 upper-bounds on the convergence rate for (A)PGM under various structural assumptions, summarized in Table 2. These rates depend on the choice of the Bregman divergence and on the precise structure of the objective function; • Tight lower bounds of two kinds are proved in Section 5: proof technique-dependent lower bounds, and algorithm-dependent lower bounds (the latter are stronger but do not cover all cases); • Numerical experiments on synthetic toy problems in Section 6 often show an excellent agreement between the theoretical rates and the ones observed in practice. Even for cases with an apparent mismatch, a closer look at the structure of the problem shows that the theory still shades light on the observed rates.
Our motivation for studying this problem is threefold. First, our results make a case for APGM with the hyperbolic geometry instead of FISTA to solve convex problems in the space of measures, as they show that the former enjoys a faster convergence rate. Second, we believe that a precise understanding of (A)PGM in this context is useful to develop and analyze more complex methods, such as the particle-based (a.k.a. moving grid) approaches mentioned above 1 . Third, this setting offers a rich test case to deepen our understanding of Bregman gradient methods in Banach spaces, and the behavior of optimization algorithms when all minimizers are at an infinite distance from the initialization, beyond the well-explored Hilbert space setting.

Related work
The comparison between additive updates (L 2 geometry) and multiplicative updates (entropy geometry) is well-known in finite dimensional spaces [Kivinen and Warmuth, 1997]. For instance, for convex optimization in the n-dimensional simplex, the two methods typically converge at the same rate but the "constant" factor is polynomial in n for additive updates while it is logarithmic in n for multiplicative updates, see [Bubeck, 2015, Section 4]. We obtain in this paper an infinite dimensional (n = ∞) version of this separation; but where the distinction is directly in the rates rather than in the constants.
Analysis of convex optimization in infinite dimensional (Banach) spaces is a classical subject [Bauschke et al., 2001[Bauschke et al., , 2003]. Here, we study a concrete class of problems defined on the space of measures which exhibit specific features. This problem-specific approach for infinite dimensional problems has proved fruitful for the analysis of gradient methods for least-squares (e.g. Yao et al. [2007], Dieuleveut [2017] and references therein), for partly smooth problems [Liang et al., 2014] and for the Iterative Soft Thresholding Algorithm (ISTA) in Hilbert spaces Lorenz, 2008, Garrigos et al., 2020].
The latter is close to our subject since ISTA is in fact an instance of PGM with the L 2 -divergence -and FISTA [Beck and Teboulle, 2009] is analogous to APGM with the L 2 -divergence. These prior works perform the analysis in a Hilbert space, while we work in the space of measures or in L 1 , which are non-reflexive Banach space. This is also the context of Chambolle and Tovey [2021] who, for a modified version of FISTA, obtained in particular the convergence rate of Table 2 when p = 2 and q = 1, and also discuss discretization. Our analysis allows to compare various algorithms and shows that FISTA is always slower than APGM with the hyperbolic entropy geometry [Ghai et al., 2020] when the solution is truly sparse, see the rates in Table 2. This is clearly observed in numerical experiments and suggests that the latter forms a stronger baseline for our class of problems.
To prove our upper bounds, we use the abstract proof strategy proposed by Jacobs et al. [2019], recalled in Section 2. In that paper, the authors study different classes of problems (total variation denoising of image and earth mover's distance) under Hilbertian geometry.
Notation The domain of a function F : Throughout, Θ is a compact d-dimensional manifold, M(Θ) (resp. M + (Θ)) is the set of finite signed (resp. nonnegative) Borel measures on Θ and P(Θ) is the set of Borel probability measures. For µ ∈ M(Θ), µ is its total variation norm. For a Hilbert space F, C p (Θ; F) is the set of p-times continuously differentiable functions from Θ to F. Lip(f ) is the Lipschitz constant of a function f . For τ ∈ P(Θ) and p ≥ 1, L p (τ ) is the space of (equivalence classes of) measurable functions f : Θ → R such that Θ |f (θ)| p dτ (θ) < +∞ or, for p = +∞, such that |f | is τ -almost everywhere bounded by some K > 0. The asymptotic notation a(k) b(k) means that there exists c > 0 independent of k such that a(k) ≤ c · b(k), and a(k) b(k) means [a b and b a].

Strategy to derive upper bounds on convergence rates
This section introduces the strategy, adapted from [Jacobs et al., 2019], that we adopt to derive upper bounds on the convergence rates.
Let F be a lower bounded convex function defined on a real vector space. Suppose that an iterative method designed to minimize F initialized at x 0 ∈ dom F generates a sequence x 1 , x 2 , · · · ∈ dom F that satisfies where (α k ) k∈N * is a positive sequence converging to 0 and D is a divergence, i.e. D(x, x 0 ) ∈ [0, +∞] and D(x 0 , x 0 ) = 0. Most first order methods enjoy guarantees of this form. For instance, PGM and APGM enjoy such guarantees with respectively α k k −1 and α k k −2 under suitable assumptions, see Section 3.3. While Eq.
(2) is sometimes the endpoint of the analysis in the optimization literature, this is our starting point: we are interested in cases where for any minimizer x * of F , the quantity D(x * , x 0 ) is infinite, which makes the bound of Eq. (2) inapplicable. Even if there exists a quasi-minimizer x with a small suboptimality gap and satisfying D(x, x 0 ) < +∞, choosing a fixed x independent of k in Eq. (2) leads to a poor upper bound which often does not match the observed practical behavior. Instead, we should exploit the flexibility offered by the guarantee of Eq. (2) and choose a different reference point at each time step 2 . This means that we reformulate the guarantee in the equivalent form: Studying ψ is particularly fruitful to understand optimization algorithms satisfying Eq. (2). In particular, its behavior at 0 determines the asymptotic convergence rate. This function can be interpreted as the value at x 0 of the (Bregman) Moreau envelope [Kan and Song, 2012] of (F − inf F ) with regularization parameter α, and it intervenes in many area of applied mathematics. For instance, when D(x, x 0 ) is a squared Hilbertian norm, ψ has a variety of Figure 1: Shape of ψ defined in Eq. (3) in our situation of interest where D(x k , x 0 ) explodes for any minimizing sequence (x k ) k∈N (Prop. 2.1). When an optimization method satisfies Eq.
(2) for some sequence (α k ) k∈N then ψ(α k ) bounds its convergence rate in objective values.
behaviors which characterize the performance of kernel ridge regression in machine learning (see e.g. [Bach, 2021, Chap. 7.5]). Before we head in a more concrete setting, let us gather a few relevant properties of the function ψ that hold in full generality.
Proposition 2.1. Assume that F (x 0 ) < +∞ and D(·, x 0 ) ≥ 0 with equality at x 0 . Then the function ψ is concave on [0, +∞[ and satisfies Proof. The function ψ is concave as the pointwise infimum of affine functions. The lower bound is immediate and the upper bound is obtained by taking x 0 as a candidate in the infimum. Let us prove (ii) (the proof of (i) follows a similar scheme and is simpler). By concavity, the limit defining ψ (0) always exists and belongs to ]0, +∞]. If a sequence (x k ) exists as in the statement, then for any α > 0, pick x k such that F (x k ) − inf F ≤ α 2 and then ψ(α) ≤ αD(x k , x 0 ) + α 2 so ψ(α)/α ≤ D(x k , x 0 ) + α. Since the upper bound is uniformly bounded as α → 0 it follows that ψ (0) is finite. Conversely, if ψ (0) is finite, take a decreasing sequence (α k ) that converges to 0 and let (x k ) be a sequence of quasi-minimizers for Eq. Figure 1 illustrates the general shape of the function ψ. Observe that if ψ (0) < +∞ then the bound of Eq.
and thus the convergence rate given by Eq. (2) is not modified (only the constant changes). However, Proposition 2.1 shows that when any minimizing sequence (x k ) satisfies D(x k , x 0 ) → +∞, then ψ (0) = +∞ and thus the convergence rate is modifed. This is the situation we are interested in in this paper, in the context of optimization in the space of measures.

Gradient methods for optimization in the space of measures
In the rest of this paper, we apply the general method of Section 2 to a class of optimization problems in the space of measures where it leads to a zoo of -often tight -convergence rates.

Objective function
Let Θ be a compact Riemannian manifold without boundary, with distance dist and with a reference probability measure τ ∈ P(Θ) that is proportional to the volume measure. We consider an objective function on the space of measuresF : Typically,Ḡ is a data-fitting term andH a regularizer. We make the following assumptions, where ι C is the convex indicator of a convex set C and λ ≥ 0 a regularization parameter: where F is a Hilbert space, R : F → R is convex and differentiable with a Lipschitz gradient ∇R, andH is a sum of functions from the following list: ι P(Θ) , ι M + (Θ) , λ µ and ι {µ ; λ µ ≤1} .
One specific property ofH that we use in our proof is that it should be non-decreasing under convolutions by a probability kernel, but we prefer to work with these specific instances rather than giving abstract conditions. We finally denote by F : and similarly H(f ) :=H(f τ ) and G(f ) :=Ḡ(f τ ) so that F = G + H. These "bar" notations convey the idea thatF ,Ḡ,H are the lower-semicontinuous (l.s.c.) extensions of F, G, H for the weak* topology induced by C 0 (Θ) on M(Θ).
Here are examples of problems that fall under this setting: • (Sparse deconvolution) The goal is, given a signal y * ∈ L 2 (τ ), to find a sparse measure µ such that the convolution of µ with a filter φ ∈ C(Θ) approximately recovers y * . Here the domain is typically the d-dimensional torus Θ = T d endowed with the Lebesgue measure τ and the objective is [De Castro andGamboa, 2012, Candès andFernandez-Granda, 2014 Adding the nonnegativity constraint ι M + (Θ) is also relevant in certain applications.
• (Two-layer relu neural networks). The goal is, given n observations (x i , y i ) ∈ R d × R, to find a regressor written as a linear combination of simple "ridge" functions. Consider a loss : R 2 → R convex and smooth in its second argument, let φ(s) = (s) + and let Key differences with the previous setting are that Θ = S d is the sphere, with d potentially large, and that the object that is truly sought after is the regressor x → φ([x; 1] θ)dµ(θ) rather than the measure µ. Typical choices for are the logistic loss (y, z) = log(1 + exp(−yz)) when y i ∈ {−1, 1} or the square loss (y, z) = 1 2 |y − z| 2 . The signed setting with regularizationH(µ) = λ µ is the most common one [Bengio et al., 2006, Bach, 2017 but the regularization ι P(Θ) also appears in the context of max-margin problems [Chizat and Bach, 2020].
The following smoothness lemma will be useful to analyze optimization algorithms and is analogous to the usual "Lipschitz gradient" property in convex optimization. Since the dual of (M(Θ), · ) is a bit exotic, we avoid using the notion of gradient at all.
is Lipschitz continuous as a function from M(Θ) to C p (Θ, R). The following smoothness inequality holds with Lip(Ḡ ) ≤ Φ 2 ∞ · Lip(∇R) and for all µ, ν ∈ M(Θ), Those results hold true when replacing Proof. For the first part, the differentiability of R implies that For the regularity of µ →Ḡ [µ], we have for µ, ν ∈ M(Θ), The smoothness inequality can be shown by bounding a 1-dimensional integral as in the Euclidean case [Nesterov, 2003, Thm. 2 is an isometry, so those results hold mutandis mutatis in L 1 (τ ).

Bregman divergences
Let us consider η : R → [0, ∞] a differentiable function that we will refer to as the distancegenerating function. For f ∈ L 1 (τ ) we write η(f ) := η • f and we definē Let D η (resp. Dη) be the Bregman divergence associated to η (resp.η), given for f, g ∈ L 1 (τ ) by We consider the following assumptions on the distance-generating function η: Specifying the values of η and η at a point in int(dom η) is just for convenience and is not restrictive since D η is not affected by affine perturbations of η. Also, the assumption η(cx) η(x) is only needed to simplify the statement of the results. Under assumption (A2) + , we have that η (0) := lim t→0 + η(t)/t = −∞ which automatically enforces an nonnegativity constraint in the methods in the next section.
) so locally, D η is equivalent to a squared Riemannian metric on the real axis given by η . For the examples listed above, it holds η p (s) = |s| p−2 , η ent (s) = s −1 and η hyp (s) = (s 2 + β 2 ) −1/2 , see Figure 2 for an illustration. The hyperbolic entropy η hyp can be interpreted as a "signed" version of η ent (see Proposition 3.6 for a precise version of this remark).
The next lemma states the strong convexity of these divergences with respect to the L 1 (τ ) norm, which is needed in the next section. It is a generalization of Pinsker inequality, recovered when K = 1 and with η ent . Notice that when p < 2, the bound worsens as the norm increases.

Gradient methods and their classical guarantees
We now detail two classical algorithms that enjoy guarantees of the form Eq. (2) for a large class of composite optimization problems. Algorithm 1 (PGM) is closely related to mirror descent [Nemirovsky and Yudin, 1983] and is discussed in Bauschke et al. [2003], Auslender and Teboulle [2006]. Algorithm 2 (APGM) is taken from Tseng [2010] who presents it as a generalization of Auslender and Teboulle [2006] itself an extension of Nesterov's second method [Nesterov, 1988]. For the sake of concreteness, we instantiate these algorithms in the context of optimization in the space of measures, where small adaptations have to be made.
In the next proposition, we verify that the updates are well-defined under suitable assumptions. Table 1 lists some update formulas which are directly implementable, after discretization.
Algorithm 2: Accelerated (Bregman) Proximal Gradient Method (APGM) Proof. Let J k be the function to minimize. Thanks to our assumptions that η (int(dom η)) = R and since H is lower-bounded, by [Rockafellar, 1971, Cor. 2B], the sublevels of J k are compact with respect to the weak topology (induced on L 1 (τ ) by L ∞ (τ )). Moreover, J k is convex and l.s.c. for the same topology; in particular because G [g k ], η (h k ) ∈ L ∞ (τ ) and for the term η(f )dτ , this follows from [Rockafellar, 1971, Cor. 2A]. Thus by the direct method of the calculus of variations, there exists a minimizer f k+1 ∈ L 1 (τ ). Since η is strictly convex, so is J k and this minimizer is unique. The condition of Eq. (5) is always a sufficient optimality condition since, by the subdifferential inclusion rule, it implies that 0 ∈ ∂J k (f ). It thus remain to show that it is also necessary, in which case the property η (h k+1 ) ∈ L ∞ (τ ) immediately follows. This is done on a case by case basis for the functionsH admissible under Assumption (A1). Consider for instance the nonnegativity constraintH = ι M + (Θ) and η that satisfies Assumption (A2) ± . Then with the update h k+1 given in Table 1 (take λ = 0), it holds Clearly φ ∈ L ∞ (τ ), φ ≤ 0 and φh k+1 dτ = 0 and thus φ ∈ ∂H(h k+1 ), which shows that h k+1 is a minimizer and satisfies Eq. (5). The other cases forH and η that are admissible under (A1) and (A2) (such as those listed in Table 1) can be treated similarly and follow computations which are standard in the finite dimensional setting.
Let us now recall the guarantees for these methods. We stress that, as discussed in Section 2, these guarantees do not necessarily lead to convergence rates.
Proposition 3.4. Assume (A1) and (A2) and that η satisfies the conclusion of Lemma 3.2 for some p ∈ [1, 2], β ≥ 0. Consider an initialization f 0 ∈ dom H such that η (f 0 ) ∈ L ∞ (τ ) and, for some is a soft-thresholding. In (ii), κ ∈ R is the unique number such that the update satisfies the constraint. In (iv), κ ≥ 0 is the smallest number such that the update satisfies the constraint (see Condat [2016] for efficient algorithms to compute κ in practice).
(i) Let (f k ) k∈N be generated by Algorithm 1 (PGM). If sup k f k L 1 (τ ) ≤ K, then Eq.
Proof. By Proposition 3.3, the updates are well-defined. The proof of [Tseng, 2010, Thm. 1] goes through, in particular thanks to Lemma 3.1 (smoothness) and since D η /(K + β) p−2 is 1-strongly convex with respect to · L 1 (τ ) whenever this property is needed in the proof. A particularly simple exposition of the proof for APGM can be found in [d'Aspremont et al., 2021, Thm. 4.24].
Remark 3.5. A difficulty in Proposition 3.4 is that when p < 2, one needs to assume a priori bounds on the L 1 -norm of certain iterates to obtain convergence guarantees, because the metric induced by the divergence Dη becomes weaker as the L 1 -norm increases. Since Algorithm 1 (PGM) is a descent method, f k L 1 is bounded, uniformly in k, as soon as the objective is coercive for the L 1 -norm. But for Algorithm 2 (APGM), even if variants exist where (F (f k )) k is monotonous [d'Aspremont et al., 2021], this property does not seem to be sufficient to control h k L 1 , even for coercive objectives. Of course, uniform bounds are always trivially satisfied whenH includes the constraint ι µ ≤K or ι P(Θ) .

Reparameterized gradient descent as a Bregman descent
In this paragraph, we recall a link between Bregman gradient descent, a.k.a. mirror descent (an instance of Algorithm 1) and L 2 gradient descent dynamics on certain reparameterized objectives. The purpose is to show that the convergence rates proved in Section 4 with η ent and η hyp are also relevant to understand L 2 gradient descent in certain contexts. While these remarks are well-known [Amid and Warmuth, 2020, Vaskevicius et al., 2019, Azulay et al., 2021, we find it instructive to state them clearly in our context. In order to reduce the discussion to its simplest setting, we consider the continuous time dynamics in the unregularized setting, and we assume that they are well-defined.
We can make the following remarks: • Combining (i) and (ii), we find that if (h + t , h − t ) t≥0 follows a η ent -mirror flow for t is a η hyp -mirror flow for F . This confirms the interpretation of η hyp as a "signed" version of the entropy (see also [Ghai et al., 2020, Thm. 23]).
• These exact equivalences are lost in discrete time, with an error term that scales as the squared step-size. It is thus difficult to convert the most efficient guarantees for (Bregman) PGM into guarantees with the same convergence rate for gradient descent. Proof Thus the function h t := f 2 t evolves according to which is precisely the η ent -mirror flow of 4F since η ent (s) = s −1 for s > 0.

Upper bounds on the convergence rates
This section contains the main result of this paper which is Theorem 4.1 and summarized in Table 2. As discussed in Section 2 and thanks to Proposition 3.4, in order to derive convergence rates for Algorithms 1 and 2 it is sufficient to control the function For the class of problems we consider, the behavior of ψ highly depends on the context. The simplest situation is when F admits a minimizer f * ∈ L q (τ ) with q > 1 (since τ is finite, it holds L q (τ ) ⊂ L 1 (τ )). Then for the distance-generating functions η p for 1 < p ≤ q or η hyp , it is easy to see that Dη(f * , f 0 ) < +∞ (this further requires f * ≥ 0 for η ent ). Thus by G [µ * ] = 0 q = 2 (I*) q = 4 (II*) (b) Value of q Table 2: (a) Upper bounds on the convergence rates of F (f k ) − inf F for Algorithm 1 (PGM) and 2 (APGM). (b) The value of q that appears in the rate depends on the regularity of Φ and on whetherḠ vanishes at optimality or not. This defines 4 settings referred to as (I), (I*), (II) and (II*). Upper bounds are derived in Thm. 4.1, lower bounds are proved in Section 5.
In the more subtle case where the minimizer ofF is only assumed to be in M(Θ), the variety of behaviors is captured by the following result.
Remark 4.2 (Additive vs. Multiplicative updates). A consequence of Theorem 4.1 is that algorithms with "additive updates" -obtained with η 2 as a distance-generating function (e.g. ISTA, FISTA) -suffer from the "curse of dimensionality in the convergence rates, see Table 2- (a). In comparison, algorithms with "multiplicative updates" -obtained with η ent or η hyp as a distance-generating function -always converge at a faster rate which is independent of the dimension d. Note that Theorem 4.1 only proves upper bounds on the rates, but we will see that they are tight in Section 5.
Proof. The upper-bound in Eq. (8) corresponds to an upper bound on F (f ) − inf F + αDη(f , f 0 ) for a specific family of candidates f ∈ L 1 (τ ). A special case of this argument for sparse µ * , η ent and q = 1 appeared in Chizat [2021] and extended to µ * ∈ M + (Θ) in Domingo-Enrich et al. [2020]. In the following we write F, G, H forF ,Ḡ,H to lighten notations.
Step 2. Bounding F (µ ) − F (µ * ). For our admissible regularizers, it is easy to verify that H(µ ) ≤ H(µ * ). By convexity of G, we have It is clear that the magnitude and regularity of G [µ ] plays a role in the magnitude of this quantity. To go further, let us consider the various cases in Table 2- (b) successively.
(I). If Φ is Lipschitz then for θ, θ ∈ Θ it holds Since ∇R is Lipschitz continuous, we deduce that there exists (I*). If Φ is Lipschitz and moreover G [µ * ] = 0, it holds Since ∇R is Lipschitz continuous, the first factor is bounded by Here we have that G [µ ] ∈ C 1 (Θ; R) and ∇ θ G [µ ] is Lipschitz. By the Mean Value Theorem on Riemannian manifolds (see e.g. [Gray and Willmore, 1982, Thm. 4.6]), there exists a constant K ≥ 0 such that for all θ ∈ Θ It follows (II*). If ∇Φ is Lipschitz and moreover G [µ * ] = 0 then an improvement as in (I*) applies. The functions {θ → Φ , Φ(θ) ; Φ ≤ 1} are differentiable with a uniformly Lipschitz derivative so arguments as in the previous paragraph show that 2 . Thus, going through the argument for (II), with all the constants multiplied by 2 , we obtain that the bound improves to Since η (f 0 ) ∈ L ∞ , all these terms are bounded by a constant independent of except η(f )dτ , so it remains to bound the latter. If µ * = 0 then this quantity is bounded by a constant independent of and we are done. Otherwise let us first assume that µ * ∈ M + (Θ). By Jensen's inequality, one has ∀θ ∈ Θ It follows, by Fubini's theorem, Now we use the fact that sup θ τ (B (θ)) d and our assumption that η(cx) η(x) for any fixed c > 0 as x → ∞ to bound this quantity by O( d η( −d )).

Lower bounds
We will consider two types of lower bounds: (i) lower bounds on ψ(α) in order to confirm that the analysis in Thm. 4.1 is tight, and (ii) direct lower bounds on the convergence rates of Algorithms 1 and 2. Of course, the latter imply the former, but studying ψ directly has its own interest and makes it simpler to cover all the cases.

Tight lower bounds on ψ
Let us show that the bounds on ψ in Theorem 4.1 cannot be improved without additional assumptions.
(II). ConsiderḠ(µ) = ΘΦ (θ)dµ(θ), whereΦ is any function which is continuously twice differentiable, coincides with dist(θ 0 , ·) 2 on B 1/2 (θ 0 ) and is larger than 1/4 outside of this ball. We cannot directly take dist(θ 0 , ·) 2 because this function is not smooth everywhere on Θ due to the existence of a cut locus, but it is smooth on B 1/2 (θ 0 ). Assumptions (A1)-(II) are satisfied. Again µ * = δ θ 0 is the unique minimizer and F (µ * ) = 0. Analogous computations Remark 5.2 (Exact decay of ψ for a natural class of problems.). There exists in fact a broad class of problems satisfying Assumption (A1)-(II) for which the bound on ψ with q = 2 is exact. These are problems with a sparse solution µ * that satisfy an additional non-degeneracy condition at optimality, that appear naturally in certain contexts [Poon et al., 2018]. For these problems, is it shown in [Chizat, 2021, Prop. 3.2] where the supremum is over 1-Lipschitz functions g : Θ → R uniformly bounded by 1.
Reasoning as in the proof of Proposition 5.1, this implies

Direct lower bounds on the convergence rates
In this section, we directly lower-bound the convergence rates of Algorithm 1 and Algorithm 2. We focus on the L 2 geometry (η 2 ) for which we prove all the lower bounds (there are 8 cases to consider) and on the relative entropy geometry (η ent ) for which we omit certain settings for the sake of conciseness. In all the cases considered, the lower bounds match the upper bounds (up to logarithmic terms for η ent ). Let us start with PGM and η 2 .
Proposition 5.3. For each of the settings (I), (I*), (II) and (II*) under Assumption (A1), there exists a function F such that the iterates of Algorithm 1 (PGM) (f k ) k≥0 with the distance-generating function η 2 , and initialized with f 0 = 1, with any step-size s > 0, satisfy where q is the constant associated to the setting via Table 2- (b).
Proof. (I). As in the proof of Proposition 5.1, we consider Θ = T d , θ 0 ∈ T,H = ι P(Θ) and G(µ) = Φdµ with Φ(θ) = dist(θ, θ 0 ). We set s = 1 as the step-size plays no role in what follows. In this case, the update equation of Algorithm 1 writes Thanks to the symmetries of the problem, a direct recursion shows that it holds (d+1) and r(k) k −1/(d+1) . We can compute the objective Which proves the case (I) (here q = 1) since inf F = 0.
Let us now prove similar lower bounds for APGM, again for the specific choice of distancegenerating function η 2 .
Proposition 5.4. For each of the settings (I), (II), (I*) and (II*) under Assumption (A1), there exists a function F such that the iterates of Algorithm 2 (APGM) (f k ) k≥0 with the distance-generating function η 2 , and initialized with f 0 = 1, with any step-size s > 0, satisfy where q is the constant associated to the setting via Table 2 (b).
Proof. For (I), we consider the same set up as in the proof of Prop. 5.3-(I) (also fixing s = 1 for conciseness). Since G is linear, the update of The proof for the case (II) follows exactly the same scheme but with the functionΦ considered in the proof of Proposition 5.3 and we omit the details.
It is also instructive to look at lower bounds with the entropy η ent . We observe that here there exist cases where the guarantee given by Prop. 3.4 is off by a log(k) factor (because this factor is present in the lower bound of Proposition 5.1).
Proposition 5.5. For the settings (I) and (II) under Assumption (A1), there exists a function F such that the iterates of Algorithm 1 (PGM) (f k ) k≥0 with the distance-generating function η ent , and initialized with f 0 = 1, with any step-size s > 0, satisfy Proof. We consider the same setting as in the proof of Proposition 5.3-(I) (and s = 1 for simplicity). In this case, the update reads f k+1 ∝ exp(f k − Φ) so by an immediate recursion f k ∝ exp(−kΦ). This is essentially a (multi-dimensional) Laplace distribution and when k is large, up to exponentially small terms in k, we can compute the integrals over R d instead of T d . For the normalizing factor, we have where Γ is the Gamma function. For the (unnormalized) value of F (f k ), we have By computing the ratio, it follows that F (f k ) − inf F k −1 . In Setting (II), we take the functionΦ as before which is equal to dist(·, θ 0 ) 2 near θ 0 . Now f k ∝ exp(−kΦ) which is essentially, when k is large, a Gaussian distribution of variance 1/k F (f k ) − inf F . Remark 5.6. Although the convergence rates obtained with η ent and η hyp are independent of the dimension d (see Table 2), this favorable behavior crucially relies on the assumption thatF admits a minimizer µ * ∈ M(Θ). When this is not the case, Wojtowytsch and E [2020] Figure 3: Behavior of Algorithm 1 (PGM) for a nonnegative sparse deconvolution problem with solution µ * = δ 0 , with d = 1 and for various Bregman divergences η p (p = 1 stands for η ent ). We plot the function f k for k ∈ {0, 6, 6 2 , 6 3 , 6 4 }, we use the same step-size and the same axes in all cases. The associated convergence plots are in Figure 4. Here p refers to the parameter of η p and p = 1 refers to η ent . The objective has structure (II*) so q = 4 in the rates of Table 2.
show that there is an example where the continuous time dynamics induced by η ent also suffer from the curse of dimensionality (our setting is slightly different but their argument would apply here). In addition, the discrete time dynamics are not stable in this case because the norm of the iterates grows unbounded, see Remark 3.5.

Numerical experiments
In this section we compare our theoretical rates with the practical behavior of PGM (Algorithms 1) and APGM (Algorithm 2) on simple toy problems. The purpose is to show that, although our analysis is asymptotic (in k and in the spatial discretization), it describes well the convergence of those algorithms in certain practical scenarios. The code to reproduce these experiments can be found online 3 .

Sparse deconvolution
We consider the sparse deconvolution problem introduced in Section 3.1 where φ is a Dirichlet kernel φ(θ) = k∈{−2,1,0,1,2} d exp(2 √ −1πk θ) and y * (θ) = φ(θ) = φ(θ − θ )dµ * (θ ) with µ * = δ 0 . The domain T 1 is discretized into a regular grid of m = 300 points and T 2 into a regular grid of m = 60 × 60 points. Figure 3 illustrates the behavior of the various Bregman divergences for this problem, where it is seen that the iterates f k (weakly) converge faster to Figure 5: Convergence of PGM and APGM vs. theoretical rates (up to log factors) in a sparse deconvolution problem withH = λ µ and d = 1. Here p refers to the parameter of η p and p = 1 refers to η hyp . The objective has structure (II) so q = 2 in the rates of Table 2. Here p refers to the parameter of η p and p = 1 refers to η ent . The objective has structure (II*) so q = 4 in the rates of Table 2. Here p refers to the parameter of η p and p = 1 refers to η hyp . The objective has structure (II) so q = 2 in the rates of Table 2. Here p refers to the parameter of η p and p = 1 refers to η hyp . The objective has structure (I) so q = 1 in the rates of Table 2.
the Dirac solution as p is smaller (in the following discussion, we use p = 1 to refer to the entropy or hypentropy distance-generating function). Figures 4, 5, 6 and 7 report the convergence rates in a variety of settings, which we compare to our theoretical predictions (without the logarithmic factors, since they do not change the asymptotic slope on a log-log plot). In both cases, inf F admits a closed form so we can exactly plotF (f k τ m )−minF and additionally observe the effect of the discretization (here τ m is the discretized reference measure). Observe that in the 2D experiments, APGM with p = 1 quickly reaches the discretization error, and on Figure 7- (b), it does not have enough "time" to attain the theoretical asymptotic rate before the effect of the discretization comes in. While our analysis is asymptotic, it thus corresponds in practice to a non-asymptotic and transient behavior.

Two-layer neural networks
We consider a two-layer ReLU neural network with the objective function introduced in Section 3.1 where we consider n = 10 input samples x i on a regular grid on [−1, 1] and observed variables y i = |x i | − 1 2 + Z i where Z i are independent and uniform on [−1, 1] (see the samples on Figure 9- (b)). The domain is S 1 discretized into a regular grid of m = 2000 points. This setting gives an example where Φ does not have a Lipschitz gradient and is only Lipschitz (observe the irregularity ofḠ [µ * ] on Figure 9-(a)). Since we use the regularization H = λ µ , we are in the setting (I) from Table 2- (b), and the parameter for the rate is q = 1. Figure 8 shows the rates of convergence for PGM and APGM. Although the general picture is consistent with the theory, we observe that our guarantees are a bit over-conservative. For PGM, we roughly measure (between iteration k = 10 3 and k = 10 5 ) the rates exponents (−1.00, −0.72, −0.58) for respectively p = (1, 1.5, 2) which corresponds to a parameter q ≈ 1.5 rather than q = 1. For APGM, we roughly measure the rates exponents (−1.97, −1.71, −1.41) instead of the predicted (−2, −1.33, −1). Figure 9-(a) helps understanding this discrepancy: as can be seen from the proof of Theorem 4.1 what truly determines the asymptotic rate is how much the objective function increases when µ * is mollified, and we quantified this using the regularity and magnitude ofḠ [µ] near µ * . Here it appears thatḠ [µ * ] is smooth near 2 out of the 3 points in the support of µ * , while it is non-smooth at the third point (the one in the middle). The fact that we have a mix of both levels of regularity (i.e. smooth vs. merely Lipschitz) may explain why the convergence is a bit faster than with the parameter q = 1, which corresponds to only taking into account the Lipschitz regularity. Figure 9: Dynamics of PGM on a two-layer neural network, for various values of p (p = 1 corresponds to η hyp ). Observe in (b) how the dynamics with η hyp fits the kinks of the optimal regressor much faster than with p > 1.

Conclusion
We have studied the convergence rates of PGM and APGM for convex optimization in the space of measures. Our analysis exhibits the influence of the regularity of the objective function on the convergence rates. It also confirms that the geometry induced by η ent and η hyp is better suited than the L 2 geometry to solve such problems. An important question for future research is to better understand the unregularized case, where the phenomenon of algorithmic regularization is at play.