Cardinality-constrained structured data-fitting problems

A memory-efficient framework is described for the cardinality-constrained structured data-fitting problem. Dual-based atom-identification rules are proposed that reveal the structure of the optimal primal solution from near-optimal dual solutions. These rules allow for a simple and computationally cheap algorithm for translating any feasible dual solution to a primal solution that satisfies the cardinality constraint. Rigorous guarantees are provided for obtaining a near-optimal primal solution given any dual-based method that generates dual iterates converging to an optimal dual solution. Numerical experiments on real-world datasets support confirm the analysis and demonstrate the efficiency of the proposed approach.


Introduction
Consider the problem of fitting a model to data by building up model parameters as the superposition of a limited set atomic elements taken from a given dictionary.Versions of this cardinality-constrained problem appear in a number of statistical-learning applications in machine learning [4,38,46,53], data mining, and signal processing [11].In these applications, common atomic dictionaries include the set of one-hot vectors or matrices (i.e., vectors and matrices that contain only a single element) and rank-1 unit matrices.The elements of the given dictionary encode a notion of parsimony in the definition of the model parameters.
The cardinality-constrained formulation that we consider aims to find x ∈ R n such that f (b − M x) ≤ α and card A (x) ≤ k, (P) where f : R n → R is an L-smooth and convex function, M : R n → R m is a linear operator with m < n, b ∈ R m is the observation vector, and A ⊆ R n is the atomic dictionary.The cardinality function measures the complexity of x with respect to the atomic set A. The loss term f (b − M x) measures the quality of the fit.When A = { ±e 1 , ±e 2 , . . ., ±e n } is the set of signed canonical unit vectors, the function card A (x) simply counts the number of nonzero elements in x.Typically k n, which indicates that we seek a feasible model parameter x that has an efficient representation in terms of k atoms from the set A.
For the application areas that we target, the two characteristics of this feasibility problem that pose the biggest challenge to efficient implementation are the combinatorial nature of the cardinality constraint, and the high-dimensionality of the parameter space.To address the combinatorial challenge, we follow van den Berg and Friedlander [47,48] and Chandrasekaran et al. [15], and use the convex gauge function as a tractable proxy for the cardinality function (1); see §3.In tandem with the convexity of the loss function, this function allows us to formulate three alternative relaxed convex optimization problems that, under certain conditions, have approximate solutions satisfying the feasibility problem; see problems (P 1 ), (P 2 ), and (P 3 ) in §4.
The high-dimensionality of the parameter space, however, may imply that it's inefficient-and maybe even practically impossible-to solve these convex relaxations because it's infeasible to store directly the approximations to a feasible solution x.Instead, we wish to develop methods that leverage the efficient representation that feasible solutions have in terms of the atoms in the dictionary A. For example, consider the case in which the dictionary is the set of symmetric n × n rank-one matrices, and M is the trace linear operator that maps these matrices into m-vectors.Any method that iterates directly on the parameters x requires O(n 2 + m) storage for the iterates and the data.An alternative is the widely-used conditional gradient method [22], which requires O(nt + m) storage after t iterations [31], but also often requires a substantial number of iterations t to converge.Instead of storing x directly, we apply a dual method to one of the convex relaxations (P 1 ), (P 2 ), and (P 3 ) (defined below); these dual methods typically require only O(m) storage, and still allow us to collect information on which atoms in A participate in the construction of a feasible x.One of the aims of this paper is to describe how to collect and use this information.

Approach
We propose a unified algorithm-agnostic strategy that uses any sequence of improving dual solutions to one of the convex relaxations.This dual sequence identifies an essential subset of atoms in A needed to construct an -infeasible solution x that satisfies the conditions f (b − M x) ≤ α + and card(A; x) ≤ k for any positive tolerance .These atomic-identification rules, described in §5, derive from the polar-alignment property and apply to arbitrary dictionaries A [19].These atom-identification rules generalize earlier approaches described by El Ghaoui [25] and Hare and Lewis [27].Once an essential subset of k atoms is identified, an -feasible solution x can be computed by optimizing over all positive linear combinations of this subset.This relatively small k-variable problem can often be solved efficiently.
We prove that when the atomic dictionary is polyhedral, we can set to zero and still identify in polynomial time a set of feasible atoms; see Corollary 9. When the atomic dictionary is spectrahedral, we prove that an -feasible set of atoms can be identified also in polynomial time; see Corollary 15.
We demonstrate via numerical experiments on real-world datasets that this approach is effective in practice.
There are three important elements in our primal-retrieval algorithm.The first element is an atom-identifier function EssCone A,k that maps M * y, where y is any feasible dual variable, to a cone generated by k atoms that are essential.These atoms have the property that The explicit definition of the essential cone depends on the particular atomic set A. In §6, we make it explicit for atomic sets that are polyhedral ( §6.1) and spectral ( §6.2).
The second element is an arbitrary function oracle f,A,M,b (such as an appropriate first-order iterative method) that generates dual iterates y (t) converging to the optimal dual variable y * of any of the dual problem (D i ).It's this oracle that generates the dual estiamtes subsequently used by EssCone A,k .
The third algorithmic component is the reduced convex optimization problem which at each iteration constructs a primal estimate x (t) using the atoms identified through the dual estimate y (t) .The detailed algorithm is shown in Algorithm 1.Note that our primal-retrieval strategy doesn't aim to recover the optimal solutions to (P 1 ), (P 2 ) or (P 3 ).These problems serve as a guidance for our atom-identification rule.
The final output of Algorithm 1 maybe different from the optimal solutions to (P 1 ), (P 2 ), or (P 3 ).

Related work
Many recent approaches for atomic-sparse optimization problems are based on algorithms [17,20].These methods, however, still need to retrieve at some point a primal solution x, which may require a prohibitive amount of memory for its storage.A widely used heuristic applies the truncated singular value decomposition to obtain low-rank solutions, but this heuristic is unreliable in minimizing the model misfit [19,Algorithm 6.4].Memoryefficient atomic-sparse optimization thus requires efficient and reliable methods to retrieve an atomic-sparse primal solution.
Dual approaches for nuclear-norm or trace-norm regularized problems are attractive because they enjoy optimal storage, which means that they have space complexity O(m) instead of O(n 2 ) [17].For example, the bundle method for solving the Lagrangian dual formulation of semi-definite programming [28], and the gauge dual formulation of general atomic sparse optimization problem [20], exhibit promising results in practice.
A related line of research uses memory-efficient primal-based algorithms based on hard-thresholding.Some examples include gradient hard-thresholding [54], periodical hard-thresholding [1], and many proximal-gradient or ADMM-based hard-thresholding algorithms [30,35,37].These approaches are primal-based and tangential to to our purposes.We do not include them in our discussion.
The theoretical analysis of our primal-retrieval approach is related to optimal atom identification [10,26,27], and especially to recently developed safe-screening rules for various kind of sparse optimization problems [5,6,9,25,34,36,40,43,[50][51][52]55].One of our main results, given by Theorem 5, generalizes the gap safe-screening rule developed by Ndiaye et al. [39] to general atomic-sparse problems and to more general problem formulations.Some of the techniques used in our analysis are related to the facial reduction strategy from Krislock and Wolcowicz [33].

Preliminaries
We introduce in this section the main tools of convex analysis used to understand atomic sparsity.The gauge function ( 2) is always convex, nonnegative, and positively homogeneous.This function is finite only at points contained within the cone cone(A) := x = a∈A c a a c a ≥ 0 generated from the elements of the set A. The gauge is not necessarily a norm because it may not be symmetric (unless A is centrosymmetric), may vanish at points other than the origin, and may not be finite valued (unless A contains the origin in the interior of its convex hull).Throughout, we make the blanket assumption that the dictionary A ⊆ R n is compact, and that the origin is contained in its convex hull.The assumption on the origin ensures that the gauge function is continuous.The compactness assumption isn't strictly necessary for many of our conclusions, but does considerably simplify the analysis.The set A may be nonconvex, which is the case, for example, if it consists of a discrete set of two or more items.
The definition of the gauge function makes explicit this function's role as a convex penalty for atomic sparsity.The atomic support of a vector x to be the collection of atoms a ∈ A that contribute positively to the conic decomposition implied by the value γ A (x) [ Commonly used atomic sets and the corresponding gauge and support functions [20].The indicator function δC(x) is zero if x is in the set C and +∞ otherwise.The commonly used group-norm is also an atomic norm [19,Example 4.7].The functions X * and X 2, respectively, correspond to the nuclear norm (sum of singular values) and spectral norm (maximum singular value) of a matrix X.
Definition 1 (Atomic support).A subset of atoms S A (x) ⊂ A is a support set for x with respect to A if every atom a ∈ S A (x) satisfies c a a, and c a > 0 ∀a ∈ S A (x).
For example, the support set S A (x) for the atomic set of 1-hot unit vectors A = { e i | i = 1, 2, . . ., n } coincides with the nonzero elements of x with the corresponding sign.The support function to the set A is given by σ A (z) := sup a∈A a, z .Because A is compact, every direction d ∈ R n generates a supporting hyperplane to the convex hull of A. The atoms contained in that supporting hyperplane are said the be exposed by d.The following definition also includes the notion of atoms that are approximately exposed.
Definition 2 (Exposed and -exposed atoms).The exposed atoms and -exposed atoms, respectively, of a set A ⊆ R n in the direction z ∈ R n are defined by the sets where σ A (z) := sup a∈A a, z is the support function with respect to A. When = 0, the -exposed atoms coincide with the exposed atoms.
We list in Table 1 commonly used atomic sets, their corresponding gauge and support functions, and atomic supports.

Atomic-sparse optimization
We introduce in this section convex relaxations to the structured data-fitting problem (P).In particular, we consider the following three related gauge-regularized optimization problems: It's well known that under mild conditions, these three formulations are equivalent for appropriate choices of the positive parameters λ, τ , and α [23].Practitioners often prefer one of these formulations depending on their application.For example, tasks related to machine learning, including feature selection and recommender systems, typically feature one of the first two formulations [38,46,53].On the other hand, applications in signal processing and related fields, such as as compressed sensing and phase retrieval, often use the third formulation [11,47].
Our primal-retrieval strategy relies on the hypothesis that the atomic-sparse optimization problems (P 1 ), (P 2 ) and (P 3 ) are reasonable convex relaxations to the structured data-fitting problem (P), in the sense that the corresponding optimal solutions are feasible for (P).We formalize this in the following assumption.Assumption 3. Let x * denote an optimal solution to (P i ), i = 1, 2, 3. Then x * is feasible for (P), i.e., As described in §1, the Fenchel-Rockafellar duals for these problems have typically smaller space complexity.These dual problems can be formulated as minimize where f * (y) = sup w y, w − f (w) is the convex conjugate function of f , and M * : R m → R n is the adjoint operator of M , which satisfies M x, y = x, M * y for all x ∈ R n and y ∈ R m .The derivation of these dual problems can be found in appendix A.

Atom identification
We demonstrate in this section how an optimal dual solution can be used to identify essential atoms that participate in the primal solutions.In order to develop atomic-identification rules that apply to arbitrary atomic sets A ⊆ R n -even those that are uncountably infinite-we require generalized notions of active constraint sets.In linear programming, for example, the simplex multipliers give information about the optimal primal support.By analogy, our atomic-identification rules give information about the essential atoms that participate in the support of the primal optimal solutions.In addition, we extend the identification rules to approximate the essential atoms from approximate dual solutions.We build on the following result, due to Fan et al.Theorem 4 (Atom identification).Let x * and y * be optimal primal-dual solutions for problems (P i ) and (D i ), with i = 1, 2, 3. Then The following theorem generalizes this result to show similar atomic support identification properties that also apply to approximate dual solutions.In particular, given a feasible dual variable y close to y * , the support of x * is contained in the set of -exposed atoms that includes E A (M * y * ).
Theorem 5 (Generalized atom identification).Let x i and y i be feasible primal and dual vectors, respectively for problems (P i ) and (D i ), with i = 1, 2, 3. Then where each i is defined for problem i by , where β and β are positive lower and upper bounds, respectively, for β * , and M A := max a∈A M a 2 is the induced atomic operator norm.
Theorem 5 asserts that the underlying atomic support of x * i is contained in the set of the -exposed atoms of M * y i .Moreover, when y i → y * i (and, for problem (D 3 ), the bounds β → β * and β → β * ), each i → 0, and thus (4) implies that we have a tighter containment for the optimal atomic support.The proofs for parts (a) and (b) of Theorem 5 depend on the strong convexity of f * , which is implied by the Lipschitz smoothness of f [29,Theorem 4.2.1].This convenient property, however, is not available for part (c) because the dual objective of (D 3 ) is the perspective map of f * + α, which is not strongly convex [2].We resolve this technical difficulty by instead imposing the additional assumption that bounds are available on the dual optimal variable β * .Appendix C describes how to obtain these bounds during the course of the level-set method developed by Aravkin et al. [3].
The gap safe-screening rule developed by Ndiaye et al. [40] is a special case of Theorem 5 that applies only to (P 1 ) for the particular case in which γ A is the one-norm.We also note that Theorem 5 states a tighter bound on 1 that relies on the optimal primal value d 1 (y * 1 ) rather than the primal value at an arbitrary primal-feasible point.

Primal retrieval
Theorem 5 serves mainly as a technical tool for error bound analysis, in particular because it's impractical to compute or approximate i .However, the inclusions (3) and ( 4), respectively, of Theorem 4 and 5 motivate us to define an atom-identifier function EssCone A,k that depends on the dual variable y and satisfies the inclusions The next two sections demonstrate how to construct such a function for polyhedral and spectral atomic sets, which are two important examples that appear frequently in practice.With this function we can thus implement the primal-recovery problem required by Step 5 of Algorithm 1.Moreover, we show how to use the error bounds of Theorem 5 to analyze the atomic-identification properties of the resulting algorithm.

Primal-retrieval for polyhedral atomic sets
We formalize in this section a definition for the function EssCone A,k for the case in which A is a finite set of vectors, which implies that the convex hull is polyhedral.Given a feasible dual vector y, consider the top-k atoms in A with respect to the inner product with the vector M * y: Note that there may be many sets of k atoms that satisfy this property.We then construct the cone of essential atoms as the convex conic hull generated from this set of top-k atoms: Thus, the primal-retrieval computation in step 5 of Algorithm 1 is given by where ĉ ∈ arg min is the k-vector of coefficients obtained by minimizing the reduced objective over a k-dimensional polyhedron defined by the top-k atoms.
Example 6 (Sparse vector recovery).We consider the problem of recovering a sparse vector x from noisy observations b := M x + η, where M : R n → R m is a given measurement matrix and η ∈ R m is standard Gaussian noise.For some expected noise level α > 0, the sparse recovery problem can be thus be expressed as which corresponds to (P) with f = • 2 and with the atomic set where each e i is the ith canonical unit vector.The basis pursuit denoising (BPDN) approach approximates this problem by replacing the cardinality constraint with an optimization problem that minimizes the 1-norm of the solution: see Chen et al. [16].This convex relaxation corresponds to problem (P 3 ).There are many dual methods that generate iterates y (t) converging to the optimal dual solution to ( 8), including the level-set method coupled with the dual conditional-gradient suboracle, as described by [3,19].The resulting primal-retrieval strategy for Step 5 of Algorithm 1 can thus be implemented by executing the following steps:  (6), where in this case, This is a nonnegative least-squares problem for which many standard algorithms are available.For example, an accelerated projected gradient descent method requires O(mk log(1/ )) iterations when M [s i1 e i1 . . .s i k e ik ] has full column rank.

(Termination)
Step 6 of the Algorithm 1 is implemented simply by verifying that f k (c (t) ) ≤ α. (As verified by Corollary 9, we may take = 0 in this polyhedral case.)Thus, we can terminate the algorithm and return the primal variable which is the superposition of the top-k atoms.Otherwise, the algorithm proceeds to the next iteration.
We describe numerical experiments for the sparse vector recovery problem in §7.1.

Iteration complexity
In order to guarantee the quality of the recovered solution, we rely on a notion of degeneracy introduced by Nutini et al. [41].
Definition 7. Let x * and y * , respectively, be optimal primal and dual solutions for problems (P i ) and (D i ), where A is polyhedral.Let δ be a positive scalar.The problem pair ( The next proposition guarantees a finite-time atom identification property when the atomic set is polyhedral.Proposition 8 (Finite-time atom-identification).For each problem i = 1, 2, 3, let {y t=1 be a sequence that converges to an optimal dual solution y * i .If the atomic set A is polyhedral and the problem pair ((P i ), (D i )) is δ-nondegenerate, then there exists T > 0 such that EssCone A,k (M * y (t) ) ⊇ E A (M * y * i ) and x (t) is feasible for (P) ∀t > T.
It follows that Algorithm 1 will terminate in T iterations regardless of the tolerance .
Proposition 8 ensures that the atom-identification property described by Theorem 5 is guaranteed to discard superfluous atoms in a finite number of iterations as long as we have available an iterative solver that generates dual iterates converging to a solution.Thus, Algorithm 1 is guaranteed to generate a feasible solution to (P).The following corollary characterizes a bound on T in terms of the convergence rate of the dual method.

Corollary 9.
For each problem i = 1, 2, 3, suppose the dual oracle generates iterates y (t) converging to optimal variable y * i with rate for some α > 0. If the atomic set A is polyhedral and the problem pair ((P i ), (D i )) is δ-nondegenerate, then Algorithm 1 with = 0 terminates in O δ −2/α iterations.

Centrosymmetry and unconstrained primal recovery
Further computational savings are possible when the atomic set A is centrosymmetric, i.e., Centrosymmetry is a common property, and perhaps the prototypical example is the set of signed canonical unit vectors given by the set (7). Whenever centrosymmetry holds, cone A = span A. This motivates us to replace the function EssCone with the function where A k is the collection of top-k atoms defined by (5).Thus, the primal-retrieval optimization problem (6) reduces to the unconstrained version ĉ ∈ arg min The following corollary simply asserts that the complexity results described in §6.1.1 continue to hold for centrosymmetric atomic sets when using the essential span function.
Corollary 10 (Atom identification under centrosymmetry).If the atomic set A is centrosymmetric and polyhedral, then Proposition 8 and Corollary 9 hold with EssCone replaced by EssSpan.

Primal-retrieval for spectral atomic sets
We formalize in this section a definition for the function EssCone A,k for the case in which A is a collection of rank-1 matrices, either asymmetric or symmetric, respectively: We mainly focus on the former atomic set of asymmetric matrices because all of our theoretical results easily specialize to the symmetric case.Note that this atomic set is centrosymmetric (cf.( 9)), and as we'll see below, the recovery problem is unconstrained.Later in §6.2.2 we'll describe the recovery problem for the atomic set of symmetric matrices, which is in fact a non-centrosymmetric set.
For this section only, we work with the linear operator M : R m×n → R p×q , and replace the vector of observations b with the p-by-q matrix B. In this context, the dual variables for one of the corresponding dual problems is a matrix of the same dimension.
Fix a feasible dual variable Y and define the singular value decomposition (SVD) for its product with the adjoint of M by where Σ k is the diagonal matrix consisting of top-k singular values of M * (Y ), the matrices U k and V k contain the corresponding left and right singular vectors, and the matrices U −k , V −k , and Σ −k contain the remaining singular vectors and values.Then the reduced atomic set by U k and V k can be expressed as We construct the cone of essential atoms as the convex cone generated from the reduced atomic set A k , i.e., Thus, the primal-retrieval computation in Step 5 of Algorithm 1 is then given by where Ĉ ∈ arg min is a k × k matrix obtained by solving the reduced problem (PR), which is defined over the cone generated by the essential atoms identified through the top-k singular triples of M * (Y ), desribed by (13).
Example 11 (Low-rank matrix completion).The low-rank matrix completion (LRMC) problem aims to recover a low-rank matrix from partial observations, which arises in many real applications such as recommender systems [44] and in a convex formulation of the phase retrieval problem [14].The LRMC problem can be expressed as find X ∈ R m×n such that where {B i,j | (i, j) ∈ Ω} is the set of observations over the index set Ω. Problem (16) corresponds to (P) with the objective f = 1 2 • 2 2 , the atomic set A given by (10), and the linear operator M defined by the mask M (X) i,j = X i,j (i, j) ∈ Ω, 0 otherwise.
Fazel [21] popularized the convex relaxation of ( 16) that minimizes the sum of singular values of X: This problem corresponds to formulation (P 3 ).As with Example 6, there are many dual methods that can generate dual feasible iterates Y (t) converging to the dual solution of ( 17), such as a dual bundle method [20].The resulting primal-retrieval strategy for Step 5 of Algorithm 1 can be implemented by executing the following steps: 1. (Top-k atoms) Compute the leading k singular vectors of the matrix M * (Y (t) ), given by U ∈ R n×k as defined by the SVD (12).2. (Retrieve coefficients) Solve the reduced problem (15), where in this case, This least squares problem can be solved to within -accuracy in O((k|Ω| + (m + n)k + k 3 ) −0.5 ) iterations, for example, with the FISTA algorithm [7].Typically, k min{m, n}, and so we expect that this reduced problem is significantly cheaper to solve than the original problem (17).

(Termination)
Step 6 of Algorithm 1 terminates when the value of the reduced objective satisifes the condition f k (C (t) ) ≤ α + , where is some pre-defined tolerance.In that case, the algorithm returns with the primal estimate constructed from the left and right singular vectors: We describe numerical experiments for the low-rank matrix completion problem in §7.2.

Iteration complexity
In the polyhedral case, we were able to assert through Proposition 8 that the optimal primal variable's atomic support could be identified in finite time.As we show here, however, finite-time identification is not possible for the spectral case.The following counterexample shows that the partial SVD of M * (Y ), which we used in (12), is not able to give us a safe cover of the essential atoms in E A (M * (Y * )) even when this set is a singleton and Y arbitrarily close to a dual solution Y * .

Example 12 (Limitation of Partial SVD). Consider the problem minimize
where for some fixed ∈ (0, 1).The dual problem is minimize The solutions for the dual pair ( 18) and ( 19) are In this pair of problems, the linear operator M is simply the identity map, and the cone of essential atoms described by (13) depends only on the dual variable Y .Let u 1 and v 1 , respectively, be the first columns of U and V .Evidently, the support of X * coincides with the essential atoms of Y * , and moreover, the support is a unique singleton.In other words, We construct the following dual feasible solution ) for any ∈ (0, 1), which in effect implies that it's impossible to recover exactly the true solution even with an arbitrarily accurate dual approximation Y .
This last example motivates our study of the quality with which a partial SVD of a given feasible dual solution M * (Y ) can be used to approximate the support E A (M * (Y * )).The next result measures the difference between S A (x * ) and EssCone A,k (M * (Y )) using one-sided Hausdorff distance.
where each i is defined in Theorem 5 and the function is the one-sided Hausdorff distance between sets A 1 and A 2 .
Oustry [42,Theorem 2.11] developed a related result based on the two-sided Hausdorff distance.Directly applying Oustry's result to our context results in a bound on the order O( /(σ k − σ k+1 )), which is looser than the bound shown in Proposition 13 because σ 1 ≥ σ k ≥ σ k+1 .
Finally, we show the error bound for primary recovery.
Proposition 14 characterizes the error bound for our primal-retrieval strategy when A is spectral.Our next corollary shows that Algorithm 1 can terminate in polynomial time with any tolerance > 0.
Corollary 15.For one of the problems (D i ), i = 1, 2, 3, suppose that a dual oracle generates iterates Y (t)  converging to optimal variable Y * with convergence rate for some α > 0. If the atomic set A is spectral then Algorithm 1 with > 0 will terminate in O −4/α iterations.

Non-centrosymmetry and constrained primal recovery
We now consider the case in which the atomic set A given by (11), which is not centrosymmetric.As we show below, the corresponding primal recovery problem is constrained.
Fix a feasible dual variable Y and define the eigenvalue decomposition for its product with the adjoint of M by where V k and the diagonal matrix Σ k , respectively, contain the top-k eigenvectors and eigenvalues of M * (Y ), and V −k and the diagonal matrix Σ −k , respectively, contain the remaining eigenvectors and eigenvalues.Then the reduced atomic set by V k can be expressed as The convex cone of essential atoms generated from the reduced atomic set A k is given by The recovery problem (15) then becomes constrained, i.e., Ĉ ∈ arg min

Numerical experiments
We conduct several numerical experiments on both synthetic and real-world datasets to empirically verify the effectiveness of our proposed primal-retrieval strategy.In §7.1, we describe experiments on the basis pursuit denoising problem (Example 6), which shows the performance of our strategy on polyhedral atomic set.In §7.2, we apply our primal-retrieval technique to the low-rank matrix completion problem (Example 11) and test the effectiveness of our proposed method on the spectral atomic set.In §7.3, we describe experiments on a image preprocessing problem, where the atomic set A is the sum of a polyhedral atomic set and a spectral atomic set.This shows that our strategy can be applied to more complicated cases.For all experiments, we implement the level-set method proposed by Aravkin et al. [3] where we only store the dual variable y.We implement the level-set method and our primal-retrieval strategy in the Julia language [8] and our code is publicly available at https://github.com/MPF-Optimization-Laboratory/AtomicOpt.jl.All the experiments are carried out on a Linux server with 8 CPUs and 64 GB memory.

Basis pursuit denoise
The experiments in this section include a selection of five relevant basis pursuit problems from the Sparco collection [49] of test problems.The chosen problems are all real-valued and suited to one-norm regularization.Each problem in the collection includes a linear operator M : R n → R m and a right-hand-side vector b ∈ R m .Table 2 summarizes the selected problems.We compare the results with SPGL1 [47].In all problems, we set α = 10 −3 • b .The results are shown in Table 3 where nMat denotes the total number of matrix-vector products with M or M * .As we can observe from Table 3, the level-set algorithm equipped with our primal-retrieval technique can obtain an -feasible solution within a small number of iterations, which is consistent with the finite-time identification property described by Proposition 8. We also observe that level-set method coupled with the primal-retrieval strategy can converge faster than SPGL1 with its default stopping criterion.This suggests that our primal-retrieval technique is both a memory-efficient method for obtaining approximal primal solutions with provable error bounds, and is also a practical technique that allow the optimization algorithm to stop early.

Low-rank matrix completion
For this atomic set, we conduct an experiment similar to that carried out by Candés and Plan [13].We retrieved from the website of National Centers for Environmental Information are daily average temperatures at 6798 different weather stations throughout the world in year 2020.The temperature matrix X is approximately low rank in the sense that X − X 5 F / X F ≈ 24%, where X 5 is the matrix created by truncating the SVD after the top 5 singular values.
To test the performance of our matrix completion algorithm, we subsampled 50% of X and then recovered an estimate X.The solution gives a relative error of X − X F / X F ≈ 30%.The result is shown in Figure 1a.As we can see from Figure 1a, the recovery error exhibits a positive correlation with the duality gap, both the duality gap and the recovery gap decrease as the number of iteration increase.The observation in this experiment is consistent with our theory (Proposition 14).

Robust principal component analysis
In this section, we show that our primal-retrieval strategy can be applied to more complicated atomic sets besides polyhedral and spectral.We conduct the the similar experiment as in [12].Face recognition algorithms are sensitive to shadows on faces, and therefore it's necessary to remove illumination variations and shadows on the face images.We obtained face images from the Yale B face database [24].We show the original faces in Figure 1b, where each face image was of size 192 × 168 with 64 different lighting conditions.The images were then reshaped into a matrix M ∈ R 32256×64 .Because of the similarity between faces and the sparse structure of the shadow, the matrix M can be approximately decomposed into two components, i.e., where L is a low-rank matrix corresponding to the clean faces and S is sparse matrix corresponding to the shadows.Based on the work by Fan et al. [18], we know that such decomposition can be obtained via solving the following convex optimization problem: By [19,Proposition 7.3], we know that (20) is equivalent to min with X = L + S and where The recovered low-rank component is shown in Figure 1c.As we can see from the figure, most of the shadow are successfully removed.This experiment suggests that our primal-retrieval technique can potentially be used for more complex atomic set and allow the underlying the dual-algorithm to produce satisfactory result within a reasonable number of iterations.

Conclusion
In this work, we proposed a simple primal-retrieval strategy for atomic-sparse optimization.We demonstrate both theoretically and empirically that our proposed strategy can obtain good solutions to the cardinality-constrained problem given a dual-based algorithm converging to the optimum dual solution.
Further research opportunities remain, particularly for designing meaningful primal-retrieval strategies for non-polyhedral and non-spectral atomic sets.The primal-retrieval technique developed in this work is algorithmagnostic, and it is an open questionn if it's possible to develop more efficient primal-retrieval approaches tailored to specific optimization algorithms, such as the conditional-gradient method.
For (P 3 ), By the properties of conjugate functions and Proposition 17, we can get that Then by [29, Example E.2.5.3], we know that the support function of the sublevel set is Finally, by Theorem 16, we can get the dual problem for (P 3 ) as minimize y∈R m , β>0

B Proof of Theorem 5
The proof of this Theorem relies on the following duality property between smoothness and strong convexity.
Lemma 18 ( [32,Theorem 6]).If f is L-smooth, then f * is 1 L -strongly convex.Proof of Theorem 5. a) Let y * denote the optimal dual variable for D 1 .First, we show that y − y * can be bounded by the duality gap.Let g(y) := f * (y) − b, y .By Lemma 18, f * is 1 L -strongly convex, and it follows that g is also 1  L -strongly convex.By the definition of strong convexity, Optimality requires that ∃s ∈ ∂g(y * ), s, y − y * ≥ 0 ∀y s.t.σ A (M * y) ≤ λ.
Therefore, reordering the inequality gives Next, we show that E A (M * y * ) ⊆ E A (M * y, 1 ).For any a ∈ E A (M * y * ), where the last inequality follows from the definition of 1 in Theorem 5 and the fact that σ A (M * y * ) = λ and y is feasible for (D 1 ).b) Let y * denote the optimal dual variable for D 2 .First, we show that y − y * can be bounded by the duality gap.Let g(y) := f * (y) − b, y + τ σ A (M * y).By Lemma 18, f * is 1 L -strongly convex, and it follows that g is also 1  L -strongly convex.By the definition of strongly convex, By optimality, 0 ∈ ∂g(y * ).Reordering the inequality to deduce that where the last inequality follows from the definition of 2 in Theorem 5. c) Let (y * , β * ) denote the optimal dual variables for D 3 .First, we show that y − y * can be bounded by the duality gap.Let By Lemma 18, f * is 1 L -strongly convex, and it's not hard to check that g is 1 β * L -strongly convex.By the definition of strongly convex, By optimality, Reorder the inequality to deduce that Since β * is unknown to us, we will then get an upper bound for d 3 (y, β * ).Fix y, let h(β) = d 3 (y, β).By the property of perspective function, we know that h is convex.Then it follows that Finally, we show that

C Upper and lower bound for β *
First, we consider (D 3 ).Let w = y/β, then (D 3 ) can be equivalently expressed as minimize Now compare (21) with (D 1 ) to conclude that they are equivalent when λ = β * .It thus follows that (P 3 ) is equivalent to minimize Next, consider using the level-set method [3] with bisection to solve (P 3 ).There exists τ * > 0 such that (P 3 ) is equivalent to minimize With the level-set method, we are able to get (x 1 , τ 1 ) and (x 2 , τ 2 ) such that τ 1 ≤ τ * ≤ τ 2 and x i is the optimum for minimize Then there exits β 1 and β 2 such that β 1 ≥ β * ≥ β 2 and x i is optimal for minimize Finally, by [19,Theorem 5.1] we can conclude that Therefore, we can get upper and lower bounds for β * via level-set method with bisection.Moreover, by strong duality and convergence of the bisection method, the gap between β 1 and β 2 will converge to zero.

D Proof of Proposition 8
Proof.First, we show that for any y i such that By Assumption 3 and the definition of δ-nondegeneracy, we know that For any a ∈ F A (M * y * i ), we have On the other hand, for any a /  For any i ∈ {1, 2, 3}, we know that S A (X * ) ⊆ E A (Z, i ) by Theorem 5. Then by (23), we have dist(S A (X * ), EssCone A,k (Z)) ≤ dist(E A (Z, i ), EssCone A,k (Z)).
For any A 1 , A 2 ⊆ A, where the second equality holds since a 1 F = a 2 F = 1 by the definition of A. Define A 1 = E A (Z, i ) and A 2 = EssCone A,k (Z) = U r pq T V T r p 2 = q 2 = 1 , where U r , V r are the top-r singular vectors of M * y.Let k := min{n, m}, C 1 = {(p, q) | k i=1 σ i p i q i ≥ σ 1 − i , p 2 = q 2 = 1, p, q ∈ R k } and C 2 = {(p, q) | p 2 = q 2 = 1, p, q ∈ R r }, then ( Now we consider the subproblem in (24): min p,q p 1:r 2 q 1:r 2 (P sub ) subject to k i=1 σ i p i q i ≥ σ 1 − i , p 2 = q 2 = 1, p, q ∈ R k .

F Proof of Proposition 14
We describe the following lemma before proceeding to the proof of Proposition 14.

Lemma 19 (Hausdorff error bound).
Given A ⊆ A, there exists X ∈ cone( A) such that Proof.Let X * = a∈S A (X * ) c a a, c a > 0. By the definition of the one-sided Hausdorff distance dist(•, •), for any a ∈ S A (X * ), there exist a corresponding â ∈ A such that â − a F ≤ dist(S A (X * ), A).
Let X = a∈S A (X * ) c a â, then it's straighforward to verify that X ∈ cone( A) and where (i) follows from the orthonormal decomposition x * = a∈S A (X * ) c a a, c a > 0 and X * 2 F = c 2 a when our atomic set is the set of rank-one matrices.
Proof of Proposition 14.By Lemma 19, we know that there exist X satisfies (25).Then by the L-smoothness of f , By the smoothness and convexity of f , we further have where the last inequality is by Lemma 19.Combining the above with Proposition 13 leads to the desired result.
Diag(1, 0.1, . . ., 0.1).Because Y is diagonal, its left and right singular vectors U and V are given by U = V = I = [e 1 , e 2 , . . ., e n ].Note also that the top singular vector u 1 = [ √ 1 − , 0, . . ., √ ] T lies in the span of the basis vectors e 1 and e n that constitute the top and bottom singular vectors of Y .Therefore, any top-r SVD of Y , with r < n, cannot be used to recover exactly a primal solution X * .Moreover, Y − Y * F = O( √

Figure 1
Figure 1 The left figure (1a) shows the result of the matrix completion experiment.The middle and right figures (1b, 1c) are for the robust principal component analysis experiment.

E Proof for Proposition 13 Proof.
First, we derive a monotonicity property of dist(•, •).By the definition of dist(•, •), it follows that dist(A, C) ≤ dist(B, C) ∀A, B, C ⊆ R n×m such that A ⊆ B.(23)

1 .
(Top-k atoms) Find the top k indices {i 1 , ..., i k } ⊂ [n] of the vector M * y(t)with largest absolute value and gather their corresponding signs s i := sign([M * y (t) ] i ) for i ∈ {i 1 , . .., i k }.The top-k atoms are thus A k = { s i1 e i1 , . .., s i k e i k }; see (5).2.(Retrieve coefficients) Solve the reduced problem

Table 2
The Sparco test problems used.
* y i > a , M * y i ∀a ∈ F A (M * y * i ) and a / ∈ F A (M * y * i ).Notice that EssCone(A, M, y i , k) contains only the atoms that corresponds to the k largest a, M * y i .ThereforeF A (M * y * i ) ⊆ EssCone(A, M, y i , k).i → y * i .For i ∈ {1, 2, 3}, we know there exist T i > 0 such that y(t) i − y * i < δ 4 M A for all t > T i .Therefore F A (M * y * i ) ⊆ EssCone(A, M, y (t)i , k) ∀t > T i and we complete the proof.