DISTRIBUTIONALLY ROBUST OPTIMIZATION : A REVIEW

The concepts of risk-aversion, chance-constrained optimization, and robust optimization have developed significantly over the last decade. Statistical learning community has also witnessed a rapid theoretical and applied growth by relying on these concepts. A modeling framework, called distributionally robust optimization (DRO), has recently received significant attention in both the operations research and statistical learning communities. This paper surveys main concepts and contributions to DRO, and its relationships with robust optimization, risk-aversion, chance-constrained optimization, and function regularization.

1. Introduction. Many real-world decision problems arising in engineering and management have uncertain parameters. This parameter uncertainty may be due to limited observability of data, noisy measurements, implementations and prediction errors. Stochastic optimization (SO) and (2) robust optimization frameworks have classically allowed to model this uncertainty within a decision-making framework. Stochastic optimization assumes that the decision maker has complete knowledge about the underlying uncertainty through a known probability distribution and minimizes a functional of the cost, see, e.g., Shapiro et al. [295], Birge and Louveaux [45]. The probability distribution of the random parameters is inferred from prior beliefs, expert opinions, errors in predictions based on the historical data (e.g., Kim and Mehrotra [171]), or a mixture of these. In robust optimization, on the other hand, it is assumed that the decision maker has no distributional knowledge about the underlying uncertainty, except for its support, and the model minimizes the worst-case cost over an uncertainty set, see, e.g., El Ghaoui and Lebret [99], El Ghaoui et al. [100], Ben-Tal and Nemirovski [19], Bertsimas and Sim [34], Ben-Tal and Nemirovski [20], Ben-Tal et al. [26]. The concept of robust optimization has a relationship with chance-constrained optimization, where in certain cases there is a direct relationship between a robust optimization model and a chance-constrained optimization model, see, e.g., Boyd and Vandenberghe [57, pp157-158].
We often have partial knowledge on the statistical properties of the model parameters. Specifically, the probability distribution quantifying the model parameter uncertainty is known ambiguously. A typical approach to handle this ambiguity, from a statistical point of view, is to estimate the probability distribution using statistical tools, such as the maximal likelihood estimator, minimum Hellinger distance estimator [311], or maximum entropy principle [126]. The decision-making process can then be performed with respect to the estimated distribution. Because such an estimation may be imprecise, the impact of inaccuracy in estimation-and the subsequent ambiguity in the underlying distribution-is widely studied in the literature through (1) the perturbation analysis of optimization problems, see, e.g., Bonnans and Shapiro [55], (2) stability analysis of a SO model with respect to a change in the probability distribution, see, e.g, Rachev [250], Römisch [263], or (3) input uncertainty analysis in stochastic simulation models, see, e.g., Lam [175] and references therein. The typical goals of these approaches are to quantify the sensitivity of the optimal value/solution(s) to the probability distribution and provide continuity and/or large-deviation-type results, see, e.g., Dupačová [96], Schultz [272], Heitsch et al. [142], Rachev and Römisch [248], Pflug and Pichler [230]. While these approaches quantify the input uncertainty, they do not provide a systematic modeling framework to hedge against the ambiguity in the underlying probability distribution.
Ambiguous stochastic optimization is a systematic modeling approach that bridges the gap between data and decision-making-statistics and optimization frameworksto protect the decision-maker from the ambiguity in the underlying probability distribution. The ambiguous stochastic optimization approach assumes that the underlying probability distribution is unknown and lies in an ambiguity set of probability distributions. As in robust optimization, this approach hedges against the ambiguity in probability distribution by taking a worst-case approach. Scarf [271] is arguably the first to consider such an approach to obtain an order quantity for a newsvendor problem to maximize the worst-case expected profit, where the worst-case is taken with respect to all product demand probability distributions with a known mean and variance. Since the seminal work of Scarf, and particularly in the past few years, significant research has been done on ambiguous stochastic optimization problems. This paper provides a review of the theoretical, modeling, and computational developments in this area. Moreover, we review the applications of the ambiguous stochastic optimization model that have been developed in the recent years. This paper also puts DRO in the context of risk-averse optimization, chance-constrained optimization, and robust optimization.

A General DRO Model.
We now formally introduce the model formulation that we discuss in this paper. Let x ∈ X ⊆ R n be the decision vector. On a measurable space (Ξ, F), let us define a random vectorξ : Ξ → Ω ⊆ R d , a random cost function h(x,ξ) : X ×Ξ → R, and a vector of random functions g(x,ξ) : X ×Ξ → R m , i.e., g(x, ·) := [g 1 (x, ·), . . . , g m (x, ·)] . Given this setup, a general stochastic optimization problem has the form where P denotes the (known) probability measure on (Ξ, F) and R P : Z → R denotes a (componentwise) real-valued functional under P , where Z is a linear space of measurable functions on (Ξ, F). The functional R P accounts for quantifying the uncertainty in the outcomes of the decision, for a given fixed probability measure P . This setup represents a broad range of problems in statistics, optimization, and control, such as regression and classification models [106,163], simulation-optimization [107,227], stochastic optimal control [31], Markov decision processes [246], and stochastic programming [45,295]. As a special case of (SO), we have the classical stochastic programming problems: ( h(x, ·) := 1 A(x) (·) in (1.1), where 1 A(x) (·) denotes an indicator function for an arbitrary set A(x) ⊆ B(R d ) (we define the indicator function and B(R d ) precisely in Section 2), we obtain the class of problems with a probabilistic objective function of the form P {ξ ∈ A(x)}, see, e.g., Prékopa [244]. The set A(x) is called a safe region and may be of the form a(x) ξ ≤ b(x) or a(ξ) x ≤ b(ξ) 1 . Similarly, by taking h(x,ξ) := h(x) and g(x, ·) := [1 A1(x) (·), . . . , 1 Am(x) (·)] , for suitable indicator functions 1 Aj (x) (·), j = 1, . . . , m, (1.2) is in the form of probabilistic (i.e., chance) constraints P {ξ ∈ A j (x)} ≤ 0, j = 1, . . . , m, see, e.g., Charnes et al. [68], Charnes and Cooper [67], Prékopa [243,245], Dentcheva [86]. Note that the case where the event {ξ ∈ A j (x)} is formed via several constraints is called joint chance constraint as compared to individual chance constraint, where the event {ξ ∈ A j (x)} is formed via one constraint. A robust optimization model is defined as where U ⊆ R d denotes an uncertainty set for the parametersξ. Similar to (SO), Here, P denotes the ambiguity set of probability measures, i.e., a family of measures consistent with the prior knowledge about uncertainty. Note that if we consider the measurable space (Ω, B), where B denotes the Borel σ-field on Ω, i.e., B = Ω ∩ B(R d ), then P can be viewed as an ambiguity set of probability distributions P defined on (Ω, B) and induced byξ 2 . As discussed before, (DRO) finds a decision that minimizes the worst-case of the functional R of the cost h among all probability measures in the ambiguity set 1 We say a safe region of the form a(x) ξ ≤ b(x) is bi-affine in x and ξ if a(x) and b(x) are both affine in x. Similarly, we say a safe region of the form a(ξ) x ≤ b(ξ) is bi-affine in x and ξ if a(ξ) and b(ξ) are both affine inξ. Observe that a bi-affine safe region of the form a(x) ξ ≤ b(x) can be equivalently written as a bi-affine safe region of the form a(ξ) x ≤ b(ξ), and vice versa. 2 In this paper, we use P to denote both an ambiguity set of probability measures and an ambiguity set of distributions induced byξ. Whether P denotes an ambiguity set of probability measures or an ambiguity set of distributions induced byξ should be understood from the context and the distinction we make between the notation of a probability measure and a probability distribution.
provided that the (componentwise) worst-case of the functional R of the function g is non-positive. The ambiguous versions of (1.1) and (1.2) Models (1.5) and (1.6) are discussed in the context of minimax stochastic optimization models, in which optimal solutions are evaluated under the worst-case expectation with respect to a family of probability distributions of the uncertain parameters, see, e.g., Scarf [271];Žáčková [313] (a.k.a. Dupačová); Dupačová [95], Breton and El Hachem [58], Shapiro and Kleywegt [292], Shapiro and Ahmed [291]. Delage and Ye [82] refer to this approach as distributionally robust optimization, in short DRO, and since then, this terminology has become widely dominant in the research community. We adopt this terminology, and for the rest of the paper, we refer to the ambiguous stochastic optimization of the form (DRO) as DRO.
As mentioned before, (DRO) is a modeling approach that assumes only partial distributional information, whereas (SO) assumes complete distributional information. In fact, if P contains only the true distribution of the random vectorξ, (DRO) reduces to (SO). On the other hand, if P contains all probability distributions on the support of the random vectorξ, supported on U, then, (DRO) reduces to (RO). Thus, a judicial choice of P can put (DRO) between (SO) and (RO). Consequently, (DRO) may not be as conservative as (RO), which ignores all distributional information, except for the support U of the uncertain parameters. (DRO) can be viewed as a unifying framework for (SO) and (RO) (see also Qian et al. [247]).

Motivation and Contributions.
In this paper, we provide an overview of the main contributions to DRO within both operations research and machine learning communities. While there are separate review papers on RO, see, e.g., [40,108,124], to the best of our knowledge, there are a few tutorials and survey papers on DRO within the operations research community. A tutorial on DRO, its connection to risk-averse optimization, and the use of φ-divergence to construct the ambiguity set is presented in Bayraksan and Love [13]. Shapiro [290] provides a general tutorial on DRO and its connection to risk-averse optimization. Postek et al. [240] surveys different papers that address distributionally robust risk constraints, with a variety of risk functional and ambiguity sets. Similar to [13,290,240], in this paper, we show the connection between DRO and risk aversion. However, the current review is different from those in the literature from a number of perspectives. We outline our contributions as follows: • We bring together the research done on DRO within the operations research and machine learning communities. This motivation is materialized throughout the paper as we take a holistic view of DRO, from modeling, to solution techniques and to applications. • We provide a detailed discussion on how DRO models are connected to different concepts such as game theory, risk-averse optimization, chanceconstrained optimization, robust optimization, and function regularization in statistical learning.
• From the algorithmic perspective, we review techniques to solve a DRO model. • From the modeling and theoretical perspectives, we categorize different approaches to model the distributional ambiguity and discuss results for each of these ambiguity sets. Moreover, we discuss the calibration of different parameters used in these ambiguity sets of distributions.
1.3. Organization of this Paper. This paper is organized as follows. In Section 2, we introduce the notation and the basic definitions. Section 3 reviews the connection of DRO to different concepts: game theory in Section 3.1, robust optimization in Section 3.2, risk-aversion and chance-constrained optimization with its relationship to robust optimization in Section 3.3, and regularization in statistical learning in Section 3.4. In Section 4, we review two main solution techniques to solve a DRO model by introducing tools in semi-infinite programming and duality. In Section 5, we discuss different models to construct the ambiguity set of distributions. This includes discrepancy-based models in Section 5.1, moment-based models in Section 5.2, shape-preserving-based models in Section 5.3, and kernel-based models in Section 5.4. In Section 6, we discuss the calibration of different parameters used in the ambiguity set of distributions. In Section 7, we discuss different functionals that amount for quantifying the uncertainty in the outcomes of a fixed decision. This includes regret functions in Section 7.1, risk measures in Section 7.2, and utility functions in Section 7.3. In Section 8, we introduce some modeling toolboxes for a DRO model.

Notation and Basic Definitions.
In this section, we introduce additional notation used throughout the paper. In order to keep the paper self-contained, we also introduce all definitions used in this paper in this section.
For a given space Ξ and a σ-field F of that space, we define an underlying measurable space (Ξ, F). In particular, let us define (R d , B(R d )), where B(R d ) is the Borel σ-field on R d . Let 1 A : Ξ → {0, 1} indicate the indicator function of set A ∈ F where 1 A (s) = 1 if s ∈ A, and 0 otherwise. Let M + (·, ·) and M(·, ·) denote the set of all nonnegative measures and the set of all probability measures Q : F → [0, 1] defined on (Ξ, F), respectively. A measure ν 2 is preferred over a measure ν 1 , denoted as ν 2 ν 1 if ν 2 (A) ≥ ν 1 (A) for all measurable sets A ∈ F. We denote by Q{A} the probability of event A ∈ F, with respect to Q ∈ M (Ξ, F). A random vector ξ : (Ξ, F) → (R d , B(R d )) is always denoted with a tilde sign, while a realization of the random vectorξ is denoted by the same symbol without a tilde, i.e., ξ. For a probability measure Q ∈ M (Ξ, F), we define a probability space (Ξ, F, Q). We denote by Q := Q•ξ −1 the probability distribution induced by a random vectorξ under Q, wherẽ ξ −1 denotes the inverse image ofξ. That is, Q : B(R d ) → [0, 1] is a probability distribution on (R d , B(R d )). Let P(·, ·) denote the set of all such probability distributions. For example, P(R d , B(R d )) denotes the set of all probability distributions ofξ. Note that in our notation, we make a distinction between a probability measure Q ∈ M (Ξ, F) and a probability distribution Q ∈ P(R d , B(R d )). Nevertheless, we have always an appropriate transformation, so we might use the terminology of probability measure and probability distribution interchangeably. Given this, for a function f : R d → R, we may write Ξ f (ξ(s))Q(ds) equivalently as R d f (s)Q(ds) with a change of measure. As we shall see later, we may denote f (ξ(s)) with f (s) in this transformation. For two random variables Z, Z : Ξ → R, we use Z ≥ Z to denote Z(s) ≥ Z (s) almost everywhere (a.e.) on Ξ. A random variable Z is Q-integrable if Z 1 := Ξ |Z|dQ is finite. Two random variables Z, Z are distributionally equivalent, denoted by Z d ∼ Z , if they induce the same distribution, i.e., Q{Z ≤ z} = Q{Z ≤ z}. We also denote by S(Ξ, F) the collection of all F-measurable functions Z : (Ξ, F) → (R, B(R)), where R denotes the extended real line R ∪ {−∞, +∞}.
For a finite space Ξ with M atoms Ξ = {s 1 , . . . , s M } and F = 2 Ξ , let {q(s 1 ), . . . , q(s M )} be the probabilities of the corresponding elementary events under probability measure Q ∈ M (Ξ, F). As a shorthand notation, we use q = Consider a linear space V, paired with a dual linear space V * , in the sense that a (real-valued) bilinear form ·, · : V × V * → R is defined. That is, for any v ∈ V and v * ∈ V * , we have that ·, v * : V → R and v, · : V * → R are linear functionals on V and V * , respectively. Similarly, we define W and W * . For a linear mapping A : V → W, we define the adjoint mapping A * : W * → V * by means of the equation w * , Av = A * w * , v , ∀v ∈ V. For two linear mappings, defined by finite dimensional matrices A and B, A • B = T r(A T B) denotes the Frobenius inner product between matrices. Moreover, A B denotes the Hadamard (i.e., componentwise) product between matrices.
For a function f : V → R, the (convex) conjugate f * : V * → R is defined as We denote by · p : R d → R the p -norm on R d . That is, for a vector u ∈ R d , u p = d i=1 |u i | p 1 p . We use ∆ d to denote the simplex in R d , i.e., ∆ d = u ∈ R d e u = 1, u ≥ 0 , where e is a vector of ones in R d . Let (·) + denote max{0, ·}.
For a proper cone K, the relation x K y indicates that y − x ∈ K. For simplicity, we drop K from the notation, when K is the positive semidefinite cone. Let S n + denote the cone of symmetric positive semidefinite matrices in the n × n matrix spaces R n×n . For a cone K ⊂ V, we define its dual cone as K := {v * ∈ V * | v * , v ≥ 0, ∀v ∈ K}. The negative of the dual cone is called polar cone and is denoted by K o . The Kepigraph of a function f : R N → R M and a proper cone K is conic-representable if the set (x, y) ∈ R N × R M f (x) K y can be expressed via conic inequalities, possibly involving a cone different from K and additional auxiliary variables.
For a set K, we use conv(K) and int (K) to denote the convex hull and the interior of K, respectively.
Because we also review DRO papers in the context of statistical learning in this paper, we introduce some terminologies in statistical learning. For every approach that uses a set of (training) data to prescribe a solution or to predict an outcome, it is important to assess the out-of-sample quality of the prescriber/predictor under a new set of (test) data, independent from the training set. Consider a given set of (training) data {ξ i } N i=1 . Suppose that P N is the empirical probability distribution on {ξ i } N i=1 . Data-driven approaches are interested in the performance of a data-driven solution (or, in-sample solution)x N that is constructed using {ξ i } N i=1 . A primitive data-driven solution for a problem of the form (1.1) can be obtained by solving a sample average approximation (SAA) of that problem, where the underlying distribution is chosen to be P N [295]. Assessing the quality of this solution is well-studied in the context of SO, see, e.g., Bayraksan and Morton [14,15], Homem-De-Mello and Bayraksan [147]. Here, we introduce the analogous of such performance measure that are used to assess the quality of a solution in the context of a DRO model. Let us focus on a DRO problem of the form (1.5) for the ease of exposition. Consider a datadriven solution x N ∈ X . Such a solution may be obtained by solving a data-driven version of the DRO model (1.5), where the ambiguity set P is constructed using data, namely P N . The out-of-sample performance of x N is defined as E P true h(x,ξ , which is the expected cost of x N given a new (test) sample that is independent of {ξ i } N i=1 , drawn from an unknown true distribution P true := P true •ξ −1 . However, as P true is unknown, one need to establish performance guarantees. One such guarantee, referred to as finite-sample performance guarantee or generalization bound is defined as which guarantees that an (in-sample) certificateV N provides a (1 − α) confidence (with respect to the training sample) on the out-of-sample performance of x N . The certificateV N may be chosen as the optimal value of the inner problem in DRO, where the worst-case is taken within P N , evaluated at x N , see, e.g., [205]. The other guarantee, referred to as asymptotic consistency, guarantees that as N increases, the certificateV N and the data-driven solution x N converges-in some sense-to the optimal value and an optimal solution of the true (unambiguous) problem of the form (1.1), see, e.g., [205].

Relationship
with Game Theory, Risk-Aversion, Chance-Constrained Optimization, and Regularization.
3.1. Relationship with Game Theory. In this section, we present a gametheoretic interpretation of DRO. Indeed, a worst-case approach to SO may be viewed to have its roots in John von Neumann's game theory. For ease of exposition, let us consider a problem of the form (1.5).
The decision maker, the first player in this setup, makes a decision x ∈ X whose consequences (i.e., cost h) depends on the outcome of the random vectorξ. The decision maker assumes thatξ follows some distribution P ∈ P. However, he/she does not know which distribution the nature, the second player in this setup, will choose to represent the uncertainty inξ. Thus, in one hand, the decision maker is looking for a decision that minimizes the maximum expected cost with respect to P, on the other hand, the nature is seeking a distribution that maximizes the minimum expected cost with respect to X . Under suitable conditions, it can be shown that these two problems are the dual of each other and the solution to one problem provides the solution to the other problem. Such a solution (x * , P * ) is called an equilibrium or saddle point. In other words, at this point, the decision maker would not change its decision x * , knowing that the nature chose P * . Similarly, the nature would not change its distribution P * , knowing that the decision maker chose x * . We state this result in the following theorem, which generalizes John von Neumann's minmax theorem.
Theorem 3.1. (Sion [299,Theorem 3.4]) Suppose that (i) X and P are convex and compact spaces, is upper semicontinuous and quasiconcave on P for all x ∈ X , and (iii) P → R P h(x,ξ) is lower semicontinuous and quasiconvex on X for all P ∈ P. Then, According to the above theorem, under appropriate conditions, the exchange of the order between inf and sup will not change the optimal value to inf x∈X sup P∈P E P h(x,ξ) . We refer to Grünwald and Dawid [126] for a variety of alternative regularity conditions for this to hold. The exchange of the order between inf and sup can be interpreted as follows [126]: a probability distribution P * that maximizes the generalized entropy inf x∈X R P h(x,ξ) over P has an associated decision x * , achieving inf x∈X R P * h(x,ξ) , and it achieves inf x∈X sup P∈P R P h(x,ξ) .

Relationship between DRO and RO.
In Section 1, we mentioned that when the ambiguity set of probability distributions contains all probability distributions on the support of the uncertain parameters, DRO and RO are equivalent. In this section, we present a different perspective on the relationship between DRO and RO under the assumption that the sample space Ξ is finite. For ease of exposition, we focus on (1.5). A similar argument follows for (DRO).
Suppose that Ξ is a finite sample space with M atoms, Ξ = {s 1 , . . . , s M }. Then, for a fixed x ∈ X , h(x,ξ) has M possible outcomes {h(x,ξ(s 1 )), . . . , h(x,ξ(s M ))}. For short, let us write these outcomes as a vector h(x) ∈ R M , where h m (x) := h(x,ξ(s m )). In (1.5), P is a subset of all probability measures onξ. So, one can think of P as a subset of all discrete probability distributions P on R d induced bỹ ξ. That is, P can be identified with a vector p ∈ R M . Consequently, P may be interpreted as a subset of R M . With this interpretation, (1.5) is written as By defining f (x, p) := p h(x), we can rewrite the above problem as inf x∈X sup p∈P f (x, p). This problem has the form of (1.3), where the probability vector p takes values in an "uncertainty set" P. Techniques that are applicable for specifying the uncertainty set in a RO model may now be used to specify P in (3.1), see, e.g., Ben-Tal and Nemirovski [18,20], Bertsimas et al. [37], Chen et al. [74]. We also refer to Bertsimas et al. [42] and Section 3.3.2. For a through treatment of different nonlinear functions f (x, p) and different uncertainty sets P, we refer to Ben-Tal et al. [29]. However, as we shall see below, DRO has the richness that allows the use of techniques developed in the statistical literature to model the problem. Moreover, its framework allows Ξ to be continuous. We also refer to Xu et al. [330] for a distributional interpretation of RO.

Relationship with Risk-Aversion.
3.3.1. Relationship between DRO and Coherent and Law Invariant Risk Measures. Under mild conditions (e.g., real-valued cost functions, a convex and compact ambiguity set), the worst-case expectations given in (1.5) or (1.6) are equivalent to a coherent risk measure [7,258,270]. Furthermore, under mild conditions, the worst-case expectations given in (1.5) or (1.6) are equivalent to a law invariant risk measure [289]. These results imply that DRO models have an equivalent risk-averse optimization problem. In order to explain the relationship between (1.5) and (1.6) and risk-averse optimization more precisely, we present some definitions and fundamental results. • Translation Equivariance: If a ∈ R and Z ∈ Z, then ρ(Z + a) = ρ(Z) + a.
, for all Z, Z ∈ Z and all t ∈ [0, 1]. A risk measure ρ is called convex if it satisfies all the above axioms besides the positive homogeneity condition. Remark 3.3. In Definition 3.2, the convexity axiom can be replaced with the subadditivity axiom: ρ (Z + Z ) ≤ ρ(Z) + ρ(Z ), for all Z, Z ∈ Z. This is true because the convexity and positive homogeneity axioms imply the subadditivity axiom, and conversely, the positive homogeneity and subadditivity axioms imply the convexity axiom. Artzner  Theorem 3.6. Let Z be the linear space of all essentially bounded F-measurable functions Z : Ξ → R that are P -integrable for all P ∈ M (Ξ, F). Let Z * be the space of all signed measures P on (Ξ, F) such that Ξ |dP | < ∞. Suppose that Z is paired with Z * such that the bilinear form E P [Z] is well-defined. Moreover, suppose that Z and Z * are equipped with the sup norm · ∞ and variation norm · 1 , respectively 3 . Recall M (Ξ, F) denotes the space of all probability measures on (Ξ, F): M (Ξ, F) = P ∈ Z * Ξ dP = 1, P 0 . Let ρ : Z → R. Then, ρ is a real-valued coherent risk measure if and only if there exists a convex compact set M ⊆ M (Ξ, F) (in the weakly* topology of Z * ) such that Moreover, given a real-valued coherent risk measure, the set M in (3.2) can be written in the form Proof. First note that Z is a Banach space, paired with the dual space Z * , which is also a Banach space. Then, by a similar proof to Shapiro et al. [295,Theorem 6.7], we can show that if ρ is a proper and lower semicontinuous coherent risk measure, then (3.2) holds when M is equal to the subdifferential of ρ at 0 ∈ Z, i.e., M = ∂ρ(0), where ∂ρ(Z) = arg max Conversely, suppose that (3.2) holds with the set M being a convex and weakly* compact subset of M (Ξ, F). Then, ρ is a real-valued coherent risk measure.
• When ρ(Z) = CVaR Q β [Z], we have The above risk measure is also equivalent to where a b denotes the componentwise product of two vectors a and b.
Theorem 3.6 relates problems (1.5) and (1.6) to risk-averse optimization problems, involving the coherent risk-measure ρ. Consider a fixed x ∈ X . With an appropriate transformation of measure P = P •ξ −1 , we can write the inner problem where in the former, P is a set of probability distributions induced byξ, while in the latter, P is a set of probability measures on (Ξ, F). Then, by applying Theorem 3.6 and setting Z = h(x,ξ), sup P ∈P E P [h(x, s)] evaluates a (real-valued) coherent risk measure ρ [h(x, s)], provided that P ⊂ M (Ξ, F) is a convex compact set. It is easy to verify that such a function ρ is coherent: • Translation Equivariance: Consider x ∈ X and a ∈ R.
• Convexity: Consider x, x ∈ X and t ∈ [0, 1]. Then, we have where we used the translation equivariance property. Consequently, (1.5) is equivalent to minimizing a coherent risk measure. Similarly, (1.6) is equivalent to a risk-averse optimization problem, subject to coherent risk constraints. Thus, a convex and compact ambiguity set of distributions gives rise to a coherent risk measure. Conversely, Theorem 3.6 implies that given a risk preference that can be expressed in the form of a coherent risk measure as a primitive, we can construct a corresponding convex and compact ambiguity set P of probability distributions in a DRO framework. Thus, the ambiguity set becomes a consequence of the particular risk measure the decision maker selects.
It is worth noting that if h is a convex random function in (1.5), i.e., h(·, ξ) is convex in x for almost every ξ, then, ρ h(·,ξ) is convex in x. Convexity of g in (1.6) also implies the convexity of the region induced by the risk constraints ρ g(·,ξ) ≤ 0.
In our setup, neither h(·, ξ) nor g(·, ξ) need to be convex as for example in the case where they are indicator functions. We now state the connection between the worst-case expectation with respect to a set of probability distributions induced byξ to law invariant risk measures. [289,Theorem 2.3]) Consider Z and Z * as defined in Theorem 3.6. Also, consider ρ : Z → R, defined as ρ(Z) = sup P ∈P E P [Z] , ∀Z ∈ Z, If the set P is law invariant, then the corresponding risk measure ρ is law invariant. Conversely, if the risk measure ρ is law invariant, and the set P is convex and weakly* closed, then the set P is law invariant.
For the connection between a general multistage DRO model, risk-averse multistage programming with conditional coherent risk mappings, and the concept of time consistency of the problem and policies, we refer to Shapiro [286,288,290].

Relationship with Chance-Constrained Optimization.
In the previous section, we discussed how DRO is connected to risk-averse optimization. In this section, we present another perspective that connects DRO to risk-averse optimization through a proper choice of the uncertainty set of the random variablesξ, as in RO.
Many approaches in RO construct the uncertainty set for the parametersξ such that the uncertainty set implies a probabilistic guarantee with respect to the true unknown distribution. To explain how this construction is related to risk and DRO, consider the uncertain constraints g(x,ξ) ≤ 0 for a fixed x. Suppose thatξ belongs to a bounded uncertainty set U ⊆ R d , i.e., U is the support ofξ. The RO counterpart of this constraint then can be formulated as (3.3) g(x, ξ) ≤ 0, ∀ξ ∈ U.
Two criticisms of (3.3) are that: (1) it treats all uncertain parameters ξ ∈ U with equal weights and (2) all the parametrized constraints are hard, i.e., no violation is accepted. An alternative framework to reduce the conservatism caused by this approach is to use a chance constraint framework that allows a small probability of violation (with respect to the probability distribution ofξ) instead of enforcing the constraint to be satisfied almost everywhere. Under the assumption thatξ is defined on a probability space (Ξ, F, P true ), the chance constraint framework can be represented as follows: for some 0 < < 1. The parameter controls the risk of violating the uncertain constraint g(x,ξ) ≤ 0. In fact, as goes to zero, the set Motivated by the chance constraint framework (3.4), many approaches in RO construct an uncertainty set U such that a feasible solution to a problem of the form (3.3) will also be feasible with probability at least 1 − with respect to P true . More precisely, for any fixed x, these constructions guarantee that the following implication holds: However, as we argued before, the probability measure P true cannot be known with certainty. As far as it is relevant to the scope and interest of this paper, there are two streams of research in order to handle the ambiguity about the true probability distribution and obtain a safe (or, conservative) approximation 5 to (3.4) 6 : (1) scenario approximation scheme of (3.3) based on Monte Carlo sampling, see, e.g., Campi and Calafiore [63], Calafiore and Campi [60], Nemirovski and Shapiro [214], Campi and Garatti [62], Luedtke and Ahmed [198], Ben-Tal and Nemirovski [21], and (2) DRO approach to (3.4), see, e.g., Nemirovski and Shapiro [213], Erdogan and Iyengar [102]. Research on scenario approximation of (3.3) focuses on providing probabilistic guarantee (with respect to the sample probability measure) that a solution to the sampled problem of (3.3) is feasible to (3.4) with a high probability.
The DRO approach, on the other hand, forms a version of (3.4) as follows: LetX denote the feasibility set induced by (3.5): If P true ∈ P, then, x ∈X implies x ∈ X . That is,X provides a conservative approximation to X 7 . By leveraging a goodness-of-fit test, Bertsimas et al. [42] construct a (1 − α)-confidence region P(α) for P true . Such a construction leads to an uncertainty set U (α) that guarantees the implication (C1) [42].
Let us now assume that the sample space Ξ is finite. By the relationship between RO and DRO, discussed in Section 3.2, one may think the parameter ξ in (3.3) represents a probability distribution p on R d , which is random. That said, we may define f (x, p) := R p g(x,ξ) . By leveraging the results in Bertsimas et al. [42], we aim to construct a data-driven ambiguity set P that guarantees the following implication: A set of constraints is called a safe or conservative approximation of the chance constraint if the feasible region induced by the approximation is a subset of the feasible region induced by the chance constraint. 6 There is another stream of research that approximates (3.4) by CVaR or its approximations, see, e.g., Chen et al. [74], Chen and Sim [71], Chen et al. [72] and references there in. 7 One can in turn seek a safe approximation to (3.5). For example, one stream of such approximations includes using Chebyshev's inequality, see, e.g., Popescu [238], Bertsimas and Popescu [33], Bernstein's inequality, see, e.g., Nemirovski and Shapiro [213], or Hoeffding's inequality. We review such safe approximations to (3.5) in Section 5 in details. Theorem 3.9. (Bertsimas et al. [42,Theorem 2]) Suppose that for any fixed x, R p g(x,ξ) is concave in p. Consider a set of data {ξ i } N i=1 , drawn independently and identically distributed (i.i.d.) according to P true . Let P (α) be a (1 − α)-confidence region for P true , constructed from a goodness-of-fit test on data. Moreover, for any y ∈ R d , let l (y; α) be a closed, convex, finite-valued, and positively homogeneous (in y) upper bound to the worst-case VaR of y p at level 1 − over P (α), i.e., sup P ∈P (α) VaR P 1− y p ≤ l (y; α), y ∈ R d . Then, the closed, convex set P (α) for which δ * y|P (α) = l (y; α) guarantees the implication (C2) with probability at least (1 − α) (with respect to the sample probability measure).
As a byproduct of Theorem 3.9, δ * y|P (α) ≤ b provides a safe approximation to sup P ∈P (α) P {y p ≤ b} ≥ 1 − . That is, there is a one-to-one correspondence between the ambiguity set P (α) that satisfies the probabilistic guarantee (C2) and safe approximations to sup P ∈P (α) P {y p ≤ b} ≥ 1 − .

Relationship with Function
Regularization. The goal of this section is to discuss the relationship of DRO/RO with the function regularization commonly used in machine learning.
3.4.1. DRO and Regularization. Some papers have shown that DRO problems via the optimal transport discrepancy and φ-divergences are connected to regularization. When the optimal transport discrepancy is used, as shown in Shafieezadeh-Abadeh et al. [273], Blanchet et al. [50], Gao and Kleywegt [110], many mainstream machine learning classification and regression models, including support vector machine (SVM), regularized logistic regression, and Least Absolute Shrinkage and Selection Operator (LASSO), have a direct distributionally robust interpretation that connects regularization to the protection from the disturbance in data. To state this result, we first present a duality theorem, due to Blanchet and Murthy [49], and we relegate the technical details and assumptions to Section 5. On the other hand, when φ-divergences are used, DRO problem is connected to variance regularization, see, e.g., Duchi et al. [92], Namkoong and Duchi [209].
Let us begin by defining the optimal transport discrepancy. Consider two probability measures P 1 , P 2 ∈ M (Ξ, F). Let Π(P 1 , P 2 ) denote the set of all probability measures on (Ξ × Ξ, F × F) whose marginals are P 1 and P 2 : An element of the above set is called a coupling or transport plan. Furthermore, suppose that there is a lower semicontinuous function c : Ξ × Ξ → R + ∪ {∞} with c(s 1 , s 2 ) = 0 if s 1 = s 2 . Then, the optimal transport discrepancy between P 1 and P 2 is defined as 8 :  8 One can similarly define the optimal transport discrepancy between two probability distributions P 1 and P 2 induced byξ.
formed via the optimal transport discrepancy W c (P, P 0 ), where c is the transportation cost function, is the size of the ambiguity set (i.e., level of robustness), and P 0 is a nominal probability measure. Then, for a fixed x ∈ X , we have We can use Theorem 3.10 to explicitly state the connection between DRO and regularization. We adopt the following two theorems from Blanchet and Murthy [49], due to their generality. However, similar results are obtained in other papers, see, e.g., Shafieezadeh-Abadeh et al. [273], Gao and Kleywegt [110].
where u i ∈ R n is a vector of covariates and y i ∈ R is the response variable. Suppose that P N is the empirical probability distribution on • For a linear regression model with a square loss function • For a logistic regression model with cost function h 2 (x, ξ) As stated in Theorem 3.11, we can rewrite an unconstrained DRO model with the optimal transport discrepancy as a minimization problem, in which the objective function, in one hand, includes an expected-cost term with respect to the empirical distribution, and on the other hand, includes a regularization term. Two other interesting results can be inferred from Theorem 3.11 about the connection between DRO and regularization: (i) the shape of the transportation cost c in the definition of the optimal transport discrepancy directly implies the type of regularization, and (ii) the size of the ambiguity set is related to the regularization parameter. An important implication of these results is that one can judicially choose an appropriate regularization parameter for the problem in hand by using the DRO equivalent reformulation. We review the papers that draw this conclusion in Section 5.1. Now, let us focus on DRO problems formulated via φ-divergences. For two probability measures P 1 , P 2 ∈ M (Ξ, F), the φ-divergence between P 1 and P 2 is defined as formed via the φ-divergence d φ (P, P 0 ), where is the size of the ambiguity set and P 0 is the empirical probability distribution on a set of independently and identically distributed (i.i.d) data {ξ i } N i=1 , according to P true . Furthermore, suppose that X is compact, there exists a measurable function M : Ω → R + such that for all ξ ∈ Ω, h(·, ξ) is M (ξ)-Lipschitz with respect to some norm · on X , E P true M (ξ) 2 < ∞, and E P true |h(x 0 ,ξ)| < ∞ for some x 0 ∈ X . Then, As Theorem 3.12, we can rewrite the inner problem of a model of the form (1.5) with φ-divergences as the expected cost plus a regularization term that accounts for the standard deviation of the cost, under the empirical distribution.
4. General Solution Techniques to Solve DRO Models. In this section, we discuss two approaches to solve (DRO). Let us first reformulate (DRO) as follows: s.t. θ ≥ R P h(x,ξ) , ∀P ∈ P (4.1b) R P g(x,ξ) ≤ 0, ∀P ∈ P. Reformulation (4.1) is a semi-infinite program (SIP), and at a first glance, obtaining an optimal solution to this problem looks unreachable 10 . It is well-known that even convex SIPs cannot be solved directly with numerical methods, and in particular are not amenable to the use of methods such as interior point method. Therefore, a key step of the solution techniques to handle the semi-infinite qualifier (i.e., ∀P ∈ P) is to reformulate (4.1) as an optimization problem that is amenable to the use of available optimization techniques and off-the-shelf solvers. Of course, the complexity and tractability of such SIPs and their reformulations depend on the geometry and properties of both the ambiguity set P and the functions h(x,ξ) and g(x,ξ). As we shall see in details in Section 5, proper assumptions on P and these functions are important in most studies on DRO in order to obtain a solvable reformulation or approximation of (4.1).
In the context of DRO, there are two main approaches to handle the semi-infinite quantifier ∀P and to numerically solve (4.1). Both approaches have their roots in the SIP literature, and they both aim at getting rid of the quantifier ∀P , but in different ways.

4.1.
Cutting-Surface Method. The first approach replaces the quantifier ∀P by for some finite atomic subset of P. The idea is to successively solve a relaxed problem of (4.1) over a finitely generated inner approximations of the ambiguity set P. To be precise, this approach approximates the semi-infinite constraints for all P ∈ P by finitely many ones over a finite set of probability distributions. In each iteration of this approach, a new probability distribution is added to this finite set until optimality criteria are met. We refer to this as a cutting-surface method (also known as exchange method, following the terminology in the SIP literature, see, e.g., Mehrotra and Papp [202], Hettich and Kortanek [145]). We refer to Pflug and Wozabal [229], Rahimian et al. [251], Bansal et al. [9] as examples of this approach in the context of DRO.
The key requirements in order to use the cutting-surface method are the abilities to (i) solve a relaxation of (4.1) with a finite number of probability distributions to optimally and (ii) generate an -optimal solution 11 to a distribution separation subproblem [200].  [200,Theorem 3.2]) Suppose that X × P is compact, and R P h(x,ξ) and R P g(x,ξ) are continuous on X × P. Moreover, suppose that we have an oracle that generates an optimal solution (x k , θ k ) to a relaxation of problem (4.1) for any finite set P k ⊆ P, and an oracle that generates an -optimal solution of the distribution generation subproblem for any x ∈ X and > 0. Suppose that iteratively the relaxed master problem is solved to optimally and yields the solution (x k , θ k ), and the distribution separation subproblem is solved to 2 -optimality and yields the solution P k . Then, the stopping criteria R P h(x,ξ) ≤ θ k + 2 and R P g j (x,ξ) ≤ 2 , j = 1, . . . , m, guarantee that an -feasible solution 12 to problem (4.1), yielding an objective function value lower bounding the optimal value of (4.1), can be obtained in a finite number of iterations.
It is worth noting that the distribution generation subproblem in the cuttingsurface method may be a nonconvex optimization problem. One may efficiently solve (DRO) through the cutting-surface method if the ambiguity set P can be convexfied without causing a change to the optimal value. The following lemma states that if R P [·] is convex in P on M (Ξ, F), then, it can be assumed without loss of generality that P is convex.
Then, x * ∈ X is an optimal solution to (DRO) if and only if it is an optimal solution to the following problem: For an optimization problem of the form z * = min{α(x) | β(x) ≤ 0}, a point x 0 is an -optimal solution if β(x 0 ) ≤ 0 and α(x 0 ) ≤ z * + . 12 For an optimization problem of the form z * = min{α(x) | β(x) ≤ 0}, a point x 0 is an -feasible solution if β(x 0 ) ≤ .
Proof. Problems (DRO) and (4.2) can be reformulated, respectively, as min{θ | (x, θ) ∈ G} and min{θ | (x, θ) ∈ G }, where Because P ⊆ conv(P ), we have G ⊆ G, and thus, an optimal solution to (4.2) is optimal to (DRO). We now show that G ⊆ G . Consider an arbitrary (x, θ) ∈ G. For an arbitrary P ∈ conv(P), there exists a collection {P i } i∈I such that P = i∈I λ i P i , where i∈I λ i = 1, P i ∈ P, λ i ≥ 0, i ∈ I. Now, by the convexity of Thus, it follows that (x, θ) ∈ G , and hence, G ⊆ G .

Dual
Method. The second approach to solve (DRO) handles the quantifier ∀P through the dualization of sup P ∈P R P h(x,ξ) and sup P ∈P R P g(x,ξ) ≤ 0. Under suitable regularity conditions, there is no duality gap between the primal problem and its dual, i.e., strong duality holds. Hence, the supremum can be replaced by an infimum which should hold for at least one corresponding solution in the dual space. We refer to this approach as a dual method. Most of the existing papers in the DRO literature are focused on the dual method, see, e.g., Delage and Ye [82], Bertsimas et al. [39], Wiesemann et al. [317], Ben-Tal et al. [28]. A situation where one benefits from the application of the dual method to solve (DRO) arises in cases where the ambiguity set of probability distribution depends on decision x as formulated below, see, e.g., Luo and Mehrotra [199], Noyan et al. [221]: where, P(x) denotes a decision-dependent ambiguity set of the probability distributions.
The papers that rely on the dual method exploit linear duality, Lagrangian duality, convex analysis (e.g., support function, conjugate duality, Fenchel duality), and conic duality. A fundamental question is then under what conditions the strong duality holds. One such condition is the existence of a probability measure that lies in the interior of the ambiguity set, i.e., the ambiguity set satisfies a Slater-type condition. We refer the readers to the optimization textbooks for results on linear and Lagrangian duality, see, e.g., Bazaraa et al. [16], Bertsekas [30], Ruszczyński [269], Rockafellar [257]. For detailed discussions of the duality theory in infinite-dimensional convex problems, we refer to Rockafellar [257], and we refer to Isii [162] and Shapiro [284] for duality theory in conic linear programs. Below, we briefly present the results from conic duality that are widely used in the dualization of DRO models.
where, C and K are convex cones and subsets of linear spaces V and W, respectively, such that for any w * ∈ W * , there exists a unique v * ∈ V * with w * , Av = v * , v , with v * = A * w * , for all v ∈ V. Then, the dual problem to (4.4) is written as Moreover, there is no duality gap between (4.4) and (4.5) and both problems have optimal solutions if and only if there exists a feasible pair (v, w * ) such that w * , Av − b = 0 and c − A * w * , v = 0.
It is worth noting that other numerical methods to solve a SIP, such as penalty methods, see, e.g., Lin et al. [190], Yang et al. [336], smooth approximation and projection methods, see, e.g., Xu et al. [332], and primal methods, see, e.g., Wang and Yuan [314], have not been popular in the DRO literature, although there are a few exceptions. Liu et al. [192] propose to discretize DRO by a min-max problem in a finite dimensional space, where the ambiguity set is replaced by a set of distributions on a discrete support set. Then, they consider lifting techniques to reformulate the discretized DRO as a saddle-point problem, if needed, and implement a primal-dual hybrid algorithm to solve the problem. They showcase this method for cases where the ambiguity set is formed via the moment constraints as in (5.23) or the Wasserstein metric, and they present the quantitative convergence of the optimal value and optimal solutions. Other iterative primal methods that have been proposed to solve a DRO model include Lam and Ghosh [177] for χ 2 -distance, and Ghosh et al. [115], Namkoong and Duchi [208], Ghosh and Lam [114] for general φ-divergences.

5.
Choice of Ambiguity Set of Probability Distributions. The ambiguity set of distribution in a DRO model provides a flexible framework to model uncertainty by allowing the modelers to incorporate partial information about the uncertainty, obtained from historical data or domain-specific knowledge. This information includes, but it is not limited to, support of the uncertainty, discrepancy from the reference distribution, descriptive statistics, and structural properties, such as symmetry and unimodality. Early DRO models considered ambiguity sets based on the support and moment information, for which techniques in global optimization for polynomial optimization problems and problem of moments are applied to obtain reformulations, see, e.g., Lasserre [182], Bertsimas et al. [38], Bertsimas and Popescu [33], Popescu [238,239], Gilboa and Schmeidler [117]. Since then, many researchers have incorporated information such as descriptive statistics as well as the structural properties of the underlying unknown true distribution into the ambiguity set.
There are usually two principles to choose the ambiguity set: (1) P should be chosen as small as possible, (2) P should contain the unknown true distribution with certainty (or at least, with a high confidence). Abiding by these two principles not only reduces the conservatism of the problem but it also robustifies the problem against the unknown true distribution. These two, in turn, give rise to two questions: (1) what should be the shape of the ambiguity set and (2) what should be the size of the ambiguity set. We discuss the latter in Section 6, and focus on the shape of the ambiguity set in this section.
Except for a few exceptions, the common practice in constructing the ambiguity set is that first, the shape of the set is determined by decision makers/modelers. In this step, data does not directly affect the choice of the shape of the ambiguity set. Then, the parameters that control the size of the ambiguity set are chosen in a data-driven fashion. We emphasize that albeit being a common practice, the size and shape of the ambiguity set might not necessarily be chosen separately. To make the transition between Section 5 and 6 somewhat smoother, we devote Section 5.4 to review those papers that address these two questions simultaneously.
When dealing with the question of the shape of the ambiguity set, most researchers, on one hand, have focused on the ambiguity sets that facilitate a tractable (exact or conservative approximate) formulation, such as linear program (LP), secondorder cone program (SOCP), or to a lesser degree, semidefinite program (SDP), so that efficient computational techniques can be developed. On the other hand, many researchers have focused on the expressiveness of the ambiguity set by incorporating information such as descriptive statistics as well as the structural properties of the underlying unknown true distribution.
In what follows in this section, we review different approaches to model the distributional ambiguity. We acknowledge that the ambiguity sets in the literature are typically categorized in two groups: moment-based and discrepancy-based ambiguity sets. In short, moment-based ambiguity sets contain distributions whose moments satisfy certain properties, while discrepancy-based ambiguity sets contain distributions that are close to a nominal distribution in the sense of some discrepancy measure. Within these two groups, some specific ambiguity sets have been given names, see, e.g., Hanasusanto et al. [134]. For example, • Markov ambiguity set contains all distributions with known mean and support, • Chebyshev ambiguity set contains all distributions with bounds on the firstand second-order moments, • Gauss ambiguity set contains all unimodal distributions from within the Chebyshev ambiguity set, • Median-absolute deviation ambiguity set contains all symmetric distributions with known median and mean absolute deviation, • Huber ambiguity set contains all distributions with known upper bound on the expected Huber loss function, • Hoeffding ambiguity set contains all componentwise independent distributions with a box support, • Bernstein ambiguity set contains all distributions from within the Hoeffding ambiguity set subject to marginal moment bounds, • Choquet ambiguity set contains all distributions that can be written as an infinite convex combination of extremal distributions of the set, • Mixture ambiguity set contains all distributions that can be written as a mixture of a parametric family of distributions. While we use the above terminology in this paper, we categorize DRO papers into four groups: • Discrepancy-based ambiguity sets (Section 5.1), • Moment-based ambiguity sets (Section 5.2), • Shape-preserving ambiguity sets (Section 5.3), • Kernel-based ambiguity sets (Section 5.4). We briefly mentioned what is meant by discrepancy-based and moment-based ambiguity sets. In short, shape-preserving ambiguity sets contain distributions with similar structural properties (e.g., unimodality, symmetry). Kernel-based ambiguity sets also contain distributions that are formed via a kernel and its parameters are close to the parameters of a nominal kernel function. The above groups are not necessarily disjoint from a modeling perspective and there are some overlaps between them. However, we try to assign papers to these categories as close as possible to what the authors explicitly or implicitly might have stated in their work.
We review these four groups of ambiguity sets in Sections 5.1-5.4. Finally, we review the papers that are general and do not consider a specific form for the ambiguity set in Section 5.5.
5.1. Discrepancy-Based Ambiguity Sets. In many situations, we have a nominal or baseline estimate of the underlying probability distribution. A natural way to hedge against the distributional ambiguity is then to consider a neighborhood of the nominal probability distribution by allowing some perturbations around it. So, the ambiguity set can be formed with all probability distributions whose discrepancy or dissimilarity to the nominal probability distribution is sufficiently small. More precisely, such an ambiguity set has the following generic form: where P 0 denotes the nominal probability measure, and d : is a functional that measures the discrepancy between two probability measure P, P 0 ∈ M (Ξ, F), dictating the shape of the ambiguity set. Moreover, parameter ∈ [0, ∞] controls the size of the ambiguity set, and it can be interpreted as the decision maker's belief in P 0 . Parameter is also referred to as the level of robustness.
A generic ambiguity set of the form (5.1) has been widely studied in the DRO literature. We relegate the discussion about P 0 and to Section 6. In this section, we review different discrepancy functionals d(·, ·) that are used in the literature. These include (i) optimal transport discrepancy, (ii) φ-divergences, (iii) total variation metric, (iv) goodness-of-fit test, (v) Prohorov metric, (vi) p -norm, (vii) ζ-structure metric, (viii) Levy metric, and (ix) contamination neighborhood. We emphasize that although all studied functionals d can quantify the discrepancy between two probability measures, they may or may not be a metric. For example, Prohorov and total variation are probability metrics, see, e.g., Gibbs and Su [116], while Kullback-Leibler and χ 2 -distance from the family of φ-divergences are not a probability metric. Thus, we refer to the models of the form (5.1) collectively as discrepancy-based ambiguity sets.
5.1.1. Optimal Transport Discrepancy. We begin this section by providing more details on the optimal transport discrepancy. Consider two probability measures P 1 , P 2 ∈ M (Ξ, F). Let Π(P 1 , P 2 ) denote the set of all probability measures on (Ξ × Ξ, F × F) whose marginals are P 1 and P 2 : Furthermore, suppose that there is a lower semicontinuous function c : Then, the optimal transport discrepancy between P 1 and P 2 is defined as: If, in addition, function c is symmetric (i.e., c(s 1 , s 2 ) = c(s 2 , s 1 )) and c 1 r (·) satisfies a triangle inequality for some 1 ≤ r < ∞ (i.e., c (P 1 , P 2 ) metricizes the weak convergence in M (Ξ, F), see, e.g., Villani [312,Theorem 6.9]. If Ξ is equipped with a metric d and c(·) = d r (·), then d W c (P 1 , P 2 ) is called Wasserstein metric of order r or r-Wasserstein metric, for short 13 .
The optimal transport discrepancy (5.2) can be used to form an ambiguity set of probability measures as follows: Over the past few years, there has been a significant growth in the popularity of the optimal transport discrepancy to model the distributional ambiguity in DRO, in both operations research and machine learning communities, see, e.g., Pflug and Wozabal [229], Mehrotra and Zhang [203], Mohajerin Esfahani and Kuhn [205], Gao and Kleywegt [110], Chen et al. [76], Blanchet et al. [53], Lee and Mehrotra [183], Luo and Mehrotra [200], Shafieezadeh-Abadeh et al. [273], Sinha et al. [298], Lee and Raginsky [185], Shafieezadeh-Abadeh et al. [275], Singh and Póczos [297]. Pioneered by the work of Pflug and Wozabal [229], most of the literature has focused on the Wasserstein metric. Before we review these papers, we present a duality result on sup P ∈P W (P0; ) E P h(x,ξ) , proved in a general form in Blanchet and Murthy [49].
Because the infimum in the defintion of (5.2) is attained for a lower semicontinuous function c [312,249], we can rewrite sup P ∈P W (P0; ) E P h(x,ξ) as follows: Recall that S (Ξ, F) is the collection of all F-measurable functions Z : (Ξ, F) → (R, B(R)). With the primal problem (5.4), we have a dual problem Theorem 5.1. (Blanchet and Murthy [49, Theorem 1]) For a fixed x ∈ X , suppose that h(x, ·) is upper semicontinuous and P 0 -integrable, i.e., Ξ |h(x,ξ(s))|P 0 (ds) < ∞. Then, Moreover, there exists a dual optimal solution of the form (λ, φ λ ), for some λ ≥ 0, where φ λ (s 1 ) := sup s2∈Ξ {h(x, s 2 ) − λc(s 1 , s 2 )}. In addition, any feasible π * ∈ Φ P0, and (λ * , φ λ * ) ∈ Λ c,g(x,·) are primal and dual optimizers, satisfying is upper semicontinuous and P 0 -integrable. Then, The importance of Theorem 5.1 and Corollary 5.2 is that (1) the transportion cost c(·, ·) is only known to be lower semicontinuous, (2) function h(x,ξ) is assumed to be upper semicontinuous and integrable, and (3) Ξ is a general Polish space. In fact, there are only mild conditions on h(x, ·) and function c, and P 0 can be any probability measure. Moreover, sup P ∈P W (P0; ) E P h(x,ξ) can be obtained by solving a univariate reformulation of the dual problem (5.5), where it involves an expectation with respect to P 0 and a linear term in the level of robustness . We shall shortly comment on similar results in the literature but under stronger assumptions. As shown in Section 3.4, by using Theorem 5.1 or its weaker forms, researchers have shown many mainstream machine learning algorithms, such as regularized logistic regression and LASSO, have a DRO representation, see, e.g., Blanchet et al. [50], Blanchet and Kang [48,47], Gao et al. [112], Shafieezadeh-Abadeh et al. [273,274].
While a strong duality result for DRO formed via the optimal transport discrepancy is provided in Blanchet and Murthy [49] under mild assumptions by utilizing Fenchel duality, Mohajerin Esfahani and Kuhn [205] and Gao and Kleywegt [110] are also among notable papers in this area. Below, we first highlight the main differences of Mohajerin Esfahani and Kuhn [205] and Gao and Kleywegt [110] with Blanchet and Murthy [49]. Then, we comment on their main contributions.
In Mohajerin Esfahani and Kuhn [205], it is assumed that the transportation cost c(·, ·) is a norm on R n , function h(x,ξ) has specific structures, and the nominal probability measure P 0 is the empirical distribution of data supported on R n . On the other hand, Gao and Kleywegt [110] consider a more general setting than the one in Mohajerin Esfahani and Kuhn [205], but slightly more restricted than that of Blanchet et al. [50]. More precisely, in contrast to Blanchet et al. [50], it is assumed in Gao and Kleywegt [110] that the transportation cost c(·, ·) forms a metric on the underlying Polish space.
Mohajerin Esfahani and Kuhn [205] study data-driven DRO problems formed via 1-Wasserstein metric utilizing an arbitrary norm on R n . The main contribution of Mohajerin Esfahani and Kuhn [205] is in proving a strong duality result for the studied problem and to reformulate it as a finite-dimesnional convex program for different cost functions, including a pointwise maximum of finitely many concave functions, convex functions, and sums of maxima of concave functions. This contribution is of importance as most of the previous research on DRO formed via Wasserstein ambiguity sets reformulates the problem as a finite-dimensional nonconvex program and relies on global optimization techniques, such as difference of convex programming, to solve the problem, see, e.g., [319,Theorem 6]. In addition, Mohajerin Esfahani and Kuhn [205] propose a procedure to construct an extremal distribution (respectively, a sequence of distributions) that attains the worst-case expectation precisely (or, asymptotically). They further show that their solutions enjoy finite-sample and asymptotic consistency guarantees. The results were applied to the mean-risk portfolio optimization and to the uncertainty quantification problems.
Gao and Kleywegt [110] study DRO problems formed via p-Wasserstein metric utilizing an arbitrary metric on a Polish space Ξ. Recognizing the fact that the ambiguity set should be chosen judicially for the application in hand, they argue that by using the Wasserstein metric the resulting distributions hedged against are more reasonable than those resulting from other popular choices of sets, such as φ-divergence-based sets, see Section 5.1.2. They prove a strong duality result for the studied problem by utilizing Lagrangian duality and approximate the worst-case distributions (or obtain a worst-case distribution, if it exists) explicitly via the firstorder optimality conditions of the dual reformulation. Using this, they show datadriven DRO problems can be approximated by robust optimization problems.
In addition to the papers by Blanchet and Murthy [49], Mohajerin Esfahani and Kuhn [205], Gao and Kleywegt [110], there are other research on DRO problems formed via the optimal transport discrepancy, but under more restricted assumptions, that move the frontier of research in this area. In the following review, we mention the properties of the transportation cost c(·, ·) in the definition of the optimal transport discrepancy, function g(x,ξ) or h(x,ξ), and the nominal distribution P 0 and its underlying space as studied in these papers. Zhao and Guan [343] study a datadriven distributionally robust two-stage stochastic linear program over a Wasserstein ambiguity set, with 1-Wasserstein metric utilizing 1 -norm. By developing a strong duality result, they reformulate the problem as a semi-infinite linear two-stage robust optimization problem. In addition, under mild conditions, they derive a closed-form expression of the worst-case distribution whose parameters can be obtained by solving a traditional two-stage robust optimization model. They also show the convergence of the problem to the corresponding stochastic program under the true unknown probability distribution as the data points increase.
Hanasusanto and Kuhn [132] derive conic programming reformulation to distributionally robust two-stage stochastic linear programs formed via p-Wasserstein metric utilizing an arbitrary norm. In particular, by relying on the strong duality result from Mohajerin Esfahani and Kuhn [205] and Gao and Kleywegt [110], they show that when the ambiguity set is formed via the 2-Wasserstein metric around a discrete distribution, the resulting model is equivalent to a copositive program of polynomial size (if the problem has complete recourse) or it can be approximated by a sequence of copositive programs of polynomial size (if for any fixed x and ξ, the dual of the second-stage problem is feasible). Moreover, by using nested hierarchies of semidefinite approximations of the (intractable) copositive cones from the inside, they obtain sequences of tractable conservative approximations to the problem. They also show that the two-stage distributionally robust stochastic linear program with nonrandom cost function in the second stage, where the ambiguity set is formed via the 1-Wasserstein metric around a discrete distribution is equivalent to a linear program. They further extend their result to a case where optimized certainty equivalent (OCE) [22,23] is used as a risk measure. As applications, they demonstrate their results for the least absolute deviations regression and multitask learning problems.
For random variables supported on a compact set and a bounded continuous function h(x, ·), Luo and Mehrotra [200] study (1.5) formed via the 1-Wasserstein metric utilizing an arbitrary norm, around the empirical distribution of data. They present an equivalent SIP reformulation of the problem by reformulating the inner problem as a conic linear program. In order to solve the resulting SIP, they propose a finitely convergent exchange method when the cost function h is a general nonlinear function in x, and a central cutting-surface method with a linear rate of convergence when the cost function h(·, ξ) is convex in x and X is convex. They investigate a logistic regression model to exemplify their algorithmic ideas, and the benefits of using 1-Wasserstein metric.
Pflug and Pichler [231] study a DRO approach to single-and two-stage stochastic programs formed via the p-Wasserstein metric utilizing an arbitrary norm. They assume that all probability distributions in the ambiguity set are supported on discrete, fixed atoms, while only the probabilities of atoms are changing in the ambiguity set. Hence, the ambiguity set can be represented as a subset of a finite-dimensional space. To solve the resulting problem, they apply the exchange method, proposed in Pflug and Wozabal [229]. Mehrotra and Zhang [203] study a distributionally robust ordinary least squares problem, where the ambiguity set of probability distribution is formed via 1-Wasserstein metric utilizing 1 -norm. Similar to Pflug and Pichler [231], they restrict the ambiguity set of distributions to all discrete distributions and show that the resulting problem can be solved by using an equivalent SOCP reformulation.
Unlike Pflug and Pichler [231] and Mehrotra and Zhang [203] that only allow varying the probabilities on atoms identical to those of the nominal distribution, the ambiguity set is allowed to contain an infinite-dimensional distribution in Wozabal [319]. Wozabal [319] study a DRO approach to single-stage stochastic programs, where the distributional ambiguity in the constraints and objective function is modeled via 1-Wasserstein metric utilizing 1 -norm around the empirical distribution. Because such a model has a higher complexity than that of those considered in Pflug and Pichler [231] and Mehrotra and Zhang [203], they propose to reformulate the problem into an equivalent finite-dimensional, nonconvex saddle-point optimization problem, under appropriate conditions. The key ideas in Wozabal [319] to obtain such a reformulation are that (i) at any level of precision and in the sense of Kantorovich distance, every distribution in the ambiguity set can be approximated via a probability distribution supported on a uniform number of atoms, and (ii) considering only the extremal distributions in the ambiguity set suffices to obtain the equivalent reformulation. Furthermore, for a portfolio selection problem complemented via a broad class of convex risk measures appearing in the constraints, they obtain an equivalent finite-dimensional, nonconvex, semidefinite saddle-point optimization problem. They propose to solve such a reformulated problem via the exchange method, proposed in Pflug and Wozabal [229].
Pichler and Xu [236] study a DRO model with a distortion risk measure and form the ambiguity set of distributions via p-Wasserstein metric utilizing an arbitrary norm. They quantitatively investigate the effect of the variation of the ambiguity set on the optimal value and the optimal solution in the resulting optimization problem, as the number of data points increases. They illustrate their results in the context of a two-stage stochastic program with recourse.
A class of data-driven distributionally robust fractional optimization problems, representing a reward-risk ratio, is studied in Ji and Lejeune [164] as follows: where R 1 P : Z → R is a reward measure and R 2 P : Z → R + is a nonnegative risk measure. Assuming that the underlying distribution is discrete, Ji and Lejeune [164] model the ambiguity about discrete distributions using the 1-Wasserstein metric utilizing 1norm, around the empirical distribution. They provide a nonconvex reformulation for the resulting model and propose a bisection algorithm to obtain the optimal value by solving a sequence of convex programming problems. As in Postek et al. [240], the reformulation is obtained through investigating the support function of the ambiguity set and the convex conjugate of the ratio function. They further apply their results to portfolio optimization problem for the Sharpe ratio [296] and Omega ratio [170].
Motivated by the drawback of moment-based DRO problems, Gao and Kleywegt [111] study DRO formed via various ambiguity sets of probability distributions that incorporate the dependence structure between the uncertain parameters. In the case that there exists a linear dependence structure, they consider probability distributions around a nominal distribution, in the sense of p-Wasserstein metric utilizing an arbitrary norm, satisfying a second-order moment constraint. They also study cases with different rank dependencies between the uncertain parameters. They obtain tractable reformulations of these models and apply their results to a portfolio optimization problem. Along the same lines as Gao and Kleywegt [111], Pflug and Pohl [232] study a DRO approach to portfolio optimization via the 1-Wasserstein metric utilizing an arbitrary norm. They address the case where the dependence structure between the assets is uncertain while the marginal distributions of the assets are known.
Noyan et al. [221] study DRO model with decision-dependent ambiguity set, where the ambiguity set is formed via the p-Wasserstein metric utilizing p -norm. They consider two types of ambiguity sets: (1) continuous ambiguity set, where there is ambiguity in both probability distribution ofξ and its realizations, and (2) discerte ambiguity set, where there is only ambiguity in the probability distribution ofξ, while the realizations are fixed. They apply their results to problems in machine scheduling and humanitarian logistics. Rujeerapaiboon et al. [268] study continuous and discrete scenario reduction [97,139,140,141,6], where p-Wasserstein metric utilizing p -norm is used as a measure of discrepancy between distributions.

Discrete Problems.
We now review DRO models over Wasserstein ambiguity sets, with discrete decisions. Bansal et al. [9] study a distributionally robust integer program with pure binary first-stage and mixed-binary second-stage variables on a finite set of scenarios as follows: They propose a decomposition-based L-shaped algorithm and a cutting surface algorithm to solve the resulting model. They investigate the conditions and the ambiguity sets under which the proposed algorithm is finitely convergent. They show that the ambiguity set of distributions formed via 1-Wasserstein metric utilizing an arbitrary norm satisfy these conditions. Xu and Burer [327] study a mixed 0-1 linear program, where the coefficients of the objective functions are affinely dependent on the random vectorξ. They seek a bound on the worst-case expected optimal value to this problem, where the worst-case is taken with respect to an ambiguity set of discrete distributions formed via 2-Wasserstein metric utilizing 2 -norm around the empirical distribution of data. Under mild assumptions, they reformulate the problem into a copositive program, which leads to a tractable semidefinite-based approximation.

Chance Constraints.
In this section, we review distributionally robust chance-constrained programs over Wasserstein ambiguity sets, see, e.g., Jiang and Guan [166], Chen et al. [76], Xie [322], Yang [333]. Ji and Lejeune [165] study a distributionally robust individual chance constraint, where the ambiguity set of distributions is formed via 1-Wasserstein metric utilizing 1 -norm, and g(x,ξ) in (1.6) is defined as For the case that the underlying distribution is supported on the same atoms as those of the empirical distribution, they provide mixed-integer LP reformulations for the linear random right-hand side case, i.e., g(x,ξ) := 1 [a x≤ξ] (ξ), and the linear random technology matrix case, i.e., g(x,ξ) := 1 [ξ x≤b] (ξ), and provide techniques to strengthen the formulations. For the case that the underlying distribution is infinitely supported, they propose an exact mixed-integer SOCP reformulation for models with random right-hand side, while a relaxation is proposed for constraints with a random technology matrix. They show that this mixed-integer SOCP relaxation is exact when the decision variables are binary or bounded general integer.
Chen et al. [76] study data-driven distributionally robust chance constrained programs, where the ambiguity set of distributions is formed via p-Wasserstein metric utilizing an arbitrary norm. For individual linear chance constraints with affine dependency on the uncertainty, and for joint chance constraints with right-hand side affine uncertainty, they provide an exact deterministic reformulation as a mixed-integer conic program. When 1 -norm or ∞ -norm are used as the transportation cost in the definition of Wasserstein metric, the chance-constrained program can be reformulated as a mixed-integer LP. They leverage the structural insights into the worst-case distributions, and show that both the CVaR and the Bonferroni approximation may give solutions that are inferior to the optimal solution of their proposed reformulation.

Statistical
Learning. DRO problems formed via the optimal transport discrepency has been widely studied in the context of statistical learning. We already mentioned Mehrotra and Zhang [203] as an example in this area. Below, we review the latest developments of DRO in the context of statistical learning. A data-driven distributionally robust maximum likelihood estimation model to infer the inverse of the covariance matrix of a normal random vector is proposed in Nguyen et al. [215]. They form the ambiguity set of distributions with all normal distributions close enough to a nominal distribution characterized by the sample mean and sample covariance matrix, in the sense of the 2-Wasserstein metric utilizing 1 -norm. By leveraging an analytical formula for the Wasserstein distance between two normal distributions, they obtain an equivalent SDP reformulation of the problem. When there is no prior sparsity information on the inverse covariance matrix, they propose a closed-form expression for the estimator that can be interpreted as a nonlinear shrinkage estimator. Otherwise, they propose a sequential quadratic approximation algorithm to obtain the estimator by solving the equivalent SDP. They apply their results to linear discriminant analysis, portfolio selection, and solar irradiation patterns inference problems.
Lee and Mehrotra [183] study a distributionally robust framework for finding support vector machines via the 1-Wasserstein metric. They provide SIP formulation of the resulting model and propose a cutting-plane algorithm to solve the problem. Lee and Raginsky [184,185] study a distributionally robust statistical learning problem formed via the p-Wasserstein metric utilizing p -norm, motivated by a domain (i.e., measure) adaption problem. This problem arises when training data are generated according to an unknown source domain P, but the learned hypothesis is evaluated on another unknown but related target domain Q. In this problem, it is assumed that a set of labeled data (covariates and responses) is drawn from P and a set of unlabeled covariates is drawn from Q. It is further assumed that the domain drift is due to an unknown deterministic transformation on the covariates space that preserves the distribution of the response conditioned on the covariates. Under these assumptions and some further regularity conditions, they prove a generalization bound and generalization error guarantees for the problem.
Gao et al. [113] develop a novel distributionally robust framework for hypothesis testing where the ambiguity set of distribution is constructed by 1-Wasserstein metric utilizing an arbitrary norm, around the empirical distribution. The goal is to obtain the optimal decision rule as well the least favorable distribution by minimizing the maximum of the worst-case type-I and type-II errors. They develop a convex safe approximation of the resulting problem and show that such an approximation renders a nearly-optimal decision rule among the family of all possible tests. By exploiting the structure of the least favorable distribution, they also develop a finite-dimensional convex programming reformulation of the safe approximation.
We now turn our attention to the connection between DRO and regularization in statistical learning. Pflug et al. [233], Pichler [235], Wozabal [320] draw the connection between robustification and regularization, where as in Theorem 3.11, the shape of the transportation cost in the definition of the optimal transport discrepancy directly implies the type of regularization, and (ii) the size of the ambiguity set dictates the regularization parameter. Pichler [235] studies worst-case values of lower semicontinuous and law-invariant risk measures, including spectral and distortion risk measures, over an ambiguity set of distributions formed via the p-Wasserstein metric utilizing an arbitrary norm around the empirical distribution. They show when the function h(x,ξ) is linear inξ, the worst-case value is the sum of the risk of h(x,ξ) under the nominal distribution and a regularization term. Pflug et al. [233] and Wozabal [320] show the worst-case value of a convex law-invariant risk measure over an ambiguity set of distributions, formed via the p-Wasserstein metric utilizing p -norm around the empirical distribution, reduces to the sum of the nominal risk and a regularization term whenever the function h(x,ξ) is affine inξ.They provide closed-form expressions for risk measures such as expectation, sum of expectation and standard deviation, CVaR, distortion risk measure, Wang transform, proportional hazards transform, the Gini measure, and sum of expectation and mean absolute deviation from the median. They apply their results to a portfolio selection problem. Important parts of the derivation of results in Pflug et al. [233], Pichler [235], Wozabal [320] are Kusuoka's representation of risk measures [173,287] and Fenchel-Moreau theorem [262,270].
In the context of statistical learning, the connection between DRO and regularization was first made in Shafieezadeh-Abadeh et al. [273], to the best of our knowledge.
In fact, they study a distributionally robust logistic regression, where an ambiguity set of probability distributions, supported on an open set, is formed around the empirical distribution of data and via the 1-Wasserstein metric utilizing an arbitrary norm. They show the resulting problem admits an equivalent reformulation as a tractable convex program. As stated in Theorem 3.11, this problem can be interpreted as a standard regularized logistic regression, where the size of the ambiguity set dictates the regularization parameter. They further propose a distributionally robust approach based on Wasserstein metric to compute upper and lower confidence bounds on the misclassification probability of the resulting classifier, based on the optimal values of two linear programs.
Shafieezadeh-Abadeh et al. [274] extend the work of Shafieezadeh-Abadeh et al. [273] and study distributionally robust supervised learning (regression and classification) models. They introduce a new generalization technique using ideas from DRO, whose ambiguity set contains all infinite-dimensional distributions in the Wasserstein neighborhood of the empirical distribution. They show that the classical robust and the distributionally robust learning models are equivalent if the data satisfies a dispersion condition (for regression) or a separability condition (for classification). By imposing bound on the decision (i.e., hypothesis) space, they improve the upper confidence bound on the out-of-sample performance proposed in Mohajerin Esfahani and Kuhn [205] and prove a generalization bound that does not rely on the complexity of the hypothesis space. This is unlike the traditional generalization bounds that are derived by controlling the complexity of the hypothesis space, in terms of Vapnik-Chervonenkis (VC)-dimension, covering numbers, or Rademacher complexities [12,276], which are usually difficult to calculate and interpret in practice. They extend their results to the case that the unknown hypothesis is searched from the space of nonlinear functionals. Given a symmetric and positive definite kernel function, such a setting gives rise to a lifted DRO problem that searches for a linear hypothesis over a reproducing kernel Hilbert space (RKHS).
Gao et al. [112] study DRO problems formed via the p-Wasserstein metric utilizing an arbitrary norm, around the empirical distribution. They identify a broad class of cost functions, for which such a DRO is asymptotically equivalent to a regularization problem with a gradient-norm penalty under the nominal distribution. For linear function class, this equivalence is exact and results in a new interpretation for discrete choice models, including multinomial logit, nested logit, and generalized extreme value choice models. They also obtain lower and upper bounds on the worst-case expected cost in terms of regularization.
Mohajerin Esfahani et al. [206] study a data-driven inverse optimization problem to learn the objective function of the decision maker, given the historical data on uncertain parameters and decisions. In an environment with imperfect information, they propose a DRO model formed via the p-Wasserstein metric utilizing an arbitrary norm to minimize the worst-case risk of the predicted error. Such a model can be interpreted as a regularization of the corresponding empirical risk minimization problem. They present exact (or safe approximation) tractable convex programming reformulation for different combinations of risk measures and error functions.
Blanchet and Kang [48] study group-square-root LASSO (group LASSO focuses on variable selection in settings where some predictive variables, if selected, must be chosen as a group). They model this problem as a DRO problem formed via the p-Wasserstein metric utilizing an arbitrary norm. A method for (semi-) supervised learning based on data-driven DRO via p-Wasserstein metric utilizing an arbitrary norm, is proposed in Blanchet and Kang [47]. This method enhances the general-ization error by using the unlabeled data to restrict the support of the worst-case distribution in the resulting DRO. They select the level of robustness using crossvalidation, and they discuss the nonparametric behavior of an optimal selection of the level of robustness.
Chen and Paschalidis [70] study a DRO approach to linear regression using an 1 -norm cost function, where the ambiguity set of distributions is formed via p-Wasserstein metric utilizing an arbitrary norm. They show that this DRO formulation can be relaxed to a convex optimization problem. By selecting proper norm spaces for the Wasserstein metric, they are able to recover several commonly used regularized regression models. They establish performance guarantees on both the out-of-sample behavior (prediction bias) and the discrepancy between the estimated and true regression planes (estimation bias), which elucidate the role of the regularizer. They study the application of the proposed model to outlier detection, arising in an abnormally high radiation exposure in CT exams, and show it achieves a higher performance than M-estimation [161].

Choice of the Transportation Cost.
When forming a Wasserstein ambiguity set, the transportation cost function c(·, ·) should be chosen besides the nominal probability measure P 0 and the size of the ambiguity set . Blanchet et al. [52] propose a comprehensive approach for designing the ambiguity set in a datadriven way, using the role of the transportation cost c(·, ·) in the definition of the p-Wasserstein metric. They apply various metric-learning procedures to estimate c(·, ·) from the training data, where they associate a relatively high transportation cost to two locations if transporting mass between these locations substantially impacts performance. This mechanism induces enhanced out-of-sample performance by focusing on regions of relevance, while improving the generalization error. Moreover, this approach connects the metric-learning procedure to estimate the parameters of adaptive regularized estimators. They select the level of robustness using cross-validation. Blanchet et al. [51] propose a data-driven robust optimization approach to optimally inform the transportation cost in the definition of the p-Wasserstein metric. This additional layer of robustification within a suitable parametric family of transportation costs does not exist in the metric-learning approach, proposed in Blanchet et al. [52], and it allows to enhance the generalization properties of regularized estimators while reducing the variability in the out-of-sample performance error.

Multistage
Setting. The single-and two-stage stochastic programs in Pflug and Pichler [231] are extended in Analui and Pflug [3] and Pflug and Pichler [231] to the multistage case, where the reference data and information structure is represented as a tree. In these papers it is assumed that the tree structure and scenario values are fixed, while the probabilities are changing only in an ambiguous neighborhood of the reference model by utilizing the multistage nested distance, formed via the Wasserstein metric. Both papers further apply their results to a multiperiod production/inventory control problem. Built upon the above results, Glanzer et al. [118] show that a scenario tree can be constructed out of data such that it converges (in terms of the nested distance) to the true model in probability at an exponential rate. Glanzer et al. [118] also study a DRO framework formed via nested distance that allows for setting up bid and ask prices for acceptability pricing of contingent claims. Another study of multistage linear optimization can also be found in Bazier-Mattea and Delage [17].
Another popular way to model the distributional ambiguity is to use φ-divergences, a class of measures used in information theory. A φ-divergence measures the discrepancy between two probability measures P 1 , if a > 0. Note that a φ-divergence does not necessarily induce a metric on the underlying space. For detailed information on φ-divergences, we refer to Read and Cressie [254], Vajda [306], Pardo [226].
A φ-divergence can be used to model the distributional ambiguity as follows: where as before P 0 is a nominal probability measure and controls the size of the ambiguity set. Table 1 presents a list of commonly used φ-divergence functions in DRO and their conjugate functions φ * . Before we review the papers that model the distributional ambiguity via the φdivergences, we present a duality result on sup P ∈P φ (P0; ) E P h(x,ξ) . Theorem 5.3. Suppose that > 0 in (5.9). Then, for a fixed x ∈ X , we have The above result can be obtained by taking the Lagrangian dual of sup P ∈P φ (P0; ) E P h(x,ξ) , and we refer the readers to Ben-Tal et al. [28], Bayraksan and Love [13], Love and Bayraksan [196] for a detailed derivation.
The robust counterpart of linear and nonlinear optimization problems with an uncertainty set of parameters defined via general φ-divergence is studied in Ben-Tal et al. [28]. As it is presented in Table 1, when the uncertain parameter is a finite-dimensional probability vector, the robust counterpart is tractable for most of the choices of φ-divergence function considered in the literature. The use of φdivergence to model the distributional ambiguity in DRO is systematically introduced in Bayraksan and Love [13] and Love and Bayraksan [197]. To elucidate the use of φdivergences for models with different sources of data and decision makers with different risk preferences, they present a classification of φ-divergences based on the notions of suppressing and popping a scenario. The situation that a scenario with a positive nominal probability ends up having a zero worst-case probability is called suppressing. On the contrary, the situation that a scenario with a zero nominal probability ends up having a positive worst-case probability is called popping. These notions give rise to four categories of φ-divergences. For example, they show that the variation distance can both suppress and pop scenarios, while Kullback-Leibler divergence can only suppress scenarios. Furthermore, they analyze the value of data and propose a decomposition algorithm to solve the dual of the resulting DRO model formed via a general φ-divergence.
Motivated by the difficulty in choosing the ambiguity set and the fact that all probability distributions in the set are treated equally (while those outside the set are completely ignored), Ben-Tal et al. [27] propose to minimize the expected cost under the nominal distribution while the maximum expected cost over an infinite nested family of ambiguity sets, parametrized by , is bounded from above. More specifically, they allow a varying level of feasibility for each family of probability distributions, where the maximum allowed expected cost for distributions in a set with parameter is proportional to . They refer to this approach as soft robust optimization and relate the feasibility region induced by this approach to the convex risk measures. They illustrate that the ambiguity sets formed via φ-divergences are related to an optimized certainty equivalent risk measure formed via φ-functions [23]. Furthermore, they show that the complexity of the soft robust approach is equivalent to that of solving a small number of standard corresponding DRO (i.e., DRO with one ambiguity set) problems. In fact, by showing that standard DRO is concave in , they solve the soft robust model by a bisection method. They also investigate how much larger a feasible region implied by the soft robust approach can cover compared to the standard DRO, without compromising the objective value. Furthermore, they study the downside probability guarantees implied by both the soft robust and standard robust approaches. They also apply their results to portfolio optimization and asset allocation problems.
A data-driven DRO approach to chance-constrained problems modeled via φdivergences is studied in Yanıkoglu and den Hertog [337]. They propose safe approximations to these ambiguous chance constraints. Their approach is capable of handling joint chance constraints, dependent uncertain parameter, and a general nonlinear function g(x,ξ).
Hu et al. [157] and Jiang and Guan [166] show that distributionally robust chanceconstrained programs formed via φ-divergences can be transformed into a chanceconstrained problem under the nominal distribution but with an adjusted risk level. For a general φ-divergence, a bisection line search algorithm to obtain the perturbed risk level is proposed in Hu et al. [157], Jiang and Guan [166]. In addition, closed-form expressions for the adjusted risk level are obtained for the case of the variation distance (see, Hu et al. [157] and Jiang and Guan [166]), and Kullback-Leibler divergence and χ 2 -distance (see, Jiang and Guan [166]). For the ambiguous probabilistic programs formed via φ-divergences, similar results to the chance-constrained programs are shown in Hu et al. [157]. Hu et al. [157] show that the ambiguous probability minimization problem can be transformed into a corresponding problem under the nominal distribution. In particular, they show that these problems have the same complexity as the corresponding pure probabilistic programs. 5.1.2.1. Statistical Learning. Hu et al. [155] study distributionally robust supervised learning, where the ambiguity set of distributions is formed via φ-divergences. They prove that such a DRO model for a classification problem gives a classifier that is optimal for the training set distribution rather than being robust against all distributions in the ambiguity set. They argue such a pessimism comes from two sources: the particular losses used in classification and the over-conservation of the ambiguity set formed via φ-divergences. Motivated by this observation, they propose an ambiguity set that incorporates prior expert structural information on the distribution. More precisely, they introduce a latent variable from a prior distribution. While such a distribution can change in the ambiguity set, they leave the ambiguous joint distribution of data conditioned on the latent variable intact. Duchi et al. [92] show that the inner problem of a data-driven DRO formed around the empirical distribution, with = χ 2 1,1−α N has an almost-sure asymptotic expansion. Such an expansion is equivalent to the expected cost under the empirical distribution plus a regularization term that accounts for the standard deviation of the objective function. They also show that the set of the optimal solutions of the DRO model converges to that of the stochastic program under the true underlying distribution, provided that h(x,ξ) is lower-semicontinuous.

Specific φ-Divergences.
In this section, we review papers that consider specific φ-divergences.
Kullback-Leibler Divergence. Calafiore [59] investigates the optimal robust portfolio and worst-case distribution for a data-driven distributionally robust portfolio optimization problem with a mean-risk objective. Motivated by the application, they consider the variance and absolute deviation as measures of risk.
Hu and Hong [156] study a variety of distributionally robust optimization problems, where the ambiguity is in either the objective function or constraints. They show that the ambiguous chance-constrained problem can be reformulated as a chanceconstrained problem under the nominal distribution but with an adjusted risk level. They further show that when the chance safe region is bi-affine in x andξ 16 , and the nominal distribution belongs to the exponential families of distributions, both the nominal and worst-case distribution belong to the same distribution family.
Blanchet et al. [53] study a DRO approach to extreme value analysis in order to estimate the tail distributions and consequently, extreme quantiles. They form the ambiguity set of distributions by the class of Réyni divergences [226], that includes Kullback-Leibler as a special case 17 . Kullback-Leibler is also used for the DRO 16 Recall the discussion following (1.1) and (1.2), where we gave a characterization of A(x) as a(x) ξ ≤ b(x) and a(ξ) x ≤ b(ξ). A safe region characterized by a bi-affine expression inξ and x means that both a(x) and b(x) are affine in x for the form a(x) ξ ≤ b(x), and both a(ξ) and b(ξ) are affine inξ for the form a(ξ) x ≤ b(ξ). 17 The class of Réyni divergences is defined as d R r (P 1 , P 2 ) := 1 1−r Ξ dP 1 dP 2 r−1 dP 1 . This class is not a φ-divergence, but d R r (P 1 , P 2 ) can be rewritten as h(D φ (P 1 , P 2 )), where h(t) = 1 r−1 log[(r − 1)t + 1] and φ(t) = approach to hypothesis testing in Levy [186], Gül and Zoubir [128], Gül [127].
Burg Entropy. Wang et al. [316] model the distributional ambiguity via the Burg entropy to consider all probability distributions that make the observed data achieve a certain level of likelihood. They present statistical analyses of their model using Bayesian statistics and empirical likelihood theory. To test the performance of the model, they apply it to the newsvendor problem and the portfolio selection problem.
Wiesemann et al. [317] study Markov decision processes where the transition Kernel is known. They use Burg entropy to construct a confidence region that contains the unknown probability distribution with a high probability, based on an observation history. It is shown in Lam [176] that a DRO model formed via the Burg entropy around the empirical distribution of data gives rise to a confidence bound on the expected cost that recovers the exact asymptotic statistical guarantees provided by the Central Limit Theorem. χ 2 -Distance. Hanasusanto and Kuhn [137] propose a robust data-driven dynamic programming approach which replaces the expectations in the dynamic programming recursions with worst-case expectations over an ambiguity set of distributions. Their motivation to propose such a scheme is to mitigate the poor out-of-sample performance of the data-driven dynamic programming approach under sparse training data. The proposed method combines convex parametric function approximation methods (to model the dependence on the endogenous state) with nonparametric kernel regression method (to model the dependence on the exogenous state). They show the conditions under which the resulting DRO model, formed via χ 2 -distance, reduces to a tractable conic program. They apply their results to problems arising in index tracking and wind energy commitment applications. Klabjan et al. [172] study optimal inventory control for a single-item multiperiod periodic review stochastic lot-sizing problem under uncertain demand, where the distributional ambiguity is modeled via χ 2 -distance. They show that the resulting model generalizes the Bayesian model, and it can be interpreted as minimizing demand-history-dependent risk measures.
Modified χ 2 -Distance. A stochastic dual dynamic programming (SDDP) approach to solve a distributionally robust multistage optimization model formed via the modified χ 2 -distance is porposed in Philpott et al. [234].
Variation Distance. Variation distance, or 1 -norm, as defined in Table 1, can be used to safely approximate several ambiguity sets formed via φ-divergences, including χ-divergence of order 2, J-divergence, Kullback-Leibler divergence, and Hellinger distance. The following lemma states the above result more formally.

Total Variation
Distance. For two probability measures P 1 , P 2 ∈ M (Ξ, F), the total variation distance is defined as d TV (P 1 , P 2 ) := sup A∈F |P 1 (A) − P 2 (A)|. When P 1 and P 2 are absolutely continuous with respect to a measure ν ∈ M (Ξ, F), with Radon-Nikodym derivaties f 1 and f 2 , respectively, then, d TV (P 1 , P 2 ) = 1 2 Ξ |f 1 (s) − f 2 (s)|ν(ds). Note that the total variation distance can be obtained from other classes of probability metrics: (1) it is a φ-divergence with φ(t) = 1 2 |t − 1|, (2) it is half of the 1 -norm, and (3) it is obtained from the optimal transport discrepancy (5.2) with The total variation distance can be used to model the distributional ambiguity as follows: where as before P 0 is a nominal probability measure and controls the size of the ambiguity set. The total variation distance between P 1 and P 2 is also related to the one-sided variation distances 1 2 Ξ (f 1 (s) − f 2 (s)) + ν(ds) and 1 2 Ξ (f 2 (s) − f 1 (s)) + ν(ds) [251], which are φ-divergences with φ(t) = 1 2 (t − 1) + and φ(t) = 1 2 (1 − t) + , respectively. However, unlike the total variation distance, the one-sided variation distances are not a probability metric.
Before we review the papers that model the distributional ambiguity via the total variation distance, we present a duality result on sup P ∈P TV (P0; ) E P h(x,ξ) .
where ν-ess sup s∈Ξ h(x,ξ(s)) = inf a ∈ R : ν{s ∈ Ξ : h(x,ξ(s)) > a) = 0} .  [289]) Let P OTV (P 0 ; ) denote the ambiguity set formed via either of the one-sided variation distances. Then, for a fixed x ∈ X , sup P ∈P TVO (P0; 2 ) can be obtained by the right-hand side of the result in Theorem 5.5. 18 As shown for e.g., in Reiss [256] and [116], d φ h (P, P 0 ) ≤ d φ kl (P, P 0 ). However, in Jiang et al. [168,Lemma 1] this relationship has been shown incorrectly as d φ h (P, P 0 ) ≤ d φ kl (P, P 0 ) Jiang and Guan [167] study distributionally robust two-stage stochastic programs formed via the total variation distance. They discuss how to find the nominal probability distribution and analyze the convergence of the problem to the corresponding stochastic program under the true unknown probability distribution. Rahimian et al. [251] study distributionally robust convex optimization problems with a finite sample space. They study how the uncertain parameters affect the optimization. In order to do so, they define the notion of "effective" and "ineffective" scenarios. According to their definitions, a subset of scenarios is effective if their removal from the support of the worst-case distribution, by forcing their probabilities to zero in the ambiguity set, changes the optimal value of the DRO problem. They propose easy-to-check conditions to identify the effective and ineffective scenarios for the case that the distributional ambiguity is modeled via the total variation distance. Rahimian et al. [252] extends the work of Rahimian et al. [251] to distributionally robust newsvendor problems with a continuous sample space. They derive a closed-form expression for the optimal solution and identify the maximal effective subsets of demands.

Goodness-of-Fit Test.
Postek et al. [240] review and derive computationally tractable reformulations of distributionally robust risk constraints over discrete probability distributions for various risk measures and ambiguity sets formed using statistical goodness-of-fit tests or probability metrics, including φ-divergences, Kolmogrov-Smirnov, Wasserstein, Anderson-Darling, Cramer-von Mises, Watson, and Kuiper. They exemplify the results in portfolio optimization and antenna array design problems. Bertsimas et al. [42] and Bertsimas et al. [43] propose a systematic view on how to choose statistical goodness-of-fit test to construct an ambiguity set of distributions that guarantee the implication (C1) (recall Theorem 3.9). They consider the situation that (i) P true = P true •ξ −1 may have continuous support, and the components ofξ are independent, (ii) P true may have continuous support, and data are drawn from its marginal distributions asynchronously, and (iii) P true may have continuous support, and data are drawn from its joint distribution. They also study a wide range of statistical hypothesis tests, including χ 2 , G, Kolmogrov-Smirnov, Kuiper, Cramer-von Mises, Watson, and Anderson-Darling goodness-of-fit tests, and they characterize the geometric shape of the corresponding ambiguity sets.

Prohorov
Metric. For two probability measures P 1 , P 2 ∈ M (Ξ, F), the Prohorov metric is defined as [116]. The Prohorov metric takes values in [0, 1] and can be used to model the distributional ambiguity as follows: where as before P 0 is a nominal probability measure and controls the size of the ambiguity set. A specialization of the Prohorov metric to the univariate distributions is called Levy metric, which is defined as [116] d L (P 1 , P 2 ) := The Levy metric can be used to model the distributional ambiguity as follows: Erdogan and Iyengar [102] study an optimization problem subject to a set of parameterized convex constraints. Similar to the argument in Section 3.3.2, they study a DRO approach to this problem, where the distributional ambiguity is modeled by the Prohorov metric. They also consider a scenario approximation scheme of the problem. By extending the work of [63,60], they provide an upper bound on the number of samples required to guarantee that the sampled problem is a good approximation for the associated ambiguous chance-constrained problem with a high probability.
5.1.6. p -Norm. Calafiore and El Ghaoui [61] study distributionally robust individual linear chance-constrained problem, and provide convex conditions that guarantee the satisfaction of the chance constraint within the family of radially-symmetric nonincreasing densities whose supports are defined by means of the 1 -and ∞ -norm 19 . Mevissen et al. [204] study distributionally robust polynomial optimization, where the distribution of the uncertain parameter is estimated using polynomial basis functions via the p -norm. They show that the optimal value of the problem is the limit of a sequence of tractable SDP relaxations of polynomial optimization problems. They also provide a finite-sample consistency guarantee for the data-driven uncertainty sets, and an asymptotic guarantee on the solutions of the SDP relaxations. They apply their techniques to a water network optimization problem.
Jiang and Guan [167] study distributionally robust two-stage stochastic programs formed via ∞ -norm. Huang et al. [158] study extend the work of Jiang and Guan [167] to the multistage setting. They formulate the problem into a problem that contains a convex combination of expectation and CVaR in the objective function of each stage to remove the nested multistage minmax structure in the objective function. They analyze the convergence of the resulting DRO problem to the corresponding multistage stochastic program under the true unknown probability distribution. They test their results on the hydrothermal scheduling problem.
5.1.7. ζ-Structure Metrics. Consider P 1 , P 2 ∈ M (Ξ, F) and let Z be a family of real-valued measurable functions z : R d , B(R d ) → (R, B(R)). The ζ-structure metric is defined as d Z (P 1 , P 2 ) := sup z∈Z E P1 z(ξ) − E P2 z(ξ) . A wide range of metrics in probability theory can be written as special cases of the above family of metrics [342,236]. Let us introduce them below.
• Uniform (Kolmogorov) metric d U (P 1 , P 2 ): The class of ζ-structure metrics may be used to model the distributional ambiguity as follows: where as before P 0 is a nominal probability measure and controls the size of the ambiguity set.
Zhao and Guan [342] study distributionally robust two-stage stochastic programs via ζ-structure metrics. They discuss how to construct the ambiguity set from historical data while utilizing a family of ζ-structure metrics. They propose solution approaches to solve the resulting problem, where the true unknown distribution is discrete or continuous. They further analyze the convergence of the DRO problem to the corresponding stochastic program under the true unknown probability distribution. They test their results on newsvendor and facility location problems.
Pichler and Xu [236] study a DRO model with a expectation as the risk measure and form the ambiguity set of distribution via ζ-structure metric. They investigate how the variation of the ambiguity set would affect the optimal value and the optimal solution in the resulting optimization problem. They illustrate their results in the context of a two-stage stochastic program with recourse.

Contamination
Neighborhood. The contamination neighborhood around a nominal probability measure P 0 is defined as where Q ⊆ M (Ξ, F) and ∈ [0, 1]. This ambiguity set is extensively used in the context of robust statistics, see, e.g., Huber [160], Huber and Ronchetti [161], and it has also been used in the economics literature, see, e.g., Nishimura and Ozaki [219,220]. Bose and Daripa [56] study ambiguity aversion in a mechanism design problem using a maximin expected utility model of Gilboa and Schmeidler [117]. The contamination neighborhood is also used in the context of statistical learning, see, e.g., Duchi et al. [93] and hypothesis testing, see, e.g., Huber [159].
5.1.9. General Discrepancy-Based Ambiguity Sets. We devote this subsection to the papers that consider general discrepancy-based models. Postek et al. [240] review and derive tractable reformulations of distributionally robust risk constraints over discrete probability distributions and for function g(x,ξ) inξ. They provide a comprehensive list for risk measures and ambiguity sets, formed using statistical goodness-of-fit tests or probability metrics. They consider risk measures such as (1) expectation, (2) sum of expectation and standard deviation/variance, (3) variance, (4) mean absolute deviation from the median, (5) Sharpe ratio, (6) lower partial moments, (7) certainty equivalent, (8) (7) Kuiper to model the distributional ambiguity. For each pair of risk measure and ambiguity set, they obtain a tractable reformulation by relying on the conjugate duality for the risk measure and the support function of the ambiguity set (i.e., the convex conjugate of the indicator function of the ambiguity set). They exemplify the results in portfolio optimization and antenna array design problems.
A connection between DRO models formed via discrepancy-based ambiguity sets and law invariant risk measures is made in Shapiro [289] as described in Theorem 3.8. They specifically derive law invariant risk measures for cases when Wasserstein metric, φ-divergences, and total variation distance is used to model the distributional ambiguity. They also propose a SAA approach to solve the corresponding dual of these problems, and establish the statistical properties of the optimal solutions and optimal value, similar to the results for the risk-neutral stochastic programs, see, e.g., Shapiro et al. [295], Shapiro [285].

Moment-Based Ambiguity Sets.
A common approach to model the ambiguity set is moment based, in which the ambiguity set contains all probability distributions whose moments satisfy certain properties. We categorize this type of models into several subgroups, although there are some overlaps. [271] models the distributional ambiguity in a newsvendor problem, where only the mean and variance of the random demand is known. He obtains a closed-form expression for the optimal order quantity and shows that the worst-case probability distribution is supported on only two points. Motivated by the Scarf's seminal work, other researchers have investigated the Chebyshev ambiguity set in the context of the newsvendor model. Gallego and Moon [109] study multiple extensions of the problem studied in Scarf [271]. These include the situations where there is a recourse opportunity, a fixed ordering cost, a random production output, and a scare resource for multiple competing products.

Chebyshev. Scarf
Unlike the ambiguity sets studied in Scarf [271] and Gallego and Moon [109], the mean and covariance matrix can be unknown themselves and belong to some uncertainty sets. El Ghaoui et al. [101] study a distributionally robust one-period portfolio optimization, where the worst-case VaR over an ambiguity set of distributions with a known mean and covariance matrix is minimized. They show that this problem can be reformulated as a SOCP. Moreover, they show that minimizing worst-case VaR with respect to such an ambiguity set can be interpreted as a RO model where the worst-case portfolio loss with respect to an ellipsoid uncertainty set is minimized. They extend their study to the case that the first two order moments are only known to belong to a convex (bounded) uncertainty set, and they show the conditions under which the resulting model can be cast as a SDP. In particular, for independent polytopic uncertainty sets for the mean and covariance (so that the mean and covariance belong to the Cartesian product of these two sets), the problem can be reformulated as a SOCP. Also, for sets with componentwise bound on the mean and covariance, they cast the problem as a SDP (see also Halldórsson and Tütüncü [131] for a similar result). Moreover, they show that in the presence of additional information on the distribution, besides the first two order moments, including constraints on the support and Kullback-Leibler divergence, an upper bound on the worst-case VaR can be obtained by solving a SDP. Motivated by the work in El Ghaoui et al. [101], Li [189] showcases the results in the context of a risk-averse portfolio optimization problem. Unlike El Ghaoui et al. [101] that considers polytopic and interval uncertainty sets for the mean and covariance, Lotfi and Zenios [195] assume that the unknown mean and covariance belong to an ellipsoidal uncertainty set. They study the worstcase VaR and worst-case CVaR optimization problems, subject to an expected return constraint. They show that both problems can be reformulated as SOCPs.
Goldfarb and Iyengar [122] study a distributionally robust portfolio selection problem, where the asset returnsξ are formed by a linear factor model of the form ξ = µ + Af +˜ , where µ is the vector of mean returns,f ∼ N (0, Σ) is the vector of random returns that derives the market, A is the factor loading matrix, and ∼ N (0, B) is the vector of residual returns with a diagonal matrix B. It is assumed that˜ is independent off , F , and B. Thus,ξ ∼ N (µ, AΣA + B); hence, the uncertainty in the mean is independent of the uncertainty in the covariance matrix of the returns. Under the assumption that the covariance matrix Σ is known, Goldfarb and Iyengar [122] study three different models to form the uncertainty in B, A, and µ as follows: maximum expected return subject to a maximum variance constraint where ξ 0 is a risk-free return rate, and (4) maximum expected return subject to a maximum VaR constraint Note that the constraint VaR β ξ x ≥ α is equivalent to P {ξ x ≤ α} ≤ β. They show that all the above four classes of problems can be reformulated as SOCPs. They further assume the covariance matrix Σ or its inverse are unknown and belong to ellipsoidal uncertainty sets, and show that the above problems can be reformulated as SOCPs. El Ghaoui et al. [101] study a similar linear factor model as the one in Goldfarb and Iyengar [122], but they assume that the uncertainty in the mean is not independent of the uncertainty in the covariance matrix of the returns. When the factor matrix A belongs to ellipsoidal uncertainty set, they show that an upper bound on the worst-case VaR can be computed by solving a SDP. Li and Kwon [188] study a distributionally robust approach for a single-period portfolio selection problem. They consider a set of reference means and variances, and they form the ambiguity set by all distributions whose means and variance are in a pre-specified distance from the reference means and variances set (in the regular sense of a point from a set via a norm). For the case that moments take values outside the reference region, since evaluation based on its worst-case performance can be overly-conservative, they consider a penalty term that further accounts for measure discrepancy between the moments in and outside the reference region. Moreover, for the case that the reference region is a conic set, they obtain an equivalent SDP reformulation.
Grünwald and Dawid [126] confine the ambiguity set to distributions with fixed first order moments τ . By varying τ , they obtain a collection of maximum generalized entropy distribution and relate it to the exponential family of distributions.
Rujeerapaiboon et al. [267] derive Chebyshev-type bounds on the worst-case right and left tail of a product of nonnegative symmetric random variables. They assume that the mean is known, but the covariance matrix might be known or bounded above by a matrix inequality. They show that if both the mean and covariance matrix are known, these bounds can be obtained by solving a SDP. For the case that the covariance matrix is bounded above, they show that (i) the bound on the left tail is equal to the bound on the left tail under the known covariance setting, and (ii) the bound on the right tail is equal to the bound on the right tail under the known mean and covariance setting, for a sufficiently large tail. They extend their results to construct Chebyshev bounds for sums, minima, and maxima of nonnegative random variables.

Delage and Ye.
Unlike the ambiguity sets studied in Scarf [271] and Gallego and Moon [109], Delage and Ye [82] allow the mean and covariance matrix to be unknown themselves. This ambiguity set is defined as follows [82]: The first constraint denotes the smallest closed convex set Ω ⊆ R d that containsξ with probability one (w.p. 1), i.e., Ω is the support of P = P •ξ −1 w.p. 1. The second constraint ensures that the mean ofξ lies in an ellipsoid of size 1 and centered around the nominal mean estimate µ 0 . Note that we can equivalently write this constraint as The third constraint defines the second central-moment matrix ofξ by a matrix inequality. The parameters 1 and 2 control the level of confidence in µ 0 and Σ 0 , respectively. Note that the ambiguity sets with a known mean and covariance matrix can be seen as a special case of (5.22), with 1 = 0 and 2 = 1. Delage and Ye [82] propose data-driven methods to form confidence regions for the mean and the covariance matrix of the random vectorξ using the concentration inequalities of Mc-Diarmid [201], and provide probabilistic guarantees that the solution found using the resulting DRO model yields an upper bound on the out-of-sample performance with respect to the true distribution of the random vector. A conic generalization of the ambiguity set P DY , beyond the first and second moment information is also studied in Delage [80]. Below, we present a duality result for sup P∈P DY E P h(x,ξ) given a fixed x ∈ X , due to Delage and Ye [82].
Theorem 5.8. (Delage and Ye [82, Lemma 1]) For a fixed x ∈ X , suppose that Slater's constraint qualification conditions are satisfied, i.e., there exists a strictly feasible P to P DY , and h(x,ξ) is P -integrable for all P ∈ P DY . Then, is equal to the optimal value of the following semi-infinite convex conic optimization problem: where Y ∈ R d×d and y ∈ R d .
The reformulated problem in Theorem 5.8 is polynomial-time solvable under the following assumptions [82]: • The sets X and Ω are convex and compact, and are both equipped with oracles that confirm the feasibility of a point x andξ, or provide a hyperplane that separates the infeasible point from its corresponding feasible set in time polynomial in the dimension of the set. • Function h(x,ξ) := max k∈{1,...,K} h k (x,ξ) is piecewise and is such that for each k, h k (x,ξ) is convex in x and concave inξ. In addition, for any given pair (x,ξ), one can evaluate h k (x,ξ), find a supergradient of h k (x,ξ) inξ, and find a subgradient of h k (x,ξ) in x, in time polynomial in the dimension of X and Ω. As a special case where Ω is an ellipsoid, the resulting reformulation in Theorem 5.8 reduces to a SDP of finite size. Motivated by the computational challenges of solving a semidefinite reformulation of (1.5) formed via (5.22), Cheng et al. [79] propose an approximation method to reduce the dimensionality of the resulting DRO. This approximation method relies on the principal component analysis for the optimal lower dimensional representation of the variability in random samples. They show that this approximation yields a relaxation of the original problem and give theoretical bounds on the gap between the original problem and its approximation.
Popescu [239] study a class of stochastic optimization problems, where the objective function is characterized with one-or two-point support functions. They show that when the ambiguity set of distributions is formed with all distributions with known mean and covaraince, the problem reduces to a deterministic parametric quadratic program. In particular, this result holds for increasing concave utilities with convex or concave-convex derivatives.
Goh and Sim [120] study a DRO approach to a stochastic linear optimization problem with expectation constraints, where the support and mean of the random parameters belong to a conic-representable set, while the covariance matrix is assumed to be known.

Discrete Problems.
Under the assumption that the mean and covariance are known, Natarajan and Teo [210] investigate the worst-case expected value of the maximum of a linear function of random variables as follows: where Z(ξ) = max ξ x x ∈ X . The set X is specified with either a finite number of points or a bounded feasible region to a mixed-integer LP. To obtain an upper bound, they approximate the copostive programming reformulation of the problem, presented in Natarajan et al. [211,Theorem 3.3], with a SDP. They show that the complexity of computing this bound is closely related to characterizing the convex hull of the quadratic forms of the points in the feasible region.
Xie and Ahmed [323] study a DRO approach to a two-stage stochastic program with a simple integer round-up recourse function, defined as follows: The ambiguity set is formed by the product of one-dimensional ambiguity sets for each component of the random parameterξ, formed with marginal distributions with known support and mean. They obtain a closed-form expression for the inner problem corresponding to each component, and they reformulate the problem as a mixedinteger SOCP. Ahipasaoǧlu et al. [2] study distributionally robust project crashing problems. They assume the underlying joint probability distribution of the activity durations lies in an ambiguity set of distributions with the given mean, standard deviation, and correlation information. The goal is to select the means and standard deviations to minimize the worst-case expected makespan for the project network with respect to the ambiguity set of distributions. Unlike the typical use of the SDP solvers to directly solve the problem, they exploit the problem structure to reformulate it as a convex-concave saddle point problem over the first two moment variables in order to solve the formulation in polynominal time.
A distributionally robust approach to an individual chance constraint with binary decisions is studied in Zhang et al. [340]. They consider the following individual chance constraints with g j (x,ξ), j = 1, . . . , m, in (1.6) is defined as where x ∈ {0, 1} n . They form the ambiguity set of distributions by all joint distributions whose marginal means and covarinces satisfy the constraints in (5.22). They reformulate the chance constraints as binary second-order conic (SOC) constraints.

Risk and Chance Constraints.
Risk-based DRO models formed via the ambiguity set (5.22) are also studied in the literature. Bertsimas et al. [39] study a riskaverse distributionally robust two-stage stochastic linear optimization problem where the mean and the covariance matrix are known, and a convex nondecreasing piecewise linear disutility function is used to model risk. When the second-stage objective function's coefficients are random, they obtain a tight polynomial-sized SDP formulation. They also provide an explicit construction for a sequence of (worst-case) distributions that asymptotically attain the optimal value. They prove that this problem is NPhard when the right-hand side is random, and further show that under the special case that the extreme points of the dual of the second-stage problem are explicitly known, the problem admits a SDP reformulation. An explicit construction of the worst-case distributions is also given. The results are applied to the production-transportation problem and a single facility minimax distance problem. Li [189] obtains a closedform expression to the worst-case of the class of law invariant coherent risk measures, where the worst case is taken with respect to all distributions with the same mean and covariance matrix.
Zymler et al. [346] extend the work of El Ghaoui et al. [101] with known first and second order moments to a portfolio of derivatives, and develop two worst-case VaR models to capture the nonlinear dependencies between the derivative returns and the underlying asset returns. They introduce worst-case polyhedral VaR with convex piecewise-linear relationship between the derivative return and the asset returns. They also show that minimizing worst-case polyhedral VaR is equivalent to a convex SOCP. A worst-case quadratic VaR with (possibly nonconvex) quadratic relationships between the derivative return and the asset returns is also introduced, and they show that minimizing worst-case quadratic VaR is equivalent to a convex SDP. These worst-case VaR measures are equivalent to the worst-case CVaR of the underlying polyhedral or quadratic loss function, and they are coherent. As in El Ghaoui et al. [101], Zymler et al. [346] show that optimization of these new worst-case VaR has a RO interpretation over an uncertainty set, asymmetrically oriented around the mean values of the asset returns. Using the result from Zymler et al. [345], Rujeerapaiboon et al. [266] show that the worst-case VaR of the quadratic approximation of a portfolio growth rate can be expressed as the optimal value of a SDP.
Chen et al. [72] summarize and develop different approximations to the individual chance constraint used in the robust optimization as the consequence of applying different bounds on CVaR. These bounds, in turn, can be written as an optimization problem over an uncertainty set. For instance, they show that when the uncertainties are characterized only by their means and covariance, the corresponding uncertainty set is an ellipsoid. Calafiore and El Ghaoui [61] provide explicit results for enforcement of the individual chance constraint over an ambiguity set of distributions. When only the information on the mean and covariance are considered, the worst-case chance constraint is equivalent to a convex second-order conic (SOC) constraint. With additional information on the symmetry, the worst-case chance constraint can be safely approximated via a convex SOC constraint. Additionally, when the means are known and individual elements are known to belong with probability one to independent bounded intervals, the worst-case chance constraint can be safely approximated via a convex SOC constraint.
Zymler et al. [345] study a safe approximation to distributionally robust individual and joint chance constraints based on the worst-case CVaR. Under the assumptions that the ambiguity set is formed via distributions with fixed mean and covariance, and the chance safe regions are bi-affine in x andξ, they obtain an exact SDP reformulation of the worst-case CVaR. They show that the CVaR approximation is in fact exact for individual chance constraints whose constraint functions are either convex or (possibly nonconconvex) quadratic inξ by relying on nonlinear Farkas lemma and S-lemma, see, e.g., Pólik and Terlaky [237].
Chen et al. [72] extend their idea to the joint chance constraint by using bounds for order statistics. They show that the resulting approximation for the joint chance constraint outperforms the Bonferroni approximation, and the constraints of the approximation are second-order conic-representable. Zymler et al. [345] show that the CVaR approximation is exact for joint chance constraints whose constraint functions depend linearly onξ. They evaluate the performance of their approximation for joint chance constraint in the context of a water reservoir control problem for hydro power generation and show it outperforms the Bonferroni approximation and the method of Chen et al. [72].
Motivated by the fact that chance constraints do not take into account the magnitude of the violation, Xu et al. [330] study a probabilistic envelope constraint. This approach can be interpreted as a continum of chance constraints with nondecreasing target values and probabilities. They show that when the first two order moments are known, an ambigious probabilistic envelope constraint is equivalent to a deterministic SIP, which is called as a comprehensive robust optimization problem [25,27]. In other words, ambiguous probabilistic envelope constraint alleviates the "all-ornothing" view of the standard RO that ignores realizations outside of the uncertainty set. We refer to Yang and Xu [335] for an extension of the work in Xu et al. [330] to the nonlinear inequalities.

Statistical Learning.
Lanckriet et al. [179] present a DRO approach to a binary classification problem to minimize the worst-case probability of missclassification where the mean and covariance matrix of each class are known. They show that for a linear hypothesis, the problem can be formulated as a SOCP. They also investigate the case where the mean and covariance are unknown and belong to convex uncertainty sets. They show that when the mean is unknown and belongs to an ellipsoid, the problem is a SOCP. On the other hand, when the mean is known and covariance belongs to a matrix norm ball, the problem is a SOCP and adopts a regularization term. For a nonlinear hypothesis, they seek a kernal function to map into a higher-dimensional covariates-response space such that a linear hypothesis in that space corresponds to a nonlinear hypothesis in the original covariate-response space. Using this idea, the model is reformulated as an SOCP. [326] study a multistage distributionally robust newvendor problem where the support and the first two order moments of the demand distribution are known at each stage. They provide a formal definition of the time consistency of the optimal policies and study this phenomena in the context of the newsvendor problem. They further relate time consistency to rectangularity of measures, see, e.g., Shapiro [288], and provide sufficient conditions for time consistency. Unlike Xin and Goldberg [326] that suppose the demand process is stage-wise independent, Xin and Goldberg [325] assume that the demand process is a martingale. They form the ambiguity set by all distributions with a known support and mean at each stage. They obtain the optimal policy and a two-point worst-case probability distribution, one of which is zero, in closed forms. They also show that for any initial inventory level, the optimal policy and random demand (distributed according to the worst-case distribution) is such that for all stages, either demand is greater than or equal to the inventory or demand is zero, meaning that all future demands are also zero.

Multistage Setting. Xin and Goldberg
Yang [334] and Van Parys et al. [307] study a stochastic optimal control model to minimize the worst-case probability that a system remains in a safe region for all stages. Yang [334] forms the ambiguity set at each stage by all distributions for which the componentwise mean of random parameters is within an interval, while the covariance is in a positive semidefinite cone. Van Parys et al. [307] form the ambiguity set by all distributions with a known mean and covariance.

Generalized Moment and Measure
Inequalities. In this section we review an ambiguity set that allows to model the support of the random vector, and impose bounds on the probability measure as well as functions of the random vector as follow: where ν 1 , ν 2 ∈ M + (Ξ, F) are two given measures that impose lower and upper bounds on a measure P ∈ M + (Ξ, F), and f := [f 1 , . . . , f m ] is a vector of measurable functions on (Ξ, F), with m ≥ 1. The first constraint in (5.23) enforces a preference relationship between probability measures. To ensure that P is a probability measure, i.e., P ∈ M (Ξ, F), we set l 1 = u 1 = 1 and f 1 = 1 in the above definition of P M M . Shapiro and Ahmed [291] propose this framework, and special cases of it appear in Popescu [238], Bertsimas and Popescu [33], Perakis and Roels [228], Mehrotra and Papp [202], among others. Note that if the first constraint in (5.23) is disregarded (i.e., we only have P 0), then we can form the constraints of a classical problem of moments, see, e.g., Landau [180]. Using this unified set, one can impose bounds on the standard moments, by setting the ith entry of f to have the form: f i (ξ) := (ξ 1 ) ki1 ·(ξ 2 ) ki2 · · · (ξ d ) k id , where k ij is a nonnegative integer indicating the power of ξ j for the ith moment function. Other possible choices for the functions f include the mean absolute deviation, the (co-)variances, semi-variance, higher order moments, and Huber loss function. Moreover, proper choices of f will give the flexibility to impose structural properties on the probability distribution, see, e.g., Popescu [238] and Perakis and Roels [228] to model the unimodality and symmetry of distributions within this framework (see also Section 5.3).
Below, we present a duality result sup P ∈P MM E P h(x,ξ) , given a fixed x ∈ X .
Moreover, suppose that f is ν 2 -integrable, and there exists ν 1 P ν 2 such that Ξ f dP ∈ (l, u). If sup P ∈P MM E P h(x,ξ) is finite, then, it can be written as the optimal value of the following problem: Shapiro and Ahmed [291] focus on a special case of (5.23), where the first constraint is written as (1 − )P * P (1 + )P * , for some reference measure P * , and they identify the coherent risk measure corresponding to the studied DRO. They further study the class of problems with convex objective function h and two-stage stochastic programs. Popescu [238], Bertsimas and Popescu [33], Mehrotra and Papp [202] study the classical problem of moments, i.e., ambiguity set is formed via only the second constraints in (5.23). When f are moment functions, Mehrotra and Papp [202] show that under mild conditions (continuous function h and compact support Ω), the optimal value of a sequence of problems of the form (1.5), where the ambiguity set is constructed via an increasing number of moments of the underlying probability distributions, with moments matched to those under a reference distribution, converges to the optimal value of a problem of the form (1.1) under the reference distribution. Moreover, using the SIP reformulation of (1.5), Mehrotra and Papp [202] propose a cutting surface method to solve a convex (1.5). This method can be applied to problems where bounds of moments are of arbitrary order, and possibly, bounds on nonpolynomial moments are available.
Royset and Wets [265] study a DRO model with a decision-dependent ambiguity set, where the ambiguity set has the form of (5.23), without the second set of constraints, and the first constraint is formed via the decision-dependent cumulative distribution functions (cdf). They establish the convergence properties of the solutions to this problem by exploiting and refining results in variational analysis.
Besides Shapiro and Ahmed [291], there are other studies that focus on special types of cost function h. Two-stage stochastic programs have received much attention in this class. Chen et al. [73] consider a two-stage stochastic linear complementarity problem, where the underlying random data are continuously distributed. They study a distributionally robust approach to this problem, where the ambiguity set of distributions is formed via (5.23) without the first constraint, and propose a discretization scheme to solve the problem. They investigate the asymptotic behavior of the approximated solution in the number of discrete partitions of the sample space Ξ. As an application, they study robust game in a duoploy market where two players need to make strategic decisions on capacity for future production with anticipation of Nash-Cournot type competition after demand uncertainty is observed. There are studies that consider only lower order moments, up to order 2. Ardestani-Jaafari and Delage [4] study distributionally robust multi-item newsvendor problem, where the ambiguity set of distribution contains all distributions with a known budgeted support, mean, and partial first order moments. To provide a reformulation of the problem, they propose a conservative approximation scheme for maximizing the sum of piecewise linear functions over polyhedral uncertainty set based on the relaxation of an associated mixed-integer LP. They show that for the above studied newsvendor problem such an approximation is exact and it is a linear program.

Discrete Problems.
Bansal et al. [9] study a (two-stage) distributionally robust integer program with pure binary first-stage and mixed-binary second stage decisions on a finite set of scenarios. They propose a decomposition-based L-shaped algorithm and a cutting surface algorithm to solve the resulting model. They investigate the conditions and ambiguity set of distribution under which the proposed algorithm is finitely convergent. They show that ambiguity set of distributions formed via (5.23) without the first constraint, satisfy these conditions. Hanasusanto et al. [135] study a finite adaptability scheme to approximate the following two-stage distributionally robust linear program, with binary recourse decisions and optimized certainty equivalent as a risk measure: where h(x, ξ) = min y q Qy(ξ) W y(ξ) ≥ Rξ − T x, y(ξ) ∈ {0, 1} q2 , and R P h(x,ξ) is an optimized certainty equivalent risk measure corresponding to the utility function u: [22,23]. As an alternative to the affine recourse approximation, they pre-determine a set of finite recourse decisions here-and-now, and implement the best among them after the realization is observed. They form the ambiguity set of distributions as in (5.23) but without the first constraint, where the support is assumed to be a polytope and functions f i are also convex piecewise linear inξ. They derive an equivalent mixedinteger LP for the resulting model. They also obtain upper and lower bounds on the probability with which any of these recourse decisions is chosen under any ambiguous distribution as linear programs. Postek et al. [242] study a two-stage stochastic integer program, where the second-stage problem is a mixed-integer program. They model the distributional ambiguity by all distributions whose mean and mean-absolute deviation are known. While they show that the problem reduces to a two-stage stochastic program when there is no discrete variables, they develop a general approximation framework for the DRO problem with integer variables. They apply their results to a surgery block allocation problem. [33] study the worst-case bound on the probability of a multivariate random vector falling outside a semialgebreic confidence region (i.e., a set described via polynomial inequalities) over an ambiguity set of the form (5.23), where functions f are represented by all polynomials of up to kth-order. For the univariate case, they obtain the result as a SDP. In particular, they obtain closed-form bounds, when k ≤ 3. For the multivariate case, they show that such a bound can be obtained via a family of SDP relaxations, yielding a sequence of increasingly stronger, asymptotically exact upper bounds, each of which is calculated via a SDP. A special case of Bertsimas and Popescu [33] appears in Vandenberghe et al. [310], where the confidence region is described via linear and quadratic inequalities, and the first two order moments are assumed to be known within the ambiguity set.

Risk and Chance Constraints. Bertsimas and Popescu
Building from Chen et al. [73], Liu et al. [191] study a distributionally robust reward-risk ratio model, based on a variation of the Sharpe ratio. The ambiguity set contains all distributions whose componentwise means and covariances are restricted to intervals. They turn this problem into a model with a distributionally robust inequality constraint, and further reformulate this model as a nonconvex SIP. They approximate the semi-infinite constraint with an entropic risk measure approximation 20 and provide an iterative method to solve the resulting model. They provide statistical analysis to assess the likelihood of the true probability distribution lying in the ambiguity set, and provide a convergence analysis of the optimal value and solutions of the data-driven distributionally robust reward-risk ratio problems. The results are applied to a portfolio optimization problem.
Nemirovski and Shapiro [213] study a convex approximation, referred to as Bernstein approximation, to an ambiguous joint chance-constrained problem of the form  20 For a measurable function Z ∈ Z∞(Q), the entropic risk meaure is defined as where γ > 0 [191].
where z i (x) = g i1 (x), . . . , g id (x) and is a conservative approximation of problem (5.24), i.e., every feasible solution to the approximation is feasible for the chance-constrained problem (5.24). This approximation is a convex program and is efficiently solvable, provided that all g ij andΨ are efficiently computable, and X is computationally tractable.
Hanasusanto et al. [136] study a distributionally robust joint chance constrained stochastic program where each chance constraint is linear inξ, and the technology matrix and right hand-side are affine in x. They form the ambiguity set of distributions as in (5.23) without the first constraint. They show that the pessimistic model (i.e., the chance constraint holds for every distribution in the set) is conic-representable if the technology matrix is constant in x, the support set is a cone, and f i is positively homogeneous. They also show the optimistic model (i.e., the chance constraint holds for at least one distribution in the set) is also conic-representable if the technology matrix is constant in x. They apply their results to problems in project management and image reconstruction. While their formulation is exact for the distributionally robust chance constrained project crashing problem, the size of the formulation grows in the number of paths in the network. For other research in chance-constrained optimization problem, we refer to Xie et al. [324], Xie and Ahmed [321].

Statistical Learning.
Fathony et al. [104] study a distributonally robust approach to graphical models for leveraging the graphical structure among the variables. The proposed model in Fathony et al. [104] seeks a predictor to make a probabilistic predictionP (ŷ|u) over all possible label assignments so that it minimizes the worst-case conditional expectation of the prediction loss l(ŷ,ȳ) with respect toP (ȳ|u) as follows: The worstcase in the above formulation is taken with respect to all conditional distributions of the predictor, conditioned on the covariates. This conditional distributionP (ȳ|u) is such that the first-order moment of the feature function Φ(U , Y ) matches the firstorder moment under the empirical joint distribution of the covariates and labels,P . Fathony et al. [104] show that the DRO approach enjoys the consistency guarantees of probabilistic graphical models, see, e.g., Lafferty et al. [174], and has the advantage of incorporating customized loss metrics during the training as in large margin models, see, e.g., Tsochantaridis et al. [302].

Moment Matrix Inequalities.
In this section we review an ambiguity set that generalizes both the ambiguity set P DY (5.22) and the ambiguity set P MM (5.23) as follows:  [331], where the moment constraint are either in the form of equality or upper bound. Note that as a special case of P M M I , we can set F i , L i , and U i to be scalars, i = 2, . . . , m, to recover the second constraint in the ambiguity set P M M , defined in (5.23). Moreover, by setting F 2 to be a ma- Below, we present a duality result on sup P ∈P MMI E P h(x,ξ) , given a fixed x ∈ X .
Theorem 5.11. For a fixed x ∈ X , suppose that h(x,ξ) and F are integrable for all P ∈ P MMI . In addition, suppose that the following Slater-type condition holds: ξ) is finite, then, it can be written as the optimal value of the following problem: Proof. Using the conic duality results from Theorem 4.3, we write the dual of where M + (Ξ, F) is the dual cone of M + (Ξ, F): Thus, we can write the first constraint above as The Slater-type condition ensures that the strong duality holds [284].
Suppose that every finite subset of Ξ is F-measurable, i.e., for every s ∈ Ξ, the corresponding Dirac measure δ(s) (of mass one at point s) belongs to M + (Ξ, F). Then, the first constraint in Theorem 5.11 can be written as follows [284]: Motivated by the difficulty in verifying the Slater-type conditions to guarantee strong duality for sup P ∈P MMI E P h(x,ξ) and its dual, Xu et al. [331] investigate the duality conditions from the perspective of lower semicontinuity of the optimal value function inner maximization problem, with a perturbed ambiguty set. While these conditions are restrictive in general, they show that they are satisfied in the case of compact Ξ or bounded F i . Xu et al. [331] present two discretization schemes to solve the resulting DRO model: (1) a cutting-plane-based exchange method that discretizes the ambiguity set P MMI and (2) a cutting-plane-based dual method that discretizes the semi-infinite constraint of the dual problem. For both methods, they show the convergence of the optimal values and optimal solutions as sample size increases. They illustrate their results for the portfolio optimization and multiproduct newsvendor problems.

Cross-Moment or Nested Moment.
In an attempt to unify modeling and solving DRO models, Wiesemann et al. [318] propose a framework for modeling the ambiguity set of probability distributions as follows: (5.26) where P represents a joint probability distribution ofξ and some auxiliary random vectorũ ∈ R r . Moreover, A ∈ R s×d , B ∈ R s×r , b ∈ R s , and I = {1, . . . , I}, while the confidence sets C i are defined as with C i ∈ R Li×d , D i ∈ R Li×r , c ∈ R Li , and K i being a proper cone. By setting p I = p I = 1, they ensure that C I contains the support of the joint random vector (ξ,ũ). This set contains all distributions with prescribed conic-representable confidence sets and with mean values residing on an affine manifold. An important aspect of (5.26) is that the inclusion of an auxiliary random vectorũ gives the flexibility to model a rich variety of structural information about the marginal distribution ofξ in a unified manner. Using this framework, Wiesemann et al. [318] show that many ambiguity sets studied in the literature can be represented by a projection of the ambiguity set (5.26) on the space ofξ. In other words, these ambiguity sets are special cases of the ambiguity set P WKS . This development is based on the following lifting result.
Theorem 5.12. (Wiesemann et al. [318,Theorem 5]) Let f ∈ R N and l : R d → R N be a function with a conic-representable K-epigraph, and consider the following ambiguity set: as well as the lifted ambiguity set which involves the auxiliary random vectorũ ∈ R N . We have that (i) P is the union of all marginal distributions ofξ under all P ∈ P and (ii) P can be formulated as an instance of the ambiguity set P WKS in (5.26).
Using Theorem 5.12, Wiesemann et al. [318] show how an ambiguity set of the form P WKS , defined in (5.26), with conic-representable expectation constraints and a collection of conic-representable confidence sets, can represent ambiguity sets formed via (1) φ-divergences, (2) mean, (3) mean and upper bound on the covariance matrix (i.e., a special case of the ambiguity set (5.22)), (4) coefficient of variation (i.e., the inverse of signal-to-noise ratio from information theory), (5) absolute mean spread, and (6) higher-order moment information. Moreover, they illustrate that (5.26) can capture information from robust statistics, such as (7) marginal median, (8) marginal median-absolute deviation, and (9) known upper bound on the expected Huber loss function. It is worth noting that (5.26) does not cover ambiguity sets that impose infinitely many moment restrictions that would be required to describe symmetry, independence, or unimodality characteristics of the distributions [78].
Wiesemann et al. [318] determine conditions under which distributionally robust expectation constraints, formed via the proposed ambiguity set (5.26), can be solved in polynomial time as follows: (i) the cost function g j , j = 1, . . . , m, is convex and piecewise affine in x andξ (i.e., g j (x,ξ) := max k∈{1,...,K} g jk (x,ξ) with g jk (x,ξ) := s jk (ξ)x + t jk (ξ) such that s jk (ξ) and t jk (ξ) are affine inξ) and (ii) the confidence sets C i 's satisfy a strict nesting condition. Below, we present a duality result under above assumptions and additional regularity conditions. is satisfied if and only if there exists β ∈ R K , κ, λ ∈ R I + , and α ik ∈ K i , i ∈ I and k ∈ {1, . . . , K}, that satisfy the following systems: where A(i) denote the set of all i ∈ I such that C i is strictly contained in the interior of C i .
The tractability of the resulting system in Theorem 5.13 depends on how the confidence sets C i are described, and hence, they give rise to linear, conic-quadratic, or semidefinite programs for the corresponding confidence sets C i . Wiesemann et al. [318] also provide tight tractable conservative approximations for problems that violate the nesting condition by proposing an outer approximation of (5.26). They discuss several mild modifications of the conditions on g.
There are several papers that use the ambiguity set (5.26) and consider its generalization or special cases. Chen et al. [78] introduce an ambiguity set of probability distributions that is characterized by conic-representable expectation constraints and a conic-represetable support set, similar to the one studied in Wiesemann et al. [318]. However, unlike Wiesemann et al. [318], an infinite number of expectation constraints can be incorporated into the ambiguity set to describe stochastic dominance, entropic dominance, and dispersion, among other. A main result in this work is that for any ambiguity set, there exists an infinitely constrained ambiguity set, such that worstcase expected h(x,ξ) over both sets are equal, provided that the objective function h(x,ξ) is tractable and conic-representable inξ for any x ∈ X . Reformulation of the resulting DRO model formed via this infinitely constrained ambiguity set yields a conic optimization problem. To solve the model, Chen et al. [78] propose a procedure that consists of solving a sequence of relaxed DRO problems-each of which considers a finitely constrained ambiguity set, and results in a conic optimization reformulation-and converges to the optimal value of the original DRO model. When incorporating covariance and fourth-order moment information into the ambiguity set, they show that the relaxed DRO is a SOCP. This is different from Delage and Ye [82] which shows that a DRO problem formed via a fixed mean and an upper bound on covariance is reformulated as a SDP.
Postek et al. [241] derive exact reformulation of the worst-case expected constraints when function g(x, ·) is convex inξ, and the ambiguity set of distributions consists of all distributions of componentwise independentξ with known support, mean, and mean-asboulute deviation information. They also obtain exact reformulation of the resulting model when g(x, ·) is concave inξ and there is additional information on the probability that a component is greater than or equal to its mean. These reformulations involve a number of terms that are exponential in the dimension ofξ. They show how upper bounds can be constructed that alleviate the independence restriction, and require only a linear number of terms, by exploiting models in which random variables are linearly aggregated and function g(x, ·) is convex. Under the assumption of independent random variables, they use the above results for the worst-case expected constraints to derive safe approximations to the corresponding individual chance constrained problems.
To reduce the conservatism of the robust optimization due to its constraint-wise approach and the assumption that all constraints are hard for all scenarios in the uncertainty set, Roos and den Hertog [264] propose an approach that bounds worstcase expected total violation of constraints from above and condense all constraints into a single constraint. They form the ambiguity set with all distributions ofξ with known support, mean, and mean-asboulute deviation information. When the right-hand side is uncertain, they use the results in Postek et al. [241] to show that the proposed formulation is tractable. When the left-hand side is uncertain, they use the aggregation approach introduced in Postek et al. [241] to derive tractable reformulations. We also refer to Sun et al. [301] for a two-stage quadratic stochastic optimization problem and DeMiguel and Nogales [84] for a portfolio optimization problem.
Bertsimas et al. [44] develop a modular and tractable framework for solving an adaptive distributionally robust two-stage linear optimization problem with recourse of the form where h(x, ξ) = min y q y(ξ) W y(ξ) ≥ r(ξ) − T (ξ)x, y(ξ) ∈ R q , and the function r(ξ) and T (ξ) are affinely dependent on ξ. Both the ambiguity set of probability distributions P and the support set are assumed to be second-order conic-representable. Such an ambiguity set is a special case of the conic-representbale ambiguity set (5.26). They show that the studied DRO model can be formulated as a classical RO problem with a second-order conic-representable uncertainty set. To obtain a tractable formulation, they replace the recourse decision functions y(ξ) with generalized linear decision rules that have affine dependency on the uncertain parameters ξ and some auxiliary random variables 21 . By adopting the approach of Wiesemann et al. [318] to lift the ambiguity set to an extended one by introducing additional auxiliary random variables, they improve the quality of solutions and show that one can transform the adaptive DRO problem to a classical RO problem with a second-order conic-representable uncertainty set. Bertsimas et al. [44] discuss extension to the conic-representbale ambiguity set (5.26) and multistage problems. They also apply their results to medical appointment scheduling and single-item multiperiod newsvendor problems.
Following the approach in Bertsimas et al. [44], Zhen et al. [344] reformulate an adaptive distributionally robust two-stage linear optimization problem with recourse into an adaptive robust two-stage optimization problem with recourse. Then, using Fourier-Motzkin elimination, they reformulate this problem into an equivalent problem with a reduced number of adjustable variables at the expense of an increased number of constraints. Although from a theoretical perspective, every adaptive robust two-stage optimization problem with recourse admits an equivalent static reformulation, they propose to eliminate some of the adjustable variables, and for the remaining adjustable variables, they impose linear decision rules to obtain an approximated solution. They show that for problems with simplex uncertainty sets, linear decision rules are optimal, and for problems with box uncertainty sets, there exists convex two-piecewise affine functions that are optimal for the adjustable variables. By studying the medical appointment scheduling considered in Bertsimas et al. [44], they show that their approach improves the solutions obtained in Bertsimas et al. [44].

Statistical Learning.
Gong et al. [123] study a distributionally robust multiple linear regression model with the least absolute value cost function. They form the ambiguity set of distributions using expectation constraints over a conicrepresentable support set as in (5.26). They reformulate the resulting model as a conic optimization problem, based on the results in Wiesemann et al. [318]. 5.2.5.2. Multistage Setting.. A Markov decision process with unknown distribution for the transition probabilities and rewards for each state is studied in Xu and Mannor [329,328]. It is assumed that the parameters are statewise independent and each state belongs to only one stage. Moreover, the parameters of each state are constrained to a sequence of nested sets, such that the parameters belong to the largest set with probability one, and there is a lower bound on the probability that they should belong to other sets, in a increasing manner. Yu and Xu [338] extends the work in Xu and Mannor [329,328] by forming the ambiguity set of distributions as in (5.26).

Marginals (Fréchet).
All the moment-based ambiguity sets discussed so far, study the ambiguity of the joint probability distribution of the random vector ξ. Papers reviewed in this section assume that additional information on the marginal distributions is available. We refer to the class of joint distributions with fixed marginal distributions as the Fréchet class of distributions [91].
5.2.6.1. Discrete problems. Chen et al. [69] study a problem of the form (1.5), where the cost function h(x,ξ) denotes the optimal value of a linear or discrete optimization problem with random linear objective coefficients. They assume the ambiguity set of distribution is formed by all distributions with known marginals. Using techniques from optimal transport theory, they identify a set of sufficient conditions for the polynomial time solvability of this class of problems. This generalizes the tractability results under marginal information from 0-1 polytopes, studied in Bertsimas et al. [36], to a class of integral polytopes. They discuss their results on four polynomial time solvable instances, arising in the appointment scheduling problem, max flow problem with random arc capacities, ranking problem with random utilities, and project scheduling problems with irregular random starting time costs. 5.2.6.2. Risk and Chance Constraints. Dhara et al. [89] provide bounds on the worst-case CVaR over an ambiguity set of discrete distributions, where the ambiguity set contains all joint distributions whose univariate marginals are fixed and their bivariate marginals are within a minimum Kullback-Leibler distance from the nominal bivariate marginals. They develop a convex reformulation for the resulting DRO. Doan et al. [91] study a DRO model of the form (1.5) with a convex piecewise linear objective function inξ and affine in x. They form the ambiguity set of joint distributions via a Fréchet class of discrete distributions with multivariate marginals, where the components of the random vector are partitioned such that they have overlaps. They show that the resulting DRO model for a portfolio optimization problem is efficiently solvable with linear programming. In particular, they develop a tight linear programming reformulation to find a bound on the worst-case CVaR over such an ambiguity set, provided that the structure of the marginals satisfy a regularity condition.
Natarajan et al. [212] study a distributionally robust approach to minimize the worst-case CVaR of regret in combinatorial optimization problems with uncertainty in the objective function coefficients, defined as follows: where h(x,ξ) = −ξ x + max y∈{0,1} q 1ξ y and It is assumed that the ambiguity set is formed with the knowledge of marginal distributions, where the ambiguity for each marginal distribution is formed via (5.23).
They reformulate the resulting problem as a polynomial sized mixed-integer LP when (i) the support is known, (ii) the support and mean are known, and (iii) the support, mean, and mean absolute deviation are known; and as a mixed-integer SOCP when the support, mean, and standard deviation are known. They show the maximum weight subset selection problem is polynomially solvable under (i) and (ii). They illustrate their results on subset selection and the shortest path problems. Zhang et al. [341] study a distributionally robust approach to a stochastic binpacking problem subject to chance constraints on the total item sizes in the bins. They form the ambiguity set by all discrete distributions with known marginal means and variances for each item size. By showing that there exists a worst-case distribution that is at most a three-point distribution, they obtain a closed-form expression for the chance constraint and they reformulate the problem as a mixed-binary program. They present a branch-and-price algorithm to solve the problem, and apply their results to a surgery scheduling problem for operating rooms. [103] study a DRO approach in the context of supervised learning problems to infer a function (i.e., decision rule) that predicts a response variable given a set of covariates. Motivated by the gametheoretic interpretation of Grünwald and Dawid [126] and the principle of maximum entropy, they seek a decision rule that predicts the response based on a distribution that maximizes a generalized entropy function over a set of probability distributions. However, because the covariate information is available, they apply the principle of maximum entropy to the conditional distribution of the response given the covariates, see, also Globerson and Tishby [119] for the case of Shannon entropy. Farnia and Tse [103] form the ambiguity set of distributions by matching the marginal of covariates to the empirical marginal of covariates while keeping the cross-moments between the response variables and covariates close enough (with respect to some norm) to that of the joint empirical distribution. They show that the DRO approach adopts a regularization interpretation for the maximum likelihood problem under the empirical distribution. As a result, Farnia and Tse [103] recover the regularized maximum likelihood problem for generalized linear models for the following loss functions: linear regression under quadratic loss function, logistic regression under logarithmic loss function, and SVM under the 0-1 loss function.

Statistical Learning. Farnia and Tse
Eban et al. [98] study a DRO approach to a classification problem to minimize the worst-case hinge loss of missclassification, where the ambiguity set of the joint probability distributions of the discrete covariates and response should contain all distributions that agree with nominal pair-wise marginals. They show that the proposed classifier provides a 2-approximation upper bound on the worst-case expected loss using a zero-one hinge loss. Razaviyayn et al. [253] study a DRO approach to the binary classification problem, with an ambiguity set similar to that of Eban et al. [98], to minimize the worst-case missclassification probability. By changing the order of inf and sup, and smoothing the objective function, they obtain a probability distribution, based on which they propose a randomized classifier. They show that this randomized classifier enjoys a 2-approximation upper bound on the worst-case missclassification probability of the optimal solution to the studied DRO.

Mixture Distribution.
In this section, we study DRO models, where the ambiguity set is formed via mixture distribution. A mixture distribution is defined as a convex combination of pdfs, known as the mixture components. The weights associated with the mixture components are called mixture probabilities [169]. For example, a mixture model can be defined as the set of all mixtures of normal distributions with mean µ and standard deviation σ with parameter a = (µ, σ) in some compact set A ⊂ R 2 . In a more generic framework, the distribution P can be any mixture of probability distributions Q a ∈ M (Ξ, F), for some family of distributions {Q a } a∈A ∈ M (Ξ, F), that depends on the parameter vector a ∈ A as follows: where M is any probability distribution on A [181]. Hence, modeling the ambiguity in the mixture probabilities may give rise to a DRO model over the resultant or barycenter P of M [238].

Risk and Chance
Constraints. Lasserre and Weisser [181] study a distributionally robust (individual and joint) chance-constrained program with a polynomial objective function, over a mixture ambiguity set and a semi-algebraic deterministic set. They approximate the ambiguous chance constraint with a polynomial whose vector coefficients is an optimal solution of a SDP. They show that the induced feasibility set by a nested sequence of such polynomial optimization approximation problems converges to that of the ambiguous chance constraints as the degree of approximate polynomials increases. Kapsos et al. [169] introduce a probability Omega ratio for portfolio optimization (i.e., a probability weighted ratio of gains versus losses for some threshold return target). They study a distributionally robust counterpart of this ratio, where each distribution of the ratio can be represented through a mixture of some known prespecified distributions with unknown mixture probabilities. In particular, they study a mixture model for a nominal discrete distribution, where the mixture probabilities are modeled via the box uncertainty and ellipsoidal uncertainty models. In the former case, they reformulate the problem as a linear program, and in the latter case, they reformulate the problem as a SOCP.
Hanasusanto et al. [133] study a distributionally robust newsvendor model with a mean-risk objective, as a convex combination of the worst-case CVaR and the worstcase expectation. The worst case is taken over all demand distributions within a multimodal ambiguity set, i.e., a mixture of a finite number of modes, where the conditional information on the ellipsoid support, mean, and covariance of each mode is known. The ambiguity in each mode is modeled via (5.22). They cast the resulting model as an exact SDP, and obtain a conservative semidefinite approximation by using quadratic decision rules to approximate the recourse decisions. Hanasusanto et al. [133] further robustify their model against ambiguity in estimating the meancovariance information, caused from ambiguity about the mixture weights. They assume that the mixture weights are close to a nominal probability vector in the sense of χ 2 -distance. For this case, they also obtain exact SDP reformulation as well as a conservative SDP approximation.

Shape-Preserving Models.
A few papers propose to model the distributional ambiguity in a way that all distributions in the ambiguity set share similar structural properties. We refer to such models as shape-preserving models to form the ambiguity set of probability distributions.
Popescu [238] propose to incorporate structural distributional information, such as symmetry, unimodality, and convexity, into a moment-based ambiguity set. The proposed ambiguity set is of the following generic form: Popescu [238] obtains upper and lower bounds on a generalized moment of a random vector (e.g., tail probabilities), given the moments and structural constraints in a convex subset of the proposed ambiguity set (5.29). Popescu [238] uses conic duality to evaluate such lower and upper bounds via SDPs. The key to the development in Popescu [238] is to focus on ambiguity sets that posses a Choquet representation, where every distribution in the ambiguity set can be written as a mixture (i.e., an infinite convex combination) of measures in a generating set and in the virtue of (5.28). For univariate distributions, it is assumed that the generating set is defined by a Markov kernel. It is shown that if the optimal value of the problem is attained, there exists a worst-case probability measure that is a convex combination of m + 1 (recall m is the dimension of f ) (extremal) probability measures from the generating set. Popescu [238] uses the above result to obtain generalized Chebyshev's inequalities bounds for distributions of a univariate random variable that are (1) symmetric, (2) unimodal with a given mode, (3) unimodal with bounds on the mode, (4) unimodal and symmetric, or (5) convex/concave monotone densities with bounds on the slope of densities. Popescu [238] further derives generalized Chebyshev's inequality for symmetric and unimodal distributions of multivariate random variables. A related notion to unimodality is α-unmiodality, which is defined as follows: Definition 5.14. Dharmadhikari and Joag-Dev [90] For α > 0, a distribution Van Parys et al. [308] further extend the work of Popescu [238] to obtain worstcase probability bounds over α-unimodal multivariate distributions with the same mode and within the class of distributions in P DY , defined in (5.22), and on a polytopic support. They show that when the support of the random vector is an open polyhedron, this generalized Gauss bound can be obtained via a SDP. Similar to Popescu [238], Van Parys et al. [308] derive semidefinite representations for worst-case probability bounds using Choquet representation of the ambiguity set. They demonstrate that classical generalized Chebyshev and Guass bounds 22 can be obtained as special cases of their result. They also show how to obtain a SDP reformulation to obtain the worst-case bound over α-multimodal multivariate distributions, defined via a mixture distribution.
By relying on information from classical statistics as well as robust statistics, Hanasusanto et al. [134] propose a unifying canonical ambiguity set that contains many ambiguity sets studied in the literature as special cases, including Gauss and median-absolute deviation ambiguity sets. Such a canonical framework is characterized through intersecting the cross-moment ambiguity set, proposed in Wiesemann et al. [318], and a structural ambiguity set on the marginal distributions, representing information such as symmetry and α-unimodality. As in [238], the key to the development in Hanasusanto et al. [134] is to focus on structural ambiguity sets that posses a Choquet representation. They study distributionally robust uncertainty quantification (i.e., a probabilistic objective function) and chance-constrained programs over the proposed ambiguity sets, where the safe region is characterized by a bi-affine expression inξ and x. They study the ambiguity sets over which the resulting problems are reformulated as conic programming formulations. A summary of these results can be found in Hanasusanto et al. [134,Table 2]. A by-product of their study is to recover some results from probability theory. For instance, by studying the worstcase probability of an event over the Chebyshev ambiguity set with a known mean and upper bound on the covariance matrix, they recover the generalized Chebyshev inequality, discovered in Popescu [238], Vandenberghe et al. [310]. Similarly, they recover the generalized Gauss inequality, discovered in Van Parys et al. [308], by considering the Gauss ambiguity set. Furthermore, they propose computable conservative approximations for the chance-constrained problem. Recognizing that the uncertainty quantification problem is tractable over a broad range of ambiguity sets, their key idea for the proposed approximation scheme is to decompose the chanceconstrained problem into an uncertainty quantification problem that evaluates the worst-case probability of the chance constraint for a fixed decision x, followed by a decision improvement procedure.
Li et al. [187] study distributionally robust chance-and CVaR-constrained stochastic programs, where the ambiguity set contains all α-unimodal distributions with the same first two order moments, and the safe region is bi-affine in bothξ and x. They show that these two ambiguous risk constraints can be cast as an infinite set of SOC constraints. They propose a separation approach to find the violated SOC constraints in an algorithmic fashion. They also derive conservative and relaxation approximations of the two SOC constraints by a finite number of constraints. These approximations for the CVaR-constrained problem are based on the results in Van Parys et al. [309].
Hu et al. [154] study a data-driven newsvendor problem to decide on the optimal order quantity and price. They assume that demand depends on the pricing, however, there is ambiguity about the price-demand function. To hedge against the misspecification of the demand function, they introduce a novel approach to this problem, called functionally robust approach, where the demand-price function is only known to be decreasing convex or concave. The proposed modeling approach in Hu et al. [155] also provides a systematic view on the risk-reward trade-off of coordinating pricing and order quantity decisions based on the size of the ambiguity set. To solve the resulting minimax model, Hu et al. [155] reduce the problem into a univariate problem that seeks the optimal pricing and develop a two-sided cutting surface algorithm that generates function cuts to shrink the set of admissible functions.
To overcome the difficulty in evaluating extremal performance due to the lack of data, Lam and Mottet [178] study the computation of worst-case bounds under the geometric premise of the tail convexity. They show that the worst-case convex tail behavior is in a sense either extremely light-tailed or extremely heavy-tailed.

Kernel-Based Models.
In Sections 5.1-5.3, we discussed different sets to model the distributional ambiguity. In all the papers we reviewed in those sections, the form of ambiguity set is endogenously chosen by decision makers. However, when facing high-dimensional uncertain parameters, it may not be practical to fix the form of ambiguity set a priori, being even more complicated with the calibration of different parameters describing the set (see Section 6). An alternative practice is to learn the form of the ambiguity set by using unsupervised learning algorithms on the historical data. Consider a given set of data Bertsimas and Kallus [32] propose a decision framework that incorporates the covariates u in addition to ξ into the optimization problem in the form of a conditionalstochastic optimization problem, where the decision-maker is seeking a predictive prescription x(u) that minimizes the conditional expectation of h(x,ξ) in anticipation of the future, given the observation u. However, the conditional distribution ofξ given u is not known and should be learned from data. Given {(u i , ξ i )} N i=1 , they suggest to find a data-driven predictive prescription that minimizes are weights learned locally from the data, in a way that predictions are made based on the mean or mode of the past observations that are in some way similar to the one at hand. Bertsimas and Kallus [32] obtain these weight functions by methods that are motivated by k-nearest-neighbors regression, Nadaraya-Watson kernel regression, local linear regression (in particular, LOESS), classification and regression trees (in particular, CART), and random forests. For instance, the estimate of E P h(x,ξ) u using the Nadaraya-Watson kernel regression is obtained as is a kernel function with bandwidth b. Common kernel smoothing functions are • Tri-cubic: • Guassian or radial basis function: K(a) = 1 √ 2π exp(− a 2 2 ). The general framework of the proposed data-driven model in Bertsimas and Kallus [32] resembles SAA. They show that under mild conditions, the problem is polynomially solvable and the resulting predictive prescription is asymptotically optimal and consistent. However, it is worth noting that Bertsimas and Kallus [32] illustrate that direct usage of SAA on can result in suboptimal decisions which are neither asymptotically optimal nor consistent.
A similar modeling framework as the conditional stochastic optimization problem studied in Bertsimas and Kallus [32] is investigated in other papers, see, e.g., Hannah et al. [138], Deng and Sen [85], Ban and Rudin [8], Pang Ho and Hanasusanto [225], to incorporate machine learning into decision making. Deng and Sen [85] use regression models such as k-nearest-neighbors regression to learn the conditional distribution ofξ given u. They study the statistical optimality of the resulting solution and its generalization error, and they provide hypothesis-based tests for model validation and selection. In Hannah et al. [138], Ban and Rudin [8], Pang Ho and Hanasusanto [225], the weights are obtained by the Nadaraya-Watson kernel regression method.
For a newsvendor problem, Ban and Rudin [8] show that the SAA decision does not converge to the true optimal decision. This motivates them to derive generalization bounds for the out-of-sample performance of the cost and the finite-sample bias from the true optimal decision. Ban and Rudin [8] apply their study to the staffing levels of nurses for a hospital emergency room.
Tulabandhula and Rudin [303] incorporate machine learning for the decision making. But, different from Bertsimas and Kallus [32], they study a framework that simultaneously seeks a best statistical model and a corresponding decision policy. In their framework, in addition to {(u i , ξ i )} N i=1 , a new set of unlabeled data is available that in conjunction with the statistical model affects the cost. The minimum of such a cost function over the set of possible decisions is cast by a regularization term in the objective function of the learning algorithm. Tulabandhula and Rudin [303] show that under some conditions this problem is equivalent to a robust optimization model, where the uncertainty set of the statistical model contains all models that are within -optimality from the predictive model describing . They illustrate the form of the uncertainty set for different loss functions used in the predictive statistical model, including least squares, 0-1, logistic, exponential, ramp, and hing losses. Tulabandhula and Rudin [305] study the application of the framework studied in Tulabandhula and Rudin [303] to a travelling repairman problem, where a repair crew is seeking for an optimal route to repair the nodes on a graph while the failure probabilities are unknown.
Similar to Tulabandhula and Rudin [303], Tulabandhula and Rudin [304] use a new set of unlabeled data in addition to {(u i , ξ i )} N i=1 in order to combine machine learning and decision making. However, unlike Bertsimas and Kallus [32], Deng and Sen [85], Tulabandhula and Rudin [303], and Tulabandhula and Rudin [305], Tulabandhula and Rudin [304] study a robust optimization framework. Their idea to form the uncertainty set ofξ is to consider a class of "good" predictive models with low training error on the data set Recognizing that the uncertainty can be decomposed into the predictive model uncertainty and residual uncertainty, they form the uncertainty by the Minkowski sum of two sets: (1) predictions of the new data set with the class of "good" predictive models, and (2) residuals of the new data set with the class of "good" predictive models. To form the class of "good" predictive models, one can use loss functions such as least squares and hing loss.
Similar to Bertsimas and Kallus [32], Bertsimas and Van Parys [35] consider the problem of finding an optimal solution to a data-driven stochastic optimization problem, where the uncertain parameter is affected by a large number of covariates. They study a distributionally robust approach to this problem formed via Kullback-Leibler divergence. By borrowing ideas from the statistical bootstrap, they propose two prescriptive methods based on the Nadaraya-Watson and nearest-neighbors learning formulation, first introduced by Bertsimas and Kallus [32], which safeguards against overfitting and lead to an improved out-of-sample performance. Both resulting prescriptive methods reduce to tractable convex optimization problems.
Kernel density estimation (KDE) [88] in combination with principal component analysis (PCA) is also used in the RO literature to construct the uncertainty set [217]. PCA captures the correlation between uncertain parameters and transfoms data into their corresponding uncorrelated principal components. KDE, then, captures the distributional information of the transformed, uncorrelated uncertain parameters along the principal components, by using kernel smoothing methods. Ning and You [217] propose to use a Gaussian kernel K defined between the latent uncertainty along the principal component k, w k , and the projected data along the principal component k, t k 23 . By incorporating forward and backward deviations to allow for asymmetry [74], Ning and You [217] propose the following polytopic uncertainty set that resembles the intersection of a box, with the so-called budget, and polyhedral uncertainty sets: and V is a square matrix consists of all m eigenvvectors (i.e., principal components) obtained from the eignevalue decomposition of the sample covariance matrix S = 1 N −1 (U − 1µ 0 ) (U − 1µ 0 ). Moreover, z − is a backward deviation, z + is a forward deviation vector, and Γ is the uncertainty budget. In addition, F −1 k := min{w k |F k (w k ) ≥ α}, k = 1, . . . , m, where F k (w k ) is the cdf of w k , with the density function is obtained using KDE as fol- Ning and You [217] further extend their approach to the data-driven static and adaptive robust optimization.
In the context of RO, support vector clustering (SVC) is proposed to form the uncertainty set, which seeks for a sphere with the smallest radius that encloses all data mapped in the covariate space [282]. In SVC, to avoid overfitting, the violations of the data outside the sphere is penalized by a regularization term as follows: min δ,s,c Dualizing the problem of finding the smallest sphere using dual multipliers π results in a quadratic problem where the kernel function appears in the objective function. It is shown that commonly used kernel functions in SVC, such as polynomial, radial basis function, sigmoid function kernel, lead to an intractable robust counterpart problem for the corresponding uncertainty set. Hence, Shang et al. [282] propose to use a piecewise linear kernel, referred to as a weighted generalized intersection kernel, defined as follows: Such a kernel not only incorporates covariance information, but also gives rise to the following results.
N γ , is a polytope; hence, the robust counterpart max u∈U u x ≤ b has the same complexity as the deterministic problem. (iii) The regularization parameter γ gives an upper bound on the fraction of the outliers; hence, a feasible solution x in the robust counterpart max u∈U u x ≤ b is also feasible to a SAA-based chance-constrained problem P {ũ x ≤ b} ≥ 1 − γ. (iv) As the number of data points increases, the fraction of outliers converges to the regularization parameter γ with probability one. (v) The regularization parameter γ gives a lower bound on the fraction of the support vectors.
Shang and You [280] further propose to calibrate the radius of the uncertainty set and provide a probabilistic guarantee of the proposed uncertainty set. Shang and You [278] use PCA in combination with SVC to construct the uncertainty set. By employing PCA, the data space is decomposed into the principal subspace and residual subspace. Then, they utilize the uncertainty set formed in Shang et al. [282] to explain the variation in the principal subspace, and utilize a polyhedral set to explain noise in the residual subspace. The proposed uncertainty set is then the intersection of the above two sets. Shang and You [279] adopt the ambiguity set proposed in Wiesemann et al. [318], and propose to use PCA to calibrate the moment functions. In fact, a moment function in their model is a piecewise linear function, which is defined as a first-order deviation of the uncertain parameter along a certain projection direction, truncated at certain points. They propose to use PCA to come up with the projection directions, and choose the truncation points symmetrically around the sample mean along the direction.
Applications of the proposed method in Ning and You [217] are studied in production scheduling [217] and in process network planning [217,218,216]. The proposed method in Shang et al. [282] is used in different application domains to construct the uncertainty set, see, e.g., control of irrigation system [283] and chemical process network planning [282]. Applications of the proposed method in Shang and You [279] are studied in production scheduling [279,281] and in process network planning [279,277]. 5.5. General Ambiguity Sets. In Sections 5.1-5.4, we reviewed papers with specific distributional and structural properties for the random parameters, captured via discrepancy-based, moment-based, shape-preserving, and kernel-based ambiguity sets. In this section, we review papers that either do not consider any specific form for the ambiguity set or provide some general results for a broad class of ambiguty sets.
A unified scenario-wise format for ambiguity sets to contain both the moment-based and discrepancy-based distributional information about the ambiguous distribution is proposed in Chen et al. [77]. It is shown that the ambiguity sets formed via generalized moments, mixture distribution, Wasserstein metric, φ-divergence, kmeans clustering, among other, all can be represented under this unified ambiguity set. The key feature of this scenario-wise ambiguity set is the introduction of a discrete random variable, which represents a finite number of scenarios that would affect the distributional ambiguity of the underlying nominal random variable. This ambiguity set can be characterized by a finite number of (conditional) expectation constraints based on generalized moments Wiesemann et al. [318]. For practical purposes, they restrict the ambiguity set to be second-order conic representable. Based on the scenario-wise ambiguity set, they introduce an adaptive robust optimization format that unifies the classical SP and (distributionally) RO models with recourse. They also introduce a scenario-wise affine recourse approximation to provide tractable solutions to the adaptive robust optimization model. Besides Chen et al. [77], there are some proposals for unified models in the context of discrepancy-based, moment-based, and shape-preserving models. As mentioned before, a broad class of moment-based ambiguity sets with conic-representable expectation constraints and a collection of nested conic-representable confidence sets is proposed in Wiesemann et al. [318], and a broad class of shape-preserving ambiguity sets is proposed in Hanasusanto et al. [134]. Luo and Mehrotra [199] study DRO problem where the ambiguity sets of probability distributions can depend on the decision variables. They consider a wide range of moment-and discrepancy-based ambiguity sets formed, such as (1) measure and moment inequalities (see Section 5.2.3), (2) bounds on moment constraints (see Section 5.2.1), (3) 1-Wasserstein metric utilizing 1 -norm, (4) φ-divergences, and (5) Kolmogorov-Smirnov test. They present equivalent reformulations for these problems by relying on duality results.
Pflug and Wozabal [229] study a DRO problem, where the ambiguity exists in both the objective function and constraints as in (DRO). To solve the model, they propose an exchange method to successively generate a finite inner approximation of the ambiguity set of distributions. They show that when the ambiguity set is compact and convex, and the risk measure is jointly continuous in both x and P, then the proposed algorithm is finitely convergent.
Bansal and Zhang [11] introduce two-stage stochastic integer programs in which the second-stage problem have p-order conic constraints as well as integer variables. They present sufficient conditions under which the addition of parametric (non)linear cutting planes along with the linear relaxation of the integrality constraints provides a convex programming equivalent for the second-stage problem. They show that this result is also valid for the distributionally robust counterpart of this problem. This paper generalizes the results on two-stage mixed-binary linear programs studied in Bansal et al. [9].
Bansal and Mehrotra [10] introduce two-stage distributionally robust disjunctive programs with disjunctive constraints in both stages and a general ambiguity set for the probability distributions. To solve the resulting model, they develop decomposition algorithms, which utilize Balas' linear programming equivalent for deterministic disjunctive programs or his sequential convexification approach within the L-shaped method. They demonstrate that the proposed algorithms are finitely convergent if a distribution separation subproblem can be solved in a finite number of iterations, as in sets formed via P MM , defined in (5.23), 1-Wasserstein metric utilizing an arbitrary norm, and the total variation distance. These algorithms generalize the distribution-ally robust integer L-shaped algorithm of Bansal et al. [9] for two-stage mixed binary linear programs.
Wang et al. [315] study a distributionally robust chance-constrained bin-packing problem with a finite number of scenarios, where the safe region of the chance constraint is bi-affine in x andξ, with a random technology matrix. They present a binary bilinear reformulation of the problem, where the feasible region is modeled as the intersection of multiple binary bilinear knapsack constraints, a cardinality constraint, and a general (probability) knapcksack constraint. They propose lifted cover valid inequalities for the binary bilinear knapsack substructure induced by a given bin and scenario, and they further obtain lifted cover inequalities that are valid for the substrcture induced by each bin. They obtain valid probability cuts and incorporate them with the lifted cover inequalities in a branch-and-cut framework to solve the model. They show that the proposed algorithm is finitely convergent if a distribution separation subproblem can be solved in a finite number of iterations. Wang et al. [315] apply their results to an operating room scheduling problem.
Guo et al. [129] study the impacts of the variation of the ambiguity set of probability distributions on the optimal value and optimal solution of the stochastic programs with distrubutionally robust chance constraints. To establish the results, they present conditions under which a sequence of approximated ambiguity sets converges to the true ambiguity set, for some discrepancy measure, including Kolmogorov and the total variation distance. They apply their convergence results to the ambiguity sets formed via (5.23) and Kullback-Leibler divergence.
Delage and Saif [81] study the value of using a randomized policy, as compared to a deterministic policy, for mixed-integer DRO problems. They show that the value of randomization for such DRO models with a convex cost function h and a convex risk measure is bounded by the difference between the optimal values of the nominal DRO problem and that of its convex relaxation. They show that when the risk measure is an expectation and the cost function is affine in the decision vector, this bound is tight. They also develop a column generation algorithm for solving a two-stage mixedinteger linear DRO problem, formed via (5.23) and 1-Wasserstein metric utilizing an arbitrary norm. They test their results on assignment problem, and on uncapacitated and capacitated facility location problems.
Long and Qi [193] study a distributionally robust binary stochastic program to minimize the entropic VaR, also known as Bernstein approximation for the chance constraint. They propose an approximation algorithm to solve the problem via solving a sequence of problems. They showcase their results for ambiguity set formed as in (5.23) for a stochastic shortest path problem.
Shapiro et al. [294] study a multistage stochastic program, where the data process can be naturally separated into two components: one can be modeled as a random process, with a known probability distribution, and the other can be treated as a random process, with a known support and no distributional information. They propose a variant of the stochastic dual dynamic programming (SDDP) method to solve this problem.
6. Calibration of the Ambiguity Set of Probability Distributions. 6.1. Choice of the Nominal Parameters. All discrepancy-based ambiguity sets, studied in Section 5.1, and some of the moment-based ambiguity sets, studied in Section 5.2, rely on some nominal input parameters, for instance, the nominal distribution P 0 in the ambiguity set P W (P 0 , ), defined in (5.3), and parameters µ 0 and Σ 0 in the ambiguity set P DY , defined in (5.22). In this section, we discuss how these parameters are chosen in a data-driven setting.
The nominal distribution P 0 in the discrepancy-based ambiguity sets is usually obtained by the maximal likelihood estimator of the true unknown distribution. In the discrete case, P 0 is typically chosen as the empirical distribution on data. In the case that the true unknown distribution is continuous, Jiang and Guan [167] and Zhao and Guan [342] propose to obtain P 0 with nonparametric kernel density estimation methods, see, e.g., Devroye and Gyorfi [88].
Delage and Ye [82] propose to estimate µ 0 and Σ 0 by their empirical estimates (see Section 6.2 for more details on how this choice of nominal parameters, in conjuction with other assumptions, ensure that the constructed ambiguity set P DY contains the true unknown probability distribution with a high probability).
6.2. Choice of Robustness Parameters. In Section 5, we reviewed different approaches to form the ambiguity set of distributions. All discrepancy-based ambiguity sets, studied in Section 5.1, and some of the moment-based ambiguity sets, studied in Section 5.2, rely on parameters that control the size of the ambiguity set. For instance, parameter in the ambiguity set P W (P 0 ; ), defined in (5.3), and parameters 1 and 2 in the ambiguity set P DY , defined in (5.22), control the size of their corresponding ambiguity sets. A judicial choice of these parameters reduce the level of conservatism of the resulting DRO. A natural question is then how to choose appropriate values for these parameters.
In this section, we review different approaches to choose the level-of-robustness parameters. To have a structured review, we make a distinction between data-driven DRO and non-data-driven DRO.
6.2.1. Data-Driven DROs. Data-driven DROs usually propose a robustness parameter that is inversely proportional to the number of available data points. This construction is motivated from the asymptotic convergence of the optimal value of DRO to that of the corresponding model under the true unknown distribution, with an increasing number of data points, see, e.g., [229,82,42].
An underlying assumption in data-driven methods is that data points are independently and identically distributed (i.i.d.) from the unknown distribution. Given this assumption, data-driven approaches for discrepancy-based ambiguity sets propose to choose the level of robustness by analyzing the discrepancy-with respect to some metric-between the empirical distribution and the true unknown distribution 24 , asymptotically, see, e.g., Ben-Tal et al. [28], Shafieezadeh-Abadeh et al. [273], or with a finite sample, see, e.g., Pflug and Wozabal [229]. A direct consequence of such analysis is that it establishes a finite-sample probabilistic guarantee on the discrepancy between the empirical distribution and the true unknown distribution. Hence, it gives rise to a probabilistic guarantee on the inclusion of the unknown distribution in the constructed set, with respect to the empirical distribution. By construction, such an ambiguity set can be interpreted as a confidence set on the true unknown distribution. Moreover, such a construction implies a finite-sample guarantee on the out-of-sample performance, so that the current optimal value provides an upper bound on the out-of-sample performance of the current solution with a high probability. A similar idea is used in moment-based ambiguity sets, see, e.g., Goldfarb and Iyengar [122] and Delage and Ye [82]. In a recent work, Gotoh et al. [125] propose to choose the level of robustness by trading off between the mean and variance of the out-ofsample objective function value. We refer the readers to that paper for a review of calibration approaches in DRO.
Below, we review the data-driven approaches to choose the level of robustness in more details. In this section, we suppose that a set {ξ i } N i=1 of i.i.d data, distributed according to P true , is available, where P N denotes the empirical probability distribution of data.
6.2.1.1. Optimal Transport Discrepancy. When the ambiguity set contains all discrete distributions around the empirical distribution in the sense of the Wasserstein metric, Pflug and Wozabal [229] and Pflug et al. [233] propose to choose the level of robustness based on a probabilistic statement on the Wasserstein metric between the empirical and true distributions, due to Dudley [94], as = CN − 1 d α . This choice of guarantees that P{d W c (P, P N ) ≥ } ≤ α. In addition to the confidence level 1 − α and the number of available data points N , the proposed level of robustness in [229,233] depends on the dimension ofξ, d, and a constant C. For such a Wasserstein-based ambiguity set, one can also choose the size of the set by utilizing the probabilistic statement on the discrepancy between empirical distribution and the true unknown distribution, established in Fournier and Guillin [105]. Nevertheless, because all the utilized probabilistic statements rely on the exogenous constant C, the size of the ambiguity set calculated from the theoretical analysis may be very conservative; hence, such proposals are not practical.
By acknowledging the issue raised above, some researchers propose to choose the level of robustness without relying on exogenous constants. For cases that the ambiguity set contains all discrete distributions, supported on a compact space and around the empirical distribution, Ji and Lejeune [164] derive a closed-form expression for computing the size of the Wasserstein-based ambiguity set. where Ω ⊆ R d and d is the 1 -norm. Choose c(·, ·) = d(·, ·) in the definition of the optimal transport discrepancy (5.2). Assume that log Ω e λd(ξ,ξ 0 ) P true (dξ) < ∞, ∀λ > 0, for some ξ 0 . Let θ := sup{d(ξ 1 , ξ 2 ) : ξ 1 , ξ 2 ∈ Ω} be the diameter of Ω. Then, Unlike the result in Pflug and Wozabal [229], the proposed level of robustness in Ji and Lejeune [164], stated in Theorem 6.1, depends only on the confidence level α, the number of available data points, and the diameter of the compact support Ω. Ji and Lejeune [164] obtain this result by bounding the Wasserstein distance between two probability distributions from above, using the properties of the weighted total variation [54], and the weighted Csiszar-Kullback-Pinsker inequality [312], and consequently applying Sanov's large deviation theorem [83] to reach a probabilistic statement on the Wasserstein distance between two distributions. As stated in Theorem 6.1, such a result guarantees that the constructed set contains the unknown probability distribution with a high probability. Moreover, it implies a probabilistic guarantee on the true optimal value.
Another criticism of methods such as those proposed in Pflug and Wozabal [229] and Pflug et al. [233] is that they merely rely on the discrepancy between two probability distributions, and the optimization framework plays no role in the prescription. By making connection between the regularizer parameter and the size of the ambiguity for Wassersetin-based sets, Blanchet et al. [50] aim to optimally choose the regularization parameter. A key component of their analysis is a robust Wasserstein profile (RWP) function. At a given solution x, this function calculates the minimum Wasserstein distance from the nominal distribution to the set of optimal probability distributions for the inner problem at x. For any confidence level α, they show that the size of the ambiguity set should be chosen as (1 − α)-quantile of RWP at the optimal solution to the minimization problem under the true unknown distribution. Using this selection of , the optimal solution to the true problem belongs to the set of optimal solutions to the DRO problem, with (1 − α) confidence for all P ∈ P W (P N , ). As such a result is based on the true optimal solution, they study the asymptotic behavior of the RWP function and discuss how to use it to optimally choose the regularization parameter without cross validation. The work in Blanchet et al. [50] is extended in Blanchet and Kang [48,46]. Blanchet and Kang [48] utilize the RWP function to introduce a data-driven (statistical) criterion for the optimal choice of the regularization parameter and study its asymptotic behavior. For a DRO approach to linear regression, Chen and Paschalidis [70] give guidance on the selection of the regularization parameter from the standpoint of a confidence region.
6.2.1.2. Goodness-of-Fit Test. Bertsimas et al. [42] propose to form the ambiguity set of distributions using the confidence set of the unknown distribution via goodness-of-fit tests. With such an approach, one chooses the level of robustness as the threshold value of the corresponding test, depending on the confidence level α, data, and the null hypothesis.
6.2.1.3. φ-Divergences. By noting that the class of φ-divergences can be used in statistical hypothesis tests, a similar approach to the one in Bertsimas et al. [42] can be used to choose the level of robustness for φ-divergence-based ambiguity sets. For the case that the distributional ambiguity in discrete distributions is modeled via φ-divergences, some papers propose to choose the level of robustness by relying on the asymptotic behavior of the discrepancy between the empirical distribution and true unknown distribution, see, e.g., Ben-Tal et al. [28], Bayraksan and Love [13], Yanıkoglu and den Hertog [337].
Suppose that Ξ is finite sample space of size m and the φ-divergence function in (5.9) is twice continuously differentiable in a neighborhood of 1, with φ (1) > 0. Then, it is shown in Pardo [226] that under the true distribution, the statistics 2N φ (1) D φ (P true , P 0 ) converges in distribution to a χ 2 m−1 -distribution, with m−1 degrees of freedom. Thus, at a given confidence level α, one can set the level of robustness to φ (1) 2N χ 2 m−1,1−α , where χ 2 m−1,1−α is the (1 − α)-quantile of χ 2 m−1 , to obtain an (approximate) confidence set on the true unknown distribution. Ben-Tal et al. [28] show that such a choice of the level of robustness gives a one-sided confidence interval with (asymptotically) inexact coverage on the true optimal value of inf x∈X E P true h(x,ξ) .
For corrections for small sample sizes, we refer readers to Pardo [226].
By generalizing the empirical likelihood framework [224] on a separable metric space (not necessarily finite), Duchi et al. [92] propose to choose the level of robustness such that a confidence interval [l N , u N ] on the true optimal value of inf x∈X E P true h(x,ξ) has an asymptotically exact coverage 1 − α, i.e., is M (ξ)-Lipschitz with respect to some norm · on X , E P true M (ξ) 2 < ∞, and E P true |h(x 0 ,ξ)| < ∞ for some x 0 ∈ X . Additionally, suppose that h(·, ξ) is proper and lower semicontinuous for ξ, P true -almost surely. If inf x∈X E P true h(x,ξ) has a unique solution, then According to Theorem 6.2, if inf x∈X E P true h(x,ξ) has a unique solution, the desired asymptotic guarantee is achieved with the choice = χ 2 1,1−α N . Duchi et al. [92] also give rates at which u N −l N → 0. Moreover, the upper confidence interval (−∞, u N ] is a one-sided confidence interval with an asymptotic exact coverage when = χ 2 1,1−2α . On another note, it can be seen from Table 1 that the φ-divergence function corresponding to the variation distance is not twice differentiable at 1. Hence, one cannot use the above result. However, by utilizing the first inequality in Lemma 5.4, i.e., the relationship between the variation distance and the Hellinger distance, Jiang and Guan [167] propose to set the level of robustness to 1 N χ 2 m−1,1−α in order to obtain an (approximate) confidence set on the true unknown discrete distribution. The proposed choice of the level of robustness ensures that the unknown discrete distribution belongs to the ambiguity set with a high probability. For the case thatξ follows a continuous distribution, the proposed level of robustness in [167] depends on some constants that appear in the probabilistic statement of the discrepancy between the empirical distributions and the true distribution.
6.2.1.4. p -Norm. For the case that ∞ -norm is used to model the distributional ambiguity, Jiang and Guan [167] propose to choose the level of robustness based on a probabilistic statement on the discrepancy between the empirical distributions and the true distribution as = , where z 1− α 2 represents the (1 − α 2 )-quantile of the standard normal distribution, and p 0 := [p 1 0 , . . . , p m 0 ] denotes the empirical distribution of data. The proposed choice of the level of robustness ensures that the unknown discrete distribution belongs to the ambiguity set with a high probability. Similar to the 1 -norm (i.e., the variation distance) case, whenξ follows a continuous distribution, the proposed level of robustness depends on some constants that appear in the probabilistic statement of the discrepancy between the empirical distributions and the true distribution.
6.2.1.5. ζ-Structure. By exploiting the relationship between different metrics in the ζ-structure family, see, e.g., Lemma 5.7, Zhao and Guan [342] provide guidelines on how to choose the level of robustness for the ambiguity sets of the unknown discrete distribution formed via bounded Lipschitz, Kantorovich, and Fortet-Mourier metrics as follows.
Theorem 6.3. Suppose that the random vectorξ is supported on a bounded finite space Ω and θ denotes the diameter of Ω, as defined in Theorem 6.1.
(ii) if ≥ θ max{1, θ q−1 } −2 log α N , then P N {d FM (P true , P N ) ≤ } ≥ 1 − α. Proof. The proof is immediate from the relationship between ζ-structure metrics, stated in Lemma 5.7, and the fact that P N {d K (P true , P N ) ≤ } ≥ 1 − exp{− 2 N 2θ 2 } due to Zhao and Guan [342,Proposition 3]. As it can be seen from Theorem 6.3, the proposed levels of robustness for the case that the unknown distribution is discrete depend on the diameter of Ω, the number of data points N , and the confidence level 1 − α. However, the results in Zhao and Guan [342] for the continuous case suffer from similar practical issues as in [229,233,167].
6.2.1.6. Chebyshev. A data-driven approach to construct a Chebyshev ambiguity set is proposed in Goldfarb and Iyengar [122]. Recall the linear model for the asset returnsξ in Goldfarb and Iyengar [122]:ξ = µ + Af +˜ , where µ is the vector of mean returns,f ∼ N (0, Σ) is the vector of random returns that derives the market, A is the factor loading matrix, and˜ ∼ N (0, B) is the vector of residual returns with a diagonal matrix D. Under the assumption that the covariance matrix Σ is known, recall that Goldfarb and Iyengar [122] study three different models to form the uncertainty in B, A, and µ as follows: where c i denotes the i-th column of C, and c i g = c i Gc i denotes the elliptic norm of c i with respect to a symmetric positive definite matrix G. Calibrating the uncertainty sets U B , U A , and U µ involves choosing parameters d i , d i , ρ i , γ i , i = 1, . . . , d, vector µ 0 , and matrices A 0 and G. Assuming that a set of data points is available onξ andf , by relying on the multivariate linear regression, Goldfarb and Iyengar [122] obtain least square estimates (µ 0 , A 0 ) of (µ, A), respectively, and construct a multdimensional confidence region of (µ, A) around (µ 0 , A 0 ). Now, projecting this confidence region along vector A and matrix µ gives the corresponding uncertainty sets U A and U µ , respectively. To form the uncertainty set U B , they propose to use a bootstrap confidence interval around the regression error of the residual.
6.2.1.7. Delage and Ye. Data-driven methods to construct the ambiguity set P DY is proposed in Delage and Ye [82].
6.2.2. Non-Data-Driven DROs. As mentioned before, data-driven DROs typically assume that a set of i.i.d. sampled data is available from the unknown true distribution. In many situations, however, there is no guarantee that the future uncertainty is drawn from the same distribution. Recognizing this fact, some research is devoted to choosing the level of robustness in situations where the i.i.d. assumption is violated and data-driven methods to calibrate the level of robustness may be unsuitable.
Rahimian et al. [252] use the notions of maximal effective subsets and prices of optimism/pessimism and nominal/worst-case regrets to calibrate the level of robustness in discrepancy-based DRO models. Price of optimism pessimism is defined as the loss by being too optimistic (i.e., using SO model with the nominal distribution)-and hence, implementing the corresponding solution-while DRO accurately represents the ambiguity in the distribution. Similarly, the price of pessimism is defined as the loss by being too pessimistic (i.e., using RO model with no distributional information except for the support of uncertainty). Nominal/worst-case regret is defined as the loss of being unnecessarily ambiguous/not being ambiguous enough-and hence, implementing the corresponding solution-while DRO is ill-calibrated. Rahimian et al. [252] suggest to balance the price of optimism and pessimism if the decision-maker is indifferent regarding the error from using too optimistic or pessimistic solutions. They refer to the smallest level of robustness for which such a balance happens as indifferent-to-solution level of robustness. On the other hand, Rahimian et al. [252] propose to balance the nominal and worst-case regrets if the decision-maker wants to be indifferent regarding the error from using an ill-calibrated DRO model in either the optimistic or the pessimistic scenarios. They refer to the smallest level of robustness for which such a balance happens as indifferent-to-distribution level of robustness.
7. Cost Function of the Inner Problem. Recall formulation (DRO) and the functional R P : Z → R. This functional accounts for quantifying the uncertainty in the outcomes of a fixed decision x ∈ X and for a given fixed probability measure P ∈ M (Ξ, F). As pointed out before in Section 1.1 for (1.1) and (1.2), one choice for this functional is the expectation operator. Other functionals, such as regret function, risk measure, and utility function have also been used in the DRO literature. These functionals are closely related concepts and we refer to Ben-Tal and Teboulle [23] and [259] for a comprehensive treatment and how one can induce one from the other. In this section, we review some notable works, where regret function, risk measure, and utility function are used to capture the uncertainty in the outcomes of the decision. 7.1. Regret Function. Given a decision x ∈ X and a probability measure P ∈ M (Ξ, F), a regret functional V P may quantify the expected displeasure or disappointment of the current decision with respect to a possible mix of future outcomes as follows: In other words, V P h(x,ξ) calculates the expected additional loss that could have been avoided. This definition of regret function is used in Natarajan et al. [212] and Hu et al. [151] in the context of combinatorial optimization and multicriteria decision-making, respectively. Another way for formulating a regret function may be as V P h(x,ξ) := E P h(x,ξ) − min x∈X E P h(x,ξ) .
This type of regret function is used in Perakis and Roels [228] in the context of the newsvendor problem. Perakis and Roels [228] obtain closed form solutions to distributionally robust single-item newsvendor problems that minimize the worst-case expected regret of acting optimally, where only (1) support, (2) mean, (3) mean and median, and (4) mean and variance information is available. This information can be captured with the ambiguity set P MM , defined in (5.23). Perakis and Roels [228] also study the ambiguity sets that preserve the shape of the distribution, including information on (1) mean and symmetry, (2) support and unimodality with a given mode, (3) median and unimodality with a given mode, and (4) mean, symmtery, and unimodality with a given mode.

Utility
Function. An alternative to using risk measures to compare random variables is to evaluate their expected utility Gilboa and Schmeidler [117]. As before, let us consider a probability space (Ξ, F, P ). A random variable Z ∈ Z is preferred over a random variable Z ∈ Z if E P [u(Z 1 )] ≥ E P [u(Z 2 )] for a given univariate utility function u 26 . A bounded utility function u can be normalized to take values between 0 and 1, and hence, it can be interpreted as a cdf of a random variable ζ, i.e., u(t) = P {ζ ≤ t} for t ∈ R. Under this interpretation, Z is preferred over Z if P {Z ≥ ζ} ≥ P {Z ≥ ζ} because However, as in decision theory, it is difficult to have a complete knowledge of a decision maker's preference (i.e., utility function), it is also difficult to have a complete knowledge of the cdf of ζ. The notion of stochastic dominance handles this issue by comparing the expected utility of random variables, for a given family U of utility functions, or equivalently, compare the probability of exceeding the target random variable ζ for a given family of cdf. Consequently, to address the problem of ambiguity in decision maker's utility or equivalently, cdf of the random variable ζ, one can study where U denotes a given family of normalized and nondecreasing utility functions, or equivalently, a given family of cdf. Note that problems (7.1) and (7.2) have the form of problems (1.5) and (1.6), respectively. Hu and Mehrotra [150] study problem of the form (7.1), where U is further restricted to include concave utility functions or equivalently, cdf, and satisfy functional bounds on the utility and marginal utility functions (cdf and pdf of ζ) as in (5.23). They provide a linear programming formulation of a particular case where the bounds on the utility function are piecewise linear increasing concave functions, and the bounds on all other functions are step functions. For the general continuous case, they study an approximation problem by discretisizing the continuous functions, and analyze the convergence properties of the approximated problem. They apply their results to a portfolio optimization problem. Unlike Hu et al. [154], in Hu et al. [155], no shape restrictions on the utility function is assumed and only functional bounds on the utility function are enforced. Hu et al. [155] show that an SAA approach to the Lagrangian dual of the resulting problem can be used while solving a mixed-integer LP. They study the convergence properties of this SAA problem, and illustrate their results using examples in portfolio optimization and a streaming bandwidth allocation problem. Bertsimas et al. [39] study a DRO model of the form (1.5), where a convex nondecreasing disutility function is used to quantify the uncertainty in decision. A utility function is closely related to risk measures [150]. For instance, for a given probability measure, the expected utility might have the form of a combination of expectation and expected excess beyond a target, or an optimized certainty equivalent risk measure. As shown in Ben-Tal and Teboulle [23], under appropriate choices of utility functions, an optimized certainty equivalent risk measure can be reduced to the mean-variance and the mean-CVaR formulations.
Wiesemann et al. [318] study a DRO model formed via (5.26), where the decision maker is risk-averse via a nondecreasing convex piecewise affine disutility function.
In particular, they investigate shortfall risk and optimized certainty equivalent risk measures.
Unlike the above discussion, many decision-making problems involve comparing random vectors. One can generalize the notion of utility-based comparison to random vectors by using multivariate utility functions [5]. Another approach to compare random vectors is based on the idea of the weighted scalarization of random vectors. For the case that the weights are deterministic and take value in an arbitrary set, we refer to Dentcheva and Ruszczyński [87] for unrestricted sets, Homem-De-Mello and Mehrotra [148], Hu et al. [151], Hu and Mehrotra [149] for polyhedral sets, and Hu et al. [152] for convex sets. For instance, Hu et al. [151] study a weighted sum approach to a multiobjective budget allocation problem under uncertain performance indicators of projects. They assume that the weights take value in the convex hull of the weights suggested by experts and study a minmax approach to the expected weighted sum problem, where the expectation is taken with respect to the uncertainty in the performance indicators and the worst-case is taken with respect to the weights. Note that the problem studied in Hu et al. [151] is in the framework of RO as the weights are deterministic.
The idea of using stochastic weights, governed by a probability measure that determines the relative importance of each vector of weights, is also introduced in Hu and Mehrotra [149] and Hu et al. [153]. For instance, Hu and Mehrotra [149] study a DRO approach to stochastically weighted multiobjective deterministic and stochastic optimization problems, where the weights are perturbed along different rays from a reference weight vector. They study the reformulations of the deterministic problem for the cases where the weights take values in (1) a polyhedral set, including those induced by a simplex, 1 -norm, and ∞ -norm, and (2) a conic-representable set, including those induced by a single cone (e.g., p -norm, ellipsoids), intersection of multiple cones, and union of multiple cones. They further study the stochastic optimization problem. For the case that the weights and random parameters are independent, and the ambiguity in the probability distribution of weights is modeled via (5.22), they obtain a reformulation of the problem using the result in Delage and Ye [82]. For the case that the weights and random parameters are dependent, they also obtain reformulations of the resulting problem by utilizing the result from the deterministic case. They illustrate the ideas set forth in the paper using examples from disaster planning and agriculture revenue management problems.
8. Modeling Toolboxes. Goh and Sim [121] develop a MATLAB-based algebraic modeling toolbox, named ROME, for a class of DRO problems with conicrepresentable sets for the support and mean, known covariance matrix, and upper bounds on the directional deviations studied in Goh and Sim [120]. Goh and Sim [121] elucidate the practicability of this toolbox in the context of (1) a service-constrained inventory management problem, (2) a project-crashing problem, and (3) a portfolio optimization problem. A C++-based algebraic modeling package, named ROC, is developed in Bertsimas et al. [44], to demonstrate the practicability and scalability of the studied adaptive DRO model. Some features of ROC include declaration of uncertain parameters and linear decision rules, transcriptions of ambiguity sets, and reformulation of DRO using the results obtained in Bertsimas et al. [44]. A brief introduction to ROC and some illustrative examples to declare the objects of a model, such as variables, constraints, ambiguity set, among others, are given in an early version of Bertsimas et al. [41]. XProg (http://xprog.weebly.com), is a MATLAB-based algebraic modeling package that also implements the proposed model in Bertsimas et al. [44]. Chen et al. [77] develop an algebraic modeling package, AROMA, to illustrate the modeling power of their proposed ambiguity set.