Cycle-based formulations in Distance Geometry

The distance geometry problem asks to find a realization of a given simple edge-weighted graph in a Euclidean space of given dimension K, where the edges are realized as straight segments of lengths equal (or as close as possible) to the edge weights. The problem is often modelled as a mathematical programming formulation involving decision variables that determine the position of the vertices in the given Euclidean space. Solution algorithms are generally constructed using local or global nonlinear optimization techniques. We present a new modelling technique for this problem where, instead of deciding vertex positions, formulations decide the length of the segments representing the edges in each cycle in the graph, projected in every dimension. We propose an exact formulation and a relaxation based on a Eulerian cycle. We then compare computational results from protein conformation instances obtained with stochastic global optimization techniques on the new cycle-based formulation and on the existing edge-based formulation. While edge-based formulations take less time to reach termination, cycle-based formulations are generally better on solution quality measures.


Introduction
We consider the fundamental problem in Distance Geometry (DG): Distance Geometry Problem (DGP).Given a positive integer K and a simple undirected graph G = (V, E) with an edge weight function d : E → R ≥0 , establish whether there exists a realization x : V → R K of the vertices such that Eq. (1) below is satisfied: where x i ∈ R K for each i ∈ V and d ij is the weight on edge {i, j} ∈ E.
Although the DGP is given above in the canonical decision form, we consider the corresponding search problem, where one has to actually find the realization x.The DGP is also known as the graph realization problem in geometric rigidity [28,6,17].It belongs to a more general class of metric completion and embedding problems [7,23,50].
In its most general form, the DGP might be parametrized over any norm [11].In practice, the ℓ 2 norm is the most usual choice [39], and will also be employed in this paper.The DGP with the ℓ 2 norm is sometimes called the Euclidean DGP (EDGP).For the EDGP, Eq. ( 1) is often reformulated to: which is a system of quadratic polynomial equations with no linear terms [35, §2.4].
The EDGP is motivated by many scientific and technological applications.The clock synchronization problem, for example, aims at establishing the absolute time of a set of clocks when only the time difference between subsets of clocks can be exchanged [52].The sensor network localization problem aims at finding the positions of moving wireless sensor on a 2D manifold given an estimation of some of the pairwise Euclidean distances [17,2,15].The Molecular DGP (MDGP) aims at finding the positions of atoms in a protein, given some of the pairwise Euclidean distances [39,35].The position of autonomous underwater vehicles cannot be determined via GPS (since the GPS signal does not reach under water), but must rely on distances estimated using sonars: a DGP can then be solved in order to localize the fleet [3].Applications of the DGP to data science are described in [33]; see [32] for an application to natural language processing.In general, the DGP is an inverse problem that occurs every time one can measure some of the pairwise distances in a set of entities, and needs to establish their position.
The DGP is weakly NP-hard even when restricted to simple cycle graphs (by reduction from Partition) and strongly NP-hard even when restricted to integer edge weights in {1, 2} in general graphs (by reduction from 3sat) [49].It is in NP if K = 1 but not known to be in NP if K > 1 for general graphs [4], which is an interesting open question [36].
There are many approaches to solving the DGP.Generally speaking, application-specific solution algorithms exploit some of the graph structure, whenever it is induced by the application.For example, a condition often asked when reconstructing the positions of sensor networks is that the realization should be unique (as one would not know how to choose between multiple realizations), a condition called global rigidity [10].This condition can, at least generically, be ensured by a specific graph rigidity structure of the unweighted input graph, as shown in [20].For protein structures, on the other hand, which are found in nature in several isomers, one is sometimes interested in finding all (incongruent) realizations of the given protein graph [30,47,37].Since such graphs are rigid, one can devise an algorithm (called Branch-and-Prune) that, following a given vertex order, branches on reflections of the position of the next vertex, which is computed using trilateration [35].It is also possible that DGP problems arise in their full generality, i.e. independently of any further knowledge on their structure or properties: for such cases, one can resort to Mathematical Programming (MP) formulations and corresponding solvers [40,12,14].
The MP formulation that is most often used reformulates Eq. ( 2) to the minimization of the sum of squared error terms: min This formulation describes an unconstrained polynomial minimization problem.The polynomial in question has degree 4, is always nonnegative, and generally nonconvex and multimodal.The decision variables are represented by a n × K rectangular matrix x such that x ik is the k-th component of the vector x i , which gives the position in R K of vertex i ∈ V .Each solution x * ∈ R nK having global minimum value equal to zero is a realization of the given graph.Solutions with small objective function value represent approximate solutions.Because of the nonconvexity of the formulation and the hardness of the problem, Eq. ( 3) is not usually solved to guaranteed ε-optimality (e.g. using a spatial Branch-and-Bound approach [5]); rather, heuristic approaches, such as MultiStart (MS) [29], Variable Neighbourhood Search (VNS) [38], or relaxation-based heuristics [14,42] may be used.
As far as we know, all existing MP formulations for the EDGP are edge-based, such as the one in Eq. ( 3).In this paper we discuss a new MP formulation for the EDGP based on the incidence of cycles and edges instead, a relaxation based on Eulerian cycles, and a computational comparison with Eq. (3).Although this paper is not about graph theory, a fair amount of graph theoretical content is needed to prove the main reformulation result.Since the OJMO readership is supposed to be well versed in optimization but not necessarily in graph theory, we strove to achieve clarity and self-containment at the expense of compactness.

Some existing MP formulations
In this short section we give a minimal list of typical variants of Eq. ( 3) in order to motivate the claim that the cycle-based formulation of the DGP discussed in this paper is new.Of course, only a complete enumeration of DGP formulations in the literature could substantiate this claim.But even this short list shows that the typical modelling approach for the DGP is direct: namely, decision variables encode the realization of each vertex as a vector in R K .Many more formulations of the DGP and its variants, all corresponding to this criterion, are given in [29,40,12].
The closest variant of Eq. ( 3) simply adds a constraint ensuring that the centroid of all of the points in the realization is at the origin (see Lemma 5.3 below).This removes the degrees of freedom given by translations: min This formulation describes a linearly constrained polynomial minimization problem.Like Eq. ( 3), the polynomial in Eq. ( 4) has degree 4, is always nonnegative, and is generally nonconvex and multimodal.Another small variant of Eq. ( 4) is achieved by adding range bounds to the the realization variables x; generally valid (but slack) bound values can be set to ± 1 2 {i,j}∈E d ij .This corresponds to the worst case of a single path being arranged in a straight line with unknown orientation.
Another possible formulation, derived again from Eq. ( 3), is obtained by replacing the squared error with absolute value errors (whose positive and negative parts are encoded by s + , s − ).This yields the following formulation: min s,x {i,j}∈E Note that, again, each solution s * , x * with zero optimal objective value makes x * an encoding of a realization of the given graph.Thus, global optima are preserved by this reformulation, while local optima may differ.Yet another reformulation derived from replacing squared errors with absolute values consists in observing that the "plus" and "minus" parts of each absolute value term correspond to a convex and concave function.This yields a formulation called push-and-pull, since the objective pulls adjacent vertices apart, while the constraint push them back together: Eq. ( 6) is a Quadratically Constrained Quadratic Program with concave objective and convex constraints.
It was used within a Multiplicative Weights Update algorithm for the DGP in [12], as well as a basis for Semidefinite Programming and Diagonally Dominant Programming relaxations [14,42].It can be shown that all constraints are active at global optima, which therefore correspond to realizations of the given graph [46].

A new formulation based on cycles
In this section we propose a new formulation for the EDGP, based on the fact that the quantities x ik − x jk sum up to zero over all edges of any cycle in the given graph for each dimensional index k ≤ K.This idea was used in [49] for proving weak NP-hardness of the DGP on cycle graphs.For a subgraph H of a graph G = (V, E), we use V (H) and E(H) to denote vertex and edge set of H explicitly; given a set F of edges we use V (F ) to denote the set of incident vertices.Let m = |E| and n = |V |.For a mapping x : V → R K we denote by x[U ] the restriction of x to a subset U ⊆ V .Furthermore, we let a closed trail be a sequence of vertices and of the edges joining them, which begins and ends at the same vertex, and is such that no edge is repeated.
Lemma 3.1.Given an integer K > 0, a simple undirected weighted graph G = (V, E, d) and a mapping x : V → R K , then for each cycle C in G, each orientation of the edges in C given by a closed trail W (C) in the cycle, and each k ≤ K we have: Proof.We renumber the vertices in V (C) to 1, 2, . . ., γ = |V (C)| following the walk order in W (C). Then Eq. ( 7) can be explicitly written as: We introduce new decision variables y ijk replacing the terms x ik − x jk for each {i, j} ∈ E and k ≤ K. Eq. ( 2) then becomes: We note that, with a slight abuse of notation, we index the sum in Eq. ( 8) with the shorthand k ≤ K instead of k ∈ {1, 2, . . ., K}.We will keep this notation throughout the paper, for ease of reading.Moreover, we remark that for the DGP with other norms this constraint changes.For the ℓ 1 or ℓ ∞ norms, for example, we would have: Next, we adjoin the constraints on cycles: We also note that the feasible value of a y ijk variable is the (oriented) length of the segment representing the edge {i, j} projected on the k-th coordinate.We can therefore infer bounds for y as follows: Although Eq. ( 11) are not necessary to solve the cycle formulation, they may improve performance of spatial Branch-and-Bound (sBB) algorithms [53,5] and of various "matheuristics" [41] that need explicit bounds on all variables, as well as allow an exact linearization of variable products, should a y variable occur in a product with a binary variable in some DGP variant.We now give the following definition and state our main result, i.e., that Eq. ( 8) and ( 10) are a valid MP formulation for the EDGP.Definition 3.2.Given a strictly positive K ∈ N and a graph G = (V, E), Y {y ∈ R Km | (8) ∧ (10)} is the set of vectors satisfying Eq. ( 8) and (10).
We emphasize that Y depends on the EDGP instance (K, G).

Theorem 3.3. The set Y is non-empty if and only if (K, G) is a YES instance of the EDGP.
The proof argues by recursion on a graph decomposition of G that a certain linear system related to the cycles of G (see Eq. ( 12) below) has a solution in the x variables if and only if the given EDGP instance is YES, as certified by the y variables1 .
We shall construct our proof by steps.The first step defines a graph decomposition based on the removal of a single vertex.Given a graph G = (V, E) and a subset U ⊂ V , the subgraph With a slight abuse of notation we denote the vertices of a graph G ′ by V (G ′ ) and its edges by E(G ′ ).We let γ(G) be the number of connected components of G.A vertex v of G with the property that γ(G[V {v}]) > γ(G) is called a cut vertex.A graph G is biconnected if, for any pair u, v of distinct vertices of G, there is a simple cycle in G incident to u and v.It is not hard to show that biconnectedness is equivalent to connectedness and the absence of cut vertices.To see this, we first introduce the concept of "1-decomposition", then prove some statements related to it.
The 1-decomposition bears some relationship to the block-cutpoint tree defined by Harary in [22, p. 36].However, subgraphs in the 1-decomposition may also be trees, which cannot appear in Harary's construction, since every vertex of a tree is a cutpoint by definition.Trees are important because they are easy to realize in R K .Their realizations can then be paste to the realizations of the other subgraphs by rotations and translations, a fact that is used in the proof of the main theorem.The same would not follow if we were to use Harary's block-cutpoint trees, since they contract blocks to a single vertex.We do, however, invoke [22,Thm. 3.1] to state that a connected graph G = (V, E) is 1-decomposable if and only if it has a cut vertex.Lemma 3.5.Let G be 1-decomposable, with decomposition G = {G 1 , . . ., G r }, and C be a cycle in G. Then there is an index i ≤ r s.t.C is a subgraph of G i .
Proof.Suppose, to aim at a contradiction, that there are two distinct subgraphs G i , G j in G both incident to the edges of C. Then there is a nontrivial path p in C, with at least two edges, joining a vertex u in G i to a vertex v in G j .Therefore, by [22,Thm. 3.1], there must be a cut vertex of G on p, which implies that there is a cut vertex in C, which is impossible, since cycles are biconnected.
We note that no biconnected graph G is 1-decomposable.On the other hand, a tree with n vertices can always be 1-decomposed into n subgraphs.Proposition 3.6.Any connected component G = (V, E) of a simple graph has a (possibly trivial) 1decomposition consisting of biconnected subgraphs and tree subgraphs.
Proof.We prove this result by induction on the number β of biconnected subgraphs in a 1-decomposition C = {G 1 , . . ., G r } of G for some r ∈ N. We first deal with the base case, where β = 0. We claim that G must be a tree: supposing G has a cycle G ′ , as well as biconnectedness of cycles and part (c) of Defn.3.4, G ′ must be one of the G 1 , . . ., G r .But then β ≥ 1 against the assumption.Therefore, the trivial 1-decomposition C = {G} is a valid 1-decomposition of G.We now tackle the induction step.Consider the largest biconnected subgraph B of G: then G = G[V V (B)] has one fewer biconnected components than G, so, by induction, We prove that D = D ′ ∪ {B} is a valid 1-decomposition of G. Condition (a) is verified since D ′ is a valid 1-decomposition by induction, and B is biconnected; condition (b) is verified since the union of the graph in D is G by construction; for condition (c), suppose there is i < t s.t.|V (G i ) ∩ V (B)| ≥ 2: this means there are two distinct vertices u, v in both V (G i ) and V (B).Since G i is connected, there must be a path p from u to v in G i , hence G[B ∪ V (p)] is a biconnected graph larger than B. But B was assumed to be largest, so this is not possible, and (c) holds, which concludes the proof.
The second step proves the easier (⇐) direction of Thm.3.3.Proposition 3.7.For any YES instance (K, G) of the EDGP there is a vector y * ∈ Y .
jk for all {i, j} ∈ E and k ≤ K. Since x * is a realization of G, by definition it satisfies Eq. ( 2), and, by substitution, Eq. ( 8).Moreover, any realization of G satisfies Eq. ( 7) over each cycle by Lemma 3.1.Hence, by replacement, it also satisfies Eq. (10).
In the third step, we lay the groundwork towards the more difficult (⇒) direction of Thm.3.3.We proceed by contradiction: we assume that (K, G) is a NO instance of the EDGP, and suppose that the set Y for this instance is non-empty.For every y ∈ Y we consider the K linear systems for each k ≤ K, each with n variables and m equations.We square both sides then sum over By Eq. ( 8) we have whence follows Eq. ( 2), contradicting the assumption that the EDGP is NO.So we only need to show that there is a solution x * to Eq. ( 12) for any given y ∈ Y .To this effect, we shall exploit the 1-decomposition of G into biconnected graphs and trees derived in Prop.3.6.First, though, we have to show that Eq. ( 12) has a solution if Y = ∅ in the "base cases" of the 1-decomposition, namely trees and biconnected graphs.
The following result essentially proves that the constraint matrix of Eq. ( 12) has full rank, which is an easy consequence of graphic matroid theory.We prove the result by elementary means for self-containment.Lemma 3.8.Let G = (V, E) be a tree, and Y = ∅.Then Eq. (12) has a solution for every k ≤ K.
Proof.Let M be the coefficient matrix of the system of equations (12), for a given k ≤ K; and let y k be the vector (y uvk | {u, v} ∈ E).We note that, since M is the (transposed) incidence matrix of G, only the right-hand side of the system changes for each k.We aim at proving that M and (M, y k ) have the same rank, and that this rank is full.We proceed by induction on the size |E| of the tree.The base case, where |E| = 1 and G consists of a single edge {u, v}, yields M = (1, −1) with rank 1 for each k ≤ K.By inspection, (M, y uvk ) also has rank 1 for any y uvk .Consider a tree G ′ with one fewer edge (say, {u, v}) than G, such that V V (G ′ ) = {v}.Let the corresponding system Eq.( 12) M x = ỹ satisfy rank( M ) = rank( M , ỹk ), for all k ≤ K. Then the shape of M is: where e u = (0, . . ., 0, 1 u , 0, . . ., 0).This shows that rank(M ) = rank( M ) + 1, that this rank is full, and hence also that rank(M ) = rank((M, y k )).
Proof.We proceed by induction on the simple cycles of G.For the base case, we consider G to be a graph consisting of a single cycle, with corresponding y ∈ Y .Since G is a cycle, it has the same number of vertices and edges, say q.This implies that, for any fixed k ≤ K, Eq. ( 12) is a linear system M x = y k (where y k = (y uvk | {u, v} ∈ E)) with a q × q coefficient matrix: We remark that M is the incidence matrix of G as in the Proof of Lemma 3.8.By Eq. ( 7) and by inspection of Eq. ( 15) it is clear that rank(M ) = q − 1: then Eq. ( 10) ensures that rank((M, y k )) = rank(M ), and therefore that Eq. ( 12) has a solution.
We now tackle the induction step.The incidence vectors in E of the cycles of any graph are a vector space of dimension m − n + 1 over the finite field F 2 = {0, 1} [51].We consider a fundamental cycle basis B of G (see Sect. 4).We assume that (a) G ′ is a union of fundamental cycles in B ′ B, for which Eq. ( 12) has a solution x ′ by the induction hypothesis, and (b) that C is another fundamental cycle in B B ′ , with a solution x C of Eq. ( 12) that exists by the base case.We aim at proving that Eq. ( 12) has a solution for G ′ ∪ C. Since G is biconnected, the induction can proceed by ear decomposition [44], which means that G ′ is also biconnected, and that C is such that By Eq. ( 10) applied to C, we have Since x ′ satisfies Eq. ( 12) by the induction hypothesis, We replace Eq. ( 17) in Eq. ( 16), obtaining Moreover, x C also satisfies Eq. ( 12) over C, hence we can replace the right hand side of Eq. ( 18) with the corresponding terms in x C ik − x C jk to get: We now fix x ′ , and aim at modifying x C so that: (a) x C matches x ′ on V (F ), (b) the modified x C is still a solution of Eq. ( 12) on C. We set x C ik to x ′ ik for each i ∈ V (F ), and consider the resulting linear system Eq.( 12) given by M , as in Eq. ( 15), for each k ≤ K, where we assume without loss of generality that V (F ) = {1, . . ., r} and V (C) = {r + 1, . . ., s}: The equations from (1) to (r−1) in Eq. ( 20) are satisfied by the induction hypothesis since they only depend on x ′ , so we can remove them from the system and assume x ′ to be constant.We are left with: Summing up the left hand sides of Eq. ( 21), we obtain: for all k ≤ K, so the (s − r + 1) × (s − r + 1) matrix M of the k-th linear system Eq.( 21) has rank ≤ s − r.
On the other hand, eliminating the first or last row makes it clear by inspection that the rest of the rows are linearly independent; therefore the rank of M is exactly s − r.Summing up the components of the right hand side vector ȳk of Eq. ( 21), we obtain: We remark that whence χ = 0 by Eq. ( 16).This implies that rank(( M , ȳk )) = rank( M ) = s − r.Therefore, Eq. ( 21) has a solution, which yields the modified x C with properties (a) and (b) given above.This concludes the induction step and the proof.
We can finally give the proof of Thm.3.3.Proof of Thm.3.3.The (⇐) part follows by Prop.3.7.For the (⇒) part, we exploit a 1-decomposition of G into trees and biconnected subgraphs, derive solutions to Eq. ( 12) for each subgraph, and show that the solutions can be easily combined to yield a solution to Eq. ( 12) for the whole graph G.
We assume without loss of generality that G is connected (otherwise each connected component can be treated separately), and consider a 1-decomposition D = {G 1 , . . ., G r } of G.By Lemmata 3.8 and 3.9, there exist solutions x 1 , . . ., x r to Eq. ( 12) applied to G 1 , . . ., G r respectively.Consider the graph By Lemma 3.5, D is a tree: otherwise, a cycle in D would be a contraction of a cycle in G not included in a single G i , against Lemma 3.5.This allows us to reorder D so that, for each j > 1, there is a unique i < j such that {i, j} ∈ E(D).
We remark that, for each i ≤ r, x i is a realization of G i in R K by Eq. ( 12)-( 14).More precisely, x i is a |V (G i )| × K matrix x i = (x i ℓk ) so that x i ℓ = (x i ℓ1 , . . ., x i ℓK ) is the position of vertex ℓ ∈ V (G i ) in R K .Note that the realizations x 1 , . . ., x r can be modified by translations without changing the values of y (by inspection of Eq. ( 12)).
We now construct a solution x of Eq. ( 12) for G by induction on D ordered as described above.For the base case i = 1, we fix x 1 in any way (e.g. by taking the centroid of the rows of x 1 to be the origin), and initialize the first |V (G 1 )| rows of x with those of x 1 .For any i > 1, we identify the unique predecessor j of i in the order on D. The induction hypothesis ensures the existence of a solution x of the union of G 1 , . . ., G j .Consider the cut vertex v in V (G j ) ∩ V (G i ) guaranteed by definition of the order on D, and let xv ∈ R K be its position.Then the translation xi = x i − 1(x i v − xv ) ⊤ yields another valid solution of Eq. ( 12) applied to G i by translation invariance, and this solution is such that xi v = xv .Therefore, using the rows of xi , x can be extended to a solution of Eq. ( 12) applied to the union of G 1 , . . ., G j and G i , as claimed.
Thm. 3.3 can also be interpreted as a polynomial reduction of the EDGP to the problem of finding a solution of Eq. ( 8) and (10).
A remarkable consequence of Thm.3.3 is that it allows a decomposition of the computation of the realization x into two stages: first, solve Eq. ( 8)-( 10) to find a feasible y * ; then solve to find a realization x * .We note that Eq. ( 22) is just a restatement of Eq. ( 12) universally quantified over k.
Corollary 3.11.Given an EDGP instance (K, G) and a solution y * ∈ Y , any solution x * of Eq. ( 22) is a valid realization of the given instance.
Proof.The feasibility of Eq. ( 22) with the right hand side replaced by y * ∈ Y follows directly from Thm. 3.3, since if such a y * exists then the EDGP is feasible.
The first stage is NP-hard by Cor.3.10, while the second stage is tractable, since solving linear systems can be done in polynomial time.
Remark 3.12.Note that Eq. (22) has Km equations, but its rank may be lower, since there are only Kn variables: in particular, Eq. ( 22) may be an overdetermined linear system.The feasibility of this system is guaranteed by Cor. 3.11; in particular, the steps of the proof of Thm.3.3 imply that Eq. ( 22) loses rank w.r.t.Km according to the incidence of the edges in the cycles of G.In other words, any solution y ′ to Eq. (10) provides a right hand side to Eq. ( 22) that makes the system feasible.
The issue with Thm.(3.3) is that it relies on the exponentially large family of constraints Eq. (10).While this is sometimes addressed by algorithmic techniques such as row generation, we shall see in the following that it suffices to consider a polynomial set of cycles (which, moreover, can be found in polynomial time) in the quantifier of Eq. ( 10).

The cycle vector space and its bases
We recall that incidence vectors of cycles (in a Euclidean space having |E| dimensions) form a vector space over a field F, which means that every cycle can be expressed as a weighted sum of cycles in a basis.In this interpretation, a cycle in G is simply a subgraph of G where each vertex has even degree: we denote their set by C.This means that Eq. ( 10) is actually quantified over a subset of C, namely the simple connected cycles.Every basis has cardinality m − n + a, where a is the number of connected components of G.If G is connected, cycle bases have cardinality m − n + 1 [51].
Our interest in introducing cycle bases is that we would like to quantify Eq. ( 10) polynomially rather than exponentially in the size of G. Our goal is to replace "C is any simple connected cycle in C" by "C is a cycle in a cycle basis of G".In order to show that this limited quantification is enough to imply every constraint in Eq. ( 10), we have to show that, for each simple connected cycle C ∈ C, the corresponding constraint in Eq. ( 10) can be obtained as a weighted sum of constraints corresponding to the basis elements.
Another feature of Eq. (10) to keep in mind is that edges are implicitly given a direction: for each cycle, the term for the undirected edge {i, j} in Eq. ( 10) is (x ik − x jk ).Note that while {i, j} is exactly the same vertex set as {j, i}, the corresponding term is either positive or not, depending on the direction (i, j) or (j, i).We deal with this issue by arbitrarily directing the edges in E to obtain a set A of arcs, and considering directed cycles in the directed graph Ḡ = (V, A).In this interpretation, the incidence vector of a directed cycle C of Ḡ is a vector c C ∈ R m satisfying [27, §2, p. 201]: A directed circuit D of Ḡ is obtained by applying the edge directions from Ḡ to a connected subgraph of G where each vertex has degree exactly 2 (note that a directed circuit need not be strongly connected, although its undirected version is connected).Its incidence vector c D ∈ {−1, 0, 1} m is defined as follows: where we have used A(D) to mean the arcs in the subgraph D. In other words, whenever we walk over an arc (i, j) in the natural direction i → j we let the (i, j)-th component of c D be 1; if we walk over (i, j) in the direction j → i we assign a −1, and otherwise a zero.

Constraints over cycle bases
The properties of undirected and directed cycle bases have been investigated in a sequence of papers by many authors, culminating with [27].We now prove that it suffices to quantify Eq. ( 10) over a directed cycle basis.
Proposition 4.1.Let B be a directed cycle basis of Ḡ over Q.Then Eq. ( 10) holds if and only if: Proof.Necessity (10) ⇒ (24) follows because Eq. ( 10) is quantified over all cycles: in particular, it follows for any undirected cycle in any undirected cycle basis.Moreover, the signs of all terms in the sum of Eq. ( 24) are consistent, by definition, with the arbitrary edge direction chosen for Ḡ.
Next, we claim sufficiency (24) ⇒ (10).Let C ∈ C be a simple cycle, and C be its directed version with the directions inherited from Ḡ. Since B is a cycle basis, we know that there is a coefficient vector We now consider the expression: On the one hand, by Eq. ( 25), Eq. ( 26) is identically equal to (i,j)∈A( C) c C ij y ijk for each k ≤ K; on the other hand, each inner sum in Eq. ( 26) is equal to zero by Eq. (24).This implies (i,j)∈A( C) c C ij y ijk = 0 for each k ≤ K. Since C is simple and connected, C is a directed circuit.This implies that c C ∈ {−1, 0, 1}.Now it suffices to replace −y ijk with y jik to obtain where the edges on C are indexed in such a way as to ensure they appear in order of consecutive adjacency.
Obviously, if B has minimum (or just small) cardinality, Eq. ( 24) will be sparsest (or just sparse), which is often a desirable property of linear constraints occurring in MP formulations.Hence we should attempt to find short cycle bases B.
In summary, given a basis B of the directed cycle space of Ḡ where c B is the incidence vector of a cycle B ∈ B, the following: min s≥0,y {i,j}∈E is a valid formulation for the EDGP.The solution of Eq. ( 27) yields a feasible vector y * .As pointed out in Cor.3.11, we must then solve Eq. ( 22) to obtain a realization x * for G.

How to find directed cycle bases
We require directed cycle bases over Q.By [27, Thm.2.4], each undirected cycle basis gives rise to a directed cycle basis (so it suffices to find a cycle basis of G and then direct the cycles using the directions in Ḡ).Horton's algorithm [24] and its variants [19,43] find a minimum cost cycle basis in polynomial time.
The most efficient deterministic variant is O(m 3 n) [43], and the most efficient randomized variant has the complexity of matrix multiplication.Existing approximation algorithms have marginally better complexity.It is not clear, however, that the provably sparsest constraint system will make the DGP actually easier to solve.We therefore consider a much simpler algorithm: starting from a spanning tree, we pick the m − n + 1 circuits that each chord (i.e., non-tree) edge defines with the rest of the tree.This algorithm [48] yields a fundamental cycle basis (FCB).Finding the minimum FCB is known to be NP-hard [13], but heuristics based on spanning trees prove to be very easy to implement and work reasonably well [13] (optionally, their cost can be improved by an edge-swapping phase [1,31]).
i.e., the union of all the edges from the 2-path replacements.We note that, by construction, Let c C ij ∈ {1, −1} be the orientation of (i, j) in C w.r.t.G; let Ĉ be the simple Eulerian cycle in Ĝ corresponding to C .
We can now prove the main result of this section.
Proof.We first consider a variant of the cycle formulation in Eq. ( 27) applied to Ĝ, where, from the constraints corresponding to Eq. ( 8) (second line of Eq. ( 27)), we omit those indexed by Ê.We call this variant (⋆).We claim that (⋆) is an exact reformulation of Eq. ( 27) applied to G. The claim holds because E( Ĝ) Ê = E by Eq. ( 28), and because the signs of the y variables are irrelevant in Eq. ( 8) since they are squared.Now, since Ĉ is a Eulerian cycle in Ĝ, Eq. ( †) must hold in G for any orientation of the edges of C , by Lemma 3.1.Therefore, Eq. ( †) is an aggregation of the constraints in Eq. ( 24), which occur within the reformulation (⋆).So Eq. ( 29) is a relaxation of (⋆).The proposition follows because of the claim.
Note that Eq. ( 29) provides a solution ȳ that may not satisfy Eq. ( 24), which also guarantee feasibility in Eq. ( 10) by Prop.4.1.By Remark 3.12, this implies that Cor.3.11 is no longer applicable.In other words, we cannot obtain a realization x of G from ȳ using the linear system in Eq. ( 22), since ȳ might well make Eq.( 22) infeasibile.We can fix this issue by adjoining Eq. ( 22) to Eq. ( 29) as additional constraints.For practical reasons we also propose to adjoin the centroid constraints which provide a restriction of Eq. ( 27) by only keeping realizations of G having zero centroid (see Eq. ( 4)).
For a formulation P , we denote by val(P ) its optimal objective function value.
Lemma 5.2.Let P be Eq.(27), and P ′ be P with the x variables and the constraints in Eq. ( 22) adjoined.
Proof.This is a direct consequence of Cor.3.11.
Lemma 5.3.For any reformulation (or relaxation) P of the EDGP involving the x variables, let P ′ be P with the centroid constraints Eq. (30) adjoined.Then val(P ) = val(P ′ ).
Proof.Since P ′ is a restriction of P , and the optimization direction is minimization, we have val(P ) ≤ val(P ′ ).
Let x be an optimal solution of P : then x ′ = x − stack(centroid(x), n) (where the second term of the right hand side is the centroid row K-vector stacked n times to yield an n×K matrix) is feasible in P ′ by definition, which proves that val(P ) ≥ val(P ′ ).The result follows.
Proof.Let us call Eq. ( 29) R and Eq. ( 27) P .By Prop.5.1, R is a relaxation of P .By adjoining new variables x and Eq. ( 22) as constraints to both R and P , we obtain formulations R ′ , P ′ such that R ′ is a relaxation of P ′ .But by Lemma 5.2 we have that val(P ′ ) = val(P ), so R ′ is a relaxation of P , which is a valid formulation of the EDGP.Note that Eq. ( 31) is R ′ with the centroid constraints Eq. ( 30) adjoined.By Lemma 5.3, therefore, val(R ′ ) = val (31).Thus, Eq. ( 31) is a relaxation of the EDGP.

Computational experiments
The aim of this section is to compare the computational performance of the following EDGP formulations: (i) the cycle-based formulation in Eq. (27), where the realization is retrieved as a post-processing stage using ( 22) according to Cor. 3.11; (ii) the Eulerian cycle-based relaxation in Eq. ( 31); (iii) the classic edge-based formulation in Eq. ( 4).
All of these formulations are nonconvex Nonlinear Programs (NLP), which are generally NP-hard to solve.
More specifically, all of these formulations are as hard to solve as the EDGP, which is NP-hard.
As a solution algorithm, we used a very simple MultiStart (MS) heuristic based on calling a local NLP solver from a random initial starting point at each iteration, and updating the best solution found so far as needed: although there are better heuristics around [38,12,46], MS is the best trade-off between implementation simplicity and efficiency.Moreover, more efficient heuristics often change the formulation during their execution, which may hinder the meaning of this computational comparison between formulations.
We evaluate the quality of a realization x of a graph G according to mean (MDE) and largest distance error (LDE), defined this way: Furthermore, for each realization x of a graph G found by using the MS algorithm, we consider the value of the corresponding solution solVal(x, G).We note that, due to the heuristic nature of the MS, this value is not guaranteed tobe globally optimal.
The CPU time taken to find the solution may also be important, depending on the application.In the control of underwater vehicles [3], for example, DGP instances might need to be solved in real time.In other applications, such as finding protein structure from distance data [8,45] (our application of choice), the CPU time is not so important.
Our tests were carried out on a single CPU of a 2.1GHz 4-CPU 8-core-per-CPU machine with 64GB RAM running Linux.The local NLP solver used within the MS heuristic was the IPOpt solver [9].We remarked in some preliminary tests that IPOpt was considerably slowed down by variants of Eq. ( 3) such as Eq. ( 5), which essentially move a nonconvexity on the objective to one in the constraints.The same holds for the cycle-based formulation in Eq. (27).We therefore reformulated Eq. ( 27) as follows: and Eq. ( 31) similarly.
Our implementation consists of a mixture of Python 3 [54] and AMPL [18] interfaced through amplpy.Cycle bases and Eulerian cycles are found using networkX [21].Solutions to the feasible but possibly overdetermined linear systems in Eq. ( 22) are obtained using an ℓ 1 error minimization approach reformulated as a Linear Programming problem solved with CPLEX [25].

Results
A benchmark on a diverse collection of randomly generated weighted graphs of small size and many different types, with a very similar set-up to the one discussed here, is presented in [34].It was found that the cycle formulation finds better MDE values, while the edge formulation generally finds better LDE values and is faster.Some results on proteins, obtained with only 3 MS iterations, were also presented in [34].
The benchmark we consider here contains medium to large scale protein graph instances realized in R 3 , all of which contain cycles.W.r.t. the protein results presented in [34], we integrated one more instance, 1tii, which, at 69800 edges and 5684 vertices, is considerably larger than all the others.The results are given in Tables 1 and 2.
In Table 1, we report instance name, instance sizes m and n, then performance measures MDE, LDE and CPU for cycle, Eulerian and edge-based formulations.In the last three lines we report average, standard deviation, and number of instances where the formulation performed best, for all performance measures.In all tested cases, finding the cycle basis, the Eulerian cycles, and solving Eq. ( 22) took a small fraction of the total solution time.The missing result for instance 100d on the Eulerian cycle reformulation is due to a failure occurred in the networkX module because the graph of 100d is not connected.It appears that, on average, there is relatively little difference between the quality performances of these three EDGP formulations on protein graphs of medium and large sizes.CPU-time wise, of course, the edge formulation is best.Cycle formulations, taken together, outperform the edge formulation on quality measures.The cycle-based formulation Eq. ( 27) is slightly better than the other formulations for both MDE and LDE.The number of instances on which Eq. ( 27) is best on quality measures is 13, against 11 for the edge-based formulation.
In Table 2, we report instance name and solVal for the cycle and the edge EGDP formulations.The three lines at the bottom of the table show the arithmetic and geometric mean of each column ("arithmean" and "geomean"), and the percentage of instances where the solution values of each formulation are smaller than the other ("best").
The cycle formulation reports better local optima more often than the edge formulation ("best" = 53.3%),while the latter is more stable, on average, as its arithmetic mean is slightly smaller.However, since the solution values of the cycle formulation are sometimes much smaller than those of the edge formulation, the geometric mean of the former is about two orders of magnitude smaller than that of the latter.We observe that Eq. ( 27) was the only formulation by which a global optimum was found (that of C0030pkl) using MS.Overall, the results reported in Table 2 follow those of Table 1, except in the case of instance hlx_amb.
We decided to ignore the Eulerian formulation in Table 2, as its the objective function values were often larger than those of the corresponding cycle formulation, despite the fact that the former is a relaxation of the latter.This apparent anomaly is due to the heuristic nature of the MS solution algorithm.
All in all, we believe that our results show that cycle formulations are credible competitors w.r.t. the well established edge-based formulations, especially when the CPU time is not an important performance measure (which is generally the case in the protein conformation application).

Table 1 :
Cycle formulation vs. Eulerian relaxation vs. edge formulation performances on protein graphs (realizations in K = 3 dimensions).

Table 2 :
MS solution values of cycle formulation vs. edge formulation (realizations in K = 3 dimensions)