Adaptive Cut Selection in Mixed-Integer Linear Programming

Cutting plane selection is a subroutine used in all modern mixed-integer linear programming solvers with the goal of selecting a subset of generated cuts that induce optimal solver performance. These solvers have millions of parameter combinations, and so are excellent candidates for parameter tuning. Cut selection scoring rules are usually weighted sums of different measurements, where the weights are parameters. We present a parametric family of mixed-integer linear programs together with infinitely many family-wide valid cuts. Some of these cuts can induce integer optimal solutions directly after being applied, while others fail to do so even if an infinite amount are applied. We show for a specific cut selection rule, that any finite grid search of the parameter space will always miss all parameter values, which select integer optimal inducing cuts in an infinite amount of our problems. We propose a variation on the design of existing graph convolutional neural networks, adapting them to learn cut selection rule parameters. We present a reinforcement learning framework for selecting cuts, and train our design using said framework over MIPLIB 2017 and a neural network verification data set. Our framework and design show that adaptive cut selection does substantially improve performance over a diverse set of instances, but that finding a single function describing such a rule is difficult. Code for reproducing all experiments is available at https://github.com/Opt-Mucca/Adaptive-Cutsel-MILP.


Introduction
A Mixed-Integer Linear Program (MILP) is an optimisation problem that is classically defined as: Here, c ∈ R n is the objective coefficient vector, A ∈ R m×n is the constraint matrix, b ∈ R m is the right hand side constraint vector, l, u ∈ R n ∪ {−∞, ∞} n are the lower and upper variable bound vectors, and J ⊆ {1, . . . , n} is the set of indices of integer variables. One of the main techniques for solving MILPs is the branch-and-cut algorithm, see [1] for an introduction. Generating cutting planes, abbreviated as cuts, is a major part of this algorithm, and is one of the most powerful techniques for quickly solving MILPs to optimality, see [3]. A cut is an inequality that does not remove any feasible solutions of (1) when added to the formulation. We restrict ourselves to linear cuts in this paper, and denote a cut as α = (α 0 , . . . , α n ) ∈ R n+1 , and denote the set of feasible solutions as I X , to formally define a cut in (2).
The purpose of cuts is to tighten the linear programming (LP) relaxation of (1), where the LP relaxation is obtained by removing all integrality requirements. Commonly, cuts are found that separate the current feasible solution to the LP relaxation, referred to as x LP , from the tightened relaxation, and for this reason algorithms that find cuts are often called separators. This property is defined as follows: Within modern MILP solvers, the cut aspect of the branch-and-cut algorithm is divided into cut generation and cut selection subproblems. The goal of cut generation is finding cuts that both tighten the LP relaxation at the current node and improve overall solver performance. The cut selection subproblem is then concerned with deciding which of the generated cuts to add to the formulation (1). That is, given the set of generated cuts S ′ = {α 1 , . . . , α |S ′ | }, find a subset S ⊆ S ′ to add to the formulation (1). We focus on the cut selection subproblem in this paper, where we motivate the need for instance-dependent cut selection rules as opposed to fixed rules, and introduce a reinforcement learning (RL) framework for learning parameters of such a rule. The cut selection subproblem is important, as adding either all or none of the generated cuts to the LP usually results in poor solver performance. This is due to the large computational burden of solving larger LPs at each node when all cuts are added, and the large increase in nodes needed to solve MILPs when no cuts are added. For a summary on MILPs we refer readers to [1], for cutting planes [23], for cut selection [30], and for reinforcement learning [28].
The rest of the paper is organised as follows. In Section 2, we summarise existing literature on learning cut selection. In Section 3, with an expanded proof in Appendix A, we motivate the need for adaptive cut selection by showing worst case performance of fixed cut selection rules. This section was inspired by [6], which proved complexity results for fixed branching rules. In Section 4 we summarise how cut selection is performed in the MILP solver SCIP [9]. In Section 5 we show how to formulate cut selection as a Markov decision process, and phrase cut selection as a reinforcement learning problem. This section was motivated by [15], which presented variable selection as a Markov decision process as well as experimental results of an imitation learning approach. Finally, in Section 6, we present a thorough computational experiment on learning cut selector parameters that improve root node performance, and study the generalisation of these parameters to the larger solving process. All experiments are done over MIPLIB 2017 [16] and a neural network verification data set [24] using the MILP solver SCIP version 8.0.1 [9].

Related Work
Several authors have proposed cut selection rules and performed several computational studies. The thesis [1] presents a linear weighted sum cut selection rule, which drastically reduces solution time to optimality by selecting a reduced number of good cuts. This cut selection rule and algorithm, see [9], can still be considered the basis of what we use in this paper. A more in-depth guide to cutting plane management is given in [30].
Here, a large variety of cut measures are summarised and additional computational results given that show how a reduced subset of good cuts can drastically improve solution time. A further computational study, focusing on cut selection strategies for zero-half cuts, is presented in [4]. They hypothesise that generating a large amount of cuts followed by heuristic selection strategy is more effective than generating a few deep cuts. Note that the solver and cut selection algorithms used in [1], [4], and [30] are different. More recently, [11] summarises the current state of separators and cut selection in the literature, and poses questions aimed to better develop the science of cut selection. The final remark of the paper ponders whether machine learning can be used to answer some of the posed questions.
Recently, the intersection of mixed-integer programming and machine learning has received a lot of attention, specifically when it comes to branching, see [6,15,24] for examples. To the best of our knowledge, however, there are currently only four publications on the intersection of cut selection and machine learning. Firstly, [7] shows how cut selection parameter spaces can be partitioned into regions, such that the highest ranking cut is invariant to parameter changes within the regions. These results are extended to the class of Chvàtal-Gomory cuts applied at the root, with a sample complexity guarantee of learning cut selection parameters w.r.t. the resultant branch and bound tree size. Secondly, [29] presents a reinforcement learning approach using evolutionary strategies for ranking Gomory cuts via neural networks. They show that their method outperforms standard measures, e.g. max violation, and generalises to larger problem sizes within the same class. Thirdly, [8] train a neural network to rank linear cuts by expected objective value improvement when applied to a semi-definite relaxation. Their experiments show that substantial computational time can be saved when using their approximation, and that the gap after each cut selection round is very similar to that found when using the true objective value improvement. Most recently, [19] proposes a multiple instance learning approach for cut selection. They learn a scoring function parameterised as a neural network, which takes as input an aggregated feature vector over a bag of cuts. Their features are mostly composed of measures normally used to score cuts, e.g. norm violation. Cross entropy loss is used to train their network by labelling the bags of cuts before training starts.
Our contribution to the literature is three-fold. First, we provide motivation for instance-dependent cut selection by proving the existence of a family of parametric MILPs together with an infinite amount of family-wide valid cuts. Some of these cuts can induce integer optimal solutions directly after being applied, while others fail to do so even if an infinite amount are applied. Using a basic cut selection strategy and a pure cutting plane approach, we show that any finite grid search of the cut selector's parameter space, will miss all parameter values, which select integer optimal inducing cuts in an infinite amount of our instances. An interactive version of this constructive proof is provided in Mathematica ® [31], and instance creation algorithms are provided using SCIP's Python API [9,22]. Second, we introduce a RL framework for learning instance-dependent cut selection rules, and present results on learning parameters to SCIP's default cut selection rule [1] over MIPLIB 2017 [16] and a neural network verification data set [24]. Third and finally, we implemented a new cut selector plugin, which is available from SCIP 8.0 [9], and enables users to include their own cut selection algorithms in the larger MILP solving process.

Motivating Adaptive Cut Selection
This section introduces a simplified cut scoring rule, and discusses how the parameters for such a rule are traditionally set in solvers. A theorem is then introduced that motivates the need for adaptive cut scoring rules, and is proven in Appendix A using a simulated pure cutting plane approach. Consider the following simplified version of SCIP's default cut scoring rule (see Section 4 for the default scoring rule): Using the general MILP definition given in (1), we define the cut measures integer support (isp) and objective parallelism (obp) as follows: We now introduce Theorem 1, which refers to the λ parameter in (4).

Theorem 1.
Given a finite discretisation of λ, an infinite family of MILP instances together with an infinite amount of family-wide valid cuts can be constructed. Using a pure cutting plane approach and applying a single cut per selection round, the infinite family of instances do not solve to optimality for any value in the discretisation, but do solve to optimality for an infinite amount of alternative λ values.
The general purpose of Theorem 1 is to motivate the need for instance-dependent parameters in the cut selection subroutine. The typical approach for finding the best choice of cut selector parameters, see previous SCIP computational studies [1,9,14], is to perform a parameter sweep, most often a grid search. A grid search, however, leaves regions unexplored in the parameter space. In our simplified cut scoring rule (4), we have a single parameter, namely λ, and these unexplored regions are simply intervals. We define Λ, the set of values in the finite grid search of λ, as follows: The set of unexplored intervals in the parameter space, denoted Λ, is then defined as: Our goal is to show that for any Λ we can construct an infinite family of MILP instances from Theorem 1. Together with our infinite amount of family-wide valid cuts and specific cut selection rule, we will show that the solving process does not finitely terminate for any choice of λ outside of an interval (λ lb , λ ub ) ⊂ Λ. In effect, this shows that using the same fixed λ value over all problems in a MILP solver could result in incredibly poor performance for many problems. This is somewhat expected, as a fixed parameter cannot be expected to perform well on all possible instances, and moreover, cut selection is only a small subroutine in the much larger MILP solving process. Additionally, the instance space of MILPs is non-uniform, and good performance over certain problems may be highly desirable as they occur more frequently in practice. Nevertheless, Theorem 1 provides important motivation for adaptive cut selection. See Appendix A for a complete proof.

Cut Selection in SCIP
Until now we have motivated adaptive cut selection in a theoretical manner, by simulating poor performance of fixed cut selector rules in a pure cutting approach. Using this motivation, we now present results of how parameters of a cut selection scoring rule can be learnt, and made to adapt with the input instance. We begin with an introduction to cut selection in SCIP [9]. The official SCIP cut scoring rule (7) that has been used since SCIP 6.0 is defined as: The measures integer support (isp) and objective parallelism (obp) are defined in (5) and (6). Using the general MILP definition (1), letting x LP be the LP optimal solution of the current relaxation, and x be the current best incumbent solution, we define the cut measures directed cutoff distance (dcd) and efficacy (eff) as follows: We note that in SCIP the cut selector does not control how many times it itself is called, which candidate cuts are provided, nor the maximum amount of cuts we can apply each round. We reiterate that each call to the selection subroutine is called an iteration or round. Algorithm 1 gives an outline of the SCIP cut selection rule.
The SCIP cut selector rule in Algorithm 1 still follows the major principles presented in [1]. Cuts are greedily added by the largest score according to the scoring rule (7). After a cut is added, all other candidate cuts that are deemed too parallel to the added cut are filtered out and can no longer be added to the formulation this round. Forced cuts, which are always added to the formulation, prefilter all candidate cuts for parallelism, and are most commonly one-dimensional cuts or user defined cuts. We note that Algorithm 1 is a summarised version of the true algorithm, and has abstracted some procedures. Certain parameters have also been removed for the sake simplicity, such as those which determine when two cuts are too parallel. We further note that λ = {0.0, 1.0, 0.1, 0.1} as of SCIP 8.0.
Motivated by work from this paper, users can now define their own cut selection algorithms and include them in SCIP for all versions since SCIP 8.0 [9]. Users can do this with a single function interface, bypassing Algorithm 1: SCIP Default Cut Selector (Summarised) Input : cuts ∈ R s1×n , forced_cuts ∈ R s2×n , max_cuts ∈ Z ≥0 , (s 1 , s 2 ) ∈ Z 2 ≥0 Return : Sorted array of selected cuts, the amount of cuts selected 1 n_cuts ← s 1 // Size of cuts array 2 for forced_cut in forced_cuts do 3 cuts, n_cuts ← remove cuts from cuts too parallel to forced_cut 4 end 5 n_selected_cuts ← 0 6 selected_cuts ← ∅ 7 while n_cuts > 0 and max_cuts > n_selected_cuts do // Scoring done with (7). If no primal, efficacy replaces cutoff distance 8 best_cut ← select highest scoring cut remaining in cuts 9 selected_cuts ← selected_cuts ∪ best_cut 10 n_selected_cuts ← n_selected_cuts + 1 11 cuts, n_cuts ← remove cuts from cuts too parallel to best_cut 12 end 13 return forced_cuts ∪ selected_cuts, s 2 + n_selected_cuts the previous need to modify SCIP source code. For example, users can introduce a cut selection rule with an entirely new scoring rule that replaces (7), or introduce a new filtering mechanism that is not based exclusively on parallelism. We hope that this leads to additional research about cut selection algorithms in modern MILP solvers.

Problem Representation and Solution Architecture
We now present our approach for learning cut selector parameters for MILPs. In Subsection 5.1 we describe our encoding of a general MILP instance into a bipartite graph. Subsection 5.2 introduces a framework for posing cut selection parameter choices as a RL problem, with Subsection 5.3 describing the graph convolutional neural network architecture used as our policy network. Subsection 5.4 outlines the training method to update our policy network.

Problem representation as a graph
The current standard for deep learning representation of a general MILP instance is the constraint-variable bipartite graph as described in [15]. Some extensions to this design have been proposed, see [12], as well as alternative non graph embeddings, see [16] and [27]. We use the embedding as introduced in [15] and the accompanying graph convolutional neural network (GCNN) design, albeit with the removal of all LP solution specific features and a different interpretation of the output. The construction process for the bipartite graph can be seen in Figure 1.
The bipartite graph representation can be written as G = {V, C, E} ∈ G, where G is the set of all bipartite graph representations of MILP instances. V ∈ R n×7 is the feature matrix of nodes on one side of the graph, which correspond one-to-one with the variables (columns) in the MILP. C ∈ R m×7 is the feature matrix of nodes on the other side, and correspond one-to-one with the constraints (rows) in the MILP. An edge (i, j) ∈ E exists when the variable represented by x i has non-zero coefficient in constraint cons j , where i ∈ {1, . . . , n} and j ∈ {1, . . . , m}. We abuse notation slightly and say that E ∈ R m×n×1 , where E is the edge feature tensor. Note that we do not extend our MILP representation after every round of cuts is added due to using single-step learning, see Subsection 5.2. The representation is extendable however to multi-step learning, where the added cuts could become constraints. The exact set of features can be seen in Table 1.

Reinforcement Learning Framework
We formulate our problem as a single step Markov decision process. The initial state of our environment is s 0 = G 0 = G. An agent takes an action a 0 ∈ R 4 , resulting in an instant reward r(s 0 , a 0 ) ∈ R, and deterministically min x c 1 x 1 + · · · · · · + c n x n a 1,1 x 1 + · · · · · · + a 1,n x n ≤ b 1 a n,1 x 1 + · · · · · · + a n,n x n ≤ b m transitions to a terminal state s 1 = G Nr , N r ∈ Z. The action taken, a 0 , is dictated by a policy π θ (a 0 |s 0 ) that maps any initial state to a distribution over our action space, i.e. a 0 ∼ π θ ( · |s 0 ).
The MILP solver in this framework is our environment, and the cut selector our agent. Let N r be the number of paired separation and cut selection rounds we wish to apply, and G i ∈ G be the bipartite graph representation of G ∈ G after i rounds have been applied. The action a 0 ∈ R 4 is the choice of cut selector parameters {λ 1 , λ 2 , λ 3 , λ 4 } followed by N r paired separation rounds. Applying action a 0 to state s 0 results in a deterministic transition to s 1 = G Nr , defined by the function f : G × R 4 − → G. The baseline function, b(s 0 ) : G − → R, maps an initial state s 0 to the primal-dual difference of the LP solution of f(s 0 , a ′ ) ∈ G, where the solver is run with standard cut selector parameters, a ′ ∈ R 4 , and some pre-loaded primal solution. The primal-dual difference in this experiment can be thought of as a strict dual bound improvement, as the pre-loaded primal cannot be improved upon without a provable optimal solution itself. The pre-loaded primal also serves to make directed cutoff distance active from the beginning of the solving process. We do note that this is different to the normal solve process and introduces some bias, most notably for directed cutoff distance. Let g a0 (s 0 ) be the primal-dual difference of the LP solution of f(s 0 , a 0 ) if a 0 are the cut selector parameter values used. The reward r(s 0 , a 0 ) can then be defined as: Let (s 0 , a 0 , s 1 ) ∈ G × R 4 × G be a trajectory, also called a roll out in the literature. The goal of reinforcement learning is to maximise the expected reward over all trajectories. That is, we want to find θ that parameterises: Here, p(s 0 ) is the density function on instances s ∈ G evaluated at s = s 0 . The pre-image f −1 (s 1 ) : G − → R 4 × G is defined as: The architecture of policy network π θ (a0|s0). H represent hidden layers of the network.
We note that equation (10) varies from the standard definition as seen in [28], and those presented in similar research [15,29], as our action space is continuous. Additionally, as the set G is infinite and we do not know the density function p(s), we use sample average approximation, creating a uniform distribution around our input data set.

Policy Architecture
Our policy network, π θ ( · |s 0 ∈ G), is parameterised as a graph convolutional neural network, and follows the general design as in [15], where θ fully describes the complete set of weights and biases in the GCNN. The changes in design are that we use 32 dimensional convolutions instead of 64 due to our lower dimensional input, and output a 4 dimensional vector as we are interested in cut selector parameters. This technique of using the constraint-variable graph as an embedding for graph neural networks has gained recent popularity, see [10] for an overview of applications in combinatorial optimisation. Our policy network takes as input the constraint-variable bipartite graph representation s 0 = {V, C, E}. Two staggered half-convolutions are then applied, with messages being passed from the embedding V to C and then back. The result is a bipartite graph with the same topology but new feature matrices. Our policy is then obtained by normalising feature values over all variable nodes and averaging the result into a vector µ ∈ R 4 . This vector µ ∈ R 4 represents the mean of a multivariate normal distribution, N 4 (µ, γI), where γ ∈ R. We note that having the GCNN only output the mean was a design choice to simplify the learning process, and that our design can be extended to also output γ or additional distribution information. Any sample from the distribution N 4 (µ, γI) can be considered an action a 0 ∈ R 4 , which represents {λ 1 , λ 2 , λ 3 , λ 4 } with the non-negativity constraints relaxed. Figure 2 provides an overview of this architecture. For a walk-through of the GCNN, see Appendix C.

Training Method
To train our GCNN we use policy gradient methods, specifically the REINFORCE algorithm with baseline and gaussian exploration, see [28] for an overview. An outline of the algorithm is given in Algorithm 2.

Algorithm 2: Batch REINFORCE
Input : Policy network π θ , MILP instances batch, n samples ∈ N, s 1 ← Apply N r rounds of separation and cut selection to s 0 7 r ← Relative dual bound improvement of s 1 to some baseline 8 L ← L + (−r × log(π θ (a 0 |s 0 ))) // Use log probability for numeric stability 9 end 10 end 11 θ ← θ + ∇ θ L // We use the Adam update rule in practice [20] Algorithm 2 is used to update the weights and biases, θ, of our GCNN, π θ ( · |s 0 ∈ G). It does this for a batch of instances by minimising L, referred to as the loss function, see [17]. We used default parameter settings in the Adam update rule, aside from a learning rate with value 5 × 10 −4 . Our training approach is performed offline, and only the final GCNN is used for evaluation.

Experiments
We use MIPLIB 2017 1 [16] as our first data set, which we simply refer to it as MIPLIB, and a set of neural network verification instances 2 [24] as our second data set, which we refer to as NN-Verification. For all subsections we run experiments on instances that have gone through SCIP's default presolve, see [2] for an overview on presolve techniques. Each individual run on a presolved instance consists of a single round of presolve (to remove fixed variables), then solving the root node, using 50 separation rounds with a limit of 10 cuts per round. Propagation, heuristics, and restarts are disabled for the runs, with a slightly modified version of SCIP's cut selector in Algorithm 1 being used, where λ is defined by the user for each run. A pre-loaded MIP start is also provided, which is the best solution found within 600s when solved with default settings. In the case of less than 10 cuts being selected due to parallelism filtering, the highest filtered scoring cuts are added until the 10 cut per round limit is reached, or no more cuts exist. We believe these conditions best represent a sandbox environment that allows cut selection to be the largest influence on solver performance. Additionally, all results are obtained by averaging results over the SCIP random seeds {1, 2, 3}. All code for reproducing experiments can be found at https://github.com/Opt-Mucca/Adaptive-Cutsel-MILP. The modification of SCIP's default cut selector for our experiment is done to standardise the range of the individual cut measures, simplifying the learning process of those measure's coefficients. The measures isp and obp for any cut are in the range [0, 1], while the measures eff and dcd, following the assumption that x LP is separated, are in the range [0, ∞). We therefore substitute eff and dcd in the default SCIP cut scoring rule by the following normalised measures eff' and dcd': For all experiments SCIP 8.0.1 [9] is used, with PySCIPOpt [22] as the API, and Gurobi 9.5.1 [18] as the LP solver. PyTorch 1.7.0 [25] and PyTorch-Geometric 2.0.1 [13] are used to model the GCNN. All experiments for MIPLIB are run on a cluster equipped with Intel Xeon E5-2670 v2 CPUs with 2.50GHz and 128GB main memory, and for NN-Verification on a cluster equipped with Intel Xeon E5-2690 v4 CPUs with 2.60GHz and 128GB main memory.
For instance selection we discard instances from both instance sets that satisfy any of the criteria in Table 2. To minimise bias, instances were discarded if any criteria were triggered in an individual run on any seed under default condition or those tested in Experiment 6.1. We believe that these conditions focus on instances where a good selection strategy of cuts can improve the dual bound in a reasonable amount of time. We note that improving the dual bound is a proxy for overall solver performance, and does not necessarily result in improved solution time. We additionally note that only 1000 randomly selected instances from the NN-Verification data set were used as opposed to the entire data set. All instance sets following instance filtering are split into training-test subsets subject to a 80-20 split.

Lower Bounding Potential Improvement
To begin our experiments, we first perform a grid search to give a lower bound on the potential improvement that adaptive cut selection can provide. We generate all parameter scenarios satisfying the following condition: Recall that λ i for all i ∈ {1, 2, 3, 4} are respectively multipliers of the cut scoring measures normalised directed cutoff distance (dcd'), normalised efficacy (eff'), integer support (isp), and objective parallelism (obp). We solve the root node for all instances and parameter choices, and store the cut selector parameters that result in the smallest primal-dual difference, as well as their relative primal-dual difference improvement compared to that when using default cut selector parameter values. We remove all instances where the worst case parameter choice compared to the best case parameter choice differ by a relative primal-dual difference performance of less than 0.1%. Additionally, we remove instances where a quarter or more of the parameter choices result in the identical best performance. These removals are made due to the sparse learning opportunities provided by the instances, as the best case performance is minimally different from the worst, or the best case performance is too common. This results in an additional 2.5% and 0.2% of instances being removed for MIPLIB, leaving 87 (8.2%) instances remaining. For NN-Verification no additional instances are removed under these conditions, leaving 231 instances (23.1%) instances remaining. We note that all criteria for instance removal in Table 2 were performed using the grid search results as well as those under default conditions to ensure no bias throughout instance selection.
We conclude from the results presented in Figure 3 that there exists notable amounts of improvement potential per instance from better cut selection rules. Specifically, we observe that the median relative primal-dual difference improvement compared to standard conditions is at least 7.7% over the training and test sets of both MIPLIB and NN-Verification. We consider this difference very large considering at most 500 cuts (50 rounds of 10 cuts) are added, with this value being only a lower bound on potential improvement as the results come from a grid search of the parameter space. Instance specific results for MIPLIB are available in Appendix D.
We draw attention to the aggregated best performing parameter results from the grid search in Table 3. We see in both data sets that a distance based metric has the largest mean value, being λ 1 (multiplier of dcd') for MIPLIB and λ 2 (multiplier of eff') for NN-Verification. We also see that λ 3 and λ 4 take on much larger mean values than those in the default SCIP scoring rule, where they have value 0.1, suggesting that the measures isp and obp are not only useful in distance dominated scoring rules. These are aggregated results, however, and we note that they best summarise how every measure can be useful for some instances, further motivating the potential of instance-dependent based cut selection. We stress that this motivation is also true for the homogeneous NN-Verification, where all parameters are still useful.

Random Seed Initialisation
Let θ i be the initialised weights and biases using random seed i, where i ∈ N. To minimise the bias of our initialised policy with respect to λ = (λ 1 , λ 2 , λ 3 , λ 4 ), the random seed that satisfies (13) We believe this random seed minimises bias as the GCNN initially outputs approximately equal values over the data set, allowing the GCNN to best decide the importance of each parameter. This was motivated from the observation that some random initialisations resulted in a cut measure always having an output value of 0 starting from the untrained GCNN. Different random seeds were used for the MIPLIB and NN-Verification experiments, and the random seeds were found using combined training and test sets. The performance of the randomly initialised GCNN can be seen in Figure 4 and Table 4, with instance specific results available in Appendix D. We observe a larger than expected mean improvement over the test set for MIPLIB, however from the size of the test set and the much lower median improvement, conclude that it's the result of outliers. Surprisingly, the median and mean relative improvement for training and test sets for MIPLIB are positive, while they are negative for NN-Verification. We believe that the positive, albeit small, performance improvement of our random initialisation over default SCIP on MIPLIB is from our slight modification of the cut scoring rule with eff' and dcd'. For NN-verification, we believe the negative performance comes from the decrease in λ 2 (the multiplier of eff'), which is weighted highly in default SCIP, and is important for the instance set according to results in Experiment 6.1.

Standard Learning Method
Before we attempt to determine the capability of our RL framework, policy architecture, and training method, we first design an experiment using SMAC (Sequential Model Algorithm Configuration) 3 , see [21]. SMAC is a standard package in the field of algorithm configuration, and is largely based on Bayesian optimisation. Unlike our approach, which returns instance-dependent cut selector parameters, SMAC will return a single set of parameter values that works over the entire instance set. It can therefore be thought of as a more intelligent approach than traditional grid searches, which have been used to define SCIP default parameter values. We therefore aim to outperform SMAC given the adaptive advantage of our algorithm. We use SMAC4BB, which is targeted at low dimensional and continuous black box functions, and provide SCIP 8.0.1's default values for λ. We run 250 epochs of SMAC (the same as we will in Experiment 6.4), however we note that our approach requires additional solver calls due to taking more than one sample of cut selector  parameters from the generated distributions during training. The function that SMAC attempts to minimise is the average primal-dual difference over all instance-seed pairs relative to that produced by SCIP with default cut selector parameter values. We observe an increase in performance over MIPLIB after using SMAC compared to that of the random initialisation as seen in Figure 5. The median improvement over default SCIP for the training set increases to 1.6% from 0.2%, and to 2.5% from 2.4% for the test set, with the mean improvement over both sets increasing by at least 2%. For NN-Verification, we only observe a median increase to 0.5% from −0.02% for the training set and −0.4% from −0.5% for the test set, with the mean performance of each set increasing by less than 2%. From the best found constant parameter choices generated by SMAC as displayed in Table 5, we conclude that an efficacy dominated cut scoring rule, such as default SCIP, is likely the best choice for NN-Verification if restricted to a non-adaptive rule. We now show the performance of our RL framework, policy architecture, and training method compared to default SCIP parameter choices over MIPLIB and NN-Verification. To do so, we run 5000 iterations of Algorithm 2 (250 epochs), with n samples set to 20, and allocate 10% of instances from the training set per batch. γ of the multivariate normal distribution, N 4 (µ, γI), is defined by the following, where n epochs is the total amount of iterations of Algorithm 2 and i epoch is the current epoch:

Learning Adaptive Parameters
We note that γ represents one of many opportunities, such as the GCNN structural design and training algorithm, where a substantial amount of additional effort could be invested to (over)tune the learning experiment. We also note that a forward pass of the trained network takes on average less than 0.1s over both data sets, see Table 8 in Appendix D, and that updating the GCNN is negligible w.r.t. time compared to solving the MILPs. The randomly initialised GCNN over the training set of MIPLIB has a median relative primal-dual difference improvement of 0.2% over default as seen in Figure 4, compared to the 1.7% of our MIPLIB trained GCNN as seen in Figure 6. This improvement is minimally better than the 1.6% improvement over default from SMAC in Figure 5, with our approach slightly improving over SMAC for the training set of MIPLIB, and performing comparably over the test set, having better median improvement and worse mean. These results suggest that our approach works, in that it is comparable with other standard approaches, and can provide improvement over default parameter choices, but that it is unable to capture the full extent of performance improvement that is shown to exist in Experiment 6.1. Interestingly, we note that over MIPLIB, Experiments 6.1-6.4 all on average set λ 1 (the multiplier for dcd) to be the largest coefficient, as seen in Tables 3, 5, and 6. This is in contrast to the default parameter values used in SCIP 8.0, where the multiplier is set to 0. We believe this difference is strengthened by our computational setup, where we provide SCIP a good initial starting solution. This starting solution is often optimal and better than what initial heuristics would produce. Additionally, as we only add cuts to the root node, the distance to the cut in the direction of the primal solution will reliably point inside the feasible region. This is not the case when the search space has been partitioned like in branch and bound. We also note that over all experiments and learning techniques, for the heterogeneous data set MIPLIB, every cut measure is useful for some instances. For specific instance results, see Appendix D.
For NN-Verification, we see from the data presented in Figure 6 that our framework, similar to SMAC, failed to perform on the homogeneous data set and capture the performance improvement that was shown to exist in Experiment 6.1. We performed comparably to the random initialisation from which training began, and converged to an efficacy dominated scoring rule featuring integer support, see Table 6, in a near identical manner to the constant scoring rule learned by SMAC. Interestingly, we see that the standard deviation for all measures is near 0, meaning that our framework converged to a constant output. We believe that this is due to a local minima existing for an efficacy dominated model, with default parameters being quite good, and our restriction to static based features, i.e. those available before the first LP solve, being insufficiently diverse for the more homogeneous NN-Verification.

Generalisation to Branch and Bound
Until this point we have focused on root node restricted experiments and used the primal-dual difference as a surrogate for solver performance. We now deploy the best instance-dependent parameter values from the grid search in Experiment 6.1 and from our approach in Experiment 6.4 to the full solving process. We keep the same sandbox environment that we have used until this point, however we no longer limit ourselves to the root node, and we set a time limit of 7200s. From the results presented in Table 7, specifically Table 7 (a), we see that instance-dependent cut selector parameters that induce good root node performance do generalise to the larger solving process. This follows from the best grid search instance-dependent parameter values from Experiment 6.1 clearly outperforming the default parameter choice. We thus believe that in general, the primal-dual difference (or gap) after applying cuts is an adequate surrogate of overall solver performance for a given set of parameters. We note, however, that there exists many solution paths where this statement is not true, and many instances where dual bound progression at the root is a poor surrogate. The improvement generalisation was not as clear for our framework as observable in Table 7 (b), with our framework outperforming both data sets in terms of time, and losing in terms of nodes. Interestingly, over MIPLIB instance-seed pairs that time out, our framework has a better dual bound than default 66.11% of the time.

Conclusion
We presented a parametric family of MILPs together with infinitely many family-wide valid cuts. We showed for a specific cut selection rule, that any finite grid search of the parameter space will always miss all parameter values, which select integer optimal inducing cuts in an infinite amount of our instances. We then presented a reinforcement learning framework for learning cut selection parameters, and phrased cut selection in MILP as a Markov decision process. By representing MILP instances as a bipartite graph, we used policy gradient methods to train a graph convolutional neural network. Table 7 Results of generated cut selection parameters compared to SCIP default parameters. Time is a comparison of the solution time of a run, Nodes the number of nodes, and Dual bound the dual bound when the time limit is hit. For Time, instance-seed pairs are considered when at most one of the two runs (default parameters and generated parameters) hit the time limit. For Nodes, instance-seed pairs that always solved to optimality are considered, and for Dual bound instance-seed pairs that always hit the time limit are considered. The columns Wins and Ties are the percentage of instance-seed pairs for which the generated parameters outperformed or respectively tied with the default parameters under the given metric.
(a) Generalisation to branch and bound of instance-dependent parameters from Experiment 6.1 The framework generates good performing, albeit sub-optimal, parameter values for a modified variant of SCIP's default cut scoring rule over MIPLIB 2017, with the performance being comparable to standard learning techniques, and clearly better than the random initialisation. Our framework, however, was subject to mode collapse over the NN-Verification data set, and failed to generate a diverse and well performing set of instance-dependent cut selector parameter values.

MIPLIB NN-Verification
Results from our grid search experiments showed that there is a large amount of potential improvements to be made in adaptive cut selection, with a median relative primal-dual difference improvement of 7.77% over MIPLIB and 8.29% over NN-Verification with only 50 rounds of 10 cuts. The generalisation of these best performing instance-dependent parameter values to branch and bound then revealed a correlation between the primal-dual difference after cut rounds and overall solver performance in terms of both solution time and number of nodes.
We suggest three key areas of further research for those wanting to build on this research. Firstly, there is a dire need for more instance sets that are sufficiently diverse, non-trivial, yet not overly difficult. Secondly, throughout this paper we restricted ourselves to individual cut measures already featured in SCIP's default rule. Further research could explore rules containing non-linear combinations of additional measures. Third and finally, we suggest that a focus on the larger selection algorithm could lead to further improved performance. For all experiments the separator algorithm's parameters were set to constant values, and we ignored other cut selector related parameters, and restricted ourselves to parallelism based filtering method. We end by noting that a major contribution of this work, the new cut selector plugin for SCIP, enables the last two key areas of further research via easy inclusion of custom cut selection algorithms in a modern MILP solver.

A Proof of Theorem 1 from Section 3
For the following theorem, we will simulate a pure cutting plane approach to solving MILPs using scoring rule (4). We will use custom MILPs, cutting planes, and select exactly one cut per round. Each call to the selection subroutine is called an iteration or round. The theorem is intended to show how a fixed cut selection rule can consistently choose "bad" cuts. The parametric MILP we use to represent our infinite family of instances is defined as follows, where a ∈ R ≥0 and d ∈ [0, 1]: The polytope of our MILPs LP relaxation is the convex hull of the following points: The convex hull of X is a 3-simplex, or alternatively a tetrahedron, see Figure 7 for a visualisation. For such a feasible region, we can exhaustively write out all integer feasible solutions: As we are dealing with linear constraints and objectives, we know that (1, x 2 , 0), where 0 < x 2 < 1 cannot be optimal without both (1, 1, 0) and (1, 0, 0) being also optimal. We therefore simplify the integer feasible set, I X , to: At each iteration of adding cuts we will always present exactly three candidate cuts. We name these cuts as follows: The "good cut", denoted GC: Applying this cut immediately results in the next LP solution being integer optimal. The "integer support cut", denoted ISC n : Applying this cut will result in a new LP solution barely better than the previous iteration. The cut has very high integer support as the name suggests, and would be selected if λ from (4) is set to a high value. The superscript n refers to the iteration number. The "objective parallelism cut", denoted OPC n : Applying this cut will also result in a new LP solution barely better than the previous iteration. The cut has very high objective parallelism, and would be selected if λ from (4) is set to a low value. The superscript n refers to the iteration number.
The cuts are defined as follows, where GC has an additional property of it being selected in the case of a scoring tie: We use ϵ n here to denote a small shift of the cut, with a greater ϵ n resulting in a deeper cut. We define ϵ n as follows: A better overview of the proof of Theorem 1 can now be imagined. At each cut selection round we present three cuts, where for high values of λ, ISC n is selected, and for low values OPC n is selected. As the scoring rule (4) is linear w.r.t. λ, our aim is to controllably sandwich the intermediate values of λ that will select GC. Specifically, for any given Λ, we aim to construct an infinite amount of parameter values for a and d s.t. the intermediate values of λ all belong to Λ. P(a, d), (a, d) ∈ R ≥0 × [0, 1], after having individually applied cuts (17), (18), or (19) are, respectively:

Lemma 3. The vertex set of the LP relaxation of
Proof. Apply GC, ISC n , and OPC n to P(a, d) individually and then compute the vertices of the convex hull of the LP relaxation. ◀ We note that because both integer support and objective parallelism do not depend on the current LP solution, iteratively applying deeper cuts of the same kind would leave the cut's scores unchanged. Thus, provided they do not separate any integer points and continue to cut off the LP solution, deeper cuts of the same kind can be recursively applied. This is why both ISC n and OPC n have a superscript.
After applying a cut of one kind, e.g. ISC n , we cannot always simply increment n in the other cut, e.g. OPC n+1 . This is because OPC n+1 does not guarantee separation of the now new LP solution in problem P(a, d) ∩ ISC n for all sequences of {ϵ n , ϵ n+1 }. Instead, we create a variant of OPC n+1 , namely OPC n+1 , which will always entirely remove the facet of the LP created by adding ISC n independent of how the series of ϵ n values increase. Once again, we note that as only the RHS values are changing, the score for all cuts within the same type remain unchanged, and thus no two cuts from different types can be applied. We have proven our results using Mathematica [31], and a complete notebook containing step-by-step instructions can be found at https://github.com/Opt-Mucca/Adaptive-Cutsel-MILP. Below we will outline the necessary cumulative lemmas to prove Theorem 1, and summarise the calculations we have taken to achieve each step.

Lemma 4.
Having applied the cut ISC n to P(a, d), (a, d) ∈ R ≥0 × [0, 1], a new facet is created. Applying either GC or a deeper variant of OPC n+1 cuts off that facet. The deeper variant, denoted OPC n+1 , which differs from OPC n+1 only by the RHS value is defined as: Proof. One can first verify that the vertex set of the facet is ISC n X \ I X . One can then find the smallest ϵ ′ s.t the following cut is valid for all x ∈ ISC n X \ I X : The statement is valid for all ϵ ′ > 61ϵn 2 , and we arbitrarily select ϵ ′ = 31ϵ n . One can also check that GC dominates ISC n by seeing that it separates all vertices of ISC n X \ I X for all n ∈ N. Finally, we need to ensure that no integer solution is cut off. We can verify this by checking that every x ∈ I X satisfies (24). This statement holds whenever ϵ n < 0.1. Therefore ϵ ′ = 31ϵ n is valid, and we arrive at the cut OPC n+1 . ◀
Proof. This follows the same structure as the proof of Lemma 4. We get that ϵ ′ > 4ϵn 43 , and that ϵ ′ = ϵ n+1 is valid w.r.t. the integer constraints. ◀ Using our definition of integer support and objective parallelism in (5)-(6), we derive the scores for each cut from the simple cut selection scoring rule (4). We let c P(a,d) denote the vector of coefficients from the objective of P(a, d), (a, d) ∈ R ≥0 × [0, 1]. The integer support and objective parallelism values of each cut are as follows: obp(GC, c P(a,d) ) = 110 + a + 10d √ 201 1 + a 2 + (10 + d) 2 (28) obp(ISC n , c P(a,d) ) = 1 + a obp(OPC n , c P(a,d) ) = 101 + 10d Using our simplified cut scoring rule as defined in (4), we derive the necessary conditions defining the λ values, which assign GC a score at least as large as the other cuts. Lemma 6. GC is selected and added to P(a, d), (a, d) ∈ R ≥0 × [0, 1], using scoring rule (4) if and only if a λ is used that satisfies the following conditions: Proof. We know that the integer support and objective parallelism do not depend on ϵ n as seen in equations (25)- (30). Our cut selector rule also selects exactly one cut per iteration, namely the largest scoring cut. Therefore, whenever λ satisfies constraints (31) and (32), GC will be selected over both OPC n and ISC n , and applied to P(a, d). If λ does not satisfy constraints (31) and (32), then GC is not the largest scoring cut and will not be applied to P(a, d). ◀ The inequalities (31)-(32) define the region, R GC , which exactly contains all tuples (a, d, λ) ∈ R ≥0 × [0, 1] 2 that result in GC being the best scoring cut. The region, R GC is visualised in Figure 8. We define the function r GC (a, d) for all (a, d) ∈ R ≥0 × [0, 1], which maps any pairing of (a, d) to the set of λ values contained in R GC for the corresponding fixed (a, d) values.
Here P refers to the power set. We are interested in R GC as we believe that we can find a continuous function that contains all (a, d, λ) ∈ R ≥0 × [0, 1] 2 pairings, which score all cuts equally. Using this function, we can find for a fixed d ∈  λ ub (a, d) and λ lb (a, d). The Proof. We know from Lemma 8 that λ ub (a, d) is continuous, and can conclude that λ ub (a max (d), d) is continuous. The different valued endpoints can be derived by evaluating λ ub (a max (0), 0) and λ ub (a max (1), 1), which have the relation λ ub (a max (1), 1) > λ ub (a max (0), 0). ◀ Figure 9 visualises the function λ ub (a max (d), d) (identically λ lb (a max (d), d)) for 0 ≤ d ≤ 1. For any d ∈ [0, 1], these functions alongside slight changes to a max (d), will be used to generate intervals of λ values, which score GC the largest and lie between a finite discretisation of [0, 1]. ub (a, d)λ lb (a, d) > 0 for all 0 ≤ d ≤ 1 and 0 ≤ a < a max (d). That is, a = a max (d) is the only  time at which λ ub (a, d) = λ lb (a, d) for 0 ≤ a ≤ a max (d).

Lemma 12. An interval
Proof. We define following function, a max (d, ϵ), representing a max (d) with a shift of ϵ: We know from Lemma 11 that a = a max (d) is the only time at which λ lb (a, d) = λ ub (a, d) for any d ∈ [0, 1]. We also know that λ ub (a, d) and λ lb (a, d) are defined over all 0 ≤ d ≤ 1 and 0 ≤ a ≤ a max (d). Therefore the following holds for any d ∈ [0, 1] and ϵ ∈ (0, a max (d)]: Additionally, by the definition of R GC from the inequalities (31)-(32), we know that the following interval is connected: We therefore can construct a connected non-empty interval While we have shown the necessary methods to construct an interval of λ values, I(a, d), that result in GC being selected, we have yet to guarantee that at all stages of the solving process, the desired LP optimal solution is taken for all 0 ≤ d ≤ 1 and 0 ≤ a ≤ a max (d). Specifically, we need to show that the originally optimal point is always ( −1 2 , 3, 1 2 ), that after applying GC the integer solution (1, 1, 0) is optimal, and that after applying ISC n (or OPC n ) a fractional solution from ISC n X (or OPC n X ) for all n ∈ N, is optimal. Lemma 13. The fractional solution ( −1 2 , 3, 1 2 ) is LP optimal for P(a, d) for all 0 ≤ d ≤ 1 and 0 ≤ a ≤ a max (d). Proof. This can be done by substituting all points from X \ ( −1 2 , 3, 1 2 ) into the objective, and then showing that the objective is strictly less when evaluated at ( −1 2 , 3, 1 2 ). This shows that for all 0 ≤ d ≤ 1 and 0 ≤ a ≤ a max (d) : The integer solution (1, 1, 0) is LP optimal after applying GC to P(a, d) for all 0 ≤ d ≤ 1 and 0 ≤ a ≤ a max (d).

Theorem 1. Given a finite discretisation of λ, an infinite family of MILP instances together with an infinite amount of family-wide valid cuts can be constructed. Using a pure cutting plane approach and applying a single cut per selection round, the infinite family of instances do not solve to optimality for any value in the discretisation, but do solve to optimality for an infinite amount of alternative λ values.
Proof. From Lemmas 2-5, we know the exact vertex set of our feasible region at each stage of the solving process, as well as the exact set of cuts at each round. Furthermore, as at each round only the RHS value for each proposed cut changes, the scoring of the cuts at each new round remains constant, and we can therefore completely describe the three scenarios of how cuts would be added. Let S be the set containing all cuts added during the solution process to an instance P(a, d), where 0 ≤ d ≤ 1 and 0 ≤ a ≤ a max (d): From Lemma 6 we know the sufficient conditions for a λ value that results in GC being scored at least as well as the other cuts. Lemmas 7-11 show how these sufficient conditions can be used to construct the region R GC . Moreover, they show that R GC is bounded, and that a = a max (d), for all d ∈ [0, 1], is the only time at which the following occurs: We therefore conclude that R GC is connected. We know from Lemma 10 that both λ ub (a max (d), d) and λ lb (a max (d), d) are continuous, where d ∈ [0, 1], and that λ ub (a max (1), 1) > λ ub (a max (0), 0). From the intermediate value theorem, we then know the following: From Lemma 12 we have shown an explicit way to construct an interval I(a, d) ⊆ r GC (a, d) for all (a, d) ∈ [0, a max (d)) × [0, 1]. We can therefore construct the following intervals: These intervals can be arbitrarily small as ϵ can be arbitrarily small. Moreover, as d ′ values that satisfy (36) can be used, and λ lb (a, d), λ ub (a, d), and a max (d, ϵ) are polynomials, we can generate infinitely many disjoint intervals. We can therefore conclude that for any finite discretisation of λ, Λ, an interval can be created that contains no values from {λ 1 , . . . , λ |Λ| }, but contains all values of λ for which P(a, d) solves to optimality.
Finally, Lemmas 13-16 ensure that each stage of the solving process, all cuts are valid for any fractional feasible LP optimal solution for all P(a, d), where 0 ≤ d ≤ 1 and 0 ≤ a ≤ a max (d). Moreover, the Lemmas guarantee that only after applying GC is an integer optimal solution found. We therefore have shown how fixing a global value of λ to a constant for use in the MILP solving process while disregarding all instance information can result in infinitely worse performance for infinitely many instances. ◀ Corollary 17. There exists an infinite family of MILP instances together with an infinite amount of family-wide valid cuts, which do not solve to integer optimality for any λ when using a pure cutting plane approach and applying a single cut per selection round.
Proof. To show this we take the following function, where 0 ≤ d ≤ 1: Any such value of a retrieved from this function will lie outside of R GC for all 0 ≤ d ≤ 1. There thus would exist no λ value that results in finite termination, as GC is never scored at least as high as the other cuts.
Similar to the proof of Theorem 1, we need to ensure that the LP optimal point at all times during the solving process is appropriate, and that the same integer optimal point stays integer optimal for all 0 ≤ d ≤ 1 and a max (d) < a ≤ a max (d, ϵ). We therefore redo the proofs of Lemmas 13-16 but change the range of values of a. ◀

B Functions of Appendix A
This section should be used to provide an intuitive understanding of our policy network, which is parameterised as a GCNN. For a more complete introduction to graph neural networks that also provides helpful visualisations, we refer readers to [26]. Throughout this section we will also refer to multi-layer perceptrons, which from now we simply refer to as (feed-forward) neural networks. A neural network is a function, which passes its input through a series of alternating linear transformations and non-linear activation functions. Note that while our design makes use of standard neural networks, other architecture types can be used. We refer readers to [17] for a thorough overview.
x n , [7] cons 1 , [7] cons m , [7] Figure 10 A visualisation the initial state s0. At each node the size of the corresponding feature vector is given, e.g. [7]. Note that the edges additionally have features, but do not appear for ease of visualisation.
Recall that we present our MILP instance via a constraint-variable bipartite graph, and that a variable and a constraint share an edge when the variable appears in the constraint with a non-zero coefficient. See Figure 10 for an initial representation of s 0 ∈ G. Recall also that the goal of the GCNN is to parameterise our policy, π θ ( · |s 0 ∈ G), outputting the mean, µ ∈ R 4 , of a distribution, N 4 (µ, γI), over the cut selector parameter space.
We will now begin the forward pass of the GCNN. Consider a node of the bipartite graph that represents the variable x i of the MILP. This node has an attached set of features, see Table 1 for a complete list, which form a vector. This feature vector gets transformed by a neural network. In our design this initially transforms our 7-dimensional feature vector to a 32-dimensional vector. This operation gets applied to all feature vectors representing variables, using the same neural network. The new bipartite graph is denoted H 1 V . We also do this procedure for all feature vectors corresponding to a constraint of the MILP, albeit with a different neural network, which results in H 1 C . The result is visualised in Figure 11. We note that there is no order to the two transformations.
x n , [32] cons 1 , [7] cons m , [7] x 1 , [7] x n , [7]  The key to graph neural networks, for example the GCNN, is using the same neural network to transform multiple feature vectors, e.g. all constraint feature vectors. This allows the GCNN to take arbitrarily sized bipartite graphs as input, and therefore work on any MILP instance.
Until this point, no information has been shared between any two variables or constraints. In our design information is gathered per node from its neighbours by summing the transformed feature vectors of all incident edges and adjacent nodes. Note that due to the bipartite nature of our graph information either flows from variables to constraints, or vice-versa. This gathering of information for either all variable nodes or all constraint nodes is called a half-convolution, or alternatively message passing. To perform this half-convolution we require the transformed feature vectors to all have the same dimension, with 32 being our choice. Note that the feature vectors on the edges are also transformed during the half-convolution. For a feature vector representing a variable, the half-convolution is defined as: , and e ′ (i,j) are transformed feature vectors of variable x i , constraint j, and edge (i, j). The functions nn and nn ′ are neural networks, with nn ′ taking a concatenated input, and N (x i ) is the neighbourhood of the variable node of x i . For a feature vector representing a constraint, the half-convolution is defined in a mirrored manner. See Figure 12 for H 2 C and H 2 V .  As our design will ultimately extract µ from the transformed variable feature vectors, we first perform the half-convolution over all constraints, and then over all variables. This guarantees that at the representation H 2 V , all variables that feature together in a constraint have shared and received information. In a similar manner to the beginning of our forward pass, we now reduce all resulting variable feature vectors using a neural network to dimension 4, i.e. the amount of cut selector parameters. The result H 3 V is shown in Figure 13. Finally, we average over all reduced variable feature vectors, resulting in the mean µ ∈ R 4 of our policy. This is a forward pass of the GCNN.
x 1 , [4] x n , [4]  We note that the decision to average the reduced variable feature vectors was inspired by the original design for branching, see [15]. For instance, it would be possible to change the order of the half-convolutions and extract µ from the transformed constraint feature vectors. This is just one way to change the specific design, with other examples being the type of activation functions, layer structure, and the dimension of each embedding. Our design also makes heavy use of layer normalisation, see [5], which followed from observations in [15] on improved generalisation capabilities. For our complete design we refer readers to https://github. com/Opt-Mucca/Adaptive-Cutsel-MILP.