Move Selection

From Realpolitik.io
Jump to navigation Jump to search

Foreign policy formation refers to decisions by agents about what moves to make in a given power structure. This move selection process in quantitative realism uses a game engine based on principles similar to the way that computers play chess. We use a technique called Monte Carlo Tree Search (MCTS) to traverse the game tree. Minimax game tree search is not used because it assumes that agents move sequentially instead of simultaneously, which would present additional complications.

Both of these techniques involve searching an immense game tree to find promising moves. If we assume that ternary tactics are used and that agents are only permitted to change one relationship per move (see "Social Inertia" below), then at any given time, each agent has [math]\displaystyle{ 2n-1 }[/math] moves. When agents move simultaneously, there are [math]\displaystyle{ (2 n - 1)^n }[/math] game states at the next ply and [math]\displaystyle{ (2 n - 1)^{nd} }[/math] states at a given search depth d.

Principal variation.png

Searching a modest tree with five agents at d=3 results in over 205 trillion leaf nodes or paths through the game tree. Going one ply deeper leads to 12 quintillion states. Given this incredible size, we draw upon clever results in game playing artificial intelligence to produce relevant and consistent search results.

Monte Carlo Tree Search

Overview

Monte Carlo Tree Search is based on the idea that the value of a game state is the expected value of the game's outcome given random play from that state. Since this is a well known algorithm, we do not describe it in detail but instead sketch our implementation of it:

  1. From the origin game state, play random lines of play, or "playouts", to a specified depth.
  2. Determine the PrinceRank vector of the leaf node of each playout.
  3. Find the average PrinceRank for each agents' unique moves and select the one with the highest value.
  4. Combine these best moves to form the new tactic matrix.

This concept varies from standard MCTS because it plays to a given depth rather than a terminal game state, as there is no terminal state in quantitative realism.

Evaluation function

MCTS requires an evaluation function that assigns a value to each player for any given game state, or position. In chess, for instance, the evaluation function looks at the pieces each player has and their positioning on the board, and reduces that information to a number between -1 and 1. In quantitative realism, PrinceRank is the logical choice for an evaluation function, because it intuitively reflects agent preferences in various power structures and provides a complete ordering of all possible power structures from the perspective of a given agent. The utility function could also conceivably serve as this evaluation function; however, it does not consider agent relationships and experiments suggest that it does not lead to hierarchy formation as reliably as PrinceRank does.

PrinceNet

In "pure" MCTS, random playouts are used to explore the game tree. With branching factors on the order described above, pure MCTS is unlikely to return much valuable information. For example, if one searches a tree where n=5 and d=4, 10,000 playouts will cover only 0.0000000005% of the tree. Clearly, some optimization technique is needed to find the most promising pathways. For this reason, we adopt an approach similar to the AlphaZero general game playing algorithm (Silvers 2017), which used a deep neural network to focus the MCTS search beam on more plausible lines of play, resulting in superhuman performance in the games of Chess, Go, and Shogi. We train a neural network, PrinceNet, to identify plausible next moves for an agent, given a power structure. PrinceNet then guides our Monte Carlo search.

Network Setup

Input. The input vector to the neural network is composed of: (1) a normalized size vector in which the focal agent is first and the other agents are reverse sorted; (2) a flattened ternary tactic matrix (also assumed to be reordered) with the diagonal removed (because it would always be 0s); and (3) the parameter α, entered as a list. If needed, the power structure can be padded with 0s to have its size match that of the network, for example, if the network assumes that n=5 but one wants to apply it to a power structure with four agents.

Output. The network outputs a vector indicating the probability or strength of the focal agent's available moves: those ternary moves within a Hamming distance of 1. This "policy" vector is normalized to sum to 1. The policy vector has 3n elements that correspond to either {-1,0,1} for each of the agents, including themselves. Each element in the policy vector represents the probability that the current tactic should be changed to the tactic corresponding to that element. Elements of the policy vector corresponding to the current move follow the same convention. Elements that correspond to the self-tactic are set to 0.

Architecture. The network has a simple feedforward structure with n^2+1 or more inputs (depending on the length of α), 3n outputs, and 5-10 hidden layers. The network uses Tanh activation functions, with Ramp and Softmax used in the final layers.

Training and Usage

Training. The network is trained on data generated by MCTS runs on random power structures. Training uses the Wolfram Language neural network framework. The network is inserted into the MCTS search algorithm to select more promising paths through the search tree, thus generating yet more training data in an iterative cycle called the training loop.

Usage. The network can be used to accelerate MCTS by suggesting best or promising moves. A homegrown version called PrinceNetJr, which uses weights extracted from PrinceNet, is used for performance reasons. However, training the network weights must occur on the PrinceNet version.

Limitations. One trained, the network has a maximum power structure size that it can accommodate. The network also comes to assume the parameter values used when the training data was generated, with the exception of those used as network inputs.

Other techniques

Other potential performance improvements include parallelizing the playouts and caching the game tree.

Social Inertia

Social inertia limits the tactical choices available to an agent, forcing them to choose tactic vectors that are in some way near the agent's current tactic. How we implement social inertia depends upon whether continuous or ternary tactics are used.

Continuous Tactics

A natural way to define tactical distance for continuous tactics is to use the Euclidean distance between the two tactic vectors, either EuclideanDistance[a,b] or Norm[a-b]. The parameter σ sets a maximum acceptable distance and we can then generate random tactics within that distance from some starting tactic:

RandomTacticNear[tactic_, i_, σ_] := NormalizeTactic[MapAt[Abs, RandomPoint[Sphere[tactic, σ]], i]]

The function above works by: (1) starting from a point represented by the original tactic, (2) picking a random point on a sphere of radius σ, centered at the original point, (3) ensuring that the self-allocation is nonnegative, and (4) normalizing the tactic. To give an example where n=4, this provides a way to find random tactics (black) near a given starting tactic (red):

Cross-polytope social inertia.png

The implementation above is admittedly quick-and-dirty. For one thing, it does not enforce the self-allocation percentage ρ, so any nonconforming tactic vectors will have to be discarded. For another, as illustrated in the image above, the random points are not distributed evenly across the surface of the cross-polytope, but instead tend to cluster at the perimeter of the circle defined by σ. Finally, it is possible that when we make the self-allocation positive, the resulting tactic has a radius greater than σ. At any rate, more complex versions of this function could be created that address these deficiencies.

Ternary Tactics

To find ternary tactics that are somehow near a given vector, a natural way to think about distance is based on the number of elements that are altered. This can be computed as the Hamming distance, for example HammingDistance[{-1, 0, 0, 1, 1}, {0, 0, -1, 0, 1}] is 3. Given a ternary tactic vector and an agent index i, the following function returns all ternary tactic vectors within a given Hamming distance dx:

TernaryTacticsNear[tactic_, i_, dx_] := 
   DeleteDuplicates[Flatten[Map[Tuples[ReplacePart[Partition[tactic, 1], 
      Map[# → {-1, 0, 1} &, #]]] &, Subsets[Delete[Range[Length[tactic]
   ], i], {dx}]], 1]]

For example, TernaryTacticsNear[{0, 1, -1}, 1, 1] will return {{0, -1, -1}, {0, 0, -1}, {0, 1, -1}, {0, 1, 0}, {0, 1, 1}}. We use dx as opposed to σ, even though they both express notions of social inertia, so as not to confuse the two approaches.

Because of the symmetry imposed by Heuristic 1, we generally want to find nearby symmetric tactic matrices, which are generated by:

SymmetricTsNear[T_, i_, dx_] := Map[ReplacePart[ReplaceColumn[T, i, #], i → #] &, TernaryTacticsNear[T[[ All, i]], i, dx]]

For instance, the following symmetric tactic matrices are within a Hamming distance of 1, for agent #1, starting from the 0 matrix.

SymmetricTsNear example.png

Other notions of distance are possible with ternary tactics. For example, we could start with all possible ternary tactics and then filter out ones that are beyond a Euclidean distance defined by σ. Or we could forbid changing a -1 to a 1, or vice versa, without first making the tactic a 0.



<< Prev | Next >>