OCBP – “Introducing Darmok” part3
Plan Retrieval
To solve a complex planning task, several subproblems have to be solved. For instance, in our domain, the system has to solve problems such as how to build a proper base, how to gather the necessary resources, or how to destroy each of the units of the enemy.
The propose to organize the case base using two kinds of elements:
• Snippets: a snippet is a procedure composed of a collection of actions, and subgoals composed in sequence or parallel.
• Episodes: an episode is a tuple e = ( p, G, S, O ) where e.p is a snippet, e.G is a goal, e.S is a situation (i.e. a game state), and e.O is the outcome of applying e.p in e.S to achieve e.G. In particular e.O is a real number between 0 and 1 representing how well the snippet achieved its goal in the situation e.S.
Figure. 7 Figure 7 shows an example of a snippet with an associated episode, with the four elements: a goal (in this case, building the resource infrastructure of player “1”) a procedure to achieve the specified goal in the given map in the snippet, a game state (with general features about the map and information about each players) and the outcome for the episode. In particular, in Darmok, the game state is defined by 35 features that represent the state of a Wargus game. Twelve features represent the number of troops (number of fighters, number of peasants, and so on), four represent the resources the player owns (gold, oil, wood and food), fourteen represent the number of the buildings (number of town halls, number of barracks, and so on) and finally, five features represent the map (size in both dimensions, percentage of water, percentage of trees, number of gold mines).
Given a game state and a goal to retrieve the snippet which could have the highest performance in such a goal in the given game state. To predict the performance of a snippet p, the plan retrieval module uses the episodes in the case base associated with p. To retrieve episodes, Darmok uses the episode relevance measure ER(e, S, G) that computes the relevance of a given episode e given the current game state S and goal G, and is defined as:
where GS is the goal similarity of the goal in the snippet p and the goal G, and SS is the state similarity. α is a parameter that has been set to 0.75 in experiments. The distance between two goals g1 = name1 (p1 , …, pn ) and g2 = name2 (q1 , …, qm ) is measured as follows:
where Pi is the maximum value that the parameter i of a goal might take (we assume that all the parameters have positive values). Thus, when name1 = name2 , the two goals will always have the same number of parameters and the distance can be computed using an Euclidean distance among the parameters. The distance is maximum (1) otherwise.
SS(gs1 , gs2 ) computes the similarity between two given game states, and returns a number between 0 and 1 (where 0 means identical, and 1 means totally different). SS is computed as the inverse of a simple Euclidean distance among the game states, where each feature is normalized between 0 and 1.
Finally, to predict the performance of a snippet, we define the set of relevant episodes RE(p, S, G) = {e1 , …, ek }, as the set of k episodes associated with the snippet p, and that have the maximum relevance ER. In our experiments, Darnok have set k ≤ 5, i.e. if there is fewer or equal to 5 episodes associated with p in the case base, then RE(p, S, G) contains all of them, but if there is more than 5, then only the best 5 are included in RE(p, S, G). Using that definition we can now define the predicted performance of a snippet as:
The predicted performance is a weighted average of the outcomes that snippet p has had in similar game states to S for similar goals. We add 1 to the numerator and 2 to the denominator following the Laplace probability estimation rule (which biases the predicted performance towards 0.5 when the number of episodes we have to predict the performance is small.) The result of the retrieval process is the snippet p that has the highest predicted performance P P (p, S, G), i.e. the snippet that has had the best performance in the past for similar goals in a similar game state. The snippet retrieved then needs to go through the adaptation process.
However, real-time domains require delayed adaptation.The game state changes with time and it is important that adaptation is done with the most up to date game state (ideally with the game state just before the snippet starts execution). For that reason, in Darmok, when the plan execution module is just about to start the execution of a particular snippet, the snippet is sent to the plan adaptation module for adaptation. Darmok cannot guarantee that the snippet is adapted with the latest game state since in a real-time domain the domain changes continuously, however, delaying adaptation as much as possible minimizes the adaptation error of Darmok.
