Incremental Policy, Learning: An Equilibrium
Selection Algorithm for Reinforcement Learning
Arrents with Common Interests U
Nancy Fulda and Dan Ventura
Department of Computer Science
Brigham Young University
Provo, Utah, 84602
email: fulda@ hyu. edu, ventura@ cs. byu. edu
Abstract- We present an equilibrium selection algorithm for by its countemarts) and ( 2) they all m;
reinforcement learning agent; that incrementally adjusts the
probability of executing each action based on the desirability of
the outcome obtained in the last time step. The algorithm assumes
that at least one coordination equilibrium exists and requires that
the agents have a heuristic for determining whether or not the
equilibrium was obtained. In deterministic environments with one
or more strict coordination equilibria, the algorithm will learn to
play an optimal equilibrium as long as the heuristic is accurate.
Empirical data demonstrate that the algorithm is also eflective
in stochastic environments and is able to learn good joint policies
when the heuristic’s parameters are estimated during learning,
rather than known in advance.
I. INTRODUCTION
Learning to play an optimal equilibrium is a non- trivial task.
Non- communicating agents must both determine the location
of equilibria in the joint action space and learn which equilibria
are enabled by the other agents’ strategies. Wang and Sand-holm
describe this task in terms of two interrelated learning
problems: identifying the game and learning to play [ Y].
We use the phrase ngenis wirh common inreresis to describe
agents whose preferences coincide in at least one point in
the joint action space; that is, there is at least one joint
action that maximizes expected reward for all agents. We call
such a joint action a coordination equilibrium. A coordination
equilibrium is strict if the system contains no joint actions that
maximize expected reward for one agent without maximizing
it for all other players. Coordination equilibria are a special
case of opiimnl equilibria, which are defined as pareto- efficient
Nash equilibria. An optimal equilibrium does not necessarily
maximize payoff for all players. A coordination equilibrium
does.
Recent algorithms that address equilibrium selection in
multiagent reinforcement learning systems include Claus and
Boutilier’s Joint Action Learners [ I], Hu and Wellman’s Nash
Q- learning algorithm [ 3], Michael Littman’s Friend- or- Foe Q-learning
[ 6], and Wang and Sandholm’s Optimal Adaptive
Learning [ Y].
The algorithms described above share two characteristics in
common. ( I) They all rely on global perception of the joint
action space ( i. e. each agent can perceive the actions executed
assumptions about
the motives of the other agents. These two characteristics
correspond to Wang and Sandholm’s taxonomy: Perception
of other agents’ actions allows the game structure to be
identified, while assumptions about other agents’ motives help
in determining which potential equilibria the other agent is
willing to play.
Incremental Policy Learning ( IPL) is a novel approach
to selecting between multiple coordination equilibria. This
approach uses a weakened form of condition ( I ) above: The
agents do not need to know the entire structure of the joint
action space. Instead, they require a heuristic ( for example,
the reward associated with a coordination equilibrium) that
indicates whether the desired equilibrium was obtained. The
heuristic information can be inferred from the joint action
table if it is available, may be provided by an external
oracle, or can be estimated based on known characteristics
about the environment and observed individual rewards. One
of IPL‘ s greatest advantages is that, depending on how the
heuristic information is obtained, optimal solutions may be
found without learning the complete game structure.
In its most basic form, Incremental Policy Learning relies
on condition ( 2) by assuming that a coordination equilibrium
exists and by using the reward associated with this equilibrium
as a parameter for the heuristic. We show in Section IV- C
that this basic form can be augmented through the use of
other heuristics so that the algorithm can play optimally un-der
other assumptions about opponent motivations, including
Nash- seekers and Minimax players.
11. RELATED WORK
One of the simplest reinforcement learning techniques for
selecting between multiple equilibria is to use a pre- arranged
coordination mechanism such as that employed by Lauer and
Riedmiller, in which the agents retain as their optimal policy
the first action that successfully maximized reward [ 5]. This
approach is effective in deterministic environments, but ceases
to be effective when nondeterministic rewards are introduced
or when the environment is unpredictable.
0- 7803- 8359- 1/ 04/$ 20.00 02004 IEEE 1121
The equilibrium selection problem can also be addressed
through on- policy learning. The underlying principle is that
each agent’s individual utilities shift to reflect the frequency
with which the agent has achieved a desirable reward, causing
the agents to settle towards complementary policies in which
both agents benefit [ I]. Unfortunately, agents using this tech-nique
do not always settle to an optimal equilibrium. Some
environments particularly those resembling penalty games)
camoflauge desired equilibria so that the weighted sum of
received rewards still appears less desirable than some other
action.
Biased exploration techniques have been utilized to en-courage
convergence to an optimal equilibrium. For example,
Kapetanakis and Kudenko’s FQM heuristic biases exploration
based on the maximum reward received for a given action
and the frequency with which that reward has been observed
[ 4]. This approach increases the likelihood of convergence to
an optimal equilibrium in cooperative games, but does not
guarantee it.
Another option is to allow the agents to percieve the actions
selected by all other agents. However, this augmentation
destroys the power of on- policy learning as an equilibrium
selection technique. Unlike typical reinforcement learners,
whose utility estimates change in response to the biased
exploration of their counterparts, agents who perceive the joint
action space learn the same joint utilities regardless of the
behavior of the other agents. The frequency with which each
joint utility is updated is affected, but not the value to which
the utilities converge. To remedy this problem, researchers
have again resorted to biased- exploration techniques such as
Fictitious Play and Optimistic Boltzmann [ I]. This results in
improved performance hut does not guarantee convergence to
an optimal solution.
Incremental Policy Learning resembles the biased explo-ration
techniques described above, with the critical distinction
that IPL uses information about the optimal equilibrium to
guide the search. This results in guaranteed convergence to
an optimal joint policy if at least one strict coordination
equilibrium exists.
111. INCREMENTAL POL. lCY LEARNING
A. Terminology
Let A = { a1, ..., a,} be the action selections available to
an agent, and let P = { p( al), ..., p( a,)} be a probability
distribution over those actions Then 0 5 p( at) 5 1 and
We assume that the agents are repeatedly playing a single-stage
game in which the ( external) state of the agent never
changes. In any time step t, each reinforcement learning agenl
executes an action a( t) E A and receives an information vector
I ( t ) back from the environment. This information includes the
agent’s reward r( t) and may also include the action selections
of other agents, the payoffs received by other agents, or othet
information about the environment.
The IPL algorithm uses a binary heuristic H( t) that returns
a value of 0 or 1. This heuristic may use as parameters the
C , p ( a , ) = 1.
information vector I ( t ) , the agent’s utilities ( if it is maintaining
them), or any other information accessible to the agent.
One of the simplest heuristics a reinforcement learning
agent can use returns 1 if and only if the reward received by
the agent matches or exceeds the expected ‘ reward associated
with a coordination equilibrium, that is, if r( t) 2 T,,~.
An inequality is used in addition to the equality because
the maximum expected reward may he exceeded due noisy
payoffs. This heuristic is used for the basic form of IPL
because of its simple nature and because of the potential for
inferring T~~~ from observations or from a priori information
about the environment.
E. Algorirhnr
The basic principle of IPL is that, if the heuristic indicates
that the agent’s objectives were successfully obtained, then
the agent increases the probability of repeating the action just
executed while slightly decreasing the probability of executing
any other action. This algorithm is purely policy- based; the
agent’s utilities have no influence on the agent’s behavior
unless they are used by the heuristic function.
The basic form Incremental Policy Learning Algorithm
functions as follows:
. Initialization
Vi, p( ai) = wi, where w is an arbitrarily chosen
initialization vector that satisfies Cy= owi = 1 and
Vi, wi > 0.
Action Selection
In each time step t, the agent selects an action a ( t ) E A
such that pTob( a( t) = ai) = p( ai). The agent executes
this action, receives an information vector I ( t ) , and
updates P as described below. . Probability Updates
Let H( t) = 1 if r( t) 2 rmar. 0 othemise.
Let 0 < a 5 1.
If H( t) = 1 then V i :
if ( a( t) = ai) then p’( ai) = p( ai) + a( 1 - p( a;))
if ( a( t) # ai) then p’( ai) = p( ai) - ap( a,)
The update rule used preserves the probability distribution
P. Hence C i p ’ ( a i ) = 1 and for all $( a;), 0 5 p’( a,) 5 1.
The general form of the algorithm differs from the basic
form in that the heuristic H( t) is not specified. It is intended
that the general algorithm can he adapted to suit varied
situations by selecting an appropriate heuristic.
C. Convergence
Figure 1 shows a generalized payoff matrix for a deter-minstic
two player game. We will assume that the game is
constrained such that it contains at least one coordination
equilibrium and that all coordination equilibria are strict. This
means that there exist values x* and y* such that Vi, j, x* 2
xij, y* 2 yij, and xij = x* iff y, j = y’.
1122
Suppose that we have two IPL agents, a and b, repeatedly
playing a single- stage game with this payoff matrix using
x* and y* as their respective T,,,~= values. Under these
conditions, the agents' heuristics will always be positively
correlated. Hence, if ( a,, bj) is a coordination equilibrium,
then x'- 3 .= x*, yij = .' y and H( t) = 1 for both agents.
If ( U,, bj) is not a coordination equilibrium, then xij < XI,
yij < y'. and H ( t ) = 0 for both agents.
We wish to determine whether the agents will lean to
play an optimal equilibrium. We begin by noting that the
combination of the agents' IPL probability distributions creates
a probability distribution over the joint action space such that
p( a,, bj) = p( ai) p( bj) ( 1)
Each time a coordination equilibrium is played, the heuristic
of both agents is satisfied and each joint action takes on a new
probability
p'( ai, bj) = p( ai, bj) + Ap( ai, bj) ( 2)
The value to which p( a;, bj) can increase is bounded for all
joint actions that are not coordination equilibria, as shown in
Theorem 3.1.
Theorem 3.1: If ( a;, b j ) is not a coordination equilibrium
and if p ( a i , b j ) > 0.5, then Ap( ai, bj) 5 0.
Pro08 We, wish to establish the conditions under which
Ap( ai, bj) 5 0. Using equation ( 2) to derive Ap( ai, bj) and
substituting, we obtain p'( ai, bj) - p( ai, bj) 2 0, which by
equation ( I ) is equivalent to
p'( ai) p'( bj) - p( ai) p( bj) I 0 ( 3)
The values of p'( a,) and p'( bj) depend on the nature of the
action executed by the system at time t. Let ( as, b,) he this
joint action. If ( as, b,) is not a coordination equilibrium, then
the IPL probabilities are not adjusted. In this case, p'( ai) =
p( a,), p'( bj) = p( bj) and A ( a , , bj) = 0.
If ( as, b,) is a coordination equilibrium then joint action
( a,, bj) can be related to it in three ways: They can share
a row ( i = k , j # z), share a column ( i # k , j = z), or
be completely disjoint ( i # k , j # 2). They cannot share
both a row and a column because ( a s , b,) is a coordination
equilibrium and ( ai, bj) is not.
If the two actions are disjoint then p'( ai) = p( ai) - ap( ai)
and p'( b3) = p( bj) - ap( bj). This means that p'( a,, b j ) <
p( a;, bj). so Ap( ai, b j ) < 0.
If the actions share a row or a column the situation is less
clear because one term of p'( ai) p'( bj) is increasing while the
11
other is decreasing. Without loss of generality, assume ( i =
k, j # z). Then p'( a;) = p( ai) + a( 1 - p( a,)) and p'( bj) =
PVJj) - ap( b,).
Substituting into equation ( 3) we get
( p ( a i ) + a ( l - p ( a i ) ) ) ( p ( b j ) - a p ( b j ) ) - p( ai) p( bj) 5 0 ( 4)
which resolves to
p( a,)( a - 2) 5 ( a - 1) ( 5)
Dividing by a - 2, a negative quantity, we find that
Ap( ai, b j ) 2 0 whenever p( a;) > s. The right side of the
equation is maximized when a = 0, so as long as p( ai) > 0.5,
A ( a i , bj) 5 0.
The above proof demonstrates that the system can never
converge to a solution that is not a coordination equilibrium:
whenever the probability of a suboptimal joint action is greater
than 0.5, it will decrease on the next update. We also notice an
interesting phenomenon. In equation 4, the value p( bj) exists
in all multiplicative terms and factors out of the calculation.
Thus, if a suboptimal joint action ( a i , b j ) shares a column with
a coordination equilibrium, the sign of Ap( a;, b j ) depends
only on the value of p( ai).
Let us now consider the case where ( ai, bj) is a coordination
equilibrium. In this case we find that there is some critical
value of p( ai, bj) beyond which Ap( ai, bj) is always positive.
Theorem 3.2: If a system contains at least one strict coor-dination
equilibrium and at least one joint action that is not a
coordination equilibrium, then 3p* such that if p ( a i , b j ) > p'
then p( a,, bj) is more likely to increase over time than to
decrease.
We wish to determine under what conditions
Ap( ai, b j ) > 0. In order to do so, we examine two types of
update situations. Since ( a,, bj) is a coordination equilibrium,
p( ai) and p( bj) will increase each time it is executed. When
some other coordination equilibrium ( as+, b,+ j) is executed,
p( ai) and p( bj) both decrease. To reflect these situations, we
define
Pro08
p-( ai, bj) = ( p ( a i ) - a p ( a ; ) ) ( p ( b j ) - 4 b j ) ) ( 7)
Let p( c) represent the probability that the executed action
is some coordination equilibrium other than ( a;, b,). Then
the value of p'( a,, b j ) can be probabilistically described as
a weighted average of the above possibilities, using p( ai, bj)
and p( c) as weighting factors. When p'( ai, bj) - p( a;, b j ) > 0,
Ap( ai, b j ) is positive. We therefore seek the conditions that
satisfy
23
We note that the term p( c) p-( ai, bj) is somewhat pes-simistic,
since the executed coordination equilibrium might
share a row or column with ( a i , b j ) . In this case, either
p( ai) or p( bj) would increase, rather than decrease. However,
these cases only improve the chance that the inequality in
equation ( 8) will be satisfied. We therefore assume the worst-case
scenario that the executed coordination equilibrium shares
neither a row nor a column with ( a,, bj).
After substituting from equations ( 6) and ( 7), equation ( 8)
can be reduced to
j
In the worst- case scenario, p( c) = 1 - p( a,) p( b,). In this
case, equation ( 9) can be simplified to p( a,) + p( b,) > 2; an
impossible condition. However, p( c) = 1 - p( ai) p( bj) only if
all joint actions are coordination equilibria, which defies the
premises of the theorem. We quickly see that for any value of
p( c) = 1 - p( a;) p( b,) - E, where 0 < E < 1 - p( ai) p( b,), then
equation ( 9) resolves to
:
which is not an impossible condition. There are values of
p( a;) and p( b,) that will satisfy it, even though they may be
high. Thus there exist critical values for p( ai) and p( bj) ( and
correspondingly, for p( ai, b,), beyond which Ap( a;, b j ) tends
to he positive.
We can now proceed to examine the overall system behavior.
By Theorem 3.2, there exists some critical value p’ beybnd
which p( a;, bi) tends to be continually increasing. The greater
the amount by which p( a,, b j ) exceeds this threshhold, the less
likely it becomes that p( ai, b j) will decrease. The threshhold
effectively represents the point at which a single coordination
equilibrium dominates all other possibilities and begins to be
executed with steadily increasing frequency.
With continued positive updates, p( ai) and p( bj) converge
towards 1 ( because the iterative series z’ = z + cy( 1 - z)
converges to 1 for 0 < a 5 1). Consequently, p( ai, bj) also
converges to 1.
Getting the ball rolling on this convergence issue may take a
while. If p’ is relatively high, then it may take many iterations
before the probability of one of the coordination equilibria
happens to reach it. Fortunately, we have a guarantee from
Theorem 3.1 that no suboptimal equilibrium can maintain a
probability greater than 0.5, so there is no risk of converging
suboptimally while waiting for a coordination equilibrium to
dominate. In the degenerate case where all joint actions are
optimal, the agents will, of course, also play optimally, even
if their probabilities never converge.
IV. ALGORITHMBE HAVIOR
Most environments are not conveniently deterministic, and
most heuristics are not 100% accurate. How does the IPL
algorithm perform in the face of such uncertainties?
11
Fig. 2. IPL performance as a function of noise
A. Noisy Rewards
Figure 2 shows algorithm performance as a function of
noise. Two IPL agents, each having five actions, repeatedly
played a single- stage game until they executed the same joint
action 50 times in succession. Each cell of the payoff matrix
was randomly initialized to an. integer between 0 and 24
( different random payoffs were assigned to each agent), with
the exception of 5 randomly placed coordination equilibria
whose payoff was 25 for both agents.
Gaussian noise was simulated using Peitgen et. al’s equa-tion
[ 2]:
Using values A = 100 and n = 50. The result D was
multipled by a scaling factor of S to provide different degrees
of noise. The value SD was added to the agents’ reward
signals.
The agents used T~~~ = 25 and a = 0.1, and the results
of 100 trials were averaged for each data point.
We found that even for very large values of S with respect
to the range of possible rewards, the IPL algorithm was able to
focus in on a near- optimal solution with reasonable frequency.
B. Estimation of T,,,~=
A critical question for IPL is ‘ how the algorithm performs
when the value of T~~~ is not known in advance, but must
instead be inferred from known information.
We implemented a set of Q- leaming agents who were each
able to observe the action selections of the other agents. The
agents used an initial utility value of 0, a learning rate 9 = 0.1,
and a simplified joint Q- value update equation
&’( ai, b j ) = ( 1 - q) Q( ai> bj)+ q ( ~ ) ( 121
where &( ai, bi) is the estimated utility of performing joint
action ( a;, b,) and T is the reward received for executing joint
action ( a,, b j ) .
124
Fig. 3. IPL performance as a function of a with rmor estimation. Scaling
factor S = 0 was used for the deterministic environment, S = 5 for the
stochastic environment.
In each time step, each agent estimated
T , , , ~ ~ = max;, j Q( a,, bj) and exceuted the IPL update
equation. Payoff matrix generation and noise generation were
carried out as described in the previous sub- section, and the
results of 100 trials were averaged.
The results are shown in Figure 3. As one might expect,
IPL performance was better with lower values of a. This is
not surprising, since a high value for cy might cause the IPL
algorithm to converge before the joint Q- values accurately
reflected the relative magnitude of the expected rewards.
C. Extensions of the Algorithm
The IPL algorithm lends itself naturally to several possi-ble
heuristics. Here, we discuss only a few of those which
intuitively appear most useful and which seem to reflect
other results obtained in the field of multiagent reinforcement
learning.
If agents have access to the full payoff matrix of the game
being played ( either because it was provided a priori or
because they have inferred it by watching the action selections
and rewards of other actions), then they can determine whether
a given joint action represents a Nash equilibrium. By letting
H( t) = 1 if the executed action was a Nash equilibrium and
H( t) = 0 otherwise, the agents can select between multiple
Nash equilibria.
Similarly, agents interacting in strictly adversarial envi-ronments
can learn to play a Minimax strategy byletting
H( t) = 1 if the executed joint action is a Minimax solution
and H( t) = 0 otherwise. Naturally, both of these approaches
require that the agents be able to see the action selections of
their counterparts, and also rely on the assumption that all
agents in the system are using the same heuristic.
One potentially interesting application of the IPL algorithm
is in satisficing environments. In satisficing, agents seek ac-tions
that are “ good enough” rather than seeking actions that
are optimal 171, [ SI. A satisficing criterion could be chosen
( such as a threshhold value for T,) such that H( t) = 1
whenever the criterion is satisfied. In this way, the agents could
select between multiple satisficing solutions.
v. CONCLUSIONS AND FUTUREW ORK
We have presented an equilibrium selection method for
agents that are able to determine with reasonable accuracy
whether or not an optimal equilibrium was obtained in the
last time step. When heuristic information is provided by
an external source, the algorithm is able to search through
policy space directly, without learning ut
the actions of other agents. When the heuristic information
rred or approximated based on observed rewards,
and observation of the complete action space may
be useful tools for the acquisition of good heuristic data.
Incremental Policy Learning provably learns to play an
optimal solution if the heuristic is accurate, the environment is
deterministic, and at least one coordination equilibrium exists.
Empirical studies indicate that IPL performs well in the
presence of noise and when the heuristic information is
approximated rather than precisely known. Future work should
concentrate on theoretical bounds on the effectiveness of such
methods, as well as on an analysis of the effect of the joint
action space size on convergence speed.
Because IPL updates only when the heuristic’s requirements
are satisfied, convergence may take a very long time in large
joint action spaces with sparse equilibria. Compounding this
problem, if the environment is noisy and the joint action space
has several near- optimal solutions, the agents may learn IO play
one of these before a true optimum is discovered. Both of these
problems might be alleviated by adding a decay factor such
that the probability of an action’s execution decreases each
time an optimal solution is not found. This would encourage
more uniform exploration of the joint action space.
Finally, other possible heuristics and methods for approx-imating
them should be developed so that the algorithm’s
usefulness can he expanded to new situations.
REFERENCES
[ I] Caroline Claus and Craig Boutilier. The dynamics of reinforcement
learning in cooperative multiagent systems. In AAAINAAI, pages 746-
752. 1998.
[ 2] H. Jurgens H. 0. Peitgen and D. Supe. Chum and Frncrols. Springer-
Verlag, New York. 1992.
[ 3] J. Hu and M. Wellman. Nash q- learning for general- sum stochastic games.
Joumol ofMachim Learning Research. to appear, 2003.
[ 4] S. Kapetlnakis and D. Kudenko. Improving on the reinforcement leaming
of coordination in cwperative multi- agent systems. In Second AISB
Symposium on Adoptive Agent.$ and MulriGAgent Systems, 2W2.
[ 51 Marun Lauer and Manin Riedmiller. An algoithm for diswibuted rein-forcement
learning in cooperative multiLagent systems. In Proceedings of
the 17rh Inremtioml Conference on Machine Learning, pages 535- 542.
. San Francisco, CA, 2000. Morgan Kaufman.
[ 6] Michael Littman. Friend or foe q- learning in general- sum markov games.
In Proceedings of the Eighreenth Inrermriowl Conference on Machine
~ e ~ r n i npga. g es 3 2 2 - 3 2 8 , 2 ~ 1
[ 7] H. A. Simon. A behavioral model of rational choice.
[ 8] W. C. Stirling, M. A. Goodrich, and D. J. Packard. Satisficing equilibria:
A non- classical approach to games and decisions. Auronomous Agenrs
and Mulri- Agenr Sysrems Journal. 5: 305- 328. 2002.
[ 9] X. Wang and T. Sandholm. Reinforcement learning to play an optimal
nash equilibrium in team markov games. In Advonces in Neural
Informorion PWC~ S@ Sysrems IS ( NIPS- 2002). Vancouver, 2W2.
1125