Multi-armed bandits
From Theory
Notes for CS 8803 - Game Theory and Computer Science. Spring 2008
Contents |
The setting
The multi-armed bandits (MAB) setting is similar to the repeated n-decision game except that one does not find out the entire payoff vector each period. Again, a decision-maker has
fixed decision options. Each period, each decision pays off a bounded real-valued payoff, say in [ − M,M] for some
. Hence the sequence of payoffs can be modeled by payoff vectors
. The decision-maker must choose a
single decision
on each period t. The payoff for the decision-maker that period is
. The main difference between this setting and the repeated n-decision game is that the decision-maker only finds out its payoff -- it is not informed of the payoffs of other decisions. The reason this problem is called the multi-armed bandit problem is in analogy to a one-armed bandit.
Again for simplicity, let us assume that the number of periods, T, is known in advance (though this assumption may be removed).
The decision-maker chooses
. The environment chooses
. The decision-maker's payoff
on period t is
.
Note that we make no assumption about the sequence
, other than that it is unknown but bounded. It can be arbitrary and changing. (We do not assume it is drawn independently from some distribution).
Each period,
:
- The decision-maker chooses
, based on its previous payoffs and decisions
.
- The decision-maker receives payoff
, and only
is revealed to the decision maker.
The decision-maker's regret is defined to be:
|
This is the difference between her average payoff and the average payoff achievable by the best single decision decision
, where the best is chosen with the benefit of hindsight. Note that this definition does not take into account the fact that if the decision maker had chosen i * each period, then the environment might have altered its choices of pt. Many alternative notions of regret are natural, but the nice thing about the above definition is that, in many cases, we can bound our regret or expected regret.
A MAB algorithm
The MAB algorithm is going to build off of the weighted-majority algorithm. The algorithm has two parameters. The first,
is the exploration parameter. The second, ε > 0 is the same as that of the weighted majority.
Parameters:On period
: * Let
be the total payoff vector. * For
, let
, where
. * Choose
from probability distribution xt, i.e., dt = i with probability
.
Note that by construction
for each t.
MAB Theorem
Theorem. For any , and for any , the MAB algorithm run with achieves,
|
We note that regret on the order of
is achievable by a more careful analysis.
MAB Proof
Proof.
Let
be the best decision in hindsight, i.e., one that maximizes,
.
Note that because dt is chosen exactly according to xt, we have,
where expectation is taken over the randomness of the algorithm. Hence, it suffices to show that,
Define
, so
.
Next note that
for each i,t, so
for
.
Next note that the sequence yt is exactly what the weighted-majority algorithm would use on the payoff sequence
using the setting
, which is the setting chosen in the weighted majority theorem. Hence, we have (with certainty),
Next, we have,
Hence,
We also claim that,
for all
.
The expectation in the above is taken over the randomness in the algorithm. This follows from,
That is, if we fix (condition upon) any particular
, then
. Hence,
.
Similarly, we also have
and
.
To see these, note that if we fix
(thereby fixing
),
Hence, in expectation,
Now, for our choice of
, we have,
This is what we needed.
Q.E.D.
The production of this material was supported in part by NSF award SES-0734780.
On period
be the total payoff vector.
* For
, let
, where
.
* Choose
, and for any
achieves,
.
