Multi-armed bandits

From Theory

Jump to: navigation, search

Notes for CS 8803 - Game Theory and Computer Science. Spring 2008


Contents

The setting

The multi-armed bandits (MAB) setting is similar to the repeated n-decision game except that one does not find out the entire payoff vector each period. Again, a decision-maker has n\geq 1 fixed decision options. Each period, each decision pays off a bounded real-valued payoff, say in [ − M,M] for some M \geq 0. Hence the sequence of payoffs can be modeled by payoff vectors p^1,p^2,\ldots,p^T \in [-M,M]^n. The decision-maker must choose a single decision d^t \in [n] on each period t. The payoff for the decision-maker that period is p^t_{d^t} \in [-M,M]. The main difference between this setting and the repeated n-decision game is that the decision-maker only finds out its payoff -- it is not informed of the payoffs of other decisions. The reason this problem is called the multi-armed bandit problem is in analogy to a one-armed bandit.

Again for simplicity, let us assume that the number of periods, T, is known in advance (though this assumption may be removed). The decision-maker chooses d^1,d^2,\ldots,d^T \in [n]. The environment chooses p^1,p^2, \ldots,p^T \in [-M,M]^n. The decision-maker's payoff on period t is x^t \cdot p^t = \sum_i p^t_{d^t}.

Note that we make no assumption about the sequence p^1,p^2,\ldots,p^t \in [-M,M]^n, other than that it is unknown but bounded. It can be arbitrary and changing. (We do not assume it is drawn independently from some distribution).

Each period, t=1,2,\ldots,T:

  • The decision-maker chooses x^t \in \Delta_n, based on its previous payoffs and decisions d^1,p^1_{d^1},d^2,p^2_{d^2},\ldots,d^{t-1},p^{t-1}_{d^{t-1}}.
  • The decision-maker receives payoff p^t_{d^t}, and only p^t_{d^t} is revealed to the decision maker.

The decision-maker's regret is defined to be:

\mathrm{regret}=\max_{i \in [n]}\frac{1}{T}\sum_{t=1}^T p^t_i-\frac{1}{T}\sum_{t=1}^T p^t_{d^t}.\,

This is the difference between her average payoff and the average payoff achievable by the best single decision decision i^* \in \Delta_n, where the best is chosen with the benefit of hindsight. Note that this definition does not take into account the fact that if the decision maker had chosen i * each period, then the environment might have altered its choices of pt. Many alternative notions of regret are natural, but the nice thing about the above definition is that, in many cases, we can bound our regret or expected regret.

A MAB algorithm

The MAB algorithm is going to build off of the weighted-majority algorithm. The algorithm has two parameters. The first, \delta \in (0,1) is the exploration parameter. The second, ε > 0 is the same as that of the weighted majority.

  Parameters: \delta \in (0,1), \epsilon>0
  On period t=1,2,\ldots,T:
  * Let P^t=p^1+p^2+\ldots+p^{t-1} be the total payoff vector.
  * For i=1,2,\ldots,n, let x^t_i = \delta \frac{1}{n}+(1-\delta)\frac{e^{\epsilon P^t_i}}{Z^t},, where Z^t = \sum_{i=1}^n e^{\epsilon P^t_i}.
  * Choose d^t \in [n] from probability distribution xt, i.e., dt = i with probability x^t_i.


Note that by construction x^t \in \Delta_n for each t.

MAB Theorem

Theorem. For any n,T\geq 1, M \geq 0, and for any p^1,p^2,\ldots,p^t \in [-M,M]^n, the MAB algorithm run with \delta=\sqrt{n}/T^{1/4}, \epsilon=\frac{\delta}{2Mn}\sqrt{\frac{\log(n)}{T}} achieves,
{\mathbb E}[\mathrm{regret}] \leq 6M\frac{\sqrt{n\log(n)}}{T^{1/4}}.

We note that regret on the order of O\left(\sqrt{n \log(n)/T}\right) is achievable by a more careful analysis.

MAB Proof

Proof. Let d^*\in [n] be the best decision in hindsight, i.e., one that maximizes,

\frac{1}{T}\sum_{t=1}^T p^t_{d^t}.

Note that because dt is chosen exactly according to xt, we have,

{\mathbb E}[p^t_{d^t}]={\mathbb E}[x^t \cdot p^t],

where expectation is taken over the randomness of the algorithm. Hence, it suffices to show that,

\frac{1}{T}\sum_{t=1}^T  p^t_{d^*}- {\mathbb E}\left[\frac{1}{T}\sum_{t=1}^T x^t \cdot p^t \right]  \leq 6M\frac{\sqrt{n\log(n)}}{T^{1/4}}.

Define y^t_i = e^{\epsilon P^t_i}/Z^t, so x^t_i = \delta/n + (1-\delta) y^t_i. Next note that x^t_i \geq \delta/n for each i,t, so \tilde p^t \in \left[-M',M'\right] for M'=\frac{Mn}{\delta}. Next note that the sequence yt is exactly what the weighted-majority algorithm would use on the payoff sequence \tilde p^1,\tilde p^2,\ldots,\tilde p^T, using the setting \epsilon = \frac{1}{2M'}\sqrt{\frac{\log(n)}{T}}, which is the setting chosen in the weighted majority theorem. Hence, we have (with certainty),

\frac{1}{T}\sum_{t=1}^T  \tilde p^t_{d^*}-\frac{1}{T}\sum_{t=1}^T y^t \cdot \tilde p^t \leq 4M'\sqrt{\frac{\log(n)}{T}}

Next, we have,

\frac{1}{T}\sum_{t=1}^T x^t \cdot \tilde p^t =  \frac{1}{T} \sum_{t=1}^T \frac{\delta}{n} \sum_{i=1}^n \tilde p^t_i + (1-\delta) \frac{1}{T} \sum_{t=1}^T y^t \cdot \tilde p^t.

Hence,

\frac{1}{T}\sum_{t=1}^T  \tilde p^t_{d^*}- \frac{1}{T}\sum_{t=1}^T x^t \cdot \tilde p^t  \leq -\frac{1}{T} \sum_{t=1}^T \frac{\delta}{n} \sum_{i=1}^n \tilde p^t_i  + \delta \frac{1}{T} \sum_{t=1}^T y^t \cdot \tilde p^t + 4M'\sqrt{\frac{\log(n)}{T}}

We also claim that,

{\mathbb E}[\tilde p^t_i]=p^t_i for all t,i,p^1,\ldots,p^t.

The expectation in the above is taken over the randomness in the algorithm. This follows from,

{\mathbb E}[\tilde p^t_i ~|~x^t_i] = \Pr[i=d^t]\frac{p^t_i}{x^t_i} + \Pr[i \neq d^t] 0=p^t_i.

That is, if we fix (condition upon) any particular x^t_i, then {\mathbb E}[\tilde p^t_i ~|~x^t_i]=p^t_i. Hence, {\mathbb E}[\tilde p^t_i]=p^t_i. Similarly, we also have {\mathbb E}[x^t \cdot \tilde p^t_i] =  {\mathbb E}[x^t \cdot p^t_i] and {\mathbb E}[y^t \cdot \tilde p^t_i] =  {\mathbb E}[y^t \cdot p^t_i]. To see these, note that if we fix d^1,d^2,\ldots,d^{t-1} (thereby fixing \tilde p^1,\tilde p^2,\ldots,\tilde p^{t-1}),

Hence, in expectation,

\frac{1}{T}\sum_{t=1}^T  p^t_{d^*}- {\mathbb E}\left[\frac{1}{T}\sum_{t=1}^T x^t \cdot p^t \right]  \leq -\frac{1}{T} \sum_{t=1}^T \frac{\delta}{n} \sum_{i=1}^n p^t_i  + \delta \frac{1}{T} \sum_{t=1}^T {\mathbb E}\left[ y^t \cdot p^t  \right] + 4M'\sqrt{\frac{\log(n)}{T}} \leq 2\delta M + 4M'\sqrt{\frac{\log(n)}{T}}.

Now, for our choice of \delta=\sqrt{n}/T^{1/4}, we have,

2\delta M + 4\frac{Mn}{\delta}\sqrt{\frac{\log(n)}{T}}= 2M\frac{\sqrt{n}}{T^{1/4}}+4M\frac{\sqrt{n\log(n)}}{T^{1/4}} \leq 6M\frac{\sqrt{n\log(n)}}{T^{1/4}}.

This is what we needed.

Q.E.D.



The production of this material was supported in part by NSF award SES-0734780.

Personal tools