Proof of the weighted majority theorem

From Theory

Jump to: navigation, search

Notes for CS 8803 - Game Theory and Computer Science. Spring 2008


Here we prove the Weighted majority theorem. First we give some intuition.

Intuition and useful lemmas

We are in the repeated n-decision game. Consider first the natural follow-the-leader strategy in which the decision maker uses, on each period t, the single decision that would have worked best on the previous periods. That is, x^t \in \Delta_n that maximizes p^1 \cdot x + p^2 \cdot x + \ldots p^{t-1} \cdot x. This may be a very poor strategy in simple cases. For example, if n = 2, p1 = (1 / 2, − 1 / 2), and p^2=(-1,1),p^3=(1,-1),p^4=(-1,1),\ldots,p^T=(-(-1)^T,(-1)^T), then this approach will have x^2=(1,0),x^3=(0,1),\ldots, and the decision-maker would get a payoff of − 1 each period, while any single decision would achieve an average payoff of about 0.

The difficulty in the above is that the "leader" is changing each period. If it so happened that x^1=x^2=\ldots=x^T in the above algorithm, then we would have 0 regret. And if xt was close to xt + 1 for most periods, we would similarly have little regret. This is implied by the following lemma.

Stability lemma. For any set S, T\geq 1, sequence of functions f^1,f^2,\ldots,f^T:S \rightarrow \reals, and sequence x^2,\ldots,x^{T+1}, where x^t \in S maximizes f^1(x)+\ldots+f^{t-1}(x),
\sum_{t=1}^T f^t(x^{t+1}) \geq \max_{x \in S} \sum_{t=1}^T f^t(x)

What the above lemma says is that the hypothetical "be the leader" algorithm that, on each period t uses the decision that works best on periods 1,\ldots,t, would have no regret. Of course it's impossible to implement such a strategy since we don't know pt when we choose xt. But it does imply that if xt is close to xt + 1 for each t (i.e., the leader is stable), then "following the leader", i.e., using xt on period t would yield low regret.

Proof. The proof follows by induction on T. The base case T = 1 is trivial. Suppose it holds for T − 1 and we want to show it for T. So, by induction hypothesis, we have,

f^1(x^2)+f^2(x^3)+\ldots+f^{T-1}(x^T)\geq \max_{x \in S} f^1(x)+\ldots+f^{T-1}(x).

Now, \max_{x \in S} f^1(x)+\ldots+f^{T-1}(x)=f^1(x^T)+\ldots+f^{T-1}(x^T)\geq f^1(x^{T+1})+\ldots+f^{T-1}(x^{T+1}). Hence,

f^1(x^2)+f^2(x^3)+\ldots+f^T(x^{T+1})\geq f^1(x^{T+1})+f^2(x^{T+1})+\ldots+f^{T-1}(x^{T+1})+f^T(x^{T+1}).

The above is equivalent to the lemma. Q.E.D.

In general the leader may change quite often, as in our example above. Hence, the key idea is to add a regularization term to the maximization to make the leader more stable. That is, rather than maximizing payoff so far, one maximizes payoff so far plus regularization. (Such regularization is common in machine learning.) The following lemma says that if the regularization makes the decisions stable, then we will have low regret.

Stability-regularization lemma

Stability-regularization lemma. For any set S, T\geq 1, functions r,f^1,f^2,\ldots,f^T:S \rightarrow \reals, and sequence x^1,x^2,\ldots,x^{T+1}, where x^t \in S maximizes r(x)+f^1(x)+\ldots+f^{t-1}(x),
\sum_{t=1}^T f^t(x^t) \geq \max_{x \in S} \sum_{t=1}^T f^t(x)-\sum_{t=1}^T \left(f^t(x^{t+1})-f^t(x^t)\right)-\max_{x, x' \in S} \left|r(x)-r(x')\right|

Proof. Let x*\in S maximize f1(x) + ... + fT(x). By previous lemma,

r(x^1)+f^1(x^2)+...+f^T(x^{T+1}) \geq r(x^*)+f^1(x^*)+...+f^T(x^*)

Rewriting

f^1(x^2)+f^2(x^3)+...+f^T(x^{T+1}) \geq r(x^*)-r(x^1)+f^1(x^*)+...+f^T(x^*)
Adding and subtracting
ft(xt)
t
, we get
f^1(x^1)+...+f^T(x^T)\geq r(x^*)-r(x^1)+\sum_t f^t(x^*) -\sum_t (f^t(x^{t+1})-f^t(x^t))

Q.E.D.

The main missing ingredient is now to show that the weighted majority algorithm in fact maximizes payoff plus regularization term. In particular, the regularization function is,

r(x) = \frac{1}{\epsilon} H(x),\,

where H(x) is the "entropy" of the distribution x,

H(x) = \sum_{i=1}^n x_i \log_2 \frac{1}{x_i}\,, (where 0\log \frac{1}{0}=0 is taken by definition)

Entropy is a beautiful notion capturing how much uncertainty a distribution has. For example, it is easy to check that the entropy of the uniform distribution over n items has entropy log(n), while the distribution which assigns probability 1 to any single decision has entropy 0. Don't fear, the properties of entropy we will use follow from its definition.

Lemma. For any x \in \Delta_n, 0 \leq H(x)\leq \log n.

Proof. Since x_i \leq 1, -x_i \log x_i\geq 0. So we easily have H(x) \geq 0. For the other part, we will use Jensen's inequality for convex functions, since log(1 / x) = − log(x) is convex. This gives,

\sum_{i=1}^n x_i \log \frac{1}{x_i} \leq \log \sum_{i=1}^n \frac{x_i}{x_i} = \log n.\,

Q.E.D.

Analysis of weighted majority

Next, we argue that the Weighted Majority algorithm is a regularized maximizer for r(x) = \frac{1}{\epsilon} H(x),

Lemma Apple. For any ε > 0, the xt of the Weighted-majority algorithm WM(ε) maximizes \frac{1}{\epsilon} H(x) + p^1 \cdot x+\ldots+p^{t-1}\cdot x over x \in \Delta_n.

Proof. By definition, we have for any x \in \Delta_n,

\frac{1}{\epsilon} H(x) + p^1 \cdot x+\ldots+p^{t-1}\cdot x = \sum_i \frac{1}{\epsilon} x_i \log \frac{1}{x_i} + P^t_i x_i.

By simple algebra, the above is equal to,

\sum_i \frac{1}{\epsilon} \left(x_i \log \frac{1}{x_i} + \epsilon x_i P^t_i \right) = \frac{1}{\epsilon}\sum_i x_i \log \frac{e^{\epsilon P^t_i}}{x_i}.

For the xi chosen by the algorithm, the above expression is \frac{1}{\epsilon} \sum x_i \log Z^t = \frac{1}{\epsilon} \log Z^t, because \sum x_i=1. Again using our friend Jensen's inequality, we have:

\frac{1}{\epsilon} \sum_i x_i \log \frac{e^{\epsilon P^t_i}}{x_i} \leq \frac{1}{\epsilon}  \log \sum_i x_i \frac{e^{\epsilon P^t_i}}{x_i} =\frac{1}{\epsilon} \log Z^t.

Hence, the Weighted Majority indeed maximizes the quantity in the lemma. Q.E.D.

Next, we argue that the weighted majority algorithm is stable.

Lemma Banana. For any \epsilon,M>0, t\geq 1 and p^1,p^2,\ldots,p^t \in [-M,M]^n,
\left|p^t \cdot x^{t+1}-p^t \cdot x^t\right| \leq 4 \epsilon M^2

Proof. Note first that P^{t+1}_i - M \leq P^t_i \leq P^{t+1}_i + M and hence

e^{\epsilon P^{t+1}_i} e^{-\epsilon M} \leq e^{\epsilon P^t_i} \leq e^{\epsilon P^{t+1}_i} e^{\epsilon M}.

The left-hand side above implies that, Z^{t+1} e^{-\epsilon M} \leq Z^t, combined with the right-hand side, gives

x^t_i = \frac{e^{\epsilon P^t_i}}{Z^t} \geq e^{-2\epsilon M} \frac{e^{\epsilon P^{t+1}_i}}{Z^{t+1}} for all i \leq n.

Finally, since e^{-s}\geq 1-s for s \geq 0, we have that x^t_i \geq (1-2\epsilon M) x^{t+1}_i.

Let λ = 2εM. First, if λ > 1, notice that the lemma is trivial because M2 > 2M, and the difference in payoff between two decisions can never be greater than 2M. Hence, WLOG, we may assume that \lambda \in [0,1]. Let z^t \in \reals^n be the unique vector such that,

xt = (1 − λ)xt + 1 + λzt.

Then, we claim that z^t \in \Delta_n. The fact that z^t_i \geq 0 follows directly from the argument above. The fact that \sum z^t_i = 1 follows from the fact that \sum x^t_i = 1, \sum x^{t+1}_i=1 and that xt is a convex combination of xt + 1 and xt.

Finally,

x^t \cdot p^t -x^{t+1} \cdot p^t  = -\lambda x^{t+1} \cdot p^t + \lambda z^t \cdot p^t.

Since y\cdot p \in [-M,M] for all y \in \Delta_n, the magnitude of the above quantity is at most M = 4εM2, as required by the lemma. Q.E.D.

Finally, we can apply the stability-regularization lemma to bound the regret of Weighted Majority.

Theorem. For any n,T\geq 1, M \geq 0, and for any p^1,p^2,\ldots,p^t \in [-M,M]^n, the weighted-majority algorithm run with \epsilon=\frac{1}{2M}\sqrt{\frac{\log(n)}{T}} achieves,
\mathrm{regret} \leq 4M\sqrt{\frac{\log(n)}{T}}.

Proof. The stability-regularization lemma, combined with the lemma before the previous lemma, implies that,

\mathrm{regret}\leq \frac{1}{T}\sum_{t=1}^T \left|x^{t+1}\cdot p^t-x^t \cdot p^t\right| + \frac{1}{T}\max_{x,x'\in \Delta_n} \left|\frac{1}{\epsilon} H(x)-\frac{1}{\epsilon} H(x')\right|.

Since we have shown 0 \leq H(x) \leq \log(n), we have \frac{1}{T}\max_{x,x'\in \Delta_n} \frac{1}{\epsilon} H(x)-\frac{1}{\epsilon} H(x')\leq \frac{\log(n)}{T \epsilon}. The lemma above bounds the "stability" term. Putting these together gives,

\mathrm{regret}\leq \frac{1}{T}\sum_{t=1}^T 4\epsilon M^2 + \frac{\log(n)}{T \epsilon}=4\epsilon M^2 + \frac{\log(n)}{T \epsilon} .

Our choice of \epsilon=\sqrt{\log(n)/T}/(2M) gives the theorem. Q.E.D.


The production of this material was supported in part by NSF award SES-0734780.

Personal tools