Structured decision-making algorithm

From Theory
Jump to: navigation, search

Notes for CS 8803 - Game Theory and Computer Science. Spring 2008


The setup is similar to the repeated n-decision game, and also to the repeated n-dimensional concave decision-making game. However, there is now a (possibly infinite) set of decisions <math>S \subset \reals^m</math>. (This set need not be convex, though by using mixed strategies we can achieve any point in the convex hull.) Each period the decision-maker chooses a decision <math>x^t \in S</math>. The payoff is assumed to be a linear function <math>p^t \cdot x^t</math>, where <math>p^t \in \reals^m</math>. In the version we will cover, the full payoff vector is revealed to the decision-maker (though this assumption can be relaxed to just the payoff of the current decision). Hence, the decision-maker's choice on period <math>t</math> can be a function of <math>x^1,p^1,\ldots,x^{t-1},p^{t-1}</math>.

As usual,

<math>\mbox{regret}=\frac{1}{T}\max_{x \in S}\sum_{t=1}^T p^t \cdot x - \frac{1}{T}\sum_{t=1}^T p^t \cdot x^t.</math>

A simple example is the decisions being the set of paths from node <math>s</math> to node <math>t</math> in a graph with <math>m</math> edges. Each path can be represented as a vector in <math>\{0,1\}^m</math>. Many functions can be represented as <math>p^t \cdot x^t</math>, including the function which is 1 if a certain edge is in the path and 0 otherwise, or the sum of edge delays on the path.


The algorithm is quite simple. We assume that we have a black-box function <math>F</math>, that given a payoff vector <math>p</math>, can efficiently find the best decision <math>F(p)\in S</math>, i.e., <math>F(p) \cdot p = \max_{x \in S} x \cdot p</math>, for all payoff vectors. In the path example, this is simply the shortest path (assuming all payoffs are <math>\leq 0</math>).

  Parameter: <math>R\geq 0</math>
  On period <math>t=1,2,\ldots,T</math>:
  * Let <math>R^t \in [-R,0]^m</math> be chosen uniformly at random.
  * Let <math>x^t = F(R^t+p^1+p^2+\ldots+p^{t-1})</math>.

Note: we chose <math>R^t \in [-R,0]</math> for convenience for the path example (so that all edge payoffs would be negative, and we could use a shortest path algorithm). We could equally well choose <math>R^t \in [0,R]</math> or <math>R^t \in [-R/2,R/2]</math> without significant change to the analysis.


Lemma Star

p^t|_1 \leq B</math>, the structured decision-making algorithm with <math>R=</math> outputs a sequence <math>x^1,x^2,\ldots,x^T</math> with expected regret:
<math>{\mathbb E}_{R^1,R^2,\ldots,R^T}[\mbox{regret}] \leq 4AB\sqrt\frac{n}{T}</math>.

Proof. The proof is similar to the analysis of the weighted majority algorithm. Note first that all payoffs are in the range <math>[-AB,AB]</math>. By linearity of expectation, we may assume WLOG <math>R^1=R^2=\ldots=R^T</math> to be the same random point in <math>[-R,0]^n</math>. By the stability-regularization lemma, we have that,

<math>\max_{x \in S}\sum_{t=1}^T p^t \cdot x - p^t \cdot x^t \leq \sum_{t=1}^T p^t \cdot x^{t+1}-p^t \cdot x^t + \max_{x,x' \in S} |R^1 \cdot x-R^1 \cdot x'|.</math>

Now, by our assumptions, it is straightforward to see that <math>R^1 \cdot x \in [-RAn,RAn]</math> for any <math>x \in S</math>. Hence,

<math>\max_{x,x' \in S} |R^1 \cdot x-R^1 \cdot x'| \leq 2RAn.</math>

Next, we claim that: Let <math>z^t=R^1+p^1+p^2+\ldots+p^{t-1}</math>, so that <math>x^t = F(z^t).</math>. Now, although for any particular <math>R_1</math>, <math>F(z^t)</math> may be quite different from <math>F(z^{t+1})</math>. However, we claim that <math>{\mathbb E}[p^t \cdot F(z^{t+1})-p^t \cdot F(z^t)]</math> is small. To see this, note that <math>z^t</math> are both chosen uniformly random from cubes of side <math>R</math>. Moreover, they are quite similar. They are parallel, and one is a translation of the other by the vector <math>p^t</math>, which has <math>\sum |p^t_i|\leq B</math>. Hence, the cubes overlap in at least <math>1-|p^t|_1/R\geq 1-\frac{B}{R}</math> fraction. See the figure below. (A BETTER EXPLANATION NEEDS TO BE ADDED HERE.)


Hence, <math>{\mathbb E}[p^t \cdot F(z^{t+1})-p^t \cdot F(z^t)]\leq \frac{B}{R}2AB</math>, because the distributions between <math>z^t</math> and <math>z^{t+1}</math> are identical on a <math>1-B/R</math> fraction, and the rest of the time they can differ by at most <math>2AB</math> since all payoffs are in <math>[-AB,AB]</math>. Hence, we have, in expectation,

<math>{\mathbb E}\left[\max_{x \in S}\sum_{t=1}^T p^t \cdot x - p^t \cdot x^t\right] \leq \frac{1}{R}2AB^2T + 2RAn</math>.

For our choice of <math>R=\frac{B}{\sqrt{nT}}</math>, the above is <math>4AB\sqrt{nT}</math>, which is what we need for the theorem.


The production of this material was supported in part by NSF award SES-0734780.