Structured decision-making algorithm

Setup

The setup is similar to the repeated n-decision game, and also to the repeated n-dimensional concave decision-making game. However, there is now a (possibly infinite) set of decisions $S \subset \reals^m$. (This set need not be convex, though by using mixed strategies we can achieve any point in the convex hull.) Each period the decision-maker chooses a decision $x^t \in S$. The payoff is assumed to be a linear function $p^t \cdot x^t$, where $p^t \in \reals^m$. In the version we will cover, the full payoff vector is revealed to the decision-maker (though this assumption can be relaxed to just the payoff of the current decision). Hence, the decision-maker's choice on period $t$ can be a function of $x^1,p^1,\ldots,x^{t-1},p^{t-1}$.

As usual,

$\mbox{regret}=\frac{1}{T}\max_{x \in S}\sum_{t=1}^T p^t \cdot x - \frac{1}{T}\sum_{t=1}^T p^t \cdot x^t.$

A simple example is the decisions being the set of paths from node $s$ to node $t$ in a graph with $m$ edges. Each path can be represented as a vector in $\{0,1\}^m$. Many functions can be represented as $p^t \cdot x^t$, including the function which is 1 if a certain edge is in the path and 0 otherwise, or the sum of edge delays on the path.

Algorithm

The algorithm is quite simple. We assume that we have a black-box function $F$, that given a payoff vector $p$, can efficiently find the best decision $F(p)\in S$, i.e., $F(p) \cdot p = \max_{x \in S} x \cdot p$, for all payoff vectors. In the path example, this is simply the shortest path (assuming all payoffs are $\leq 0$).

  Parameter: $R\geq 0$
On period $t=1,2,\ldots,T$:
* Let $R^t \in [-R,0]^m$ be chosen uniformly at random.
* Let $x^t = F(R^t+p^1+p^2+\ldots+p^{t-1})$.


Note: we chose $R^t \in [-R,0]$ for convenience for the path example (so that all edge payoffs would be negative, and we could use a shortest path algorithm). We could equally well choose $R^t \in [0,R]$ or $R^t \in [-R/2,R/2]$ without significant change to the analysis.

Analysis

Lemma Star

 p^t|_1 \leq B[/itex], the structured decision-making algorithm with $R=$ outputs a sequence $x^1,x^2,\ldots,x^T$ with expected regret: ${\mathbb E}_{R^1,R^2,\ldots,R^T}[\mbox{regret}] \leq 4AB\sqrt\frac{n}{T}$.

Proof. The proof is similar to the analysis of the weighted majority algorithm. Note first that all payoffs are in the range $[-AB,AB]$. By linearity of expectation, we may assume WLOG $R^1=R^2=\ldots=R^T$ to be the same random point in $[-R,0]^n$. By the stability-regularization lemma, we have that,

$\max_{x \in S}\sum_{t=1}^T p^t \cdot x - p^t \cdot x^t \leq \sum_{t=1}^T p^t \cdot x^{t+1}-p^t \cdot x^t + \max_{x,x' \in S} |R^1 \cdot x-R^1 \cdot x'|.$

Now, by our assumptions, it is straightforward to see that $R^1 \cdot x \in [-RAn,RAn]$ for any $x \in S$. Hence,

$\max_{x,x' \in S} |R^1 \cdot x-R^1 \cdot x'| \leq 2RAn.$

Next, we claim that: Let $z^t=R^1+p^1+p^2+\ldots+p^{t-1}$, so that $x^t = F(z^t).$. Now, although for any particular $R_1$, $F(z^t)$ may be quite different from $F(z^{t+1})$. However, we claim that ${\mathbb E}[p^t \cdot F(z^{t+1})-p^t \cdot F(z^t)]$ is small. To see this, note that $z^t$ are both chosen uniformly random from cubes of side $R$. Moreover, they are quite similar. They are parallel, and one is a translation of the other by the vector $p^t$, which has $\sum |p^t_i|\leq B$. Hence, the cubes overlap in at least $1-|p^t|_1/R\geq 1-\frac{B}{R}$ fraction. See the figure below. (A BETTER EXPLANATION NEEDS TO BE ADDED HERE.)

Hence, ${\mathbb E}[p^t \cdot F(z^{t+1})-p^t \cdot F(z^t)]\leq \frac{B}{R}2AB$, because the distributions between $z^t$ and $z^{t+1}$ are identical on a $1-B/R$ fraction, and the rest of the time they can differ by at most $2AB$ since all payoffs are in $[-AB,AB]$. Hence, we have, in expectation,

${\mathbb E}\left[\max_{x \in S}\sum_{t=1}^T p^t \cdot x - p^t \cdot x^t\right] \leq \frac{1}{R}2AB^2T + 2RAn$.

For our choice of $R=\frac{B}{\sqrt{nT}}$, the above is $4AB\sqrt{nT}$, which is what we need for the theorem.

Q.E.D.

The production of this material was supported in part by NSF award SES-0734780.