Repeated n-decision game
From Theory
Notes for CS 8803 - Game Theory and Computer Science. Spring 2008
The setting here is that the decision-maker has
fixed decision options. Each period, each decision pays off a bounded real-valued payoff, say in [ − M,M] for some
. Hence the sequence of payoffs can be modeled by payoff vectors
. We allow the decision-maker to choose a probability distribution over the decisions, i.e., a member of
.
For applications in which one needs to choose a single action, this can be simulated by the randomized weighted majority, in which each period the decision maker chooses one of the decisions according to her specified distribution.
For simplicity, let us assume that the number of periods, T, is known in advance (though this assumption may be removed).
The decision-maker chooses
. The environment chooses
. The decision-maker's payoff
on period t is
.
Note that we make no assumption about the sequence
, other than that it is unknown but bounded. It can be arbitrary and changing. (We do not assume it is drawn independently from some distribution).
Each period,
:
- The decision-maker (player 1) chooses
, based on the history
.
- The decision-maker receives payoff
, and the entire vector pt is revealed to the decision maker.
Remark 1. This version is the perfect monitoring version, meaning that the decision-maker finds out the payoffs of all the decisions each period. Later we will talk about an imperfect monitoring version, in which the decision-maker finds out only the payoff of her choice that period.
Remark 2. The version is the mixed-decision version, in which the decision-maker outputs a probability distribution over decisions and achieves exactly the expected payoff.
Remark 3. Note that the in the above definition, the payoffs are not allowed to be adaptive in the sense that they cannot depend on any randomized choices that the decision-maker may make. However, in the mixed-decision version, we will consider only deterministic algorithms for making decisions. In this case, there is no need to consider adaptive payoffs.
The decision-maker's regret is defined to be:
|
This is the difference between her average payoff and the average payoff achievable by the best single decision
, where the best is chosen with the benefit of hindsight. Note that this definition does not take into account the fact that if the decision maker had chosen x * each period, then the environment might have altered its choices of pt. Many alternative notions of regret are natural, but the nice thing about the above definition is that, in many cases, we can bound our regret or expected regret.
The production of this material was supported in part by NSF award SES-0734780.
