From Theory
Revision as of 09:54, 23 April 2008 by Atk (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search




Perceptron Algorithm
Input: <math>T \geq 1</math>, <math>(x_1,y_1),\ldots,(x_m,y_m) \in \reals^n \times \{0,1\}</math>.
Let <math>w^1=0</math>
For <math>t=1,2,\ldots,T-1</math>:
* Let <math>w^{t+1}:=w^t+\frac{1}{m} \sum_{i=1}^m \left(y_i-I(w^t \cdot x_i \geq 0)\right)x_i</math>
Output: <math>h(x)=I(w^{T+1} \cdot x \geq 0)</math>.
Input: <math>T \geq 1</math>, <math>(x_1,y_1),\ldots,(x_m,y_m) \in \reals^n \times [0,1]</math>.
Let <math>w^1=0</math>
For <math>t=1,2,\ldots,T-1</math>:
* Let <math>u^t:=\mathrm{PAV}\left((w^t \cdot x_1,y_1),\ldots,(w^t \cdot x_m,y_m)\right)</math>
* Let <math>w^{t+1}:=w^t+\frac{1}{m} \sum_{i=1}^m \left(y_i-u^t(w^t \cdot x_i)\right)x_i</math>
Output: <math>h(x)=u^T(w^T \cdot x)</math>.
Regretron II
Input: <math>T \geq 1</math>,
       <math>(x^1_1,y^1_1),\ldots,(x^1_m,y^1_m) \in \reals^n \times [0,1],</math>,
       <math>(x^T_1,y^T_1),\ldots,(x^T_m,y^T_m) \in \reals^n \times [0,1]</math>,
Let <math>w^1=0</math>
For <math>t=1,2,\ldots,T-1</math>:
* Let <math>u^t:=\mathrm{PAV}\left((w^t \cdot x^t_1,y^t_1),\ldots,(w^t \cdot x^t_m,y^t_m)\right)</math>
* Let <math>w^{t+1}:=w^t+\frac{1}{m} \sum_{i=1}^m \left(y_i-u^t(w^t \cdot x^t_i)\right)x^t_i</math>
Output: <math>h(x)=u^T(w^T \cdot x)</math>.
Input: <math>(x_1,y_1),\ldots,(x_m,y_m) \in \reals \times \reals</math>.
Sort the data so that <math>x_1 \leq x_2 \leq \ldots \leq x_m</math> and, if <math>x_i=x_{i+1}</math>, then <math>y_i  \geq y_{i+1}</math>, for all <math>i=1,2,\ldots,m-1</math>. 
Let <math>\hat{y}_1=y_1</math>
* For i=2,\ldots,n:
** If <math>y_i \geq \hat{y}_{i-1}</math> then <math>\hat{y}_i=y_i</math>, else:
*** Let <math>j</math> be the index of the 
Let <math>w^1=0</math>
For <math>t=1,2,\ldots,T-1</math>:
* Let <math>u^t:=\mathrm{PAV}\left((w^t \cdot x_1,y_1),\ldots,(w^t \cdot x_m,y_m)\right)</math>
* Let <math>w^{t+1}:=w^t+\frac{1}{m} \sum_{i=1}^m \left(y_i-u^t(w^t \cdot x_i)\right)x_i</math>
Output: <math>h(x)=u^T(w^T \cdot x)</math>.

Perceptron model

Idealized regression model

Regression/probabilistic prediction model

Lemma Cherry

Lemma Cherry.

Let <math>y_1,y_2,\ldots,y_m \in \reals</math> be a sequence and let <math>z_1\leq z_2\leq \ldots\leq z_m</math> be the unique minimizer of

<math>\min_{z_1 \leq z_2 \leq \ldots \leq z_m} \sum_{i=1}^m (z_i - y_i)^2</math>.

Let <math>b_1 \leq b_2 \leq \ldots \leq b_m</math> be another nondecreasing sequence. Then,

<math>\sum_{i=1}^m (z_i-y_i)(b_i-z_i) \geq 0</math>.

Proof. Let <math>\lambda \in (0,1]</math>. Note that the sequence <math>\lambda b_i + (1-\lambda)z_i</math> is also nondecreasing and <math>z_i</math> minimizes the sum of squared differences. Hence,

<math>\sum_i (\lambda b_i + (1-\lambda) z_i-y_i)^2 - \sum_i (z_i-y_i)^2 \geq 0.</math>

Collecting sums and using difference of squares gives,

<math>\sum_i (\lambda b_i-\lambda z_i)(\lambda b_i + (1-\lambda) z_i-y_i + z_i-y_i ) \geq 0</math>

Dividing through by the positive quantity <math>2 \lambda</math> gives,

<math>\sum_i (b_i-z_i)\left(\frac{\lambda}{2}(b_i-z_i)+(z_i-y_i)\right)\geq 0</math>.

Since the above holds for all <math>\lambda \in (0,1]</math>, it must hold for <math>\lambda = 0</math> as well, by continuity. Q.E.D.

Lemma Blueberry

Lemma Blueberry.

<math>{\mathbb E}\left[(w^{t+1})^2- (w^t)^2 \right] \leq .</math>

Proof. For notational ease, we write <math>x_i,y_i,z_i</math> instead of <math>x_i^t,y_i^t,z_i^t</math> for this proof.

By expansion,

<math>(w^{t+1})^2 = (w^t)^2+\frac{2}{m} \sum_{i=1}^m (y^t_i - z^t_i) x^t_i \cdot w^t + \left(\frac{1}{m}\sum_{i=1}^m (y^t_i-z^t_i) x_i\right)^2.</math>

Now, by Property Pumpernickle of the PAV algorithm, we have that, <math>\sum (y^t_i-z^t_i) x^t_i \cdot w^t=0</math>. Hence it remains to show that,

<math>\left(\frac{1}{m}\sum_{i=1}^m (y^t_i-z^t_i) x_i\right)^2\leq .</math>


<math>\left(\frac{1}{m}\sum_{i=1}^m (y^t_i-z^t_i) x_i\right)^2=(a+b)^2</math>, for <math>a=\frac{1}{m}\sum_{i=1}^m (f(x^t_i)-z^t_i) x_i</math> and <math>b=\frac{1}{m}\sum_{i=1}^m(y^t_i-f(x^t_i))x_i</math>.


<math>{\mathbb E}[b^2] = \frac{1}{m^2} \sum_{1\leq i,j \leq m} (y^t_i-f(x^t_i))(y^t_j-f(x^t_j))x^t_i \cdot x^t_j = \frac{1}{m^2}\sum_{i=1}^m (y^t_i-f(x^t_i))^2\|x^t_i\|^2.</math> The cross terms above are 0 in expectation by independence. Since <math>(y^t_i-f(x^t_i))^2\leq 1</math> and <math>\|x^t_i\|^2\leq 1</math>, we have that <math>{\mathbb E}[b^2] \leq 1/m</math> and <math>{\mathbb E}[|b|] \leq 1/\sqrt{m}</math>.

Now let <math>\bar{y}_i</math> be the sequence that would have been output had we actually observed <math>(x_i,f(x_i))</math> instead of <math>(x_i,y_i)</math>. That is


((x_1 \cdot w_1,f(x_1)),\ldots,(x_m\cdot w_m,f(x_m)))</math>. By Lemma


Lemma Cantelope

Lemma Cantelope.

<math>{\mathbb E}\left[w^{t+1}\cdot w^* - w^t \cdot w^*\right] \geq
 \frac{1}{L} \overline{err_t}-\left(1+\frac{1}{L}\right)2\sqrt{\frac{2}{m}}.</math>


<math>\left(w^{t+1}-w^t\right)\cdot w^* = \frac{1}{m}\sum_{i=1}^m\left(y^t_i-z^t_i\right) x^t_i \cdot w^*,</math>

which can be written as,

<math>\frac{1}{m}\sum_{i=1}^m\left(f(x^t_i)-z^t_i\right)x^t_i\cdot w^* +

\left(y^t_i-f(x^t_i)\right) x^t_i \cdot w^* </math> First, <math>{\mathbb E}\left[\left(y^t_i-f(x^t_i)\right) x^t_i \right]=0</math> because.... Next,

<math>\left(f(x^t_i)-z^t_i\right) x^t_i\cdot w^*=\left(u^*(w^*\cdot x^t_i)-z^t_i\right) x^t_i\cdot w^*.</math>

Now, as in the proof of Lemma Cucumber, we would like to consider the inverse of <math>u^*</math> on values <math>z^t_i</math>. Note that <math>z^t_i \in [0,1]</math>. There are two problems with this. First, the range of <math>u^*</math> is an interval <math>[a,b]\subseteq [0,1]</math>, but may not include <math>z^t_i</math>. To deal with this, we define a new function, <math>u^{**}:[-1-\frac{1}{L},1+\frac{1}{L}]\rightarrow [0,1]</math>. The properties of <math>u^{**}</math> that we require are:

  • <math>u^{**}(t) = u^*(t)</math> for all <math>t \in [-1,1]</math>.
  • <math>u^{**}</math> is <math>L</math>-Lipschitz and nondecreasing.
  • <math>u^{**}(-1-1/L)=0,u^{**}(1+1/L)=1</math>, i.e., the range of <math>u^{**}</math> is the entire interval <math>[0,1]</math>.

It is not hard to see that, for the domain <math>[-1-1/L,1+1/L]</math>, we have chosen, it is always possible to extend <math>u^*</math> (e.g., linearly) to such a function. The second problem is that the inverse of <math>u^{**}</math> (or <math>u^*</math>) is not necessarily unique at a point <math>z \in [0,1]</math>. We consider the following function, (but any inverse would do equally well)

<math>g:[0,1]\rightarrow [-1-1/L,1+1/L], g(z) = \inf \{x \in [-1-1/L,1+1/L]~|~u^{**}(x)=z \}.</math>

Since <math>u^{**}</math> is continuous, we have that <math>u^{**}(g(z))=z</math> for all <math>z \in [0,1]</math>.

Now, by Observation Mango,

<math>\left(u^*(w^*\cdot x^t_i)-z^t_i\right) (x^t_i\cdot w^*-g(z^t_i)) \geq \frac{\left(u^*(w^*\cdot x^t_i)-z^t_i\right)^2}{L}.</math>

Putting this together with the above, we have,

<math>{\mathbb E}\left[\left(w^{t+1}-w^t\right)\cdot w^*\right] \geq {\mathbb E}\left[\frac{1}{Lm}\sum_{i=1}^m\left(u^*(w^*\cdot x_i^t)-z_i^t\right)^2\right]+

{\mathbb E}\left[\frac{1}{m}\sum_{i=1}^m\left(f(x_i^t)-z_i^t\right)g(z_i^t)\right]</math> The first term in the RHS above is <math>\bar{err}^t/L</math>. We upper-bound the second term using Lemma Sabra, below. This gives a bound of,

<math>{\mathbb E}\left[\frac{1}{m}\sum_{i=1}^m\left(f(x_i^t)-z_i^t\right)g(z_i^t)\right]=

{\mathbb E}\left[\frac{1}{m}\sum_{i=1}^m\left(f(x_i^t)-y_i^t\right)g(z_i^t)\right] \geq -\left(2+\frac{2}{L}\right)\sqrt{\frac{2}{m}}.</math> Q.E.D.

Lemma Tangerine

For this lemma, we show that

Lemma Tangerine.

Let <math>m\geq 1</math> be any integer and <math>X_1,X_2,\ldots,X_n</math> be independent random variables, with <math>{\mathbb E}[X_i]=0</math>, each in the bounded range <math>[-1,1]</math>. Then,

<math>{\mathbb E}\left[\max_{i \in \{0,1,\ldots,m\}} X_1+X_2+ \ldots+X_i\right] \leq \sqrt{2m}.</math>


Let <math>Y</math> be the random variable <math>Y = \max \{0,X_1,X_1+X_2,\ldots,X_1+X_2+ \ldots+X_m\}</math>. Next, we claim that, <math>\Pr[Y^2\geq t] \leq e^{-t/(2m)}</math>, for all <math>t> 0</math>. Since, <math>Y\geq 0</math>, this is equivalent to <math>\Pr[Y\geq \sqrt{t}] \leq e^{-t/(2m)}</math>. To see this, fix <math>t</math> and consider the martingale <math>Z_1,Z_2,\ldots,Z_m</math>, where <math>Z_0 = 0</math> and,

<math>Z_{i} = \begin{cases} Z_{i-1} & \mbox{if }Z_{i-1}\geq \sqrt{t}\\

Z_{i-1}+X_{i} & \mbox{otherwise}\end{cases} \mbox{ for }i=1,2,\ldots,{m-1}</math>. Since <math>{\mathbb E}[X_i]=0</math>, the <math>Z_i</math>'s form a martingale. with bounded difference 1. Hence, Azuma's inequality implies that <math>\Pr[Z_m \geq \sqrt{t}] \leq e^{-t/(2n)}</math>. Moreover, note that if <math>Y \geq \sqrt{t}</math>, then <math>Z_m \geq \sqrt{t}</math>. Hence, we have shown that <math>\Pr[Y^2\geq t] \leq e^{-t/(2m)}</math>, for all <math>t> 0</math>. Now, for any nonnegative random variable <math>R\geq 0</math>, we have <math>{\mathbb E}[R]=\int_0^\infty {\mathbb P}[R \geq t]dt</math>. Hence,

<math>{\mathbb E}[Y^2]=\int_0^\infty \Pr[Y^2\geq t] dt \leq \int_0^\infty e^{-t/(2m)} dt=2m.</math>

Finally, using the fact that <math>{\mathbb E}[Y]^2 \leq {\mathbb E}\left[Y^2\right]</math> for any random variable (non-negativity of variance), implies the lemma. Q.E.D.

Lemma Sabra

Lemma Sabra. For any integer <math>m \geq 1</math> and reals <math>a<b</math>

let <math>X_1,X_2,\ldots,X_m</math> be independent random variables, with <math>{\mathbb E}[X_i]=0</math>, each in the bounded range <math>[-1,1]</math>. Let <math>A_1,A_2,\ldots,A_m</math> be a sequence of random variables such that

<math>a \leq A_1 \leq A_2 \leq \ldots\leq A_n \leq b.</math>

Note that the <math>A_i</math> sequence need not be independent, and may depend on the <math>X_i</math>'s as well. Then,

<math>(a-b)\sqrt{\frac{2}{m}}\leq{\mathbb E}\left[\frac{A_1X_1+\ldots + A_mX_m}{m}\right]\leq(b-a)\sqrt{\frac{2}{m}}.</math>

Proof. First consider the random variables <math>A_i'=\frac{A_i-a}{b-a}\in [0,1]</math>. Then, by linearity of expectation,

<math>{\mathbb E}\left[\frac{A_1X_1+\ldots + A_mX_m}{m}\right]=(b-a){\mathbb E}\left[\frac{A_1'X_1+\ldots + A_m'X_m}{m}\right]+a{\mathbb E}\left[\frac{X_1+\ldots + X_m}{m}\right].</math>

By the above and the fact that <math>{\mathbb E}\left[\frac{X_1+\ldots + X_m}{m}\right]=0</math>, it suffices to prove the lemma for <math>a=0,b=1</math>. So that is the case we consider. Moreover, it suffices to provide the upper bound -- the lower bound follows by symmetry, replacing <math>X_i</math> by <math>-X_i</math>. We next claim that <math>\sum_{i=1}^m A_i X_i \leq \max_{0 \leq j \leq m} \sum_{i=1}^j X_i</math>. To see this, note that:

<math>A_1 X_1 + \ldots + A_m X_m = \Delta_1 (X_1 +X_2+\ldots X_m) + \Delta_2 (X_2+X_3+\ldots+X_m) + \ldots \Delta_m X_m + \Delta_{m+1} 0,</math> where <math>\Delta_1=A_1,\Delta_2=A_2-A_1,

\ldots,\Delta_m=A_m-A_{m-1},\Delta_{m+1}=1-A_m.</math> Since <math>\Delta_i\geq0, \sum_{i=1}^{m+1}\Delta_i=1</math>, we have that <math>\sum_1^m A_i X_i</math> is a convex combination of <math>\sum_1^jX_i </math>, over <math>j=0,1,\ldots,m</math>. Hence, it is no larger than the maximum. Since <math>{\mathbb E}[\max_{0 \leq j \leq m} \sum_{i=1}^j X_i]\leq \sqrt{2n},</math> we get the lemma. Q.E.D.