An Identity For Kernel Ridge Regression

This paper describes an approach to prove theoretical guarantees on online kernelized Ridge Regression in the form of equality. It approaches the results described in Zhdanov and Vovk, 2009, from the probabilistic perspective.

The method of kernel Ridge Regression with a kernel $\mathcal{K}$ and a real regularisation parameter $a>0$ suggests the function $f_\mathrm{RR}(x)=Y'(K+aI)^{-1}k(x)$, where $Y=(y_1,y_2,\ldots,y_T)'$ is the column vector of outcomes, $$ K= \begin{pmatrix} \mathcal{K}(x_1,x_1) & \mathcal{K}(x_1,x_2) & \ldots & \mathcal{K}(x_1,x_T)\\ \mathcal{K}(x_2,x_1) & \mathcal{K}(x_2,x_2) & \ldots & \mathcal{K}(x_2,x_T)\\ \vdots & \vdots & \ddots & \ldots \\ \mathcal{K}(x_T,x_1) & \mathcal{K}(x_T,x_2) & \ldots & \mathcal{K}(x_T,x_T) \end{pmatrix}$$ is the kernel matrix and

$\displaystyle{ k(x)= \begin{pmatrix} \mathcal{K}(x_1,x)\\ \mathcal{K}(x_2,x)\\ \vdots \\ \mathcal{K}(x_T,x) \end{pmatrix}\enspace.}$

Theorem. Take a kernel $\mathcal{K}$ on a domain $X$ and a parameter $a>0$. Let $\mathcal{F}$ be the RKHS for the kernel $\mathcal{K}$. For a sample $(x_1,y_1),(x_2,y_2),\ldots,(x_T,y_T)$ let $\gamma_1^\mathrm{RR},\gamma_2^\mathrm{RR},\ldots,\gamma_T^\mathrm{RR}$ be the predictions output by ridge regression with the kernel $\mathcal{K}$ and the parameter $a$ in the on-line mode. Then

$\displaystyle{ \sum_{t=1}^T\frac{(\gamma^\mathrm{RR}_t-y_t)^2}{1+d_t/a}=\min_{f\in\mathcal{F}}\left( \sum_{t=1}^T(f(x_t)-y_t)^2+a\|f\|^2_\mathcal{F}\right) =aY'(K_{T+1}+aI)^{-1}Y\enspace,}$

where $d_t=\mathcal{K}(x_t,x_t)-k'_t(x_t)(K_t+aI)^{-1}k_t(x_t)>0$.

The idea of the proof is very elegant: the density of the joint distribution of the variables $(y_{x_1},y_{x_2},\ldots,y_{x_T})$ at the point $(y_1,y_2,\ldots,y_T)$ is calculated in three different ways: by decomposing the density into a chain of conditional densities, marginalisation, and, finally, direct calculation. Each method will give us a different expression corresponding to a term in the identity. Since all the three terms express the same density, they must be equal.

The following corollary is also proven on the cumulative square loss of the clipped kernel Ridge Regression.

Corollary. Take a kernel $\mathcal{K}$ on a domain $X$ and a parameter $a>0$. Let $\mathcal{F}$ be the RKHS for the kernel $\mathcal{K}$. For a sample $(x_1,y_1),(x_2,y_2),\ldots,(x_T,y_T)$ such that $y_t\in [-Y,Y]$ for all $t=1,2,\ldots,T$, let $\gamma_1^{\mathrm{RR},Y},\gamma_2^{\mathrm{RR},Y},\ldots,\gamma_T^{\mathrm{RR},Y}$ be the predictions output by clipped ridge regression with the kernel $\mathcal{K}$ and the parameter $a$ in the on-line mode. Then

$\displaystyle{\sum_{t=1}^T(\gamma^{\mathrm{RR},Y}_t-y_t)^2\le \min_{f\in\mathcal{F}}\left( \sum_{t=1}^T(f(x_t)-y_t)^2+a\|f\|^2_\mathcal{F}\right)+4Y^2\ln\det\left(I+\frac1a K_{T+1}\right)\enspace,}$

where $K_{T+1}$ is as above.

Fedor Zhdanov and Vladimir Vovk. Competing with Gaussian linear experts. Technical report, arXiv:0910.4683 [cs.LG], arXiv.org e-Print archive, 2009.
Fedor Zhdanov and Yuri Kalnishkan. An identity for kernel ridge regression. In Proceedings of the 21st International Conference on Algorithmic Learning Theory, 2010.