Kernel Methods

Kernel methods are a powerful tool of modern learning. This article provides an introduction to kernel methods through a motivating example of kernel ridge regression, defines reproducing kernel Hilbert spaces (RKHS), and then sketches a proof of the fundamental existence theorem.

Some results that appear to be important in the context of learning are also discussed.

See also:

Sylvester identity

Upper Bound on a Determinant

1.  A Motivating Example: Kernel Ridge Regression

In this section we will introduce kernels in the context of ridge regression. The reader may skip this section and proceed straight to the next session if he is only interested in the formal theory of RKHSs.

A more detailed discussion of Ridge Regression and kernels can be found in Section 3 of Steve Busuttil's dissertation.

1.1  The Problem

Suppose we are given a set of $T$ examples $(x_1,y_1), (x_2,y_2), \ldots, (x_T,y_T)$, where $x_i\in\mathbb R^n$ are signals and $y_i\in\mathbb R$ are outcomes or labels. We want to find a dependency between signals and outcomes and to be able to predict $y$ given a new $x$. This problem is often referred to as the regression problem 1.

Let us start by restricting ourselves to linear dependencies of the form $y=\langle w, x\rangle = w'x$, where $w\in\mathbb R^n$, $\langle\cdot,\cdot\rangle$ is the standard scalar product in $\mathbb R^n$, and the prime stands for transposition (by default all vectors are assumed to be column vectors). The class of linear functions is not too reach, and we will need to progress to more sophisticated classes later.

1.2  Least Squares and Ridge Regression

The least squares is a natural, popular, and time-honoured (apparently going back to Legendre and Gauss) approach to finding $w$. Let us take an $w$ minimising the sum of squared discrepancies

$$ L_{\rm SQ}(w)=\sum_{i=1}^T\left(w'x_i-y_i\right)^2. $$
Minimising this expression is equivalent to projecting of the vector $Y=(y_1,y_2,\ldots,y_T)$ on the subspace of $\mathbb R^T$ generated by $n$ vectors $v_1,v_2,\ldots,v_n$, where $v_i$ consists of $i$-th coordinates of $x$s.

The projection on the subspace is always unique and well-defined. However if the vectors $v_1,v_2,\ldots,v_n$ are linearly dependent (which always happens if the sample is small and $T<n$) this unique projection has multiple representations as their linear combination. The method of least squares thus does not necessarily define a unique regressor.

Consider a more general problem. Let us take $a\ge 0$ and consider the expression

$$ L_{\rm RR}(w)=a\|w\|^2+\sum_{i=1}^T\left(w'x_i-y_i\right)^2. $$
A vector $w$ minimising this is called a solution of the ridge regression problem (for reasons that will become apparent later). The least squares approach is a special case of the ridge regression approach, namely, that of $a=0$.

Why would anyone want to use $a>0$? There are three main reasons. First, ridge regression with $a>0$ always specifies a unique regressor as will be shown below. Secondly, a positive value of $a$ makes the problem easier computationally. These properties will be shown later in this section. Thirdly, the term $a\|w\|^2$ performs the regularisation function. It penalises the growth of coefficients of $w$ and urges us to look for "simpler" solutions.

1.3  Solution: Primary Form

Let us find the solution of the ridge regression problem with $a>0$. It is convenient to introduce a $T\times n$ matrix $X=(x_1,x_2,\ldots,x_t)'$; the rows of $X$ are vectors $x_i$ (and the columns are vectors $v_i$ mentioned above). We get

$$L_{\rm RR}(w)=a\|w\|^2+\|Xw-Y\|^2.$$
By differentiating $L_{\rm RR}(w)$ w.r.t. $w$ and equating the result to 0 we get
$$ 2aw-2X'Y+X'Xw=0 $$
$$ w=(aI+X'X)^{-1}X'Y, $$
where $I$ is the identity matrix. This must be a solution; indeed, as coefficients of $w$ approach infinity, $a\|w\|^2$ and therefore $L_{\rm RR}(w)$ must go to infinity.

Let us analyse this expression. The matrices $X'X$, $I$, and $aI+X'X$ have the size $n\times n$. The matrix $X'X$ is positive semi-definite, i.e., $\xi'(X'X)\xi\ge 0$ for all $\xi\in\mathbb R^n$ (this follows from $\xi'(X'X)\xi= (X\xi)'(X\xi)=\|X\xi\|^2$). By adding $aI$ we make the matrix positive definite, i.e., we have $\xi'(aI+X'X)\xi>0$ for all $\xi\ne 0$ (indeed, $\xi'(aI+X'X)\xi=\|X\xi\|^2+a\|\xi\|^2$). Because every positive definite matrix is non-singular 2, $aI+X'X$ must have the inverse. If $a>0$, a solution to the ridge regression problem always exists and it is unique.

If $a=0$ and $\mathop{{\mathrm{rank}}}\nolimits X<n$, the matrix becomes singular 3. The geometrical interpretation of this situation was discussed above.

As $a$ approach 0, the matrix $aI+X'X$ may become close to singular. The numerical routines for finding $w$ will then become less and less stable: they will have to deal with very big or very small values and make large round-up errors. Taking a larger $a>0$ thus stabilises the computation as mentioned above.

Let us attempt a rough hypothetical analysis of the predictive performance of ridge regression for different values of $a$. If $a$ is very big, the term $aI$ completely overshadows $X'X$ and the predictive performance deteriorates. If $a$ is very small, we may encounter numerical problems. An optimal value should thus be neither too big no too small. In some sense it must be comparable in size to elements of $X'X$. The exact choice of $a$ depends on the particular dataset.

Finally, let us go back to the term "ridge regression". One of the versions of its etymology is that the diagonal of $aI$ forms a "ridge" added on top of the least squares matrix $X'X$.

1.4  Solution: Dual Form

Using the matrix identity $A(aI+A'A)^{-1}=(aI+AA')^{-1}A$ we can rewrite the ridge regression solution as follows. For an arbitrary $x\in \mathbb R^n$ the outcome suggested by ridge regression is $w'x$ and this can be rewritten as

\begin{align*} w'x &= ((aI+X'X)^{-1}X'Y)'x,\\ &= Y'X(aI+X'X)^{-1}x,\\ &= Y'(aI+XX')^{-1}Xx. \end{align*}
This formula is called the dual form of the ridge regression solution.

Similar arguments concerning non-singularity apply to $aI+XX'$. The matrix has the size $T\times T$. This might seem a disadvantage compared to the primary form: it is natural to expect that in practice $n$ would be fixed and not too big, while the size of the sample $T$ may be quite large. However this formula allows us to develop important generalisations.

We can say that $w=Y'(aI+XX')^{-1}X$ in the dual form. However there is a more interesting way to interpret the dual form formula. We have

$$ w'x = Y'(aI+K)^{-1}k, $$
where $K$ is the matrix of mutual scalar products
$$ K=\begin{pmatrix} \langle x_1,x_1\rangle &\langle x_1,x_2\rangle &\ldots&\langle x_1,x_T\rangle\\ \langle x_2,x_1\rangle &\langle x_2,x_2\rangle &\ldots&\langle x_2,x_T\rangle\\ \vdots                 &\vdots                 &\ddots&\vdots\\ \langle x_T,x_1\rangle &\langle x_T,x_2\rangle &\ldots&\langle x_T,x_T\rangle \end{pmatrix} $$
and $k=k(x)$ is the vector of scalar products of $x_i$ by $x$:
$$ k=\begin{pmatrix} \langle x_1,x\rangle\\ \langle x_2,x\rangle\\ \ldots\\ \langle x_T,x\rangle \end{pmatrix}. $$
Note that all $x_i$s and $x$ appear in this formula only in mutual scalar product. This observation has important consequences.

1.5  Non-linear Regression

Now let us try to extend the class of functions we use and consider a wider class. Suppose that $n=1$, i.e., all $x$s are numbers and we are interested in approximations by polynomials of degree $3$, i.e., functions of the form $w_0+w_1x+w_2x^2+w_3x^3$. Of course we can write down $L(w)$ for this case, perform the differentiation and find the solution as we did before. However there is a simpler argument based on the dual form.

Let us map $x$ into $\mathbb R^4$ as follows: $x\to (1,x,x^2,x^3)$. Once we have done this, we can do linear regression on new "long" signals. If we use the dual form, we do not even have to perform the transformations explicitly. Because we only need scalar products, we can compute all the necessary products $1+x_1x_2+x_1^2x_2^2+x_1^3x_2^3$ and substitute them into the dual form formula.

Let us write down a formal generalisation. The signals $x$ do not have to come from $\mathbb R^n$ any longer. Let them be drawn from some arbitrary set 4 $X$. Suppose that we have a mapping $\Phi: X\to {\cal S}$, where ${\cal S}$ is some vector space equipped with a scalar product $\langle\cdot,\cdot\rangle$ (dot-product space); the space ${\cal S}$ is sometimes referred to as the feature space. We can use ridge regression in the feature space. The prediction of ridge regression on a signal $x$ can be written as

$$ \gamma_{\rm RR} = Y'(aI+K)^{-1}k, $$
$$ K=\begin{pmatrix} {\cal K}(x_1,x_1) & {\cal K}(x_1,x_2) &\ldots& {\cal K}(x_1,x_T)\\ {\cal K}(x_2,x_1) & {\cal K}(x_2,x_2) &\ldots& {\cal K}(x_2,x_T)\\ \vdots                 &\vdots                 &\ddots&\vdots\\ {\cal K}(x_T,x_1) & {\cal K}(x_T,x_2) &\ldots& {\cal K}(x_T,x_T) \end{pmatrix} $$
$$ k=\begin{pmatrix} {\cal K}(x_1,x)\\ {\cal K}(x_2,x)\\ \ldots\\ {\cal K}(x_T,x) \end{pmatrix}; $$
the function $ {\cal K}: X^2\to \mathbb R$ is given by $ {\cal K}(x_1,x_2)=\langle\Phi(x_1),\Phi(x_2)\rangle$.

The space ${\cal S}$ does not have to be finite-dimensional. However since every vector space with a scalar product can be embedded into a Hilbert space (see below for a definition) we can assume that it is Hilbert.

The transformation $\Phi$ is of no particular importance to us. Once we know the kernel $ {\cal K}$, we can perform regression with it. A justification for the ridge regression in the kernel case will be obtained below in Subsection 5.3.

1.6  Mappings and Kernels

It would be nice to have a characterisation of all $ {\cal K}$ without a reference to $\Phi$. A characterisation of this kind can be given.

It is easy to see that $ {\cal K}(x_1,x_2)=\langle\Phi(x_1),\Phi(x_2)\rangle$ has the following properties:

  • it is symmetric: $ {\cal K}(x_1,x_2)= {\cal K}(x_2,x_1)$ for all $x_1,x_2\in X$; this follows from the symmetry of the scalar product $\langle\cdot,\cdot\rangle$;
  • it is positive semi-definite: for every positive integer $T$ and every $x_1,x_2,\ldots,x_T\in X$ the matrix
    $$ K=\begin{pmatrix} {\cal K}(x_1,x_1) & {\cal K}(x_1,x_2) &\ldots& {\cal K}(x_1,x_T)\\ {\cal K}(x_2,x_1) & {\cal K}(x_2,x_2) &\ldots& {\cal K}(x_2,x_T)\\ \vdots                 &\vdots                 &\ddots&\vdots\\ {\cal K}(x_T,x_1) & {\cal K}(x_T,x_2) &\ldots& {\cal K}(x_T,x_T) \end{pmatrix} $$
    is positive semi-definite 5; indeed, $K$ is the Gram matrix of the images $\Phi(x_1),\Phi(x_2),\ldots,\Phi(x_T)$ 6.

Surprisingly these two simple properties are sufficient. Let us call a function $ {\cal K}: X^2\to\mathbb R$ satisfying these two properties a kernel. Then the following theorem can be formulated.

Theorem 1. For any set $X$ a function $ {\cal K}: X^2\to\mathbb R$ is a kernel, i.e., it is symmetric and positive semi-definite, if and only if there is a mapping $\Phi$ from $X$ into a Hilbert space $ {\cal H}$ with a scalar product $\langle\cdot,\cdot\rangle$ such that $ {\cal K}(x_1,x_2)=\langle\Phi(x_1),\Phi(x_2)\rangle$ for all $x_1,x_2\in X$.

We proved the "if" part when we defined kernels. The "only if" part follows from the results of the next sections, where we will show that the class of kernels coincides with the class of so called reproducing kernels.

2.  Reproducing Kernel Hilbert Spaces

In this section we introduce reproducing kernel Hilbert spaces (RKHS) and show some of their basic properties. The presentation is based mainly on [Aronszajn, 1943] and [Aronszajn, 1950] and the reader may consult these papers for more details; note that the former paper is in French.

2.1  Hilbert Space

A set of some elements $H$ is a Hilbert space if

  1. $H$ is a vector space over $\mathbb R$ (Hilbert spaces over the complex plain $ {\mathbb C}$ can also be considered, but we shall restrict ourselves to $\mathbb R$ in this article);
  2. $H$ is equipped with a scalar product $\langle\cdot,\cdot\rangle$ (i.e., with a symmetric positive definite bilinear form);
  3. $H$ is complete w.r.t. the metric generated by the scalar product, i.e., every fundamental sequence of elements of $H$ converges.

Some authors require a Hilbert space to be separable, t.e., to have a countable dense subset. For example, [Aronszajn, 1943] reserves the name "Hilbert". We shall not impose this requirement by default.

As a matter of fact all separable Hilbert spaces are isomorphic (the situation is similar to that with finite-dimensional spaces; the separable Hilbert space is "countable-dimensional").

A typical (though not particularly relevant to this article) example of a Hilbert space is provided by $L_2(X,\mu)$, which is the space of all real-valued functions

$f$ on $X$ such that $f^2$ is Lebesgue-integrable w.r.t. the measure $\mu$ on $X$ with the scalar product $\langle f,g\rangle=\int_X fg d\mu$. Another example is given by $l_2$, which is the set of infinite sequences $(x_1,x_2,\ldots)$, $x_i\in\mathbb R$, such that the sum $\sum_{i=1}^{+\infty}x_i^2$ converges. Both $l_2$ and $L_2$ on $\left[0,1\right]$ with the standard Lebesgue measure are separable; therefore they are isomorphic.

2.2  Reproducing Kernel Hilbert Spaces: a Definition

Let $ {\cal F}$ be a Hilbert space consisting of functions on a set $X$. A function $ {\cal K}(x_1,x_2)$ is a reproducing kernel (r.k.) for $ {\cal F}$ if

  • for every $x\in X$ the function $ {\cal K}(x,\cdot)$ (i.e., $ {\cal K}(x,x_2)$ as the function of the second argument with $x$ fixed) belongs to $ {\cal F}$
  • the reproducing property holds: for every $f\in {\cal F}$ and every $x\in X$ we have $f(x)=\langle f, {\cal K}(x,\cdot)\rangle$. A space $ {\cal F}$ admitting a reproducing kernel is called a reproducing kernel Hilbert space (RKHS).

2.3  Reproducing Kernel Hilbert Spaces: Some Properties

Let us formulate and prove some basic properties of reproducing kernels.

Theorem 2.
  1. If a r.k. for $ {\cal F}$ exists, it is unique.
  2. If $ {\cal K}$ is a reproducing kernel for $ {\cal F}$, then for all $x\in X$ and $f\in {\cal F}$ we have $|f(x)|\le \sqrt{ {\cal K}(x,x)}\|f\|_ {\cal F}$.
  3. If $ {\cal F}$ is a RKHS, then convergence in $ {\cal F}$ implies pointwise convergence of corresponding functions.

Proof: In order to prove (1) suppose that there are two r.k. $ {\cal K}_1$ and $ {\cal K}_2$ for the same space $ {\cal F}$. For every $x\in X$ the function $ {\cal K}_1(x,\cdot)- {\cal K}_2(x,\cdot)$ belongs to $ {\cal F}$ and, applying linearity and the reproducing property, we get

\begin{align*} \| {\cal K}_1(x,\cdot)- {\cal K}_2(x,\cdot)\|^2_ {\cal F} &= \langle  {\cal K}_1(x,\cdot)- {\cal K}_2(x,\cdot), {\cal K}_1(x,\cdot)- {\cal K}_2(x,\cdot)\rangle\\ &=\langle  {\cal K}_1(x,\cdot)- {\cal K}_2(x,\cdot), {\cal K}_1(x,\cdot)\rangle-\\ &{\:\:\:\:\:\:\:\:\:\:\:\:}\langle  {\cal K}_1(x,\cdot)- {\cal K}_2(x,\cdot), {\cal K}_2(x,\cdot)\rangle\\ &=( {\cal K}_1(x,x)- {\cal K}_2(x,x))-( {\cal K}_1(x,x)- {\cal K}_2(x,x))\\ &=0. \end{align*}
The definition of a Hilbert space implies that $ {\cal K}_1(x,\cdot)$ coincides with $ {\cal K}_2(x,\cdot)$ and therefore they are equal everywhere as functions.

Property (2) follows immediately from the reproducing property and the Cauchy(-Schwarz-Bunyakovsky) inequality.

Property (3) follows from (2). Indeed, for all $f_1,f_2\in {\cal F}$ and $x\in X$ we have

$$ |f_1(x)-f_2(x)| \le \sqrt{ {\cal K}(x,x)}\|f_1-f_2\|_ {\cal F}. $$

We shall now give an important "internal" characterisation of reproducing kernel Hilbert spaces.

Let $ {\cal F}$ consisting of real-valued functions on $X$ be a Hilbert space. Take $x\in X$ and consider the functional $ {\cal F}\to\mathbb R$ mapping $f\in {\cal F}$ into $f(x)$. It is linear (in $f$) and is called the evaluation functional.

Note that the evaluation functional is not defined on $L_2$: the elements of $L_2$ are in fact equivalence classes of functions that coincide everywhere up to a set of measure 0, and thus they are not really defined at every point 7.

Theorem 3. A Hilbert space $ {\cal F}$ consisting of real-valued functions on $X$ is a RKHS if and only if for every $x\in X$ the corresponding evaluation functional is continuous.

Proof: The "only if" part follows from (2) from the previous theorem.

In order to prove the "if" part we need the Riess-Fischer Representation Theorem, which states that every continuous linear functional on a Hilbert space can be represented as the scalar product by some element of the space.

Take $x\in X$. Because the evaluation functional is continuous, there is a unique $k_x\in {\cal F}$ such that $f(x)=\langle f,k_x\rangle$. We can define a mapping $F:X\to {\cal F}$ by $F(x)=k_x$. Let $ {\cal K}(x_1,x_2)=\langle F(x_1),F(x_2)\rangle$.

We have

\begin{align*} {\cal K}(x_1,x_2)&=\langle k_{x_1}, k_{x_2}\rangle\\ &= k_{x_1}(x_2) \end{align*}
and thus $ {\cal K}(x_1,\cdot)=k_{x_1}(\cdot)\in {\cal F}$. On the other hand for every $f\in {\cal F}$ and $x\in X$ we have
\begin{align*} f(x)&=\langle f,k_x\rangle\\ &=\langle f, {\cal K}(x,\cdot)\rangle. \end{align*}
Therefore $ {\cal K}$ is a r.k. for $ {\cal F}$. $\Box$

This criterion is quite important. The continuity of the evaluation functional means that it is consistent with the norm: functions $f_1$ and $f_2$ that are close with respect to the norm evaluate to values $f_1(x)$ and $f_2(x)$ that are close. If we consider functions from some space as hypotheses in machine learning and the norm on the space as a measure of complexity, it is natural to require the continuity of the evaluation functional. The theorem shows that all "natural" Hilbert spaces of functions are in fact reproducing kernel Hilbert spaces.

2.4  Existence Theorem

We have shown that a r.k. $ {\cal K}(x_1,x_2)$ can be represented as $\langle F(x_1),F(x_2)\rangle$. This implies that $ {\cal K}$ is

  • symmetric due to the symmetry of the scalar product;
  • positive semi-definite, i.e., for all $x_1,x_2,\ldots,x_T\in X$ the matrix
    $$ K=\begin{pmatrix} {\cal K}(x_1,x_1) & {\cal K}(x_1,x_2) &\ldots& {\cal K}(x_1,x_T)\\ {\cal K}(x_2,x_1) & {\cal K}(x_2,x_2) &\ldots& {\cal K}(x_2,x_T)\\ \vdots                 &\vdots                 &\ddots&\vdots\\ {\cal K}(x_T,x_1) & {\cal K}(x_T,x_2) &\ldots& {\cal K}(x_T,x_T) \end{pmatrix} $$
    is positive semi-definite; this holds since $K$ is the Gram matrix. Thus $ {\cal K}$ is a kernel according to the definition from the previous section. The following theorem shows that the classes of kernels and reproducing kernels coincide.
Theorem 4. Let $ {\cal K}: X^2\to\mathbb R$ is a real-valued function of two arguments on $X$. Then $ {\cal K}$ is a reproducing kernel for some Hilbert space of functions $ {\cal F}$ on $X$ if and only if
  • $ {\cal K}$ is symmetric
  • $ {\cal K}$ is positive semi-definite. If there is a space admitting $ {\cal K}$ as its reproducing kernel, it is unique.

3.  Proof of the Existence Theorem

In this section we will prove the existence theorem. Let $ {\cal K}:X^2\to\mathbb R$ be a kernel.

3.1  Linear Combinations: A Dot Product Space

We start the proof by constructing a linear space of functions $ {\cal F}_1$ consisting of linear combinations $\sum_{i=1}^na_i {\cal K}(x_i,\cdot)$, where $n$ is a positive integer, $a_i\in\mathbb R$ and $x_i\in X$. The linearity follows by construction.

The scalar product is defined after the following fashion. Let

$$  \left\langle\sum_{i=1}^na_i {\cal K}(x_i,\cdot), \sum_{i=1}^nb_i {\cal K}(x_i,\cdot)\right\rangle = \sum_{i,j=1}^na_ib_j {\cal K}(x_i,x_j) $$
(by adding terms with zero coefficients we can ensure that the linear combinations have equal numbers of terms and that all $x_i$ in the combinations are the same). We need to prove that the scalar product is well-defined, i.e., to show that it is independent of particular representations of factors (recall that we are constructing a space of functions rather than formal linear combinations).

Let $f(x)=\sum_{i=1}^na_i {\cal K}(x_i,x)$ and $g(x)=\sum_{i=1}^nb_i {\cal K}(x_i,x)$. We have

\begin{align*} \langle f,g\rangle &=\sum_{i,j=1}^na_ib_j {\cal K}(x_i,x_j)\\ &= \sum_{i=1}^n a_i\left(\sum_{j=1}^n b_j {\cal K}(x_i,x_j)\right)\\ &= \sum_{i=1}^n a_i g(x_j). \end{align*}
We see that the scalar product can be expressed in terms of values of $g$ and thus is independent of a particular representation of $g$ as a linear combination. A similar argument can be applied to $f$. The independence follows.

The function $\langle\cdot,\cdot\rangle$ is symmetric because $ {\cal K}$ is symmetric. For $f$ from above we have

$$ \langle f,f\rangle = \sum_{i,j=1}^na_ia_j {\cal K}(x_i,x_j)\ge 0 $$
because $ {\cal K}$ is positive semi-definite. Therefore $\langle\cdot,\cdot\rangle$ is positive semi-definite. We have shown that it is a positive semi-definite symmetric bilinear form. One final step is necessary to prove that it is positive definite and therefore a scalar product.

Let us evaluate $\langle f(\cdot),  {\cal K}(x,\cdot)\rangle$, where $f\in {\cal F}_1$ and $x$ is some element from $X$. We get

$$ \langle f(\cdot),  {\cal K}(x,\cdot)\rangle = f(x). $$
The form $\langle\cdot,\cdot\rangle$ and $ {\cal K}$ thus satisfy the reproducing property.

Because the form $\langle\cdot,\cdot\rangle$ is positive semi-definite, the Cauchy(-Schwarz-Bunyakovsky) inequality holds for it and

$$ \langle f, g\rangle \le \|f\|\cdot\|g\|, $$
where $\|f\|$ is defined as $\sqrt{\langle f,f\rangle}$. Combining this with the reproducing property yields
\begin{align*} \langle f(\cdot),  {\cal K}(x,\cdot)\rangle &\le \|f\|\cdot\| {\cal K}(x,\cdot)\|\\ &= \|f\|\sqrt{ {\cal K}(x,x)}. \end{align*}
Therefore $\|f\|=0$ implies that $f(x)=0$ for an arbitrary $x\in X$. We have thus shown that $\langle\cdot,\cdot\rangle$ is actually positive definite and therefore a scalar product.

The construction is not finished yet because $ {\cal F}_1$ is not necessarily complete. It remains to construct a completion of $ {\cal F}_1$. It is well known that every linear space with a scalar product has a completion, which is a Hilbert space. However this argument cannot be applied here 8: we need a completion of a specific form, namely, consisting of functions $X\to\mathbb R$. Note that, however, we have already proved Theorem 1 from Section 1: we can map $X$ into some Hilbert space $H$ so that the value of the kernel is given by the scalar product of images. The mapping $\Phi:X\to {\cal F}_1$ is given by the obvious $\Phi(x)= {\cal K}(x,\cdot)$.

3.2  Completion

In this subsection we will construct a completion of $ {\cal F}_1$.

Let $f_1,f_2,\ldots\in {\cal F}_1$ be a fundamental sequence. For every $x\in X$ the inequalities

\begin{align*} |f_n(x)-f_m(x)|& = |\langle f_n-f_m, {\cal K}(x,\cdot)\rangle|\\ & \le \|f_n-f_m\|\sqrt{ {\cal K}(x,x)}, \end{align*}
which follow from the previous subsection, imply that the sequence of values $f_1(x),f_2(x),\ldots$ is fundamental and therefore has a limit. We can define a function $f:X\to\mathbb R$ by $f(x)=\lim_{n\to\infty}f_n(x)$.

Let $ {\cal F}$ consist of all functions thus obtained. Clearly, $ {\cal F}_1\subseteq {\cal F}$ since each $f$ from $ {\cal F}_1$ is the pointwise limit of the sequence $f,f,\ldots$.

The scalar product on $ {\cal F}$ can be introduced as follows. If $f$ is the pointwise limit of $f_1,f_2,\ldots\in {\cal F}_1$ and $g$ is the pointwise limit of $g_1,g_2,\ldots\in {\cal F}_1$, then $\langle f,g\rangle_ {\cal F}=\lim_{n\to\infty}\langle f_n,g_n\rangle_{ {\cal F}_1}$.

Let us show that this limit exists. For all positive integers $n_1$, $n_2$, $m_1$ and $m_2$ we have

\begin{multline*} |\langle f_{n_1},g_{m_1}\rangle - \langle f_{n_2},g_{m_2}\rangle|\le\\ |\langle f_{n_1},g_{m_1}\rangle - \langle f_{n_1},g_{m_2}\rangle|+ |\langle f_{n_1},g_{m_2}\rangle - \langle f_{n_2},g_{m_2}\rangle| \le\\ \|f_{n_1}\|\cdot\|g_{m_1}-g_{m_2}\|+\|g_{m_2}\|\cdot\|f_{n_1}-f_{n_2}\|. \end{multline*}
Because the norms of elements of a fundamental sequence are uniformly bounded, the difference can be made as close to 0 as necessary for sufficiently large $n_1$, $n_2$, $m_1$ and $m_2$. Thus there is even a double limit $\lim_{n,m\to\infty}\langle f_n,g_m\rangle=s$ in the sense that for all sufficiently big $n$ and $m$ the difference $|\langle f_n,g_m\rangle-s|$ becomes arbitrarily small.

Let us show that the scalar product is independent of a choice of fundamental sequences converging to $f$ and $g$. Consider two pairs of fundamental sequences, $f_1,f_2,\ldots$ and $f'_1,f'_2,\ldots$ converging to $f$ and $g_1,g_2,\ldots$ and $g'_1,g'_2,\ldots$ converging to $g$.

Consider the expression $\langle f_m-f'_m,g_n\rangle$. The sequence consisting of functions $f_n-f'_n$ is clearly fundamental, therefore, as shown above, there must exist a limit $\lim_{n,m\to\infty} \langle f_m-f'_m,g_n\rangle$. Let us evaluate this limit. There are coefficients $b^m_1,b^m_2,\ldots,b^m_p$ and elements $z^m_1,z^m_2,\ldots,z^m_p$ such that $f(\cdot)=\sum_{i=1}^pb^m_i {\cal K}(z^m_i,\cdot)$ ($p=p(m)$ may change as $m$ varies). We have

\begin{align*} \langle f_m-f'_m,g_n\rangle  &= \langle f_m-f'_m,\sum_{i=1}^pb^m_i {\cal K}(z^m_i,\cdot)\rangle\\  &= \sum_{i=1}^mb^m_i(f_m(z_i)-f'_m(z_i)). \end{align*}
Since $f_m$ and $f'_m$ converge pointwise to 0, this expression converges to zero as $m\to\infty$. Thus
$$ \lim_{n,m\to\infty} \langle f_m-f'_m,g_n\rangle=0. $$
$$ \lim_{n,m\to\infty} \langle f'_m,g_n-g'_n\rangle=0. $$
Therefore the difference
$$ \langle f_m,g_n\rangle-\langle f'_m,g'_n\rangle= \langle f_m-f'_m,g_n\rangle+\langle f'_m,g_n-g'_n\rangle $$
converges to 0 as $n,m\to\infty$. Our definition is thus independent of a particular choice of fundamental sequences.

The bilinearity of $\langle\cdot,\cdot\rangle$ on $ {\cal F}$ is easy to check. The number $\|f\|=\langle f,f\rangle$ is non-negative as a limit of non-negative numbers. More precisely, let $f_1,f_2,\ldots\in {\cal F}_1$ be a fundamental sequence converging to $f$ pointwise. Because

\begin{align*} |f(x)| &= \lim_{n\to\infty}|f_n(x)|\\ &\le \lim_{n\to\infty} \sqrt{ {\cal K}(x,x}\|f_n\|\\ &= \sqrt{ {\cal K}(x,x)}\|f\| \end{align*}
the equality $\|f\|$ implies that $f(x)=0$ for all $x\in X$.

We have shown that $ {\cal F}$ is indeed a linear space with a scalar product. Clearly, $ {\cal F}_1\subseteq {\cal F}$ and the scalar product on $ {\cal F}$ extends that on $ {\cal F}_1$.

Let us show that $ {\cal F}$ is complete. First, let $f_1,f_2\ldots$ be a fundamental sequence of elements of $ {\cal F}_1$ converging pointwise to $f$. We have

\begin{align*} \|f-f_n\|&=\sqrt{\langle f-f_n,f-f_n\rangle}\\ &=\sqrt{\lim_{m\to\infty} \langle f_m-f_n,f_m-f_n\rangle}\\ &=\sqrt{\lim_{m\to\infty}\|f_m-f_n\|}. \end{align*}
This converges to 0 as $n\to 0$ and thus $f$ is the limit of $f_n$ in $ {\cal F}$. Secondly, consider a fundamental sequence $f_1,f_2\ldots$ of elements of $ {\cal F}$. For each $n$ there is $g_n\in {\cal F}_1$ such that $\|f_n-g_n\|\le 1\slash{}2^n$. The sequence $g_1,g_2,\ldots$ is fundamental in $ {\cal F}_1$ and therefore has a limit in $ {\cal F}$. It must be the limit of $f_1,f_2\ldots$ too.

It remains to show that the reproducing property holds of $ {\cal F}$. It follows by continuity. Let $f_1,f_2\ldots$ be a fundamental sequence of elements of $ {\cal F}_1$ converging pointwise to $f$. We have

\begin{align*} f(x) &= \lim_{n\to\infty} f_n(x)\\ &= \lim_{n\to\infty} \langle f_n(\cdot), {\cal K}(x,\cdot)\\ &= \langle f(\cdot), {\cal K}(x,\cdot)\rangle  \end{align*}

We have constructed a RKHS for $ {\cal K}$. Note that $ {\cal F}_1$ constructed in the previous subsection is dense in it.

3.3  Uniqueness

Let us show that the RKHS for a particular kernel $ {\cal K}$ is unique. Let $ {\cal F}$ be the RKHS constructed above and $ {\cal H}$ be some other RKHS for the same kernel $ {\cal K}$.

The definition of an RKHS implies that all functions $ {\cal K}(x,\cdot)$ must belong to $ {\cal H}$. The same must be true of their linear combinations $\sum_{i=1}^na_i {\cal K}(x_i,\cdot)$. Thus $ {\cal F}_1\subseteq {\cal H}$ as a set.

Since the reproducing property holds on $ {\cal H}$, on elements of $ {\cal F}_1$ the scalar product $\langle\cdot,\cdot\rangle_ {\cal H}$ must coincide with scalar product we constructed above. Thus $ {\cal F}_1$ is a subspace of $ {\cal H}$.

Because $ {\cal H}$ is complete, all fundamental sequences from $ {\cal F}_1$ should have limits in $ {\cal H}$. In RKHSs convergence implies pointwise convergence and thus all pointwise limits of fundamental sequences from $ {\cal F}_1$ belong to $ {\cal H}$. Thus $ {\cal F}\subseteq {\cal H}$ as a set. Because the scalar product is continuous w.r.t. itself, we have

$$ \lim_{n\to\infty}\langle f_n,g_n\rangle= \langle f,g\rangle $$
for all sequences $f_1,f_2,\ldots$ and $g_1,g_2,\ldots$ such that $f_n\to f$ and $g_n\to g$ in $ {\cal H}$ as $n\to\infty$. Thus the scalar product on $ {\cal F}$ coincides with that on $ {\cal H}$, or, in other terms, $ {\cal F}$ is a closed subspace of $ {\cal H}$.

Let $h\in {\cal H}$. We can represent it as $h=h_ {\cal F}+h^\bot$, where $h_ {\cal F}\in {\cal F}$ and $h^\bot$ is orthogonal to $ {\cal F}$ and therefore to all functions $ {\cal K}(x,\cdot)$, which belong to $ {\cal F}_1\subseteq {\cal F}$. Because the reproducing property holds on $ {\cal H}$, we get

\begin{align*} h(x) &= \langle h, {\cal K}(x,\cdot)\rangle\\ &= \langle h_ {\cal F}, {\cal K}(x,\cdot)\rangle+\langle h^\bot, {\cal K}(x,\cdot)\rangle\\ &= \langle h_ {\cal F}, {\cal K}(x,\cdot)\rangle\\ &= h_ {\cal F}(x). \end{align*}
Thus $h$ coincides with $h_ {\cal F}$ everywhere on $X$ and $ {\cal H}= {\cal F}$.

4.  What RKHSs are and What They are not

In this section we provide a selection of minor results and counterexamples concerning RKHSs. They will help the reader to get a better intuition of an RKHS and perhaps dissolve some misconceptions.

4.1  Continuity and Separability

Suppose that $X$ is a topological space and $ {\cal K}$ is continuous (in both the arguments, i.e., on $X^2$). One can claim that any feature mapping $\Phi\to H$ associated with $ {\cal K}$ is continuous. Indeed,

\begin{align*} \|\Phi(x)-\Phi(y)\|^2 &= \langle \Phi(x)-\Phi(y),\Phi(x)-\Phi(y)\rangle\\ &= \langle\Phi(x),\Phi(x)\rangle+\langle\Phi(y),\Phi(y)\rangle- 2\langle\Phi(x),\Phi(y)\rangle\\ &=  {\cal K}(x,x)+ {\cal K}(y,y)-2 {\cal K}(x,y). \end{align*}
The continuity of $ {\cal K}$ implies that as $y$ approach $x$, the expression $ {\cal K}(x,x)+ {\cal K}(y,y)-2 {\cal K}(x,y)$ approaches 0. We have shown that $\Phi$ is continuous 9.

If moreover $X$ is separable, then the RKHS $ {\cal F}$ is separable. Indeed we have shown in the proof of the existence theorem that finite linear combinations $\sum_{i=1}^na_i {\cal K}(x_i,\cdot)$ are dense in $ {\cal F}$. The coefficients $a_i$ can be taken to be rational. Take a countable dense subset $\{\tilde x_1,\tilde x_2,\ldots\}\subseteq X$. Because the mapping $x\to {\cal K}(x,\cdot)$ is a feature mapping, it is continuous, and we can approximate any $ {\cal K}(x,\cdot)$ with some $ {\cal K}(\tilde x_i,\cdot)$.

In most natural cases signals are strings of reals, i.e., $X\subseteq\mathbb R^n$ and the kernel $ {\cal K}$ is continuous and therefore the RKHS is separable. Note however that inseparable RKHSs exist. Indeed, consider the space $l_2([0,1])$. It consists of functions $f:[0,1]\to\mathbb R$ such that $f(x)=0$ everywhere except, perhaps, for a countable set and

$$ \sum_{x:f(x)\ne 0}f^2(x)<+\infty. $$
The scalar product is defined by
$$ \langle f,g\rangle=\sum_{x:f(x)\ne 0\text{~and~} g(x)\ne 0}f(x)g(x). $$
It is easy to see that this space is Hilbert. The only non-trivial bit is completeness. Let $f_i$ be a fundamental sequence. By taking the union of all supports we can obtain a countable set $S=(s_1,s_2,\ldots)\subseteq [0,1]$ such that all $f_i$ vanish outside $S$. The sequences $(f_i(s_1),f_i(s_2),\ldots)$ belong to the "usual" $l_2$ and we can reduce the question of completeness of $l_2([0,1])$ to completeness of $l_2$.

The space $l_2([0,1])$ is an RKHS. Indeed, consider

$$ {\cal K}(x,y)=I_{\{x\}}y=\begin{cases} 1,       & \text{ if } x=y;\\ 0, & \text{ otherwise } \end{cases} $$
(here $I_A$ is the indicator function of a set $A$). This is clearly a kernel.

On the other hand $l_2([0,1])$ is not separable. Indeed, the functions $I_{x}$, $x\in [0,1]$, form an uncountable orthonormal system.

4.2  Pointwise Convergence in RKHSs

As we know from Theorem 2, convergence in an RKHS implies pointwise convergence of functions. Let us show that the opposite is not true.

A simple counterexample is provided by $l_2$ considered as a set of functions on the set of positive integers $ {\mathbb N}=\{1,2,\ldots\}$. It is an RKHS with

$$ {\cal K}(x,y)=I_{\{x\}}y=\begin{cases} 1,       & \text{ if } x=y;\\ 0, & \text{ otherwise } \end{cases} $$
as above. Consider the sequence of functions
$$ f_n(i)= \begin{cases} 0,       & \text{ if } n^2<i;\\ \frac 1n, & \text{ if } i\le n^2. \end{cases} $$
We have $f_n(i)\le 1\slash{}n\to 0$ as $n\to+\infty$, which implies pointwise and even uniform convergence to $f_0$ equal to 0 everywhere. On the other hand $\|f_n\|=1$.

Requiring $ {\cal K}$ to be continuous and the domain $X$ to be connected will not change the situation; the counterexample can be easily extended to cover this possibility. Indeed, take $X=\left[0,+\infty\right)$. For each positive integer $n$ consider the "tooth" function $t_n:\left[0,+\infty\right)\to\mathbb R$ equal to 0 outside $(n-1\slash{}2,n+1\slash{}2)$, 1 at $n$ and linear on $\left[n-1\slash{}2,n\right]$ and $\left[n,n+1\slash{}2\right]$. Let the space $T$ consist of functions $\sum_{n=1}^{+\infty}a_nf_n(x)$, where $(a_1,a_2,\ldots)\in l_2$. A function $f\in T$ is continuous on $\left[0,+\infty\right)$, its values at integer points form a sequence from $l_2$ and the value at $x\in \left[n-1\slash{}2,n+1\slash{}2\right]$ is given by $f(n)(1-2|x-n|)$, $n=1,2,\ldots$. Let the scalar product be induced by $l_2$: $\langle f,g\rangle=\sum_{n=1}^{+\infty}f(n)g(n)$. This space is isomorphic to $l_2$.

The space $T$ is an RKHS with the following kernel. For a positive integer $n$ we have $ {\cal K}(n,y)=f_n(y)$ and $ {\cal K}(n\pm 1\slash{}2,y)=0$ for all $y\in \left[0,+\infty\right)$. For $x\in [n-1\slash{}2,n+1\slash{}2]$, where $n$ is a positive integer, $ {\cal K}(x,y)=f_n(y)(1-2|x-n|)$. This kernel is continuous on $X^2$. On the other hand pointwise (and uniform) convergence does not imply convergence in the norm.

It is easy to do a further step and show that $X$ can be connected and compact. Note that for any sequence $(a_1,a_2,\ldots)\in l_2$ we have $a_i\to 0$ as $i\to +\infty$ and therefore for any $f\in T$ we have $f(x)\to 0$ as $x\to +\infty$. We can define $f(+\infty)=0$ by continuity. The set $\left[0,+\infty\right]$ can be mapped onto $\left[0,1\right]$ by $x\to x\slash{}(1+x)$.

The following intuition may be helpful. As explained above, an RKHS may be thought of as a Hilbert space of hypotheses and the norm as the hypothesis complexity. The functions uniformly close to zero but having non-zero norms correspond to overcomplicated hypotheses with no real explanatory power.

4.3  RKHSs with the Same Sets of Functions

We know from part 1 of Theorem 2 that for each RKHS there is a unique kernel, i.e., if two kernels, $ {\cal K}_1$ and $ {\cal K}_2$, specify the same RKHS, then they coincide. Here the words "the same RKHS" mean the same set of functions with the same scalar product on them.

What if the scalar product is different? Can two different kernels specify RKHSs with the same sets of functions?

It is easy to construct a trivial example. The kernel $ {\cal K}\slash{}2$ specifies the same set of functions as the kernel $ {\cal K}$. Indeed, recall the construction from the existence theorem. The kernels ${\cal K}\slash{}2$ and $ {\cal K}$ specify the same set of finite linear combinations $\sum_{i=1}^na_1 {\cal K}(x_i,\cdot)$ with equivalent norms. A sequence is fundamental in one space if and only if it is fundamental in another and the completions, consisting of pointwise limits, will coincide.

A slightly less trivial example may be constructed as follows. Let $X=[0,1]$ and $\Phi_1: X\to \mathbb R^2$ is given by $\Phi_1(x)=(1,x)$. If $\mathbb R^2$ is equipped with the standard Euclidean scalar product $\langle u_1,u_2,\rangle=u_1'u_2$, we get a kernel $ {\cal K}_1(x_1,x_2)=\langle \Phi_1(x_1),\Phi_1(x_2)\rangle=1+x_1x_2$. The RKHS corresponding to this kernel consists of functions of the form $\langle\Phi(x),h\rangle=h_1+h_2x$, where $h=(h_1,h_2)'\in \mathbb R^2$.

Let $\Phi_2: X\to \mathbb R^2$ is given by $\Phi_1(x)=(1,1+x)$. It specifies the kernel $ {\cal K}_2(x_1,x_2)=\langle \Phi_2(x_1),\Phi_2(x_2)\rangle=1+(1+x_1)(1+x_2)=2+x_1+x_2+x_1x_2$, which is not a multiple of $ {\cal K}_1$. The RKHS corresponding to $\Phi_2$ consists of functions of the form $\langle\Phi(x),h\rangle=h_1+h_2(1+x)=(h_1+h_2)+h_2x$, where $h=(h_1,h_2)'\in \mathbb R^2$. It is easy to see that the set of functions is the set of polynomials of degree 1, just as in the first case. The kernels specifying the RKHSs with the same sets of functions thus do not have to be multiples of each other.

This argument can be generalised further as follows. Let $\Phi: X\to H$ specify kernel $ {\cal K}(x_1,x_2)=\langle(\Phi_1(x_1),\Phi(x_2)\rangle$ and let $A:H\to H$ be a bounded linear operator in the Hilbert space $H$. Consider the mapping $\Phi_A(x)=A\Phi(x): X\to H$ and the kernel $ {\cal K}_A(x_1,x_2)=\langle(A\Phi_1(x_1),A\Phi(x_2)\rangle$. The RKHS corresponding to this kernel $ {\cal K}_A$ consists of functions of the form $\langle A\Phi(x),h\rangle$. This can be rewritten as $\langle \Phi(x),A^*h\rangle$, where $A^*$ is the operator adjoint to $A$. If $A$ (and therefore $A^*$) is invertible, $A^*h$ ranges over the whole space $H$ and the RKHS for $ {\cal K}_A$ consists of the same functions as that for $ {\cal K}$.

The following observation can also be made.

Theorem 5. The set of kernels on $X$ specifying the RKHS with the same set of functions is closed under multiplication by a positive constant and addition, i.e.,
  1. for every kernel $ {\cal K}$ and $\alpha>0$, the kernel $\alpha {\cal K}$ specifies the RKHS with the same set of functions as the RKHS for $ {\cal K}$;
  2. for every two kernels $ {\cal K}_1$ and $ {\cal K}_2$ on $X$ specifying the RKHSs with the same set of functions, the kernel $ {\cal K}(x_1,x_2)= {\cal K}_1(x_1,x_2)+ {\cal K}_2(x_1,x_2)$ specifies the RKHS with the same set of functions.

Proof: The multiplicative part has been already proved above. Let us prove the additive part (the multiplicative part can be proved in the same fashion).

Let $\Phi_1: X\to H_1$ and $\Phi_2: X\to H_2$ be feature mappings corresponding to $ {\cal K}_1$ and $ {\cal K}_2$. Consider the Hilbert space $H=H_1\times H_2$ with the "componentwise" scalar product $\langle((h_1',h_2'),(h_1'',h_2''))\rangle=\langle h_1',h_1''\rangle_1+\langle h_2',h_2''\rangle_2$ (here $\langle\cdot,\cdot\rangle_1$ and $\langle\cdot,\cdot\rangle_2$ are the scalar products on $H_1$ and $H_2$, respectively). Clearly the mapping $\Phi: X\to H$ given by $\Phi(x)=(\Phi_1(x),\Phi_2(x))$ specifies the kernel $ {\cal K}$. The corresponding RKHS consists of functions of the form $\langle \Phi(x),h_1\rangle_1+\langle \Phi(x),h_2\rangle_2$, where $h_1\in H_1$ and $h_2\in H_2$. Since $h_1$ or $h_2$ can be equal to 0, the RKHS contains the RKHSs for $K_1$ and $K_2$ as sets. Since the two sets of functions are equal, adding them up does not produce anything new. $\Box$

5.  RKHSs and Prediction in Feature Spaces

We have shown that the three definitions of a kernel $ {\cal K}: X^2\to\mathbb R$ are equivalent:

  • a positive semi-definite symmetric function;
  • a reproducing kernel;
  • the scalar product in a feature space, i.e., $\langle\Phi(x_1),\Phi(x_2)\rangle$, where $\Phi$ maps $X$ into a Hilbert space $H$. The RKHS for a particular kernel is unique. Note that uniqueness holds in a very strong sense: it is a unique set of functions with a uniquely defined scalar product; there are no isomorphisms or equivalences involved.

The mapping $\Phi$ in the third definition is by no means unique. Indeed, let $H=l_2$. Consider a right shift $R:l_2\to l_2$ defined by $R(x_1x_2\ldots)=0x_1x_2\ldots$. The composition $R(\Phi)$ will produce the same kernel as $\Phi$.

However there is some degree of uniqueness. Let $S\subseteq H$ be the closure of the linear span of all images $\Phi(x)$, $x\in X$. It is isomorphic to the RKHS.

5.1  RKHS Inside a Feature Space

Theorem 6. For every mapping $\Phi: X\to H$, where $H$ is a Hilbert space, the closure of the linear span of the image of $X$, i.e., $\overline{\mathop{{\mathrm{span}}}\nolimits}(\Phi(X))\subseteq H$, is isomorphic to the RKHS of the kernel $ {\cal K}(x_1,x_2)=\langle\Phi(x_1),\Phi(x_2)\rangle$. There is a canonical isomorphism mapping $\Phi(x)\in H$ onto $ {\cal K}(x,\cdot)$ from the RKHS.

Proof: Let us denote the closure of the span by $S$ and the RKHS by $ {\cal F}$. Let $ {\cal F}_1\subseteq {\cal F}$ be the set of finite sums of the form $\sum_ia_i {\cal K}(x_i,\cdot)$, where $x_i\in X$ and $a_i$ are some coefficients, as in the construction above.

We start by constructing the isomorphism $L$ of $\mathop{{\mathrm{span}}}\nolimits(\Phi(X))$ and $ {\cal F}_1$. Put $L(\Phi(x))= {\cal K}(x,\cdot)$ and, by linearity, $L(\sum_{i=1}^na_i\Phi(x_i))= \sum_{i=1}^na_i {\cal K}(x_i,\cdot)$. We need to show that $L$ is well-defined. Let $\sum_{i=1}^na_i\Phi(x_i)=\sum_{j=1}^mb_j\Phi(z_i)$ for some coefficients $a_i$ and $b_i$ and elements $x_i,z_i\in X$. Then for every $x\in X$ we have

$$ \left\langle \sum_{i=1}^na_i\Phi(x_i),x\right\rangle = \left\langle \sum_{j=1}^mb_j\Phi(z_i),x\right\rangle, $$
$$ \sum_{i=1}^na_i {\cal K}(x_i,x)=\sum_{j=1}^mb_j {\cal K}(z_j,x) $$
by the definition of $ {\cal K}$. This means that the functions $\sum_{i=1}^na_i {\cal K}(x_i,\cdot)$ and $\sum_{j=1}^mb_j {\cal K}(z_j,\cdot)$ coincide everywhere and thus $L$ is well-defined.

The mapping $L$ preserves the scalar product:

\begin{align*} \left\langle\sum_{i=1}^na_i {\cal K}(x_i,\cdot),  \sum_{j=1}^mb_j {\cal K}(z_i,\cdot)\right\rangle &=\sum_{i,j}a_ib_j {\cal K}(x_i,z_j)\\ &=\left\langle\sum_{i=1}^na_i\Phi(x_i),\sum_{j=1}^mb_j\Phi(z_j)\right\rangle.  \end{align*}

The mapping $L:\mathop{{\mathrm{span}}}\nolimits(\Phi(X))\to {\cal F}_1$ is surjective. Indeed, each sum $\sum_{i=1}^na_i {\cal K}(x_i,\cdot)$ has an inverse image. The mapping is also injective. Assume the converse. Then there is a point $z=\sum_{i=1}^na_i\Phi(x_i)\in \mathop{{\mathrm{span}}}\nolimits(\Phi(X))$ such that $z\ne 0$ but $L(z)=0$ in the RKHS. This is a contradiction because $L$ preserves the scalar product and therefore the norm. Thus $L$ is a bijection.

Let us extend $L$ to the isomorphism of $S$ and $ {\cal F}$. Let $h_1,h_2,\ldots\in\mathop{{\mathrm{span}}}\nolimits(\Phi(X))$ converge to $h\in S$. The sequence $h_1,h_2,\ldots$ is fundamental in $\mathop{{\mathrm{span}}}\nolimits(\Phi(X))$. Since $L$ preserves the scalar product on $\mathop{{\mathrm{span}}}\nolimits(\Phi(X))$, the images $L(h_1),L(h_2),\ldots$ form a fundamental sequence in $ {\cal F}$. It should converge. Put $L(h)=\lim_{i\to\infty}L(h_i)$.

Suppose that there are two sequences $h_1,h_2,\ldots\in\mathop{{\mathrm{span}}}\nolimits(\Phi(X))$ and $g_1,g_2,\ldots\in\mathop{{\mathrm{span}}}\nolimits(\Phi(X))$ converging to $h$. Let us mix the sequences into $u_i$ (e.g., by letting $u_{2i}=h_i$ and $u_{2i-1}=g_i$, $i=1,2,\ldots$). The sequence $u_1,u_2,\ldots$ converges to $h$ and is therefore fundamental. The images of $u_i$ must form a fundamental sequence in $ {\cal F}$ and must have a limit. All its subsequences should converge to the same limit. Thus $\lim_{i\to\infty}L(h_i)=\lim_{i\to\infty}L(g_i)$ and $L$ is well-defined.

The scalar product is preserved by continuity. The surjectivity can be shown as follows. Let $f\in {\cal F}$ and let $f_1,f_2,\ldots\in {\cal F}_1$ converge to $f$. The inverse images of $f_i$ must form a fundamental sequence in $\mathop{{\mathrm{span}}}\nolimits(\Phi(X))$ and must have a limit $h$. It follows from the definition that $L(h)=f$. The injectivity on $\overline{\mathop{{\mathrm{span}}}\nolimits}(\Phi(X))$ follows from the same argument as on $\mathop{{\mathrm{span}}}\nolimits(\Phi(X))$.

The theorem follows. $\Box$

The mapping $L$ can be extended to the mapping of the whole $H$ by letting $L_H(h)=L(\mathop{\mathrm{pr}_S}(h))$, where $\mathop{\mathrm{pr}_S}: H\to S$ is the projection operator. The mapping $L_H$ is no longer injective (unless $H$ coincides with $S$) and no longer preserves the scalar product. However we have $\|L(h)\|=\min_{g\in H: L(g)=L(h)}\|L(g)\|$, where the minimum is attained on the projection $\mathop{\mathrm{pr}_S}(h)$.

5.2  Another Definition of RKHS

The above construction allows us to construct an interpretation of RKHS important for machine learning.

In competitive prediction we prove consistency results of the following type. We do not assume the existence of a "correct" hypothesis but rather show that our method competes well with a class of some other predictors, such as all linear regressors. Therefore identifying and describing such natural classes is an important task.

In Hilbert spaces we have a natural equivalent of linear regressors. Those are linear functionals, or, as the Riess-Fischer Representation Theorem shows, scalar products by an element $h\in H$. After we have mapped $X$ into a Hilbert space $H$, we can consider predictors $f:X\to\mathbb R$ of the form $f(x)=\langle h,\Phi(x)\rangle$.

Theorem 6 immediately implies that the class of such functions coincides with the RKHS. Indeed, there is a unique decomposition $h=h_0+h^\bot$, where $h_0=\mathop{\mathrm{pr}_S}(h)$ is the projection of $h$ on $S=\overline{\mathop{{\mathrm{span}}}\nolimits}(\Phi(X))$ and $h^\bot$ is orthogonal to $S$. We have

\begin{align*} f(x)&=\langle h,\Phi(x)\rangle\\ &=\langle h_0,\Phi(x)\rangle\\ &=\langle L(h_0),L(\Phi(x))\rangle_ {\cal F}\\ &=\langle L(h_0), {\cal K}(x,\cdot)\rangle_ {\cal F}\\ &=g(x), \end{align*}
where $g=L(h_0)\in {\cal F}$ belongs to the RKHS.

We may want to assign the norm $\|h\|$ to the predictor $f(x)=\langle h,\Phi(x)\rangle$; clearly $\|h\|\ge\|h_0\|=\|g\|_ {\cal F}$. The space of predictors thus obtained does not exceed the RKHS and the norms of predictors are equal to or greater than those of the elements of the RKHS. Thus regressors in the feature space have no more power than functions from the RKHS.

We get the following theorem as a bonus.

Theorem 7. Let $\Phi: X\to H$ be a mapping into a Hilbert space $H$. The space of functions $f:X\to\mathbb R$ defined by $f(x)=\langle h,\Phi(x)\rangle$, where $h\in H$, equipped with the norm $\|f\|=\min_{h\in H: f(\cdot)=\langle h,\Phi(\cdot)\rangle}\|h\|$ coincides with the reproducing kernel Hilbert space for the kernel defined by $ {\cal K}(x_1,x_2)=\langle\Phi(x_1),\Phi(x_2)\rangle$.

5.3  Ridge Regression in Feature Spaces; the Representer Theorem

In this section we revisit the ridge regression problem from Section 1 and present one argument of great importance for competitive prediction.

Suppose that we have a sequence of signals and outcomes as in Section 1. On top of that suppose that we have a mapping $\Phi: X\to H$ from the set of signals $X$ into a Hilbert feature space $H$. Take $h\in H$; as we said before, it can be considered as a regressor yielding the dependency $y=\langle h,\Phi(x)\rangle$. We may ask if there is $h$ minimising the expression

$$  L_{\rm RR}(h)=a\|h\|^2+\sum_{i=1}^T \left(\langle h,\Phi(x_i)\rangle-y_i\right)^2. $$
with $a>0$.

Consider the predictor

$$ \gamma_{\rm RR} = Y'(aI+K)^{-1}k, $$
$$ K=\begin{pmatrix} {\cal K}(x_1,x_1) & {\cal K}(x_1,x_2) &\ldots& {\cal K}(x_1,x_T)\\ {\cal K}(x_2,x_1) & {\cal K}(x_2,x_2) &\ldots& {\cal K}(x_2,x_T)\\ \vdots                 &\vdots                 &\ddots&\vdots\\ {\cal K}(x_T,x_1) & {\cal K}(x_T,x_2) &\ldots& {\cal K}(x_T,x_T) \end{pmatrix} $$
$$ k=\begin{pmatrix} {\cal K}(x_1,x)\\ {\cal K}(x_2,x)\\ \ldots\\ {\cal K}(x_T,x) \end{pmatrix}; $$
It qualifies as a regressor of the above type. Indeed, it is a linear combination of $ {\cal K}(x_i,\cdot)=\langle \Phi(x_i),\Phi(\cdot)\rangle$. Let us show that it minimises $L_{\rm RR}$.

Let $H_0\subset H$ be a subspace of $H$ consisting of all linear combinations of $\Phi(x_i)$ (it is closed because it is finite-dimensional). Take $h\in H$. It can be decomposed into a sum $h=h_0+h^\bot$, where $h^\bot$ is orthogonal to $H$. For all $i=1,2,\ldots,T$ we have $\langle h,\Phi(x_i)\rangle=\langle h_0,\Phi(x_i)\rangle$; we also have $\|h_0\|\le\|h\|$. Therefore $L_{\rm RR}(h_0)\le L_{\rm RR}(h_0)$. When minimising $L_{\rm RR}$ we can restrict ourselves to predictors from $H_0$, i.e., linear combinations of $\Phi(x_i)$!

Because $H_0$ is finite-dimensional, the arguments from Section 1 apply and the ridge regression turns out to provide the minimum of $L_{\rm RR}(h)$ over $H$ (or the minimum of $L_{\rm RR}(f)$ over the corresponding RKHS).

This argument can be generalised to the representer theorem. We shall formulate two variants of it.

Theorem 8. Let $\Phi: X\to H$ be a mapping into a Hilbert space $H$ and $\alpha: H\to\mathbb R$ be given by
\begin{equation*} \alpha(h)=\delta(\langle h,\Phi(x_1)\rangle,\langle h,\Phi(x_2)\rangle,\ldots,\langle h,\Phi(x_T)\rangle,\|h\|), \end{equation*}
where $\delta$ is a function from $\mathbb R^{T+1}$ to $\mathbb R$ nondecreasing in the last argument and $x_1,x_2,\ldots,x_T\in X$ are some fixed elements. Then for every $h\in H$ there is a linear combination $h'=\sum_{t=1}^Ta_t\Phi(x_t)$, where $a_t\in\mathbb R$ are constants, such that $\alpha(h')\le\alpha(h)$. If $\delta$ is strictly increasing in the last argument and $h$ does not itself have the form $\sum_{t=1}^Ta_t\Phi(x_t)$, there is a linear combination $h'$ such that $\alpha(h')<\alpha(h)$.

Proof: The proof is by observing that the projection $h'$ of $h$ on the subspace $\mathop{{\mathrm{span}}}\nolimits(\{\Phi(x_1),\Phi(x_2),\ldots,\Phi(x_T)\})$ has the same scalar products $\langle h',\Phi(x_t)\rangle=\langle h,\Phi(x_t)\rangle$ with elements $\Phi(x_t)$, $t=1,2,\ldots,T$ and a smaller norm $\|h'\|\le\|h\|$. $\Box$

Corollary 1. Let $ {\cal K}:X\times X\to\mathbb R$ be a kernel and $ {\cal F}$ the corresponding RHKS; let $\alpha:  {\cal F}\to\mathbb R$ be given by

\begin{equation*} \alpha(f)=\delta(f(x_1),f(x_2),\ldots,f(x_T),\|f\|_ {\cal F}), \end{equation*}
where $\delta$ is a function from $\mathbb R^{T+1}$ to $\mathbb R$ nondecreasing in the last argument and $x_1,x_2,\ldots,x_T\in X$. Then for every $f\in {\cal F}$ there is a linear combination $\tilde f(\cdot)=\sum_{t=1}^Ta_t {\cal K}(x_t,\cdot)$, where $a_t\in\mathbb R$ are some constants, such that $\alpha(\tilde f)\le\alpha(f)$. If $\delta$ is strictly increasing in the last argument and $f$ does not itself have the form $\sum_{t=1}^Ta_t {\cal K}(x_t,\cdot)$, there is a linear combination $\tilde f$ such that $\alpha(\tilde f)<\alpha(f)$.

6.  Hierarchies of Kernels

6.1  Subspaces of RKHSs and Sums of Kernels

Consider an RKHS $ {\cal F}$ corresponding to a kernel $ {\cal K}$. Let $ {\cal F}'\subseteq {\cal F}$ be a subspace of $ {\cal F}$. Clearly, $ {\cal F}'$ is a RKHS. This can be shown as follows. The evaluation functional is continuous on $ {\cal F}$. Its restriction on $ {\cal F}'$ should remain continuous and therefore $ {\cal F}'$ is a RKHS. This does not contradict the uniqueness theorem. If $ {\cal F}'$ is a proper subspace of $ {\cal F}$, it is an RKHS for a different kernel $ {\cal K}'$.

Suppose that $ {\cal F}$ itself is a subspace of some Hilbert space of functions $ {\cal F}''$. As we discussed above, in applications such as machine learning it does not make much sense to consider spaces where the evaluation functional is not continuous, and therefore $ {\cal F}''$ should be an RKHS with its own kernel too.

One can say that RKHSs form hierarchies: larger spaces have more power than smaller spaces. However each of them has its own kernel. In competitive prediction the natural competitors of a kernel method are the functions from the corresponding RKHS. Other RKHSs require the use of a different kernel.

The rest of this section contains some results clarifying the structure of this hierarchy.

Theorem 9. Let a space $ {\cal F}$ of real-valued functions on $X$ be the RKHS corresponding to a kernel $ {\cal K}: X^2\to\mathbb R$. If $ {\cal F}'\subseteq {\cal F}$ is a subspace of $ {\cal F}$, then $ {\cal F}'$ is an RKHS and has a kernel $ {\cal K}': X^2\to\mathbb R$. If $ {\cal F}''$ is the orthogonal complement of $ {\cal F}'$, then it is also an RKHS and it has the kernel $ {\cal K}'': X^2\to\mathbb R$ such that $ {\cal K}'(x_1,x_2)+ {\cal K}''(x_1,x_2)= {\cal K}(x_1,x_2)$ for all $x_1,x_2\in X$.

Proof: Let $\mathop{{\mathrm{pr}}}\nolimits'$ and $\mathop{{\mathrm{pr}}}\nolimits'$ be the projection operators from $ {\cal F}$ to $ {\cal F}'$ and $ {\cal F}''$, respectively.

Take a point $x\in X$. The evaluation functional on $ {\cal F}$ equals the scalar product by $ {\cal K}(x,\cdot)$. It is easy to see that $\mathop{{\mathrm{pr}}}\nolimits'( {\cal K}(x,\cdot))$ plays the same role in $ {\cal F}'$. Indeed, $\mathop{{\mathrm{pr}}}\nolimits'( {\cal K}(x,\cdot))\in {\cal F}'$ and for every function $f\in {\cal F}'$ we have

\begin{align*} f(x) &= \langle  {\cal K}(x,\cdot),f(\cdot)\rangle\\ &= \langle \mathop{{\mathrm{pr}}}\nolimits'( {\cal K}(x,\cdot)),f(\cdot)\rangle. \end{align*}
Put $ {\cal K}'(x_1,x_2)=\langle\mathop{{\mathrm{pr}}}\nolimits'( {\cal K}(x_1,\cdot)), \mathop{{\mathrm{pr}}}\nolimits'( {\cal K}(x_2,\cdot))\rangle$. Let us prove that it is the kernel for $ {\cal F}'$

We do this by showing that $ {\cal K}'(x_1,x_2)$ as a function of $x_2$ coincides with $\mathop{{\mathrm{pr}}}\nolimits'( {\cal K}(x_1,\cdot))$. Fix $x_1$ and denote the function $\mathop{{\mathrm{pr}}}\nolimits'( {\cal K}(x_1,\cdot))$ by $f(\cdot)$. We have $f\in {\cal F}'$. The above argument implies that $\langle f(\cdot),\mathop{{\mathrm{pr}}}\nolimits'( {\cal K}(x_2,\cdot))\rangle=f(x_2)$ for every $x_2\in X$ and thus $ {\cal K}'(x_1,\cdot)=f(\cdot)\in {\cal F}$. The reproducing property follows.

Similarly $ {\cal K}''(x_1,x_2)=\langle\mathop{{\mathrm{pr}}}\nolimits''( {\cal K}(x_1,\cdot)), \mathop{{\mathrm{pr}}}\nolimits''( {\cal K}(x_2,\cdot))\rangle$ is the kernel for $ {\cal F}''$.

Let $f,g\in {\cal F}$. We have $f=\mathop{{\mathrm{pr}}}\nolimits'(f)+\mathop{{\mathrm{pr}}}\nolimits''(f)$ and $g=\mathop{{\mathrm{pr}}}\nolimits'(g)+\mathop{{\mathrm{pr}}}\nolimits''(g)$; therefore

\begin{align*} \langle f,g\rangle &= \left\langle \mathop{{\mathrm{pr}}}\nolimits'(f)+ \mathop{{\mathrm{pr}}}\nolimits''(f), \mathop{{\mathrm{pr}}}\nolimits'(g)+ \mathop{{\mathrm{pr}}}\nolimits''(g)\right\rangle\\ &= \left\langle \mathop{{\mathrm{pr}}}\nolimits'(f), \mathop{{\mathrm{pr}}}\nolimits'(g)\right\rangle+ \left\langle \mathop{{\mathrm{pr}}}\nolimits''(f), \mathop{{\mathrm{pr}}}\nolimits''(g)\right\rangle. \end{align*}
By taking $f= {\cal K}(x_1,\cdot)$ and $g= {\cal K}(x_2,\cdot)$ we get $ {\cal K}'(x_1,x_2)+ {\cal K}''(x_1,x_2)= {\cal K}(x_1,x_2)$. $\Box$

Theorem 10. Let $ {\cal K}, {\cal K}', {\cal K}'':X^2\to\mathbb R$ be three kernels on $X$ such that $ {\cal K}(x_1,x_2)= {\cal K}'(x_1,x_2)+ {\cal K}''(x_1,x_2)$ for all $x_1,x_2\in X$. Then the RKHSs $ {\cal F}'$ and $ {\cal F}''$ corresponding to the kernels $ {\cal K}'$ and $ {\cal K}''$ are subsets (but not necessarily subspaces) of the RKHS $ {\cal F}$ corresponding to the kernel $ {\cal K}$. For each $f\in {\cal F}$ there are functions $f_1\in {\cal F}'$ and $f_2\in {\cal F}''$ such that $f=f_1+f_2$ and for its norm we have the equality
$$ \|f\|^2_{ {\cal F}}=\min\nolimits_{f_1\in {\cal F}_1, f_2\in {\cal F}_2: f=f_1+f_2} \left(\|f_1\|^2_{ {\cal F}'}+\|f_2\|^2_{ {\cal F}''}\right). $$
If $ {\cal F}'$ and $ {\cal F}''$ have only the identical zero function in common (i.e., $ {\cal F}'\cap {\cal F}''=\{0\}$), then they are subspaces of $ {\cal F}$ and each one is the orthogonal complement of the other.

It is easy to see that $ {\cal F}'$ does not have to be a subspace of $ {\cal F}$. Indeed, let $ {\cal K}'= {\cal K}''$ and $ {\cal K}=2 {\cal K}'$. Clearly if we take the set of functions constituting $ {\cal F}$ and equip it with the scalar product $\langle\cdot,\cdot\rangle_ {\cal F}\slash{}2$, we get the RKHS for $ {\cal F}'$. It is a subset but not a subspace of $ {\cal F}$.

Proof: Let $\Phi'$ and $\Phi''$ mapping $X$ into Hilbert spaces $H'$ and $H''$, respectively, be feature maps giving rise to the kernels $ {\cal K}'$ and $ {\cal K}''$, i.e.,

\begin{align*} {\cal K}'(x_1,x_2)  &= \langle \Phi'(x_1),\Phi'(x_2)\rangle\\ {\cal K}''(x_1,x_2) &= \langle \Phi''(x_1),\Phi''(x_2)\rangle. \end{align*}
Let $H$ be the Hilbert space consisting of pairs $(h',h'')$ such that $h'\in H'$ and $h''\in H''$ with the scalar product given by
$$ \left\langle (h'_1,h''_1), (h'_2,h''_2)\right\rangle_H = \langle h'_1,h'_2 \rangle_{H_1}+\langle h''_1,h''_2 \rangle_{H_2}. $$
Take $\Phi:X\to H$ defined by
$$ \Phi(x)=(\Phi'(x),\Phi''(x)). $$
It is easy to see that
$$ {\cal K}(x_1,x_2)  = \langle \Phi(x_1),\Phi(x_2)\rangle. $$

The results of Subsect. 5.2 imply that the RKHS for $ {\cal K}$ coincides with the set of functions $\langle h,\Phi(\cdot)\rangle$, where $h\in H$. Similarly, the RKHSs for $ {\cal K}'$ and $ {\cal K}''$ consist of all functions $\langle h',\Phi'(\cdot)\rangle$ and $\langle h'',\Phi''(\cdot)\rangle$, respectively, where $h'\in H'$ and $h''\in H''$.

For every $h'\in H'$ the element $(h',0)$ belongs to $H$ ($0$ is the zero element of $H''$ here). We have

$$ \langle(h',0),\Phi(\cdot)\rangle_H=\langle(h',\Phi'(\cdot))\rangle_{H_1} $$
and therefore $ {\cal F}'\subseteq {\cal F}$ as a set; the same argument applies to $ {\cal F}''$. The decomposition
$$ \left\langle (h',h''),\Phi(\cdot)\right\rangle = \langle h',\Phi'(\cdot)\rangle + \langle h'',\Phi''(\cdot)\rangle $$
implies that each $f\in {\cal F}$ can be decomposed into the sum of $f_1\in {\cal F}'$ and $f_2\in {\cal F}''$.

We have

\begin{align*} \|f\|^2_ {\cal F} &= \min\nolimits_{h\in H:  f(\cdot)=\langle h,\Phi(\cdot)\rangle} \|h\|^2_H\\ &=\min\nolimits_{h'\in H, h''\in H'':  f(\cdot)=\langle h',\Phi'(\cdot)\rangle+\langle h'',\Phi''(\cdot)\rangle}  \left(\|h'\|^2_{H_1}+\|h''\|^2_{H_2}\right) \end{align*}

The minimum is taken over pairs of $(h',h'')$; clearly, we can take the minimum over all pairs of $f_1\in {\cal F}'$ and $f_2\in {\cal F}''$ such that $f=f_1+f_2$; indeed, $\|f_1\|_{ {\cal F}'}=\min\nolimits_{h'\in H_1: f_1(\cdot)=\langle h', {\cal K}'(\cdot)\rangle_{H_1}}\|h'\|_{H_1}$.

Now let $ {\cal F}'\cap {\cal F}''=\{0\}$. Under this assumption every $f\in {\cal F}$ has a unique decomposition $f=f_1+f_2$, where $f_1\in {\cal F}'$ and $f_2\in {\cal F}''$. Indeed, if $f=f_1+f_2=g_1+g_2$, then $f_1-g_1=f_2-g_2$. The function of the left-hand side belongs to $ {\cal F}'$ and the function on the right-hand side belongs to $ {\cal F}''$ and therefore they are both equal to zero. Thus for every pair $f_1\in {\cal F}'$ and $f_2\in {\cal F}''$ we have $\|f_1+f_2\|^2_{ {\cal F}}=\|f_1\|^2_{ {\cal F}'}+\|f_2\|^2_{ {\cal F}''}$.

Take $f_2=0$. Then this equality implies that $\|f_1\|_{ {\cal F}}=\|f_1\|_{ {\cal F}'}$. Taking $f_1=0$ leads to $\|f_2\|_{ {\cal F}}=\|f_2\|_{ {\cal F}''}$. The norms on $ {\cal F}'$ and $ {\cal F}''$ coincide with the norm on $ {\cal F}$; the same should apply to the scalar product and thus $ {\cal F}'$ and $ {\cal F}''$ are subspaces rather than just subsets of $ {\cal F}$.

Picking arbitrary $f_1$ and $f_2$ and applying Pythagoras theorem yields $\|f_1+f_2\|^2_{ {\cal F}}=\|f_1\|^2_{ {\cal F}}+\|f_2\|^2_{ {\cal F}}+2\langle f_1,f_2\rangle_{ {\cal F}}$. Comparing this with the above equality implies that $\langle f_1,f_2\rangle_{ {\cal F}}=0$, i.e., $ {\cal F}'$ and $ {\cal F}''$ are orthogonal subspaces.


Let us introduce a relation on kernels on a set $X$. We will write $ {\cal K}'\ll {\cal K}$ if the difference $ {\cal K}''(x_1,x_2)= {\cal K}(x_1,x_2)- {\cal K}'(x_1,x_2)$ is a kernel.

If this relation holds, then $ {\cal K}'(x,x)\le  {\cal K}(x,x)$ for all $x\in X$. Indeed, since $ {\cal K}''$ is a kernel, the $1\times 1$ matrix $ {\cal K}''(x,x)$ is positive semi-definite, i.e., $ {\cal K}''(x,x)\ge 0$. This observation justifies the notation to some extent and implies that $\ll$ is antisymmetric. Indeed, if $ {\cal K}'\ll {\cal K}$ and $ {\cal K}\ll {\cal K}'$ then for $ {\cal K}''= {\cal K}- {\cal K}'$ we get $ {\cal K}''(x,x)=0$ for all $x\in X$. Theorem 2 implies that for every $f$ from the RKHS of $ {\cal K}''$ and every $x\in X$ we have $f(x)=0$ and thus $f$ is identically zero. This implies that $ {\cal K}''=0$.

Clearly, $\ll$ is transitive: if $ {\cal K}'\ll {\cal K}''\ll {\cal K}''$, then $ {\cal K}'\ll {\cal K}'''$.

The theorems above imply that $\ll$ for kernels is closely linked with the relation $\subseteq$ on their RKHSs. However no direct correspondence has been shown. The following results close the gap.

Theorem 11. Let $ {\cal K}$ and $ {\cal K}'$ be two kernels on the same set and let $ {\cal F}$ and $ {\cal F}'$ be their RKHSs. If $ {\cal K}'\ll {\cal K}$, then $ {\cal F}'\subseteq {\cal F}$ as a set and for every $f_1\in {\cal F}'$ we have
$$\|f_1\|_{ {\cal F}}\le \|f_1\|_{ {\cal F}'}.$$

This theorem follows from our previous results. Indeed, the square $\|f\|_{ {\cal F}}^2$ is given by the minimum of $\|f_1\|_{ {\cal F}'}^2+ \|f_2\|_{ {\cal F}''}^2$ taken over all decompositions $f=f_1+f_2$, where $ {\cal F}''$ is the RKHS corresponding to the difference $ {\cal K}- {\cal K}'$. Every $f_1\in {\cal F}'$ can be represented as $f_1+0$, which implies the inequality in the theorem.

The opposite result holds.

Theorem 12. Let $ {\cal K}$ and $ {\cal F}$ be its RKHSs. If $ {\cal F}'\subseteq {\cal F}$ as a set and $ {\cal F}'$ forms a Hilbert space w.r.t. a norm $\|\cdot\|_{ {\cal F}_1}$ such that
$$\|f_1\|_{ {\cal F}}\le \|f_1\|_{ {\cal F}'}$$
for every $f_1\in {\cal F}'$, then $ {\cal F}'$ is an RKHS and its reproducing kernel $ {\cal K}'$ satisfies $ {\cal K}'\ll {\cal K}$.

It is easy to see that the inequality on the norms cannot be omitted. Consider some kernel $ {\cal K}'$ on $X$ and let $ {\cal K}= {\cal K}'+ {\cal K}'=2 {\cal K}$. Let $ {\cal F}'$ be the RKHS for $ {\cal K}'$. For every $f\in {\cal F}'$ we have

\begin{align*} f(x) &= \langle f(\cdot), {\cal K}'(x,\cdot)\rangle_{ {\cal F}'}\\ &= \frac 12 \langle f(\cdot),2 {\cal K}'(x,\cdot)\rangle_{ {\cal F}'} \end{align*}
and therefore $ {\cal F}$ that coincides with $ {\cal F}'$ as a set and has the scalar product $\langle\cdot,\cdot\rangle_{ {\cal F}}=\frac 12  \langle\cdot,\cdot\rangle_{ {\cal F}'}$ is the RKHS for $ {\cal K}$. We have $\|f\|_{ {\cal F}'}=\sqrt{2}\|f\|_{ {\cal F}}\ge \|f\|_{ {\cal F}}$. However let us consider $ {\cal F}$ as a subset of $ {\cal F}'$. It satisfies the conditions of the theorem apart from the norm clause but $ {\cal K}\not\ll {\cal K}'$.

The proof of the theorem is beyond the scope of this article and can be found in [Aronszajn, 1950], pp.355-356.

6.2  Subsets of the Domain

The representer theorem provides a motivation to the following analysis important for learning.

Suppose that we have a kernel $ {\cal K}: X\times X\to\mathbb R$ and let $ {\cal F}$ be its RKHS. Consider a subset $X'\subseteq X$ and let $ {\cal K}'$ be the restriction of $ {\cal K}$ to $X'\times X'$. It is easy to see that $ {\cal K}'$ is still a kernel. What can one say about the RKHS $ {\cal F}'$ corresponding to $ {\cal F}$?

This question is important for learning for the following reason. Suppose that we are given a training sample (or a history of past signals and outcomes, if we are using an on-line protocol) $(x_1,y_1),(x_2,y_2),\ldots,(x_{T-1},y_{T-1})$ and a new signal $x_T$. One can consider different sets $X$ containing the observed signals $x_1,x_2,\ldots,x_T$. What effect will the choice of a particular $X$ have on the consistency analysis of a learning method? The representer theorem above suggests that the choice of $X$ is essentially irrelevant. Padding $X$ does not really change much. (Of course the situation becomes different if we need to make a statement valid for all $x$s that can possibly occur.) We shall see that this is generally true.

Let us try and clarify the situation. First, does $ {\cal F}$ contain more functions on $X'$? Are restrictions of functions from $ {\cal F}$ form a set richer than $ {\cal F}'$?

This is easy to answer if we use the definition of a RKHS provided by Theorem 7. Let $\Phi:X\to H$ be a mapping such that $ {\cal K}(x_1,x_2)=\langle\Phi(x_1),\Phi(x_2)\rangle$ for all $x_1,x_2\in X$ (here $H$ is a Hilbert space as usual). Then $ {\cal F}$ coincides with the set of functions of the form $\langle h,\Phi(\cdot)\rangle$, where $h$ ranges over $H$ and the norm of $f\in {\cal F}$ is the minimum of the norms of all elements $h\in H$ corresponding to $f$. Clearly $\Phi$ represents $ {\cal K}'$ as well as $ {\cal K}$: we have $ {\cal K}'(x_1,x_2)=\langle\Phi(x_1),\Phi(x_2)\rangle$ for all $x_1,x_2\in X'$. Thus $ {\cal F}'$ consists of functions of the form $\langle h,\Phi(\cdot)\rangle$ restricted to $X'$.

This implies that the restrictions of all functions from $ {\cal F}$ to $X'$ belong to $ {\cal F}'$. On the other hand for every $f'\in {\cal F}'$ there is $h\in H$ such that $f'=\langle h,\Phi(\cdot)\rangle$ and we can extend $f'$ to $X$ by considering scalar products with the same $h$. Therefore each function from $ {\cal F}'$ is a restriction of some function from $ {\cal F}$. The norm of $f'$ does not exceed that of each extension; on the other hand it equals to the norm of some $h'\in H$ and therefore to the norm of some extension.

Theorem 13. Let a space $ {\cal F}$ of real-valued functions on $X$ be the RKHS corresponding to a kernel $ {\cal K}: X^2\to\mathbb R$ and let $X'\subseteq X$. Then the RKHS $ {\cal F}'$ corresponding to the kernel $ {\cal K}'= {\cal K}|_{X'^2}$ coincides with the set of restrictions to $X'$ of functions from $ {\cal F}$ and the norm on it is given by $\|f'\|_{ {\cal F}'}=\min_{f\in {\cal F}: f|_{X'}=f'}\|f\|_ {\cal F}$.

The mapping from $ {\cal F}$ to $ {\cal F}'$ provided by the restriction operator does not have to be injective. Indeed two elements $h_1,h_2$ can have the same scalar products with all elements from $\Phi(X')$ but different scalar products with some elements of $\Phi(X\setminus X')$. One can easily identify a subspace of $ {\cal F}$ isomorphic to $ {\cal F}'$ though.

Theorem 14. Let a space $ {\cal F}$ of real-valued functions on $X$ be the RKHS corresponding to a kernel $ {\cal K}: X^2\to\mathbb R$ and let $X'\subseteq X$. Then the closure of the set of finite linear combinations of the form $\sum_{i=1}^n a_i {\cal K}(x_i,\cdot)$, where $n=1,2,\ldots$ and $x_i\in X'$, i.e., the subspace $ {\cal F}''=\overline{\mathop{{\mathrm{span}}}\nolimits(\{ {\cal K}(x,\cdot)\mid x\in X'\})}\subseteq {\cal F}$ is isomorphic to the RKHS corresponding to the kernel $ {\cal K}'= {\cal K}|_{X'^2}$ and an isomorphism is given by the restriction of functions from $ {\cal F}''$ to $X'$.

Proof: Let us denote the restriction operator by $R: {\cal F}\to {\cal F}'$. It maps each $f\in {\cal F}$ into its restriction $f'=f|_{X'}\in {\cal F}'$.

We already know that $R$ maps $ {\cal F}''$ into $ {\cal F}'$ and it does not increase the norm. It is easy to see that it is linear. We need to show that $R$ as a mapping from $ {\cal F}''$ into $ {\cal F}'$ is injective, surjective, and actually preserves the norm.

Consider the set of finite linear combinations of the form $\sum_{i=1}^n a_i {\cal K}(x_i,\cdot)$, where $x_i\in X'$. We can consider them as elements of $ {\cal F}''$ and elements of $ {\cal F}'$. It is easy to see that $R$ "" these linear combinations, i.e., $R(\sum_{i=1}^n a_i {\cal K}(x_i,\cdot))= \sum_{i=1}^n a_i {\cal K}(x_i,\cdot)$ (note that on the left the combination is considered as an element of $ {\cal F}''$ and on the right the same combination is an element of $ {\cal F}'$).

Let us show that $R$ restricted to functions that are finite linear combinations of the above form is an isomorphism. Its surjectivity follows from the above. On the other hand we have

\begin{align*} \left\|\sum_{i=1}^n a_i {\cal K}(x_i,\cdot)\right\|_{ {\cal F}} &= \sum_{i,j=1}^n a_ia_j {\cal K}(x_i,x_j) \\ &= \left\|\sum_{i=1}^n a_i {\cal K}(x_i,\cdot)\right\|_{ {\cal F}'}. \end{align*}
The mapping $R$ thus preserves the norm of a linear combination. This implies that $R$ is injective because $R$ cannot map a non-zero element into zero.

Now consider $R$ on the subspace $ {\cal F}''$. Let $f\in {\cal F}''$; it can be approximated by a linear combination $g$ to any degree of precision w.r.t. the norm $\|\cdot\|_{ {\cal F}}$. However $R$ does not increase the norm and we therefore have

\begin{align*} \|R(f)-R(g)\|_{ {\cal F}'} &= \| R(f-g)\|_{ {\cal F}'} \\ &\le \|f-g\|_{ {\cal F}}. \end{align*}
Because $\|g\|_{ {\cal F}}=\|R(g)\|_{ {\cal F}'}$, the norm of $f$ is also preserved, which immediately implies that $R$ is an injection on $ {\cal F}''$.

To prove the surjectivity we need to remember that the finite linear combinations are dense in $ {\cal F}'$. If $f\in {\cal F}'$, it is the limit of a sequence of finite linear combinations. Their inverse images (i.e., the same combinations) in $ {\cal F}''$ will form a fundamental sequence because the norm is preserved and therefore will converge. The image of the limit has to be the original $f\in {\cal F}'$.


Corollary 2. The orthogonal complement of $ {\cal F}''$ w.r.t. $ {\cal F}$ consists of all functions $g\in {\cal F}$ such that $g(x)=0$ for all $x\in X'$.

Proof: Consider $g\in {\cal F}$ and $x\in X'$. We have $g(x)=\langle g, {\cal K}(x,\cdot)\rangle_{ {\cal F}}$. If $g$ is orthogonal to $ {\cal F}''$, it is orthogonal to each $ {\cal K}(x,\cdot)$ and therefore $g(x)=0$ for all $x\in X'$. If $g\equiv 0$ on $X'$, then it is orthogonal to all $ {\cal K}(x,\cdot)$, where $x\in X'$, all their linear combinations and therefore to all elements of $ {\cal F}''$. $\Box$

7.  Tensor Products of Kernels

Suppose that we have two kernels $ {\cal K}_1: X_1\times X_1\to\mathbb R$ and $ {\cal K}_2: X_2\times X_2\to\mathbb R$. Let $X$ be the Cartesian product $X_1\times X_2$. We can consider the function $ {\cal K}: X\times X\to\mathbb R$ defined by

$$ {\cal K}((x'_1,x'_2),(x''_1,x''_2))= {\cal K}_1(x'_1,x''_1) {\cal K}_2(x'_2,x''_2). $$
We will call it the tensor product 10 of $ {\cal K}_1$ and $ {\cal K}_2$ and use the notation $ {\cal K}= {\cal K}_1\otimes {\cal K}_2$. We will show that the tensor product is also a kernel and construct the corresponding RKHS.

Theorem 15. If $ {\cal K}_1: X_1\times X_1\to\mathbb R$ and $ {\cal K}_2: X_2\times X_2\to\mathbb R$ are kernels then their tensor product $ {\cal K}= {\cal K}_1\otimes {\cal K}_2: X\times X\to\mathbb R$, where $X=X_1\times X_2$, defined by
$$ {\cal K}((x'_1,x'_2),(x''_1,x''_2))= {\cal K}_1(x'_1,x''_1) {\cal K}_2(x'_2,x''_2) $$
is also a kernel.

We shall prove this theorem later. Let us first derive some corollaries.

Corollary 3. Let $ {\cal K}_1:X\times X\to\mathbb R$ and $ {\cal K}_2:X\times X\to\mathbb R$ be two kernels defined on the same domain $X$. Then their componentwise product $ {\cal K}= {\cal K}_1 {\cal K}_2: X\times X\to\mathbb R$ defined by

$$ {\cal K}(x_1,x_2)= {\cal K}_1(x_1,x_2) {\cal K}_2(x_1,x_2) $$
is also a kernel.

Proof: Inside the direct product $(X\times X)$ consider the "diagonal" set $D$ defined by $\{((x,x)\mid x\in X\}$. The restriction of the tensor product $ {\cal K}_1\otimes {\cal K}_2$ coincides with the componentwise product on $D\times D$. Clearly, it is therefore a kernel. $\Box$

Theorem 13 implies that the RKHS corresponding to the componentwise product consists of restrictions of functions from the RKHS corresponding to the tensor product to $D$.

We obtain the following linear algebra result as a bonus.

Corollary 4. The componentwise product of two symmetric positive-definite matrices is symmetric and positive-definite.

The corollary can be proved by considering a symmetric positive-definite matrix as a kernel on a finite set.

Let us prove the theorem about the tensor product. We shall prove it by constructing the RKHS $ {\cal F}$ for the product $ {\cal K}_1\otimes {\cal K}_2$. The space is the tensor product of RKHSs $ {\cal F}_1$ with the scalar product $\langle\cdot,\cdot\rangle_1$ corresponding to $ {\cal K}_1$ and $ {\cal F}_2$ with the scalar product $\langle\cdot,\cdot\rangle_2$ corresponding to $ {\cal K}_2$. We cannot employ the standard construction (see e.g., [Weidmann, 1980], Section 3.4) directly because we need to ensure that the resulting space is still a space of functions rather than a generic Hilbert space.

Consider the set of functions on $X=X_1\times X_2$ of the form $f(x_1,x_2)=f_1(x_1)f_2(x_2)$, where $f_1\in {\cal F}_1$ and $f_2\in {\cal F}_2$. As often happens when we try to construct a tensor product of some spaces, this set is not necessarily closed under addition. Therefore let us consider the space of finite sums $f(x_1,x_2)=\sum_{i=1}^nf_1^{\left(i\right)}(x_1)f_2^{\left(i\right)}(x_2)$, where $n$ is a positive integer, $f_1^{\left(1\right)},f_1^{\left(2\right)},\ldots,f_1^{\left(n\right)}\in {\cal F}_1$ and $f_2^{\left(1\right)},f_2^{\left(2\right)},\ldots,f_2^{\left(n\right)}\in {\cal F}_2$. It is clearly a vector space.

Let us define the scalar product of $f(x_1,x_2)=\sum_{i=1}^nf_1^{\left(i\right)}(x_1)f_2^{\left(i\right)}(x_2)$ and $g(x_1,x_2)=\sum_{j=1}^mg_1^{\left(j\right)}(x_1)g_2^{\left(j\right)}(x_2)$ by

\begin{equation*} \langle f,g\rangle_\otimes = \sum_{i,j} \langle f_1^{\left(i\right)},g_1^{\left(j\right)}\rangle_1 \langle f_2^{\left(i\right)},g_2^{\left(j\right)}\rangle_2\enspace. \end{equation*}
We need to show that the scalar product is well-defined, i.e., it depends on functions rather than their representations in the form of sums of products. We have
\begin{align*} \langle f,g\rangle_\otimes &= \sum_{i=1}^n\sum_{j=1}^m\langle f_1^{\left(i\right)},g_1^{\left(j\right)}\rangle_1 \langle f_2^{\left(i\right)},g_2^{\left(j\right)}\rangle_2 \\  &=\sum_{j=1}^m \left\langle \sum_{i=1}^n\langle f_1^{\left(i\right)},g_1^{\left(j\right)}\rangle_1 f_2^{\left(i\right)},g_2^{\left(j\right)}\right\rangle_2 \end{align*}
The inner sum $\sum_{i=1}^n\langle f_1^{\left(i\right)},g_1^{\left(j\right)}\rangle_1 f_2^{\left(i\right)}= \sum_{i=1}^n\langle f_1^{\left(i\right)}(\cdot),g_1^{\left(j\right)}(\cdot)\rangle_1 f_2^{\left(i\right)}(x_2)$ is a function on $X_2$. We have
\begin{align*} \sum_{i=1}^n\langle f_1^{\left(i\right)}(\cdot),g_1^{\left(j\right)}(\cdot)\rangle_1 f_2^{\left(i\right)}(x_2)  &= \left\langle \sum_{i=1}^nf_1^{\left(i\right)}(\cdot)f_2^{\left(i\right)}(x_2),g_1^{\left(j\right)}(\cdot)\right\rangle_1\\ &= \langle f(\cdot,x_2),g_1^{\left(j\right)}(\cdot)\rangle_1\enspace. \end{align*}
We can thus get rid of a particular representation of $f$ as a sum of products; only the values of $f$ matter. Clearly, the same argument applies to $g$. The scalar product is thus well-defined.

It still remains to show that $\langle\cdot,\cdot\rangle_\otimes$ satisfies the properties of a scalar product. It is easy to check that it is a bilinear form. We need to show that $\langle f,f\rangle_\otimes\ge 0$ and vanishes only for $f=0$. Let us take $f(x_1,x_2)=\sum_{i=1}^nf_1^{\left(i\right)}(x_1)f_2^{\left(i\right)}(x_2)$ and apply an orthonormalisation procedure (e.g., the Gram-Schmidt method) to $f_1^{\left(1\right)},f_1^{\left(2\right)},\ldots,f_1^{\left(n\right)}$ in $ {\cal F}_1$ and to $f_2^{\left(1\right)},f_2^{\left(2\right)},\ldots,f_2^{\left(n\right)}$ in $ {\cal F}_2$. After renaming the functions, we get a representation $f(x_1,x_2)=\sum_{i,j}\alpha_{i,j} f_1^{\left(i\right)}(x_1)f_2^{\left(i\right)}(x_2)$, where, for both $s=1$ and $s=2$, the equality $\langle f_s^{\left(i\right)},f_s^{\left(i\right)}\rangle_s=1$ holds for all $i$ and $\langle f_s^{\left(i\right)},f_s^{\left(j\right)}\rangle_s=0$ holds for all $i\ne j$. For the scalar product we have

\begin{align*} \langle f,f\rangle_\otimes &=\sum_{i,j,k,l}\alpha_{i,j}\alpha_{k,l}\langle f_1^{\left(i\right)},f_1^{\left(k\right)}\rangle_1 \langle f_2^{\left(j\right)},f_2^{\left(l\right)}\rangle_2\\ &=\sum_{i,j}\alpha_{i,j}^2\\ &\ge 0\enspace. \end{align*}
If $\langle f,f\rangle_\otimes$, all $\alpha_{i,j}$ are 0 and therefore $f=0$.

Let us now find out the role of the product of kernels defined by

$$  {\cal K}= {\cal K}((x'_1,x'_2),(x''_1,x''_2))= {\cal K}_1(x'_1,x''_1) {\cal K}_2(x'_2,x''_2)\enspace. $$
Fix $x'_1\in X_1$ and $x'_2\in X_2$. The function
$$ {\cal K}((x'_1,x'_2),(\cdot,\cdot))= {\cal K}_1(x'_1,\cdot) {\cal K}_2(x'_2,\cdot): X_1\times X_2\to\mathbb R $$
is a product of functions from $ {\cal F}_1$ and $ {\cal F}_2$, i.e., has the form $f^{\left(1\right)}(\cdot)f^{\left(2\right)}(\cdot)$, where $f^{\left(1\right)}\in {\cal F}_1$ and $f^{\left(2\right)}\in {\cal F}_2$.

For $f(x_1,x_2)=\sum_{i=1}^nf_1^{\left(i\right)}(x_1)f_2^{\left(i\right)}(x_2)$ we get

\begin{align*} \langle f, {\cal K}\rangle_\otimes &= \langle f(\cdot,\cdot), {\cal K}_1(\cdot,x_1) {\cal K}_2(\cdot,x_2)\rangle_\otimes\\ &= \left\langle \sum_{i=1}^nf_1^{\left(i\right)}(\cdot)f_2^{\left(i\right)}(\cdot), {\cal K}_1(\cdot,x_1) {\cal K}_2(\cdot,x_2)\right\rangle_\otimes\\ &= \sum_{i=1}^n\langle f_1^{\left(i\right)}(\cdot), {\cal K}_1(\cdot,x_1)\rangle_1 \langle f_2^{\left(i\right)}(\cdot), {\cal K}_2(\cdot,x_2)\rangle_2\\ &= \sum_{i=1}^n f_1^{\left(i\right)}(x_1)f_2^{\left(i\right)}(x_2)\\ &= f(x_1,x_2)\enspace. \end{align*}

The resulting space with a scalar product can be incomplete, so we need to construct its completion; the situation is somehow similar to that with the existence theorem. We cannot use the standard result about completions of pre-Hilbert spaces, because we need to obtain a space of functions.

In order to construct the completion, we need to recall some facts about orthonormal bases in Hilbert spaces. Let $H$ be a (pre-)Hilbert space with a scalar product $\langle\cdot,\cdot\rangle$. An orthonormal system is a system of vectors $ {\cal S}=\{e_i,i\in {\cal I}\}$, where $ {\cal I}$ is some index set, not necessarily countable, such that $\langle e_i,e_j\rangle$ equals 0 for $i\ne j$ and 1 for $i=j$. It is easy to check that all finite subsets of an orthonormal system are linearly independent.

If $ {\cal S}$ is an orthonormal system, then for each $h\in H$ only countably many scalar products $\langle h,e_i\rangle$ are non-zero and the Bessel inequality $\|h\|^2\ge\sum_{i\in {\cal I}}|\langle h,e_i\rangle|^2$ holds.

An orthonormal system $ {\cal S}$ is an orthonormal base, if its span (the set of all finite linear combinations of its elements) is dense in $H$, i.e., $\overline{\mathop{{\mathrm{span}}}\nolimits}( {\cal S})=H$. An orthonormal system is an orthonormal base if and only if for every $h\in H$ the Parseval equality $\|h\|^2=\sum_{i\in {\cal I}}|\langle h,e_i\rangle|^2$ holds. The Parseval equality implies that $h=\sum_{i\in {\cal I}}\langle h,e_i\rangle e_i$. (The convergence of this series can be understood as follows. If $\alpha_i\in {\cal I}$, $i=1,2,\ldots$ is a sequence of (pairwise different) indexes including all $\alpha$ such that $\langle h,e_\alpha\rangle\ne 0$, then $\|h-\sum_{i=1}^n\langle h,e_{\alpha_i}\rangle e_{\alpha_i}\|\to 0$ as $n\to+\infty$.) The scalar product of $h_1,h_2\in H$ is given by $\sum_{i\in {\cal I}}\langle h_1,e_i\rangle\langle h_2,e_i\rangle$. This generalises the finite-dimensional facts about the coordinate decomposition of a vector and its length.

Each Hilbert space $H$ has an orthonormal base; this follows from Zorn's lemma. This system is countable if and only if $H$ is separable.

Let vectors $e_i$, $i=1,2,\ldots$ form an orthonormal system in a Hilbert space. The series $\sum_{i=1}^{+\infty}c_ie_i$ converges if and only if the sequence $(c_1,c_2,\ldots)$ belongs to $l_2$. Each separable Hilbert space is thus isomorphic to $l_2$. As a matter of fact, any Hilbert space is isomorphic to $l_2( {\cal I})$, where $ {\cal I}$ is the index set of its orthonormal base. The space $l_2( {\cal I})$ consists of systems of reals $\{c_i,i\in {\cal I}\}$, where only countably many elements are non-zero and $\sum_{i\in {\cal I}}|c_i|^2<+\infty$. The addition and multiplication by a constant are defined componentwise. The scalar product of $\{c^{\left(1\right)}_i,i\in {\cal I}\}$ and $\{c^{\left(2\right)}_i,i\in {\cal I}\}$ is $\sum_{i\in {\cal I}}c^{\left(1\right)}_ic^{\left(2\right)}_i$.

Let us now build the completion. Take an orthonormal base $e^{\left(1\right)}_i,i\in {\cal I}_1$ in $ {\cal F}_1$ and an orthonormal base $e^{\left(2\right)}_i,i\in {\cal I}_2$ in $ {\cal F}_2$. The index sets $ {\cal I}_1$ and $ {\cal I}_2$ do not have to be countable 11 Consider the class of functions of the form $g(x_1,x_2)=\sum_{i\in {\cal I}_1,j\in {\cal I}_2} c_{i,j}e^{\left(1\right)}_i(x_1)e^{\left(2\right)}_j(x_2)$, where only countably many coefficients $c_{i,j}$ are non-zero and $\sum_{i\in {\cal I}_1,j\in {\cal I}_2} c_{i,j}^2<+\infty$. Let us show that this series is absolutely convergent for each $(x_1,x_2)\in X_1\times X_2$. Fix $i\in {\cal I}_1$; the Cauchy(-Schwarz-Bunyakovsky) inequality implies that

\begin{equation*} \sum_{j\in {\cal I}_2}|c_{i,j}||e^{\left(2\right)}_j(x_2)|\le \sqrt{\sum_{j\in {\cal I}_2}|c_{i,j}|^2} \sqrt{\sum_{j\in {\cal I}_2}|e^{\left(2\right)}_j(x_2)|^2}\enspace, \end{equation*}
where the first sum on the left-hand side must converge and second is still to be analysed. Recall that $ {\cal F}_2$ is an RKHS and the evaluation functional at a point $x_2\in X_2$ corresponds to the scalar product by $ {\cal K}_2(x_2,\cdot)$. Thus $e^{\left(2\right)}_j(x_2)=\langle  e^{\left(2\right)}_j,  {\cal K}_2(x_2,\cdot)\rangle_2$. The Bessel inequality implies that
\begin{align*} \sum_{j\in {\cal I}_2}e^{\left(2\right)}_j(x_2))^2 &= \sum_{j\in {\cal I}_2}|\langle  e^{\left(2\right)}_j,  {\cal K}_2(x_2,\cdot)\rangle_2|^2\\ &\le \| {\cal K}_2(x_2,\cdot)\|^2_2\\ &=  {\cal K}_2(x_2,x_2) \end{align*}
and therefore
\begin{equation*} \sum_{j\in {\cal I}_2}|c_{i,j}||e^{\left(2\right)}_j(x_2)|\le \sqrt{\sum_{j\in {\cal I}_2}|c_{i,j}|^2}\sqrt{ {\cal K}_2(x_2,x_2)}\enspace. \end{equation*}
Repeating the same argument, we get
\begin{multline*} \sum_{i\in {\cal I}_1,j\in {\cal I}_2} |c_{i,j}||e^{\left(1\right)}_i(x_1)||e^{\left(2\right)}_j(x_2)|\le \\ \sqrt{\sum_{i\in {\cal I}_1, j\in {\cal I}_2}|c_{i,j}|^2}\sqrt{ {\cal K}_1(x_1,x_1)}\sqrt{ {\cal K}_2(x_2,x_2)} \enspace. \end{multline*}

Take $g(x_1,x_2)=\sum_{i\in {\cal I}_1,j\in {\cal I}_2} c_{i,j}e^{\left(1\right)}_i(x_1)e^{\left(2\right)}_j(x_2)$ and fix the first coordinate. The function $g(x_1,\cdot)=\sum_{i\in {\cal I}_1,j\in {\cal I}_2} c_{i,j}e^{\left(1\right)}_i(x_1)e^{\left(2\right)}_j(\cdot)$ belongs to $ {\cal F}_2$ as a converging sum of $e^{\left(2\right)}_j(\cdot)\in {\cal F}_2$. (Indeed, the sum $\sum c_{i,j}^2$ converges and $|e^{\left(1\right)}_i(x_1)|^2\le \|e^{\left(1\right)}_i\|^2_1\| {\cal K}_1(x_1,\cdot)\|^2_1=\| {\cal K}_1(x_1,\cdot)\|^2_1$.) We can calculate the scalar product

\begin{align*} \langle g(x_1,\cdot),e^{\left(2\right)}_k(\cdot)\rangle_2 &= \sum_{i\in {\cal I}_1,j\in {\cal I}_2}\langle c_{i,j}e^{\left(1\right)}_i(x_1)e^{\left(2\right)}_j(\cdot), e^{\left(2\right)}_k(\cdot)\rangle_2\\ &= \sum_{i\in {\cal I}_1}c_{i,k}e^{\left(1\right)}_i(x_1)\enspace. \end{align*}
This expression considered as a function of $x_1$ belongs to $ {\cal F}_1$. Clearly its scalar product with $e^{\left(1\right)}_l(\cdot)$ equals $c_{l,k}$. The coordinates $c_{i,j}$ are thus uniquely defined by the function $g$. We can define a scalar product as the sum of the products of corresponding coordinates. The resulting space is isomorphic to $l_2( {\cal I}_1\times {\cal I}_2)$. Let us denote if by $ {\cal F}_\otimes$. It is easy to see that it is complete and therefore Hilbert.

Let us show that $ {\cal F}_\otimes$ contains all products of the form $f_1(x_1)f_2(x_2)$, where $f_1\in {\cal F}_1$ and $f_2\in {\cal F}_2$. Indeed, if $f_1=\sum_{i\in {\cal I}_1}c^{\left(1\right)}_ie^{\left(1\right)}_i$ and $f_2=\sum_{i\in {\cal I}_2}c^{\left(2\right)}_ie^{\left(2\right)}_i$ then $f_1(x_1)f_2(x_2)= \sum_{i\in {\cal I}_1,j\in {\cal I}_2}c^{\left(1\right)}_ic^{\left(2\right)}_je^{\left(1\right)}_i(x_1)e^{\left(2\right)}_j(x_2)$ and the sum

\begin{equation*} \sum_{i\in {\cal I}_1,j\in {\cal I}_2}(c^{\left(1\right)}_ic^{\left(2\right)}_j)^2= \sum_{i\in {\cal I}_1}(c^{\left(1\right)}_i)^2\sum_{j\in {\cal I}_2}(c^{\left(2\right)}_j)^2<+\infty \end{equation*}
converges. Finite linear combinations of functions of this form are dense in $ {\cal F}_\otimes$; indeed, the expression for $g(x_1,x_2)=\sum_{i\in {\cal I}_1,j\in {\cal I}_2} c_{i,j}e^{\left(1\right)}_i(x_1)e^{\left(2\right)}_j(x_2)$ can be truncated to a finite number of terms. The scalar product we introduced on functions $f_1(x_1)f_2(x_2)$ coincides with the product in $ {\cal F}_\otimes$. Indeed, let $f_1=\sum_{i\in {\cal I}_1}c^{\left(1\right)}_ie^{\left(1\right)}_i$, $f_2=\sum_{i\in {\cal I}_2}c^{\left(2\right)}_ie^{\left(2\right)}_i$, $g_1=\sum_{i\in {\cal I}_1}d^{\left(1\right)}_ie^{\left(1\right)}_i$, and $g_2=\sum_{i\in {\cal I}_2}d^{\left(2\right)}_ie^{\left(2\right)}_i$. We have
\begin{align*} \langle f_1f_2,g_1,g_2\rangle_{ {\cal F}_\otimes} &= \langle \sum_{i\in {\cal I}_1,j\in {\cal I}_2}c^{\left(1\right)}_ic^{\left(2\right)}_je^{\left(1\right)}e^{\left(2\right)}_j, \sum_{i\in {\cal I}_1,j\in {\cal I}_2}d^{\left(1\right)}_id^{\left(2\right)}_je^{\left(1\right)}e^{\left(2\right)}_j \rangle_{ {\cal F}_\otimes}\\ &= \sum_{i\in {\cal I}_1,j\in {\cal I}_2}c^{\left(1\right)}_ic^{\left(2\right)}_jd^{\left(1\right)}_id^{\left(2\right)}_j \\ &= \sum_{i\in {\cal I}_1}c^{\left(1\right)}_id^{\left(1\right)}_i\sum_{j\in {\cal I}_2}c^{\left(2\right)}_jd^{\left(2\right)}_j\\ &= \langle f_1,g_1\rangle_1\langle f_2,g_2\rangle_2\enspace. \end{align*}
By linearly this property can be extended to finite linear combinations.

One can conclude, by continuity, that

$$ {\cal K}((x'_1,x'_2),(\cdot,\cdot))= {\cal K}_1(x'_1,\cdot) {\cal K}_2(x'_2,\cdot) $$
is a reproducing kernel for the Hilbert space of functions $ {\cal F}_\otimes$.

8.  Some Popular Kernels

In this section we summarise some sufficient conditions for a function to be a kernel and obtain some popular kernels.

Let us recall the facts we know and formulate simple corollaries.

Theorem 16. A function $ {\cal K}: X\times X\to\mathbb R$ is a kernel if and only if it is symmetric and positive-semidefinite.

This is Theorem 4 or the existence theorem.

Corollary 5. If $ {\cal K}$ is a kernel and $a>0$, then $ a{\cal K}$ is a kernel.

Corollary 6. If $ {\cal K}_1$ and $ {\cal K}_2$ are kernels on the same set, then their sum is a kernel on the set.

Corollary 7. The pointwise limit of kernels is a kernel.

Proof: Let $ {\cal K}: X\times X\to\mathbb R$ be the pointwise limit of $ {\cal K}_n: X\times X\to\mathbb R$ and each $ {\cal K}_n$ is a kernel.

Because $ {\cal K}_n$ is symmetric, $ {\cal K}$ is symmetric too. For any finite array $x_1,x_2,\ldots,x_k\in X$ and $v_1,v_2,\ldots,v_k\in\mathbb R$ we have $\sum_{i,j=1}^n v_i {\cal K}_n(x_i,x_j)v_j\ge 0$. As $n\to+\infty$, we get $\sum_{i,j=1}^n v_i {\cal K}(x_i,x_j)v_j\ge 0$ and thus $ {\cal K}$ is positive semi-definite. $\Box$

Theorem 17. If $ {\cal K}_1$ and $ {\cal K}_2$ are kernels on the same set, then their product is a kernel on the set.

This was proven as Corollary 3.

Corollary 8. Let $ {\cal K}$ be a kernel and $p(x)$ is a polynomial with positive coefficients. Then $p( {\cal K})$ is a kernel.

Note the requirement that the coefficients should be positive. Indeed, $- {\cal K}$ is not a kernel unless $ {\cal K}$ is identically equal to zero.

Corollary 9. If $ {\cal K}$ is a kernel, then $e^ {\cal K}$ is a kernel.

Proof: The exponent can be decomposed into Taylor's series on $\mathbb R$. The function $e^x$ is a pointwise limit of polynomials $\sum_{i=1}^nx^n\slash{}n!$ on $\mathbb R$ and all these polynomials have positive coefficients. $\Box$

Theorem 18. A function $ {\cal K}:X\times X$ is a kernel if and only if there is a mapping $\Phi: X\to H$, where $H$ is a vector space with a scalar product $\langle\cdot,\cdot\rangle$, such that $ {\cal K}(x_1,x_2)=\langle\Phi(x_1),\Phi(x_2)\rangle$.

Corollary 10. If $f:X\to\mathbb R^n$ is a real vector-function on $X$, then

$$ {\cal K}(x_1,x_2)=(f(x_1))'f(x_2) $$
is a kernel on $X$.

Corollary 11. The function $ {\cal K}(x_1,x_2)=x'_1x_2$ is a kernel on $\mathbb R^n$, $n>0$.

This fact is obvious and can be derived by many different ways.

We can now obtain Vapnik's inhomogeneous polynomial kernel.

Corollary 12. For any positive integer $d$ the function $ {\cal K}(x_1,x_2)=(1+x'_1x_2)^d$ is a kernel on $\mathbb R^n$.

Proof: Indeed, $x'_1x_2$ is a kernel and $(1+x)^d$ is a polynomial with positive coefficients. $\Box$

The following corollary introduces radial-basis function (rbf) kernel.

Corollary 13. For any $\sigma\ne 0$, the function $ {\cal K}(x_1,x_2)=e^{-\|x_1-x_2\|^2\slash{}(2\sigma^2)}$, where $\|\cdot\|$ is the Euclidean norm, is a kernel on $\mathbb R^n$.

Proof: We have

$$ e^{-\frac{\|x_1-x_2\|}{2\sigma^2}}=e^{-\|x_1\|^2\slash{}(2\sigma^2)}e^{-\|x_2\|^2\slash{}\left(2\sigma^2\right)} e^{x'_1x_2\slash{}\sigma^2}x\enspace. $$
The product of the first two terms $e^{-\|x_1\|^2\slash{}(2\sigma^2)}e^{-\|x_2\|^2\slash{}(2\sigma^2)}$ is a kernel generated by the mapping $f(x)=e^{-\|x\|^2\slash{}(2\sigma^2)}$. The third term $e^{x'_1x_2\slash{}\sigma^2}$ is a kernel as the exponent of a kernel. The product of two kernels is also a kernel. $\Box$

1 As different, for example, to the classification problem, where $y$s belong to a finite set.

2 Indeed, let $A$ be positive definite. If $A$ is singular, $Av=0$ for some $v\ne 0$, but this implies $v'Av=0$.

3 The condition $\mathop{{\mathrm{rank}}}\nolimits X<n$ is not just sufficient, but necessary for $X'X$ to be singular. Indeed, if $X'X$ is singular, it is not positive definite and $\xi'X'X\xi=(X\xi)'(X\xi)=0$ for some $\xi\ne 0$. This implies $X\xi=0$, i.e., $n$ columns of $X$ are linearly dependent.

4 It is important that no particular structure is postulated on $X$; throughout the most of this article it is just a set.

5 A definition and a discussion were given above.

6 We have $\xi'K\xi=\|\sum_{i=1}^T\xi_i\Phi(x_i)\|^2\ge 0$.

7 Thus $L_2$ is not an RKHS according to our definition: it is disqualified without a substantive consideration.

8 The book [Schölkopf and Smola 2002] uses this argument in Section 2.2.3, p. 35-36, in a rather misleading way.

9 Proposition 2.14 from [Schölkopf and Smola 2002] states that continuous maps exist with a reference to [Parthasary and Smith, 1972], where the stronger statement is proven, but the weaker statement is claimed.

10 Aronszajn uses the term "direct product". The reason why we use the term "tensor product" will become apparent later

11 Paper [Aronszajn, 1950] considers only the countable case and its proof is formally valid only for separable RKHSs. However minor amendments can make it valid everywhere.