Conformal Predictor

1.  Definition

Suppose we have a training sequence of pairs

$\displaystyle{ (z_1,z_2,\ldots,z_n) = ((x_1, y_1), (x_2, y_2), \ldots, (x_n,y_n)) }$

called observations. The objects $x_i$ are elements of a measurable space $\mathbf{X}$ and the labels $y_i$ are elements of a measurable space $\mathbf{Y}$.

We call $\mathbf{Z}:=\mathbf{X}\times\mathbf{Y}$ the observation space, $\epsilon\in(0,1)$ the significance level, and the complimentary value $1 - \epsilon$ the confidence level.

A conformity measure is a measurable mapping \[ A: \mathbf{Z}^{(*)} \times \mathbf{Z} \to [-\infty, +\infty], \] where $\mathbf{Z}^{(*)}$ is the set of all bags (multisets) of elements of $\mathbf{Z}$. Intuitively, this function assigns a numerical score (sometimes called the conformity score) indicating how similar a new observation is to a multiset of old ones.

The conformal predictor determined by the conformity measure $A$ is a confidence predictor $\Gamma$ whose value on the training sequence $z_1,\ldots,z_n$ and a test object $x_{n+1}\in\mathbf{X}$ at a significance level $\epsilon$ is obtained by setting \begin{equation} \Gamma^\epsilon(z_1,\ldots,z_n,x_{n+1}) := \left\{ y\in\mathbf{Y} \mid \frac{|{i = 1, \ldots,n,n+1: \alpha_i^y \le \alpha^y_{n+1} }|}{n+1} > \epsilon \right\},\end{equation} where \[ \alpha^y_i := A(\{z_1,\ldots,z_{i-1}, z_{i+1},\ldots,z_n,(x_{n+1},y)\}, z_i), \quad i = 1,\ldots,n,\] \[ \alpha^y_{n+1} := A(\{z_1,\ldots,z_n\}, (x_{n+1},y)),\] and $\{\ldots\}$ designates the bag (multiset) of observations.

The standard assumption for conformal predictors is the randomness assumption (also called the i.i.d. assumption).

Conformal predictors can be generalized by inductive conformal predictors or Mondrian conformal predictors to a wider class of confidence predictors.

2.  Desiderata

2.1  Validity

All the statements in the section are given under the randomness assumption in the on-line prediction protocol.

The statement of validity is easiest for smoothed conformal predictors, for which \[ \Gamma^\epsilon(z_1,\ldots,z_n,x_{n+1}) := \left\{ y\in\mathbf{Y} \mid \frac{|\{i=1,\ldots,n+1:\alpha^y_i>\alpha^y_{n+1}\}|+\eta_y|\{i=1,\ldots,n+1:\alpha^y_i=\alpha^y_{n+1}\}|}{n+1} > \epsilon \right\},\] where the nonconformity scores $\alpha^y_i$ are defined as before and $\eta_y\in[0,1]$ is a random number (chosen from the uniform distribution on [0,1]. In the rest of this section we will assume that the random numbers $\eta_y$ used at different trials $n$ are independent among themselves and of the observations.

Theorem All smoothed conformal predictors are exactly valid, in the sense that, for any exchangeable probability distribution $P$ on $\mathbf{Z}^{\infty}$ and any significance level $\epsilon\in(0,1)$, the random variables $\text{err}_n^{\epsilon}(\Gamma)$, $n = 1, 2, \ldots$, are independent Bernoulli variables with parameter $\epsilon$, where $\text{err}_{n} ^{\epsilon}(\Gamma)$ is the random variable

$\displaystyle{ \text{err}_{n}^{\epsilon}(\Gamma)(z_1,z_2,\ldots) := \begin{cases} 1 & \text{if } y_{n+1} \not \in \Gamma^\epsilon(z_1,\ldots,z_n,x_{n+1})\\ 0 & \text{otherwise}. \end{cases}}$

The idea of the proof is quite simple.

Corollary All smoothed conformal predictors are asymptotically exact, in the sense that for any exchangeable probability distribution $P$ on $\mathbf{Z}^{\infty}$ and any significance level $\epsilon$, \[ \lim_{n\to\infty} \frac{1}{n}\sum_{i=1}^{n} \text{err}_n^{\epsilon}(\Gamma) = \epsilon\] with probability one.

Corollary All conformal predictors are asymptotically conservative, in the sense that for any exchangeable probability distribution $P$ on $\mathbf{Z}^{\infty}$ and any significance level $\epsilon$, \[ \limsup_{n\to \infty} \frac{1}{n} \sum_{i=1}^{n} \text{err}_n^{\epsilon}(\Gamma) \le \epsilon\] with probability one.

To put it simply, in the long run the frequency of erroneous predictions does not exceed $\epsilon$ at each confidence level $1 - \epsilon$.

One can also give a formal notion of validity for conformal predictors, although the usefulness of this notion is limited: it is intuitively clear that conformal predictors are valid or "more than valid" since the number of errors made by a conformal predictor never exceeds the number of errors made by the corresponding smoothed conformal predictor.

2.2  Efficiency

As conformal predictors are automatically valid, the main goal is to improve their efficiency (also known as predictive efficiency): to make the prediction sets conformal predictors output as small as possible. In classification problem, a natural measure of efficiency of conformal predictors is the number of multiple predictions - the number of prediction sets containing more than one label. In regression problems, the prediction set is often an interval of values, hence, a natural measure of efficiency of such predictors is the length of the interval. There are many possible criteria of efficiency.

There are separate articles about validity, efficiency, and a third desideratum, conditionality:

3.  Conformalizing specific machine-learning algorithms

Suppose we have a prediction algorithm that outputs a prediction $\hat y\in\mathbf{Y}$ for a test object $x\in\mathbf{X}$ given a training set $\{z_1,\ldots,z_n\}\in\mathbf{Z}^{(*)}$ (notice that it is assumed that $\hat y$ does not depend on the ordering of $(z_1,\ldots,z_n)$). The function \[ A(\{z_1,\ldots,z_n\},(x,y)) := y - \hat y\] is a conformity measure, for any method of computing $\hat y$ from $\{z_1,\ldots,z_n\}$, $x$, and $y$. The conformal predictor determined by this conformity measure can be referred to as the conformalization of the original algorithm. More generally, we can set \[ A(\{z_1,\ldots,z_n\},(x,y)) := (y - \hat y) / \sigma_y,\] where $\sigma_y$ is a measure of precision of $\hat y$ (computed from $\{z_1,\ldots,z_n\}$, $x$, and $y$).

For many non-trivial machine-learning algorithms, their conformalizations will be computationally inefficient (and then alternative methods such as inductive conformal predictors or aggregated conformal predictors) should be used. However, for some important algorithms their conformalizations can be computed efficiently (at least to some degree): see, e.g.,

It is also known that K-nearest neighbours classification and regression are conformalizable.

4.  Conformal transducers and predictive systems

In many cases, it is more convenient to consider conformal transducers which output, for each training set $z_1,\ldots,z_n$, each test object $x_{n+1}\in\mathbf{X}$, and each potential label $y\in\mathbf{Y}$ for $x_{n+1}$ the p-value given by \[ p^y := \frac{|\{i=1,\ldots,n+1:\alpha^y_i>\alpha^y_{n+1}\}|+\eta_y|\{i=1,\ldots,n+1:\alpha^y_i=\alpha^y_{n+1}\}|}{n+1}\] (cf. the definition of the smoothed conformal predictor). Conformal predictors and conformal transducers can be regarded as different ways of packaging the same object. For further details, see Vovk et al. (2005).

Conformal transducers can be used for solving problems of probabilistic regression, giving rise to conformal predictive systems.