The AA And Laissez-Faire Investment

The aggregating algorithm was proposed by V.Vovk in [Vovk, 1990] (see also [Vovk and Watkins, 1998] and [Vovk, 1998]. This note aims to explain some intuition behind the aggregating algorithm and to obtain it as an extension of a natural investment strategy.

1.  Sequential Investment

1.1  Sequential Investment: the Setup

Suppose that there are $M$ shares (numbered $0,1,\ldots,M-1$) we can invest into. The prices of shares change in discrete time. Let the vector $x_t$ represents the proportional changes of the values of the shares between the moments $t-1$ and $t$; namely, let $x_{t,j}$ be the ratio of the prices of share $i$ at the moments $t$ and $t-1$, $t=1,2,\ldots$ We assume that all components of $x_t$ are non-negative, $x_{t,j}\ge 0$.

We have a sum of money to invest in the shares. An investment decision can be represented by a vector showing how we split the capital among the shares. On step $t-1$ we spend the fraction $\gamma_{t,j}\in [0,1]$ of our capital to buy share $j$, $j=1,2,\ldots,M$. Assume that we do not keep any money in cash (or that cash is actually one of the shares) and therefore $\sum_{j=0}^{M-1}\gamma_{t,i}=1$. Possible investment decisions thus fill in the simplex.

Suppose that we have capital $W_{t-1}$ on step $t-1$. We buy $W_{t-1}\gamma_{t,j}$ worth 1 of share $j$. As the price changes between times $t-1$ and $t$, this sum becomes $W_{t-1}\gamma_{t,j}x_{t,j}$. The total amount of money at our disposal thus changes from $W_{t-1}$ to $W_t=W_{t-1}\langle \gamma_t,x_t\rangle$. We can sell the shares, consolidate the capital, and then make a new investment decision 2. If the initial capital is $W_0=1$, then after $T$ steps we get $W_T=\prod_{t=1}^T\langle \gamma_t,x_t\rangle$.

It is typical to make assumptions about the behaviour of the prices, i.e., price changes $x_t$. The Black-Scholes theory postulates that a share price follows a geometric Brownian motion. The Capital Asset Price Model is less restrictive, but it still assumes that prices are random variables with means and variances. We will not be making any assumptions of this kind. Instead we are interested in properties of investment strategies that hold in the worst case, i.e., for all possible changes of stock prices.

1.2  Experts

Suppose that there are $N$ experts $ {\cal E}^{\left(1\right)}, {\cal E}^{\left(2\right)},\ldots, {\cal E}^{\left(N\right)}$ that suggest investment decisions to us. Before deciding on $\gamma_t$, we can observe decisions $\gamma_t^{\left(i\right)}$, $i=1,2,\ldots,N$, output by experts. We make no assumptions as to how the experts arrive at their decisions. They may be algorithms of arbitrary nature and complexity, rely on side information etc. We treat them as black boxes producing decisions. Our goal is to merge their decisions in such a way so that our capital $W_t$ is not much less than the capital of any expert $W_t^{\left(i\right)}$, $i=1,2,\ldots,N$ (by $W_t^{\left(i\right)}$ we mean the capital we would have earned by starting from 1 and following the suggestions of expert $i$ all the time); in other terms we want an inequality of the type $W_t\gtrsim W_t^{\left(i\right)}$ to hold uniformly for all $i$ and possibly for all $t$.

We can think of a merging strategy in the following terms. At time $t-1$ we split our capital into $N$ parts and invest the $i$th part as suggested by expert $i$. Let $p_t^{\left(i\right)}$ represent the share of the wealth invested according to the suggestions of expert $i$.

We need to decide what share of the capital to entrust to expert $i$ on step $t$. If an expert is doing well, it is natural to increase its share, and if an expert is doing badly then to decrease it. This can be achieved in the following natural way. Let us give each expert a share of the capital at the start and let it operate on what it has earned. This laissez-faire approach ensures that successful experts will have a higher influence.

If expert $i$ gets the share $p^{\left(i\right)}$ of the initial capital $W_0=1$ and $W_t^{\left(i\right)}$ is as above, our capital at time $t$ equals $W_t=\sum_{i=1}^Np^{\left(i\right)}W_t^{\left(i\right)}$. We thus get

\begin{equation*} W_t\ge p^{\left(i\right)}W_t^{\left(i\right)}  \end{equation*}
for all $i=1,2,\ldots,N$. If we give each expert an equal share $1\slash{}N$, then we get
\begin{equation*} W_t\ge \frac 1NW_t^{\left(i\right)}\enspace. \end{equation*}
Note that the bounds hold for all times $t$ and all possible changes in stock prices. We do not even need to know $t$ in advance; the strategy is the same no matter at what point in the future we are going to check its performance.

It is easy to get a formula for $p_t^{\left(i\right)}$, the share of our wealth in possession of expert $i$ after time $t$. We have

\begin{align*} p_t^{\left(i\right)} &= p^{\left(i\right)}W_t^{\left(i\right)}\slash{}W_t\\ &= p^{\left(i\right)}W_t^{\left(i\right)}\slash{}\sum_{i=1}^Np^{\left(i\right)}W_t^{\left(i\right)}\enspace. \end{align*}
In fact giving a share of the wealth to an expert is a metaphor. We need to produce one investment decision, $\gamma_t$, and the formula for it is $\gamma_t=\sum_{i=1}^Np_{t-1}^{\left(i\right)}\gamma_t^{\left(i\right)}$.

One can update the weights in a simple way. To get the weights for step $t$ we multiply each $p_{t-1}^{\left(i\right)}$ by $\langle \gamma_{t}^{\left(i\right)},x_{t}\rangle$ and then normalise the vector to ensure the weights sum up to 1.

1.3  Cover's Universal Portfolios

The bound we have obtained is rather trivial. However it is possible to obtain more involved results arguing in this way.

We can consider some class of investment strategies as experts and obtain a universal strategy for this class. Interesting classes of strategies are infinite. One can easily expand the above argument to a countable class of experts. The extension to uncountable classes is less straightforward. The sum will be replaced by an integral and we will not be able to obtain a lower bound by simply dropping all terms except for one from the sum. One will need to estimate the integral instead.

Interesting results were obtained in this fashion in [Cover and Ordentlich, 1996], where the class of constant rebalanced portfolios is considered.

A constant rebalanced portfolio is a strategy that sticks to a fixed partitioning of capital among $M$ shares, which is maintained all the time. Consider a partitioning $\gamma=(\gamma_1,\gamma_2,\ldots,\gamma_M)$, $\sum_{j=1}^M\gamma_j=1$. The constant rebalanced portfolio using this $\gamma$ requires that on each step the share $\gamma_i$ of our total wealth is invested in share $i$. Suppose that we have invested the money accordingly at time $t-1$. By the time $t$ the share prices will change and this requirement will no longer hold, so we will need to buy and sell some shares to ensure that the share $\gamma_i$ of our total wealth is invested in share $i$ for all $i$.

We thus need to rebalance the portfolio on each step, hence the word rebalanced. However the rebalancing seeks to achieve a fixed allocation of money to shares, hence the word constant.

The constant rebalanced portfolios fill the simplex of dimension $M-1$. In [Cover and Ordentlich, 1996] two distributions on the simplex are considered. If the distribution is uniform, we get the bound

\begin{equation*} W_t\ge \frac 1{(t+1)^{M-1}}W_\gamma \end{equation*}
and for the Dirichlet distribution with parameters $(1\slash{}2,\ldots,1\slash{}2)$ we get
\begin{equation*} W_t\ge \frac 1{2(t+1)^{\left(M-1\right)\slash{}2}}W_\gamma \end{equation*}
for all $t$ and all constant rebalanced portfolios $\gamma$.

Note that the coefficient depends on $t$ and decreases as $t$ increases. This is the price to pay for having a very large class of experts.

2.  Bayesian Estimation

The investment scenario and our laissez-faire strategy turn out to have an important special case.

Let us consider a restricted version of the above scenario. Suppose that there are two `shares' and that the vector of proportional changes can take two values, $(0,1)$ and $(1,0)$. Within the investment metaphor this means that one part of our investment always disappears and another part is returned to us intact.

There is a different interpretation of this scenario though. Suppose we need to output the probability of some event. The outcome $x_t=(0,1)$ corresponds to the event happening and $x_t=(1,0)$ corresponds to it not happening. The `investment' $\gamma_{t,1}$ can be thought of as the probability of it happening and $\gamma_{t,0}$ as the probability of it not happening (recall that $\gamma_{t,0}+\gamma_{t,1}=1$). Our `capital' changes by the factor equal to the probability we assigned to the outcome that actually occurs. We shall use the notation

$$ k_t= \begin{cases} 1, & \text{if~}x_t=(0,1);\\ 0, & \text{if~}x_t=(1,0). \end{cases} $$
This variable tells us whether the event has happened ($k_t=1$) or not ($k_t=0$).

We can think of experts $ {\cal E}^{\left(1\right)}, {\cal E}^{\left(2\right)},\ldots, {\cal E}^{\left(N\right)}$ as of $N$ hypotheses. Each hypothesis suggests a probability distribution: $\Pr(k_t=1\mid  {\cal E}_i)=\gamma_{t,1}^{\left(i\right)}$ and $\Pr(k_t=0\mid  {\cal E}_i)=\gamma_{t,0}^{\left(i\right)}$.

It is remarkable that the laissez-faire investment in this special case corresponds to the Bayesian approach. Let us think of the weights $p_{t-1}^{\left(1\right)},p_{t-1}^{\left(2\right)},\ldots,p_{t-1}^{\left(N\right)}$ as of prior probabilities we have for the hypothesis before step $t$. The Bayes rule gives us posterior probabilities

\begin{align*} p^{\left(i\right)}_{t}=\Pr( {\cal E}^{\left(i\right)}\mid x_t)&= \frac{\Pr(x_t\mid {\cal E}^{\left(i\right)})\Pr( {\cal E}^{\left(i\right)})}{\Pr(x_t)}\\ &\sim \gamma_{t,k_t}p_{t-1}^{\left(i\right)}\\ &= \langle \gamma_t^{\left(i\right)},x_t\rangle p_{t-1}^{\left(i\right)} \end{align*}
where $k_t$ is as above and $\sim$ denotes proportionality. To get $p_{t}$ one should normalise the products $\langle\gamma_t^{\left(i\right)},x_t\rangle p_{t-1}^{\left(i\right)}$ to ensure the results sum up to 1. Note that this is exactly the same procedure as described above for investments.

The distribution $\gamma_t$ is obtained from $\gamma_t^{\left(i\right)}$ be means of the product rule:

\begin{align*} \Pr(k_t=k)&=\sum_{i=1}^N\Pr(k_t=k\mid {\cal E}^{\left(i\right)})\Pr( {\cal E}^{\left(i\right)})\\ &=\sum_{i=1}^N\gamma_{t,k}p_{t-1}^{\left(i\right)}\enspace, \end{align*}
where $k=0,1$, i.e., $\gamma_t=\sum_{i=1}^Np_{t-1}^{\left(i\right)}\gamma_t^{\left(i\right)}$.

3.  General Case

In this section we will generalise the laissez-faire investment to obtain the aggregating algorithm.

3.1  Games and Losses

Let us introduce a general prediction scenario. Suppose that outcomes $\omega_t$ occur sequentially in discrete time: $\omega_1,\omega_2,\ldots$ We assume that outcomes are drawn from an outcome space $\Omega$. The learner outputs a prediction $\gamma_t$ before seeing $\omega_t$. Predictions may be taken from a prediction space $\Gamma$. The quality of the predictions is measured by a loss function $\lambda:\Omega\times\Gamma\to [0,+\infty)$. We want to minimise the cumulative loss $\mathop{{\rm{Loss}}}\nolimits(T)=\sum_{i=1}^T\lambda(\omega_t,\gamma_t)$. The triple $(\Omega,\Gamma,\lambda)$ describes a prediction environment; we will refer to it as a game.

One may think of prediction $\gamma_t$ as of an action taken in a situation of uncertainty. As the uncertainty is lifted, the learner confronts the reality $\omega_t$ and faces the consequences of its action represented by $\lambda(\omega_t,\gamma_t)$. We will however keep the prediction terminology.

The investment scenario with $M$ shares can be interpreted in this framework. Let the space of outcomes be $\Omega=[0,+\infty)^M$, the space of predictions be the simplex $\Gamma=\{\gamma=(\gamma_0,\gamma_1,\ldots,\gamma_M\in\mathbb R^M\mid \sum_{i=0}^{M-1}\gamma_j=1\}$, and the loss function be $\lambda(\omega,\gamma)=-\ln\langle\omega,\gamma\rangle$. This game is called Cover's game. The cumulative loss is the negative logarithm of the wealth: $\mathop{{\rm{Loss}}}\nolimits(T)=-\ln W_t$.

Simpler examples are provided by games with finite outcome spaces. In binary games there are two possible outcomes, $\Omega=\{0,1\}$, and predictions can be drawn from the unit interval $\left[0,1\right]$. The square-loss game has the loss function $\lambda(\omega,\gamma)=(\omega-\gamma)^2$, the absolute-loss game has the loss function $\lambda(\omega,\gamma)=|\omega-\gamma|^2$, and the logarithmic-loss game has

$$ \lambda(\omega,\gamma)= \begin{cases} -\ln\gamma, &  \text{if} \omega=1;\\ -\ln(1-\gamma), &  \text{if~}\omega=0. \end{cases} $$
The logarithmic-loss game corresponds to sequential estimation discussed above. The outcome $\omega_t$ can be interpreted at $k_t$ and the prediction $\gamma_t$ as $\gamma_{t,1}$.

In the problem of prediction with expert advice we have access to predictions of $N$ experts $ {\cal E}^{\left(1\right)}, {\cal E}^{\left(2\right)},\ldots, {\cal E}^{\left(N\right)}$ and want to suffer loss not much greater than the loss of each expert, i.e., to achieve $\mathop{{\rm{Loss}}}\nolimits(T)\lesssim \mathop{{\rm{Loss}}}\nolimits_{ {\cal E}_i}(T)$.

3.2  Extension of Laissez-Faire Investment

For the sequential investment scenario the laissez-faire strategy achieves the wealth $W_t\ge p^{\left(i\right)}W_t^{\left(i\right)}$ for all $i=1,2,\ldots,N$ and $t=1,2,\ldots$ Taking the logarithm yields $\mathop{{\rm{Loss}}}\nolimits(t)\le\mathop{{\rm{Loss}}}\nolimits_{ {\cal E}_i}(t) +\ln (1\slash{}p^{\left(i\right)})$.

Let us try and extend the algorithm to an arbitrary game $(\Omega,\Gamma,\lambda)$. We need to consider notional wealth instead of loss. It is natural to define it as the exponent of $-\mathop{{\rm{Loss}}}\nolimits$. What shall be the basis of this exponent however? Let us postpone this decision till later and consider a parameter $\eta>0$. Take $W_t^{\left(i\right)}=e^{-\eta\mathop{{\rm{Loss}}}\nolimits_{ {\cal E}_i}(t)}$ for expert $ {\cal E}_i$. After step $t+1$ it changes by the factor $e^{-\eta\lambda(\omega_{t+1},\gamma_{t+1}^{\left(i\right)})}$.

Let the learner share its initial wealth of $e^{-\eta\cdot 0}=1$ among the experts so that expert $ {\cal E}_i$ gets $p^{\left(i\right)}$. The laissez-faire investment achieves $W_t=\sum_{i=1}^Np^{\left(i\right)}W_t^{\left(i\right)}$ (cf. the pseudo-aggregating algorithm from [Vovk and Watkins, 1998] and Lemma 1). Let us turn this notional method into a real practical strategy by translating its terms into the prediction language.

In laissez-faire investment we maintain a set of weights for the experts. The weights before step $t$ correspond to the share of our wealth currently possessed by expert $ {\cal E}_i$ and can be obtained by normalising $p^{\left(i\right)}W_{t-1}^{\left(i\right)}$ so that they sum up to 1. This can be easily expressed in prediction terms: $p_{t-1}^{\left(i\right)}$ are obtained by normalising $p^{\left(i\right)}e^{-\eta\mathop{{\rm{Loss}}}\nolimits_{ {\cal E}_i}(t-1)}$ so that $\sum_{i=1}^Np_{t-1}^{\left(i\right)}=1$.

In the sequential investment scenario we simply took our investment decision to be the weighted sum of experts' decisions: $\gamma_t=\sum_{i=1}^Np^{\left(i\right)}_{t-1}\gamma_t^{\left(i\right)}$. We do not have this linearity here. Let us write down a suitable replacement.

What we really need is to achieve the wealth greater than or equal to $W_t$, i.e., $e^{-\eta\mathop{{\rm{Loss}}}\nolimits(t)}\ge W_t$. One way to ensure this is to maintain the inequality $e^{-\eta\lambda(\omega_t,\gamma_t)}\ge W_t\slash{}W_{t-1}$ for all $t$. We have $W_{t-1}=\sum_{i=1}^Np_{t-1}^{\left(i\right)}W_{t-1}$, where $p_{t-1}^{\left(i\right)}W_{t-1}$ is the money in possession of expert $ {\cal E}_i$; on the next step this share of the money grows to $p_{t-1}^{\left(i\right)}W_{t-1}e^{-\eta\lambda(\omega_t,\gamma_t^{\left(i\right)})}$ and therefore

\begin{align*} \frac{W_t}{W_{t-1}} &=\frac{\sum_{i=1}^Np_{t-1}^{\left(i\right)}W_{t-1}e^{-\eta\lambda(\omega_t,\gamma_t^{\left(i\right)})}}{W_{t-1}}\\ &=\sum_{i=1}^Np_{t-1}^{\left(i\right)}e^{-\eta\lambda(\omega_t,\gamma_t^{\left(i\right)})}\enspace. \end{align*}
The inequality $e^{-\eta\lambda(\omega_t,\gamma_t)}\ge W_t\slash{}W_{t-1}$ thus transforms into
\begin{equation*} e^{-\eta\lambda(\omega_t,\gamma_t)}\ge \sum_{i=1}^Np_{t-1}^{\left(i\right)}e^{-\eta\lambda(\omega_t,\gamma_t^{\left(i\right)})} \end{equation*}
or, equivalently, into
\begin{equation*} \lambda(\omega_t,\gamma_t)\le -\frac 1\eta\ln\sum_{i=1}^Np_{t-1}^{\left(i\right)}e^{-\eta\lambda(\omega_t,\gamma_t^{\left(i\right)})}\enspace. \end{equation*}
Note that we do not know $\omega_t$ when we choose $\gamma_t$ so we must ensure that the inequality holds for all $\omega_t\in\Omega$.

Depending on the game and particular values of the variables, this may be possible or not. One can easily formulate a sufficient condition for the existence of $\gamma_t$.

Consider $\left[0,+\infty\right]^\Omega$, the linear space of all functions from $\Omega$ to $\left[0,+\infty\right]$ (if $\Omega$ is finite, it is finite-dimensional). The game $(\Omega,\Gamma,\lambda)$ defines subsets $P=\{f:\Omega\to[0,+\infty]\mid \exists\gamma\in\Gamma:\: f(\cdot)=\lambda(\omega,\cdot)\}$ and $S=\{f:\Omega\to[0,+\infty]\mid \exists g\in P \forall\omega\in\Omega:\: f(\omega)\ge g(\omega)\}$. The elements of $P$ may be identified with predictions $\gamma$: each element of $P$ may be thought as possible losses of a prediction. The elements of $S$ are called superpredictions: they majorise elements of $P$. Now consider the set $e^{-\eta S}$; it consists of functions $e^{-\eta f}$, where $f\in S$. If this set is convex, the game is called $\eta$-mixable.

Suppose that our game is $\eta$-mixable. One can easily check that for every array of non-negative weights $p^{\left(1\right)},p^{\left(2\right)},\ldots,p^{\left(N\right)}$ summing to 1 and every array of experts' predictions $\gamma^{\left(1\right)},\gamma^{\left(2\right)},\ldots,\gamma^{\left(N\right)}$ there is $\gamma\in\Gamma$ such that

\begin{equation*} \lambda(\omega,\gamma)\le -\frac 1\eta\ln\sum_{i=1}^Np^{\left(i\right)}e^{-\eta\lambda(\omega,\gamma^{\left(i\right)})}\enspace. \end{equation*}
for all $\omega\in\Omega$.

We can now formulate the aggregating algorithm for $\eta$-mixable games. Assign to the experts initial weights $p^{\left(i\right)}$. On step $t$ calculate the weights $p_{t-1}^{\left(i\right)}$ for the experts and find any $\gamma_t$ satisfying

\begin{equation*} \lambda(\omega,\gamma_t)\le -\frac 1\eta\ln\sum_{i=1}^Np_{t-1}^{\left(i\right)}e^{-\eta\lambda(\omega,\gamma_t^{\left(i\right)})} \end{equation*}
for all $\omega\in\Omega$. By reverting the above calculations one can easily check that the loss of this algorithm satisfies
\begin{align*} e^{-\eta\mathop{{\rm{Loss}}}\nolimits(t)}&\ge\sum_{i=1}^Np^{\left(i\right)}e^{-\eta\mathop{{\rm{Loss}}}\nolimits_{ {\cal E}_i}(t)}\\ &\ge p^{\left(i\right)}e^{-\eta\mathop{{\rm{Loss}}}\nolimits_{ {\cal E}_i}(t)} \end{align*}
\begin{equation*} \mathop{{\rm{Loss}}}\nolimits(t)\le \mathop{{\rm{Loss}}}\nolimits_{ {\cal E}_i}(t)+\frac 1\eta\ln\frac 1{p^{\left(i\right)}}  \end{equation*}
for all $i=1,2,\ldots,N$.

Suppose that the game is not $\eta$-mixable, i.e., the convex hull $ {\cal H}(e^{-\eta S})$ is not a subset of $e^{-\eta S}$, or, equivalently, that $-\frac 1\eta \ln {\cal H}(e^{-\eta S})$ is not a subset of $S$.

The set $S$ is defined in such a way that for every $c>1$ we have $cS\subseteq S$. The set $\frac 1c S$ includes $S$ and may be larger than $S$. It may even include $-\frac 1\eta \ln {\cal H}(e^{-\eta S})$.

Let as assume that $S$ is closed and we can take the minimum of all $c\ge 1$ such that $-c\frac 1\eta  {\cal H}(e^{-\eta S})\subseteq S$. This minimum is denoted by $c(\eta)$. Now for every array of non-negative weights $p^{\left(1\right)},p^{\left(2\right)},\ldots,p^{\left(N\right)}$ summing to 1 and every array of experts' predictions $\gamma^{\left(1\right)},\gamma^{\left(2\right)},\ldots,\gamma^{\left(N\right)}$ there is $\gamma\in\Gamma$ such that

\begin{equation*} \lambda(\omega,\gamma)\le -c(\eta)\frac 1\eta\ln\sum_{i=1}^Np^{\left(i\right)}e^{-\eta\lambda(\omega,\gamma^{\left(i\right)})}\enspace. \end{equation*}
for all $\omega\in\Omega$ and the aggregating algorithm with a natural modification achieves the loss
\begin{equation*} \mathop{{\rm{Loss}}}\nolimits(t)\le c(\eta)\mathop{{\rm{Loss}}}\nolimits_{ {\cal E}_i}(t)+c(\eta)\frac 1\eta\ln\frac 1{p^{\left(i\right)}}  \end{equation*}
for all $i=1,2,\ldots,N$.

One can show (cf. [Vovk, 1998]) that for the square-loss game $c(\eta)=1$, i.e., the game is $\eta$-mixable, for $0<\eta\le 2$. The absolute-loss game is not mixable for any $\eta>0$; we have

\begin{equation*} c(\eta)=\eta\slash{}\left(2\ln\frac 2{1+e^{-\eta}}\right)>1\enspace. \end{equation*}
The logarithmic-loss and Cover's game are mixable for $0<\eta\le 1$ and the laissez-faire bound cannot be improved by the aggregating algorithm.

3.3  Optimality

We have constructed some algorithms so far and obtained performance bounds for them. We have not explained why those bounds and the algorithms are good in any sense. In [Vovk, 1998] an optimality result is obtained.

1 We assume that we can buy an arbitrary fraction of a share; in real life one can only buy multiples of some fixed amount, but we ignore this.

2 In real life buying and selling shares incurs transaction costs, and shares should not be bought or sold too often. We ignore this consideration in our analysis.