Hedge Algorithm

Hedge Algorithm is an algorithm for prediction with expert advice. Initially it was formulated for the different framework, than usual framework for the prediction game.

The outcome space $\Omega=[0,1]^K$. At each step gambler has a capital 1, which he distributes between $K$ of his friends in correspondence with fractions $\gamma_k: \sum_{k=1}^K \gamma_k = 1$. For each step each element $\omega_k$ of the oucome space is the loss of $k$-th gambler's friend, if the gambler gave him all his money. Thus the loss of the gambler is $\sum_{i=1}^K \gamma_k \omega_k$. For the start of the game the weights are initialized by some probability distribution, like $w_k=1\slash{}K$. For each step the protocol of the algorithm is the following for $\beta \in [0,1]$:

- Receive vector $\omega_k$ from the environment.
- Suffer loss $\sum_{i=1}^K \gamma_k \omega_k$.
- Set the new weights to be $w_k := w_k \beta^{\omega_k}$, and normalize them.

In Freund and Schapire (1997, theorem 2) it is proven that for any sequence of outcomes

\[
  L_T \le \frac{\ln 1\slash{}\beta}{1-\beta} L_T(k) + \frac{\ln K}{1-\beta}
\]

for all $T$ and $k$, where $L_T$ is the loss suffered by the algorithm over the first $T$ trials, and $L_T(k)=\sum_{t=1}^T \omega_k^t$ is the loss (or expected loss) suffered by the $k$th expert over the first $T$ trials.

The inequality above can be improved using Strong Aggregating Algorithm, as in Vovk (1998):

\[
  L_T \le c(\beta) L_T(k) + \frac{c(\beta)}{\ln(1\slash{}\beta)}\ln K,
\]

where $c(\beta)=(\ln \frac{1}{\beta})\slash{}(K \ln \frac{K}{K+\beta-1})$. It is interesting whether it can be improved further.

The same method can be applied if the experts and the algorithm provide the probability distribution over the outcome space, and they suffer the expected loss of a decision randomly selected according to this distribution. In this case the algorithm predicts simply the weighted average of the experts predictions. In this case the loss function should be only bounded (one can compare with Weak Aggregating Algorithm). The theoretical bound for the loss of the Hedge algorithm is

\[
  L_T \le L_T(k)+ \sqrt{2L\ln K} + \ln K
\]

for all $T$ and $k$, where $L_T$ is the loss (or expected loss) suffered by the algorithm over the first $T$ trials, and $L_T(k)$ is the loss (or expected loss) suffered by the $k$th expert over the first $T$ trials. The constant $L$ is a prior upper bound on the loss of the best strategy, that in the worst case is $T\widetilde{L}$, where $\widetilde{L}$ is the bound for the loss used. This algorithm differs from the Weighted Average Algorithm by the learning rate for updating weights. The weights are updated by the rule: $w_k:=w_k \beta^{\lambda_k}$, where $\beta=\frac{1}{1+\sqrt{2 \ln K \slash{}L }}$, and then they are normalized. The bounds for some particular loss functions, like square-loss function, can be easily derived from the bound described above using geometrical inequalities.

Bibliography

  • Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119--139, (1997).
  • Vladimir Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences, 56:153 - 173, 1998.