# Prediction With Expert Advice Under Discounted Loss

The paper describes the application of the idea of Shortcut Defensive Forecasting to compete under discounted loss. It suggests a modification of the Aggregating Algorithm for that. The arXiv technical report contains more detailed description of the idea of the modification, and of the theorems which are possible to prove.

The problem of prediction with expert advice under discounted loss is considered. The cumulative discounted loss $\mathcal{L}_t$ is defined as $\mathcal{L}_t = \alpha_{t-1} \mathcal{L}_{t-1} + \lambda(\gamma_t,\omega_t)$ for $\gamma_t\in\Gamma$, $\omega_t \in \Omega$, loss function $\lambda$ (mixable or not), and discounting factors $\alpha_t\in[0,1]$. In the case when $\alpha_t=1$, there is no discounting. The case when $\alpha_t = \alpha\in(0,1)$ is the most popular and is called exponential discounting. The following theorem is the main result of the paper (we state it here only for the $\eta$-mixable loss functions, but the paper describes the general case). It holds when the number of experts is finite.

**Theorem.**
Learner has a strategy guaranteeing that, for any $T$ and for any $k\in\{1,\ldots, K\}$, it holds

The algorithm works as follows.

For non-mixable but convex and bounded games, a modification of the Weak Aggregating Algorithm is used to prove an upper bound on the discounted loss of the learner.

**Theorem.** Suppose that $(\Omega,\Gamma,\lambda)$ is a non-empty convex game
and $\lambda(\gamma,\omega)\in[0,1]$ for all $\gamma\in\Gamma$ and $\omega\in\Omega$.
Learner has a strategy guaranteeing that, for any $T$ and for any $k\in\{1,\ldots, K\}$, it holds

where $\beta_t=1/(\alpha_1\cdots\alpha_{t-1})$ and $B_T=\sum_{t=1}^T \beta_t$.

For the discounted square loss function, it is possible to compete with linear and kernelized linear experts in the online regression framework. Denote by $X$ the matrix of size $T\times n$ consisting of the rows of the input vectors $x_1',\ldots,x_T'$. Let also $W_T = diag(\beta_1/\beta_T,\beta_2/\beta_T,\ldots,\beta_T/\beta_T)$, i.e., $W_T$ is a diagonal matrix $T \times T$.

**Theorem.**
For any $a > 0$,
there exists a prediction strategy for Learner in online regression protocol
achieving, for every $T$
and for any linear predictor $\theta \in \mathbb{R}^n$,

If, in addition, $\|x_t\|_\infty \le Z$ for all $t$, then

Interestingly, instead of the time variable (as in the bound for Aggregating Algorithm Regression), we have $\frac{\sum_{t=1}^T \beta_t}{\beta_T}$. For exponential discounting $\alpha_t = \alpha\in(0,1)$ for all $t$, we have $\frac{\sum_{t=1}^T \beta_t}{\beta_T} = \frac{1-\alpha^{T-1}}{1-\alpha}$.

The method which achieves the bound for linear regression is similar to online weighed least squares.

- Alexey Chernov and Fedor Zhdanov. Prediction with expert advice under discounted loss. In
*Proceedings of the 21st International Conference on Algorithmic Learning Theory*, 2010.