Prediction With Expert Advice Under Discounted Loss

The paper describes the application of the idea of Shortcut Defensive Forecasting to compete under discounted loss. It suggests a modification of the Aggregating Algorithm for that. The arXiv technical report contains more detailed description of the idea of the modification, and of the theorems which are possible to prove.

The problem of prediction with expert advice under discounted loss is considered. The cumulative discounted loss t is defined as t=αt-1t-1+λ(γt,ωt) for γtΓ, ωtΩ, loss function λ (mixable or not), and discounting factors αt[0,1]. In the case when αt=1, there is no discounting. The case when αt=α(0,1) is the most popular and is called exponential discounting. The following theorem is the main result of the paper (we state it here only for the η-mixable loss functions, but the paper describes the general case). It holds when the number of experts is finite.

Theorem. Learner has a strategy guaranteeing that, for any T and for any k{1,,K}, it holds

TTk+lnKη.

The algorithm works as follows.

Initialize weights of experts {wk:=1/K}, {k=1,,K}.
FOR {t=1,2,s.}
Get discount αt-1(0,1].
Get Experts' predictions {γtkΓ,k=1,,K}.
Calculate gt(ω)=-1ηln(k=1K1K(wt-1k)αt-1e-ηλ(γtk,ω)), for all ωΩ.
Output γt:=σ(gt)Γ.
Get ωtΩ.
Update the weights wtk:=(wt-1k)αt-1eηλ(γt,ωt)-ηλ(γtk,ωt), k=1,,K.
ENDFOR

For non-mixable but convex and bounded games, a modification of the Weak Aggregating Algorithm is used to prove an upper bound on the discounted loss of the learner.

Theorem. Suppose that (Ω,Γ,λ) is a non-empty convex game and λ(γ,ω)[0,1] for all γΓ and ωΩ. Learner has a strategy guaranteeing that, for any T and for any k{1,,K}, it holds

TTk+lnKBTβT,

where βt=1/(α1αt-1) and BT=t=1Tβt.

For the discounted square loss function, it is possible to compete with linear and kernelized linear experts in the online regression framework. Denote by X the matrix of size T×n consisting of the rows of the input vectors x1ʹ,,xTʹ. Let also WT=diag(β1/βT,β2/βT,,βT/βT), i.e., WT is a diagonal matrix T×T.

Theorem. For any a>0, there exists a prediction strategy for Learner in online regression protocol achieving, for every T and for any linear predictor θn,

t=1TβtβT(γt-yt)2t=1TβtβT(θʹxt-yt)2+aθ2+(Y2-Y1)24lndet(XʹWTXa+I).

If, in addition, xtZ for all t, then

t=1TβtβT(γt-yt)2t=1TβtβT(θʹxt-yt)2+aθ2+n(Y2-Y1)24ln(Z2at=1TβtβT+1).

Interestingly, instead of the time variable (as in the bound for Aggregating Algorithm Regression), we have t=1TβtβT. For exponential discounting αt=α(0,1) for all t, we have t=1TβtβT=1-αT-11-α.

The method which achieves the bound for linear regression is similar to online weighed least squares.

  • Alexey Chernov and Fedor Zhdanov. Prediction with expert advice under discounted loss. In Proceedings of the 21st International Conference on Algorithmic Learning Theory, 2010.