²¹To see how this works mathematically, consider the definition of RPE (given in footnote 20 above) in the case where the obtained reward is zero. It makes sense to consider zero reward because it is generally agreed that rewards are temporally sparse (mostly zero), often extremely sparse, so most of the time the RPE is simply the difference between the state-values in two states (before and after, or past and present), possibly discounted in the latter state. Recall that the state-value is nothing else than the predicted total future reward. Thus, recalling that the sign in this conventional definition of RPE is wrong for our purposes, RPE defines frustration as V _before - V _after, which is exactly the decrease in predicted total future reward, comparing the prediction in the previous time step and the present time step. Such a decrease is possible when the agent receives new information (which implies, in the basic formalism, that it finds itself in a new state incorporating that information), and that information makes it revise its prediction downward (it switches to the prediction given by the new state it finds itself in). Thus, in the case where the prediction decreases in the absence of any reward obtained, the reward loss or the negative part of RPE is equal to the decrease in the prediction of the total future reward. This is how RPE can define frustration based on predictions alone, without any reward currently expected. One might think that reward loss could do the same if we simply change the time scale: in the robot example, if you take the expected and obtained reward for, say, one whole hour, that would arguably lead to a reward loss since the robot expected to get dust during that hour but didn’t get any. However, RPE makes its computations independently of any such time scales (it is in fact taking into account the whole future as it looks at the total expected future reward) and moreover, such long-term reward loss would not occur before than hour has passed, while RPE signals frustration the very moment the new information has arrived and has been processed. (As a minor point on terminology, it may be slightly misleading to talk about “reward prediction error”, since RPE is in this case rather a change in predictions due to new observations; a non-zero RPE does not necessarily imply that there was any error, but simply a change, an update of prediction based on new information. )