6 A multitude of algorithms for learning the state-values exist; see Sutton and Barto (2018) for a comprehensive treatment. Typical algorithms proceed by a recursion where the value of a state is defined based on the values of the states to which the agent can go from that state, based on the theory of dynamic programming and, in particular, what is called the Bellman equation. As an illustration, consider a simple world with three states, A, B, and C, where C is the (only) goal state; suppose you can move from A to B and from B to C. The first part of the recursion says that the value of state A must be the value of state B minus a small quantity. That is because from state A you could go to state B in a single step, and the subtraction of the small quantity expresses discounting, due to the fact that you need one step. Likewise, the value of state B must be a bit less than the value of C. Now the value of C is fixed (to some numerical value which is irrelevant) by the fact that it is the goal, and needs no recursion or computation. So, once the agent encounters the state C even once, it knows the value of C. Based on that knowledge and its model of the world, it can start recursive computations, by applying the ideas above (value of B equal to value of C minus a small constant, value of A likewise) to recursively compute the values of B and A. If we fix the value of the goal to 1, the state-values could be 0.8, 0.9, and 1 for A, B, and C respectively. Note that in this example, we computed the state-values of the optimal policy, i.e., assuming the agent always takes the smartest possible actions. You could also compute the state-values for a very dumb policy (say, always taking random actions), and they would be lower because by taking less smart actions, the agent would get less reward.