Temporal difference learning


Temporal difference learning works in the same way as Q-learning but involves modelling the V-function rather than the Q-function: rather than learning the rewards associated with performing certain actions from certain states, temporal difference learning learns the rewards associated with being in certain states. The two algorithms are equivalent when the Markov decision process is not random (stochastic); however, for stochastic MDPs, the expected value distribution for each action has to be calculated when the agent is deciding how to behave in a given situation. This makes using the temporal difference algorithm more cumbersome than using Q-learning, so that Q-learning is normally preferred.

The simple version of temporal difference learning described in the previous paragraph is also referred to as TD(0). TD-λ is a variant where, when a goal is reached, not only is the value for the goal state node updated, but the values for all previously visited state nodes within the path that led up to the goal state node are as well. This speeds up learning because it means the values for the whole path are all updated in one go rather than waiting for the information to propagate backwards along the path during subsequent training sessions. Lambda (λ) is here an expression of a penalty term that reduces the reward received by each state node as the distance from the goal state node increases. Setting λ to 1 (no penalty term) is equivalent to performing a Monte-Carlo tree search.

Note that the term “temporal difference learning” is also sometimes used to refer to an entire class of reinforcement learning algorithms including Q-learning and SARSA.

TD-learning TD(0)
has functional building block
FBB_Behavioural modelling
has input data type
IDT_Vector of categorical variables IDT_Binary vector
has internal model
INM_Markov decision process
has output data type
has learning style
has parametricity
has relevance
sometimes supports
mathematically similar to