Q-learning

Algorithm

Q-learning is a model-free, off-policy reinforcement learning method where the agent maintains a table of possible actions that can be taken in possible states and learns an estimation of the expected reward to be gained by performing each action. The table can then be used to determine the best action to take in any given state.

When the system is learning, it is in some state and performs some action. If the action results in a goal being achieved, a numeric reward (e.g. 100) is recorded in the table; in environments where an unwanted end state such as the death of the agent are possible, this is captured with a reward of 0. In every other case, the system observes the new state that resulted from the action and looks in the table to find the highest numeric reward recorded for any action that is executable from that new state. This next-step numeric reward is then multiplied by a discounting factor and the result is recorded in the table for the state-action pair that is currently being processed. In this way, a path from a given state to a goal will be built up over time where the actions have gradually increasing numeric reward values (e.g. 51, 64, 80, 100 if the discounting factor is 0.8).

Q-learning-λ is a variant analogous to TD-λ in which the values for the whole path are updated in one go when a goal is reached; and Dyna-Q is a model-based variant where short phases in which actions are tried out in the environment alternate with longer, ‘hallucinating’ phases in which the algorithm updates numeric rewards on the basis of what the existing table predicts would happen if certain courses of action were followed. Both variants allow for faster learning but are also more vulnerable to overfitting.

alias
subtype
Q-learning-λ Dyna-Q
has functional building block
FBB_Behavioural modelling
has input data type
IDT_Vector of categorical variables IDT_Binary vector
has internal model
INM_Markov decision process
has output data type
ODT_Classification
has learning style
LST_Reinforcement
has parametricity
PRM_Parametric
has relevance
REL_Relevant
uses
sometimes supports
mathematically similar to