In reinforcement learning, an agent is trained to develop a behavioural strategy to allow it to achieve a certain goal or goals within a defined environment. Reinforcement learning is not unsupervised as it involves a training phase, but neither is it supervised because the data scientist does not supply pre-calculated examples to facilitate learning.
The environment in which the agent operates is modelled as a Markov decision process (MDP) and the aim of training is to learn to prefer paths through the MDP that lead to the goal or goals being met and to avoid paths that terminate the MDP without the goals being met.
There are three different functions that a reinforcement learning algorithm can use to determine its behaviour, and algorithms differ mainly in terms of which function or functions they use. The distinctions between the functions can be hard to grasp at first because they are often largely irrelevant when considering simple examples:
- Value function: the value function or V-function expresses the reward expected when the agent is in a certain state within its environment: the ‘value of being in a certain place’.
Quality function: the quality function or Q-function expresses the reward expected from performing a certain action from the context of a certain state: the ‘value of doing a certain thing in a certain place’. In a situation where it is known what the effect of a given action will be on the environment, it is largely irrelevant whether the Q-function or the V-function is modelled. However, in some systems there is a random (stochastic) element to the effect of a given action that can only be modelled using the Q-function.
The V-function and the Q-function are referred to together as value functions. Algorithms that use value functions typically try out many or all possibilities to see which yield the best V-function and/or Q-function values: they associate a state or action that moves the system in the direction of a final goal with a numerical reward so that the state or action is preferred over competitors that move it in the wrong direction. However, the numerical reward is generally lower than the reward for actually reaching the goal: it is discounted according to the estimated number of steps between the state or action and the aimed-for goal. Discounting is important: without it an algorithm would stop learning as soon as it had reached some solution and ignore the possibility of better, faster solutions.
- Policy function: the policy function works out the action or sequence of actions to perform from the context of a given state: ‘what to do in a certain place’. Algorithms that use a policy function are known as on-policy, while those that do not and that rely solely on value functions are known as off-policy. On-policy algorithms have to manage a trade-off (normally expressed as the ε-hyperparameter) between exploiting the previously learned policy and exploring alternatives: just like a value function without discounting, an on-policy algorithm that only exploited and never explored would stop learning as soon as it had found some means of reaching its goal(s) and would never have any chance of finding better, faster routes to them.
Most generically applicable reinforcement learning algorithms are model-free, i.e. they simply ‘learn what works’. However, some model-based reinforcement learning algorithms build up a model of their environment to try and predict the effects of performing each action. Of particular note here is the Dyna paradigm in which a model is built by alternating real experience with simulated experience (‘what the algorithm thinks would happen if certain actions were performed’).