Policy gradient estimation


In policy gradient estimation, an actor learns its behaviour using a weighted policy function that maps parameters describing the current state to actions. It is useful when the input parameters describing the environment state are continuous rather than categorical. The policy is learned without modelling the value function, i.e. without explicit consideration of the expected return from being in each state or carrying out each action. In practice, the policy is almost always modelled using a neural network. It uses gradient ascent to gradually learn the policy weights that maximize the overall return (‘the reward waiting at the end of the episode if this policy is followed’). Gradient ascent follows the same principle as the gradient descent used to minimize error in many other areas of machine learning.

has functional building block
FBB_Behavioural modelling
has input data type
IDT_Vector of quantitative variables
has internal model
INM_Markov decision process INM_Neural network
has output data type
has learning style
has parametricity
PRM_Nonparametric with hyperparameter(s)
has relevance
sometimes supports
mathematically similar to