In policy gradient estimation, an actor learns its behaviour using a weighted policy function that maps parameters describing the current state to actions. It is useful when the input parameters describing the environment state are continuous rather than categorical. The policy is learned without modelling the value function, i.e. without explicit consideration of the expected return from being in each state or carrying out each action. In practice, the policy is almost always modelled using a neural network. It uses gradient ascent to gradually learn the policy weights that maximize the overall return (‘the reward waiting at the end of the episode if this policy is followed’). Gradient ascent follows the same principle as the gradient descent used to minimize error in many other areas of machine learning.
- alias
- subtype
- has functional building block
- FBB_Behavioural modelling
- has input data type
- IDT_Vector of quantitative variables
- has internal model
- INM_Markov decision process INM_Neural network
- has output data type
- ODT_Classification
- has learning style
- LST_Reinforcement
- has parametricity
- PRM_Nonparametric with hyperparameter(s)
- has relevance
- REL_Relevant
- uses
- sometimes supports
- mathematically similar to