A deep Q-network (DQN) is a neural network used to learn a Q-function. As most reinforcement learning is associated with complex (typically visual) inputs, the initial layers of a DQN are normally convolutional. There are two ways of using a neural network to calculate expected rewards for actions:
- the network accepts the environment state and a possible action as input and outputs the expected reward;
- the network accepts the environment state as input and outputs a vector of possible actions weighted according to the expected reward of each one.
The second of these options has been found to work better because it allows for more rapid training and operation of the network.
Recall that Q-learning involves increasing the expected rewards for actions that turned out to lead to positive outcomes and reducing the expected rewards for actions that turned out to lead to negative outcomes. Naïve approaches to Q-learning with neural networks fail because a sequence of observations of the environment necessarily contains a large number of input vectors that are very similar to one another but that will probably never be exactly repeated in the future, which leads to overfitting and general learning instability. This problem can be reduced to an acceptable level using the following techniques:
- In experience replay, training involves alternating between steps where the system performs whatever task is to be learned a number of times and steps where the network weights are updated. The observations made during a task performance step are recorded; only a small random selection from these observations is used as input to the weight-update step that follows.
- During a weight-update step, the network is trained by tweaking the weights so that the Q-values the network predicts for the observed input vectors and performed actions better fit the outcomes that were observed following on from these actions. A target network is a copy of the online (main) network in which the weights are fixed for extended periods during training and which is used instead of the online network as a reference for what the old version of the network would have predicted. Using the older predictions from the target network as the baseline for weight updates leads to much more stable learning.
Double deep Q network is a variant that has sometimes been erroneously but understandably confused with the target network paradigm explained above. It actually refers to duplicating the network during training and using one copy to learn the correct selection between possible actions and the second copy to learn the Q-value for the optimal value (the one with the highest expected return that will then also be carried out). The weights of the two copies are regularly swapped during the training phase. This is a further optimization of the algorithm that improves stability and prevents overfitting.
A duelling deep Q network recalls actor-critic architectures in that two separate estimations are made based on the environment and then combined to inform what to do. While actor-critic combines a policy function and a value function, however, a duelling network combines two value functions. Whereas the two value functions that are well-known from other algorithms are the Q-function and the V-function, a duelling deep Q network combines results from the V-function and a new function, the advantage function, that is obtained by subtracting V from Q and is a relative measure of the importance of each action. Logically, adding V to Q-V yields Q again, but decomposing the Q-function in this way has been found to yield more accurate estimations of it.
Asynchronous one-step Q-learning is a deep Q-network implementation that, like the asynchronous versions of SARSA and actor-critic, is trained using several parallel actors that pool their results, which serves to reduce overlearning; asynchronous n-step Q-learning includes the additional innovation that the Q-function is calculated for sequences of actions rather than for one action at a time.
- Double deep Q-network Duelling deep Q-network Asynchronous one-step Q-learning Asynchronous n-step Q-learning
- has functional building block
- FBB_Behavioural modelling
- has input data type
- IDT_Vector of quantitative variables IDT_Vector of categorical variables IDT_Binary vector
- has internal model
- INM_Neural network INM_Markov decision process
- has output data type
- ODT_Classification ODT_Vector of quantitative variables
- has learning style
- has parametricity
- PRM_Nonparametric with hyperparameter(s)
- has relevance
- sometimes supports
- mathematically similar to