Deep Q-network

Algorithm

A deep Q-network (DQN) is a neural network used to learn a Q-function in reinforcement learning.

Reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize cumulative reward. Q-learning is a popular reinforcement learning algorithm that aims to learn the optimal action-selection policy by estimating the Q-values, which represent the expected cumulative reward of taking a particular action in a given state.

A DQN combines Q-learning with deep neural networks to handle high-dimensional input spaces, such as images. As most reinforcement learning is associated with complex (typically visual) inputs, the initial layers of a DQN are normally convolutional. There are two ways of using a neural network to calculate expected rewards for actions:

  1. The network accepts the environment state and a possible action as input and outputs the expected reward.
  2. The network accepts the environment state as input and outputs a vector of possible actions weighted according to the expected reward of each one.

The second option has been found to work better because it allows for more rapid training and operation of the network.

Recall that Q-learning involves increasing the expected rewards for actions that lead to positive outcomes and reducing the expected rewards for actions that lead to negative outcomes. Naïve approaches to Q-learning with neural networks fail because a sequence of observations of the environment contains many similar input vectors that will probably never be exactly repeated in the future, leading to overfitting and learning instability. This problem can be reduced to an acceptable level using the following techniques:

  • In experience replay, training involves alternating between steps where the system performs the task to be learned and steps where the network weights are updated. Observations made during task performance are recorded, and only a small random selection from these observations is used as input to the weight-update step.
  • During a weight-update step, the network is trained by tweaking the weights so that the Q-values the network predicts for the observed input vectors and performed actions better fit the outcomes observed. A target network is a copy of the online (main) network with fixed weights for extended periods during training, used as a reference for what the old version of the network would have predicted. Using older predictions from the target network as the baseline for weight updates leads to more stable learning.

A double deep Q-network duplicates the network during training, using one copy to learn the correct selection between possible actions and the second copy to learn the Q-value for the optimal action. The weights of the two copies are regularly swapped during training, improving stability and preventing overfitting.

A duelling deep Q-network combines two value functions, the V-function and the advantage function (obtained by subtracting V from Q), to yield more accurate estimations of the Q-function.

Asynchronous one-step Q-learning is a DQN implementation trained using several parallel actors that pool their results, reducing overlearning. Asynchronous n-step Q-learning includes the additional innovation that the Q-function is calculated for sequences of actions rather than one action at a time.

In summary, DQNs enable reinforcement learning to handle high-dimensional input spaces effectively. They are widely used in applications such as game playing, robotics, and autonomous systems due to their ability to learn complex policies from raw sensory data.

Alias
DQN
Related terms
Reinforcement Learning Q-learning Neural Network Convolutional Neural Network Experience Replay Target Network Double deep Q-network Duelling deep Q-network Asynchronous one-step Q-learning Asynchronous n-step Q-learning