Q-learning

Algorithm

Q-learning is a model-free, off-policy reinforcement learning method.

It is used in environments where an agent needs to learn the best actions to take in various states to maximize cumulative reward. The agent maintains a table of possible actions for each state and updates this table based on the rewards received from the environment.

When the agent performs an action in a state, it receives a reward and transitions to a new state. The agent updates the table by considering the reward received and the maximum expected future rewards from the new state, discounted by a factor. This process helps the agent learn the optimal policy over time.

For example, if an agent in a grid world moves from one cell to another and receives a reward, it updates its table to reflect the new knowledge about the expected rewards of actions in that cell. Over time, the agent learns the best path to reach a goal.

Q-learning-λ is a variant where the values for the entire path are updated at once when a goal is reached, similar to Temporal Difference Learning with Lambda. Dyna-Q is another variant that alternates between real and simulated experiences to update the table, allowing for faster learning but with a risk of overfitting.

Q-learning is related to Markov Decision Processes because it relies on the concept of states, actions, and rewards, which are fundamental to MDPs. The agent’s goal is to learn the optimal policy that maximizes cumulative reward, which is a key objective in solving MDPs.

In summary, Q-learning is crucial for learning optimal policies in reinforcement learning tasks, especially when the model of the environment is unknown.

Alias
Q-learning-λ Dyna-Q
Related terms
Reinforcement Learning Markov Decision Process Temporal Difference Learning Q-learning-λ Dyna-Q Temporal Difference Learning with Lambda