SARSA stands for State-Action-Reward-State-Action and is a model-free, on-policy reinforcement learning method. It works in a similar fashion to Q-learning. The difference lies in how the reward is calculated when an action has been carried out: while Q-learning sets the reward for having carried out an action in a state based on the highest-rewarded action available within the new, resulting state, SARSA carries out an second action from the second state according to the policy it has learned and sets the reward for the first state-action pair based on what then happens.

The ways SARSA and Q-learning work sound very similar, and indeed the policy SARSA follows will normally be to choose the most promising available action, which would also be the highest-rewarded action in Q-learning. However, the essential difference is that exploration means that the policy in SARSA will not always be to choose the most promising available action; it will sometimes be to choose some other action to check that the stored policy information is correct.

Presuming that the actions SARSA performs when exploring are confirmed as being less optimal than the ones specified by its existing policy, SARSA – unlike Q-learning – is able to take into account how much less optimal they are, preferring alternative options that are only slightly sub-optimal to alternative options that turn out to be catastrophic. In a classic example, a mouse actor learning to walk down a virtual cliff towards some cheese will learn to walk down the cliff edge (the shortest path) if Q-learning is used; with SARSA, on the other hand, the mouse will learn move away from the edge when walking so that a single move in the wrong direction does not lead to death on the rocks below.

SARSA-λ is a variant analogous to TD-λ in which the values for the whole path are updated in one go when a goal is reached.

Asynchronous one-step SARSA is a neural-network implementation of SARSA that is trained using several parallel actors that pool their results, which serves to reduce overlearning.

SARSA-λ Asynchronous one-step SARSA
has functional building block
FBB_Behavioural modelling
has input data type
IDT_Vector of categorical variables IDT_Binary vector
has internal model
INM_Markov decision process
has output data type
has learning style
has parametricity
has relevance
sometimes supports
mathematically similar to