Temporal difference learning for action values. qt+1(St,At)←qt(St,At)+αtargetRt+1+γqt(St+1,At+1)−qt(St,At)TD error It’s known as SARSA since it uses (St,At,Rt+1,St+1,At+1).