How is double DQN better than DQN?
I started looking into the double DQN (DDQN). Apparently, the difference between DDQN and DQN is that in DDQN we use the main value network for action selection and the target network for outputting the Q values. However, I don't understand why this would be beneficial, compared to the standard DQN. So, in simple terms, what exactly is the advantage of DDQN over DQN?
In Q-learning there is what is known as a maximization bias. That is because the update target is r+γmaxaQ(s,a). If you slightly overestimate your Q-value then this error gets compounded (there is a nice example in the Sutton and Barto book that illustrates this). The idea behind tabular double Q-learning is to have two Q-networks, Q1,Q2, and you choose an action from them, e.g. from Q1+Q2. You then flip a coin to decide which to update. If you choose to update Q1 then the update target becomes r+γQ2(s′,argmaxaQ1(s′,a)).
The idea is that if you overshoot your estimate on one Q network then having the second will hopefully control this bias when you would take the max.
In Deep Double DQN learning the idea is essentially the same but instead of having to maintain and train two Q-networks, they use the target network from vanilla DQN to provide the target. To make this more concrete, the update target they use is
r+γQ(s′,argmax(s′,a;θ);θ−),
where Q(s,a;θ−) denotes the target network whose parameters are only updated to the current networks every C time step.
As before, the idea is that if we have overestimated our value of being state s′ in our current network when taking the max action, using the target network to provide the target will help control for this bias.
Maximization Bias
I will here explain maximization bias from the simple example given from the Sutton and Barto book.
enter image description here The Markov Decision Process in the image is defined as follows: we start in state A and can take the 'right' action which gives us 0 reward and immediately leads to termination. If we choose 'left' we get 0 immediate reward where we then move to state B. From there, we have an arbitrary number of actions we can take where they all lead to the terminal state and the reward is drawn from a Normal(-0.1,1) distribution.
Clearly, the optimal action is always to move to the right from state A as this gives 0 expected future returns. Taking the right action will give a γ×−0.1 expected future returns (the γ is our discount factor).
Now, if we got into state B and took some random action our initial reward could be bigger than 0 -- after all it is drawn from a Normal(-0.1,1) distribution.
Now, consider we are updating our Q-function for state A and taking the left action. Our update target will be 0+γmaxaQ(B,a). Because we are taking the max over all possible actions, this will lead to a positive reward and so we are backing up the belief of our expected future rewards from taking action left in state A to be something positive -- clearly this is wrong since we know it should be -0.1. This is what is known as the maximization bias, because it gives us a kind of 'optimistic' estimate of the action value! I've attached an image below that shows the %age of time the agent chose the left action, which it shouldn't be choosing). As you can see, it takes normal Q-learning a long time to even start to correct itself, whereas double Q-learning corrects the mistake almost immediately.