Experience Replay in Deep Q Networks

Ninesouls · December 22, 2019, 1:28pm

Hi there! Great to have a forum like this.

A question regarding DQN RL. I’ve just started looking into this, and one thing which I’m not sure I understand correctly is experience replay.

In experience replay, the state, the action, the next state and the reward are stored at each timestep. Later on they are sampled and used for training. This is supposed to decorrelate the data, which is generally strongly correlated in consecutive game timesteps. While this is true, it seems to me that it also breaks the correlations that are necessary for effective learning, because the true reward for an action usually lags the action by multiple timesteps.

Suppose, for example, that we have a game where firing a missile costs 10 points but if it hits you gain 100 points. In whatever sample that you may draw of firing the missile, the network will see an immediate reward of -10. The samples in which the missile hits (several timesteps later) and you get the +100 reward might have a completely different action associated with them. Moreover, the sample in which the missile hits would probably not be selected in the same minibatch and they would not be simultaneously used for the training. So, how is the network supposed to learn to associate the firing with the reward?

One might suggest that in the state just before the hit the network sees the missile close to the target and so it learns to associate this with a reward, then gradually it learns to associate further and further distances between missile and target with rewards and so on. This might be possible, but first - there might be very little signal drowning in a huge amount of noise as many missiles miss, if there are many timesteps between firing and hitting you might look at an incredibly complex model etc. Secondly, even if it’s feasible, it seems to me like a very roundabout way to learn something, one that would require a tremendous amount of training to distill the signal from the noise. Why not use as a reward the difference between the current score and the score after max-missile-flight time? Sure, that would also add noise around the signal, but at least it would envelope both the firing and the hitting (or missing!) in the same reward - to me that seems a far easier way of getting the learning done. Alternatively, why not use a recurrent network such as LSTM which can store multiple states and learn from their combined results? What must I be missing?

Thanks!