Alpha Go/Reinforcement learning


I have a question concerning reinforcement learning:

If I understand correctly, q learning as well as policy gradient methods in the end rely on predicting the action a given the current state s.

For games like pong or the atari games (I read the deepmind paper) I fully understand that, but my question is how that is supposed to work when the number of available actions is dependent on the state.

To chess as an example, every state has (at least some) different legal moves. Of I just softmax in my action prediction neural network (like they do with the atari games) wouldn’t I theoretically have to output billions of probabilities and a very large part would be illegal moves? Or do you encode the moves as relative moves (e. g. Move horse 1 forward relative to its current position)?

I’d be thankful for any help or recommendations of papers that discuss the problem of action space


1 Like

Interesting question. I don’t know the answer, but if you read this paper you might be able to find out - and then tell us what you learn! :slight_smile:

Hi, Jeremy

read the paper (I was a little scared of it at first 30 pages+ and part of a masters degree) but it turned out to be quite readable.

The short answer to the question is, that they don’t just train a neural network to select the next action from the current state (like you’d do it in pong) but rather they use a (heavily modified) version of von Neumanns minimax algorithm. It performs a tree search, and in theory would always find the best move, but since chess has to many possible moves/positions a full tree search is infeasible, so they perform a static evaluation at the leaves of the tree at a certain depth (all chess engines do that). Usually this function is a handcoded function, finetuned by chess masters. In giraffe they train a neural network to evaluate the position using reinforcement learning.

Also they change the minimax algorithm by introducing probability based search instead of depth-limited search, then then they train another neural net to estimate the probabilties. That leads to a more selective search.

All in all, I think the paper is a great example in how deep learning can enhance alogrithms and replace handcoded things, but the algorithm in general stays quite similiar.

I think I will write a short blog post on chess/board game AI in the next weeks as @rachel suggested, I will post it in the forum for others to proof-read as soon as I’m finished

Happy easter