I am trying to create an environment for simulating the beer game to train a reinforcement learning agent. In the beer game, there are 3 players manufacturer, distributor, wholesaler, retailer. I am only modelling for the retailer.
I have only 1 state variable as inventory - backlog and 5 actions as [100, 200, 300, 400, 500] units. I have binned my states to 10 bins. So, my Q-table is of size (10 x 5) i.e. (states x actions).
Now in traditional Q-Learning approach, we do some action in a current state and reach the next state and get the reward for the next state. We update the Q-table with the formula.
Q[cur_state][action] += alpha*(reward + gamma*max(Q[new_state]) - Q[cur_state][action])
where alpha and gamma are constants.
I have to make an action every week and the action that I choose may not change my state immediately, but maybe after a few weeks as there is some delay in receiving the order.
So, Is it right to update the Q-table immediately after doing the action even if that action has a delayed effect on the states? Or should I wait for the action to affect the states and then update the Q-table, which may be a few weeks later so I have to wait for a few weeks to update the Q-table.
Also, What should be the current state and the next state for a week level model of the beer game. Any examples?