Learning Deep Reinforcement Learning

xariusdrake · December 21, 2022, 9:05am

I am currently learning about deep reinforcement learning, which builds upon the foundations of deep learning. I think that understanding deep reinforcement learning can be especially helpful for those who already have a foundation in deep learning.

I am currently learning about reinforcement learning and implementing MuZero and RLHF (a technique used to train ChatGPT) from scratch. I would like to share my learning progress here.

MuZero’s repo: GitHub - xrsrke/muzero: Implement MuZero from scratch [WORK IN PROGRESS]

InstructGPT’s repo: GitHub - xrsrke/instructGOOSE: Implementation of Reinforcement Learning from Human Feedback (RLHF)

P/S: If posting about reinforcement learning is not allowed in this channel, please let me know and I will move my posts to a different location

EDIT 1: Added github repo

xariusdrake · December 21, 2022, 9:05am

TIL: how to calculate the loss of reinforce algorithm (policy gradient)

xariusdrake · December 22, 2022, 8:27am

TIL: Fixed my reinforce algorithm training loop (previously i got the loss function wrong), understand how advantage function works in actor-critic (will implement from scratch very soon)

Github: reinforcement-learning/05_policy_gradient.ipynb at main · xrsrke/reinforcement-learning · GitHub

xariusdrake · December 26, 2022, 9:26am

TIL: how the soft actor-critic optimizes its policy and warming up my math muscles

xariusdrake · December 27, 2022, 8:39am

TIL: yay, finally understand the update rule for the policy in the soft actor-critic (updated version of previous note)

xariusdrake · December 28, 2022, 8:46am

TIL: Understand why the three loss functions in soft actor-critic are designed the way they are

and learning RLHF

xariusdrake · December 31, 2022, 8:33am

the last few days i learned: some basics of RLHF, and PPO

xariusdrake · January 2, 2023, 8:17am

the last two day i learned: the idea of MuZero, some basic of PPO, and how MCTS works (I will post my notes on the last two topics once I finish going through them)

xariusdrake · January 6, 2023, 8:41am

the last four days i learned: implemented PPO from scratch

xariusdrake · January 11, 2023, 8:35am

the last five days i learned: how RLHF works and implemented a vanilla representation network in MuZero (will fix it when all the components are put together)