Learning Deep Reinforcement Learning

I am currently learning about deep reinforcement learning, which builds upon the foundations of deep learning. I think that understanding deep reinforcement learning can be especially helpful for those who already have a foundation in deep learning.

I am currently learning about reinforcement learning and implementing MuZero and RLHF (a technique used to train ChatGPT) from scratch. I would like to share my learning progress here.

MuZero’s repo: GitHub - xrsrke/muzero: Implement MuZero from scratch [WORK IN PROGRESS]

InstructGPT’s repo: GitHub - xrsrke/instructGOOSE: Implementation of Reinforcement Learning from Human Feedback (RLHF)

P/S: If posting about reinforcement learning is not allowed in this channel, please let me know and I will move my posts to a different location

EDIT 1: Added github repo


TIL: how to calculate the loss of reinforce algorithm (policy gradient)

TIL: Fixed my reinforce algorithm training loop (previously i got the loss function wrong), understand how advantage function works in actor-critic (will implement from scratch very soon)

Github: reinforcement-learning/05_policy_gradient.ipynb at main · xrsrke/reinforcement-learning · GitHub

TIL: how the soft actor-critic optimizes its policy and warming up my math muscles

TIL: yay, finally understand the update rule for the policy in the soft actor-critic (updated version of previous note)

TIL: Understand why the three loss functions in soft actor-critic are designed the way they are

and learning RLHF

the last few days i learned: some basics of RLHF, and PPO

the last two day i learned: the idea of MuZero, some basic of PPO, and how MCTS works (I will post my notes on the last two topics once I finish going through them)

the last four days i learned: implemented PPO from scratch

the last five days i learned: how RLHF works and implemented a vanilla representation network in MuZero (will fix it when all the components are put together)

the last three day i learned: implemented pairwise dataset, and reward model and reward loss in RLHF, and 1/4 of MuZero’s self-play

the last four days, I learned: implemented the agent objective, a reward trainer, 1/3 of the agent trainer in RLHF, 3/10 of MCTS in MuZero

1 Like

Nice! Good work

1 Like


the last 6 days i learned (yes, I was stuck that long): figured out how to train use PPO to train RLHF

next i’m working on implement the RLHFTrainer: Actions · xrsrke/instructGOOSE · GitHub


Six days is not a long time to be stuck with something, trust me :slight_smile:

1 Like

Technically speaking, I’ve been stuck on this for two weeks. But I learn different subjects every day, so it balances out. I count it as 6 days :slight_smile:


Great work, keep it up

1 Like

The last three days I learned: created a custom gym environment for RLHF

1 Like

the last three days i learned :partying_face: : implemented RLHF Trainer

1 Like