Understanding PPO for LLMs

06 Jul, 2025

This past week I spent some time trying to understand reinforcement learning (RL) in the context of LLMs and found it super hard to get a mental model. I've written up these notes as a way you could think about deriving Proximal Policy Optimization (PPO) as used by the InstructGPT paper from scratch.

This writing assumes you have some background on pre- and post-training techniques, but if that's not the case, this article might be a good preread. InstructGPT was the first RLHF-trained LLM from OpenAI back in 2022 with the PPO RL algorithm from 2017. This topic originally came up for me while watching the Alignment and SFT/RLHF lecture from Stanford CS336.

Why do we need to do RLHF at all?

After pretraining, the model is an “internet autocompleter” but isn’t good at doing tasks like following instructions. For instance, if you prompt it “What is the capital of France?”, it might decide to autocomplete with “What is the capital of Germany? Where is London?”

Example output from pre (left) and post (right) training. The pretrained model doesn’t know how to autocomplete in the form of an assistant response.

We could try to do Supervised Fine Tuning (SFT), but SFT will only get us so far because high quality SFT training data was not available for this kind of text in large quantities. SFT also seems to have a hard time generalizing outside of the data you fine-tune over.

RLHF takes advantage of the fact that it’s much easier to have human labelers look at existing model outputs and rank them.

PPO and RLHF Overview

To apply reinforcement learning (RL) to LLMs, we need to think about how LLM concepts map to RL. RL problems have an agent interacting in an environment based on a policy. A rollout is one example interaction in the environment. A rollout is composed of the actions the agent makes. The agent collects rewards for certain actions (good or bad).

For LLMs:

The policy/agent is the LLM itself
A rollout is the response (tokens generated) from the LLM, where each token is an action
There isn’t really an “environment” for LLMs, but you could maybe think of the conversation you're having as the environment

To train the policy, we want to sample rollouts from the policy and figure out what actions give the best reward. The specific algorithm the InstructGPT paper used for this is called PPO (proximal policy optimization).

Training complexities

This setup sounds simple (i.e. can’t we just $o b j e c t i v e = m a x (r e w a r d)$ , backprop and call it a day?) but ends up with a lot of complexities.

To help with RLHF training, we will end up with three helper models (the value model, reward model, and reference model) that are used during training but aren’t the final result. We'll start with simple reward maximization and step through introducing why each of these additional models is needed.

Only the policy model is final trained network. Blue models are static; yellow models can make parameter updates via backprop. This image taken from the DeepSeekMath paper.

Training the reward model

Before actually running the RL training loop above, we first need a way to decide what reward to give the LLM for its responses. This is called the Reward Model (RM).

Generally, the RM could be any arbitrary model that gives rewards for certain actions, i.e. you could imagine a massive lookup table for every possible sentence (exact solution method), but that lookup table would be massive and impossible to actually make. Instead, we can approximate a good reward model by using the pretrained LLM and fine-tuning it to do approximate reward prediction instead.

Training the reward model from Bradley-Terry

To train the reward model, the InstructGPT paper had labelers take a set of responses for a given prompt and rank them from best to worst. From there, they take each pair combination and use them as pairwise examples. They then train the reward model with the examples to take a (prompt, completion) pair as an input and output a scalar score (i.e. maybe 3.23 for a good output or -1.98 for a bad one).

See this appendix section for more details. tl;dr: the reward model spits out rewards (scalar value) for a given input prompt. Higher reward, better answer.

Value function and GAE

Now that we have a way to calculate rewards, we could imagine doing our simple proposed training loop as above:

$o b j e c t i v e = m a x (r e w a r d_{policy})$

and we maximize this.

One big problem with this is that our model may give reward to suboptimal outputs that are still generally okay. For instance, if the prompt is “Who is the King of England”, “King Charles” is a better answer than “Queen Elizabeth” is a better answer than “potato”. However, the reward model will reward something like “Queen Elizabeth” much more than “potato” even though it’s still a really bad answer (model might say: “at least it’s an English monarch?”).

To mitigate this, we should only reward the model for when its outputs are “better than expected”. In PPO, this is done using Advantage.

Calculating Advantage

The high-level idea here is that we want to give the reward model a baseline to compare to. If our untrained model were to spit out “Queen Elizabeth” we’re really excited! But if we have a trained model we should expect that it should do better.

Advantage is somewhat simple: we take the reward and subtract out the baseline performance we’d expect. But how do we get the baseline? Enter: the value function¹ - a copy of the reward model that gives us a baseline to compare to. At the start of RL training, it should be returning the same results as the RM. However, during training the VF weights are not frozen and can update alongside the policy. This is done so that our "expectation" for the policy's performance can get better as the policy gets better, i.e. we should only reward improvement.

The practical computation for advantage is pretty complicated; I’ve put the details in another appendix section.

Okay, so with advantage, our objective function looks more like:

$o b j e c t i v e = m a x (a d v a n t a g e_{policy})$

We can now account for the policy network and the value model portions of this diagram

We now only give the model rewards for outperforming itself!

Avoiding reward hacking

The next issue we’ll likely find is that this training loop gets really good at reward hacking. Maybe the model realizes that writing “I figured out the answer!” or “asdh92h;;awep2iuh” as part of the response will give it a lot of reward².

Giving direct rewards like this is too unconstrained and the model will do whatever it takes to obtain reward.

Instead, we want to constrain the new policy to stay as close as possible to the baseline SFT’d model. For this, we bring in the reference model, our final helper model in the diagram. The reference model is a copy of the policy model before training started, i.e. the SFT baseline.

We compute the KL divergence between the old policy and the new policy (i.e. on the logits before the softmax layer) and penalize it in our advantage³:

a d v a n t a g e_{policy} = r e w a r d_{policy} - v a l u e_{baseline} - K L (p o l i c y_{new}, p o l i c y_{old})

If we take a look at the InstructGPT paper, we’re pretty much there!

Okay, so that's kind of terrible to read, but let's look at just the first line:

objective (ϕ) = E_{(x, y) ~ D_{π_{ϕ}^{RL}}} [r_{θ} (x, y) - β \log (\frac{π_{ϕ}^{RL} (y | x)}{π^{SFT} (y | x)})]

The $E$ at the start means "the expected reward given this policy, input, and outputs". Expected reward for what? For the reward (from our advantage calculation) minus the KL penalty⁴ as described above:

objective (ϕ) = expected value of [\underset{advantage based reward}{\underset{⏟}{r_{θ} (x, y)}} - β \log (\frac{π_{ϕ}^{RL} (y | x)}{π^{SFT} (y | x)})]

Fighting the alignment tax

What's going on with the last term, $γ E_{x ~ D_{pretrain}} [\log (π_{ϕ}^{RL} (x))]$ ? The InstructGPT paper has this additional term to fight the “alignment tax”.

Running RL on the model aligns it to being able to respond more helpfully, harmlessly, and honestly, but this comes at a cost. When they first trained models with PPO, they found that while the new models got better at question answering, they got worse at being “internet autocompleters” the way the pretrained models were.

To fight this, they essentially mix some pretraining (literally, running the pretraining process) back into the RL process as they do PPO. Section C.4 of the paper has a tiny bit of detail:

In other words, during the backprop for PPO, they fight the RL process a little bit by also nudging the gradients back towards the distribution of outputs for the original pretrained model.

This feels vaguely crazy to me - total engineering hack just to get things to work? I am curious if this approach is still SOTA or we’ve figured out something better.

Putting it all together

Going back to our original PPO diagram, we can now piece together what’s going on here.

PPO training for LLMs, taken from the DeepSeekMath paper.

For every training step, we evaluate the LLM (policy model). We take the input and output and ask the reward model to score it. We then penalize the reward using the KL divergence against the baseline SFT’d model. Finally, we want to only reward the model if it does better than our expectation (the value model) so we compute the Advantage. We train the models by maximizing Advantage and as we train, both the policy model and the value model update their weights.

Not pictured: we take our final $A$ term from this diagram and run it through PPO’s clipped objective function. For InstructGPT, we also tack on a term to mix in gradients from the pretrained model to fight the alignment tax.

At the end of training, we take the policy network and use it as our chat assistant!

Other thoughts

Interestingly, my impression of PPO was that it and TRPO were developed to help us take multiple gradient steps for a single rollout. For LLMs, it seems like we still only take a single gradient step but the “trusted region” formulation helps with training.

I wonder why the InstructGPT paper doesn’t talk about the PPO clipped objective function much. I guess it’s an exercise left to the reader? The paper is already super long without touching on PPO or GAE.

The example outputs from the paper (Appendix F) are worth a skim. There’s a lot of cool bugs you can see that seem like they might be a direct result of some of the choices here.

Both pretrained and posttrained models can’t help themselves but give a last step for the recipe. Is this from the pretrain mix or the KL divergence from the SFT model perhaps? Or maybe just not enough alignment?

references and further reading

InstructGPT paper with tons information about the original RLHF model
Original RLHF paper that InstructGPT builds on
RLHF paper on summarization paper
GAE paper on advantage calculation (a lot of math in this one)
RLHF book on policy gradients
These Serrano Academy videos on RL that are somewhat light on details but give a good gentle overview of the problem space
CS336 lectures 15 and 16 that confused me so much I wrote these notes
The N Implementation Details of RLHF with PPO - nice followup read after this
Original PPO and TRPO papers
While editing these notes for publication, I found these two blog posts that are pretty similar in content to this one that are pretty good

appendix: training the reward model using bradley-terry

Let’s look quickly at how to use Bradley-Terry to train the reward model. We have a bunch of pairwise rankings that we want to convert to scores to use as rewards:

$R M (p r o m p t, r e s p o n s e) = s c o r e$

How do we train this model to output the right scores? As with everything else in ML, the general trick is to shape the problem as something we can optimize using gradient descent and a loss function. We’ll take our existing SFT-trained model and rip off the last linear layer (that outputs token predictions) and we’ll replace it with a linear layer that outputs a single scalar value⁵.

We’ll look at how to formulate our loss next. To contextualize things, let’s bring back our “Who is the King of England?” prompt and think about the pair of outputs: “King Charles” and “Queen Elizabeth”. From our human feedback, we know that “King Charles” is preferred.

Bradley-Terry Model

Bradley-Terry says that you can form a probability that one of the responses is better than the other given a certain score, i.e. $P r o b a b i l i t y (i > j) = \frac{x_{i}}{x_{i} + x_{j}}$ where $x$ is the strength score for the response.

You can also use an exponential parameterization of this equation:

$Pr (i > j) = \frac{e^{x_{i}}}{e^{x_{i}} + e^{x_{j}}}$

This form is generally preferable for us because it’s nice to have scores always be positive and on a logarithmic scale. If we stare at this long enough (or ask Claude), we see that this formula can be rewritten as the sigmoid function between the two scores:

$Pr (i > j) = \frac{e^{x_{i}}}{e^{x_{i}} + e^{x_{j}}} = \frac{e^{x_{i}}}{e^{x_{j}} (e^{x_{i} - x_{j}} + 1)} = \frac{1}{(e^{x_{i} - x_{j}} + 1)} = σ (x_{i} - x_{j})$

If we squint even harder, we can see that this is pretty much doing a softmax across the two scores⁶.

Coming back to our example, if we had two scores for outputs of “King Charles” and “Queen Elizabeth” as 3 and 5, we can take the softmax over these scores to get the distribution $(.12, .88)$ , i.e. 12% chance we should output “King Charles” and 88% chance for Queen Elizabeth”.

But how do we get these scores and why are they wrong? Before training, this newly minted RM will spit out arbitrarily bad scores for whatever $(p r o m p t, r e s p o n s e)$ we show it.

Tying in cross-entropy loss Now, we know from our human feedback that the expected probability should be $(0, 1)$ , i.e. “King Charles” should be 100% preferred as an answer over “Queen Elizabeth”. We need to teach the model that instead of the scores 3 and 5, it should output scores that get us better probabilities so that “King Charles” always wins against “Queen Elizabeth”.

Seeing that we’ve now boiled are problem down to comparing two probability distributions, we can model the loss for this model as minimizing the cross-entropy loss between “the prediction the model makes about which response is better” to “the one-hot of which response we actually prefer”.

From here, we can set up a pretty standard training loop optimizing cross entropy loss; we toss in all of our pairs from human feedback and use this as our reward model.

appendix: computing advantage and the policy gradient for ppo

Advantage is computed through Generalized Advantage Estimation⁷, which has quite a bit of nuance. If we break the term meaning down,

Generalized: how to calculate advantage over multiple actions (we’ll go over this after)
Advantage: how much better is this policy over the baseline policy?
Estimation: we approximated using the baseline using lossy neural nets

You can think of advantage as computed as the reward for the current token minus the baseline:

$A d v a n t a g e = R e w a r d - B a s e l i n e$

Generalizing advantage

Unfortunately, the actual math is not so simple. We need to give some scheme for the reward to be credited partially to all of the tokens in the response, so we have a more complicated formula:

$A d v a n t a g e = \sum_{t = 0}^{T - 1} {(γ λ)}^{t} δ_{t}$

where $δ_{t}$ (the TD error, or temporal difference error) is:

$δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t})$

This is a bit of a mess. Some terms:

$γ$ controls reward decay⁸ as in Bellman’s equation, i.e. how much to prefer rewards now vs. later
$λ$ controls bias-variance tradeoff. For instance, $λ = 0$ would ignore future reward for a given token (less variance) and $1 > λ > 0$ gives future tokes some weight
$t$ is the token index in the response and $T$ is the length of the response

Together, $γ$ and $λ$ control how much to spread the reward across token future tokens. These are usually set to rough 1 (maybe $γ = 1, λ = .95$ ) so you can mostly ignore them for your intuition's sake.

Zooming out a bit, we’re trying to compute the Advantage at a per-token level, spreading the reward across the tokens in the response.

Advantage in pseudocode

Looking at this math is hard, so let’s write it out as code instead⁹:

import torch

# input_and_response is shape [b, n, v]
rewards = reward_model(input_and_response) # ex. [0, 0, 0, 2.5], reward on last token
values = value_model(input_and_response) # ex. [.5, .2, .1, .3, 0], value at any token
gamma = 0.99    # discount factor
lambda = 0.95   # GAE parameter

# Calculate TD error for each token
td_errors = rewards + gamma * values[1:] - values[:-1]
# δ_0 = 0 + 0.99*0.6 - 0.5 = 0.094
# δ_1 = 0 + 0.99*0.7 - 0.6 = 0.093  
# δ_2 = 0 + 0.99*0.8 - 0.7 = 0.092
# δ_3 = 1.0 + 0.99*0.0 - 0.8 = 0.2

# Calculate advantage for each token
advantages = torch.zeros_like(rewards)
gae = 0

# Work backwards because each token's AE depends on all subsequent tokens as well
for t in reversed(range(len(rewards))):
    gae = td_errors[t] + gamma * lam * gae
    advantages[t] = gae

# A_3 = δ_3 = 0.2
# A_2 = δ_2 + γλ*A_3 = 0.092 + 0.99*0.95*0.2 = 0.280
# A_1 = δ_1 + γλ*A_2 = 0.093 + 0.99*0.95*0.580 = 0.638
# A_0 = δ_0 + γλ*A_1 = 0.094 + 0.99*0.95*0.838 = 0.882

From here, we now have our $A d v a n t a g e$ term to use for each token generated. Note that the reward is only given for the last token for the LLM use of PPO, but other RL setups might give per-token reward. In other words, credit assignment to the other tokens in the answer is not done by the reward model directly, but by GAE and the value network moving the model to prefer the tokens that lead up to the preferred output.

Writing out the gradient for the policy As a last wrinkle, to have a proper gradient to learn from, we need to multiply advantage term by the probability of each generated token (measured by the pre-softmax log probability of the token) to figure out if we should be upweighting or downweighting the advantage:

$o b j e c t i v e = A * P r o b a b i l i t y (π_{θ})$

Why multiply by the probability?

This part confused me a for a bit - if we’re just trying to maximize reward, aren’t we essentially doing $o b j e c t i v e = m a x (A d v a n t a g e)$ ? Why do we need to multiply by anything?

The main problem here is that our computation of Advantage doesn’t involve the policy at all. Looking at the code above, Advantage is computed from the RM and VF. If we think in terms of the autograd engine, there is no relationship in the computation graph between the parameters of the policy and the objective function formulated. By instead doing $o b j e c t i v e = A * P r o b a b i l i t y (π_{θ})$ , there is a way to backprop the policy in terms of the advantage¹⁰.

This is not mathematically how the problem is typically formulated, but I found this reasoning to make the most sense to me.

Correcting for mini-batches in PPO

Basic REINFORCE-style RL algorithms can't do minibatches, i.e. in a single gradient step, you can only do backprop over one batch of actions at a time. Otherwise, the gradient you compute “goes stale” because it’s the gradient relative to the old policy.

PPO uses a ratio of the policies ( $r = \frac{π_{θ_{new}}}{π_{θ_{old}}}$ ) instead of just the probability ( $π_{θ_{new}}$ ) of the current policy to correct for this when doing mini-batches.

This ratio is called the importance sampling ratio - It’s not clear to me why the math for this works (“importance sampling theory?"¹¹) but we can at least build some intuitive sense for what it’s doing.

With this in mind, our advantaged PPO objective looks like:

$o b j e c t i v e_{per token} = A d v a n t a g e * \frac{π_{θ_{new}}}{π_{θ_{old}}}$

When the new policy prefers (has a higher probability for) the output, the ratio is > 1 and the ratio will boost gradient. When the new policy doesn’t prefer the output, the ratio is < 1 and the ratio will diminish the gradient. If the new and old policy have the same preference for the output, we neither boost nor diminish the gradient.

Applying PPO’s clipped objective to the ratio

The last bit of nuance here is that PPO uses a clipped objective function¹², i.e. when we compute our advantaged reward, we don’t want it to drastically change the update. PPO takes the policy ratio from above and "clips" it:

$o b j e c t i v e_{per token} = L^{C L I P} = min (r \cdot A, clip (r, 1 - ε, 1 + ε) \cdot A)$

This looks complicated but is essentially saying that we take the importance sampling ratio from above and bound it to be somewhere around 1. A reasonable value for the $ϵ$ hyperparameter might $.2$ , i.e. the ratio can only be between $0.8$ and $1.2$ .

footnotes

From the original Learning to summarize from human feedback paper, the KL term also helps the model explore during training by introducing some variation.

also called the value network or critic network↩
recall that our reward model is our best approximation for what kinds of rewards we should give! it can only do its best↩
↩
note that the $b e t a$ is a hyperparameter to control how much we want to prioritize the KL divergence↩
Implementations seem to only look at the last token (EOS). See this section.↩
it’s also very similar to how ELO scores are calculated which is kind of solving the same problem: if all you have is pairwise chess games, how can you make a globally ranked list?↩
Interestingly, in the InstructGPT paper they don’t go into the details GAE or even show the advantage term in their objective function - not sure why↩
Interestingly, the discount factor ( $λ$ ) is used twice! I’m not really sure why this is okay or desirable.↩
thank you claude for helping me write this out↩
The RLHF Book has a good section on how this is formulated from the math↩
Some [more detail here](some more detail here https://rlhfbook.com/c/11-policy-gradients.html#proximal-policy-optimization-1) from the RLHF book↩
oddly, the InstructGPT paper talks about using PPO but the $L^{CLIP}$ objective never shows up in formulas.↩

#article