2025-05-28

PPO & GRPO

These are the notes for the blog: A vision researcher’s guide to some RL stuff: PPO & GRPO

LLM Pre-training and Post-training

Pre-training: Next token prediction on large-scale web data.
Post-training: Two-stage process to improve reasoning:
- Stage 1: SFT (Supervised Finetuning): Fine-tune on high-quality expert reasoning data (instruction-following, QA, chain-of-thought). Models learn to mimic expert demonstrations.
- Stage 2: RLHF (Reinforcement Learning from Human Feedback): When expert data is limited, use human feedback to train a reward model that guides RL training, aligning the model with human preferences.

DeepSeek R1 Innovation

DeepSeek R1 R1-zero model completely skips the SFT part and applies RL directly to the base model (DeepSeek V3). This helps in:

Computational efficiency: Skipping one stage of post-training brings computational efficiency
Open-ended learning: Allows the model to “self-evolve” reasoning capabilities through exploration
Alignment: Avoiding biases introduced by human-curated SFT data
Memory optimization: Replace PPO by GRPO in RLHF, which eliminates a separate critic model (typically as large as the policy model), reducing memory and compute overhead by ~50%

RLHF

Let’s break down the workflow of RLHF into steps:

Step 1: For each prompt, sample multiple responses from the model
Step 2: Humans rank these outputs by quality
Step 3: Train a reward model to predict human preferences/ranking, given any model responses
Step 4: Use RL (e.g. PPO, GRPO) to fine-tune the model to maximise the reward model’s scores

As we can see the process here is relatively simple, with two learnable components: the reward model and “the RL”. Now let’s dive into each component with more details.

Reward Model

We can’t have humans rank all the outputs of the model. Annotators rate a small portion of the LLM outputs, then train a reward model to predict these annotators’ preferences.

Let’s denote our learnable reward model as $R_\phi$. Given a prompt $p$, the LLM generate $N$ responses ${r_1, r_2,…r_N}$.

Note: The reward for a partial response is always 0; only for complete responses from the LLM would the reward model return a non-zero scalar score.

Given that a response $r_i$ is preferable to $r_j$ according to the human rater, the reward model is trained to minimise the following objective:

\[\mathcal{L}(\phi) = -\log \sigma(R_\phi(p, r_i) - R_\phi(p, r_j))\]

where $\sigma$ denotes the sigmoid function.

Side note: The objective is derived from the Bradley-Terry model, which defines the probability that a rater prefers $r_i$ over $r_j$ as:
$P(r_i \succ r_j) = \frac{\exp\big(R_\phi(p, r_i)\big)}{\exp\big(R_\phi(p, r_i)\big) + \exp\big(R_\phi(p, r_j)\big)}$
Taking the negative log-likelihood of this probability gives the loss $\mathcal{L}(\phi)$ above. The sigmoid $\sigma$ emerges naturally from rearranging the Bradley-Terry ratio.

Negative log-likelihood measures how well a model fits data. For probability P, it’s simply -log(P). We use it because:

Numerical stability - Probabilities can be very small numbers (like 0.0001), which computers struggle with. Taking the log converts these to larger, more manageable numbers

Converts products to sums - When you have multiple independent events, their combined probability is the product P₁ × P₂ × P₃. Taking logs converts this to log(P₁) + log(P₂) + log(P₃), which is much easier to compute

Optimization - The negative sign makes it a loss function that we want to minimize (since maximizing likelihood = minimizing negative log-likelihood)

Rare events get high negative log-likelihood (bad fit), common events get low values (good fit).

Sigmoid Function: $\sigma(x) = \frac{1}{1 + e^{-x}}$

PPO: Proximal Policy Optimization

Policy ($\pi_\theta$): LLM obtained after Pretraining or SFT
Reward model ($R_\phi$): Frozen reward model
Critic ($V_\gamma$) (value function): Learnable network that takes in partial response to a prompt and predicts the scalar reward

The Five Stages

Generate responses: LLM produces multiple responses for a given prompt
Score responses: The reward model assigns reward for each response
Compute advantages: Use GAE to compute advantages
Optimise policy: Update the LLM by optimising the total objective
Update critic: Train the critic to be better at predicting the rewards given partial responses

State and Action Definitions

We have state $s_t$ and action $a_t$.

Subscript $t$ is used to denote the state and action at a token level
Subscript $i$ is used to denote the response at an instance level, used in responses $r_i$

LLM for a prompt $p$ generates $r_i$ of length $T$ one token at a time:

$t=0$: state = prompt, i.e. $s_0 = {p}$, and the first action $a_0$ is the first word token generated by the LLM
$t=1$: $s_1 = {p, a_0}$, and the next action is $a_1$ (2nd word token)
…
$t=T-1$: $s_{T-1} = {p, a_{0:T-2}}$, and final action = $a_{T-1}$

Therefore: $r_i = {a_0, a_1,…a_{T-1}}$.

General Advantage Estimation (GAE)

What is Advantage?

Think of advantage as asking: “How much better is this specific choice compared to what I normally do?”

State ($s_t$): The prompt plus any words generated so far
Action ($a_t$): The next word to generate
Advantage: How much better is choosing this specific word versus my typical word choice (average action) the policy will take in state $s_t$

Formally:

\[A_t = Q(s_t, a_t) - V(s_t)\]

$Q(s_t, a_t)$: “If I choose this specific word, how good will my final response be?”
$V(s_t)$: “On average, how good will my response be with my usual word choices?”
$A$: The difference - how much this specific word improves things

Two Main Estimation Approaches

1) Monte-Carlo (MC):

Generate the complete response, then judge how good it was
Like writing an entire essay before getting feedback
Pro: Very accurate assessment of final quality, low bias as we can accurately model the reward
Con: Very unpredictable results (high variance due to the sparse reward), expensive to run many times
Play the entire game, then analyze if that opening move was good

2) Temporal Difference (TD):

Evaluate each word immediately as it’s generated
Like getting feedback on each sentence while writing
Pro: More consistent results (low variance), cheaper to compute
Con: Hard to predict final quality from partial text (higher bias)
Evaluate each move immediately based on current board position

Bias and Variance - The Dartboard Analogy 🎯

Imagine you’re throwing darts at a bullseye (the “true advantage value”):

Low Bias: Your throws are centered around the bullseye on average
High Bias: Your throws consistently miss in one direction (systematic error)
Low Variance: Your throws are tightly clustered together
High Variance: Your throws are scattered all over the place

In Advantage Estimation:

Monte-Carlo (MC) Method:

Low Bias: When you wait for the complete response, you get the “true” reward. It’s like actually finishing the chess game to see if your opening move was good
High Variance: But each complete response can be very different! One response might be great, another terrible, even from the same starting point. It’s like your dart throws being all over the dartboard

Temporal Difference (TD) Method:

Low Variance: Since you’re judging each word immediately, your estimates are more consistent. Like throwing darts that land in a tight cluster
High Bias: But you’re guessing the final outcome from incomplete information. It’s like judging a movie from just the first 10 minutes - you might consistently over or underestimate

Real Example:

Say you’re generating “The weather today is…”

MC approach:

Complete response 1: “The weather today is sunny and perfect for hiking!” (Score: 9/10)
Complete response 2: “The weather today is terrible and ruined my day.” (Score: 3/10)
Average advantage estimate varies wildly (high variance)

TD approach:

After “sunny”: Estimates final score as 7/10
After “terrible”: Estimates final score as 4/10
More consistent estimates, but might miss that “sunny” leads to an amazing hiking story (bias)

The Trade-off: You can’t easily have both low bias AND low variance - it’s like trying to be both very accurate and very consistent when the underlying process is inherently noisy.

The Critic (Value Function)

The critic is like a “fortune teller” for your language model. It looks at a partially generated response and tries to predict: “How good will the final, complete response be?”

Training the critic $V_\gamma$ is fairly straightforward:

Given a partial state $s_t$, we want to predict the reward model’s output given the full state $s_T = {p, r}$. The objective for the critic can be written as:

\[L(\gamma) = \mathbb{E}_t \left[(V_\gamma(s_t) - \text{sg}(R_\phi(s_T)))^2\right]\]

where $\text{sg}$ denotes the stop gradient operation.

Real-World Analogy: Imagine training a movie critic to predict final movie ratings from just watching trailers:

Show trailer (partial state)
Critic predicts: “This will be 4/5 stars”
Full movie gets actual rating: “3/5 stars”
Train critic to be closer to 3/5 next time
But don’t change how the rating system works (stop gradient)

Stop Gradient: It is an operation that prevents gradients from flowing backward through a specific part of a neural network during backpropagation.

Code Example (PyTorch):
x = torch.tensor([1.0], requires_grad=True)
y = x * 2
z = y.detach()  # Stop gradient here
loss = z * 3
loss.backward()
print(x.grad)  # Will be None - no gradient flowed back to x
We update the critic to better predict the reward, We DON’T update the reward model based on this training. It’s like saying “the reward model’s score is the ‘correct answer’ - don’t change it”.

Why we train critic alongside LLM while Reward model is frozen?: This is because the value function must estimate the reward for partial response given the current policy; as a result, it must be updated alongside the LLM, to avoid its predictions to become outdated and misaligned.

Back to GAE

GAE is a clever way to estimate advantage that balances between the TD (fast but biased) and MC (accurate but noisy) approaches we discussed earlier.

What is TD Error ($\delta_t$)?

TD error measures: “How much better did things get after generating one more word?”
$\delta_t$ denotes the TD error at step $t$, and is computed as:

\[\delta_t = V_\gamma(s_{t+1}) - V_\gamma(s_t)\]

Example:

State $s_t$: “The weather today is”

Critic predicts: $V_\gamma(s_t) = 6.0$

After adding “sunny”: “The weather today is sunny”

Critic now predicts: $V_\gamma(s_{t+1}) = 7.2$

TD error: $\delta_t = 7.2 - 6.0 = +1.2$ (good improvement!)

GAE Formula

Instead of using just one TD step, GAE looks at multiple steps and combines them with decreasing weights:

\[A^{\text{GAE}}_K = \delta_0 + \lambda \delta_1 + \lambda^2 \delta_2 + ... + (\lambda)^{K-1} \delta_{K-1} = \sum^{K-1}_{t=0} (\lambda)^t \delta_t\]

where $K$ denotes the number of TD steps and $K<T$ (because obviously you can’t compute TD beyond the length of the trajectory).

The λ Parameter Controls the Trade-off:

λ = 0 (Pure TD):

Only looks at immediate next step: $A = \delta_0$
Fast, consistent, but potentially wrong about future

λ = 1 (Pure MC):

Looks at all steps equally: $A = \delta_0 + \lambda \delta_1 + \lambda^2 \delta_2 + …$
More accurate but noisier

λ = 0.9 (Balanced):

Recent steps matter more, but still considers future
Good compromise between speed and accuracy

Why This Helps:

Reduces variance: By weighting recent steps more heavily
Maintains some accuracy: By still considering multiple future steps
Computational efficiency: Don’t need to generate complete responses

In RLHF Context: The model learns to generate words that maximize this GAE advantage, meaning it learns to pick words that improve the expected final reward not just immediately, but over the next several steps too.

Analogy: It’s like a chess player who considers not just the immediate move, but also the next 3-4 moves, with each future move being slightly less important than the current one.

Side note: Originally there is also a discount factor $\eta$ in GAE: $A^{\text{GAE}}_K = \sum^{K-1}_{t=0} (\lambda\eta)^t \delta_t$ which is also used in the TD error $\delta_t$, and there is also an extra reward term: $\delta_t = R_\phi(s_t) + \eta V_\gamma(s_{t+1}) - V_\gamma(s_t)$ But since we almost always have $\eta=1$, and $R_\phi(s_t)=0$ for $t<T$ which is always the case, the author took a shortcut to simplify and omit those terms.

$\eta$ (Discount Factor):

Standard RL concept - future rewards are worth less than immediate rewards

$\eta = 0.99$ means a reward tomorrow is worth 99% of the same reward today

$\eta = 1.0$ means future rewards are equally valuable (no discounting)

Understanding K vs λ in GAE

K (Number of Steps) controls how far into the future we look when estimating advantage.

Example with different K values:

Starting from “The weather today is”:

Step 0: “sunny” → $\delta_0 = +1.2$
Step 1: “and” → $\delta_1 = +0.3$
Step 2: “warm” → $\delta_2 = +0.8$
Step 3: “today” → $\delta_3 = -0.2$
Step 4: “which” → $\delta_4 = +0.5$

K=1: $A^{\text{GAE}}_1 = \delta_0 = 1.2$

K=3: $A^{\text{GAE}}_3 = \delta_0 + \lambda\delta_1 + \lambda^2\delta_2 = 1.2 + 0.9 \times 0.3 + 0.81 \times 0.8$

K=5: $A^{\text{GAE}}_5 = \delta_0 + \lambda\delta_1 + \lambda^2\delta_2 + \lambda^3\delta_3 + \lambda^4\delta_4$

Effect of K:

Larger K: More future context, potentially more accurate, but more computation
Smaller K: Less future context, faster, but might miss important future effects

λ (Lambda - Decay Factor) controls how much we trust future predictions vs. immediate ones.

Same example with K=4, different λ:

λ = 0.0: $A = 1.2$ (only immediate step)

λ = 0.5: $A = 1.2 + 0.5 \times 0.3 + 0.25 \times 0.8 + 0.125 \times (-0.2) = 1.525$

λ = 0.9: $A = 1.2 + 0.9 \times 0.3 + 0.81 \times 0.8 + 0.729 \times (-0.2) = 1.972$

λ = 1.0: $A = 1.2 + 0.3 + 0.8 + (-0.2) = 2.1$

Effect of λ:

λ → 0: Trust only immediate predictions (low variance, high bias)
λ → 1: Trust all future predictions equally (high variance, low bias)

Interactive Effects:

Conservative (K=2, λ=0.3):

Look only 2 steps ahead, don’t trust future much
Fast, stable, but might miss long-term patterns

Aggressive (K=10, λ=0.95):

Look far ahead, trust future predictions a lot
Slower, noisier, but captures long-term dependencies

Balanced (K=5, λ=0.9):

Reasonable lookahead with moderate future trust
Good compromise for most applications

Practical Tuning:

For language models:

Short responses: Smaller K (fewer tokens to look ahead)
Long responses: Larger K (more context matters)
Stable domains: Higher λ (future predictions more reliable)
Noisy domains: Lower λ (don’t trust uncertain futures)

Putting it Together – PPO Objective

There are a few components to the PPO objective: 1) the clipped surrogate objective, 2) the entropy bonus, 3) the KL penalty.

1. The Clipped Surrogate Objective

We want to update our policy to pick tokens with higher advantage, but we don’t want to change too drastically in one step (to avoid instability). The clipped surrogate objective constrains policy updates with a probability ratio $c_t(\pi_\theta)$:

\[L^{\text{clip}}(\theta) = \mathbb{E}_t \left[ \min(c_t(\pi_\theta)A^{GAE}_t, \text{clip}(c_t(\pi_\theta),1-\epsilon, 1+\epsilon)A^{GAE}_t)\right]\]

where $\epsilon$ controls the clipping range, $c_t(\pi_\theta)$ the probability ratio of predicting a specific token $a_t$ at given cumulative state $s_t$, before and after the update:

\[c_t(\pi_\theta) = \frac{\pi_\theta (a_t | s_t)}{\pi_{\theta_{\text{old}}} (a_t | s_t)}\]

“How much more likely is the new policy to pick this token compared to the old policy?”

Concrete Example:

Let’s say the LLM assigns the word unlimited with the following probabilities:
- Before update: 0.1
- After update: 0.3
Then the probability ratio $c_t=0.3/0.1=3$
If we take $\epsilon=0.2$, $c_t$ gets clipped to 1.2
The final clipped surrogate loss is $L^{\text{clip}}(\pi_\theta) = 1.2A_K^{\text{GAE}}$

Why minimum: min()?

When advantage is positive (good action): clipping prevents over-optimization
When advantage is negative (bad action): we want to discourage it, but not too aggressively

You can think of clipping as a way to prevent overconfidence.

Without clipping:

Model sees one good example → becomes 100x more likely to use that pattern
Leads to repetitive, overconfident generations

With clipping:

Model can only increase probability by max (1+ε) factor per update
Requires multiple training steps to make large changes
Results in more stable, gradual learning

2. KL Divergence Penalty

Additionally, we have the KL divergence penalty which prevents the current policy $\theta$ from deviating too far from the original model that we are finetuning from $\theta_{\text{orig}}$:

\[\text{KL}(\theta) = \mathbb{E}_{s_t} \left[ \mathbb{D}_{\text{KL}}(\pi_{\theta\text{orig}}(\cdot | s_t) || \pi_{\theta}(\cdot | s_t)) \right]\]

The KL is simply estimated by taking the average over sequence and batch.

Pseudocode:

# Compute KL divergence between original and current policy/model
logits_orig = original_model(states)  # Original model's logits
logits_current = current_model(states)  # Current model's logits

probs_orig = F.softmax(logits_orig, dim=-1)
log_probs_orig = F.log_softmax(logits_orig, dim=-1)
log_probs_current = F.log_softmax(logits_current, dim=-1)

kl_div = (probs_orig * (log_probs_orig - log_probs_current)).sum(dim=-1)
kl_penalty = kl_div.mean()  # Average over sequence and batch

What is KL Divergence?

KL Divergence measures how different two probability distributions are.

Formula
\[\text{KL}(P \| Q) = \sum_x P(x) \times \log\left(\frac{P(x)}{Q(x)}\right)\]
Where P is the “true” distribution and Q is your approximation.

Example

Weather Models:

Model A: Sunny 70%, Rainy 20%, Cloudy 10%

Model B: Sunny 60%, Rainy 30%, Cloudy 10%

KL(B || A) calculation:

Sunny: 0.6 × log(0.6/0.7) = -0.092

Rainy: 0.3 × log(0.3/0.2) = 0.122

Cloudy: 0.1 × log(0.1/0.1) = 0

Total: 0.030 (very similar models)

Key Properties

KL = 0: Identical distributions

KL > 0: Always positive, measures “surprise”

Asymmetric: KL(P Q) ≠ KL(Q P)

Why in RLHF?
KL_penalty = KL(π_current || π_original)
Prevents the model from:

Generating gibberish to game rewards

Forgetting original language abilities

Deviating too far from human-like text

3. Entropy bonus

The entropy bonus encourages exploration of LLM’s generation by penalising low entropy:

\[\begin{align} H(\theta) = - \mathbb{E}_{a_t} [\log \pi_\theta (a_t | s_t)]. \end{align}\]

Pseudocode:

# Compute entropy of current policy
probs_current = F.softmax(logits_current, dim=-1)
log_probs_current = F.log_softmax(logits_current, dim=-1)

entropy = -(probs_current * log_probs_current).sum(dim=-1)
entropy_bonus = entropy.mean()  # Average over sequence and batch

What is Entropy?

Entropy measures “How much uncertainty/surprise is in a situation?” Think of it as quantifying how unpredictable something is.

The Basic Intuition

Fair Coin (50% heads, 50% tails):

High uncertainty → High entropy

You learn a lot when you see the result

Rigged Coin (99% heads, 1% tails):

Low uncertainty → Low entropy

You barely learn anything when you see “heads”

The Formula
\[H(P) = -\sum_x P(x) \times \log(P(x))\]
Quick Examples

Fair Coin:

H = -[0.5×log(0.5) + 0.5×log(0.5)] = 0.69 nats

Maximum uncertainty for 2 outcomes

Rigged Coin:

H = -[0.99×log(0.99) + 0.01×log(0.01)] = 0.06 nats

Very predictable, low entropy

In Language Models

High Entropy Generation:

Model is uncertain about next word

More diverse, creative outputs

“The weather today is… [sunny/rainy/cloudy/windy all equally likely]”

Low Entropy Generation:

Model is confident about next word

More predictable, repetitive outputs

“The weather today is… [sunny with 95% confidence]”

Why Entropy Bonus in RL?

Without entropy bonus:

Model becomes overconfident

Generates repetitive, boring text

Gets stuck in local optima

With entropy bonus:

Encourages exploration of different word choices

Maintains diversity in generations

Prevents mode collapse

The trade-off: Balance between reward maximization (exploitation) and maintaining diversity (exploration).

Finally, the PPO objective

Given the three terms above, in addition to the value function MSE loss (recall it is optimised along with the LLM), the PPO objective is defined as follows:

\[\mathcal{L}_{\text{PPO}}(\theta, \gamma) = \underbrace{\mathcal{L}_{\text{clip}}(\theta)}_{\text{aaximise reward}} + \underbrace{w_1 H(\theta)}_{\text{Maximise entropy}} - \underbrace{w_2 \text{KL}(\theta )}_{\text{Penalise KL divergence}} - \underbrace{w_3 \mathcal{L(}\gamma)}_{\text{Critic L2}}\]

A summary of the different terms in this objective is as follows:

Term	Purpose
$\mathcal{L}_{\text{clip}}(\theta)$	Maximize rewards for high-advantage actions (clipped to avoid instability).
$H(\theta)$	Maximize entropy to encourage exploration.
$\text{KL}(\theta)$	Penalize deviations from the reference policy (stability).
$\mathcal{L}(\gamma)$	Minimize error in value predictions (critic L2 loss).

GRPO: Group Relative Policy Optimization

Key difference lies in how the two algorithms estimate advantage $A$: instead of estimating advantage through the critic like in PPO, GRPO does so by taking multiple samples from the LLM using the same prompt.

Workflow:

For each prompt $p$, sample a group of $N$ responses $\mathcal{G}={r_1, r_2,…r_N}$ from the LLM policy $\pi_\theta$;
Compute rewards ${R_\phi(r_1),R_\phi(r_2),…R_\phi(r_N)}$ for each response using the reward model $R_\phi$;
Calculate group-normalised advantage for each response: $\begin{align} A_i = \frac{R_\phi(r_i) - \text{mean}(\mathcal{G})}{\text{std}(\mathcal{G})}, \end{align}$ where $\text{mean}(\mathcal{G})$ and $\text{std}(\mathcal{G})$ denotes the within-group mean and standard deviation, respectively.

In GRPO, advantage is approximated as the normalised reward of each response within its group of responses. This removes the need of a critic network calculating per-step rewards, not to mention the mathematical simplicity and elegance.

Why no one has tried GRPO before?r:

Computational Cost Barrier:

Traditional approach: 1 response per prompt

GRPO approach: 1000 responses per prompt

Historical constraint: Generating 1000 responses was prohibitively expensive

Modern reality: LLM inference became much cheaper/faster

REINFORCE Baseline Tradition:

Classic REINFORCE baseline: Average of current batch (maybe 32 trajectories)

GRPO baseline: Average of 1000+ responses for SAME prompt

Same-prompt comparison is much more informative!

Scale Wasn’t Available:

Earlier RL work: Small action spaces, short episodes

Language models: Massive action spaces, long sequences

GRPO needs scale to work well (law of large numbers)

The GRPO objective

Similar to PPO, GRPO still make use of a clipped surrogate loss as well as the KL penalty. The entropy bonus term is not used here, as the group-based sampling already encourages exploration. The clipped surrogate loss is identical to the one used in PPO, but for completeness sake here it is: $\begin{align*} & \mathcal{L}_{\text{clip}}(\theta) = \\ &\frac{1}{N} \sum_{i=1}^N \left( \min\left( \frac{\pi_\theta(r_i|p)}{\pi_{\theta_{\text{old}}}(r_i|p)} A_i, \ \text{clip}\left( \frac{\pi_\theta(r_i|p)}{\pi_{\theta_{\text{old}}}(r_i|p)}, 1-\epsilon, 1+\epsilon \right) A_i \right) \right), \end{align*}$

then with the KL penalty term, the final GRPO objective can be written as:

\[\begin{align} \mathcal{L}_{\text{GRPO}}(\theta) &= \underbrace{\mathcal{L}_{\text{clip}}(\theta)}_{\text{Maximise reward}} - \underbrace{w_1\mathbb{D}_{\text{KL}}(\pi_\theta || \pi_{\text{orig}})}_{\text{Penalise KL divergence}} \end{align}\]