Understanding KL Divergence Estimators in RL: From Value Approximation to Gradient Estimation

Introduction: What KL Does in RL
- Forward vs. reverse KL
Three estimators: definitions and design
Core analysis
RL practice guide
- KL as reward penalty (no gradient needed)
- KL as loss (needs gradients)
“Grab-and-use” crib sheet
Common implementation traps
Summary
References

Mini-class

How we approximate KL divergence directly affects training stability. This post systematically analyzes three estimators $k_1, k_2, k_3$ in both on-policy and off-policy scenarios, and gives practical guidelines for choosing them when KL is used as a reward penalty versus when it is used as a loss for backpropagation.

中文版 | 知乎版本

Introduction: What KL Does in RL

In policy optimization (PPO, GRPO, etc.) and alignment training (RLHF/RLAIF), KL penalty keeps the new policy from drifting too far from a reference policy, preventing instability or collapse.

Forward vs. reverse KL

Let $q_\theta$ be the current actor, $p$ the reference policy. The two directions are:

Reverse KL: $D_{\mathrm{KL}}(q_\theta \| p) = \mathbb{E}_{x \sim q_\theta}\left[\log \frac{q_\theta(x)}{p(x)}\right]$

Forward KL: $D_{\mathrm{KL}}(p \| q_\theta) = \mathbb{E}_{x \sim p}\left[\log \frac{p(x)}{q_\theta(x)}\right]$

Intuition:

Reverse KL is mode-seeking: policy concentrates on high-probability regions of $p$, possibly sacrificing diversity.
Forward KL is mass-covering: policy tries to cover the support of $p$.

RLHF typically uses reverse KL because we want the actor not to move too far from the reference, not necessarily to cover every mode.

Three estimators: definitions and design

Let $r(x) = \dfrac{p(x)}{q_\theta(x)}$. John Schulman defined three single-sample estimators:

$k_1$: the naive estimator

\[k_1(x) = -\log r = \log q_\theta(x) - \log p(x)\]

Direct log-ratio. It is unbiased for reverse KL, but can be negative while KL is always nonnegative, giving huge variance because positive and negative samples cancel.

$k_2$: an f-divergence, lower variance

\[k_2(x) = \frac{1}{2}(\log r)^2\]

Motivation: $k_1$ can be positive or negative; $k_2$ squares it so every sample is positive, each telling you how far $p$ and $q$ differ.

Why tiny bias? $k_2$ is an f-divergence with $f(x) = \tfrac{1}{2}(\log x)^2$. All smooth f-divergences have the same second-order expansion near $q \approx p$:

\[D_f(p, q_\theta) = \frac{f^{\prime\prime}(1)}{2} \theta^T F \theta + O(\theta^3)\]

KL corresponds to $f(x) = -\log x$, so $f^{\prime\prime}(1) = 1$. For $k_2$, $f^{\prime\prime}(1) = 1$ as well. When policies are close, $k_2$ tracks true KL almost identically, bias only appears in higher-order terms.

$k_3$: control variate, “optimal” shape

\[k_3(x) = r - 1 - \log r\]

Motivation: we want unbiased and low variance. Add a control variate to $k_1$: something zero-mean and negatively correlated.

Because $\mathbb{E}_q[r - 1] = 1 - 1 = 0$, for any $\lambda$:

\[k_1 + \lambda(r - 1) = -\log r + \lambda(r - 1)\]

is still unbiased.

Why $\lambda = 1$? By concavity of $\log$, $\log x \le x - 1$, so

\[k_3 = (r - 1) - \log r \ge 0\]

It is always nonnegative, avoiding the cancelation problem.

Geometric view: $k_3$ is a Bregman divergence for $\phi(x) = -\log x$. Its tangent at $x=1$ is $y = 1 - x$, so

\[\begin{aligned} D_\phi(r, 1) &= \phi(r) - \phi(1) - \phi'(1)(r - 1) \\ &= -\log r - 0 - (-1)(r - 1) \\ &= r - 1 - \log r = k_3. \end{aligned}\]

Convexity keeps $\phi$ above its tangent, so this gap is nonnegative. As $r \to 1$, the gap shrinks quadratically $(r-1)^2$, explaining the low variance when policies are close.

Quick comparison

Estimator	Definition	Design idea	Bias (value)	Variance
$k_1$	$\log r$	Naive log-ratio	Unbiased	High (can be negative)
$k_2$	$\tfrac{1}{2}(\log r)^2$	f-divergence, KL-matching 2nd order	Biased (very small)	Low (always positive)
$k_3$	$r - 1 - \log r$	Control variate + Bregman	Unbiased	Low (always positive)

For estimating the KL value, $k_3$ is “unbiased + low variance”; but for gradients, the story is different.

Core analysis

Bias and variance for KL values

Assume samples from $q_\theta$ to estimate reverse KL $D_{\mathrm{KL}}(q_\theta | p)$.

Unbiasedness:

\[\begin{aligned} \mathbb{E}_{q}[k_1] &= \mathbb{E}_{q}\left[\log \tfrac{q}{p}\right] = D_{\mathrm{KL}}(q \| p) \quad \textbf{(unbiased)}\\ \mathbb{E}_{q}[k_3] &= \mathbb{E}_{q}[r - 1 - \log r] = 1 - 1 + D_{\mathrm{KL}}(q \| p) = D_{\mathrm{KL}}(q \| p) \quad \textbf{(unbiased)}\\ \mathbb{E}_{q}[k_2] &= \tfrac{1}{2}\mathbb{E}_{q}[(\log r)^2] \neq D_{\mathrm{KL}}(q \| p) \quad \textbf{(biased)} \end{aligned}\]

Variance trade-off:

John Schulman’s toy experiments ($q = \mathcal{N}(0,1)$, $p = \mathcal{N}(0.1,1)$, true KL = 0.005):

Estimator	bias/true	stdev/true
$k_1$	0	20
$k_2$	0.002	1.42
$k_3$	0	1.42

When KL is large ($p = \mathcal{N}(1,1)$, true KL = 0.5):

Estimator	bias/true	stdev/true
$k_1$	0	2
$k_2$	0.25	1.73
$k_3$	0	1.7

Intuition:

$k_1 = -\log r$ is first-order around $r=1$, can be negative, so variance explodes when close.
$k_3 = r - 1 - \log r$ is second-order near $r=1$ and always positive, so lower variance when close.
When coverage is poor (heavy tails in $r$), $k_3$ can explode; then $k_1$ can be more stable.

Note: To estimate forward KL value $D_{\mathrm{KL}}(p | q) = \mathbb{E}_p[\log r]$ but only sample from $q$, use importance sampling $\mathbb{E}_q[r \log r]$.

Gradient estimation: the crucial distinction

This is the easiest part to get wrong. First analyze on-policy (samples from $q_\theta$), then extend to off-policy (samples from behavior $\mu$).

True gradients for reference

Let score function $s_\theta(x) = \nabla_\theta \log q_\theta(x)$, with key property $\mathbb{E}_{q_\theta}[s_\theta] = 0$.

Reverse KL gradient:

\[D_{\mathrm{KL}}(q_\theta \| p) = \int q_\theta(x) \log \frac{q_\theta(x)}{p(x)} dx\]

Product rule and $\nabla_\theta q_\theta = q_\theta s_\theta$, $\nabla_\theta \log p = 0$ give

\[\nabla_\theta D_{\mathrm{KL}}(q_\theta \| p) = \mathbb{E}_q\left[s_\theta \log \tfrac{q_\theta}{p}\right] = -\mathbb{E}_q[s_\theta \log r].\]

Forward KL gradient:

\[D_{\mathrm{KL}}(p \| q_\theta) = \int p(x) \log \frac{p(x)}{q_\theta(x)} dx\]

Since $p$ is $\theta$-independent,

\[\nabla_\theta D_{\mathrm{KL}}(p \| q_\theta) = -\mathbb{E}_p[s_\theta] = -\mathbb{E}_q[r s_\theta] = \mathbb{E}_q[(1-r) s_\theta].\]

These baselines tell us what each estimator’s expected gradient really targets.

Two differentiation orders

1) Grad then expectation: autograd on each sample, then batch average (what DL code actually does). 2) Expectation then grad: treat $\mathbb{E}_q[k_i]$ as a function of $\theta$ and differentiate analytically.

Typical code does (1).

Gradients of the three estimators (on-policy)

\[\nabla_\theta k_1 = s_\theta\] \[\nabla_\theta k_2 = (\log r) \nabla_\theta(\log r) = (\log r)(-s_\theta) = - (\log r) s_\theta\] \[\nabla_\theta k_3 = (1 - r) s_\theta\]

Taking expectation under $q_\theta$:

Estimator	$\mathbb{E}_{q}[\nabla_\theta k_i]$	Equals
$k_1$	$\mathbb{E}_{q}[s_\theta] = 0$	Zero (useless as loss)
$k_2$	$-\mathbb{E}_{q}[(\log r) s_\theta] = \nabla_\theta D_{\mathrm{KL}}(q \\| p)$	Gradient of reverse KL
$k_3$	$\mathbb{E}_{q}[(1-r) s_\theta] = \nabla_\theta D_{\mathrm{KL}}(p \\| q)$	Gradient of forward KL

Key takeaways:

$k_2$ gradient matches reverse KL gradient (the usual “stay near ref” objective).
$k_3$ gradient matches forward KL gradient (coverage objective).
$k_1$ gradient expectation is zero — useless as a loss.

Expectation-then-grad vs. grad-then-expectation

If you first form $\mathbb{E}_q[k_i]$ and then differentiate (expectation-then-grad):

\[\nabla_\theta \mathbb{E}_q[k_1] = \nabla_\theta D_{\mathrm{KL}}(q \| p), \quad \nabla_\theta \mathbb{E}_q[k_3] = \nabla_\theta D_{\mathrm{KL}}(q \| p).\]

Both give reverse KL. But autograd on per-sample $k_3$ averages (grad-then-expectation) yields forward KL gradient. Same estimator, different order, different result.

Off-policy gradients with importance sampling

Real RL often samples from a behavior policy $\mu$ (old or mixed policy, replay buffer). To optimize reverse KL you need importance weights.

See also my earlier post: Three-policy TRPO extension for LLM RL.

Setup

Define importance weight

\[w(x) = \frac{q_\theta(x)}{\mu(x)}.\]

Using batch loss $w(x) k_i(x)$ with autograd, what gradients do we get?

A key difference:

Previously expectations were under $q_\theta$ (depends on $\theta$).
Now expectations are under $\mu$ (independent of $\theta$).

Crucial observation: the two orders coincide

Because $\mu$ is $\theta$-independent,

\[\nabla_\theta \mathbb{E}_{\mu}[f_\theta] = \mathbb{E}_{\mu}[\nabla_\theta f_\theta].\]

So autograd on sample means (grad-then-expectation) equals expectation-then-grad. For $k_1$ and $k_3$, both value-unbiased for reverse KL, their gradient expectations also match reverse KL.

Value unbiasedness remains

By $\mathbb{E}_\mu[w f] = \mathbb{E}_q[f]$:

\[\mathbb{E}_\mu[w k_1] = D_{\mathrm{KL}}(q_\theta \| p), \quad \mathbb{E}_\mu[w k_3] = D_{\mathrm{KL}}(q_\theta \| p) \quad \textbf{(unbiased)}\] \[\mathbb{E}_\mu[w k_2] = \mathbb{E}_{q_\theta}[k_2] \neq D_{\mathrm{KL}}(q_\theta \| p) \quad \textbf{(biased)}\]

Gradients with weights

Gradient of weight: $\nabla_\theta w = w s_\theta$. Using product rule:

$\nabla_\theta(w k_1) = w s_\theta (k_1 + 1)$ $\nabla_\theta(w k_2) = w s_\theta (k_2 - \log r)$ $\nabla_\theta(w k_3) = w s_\theta (k_3 + 1 - r) = w s_\theta k_1$

Which give expected gradients:

Weighted estimator	Value target	Expected gradient
$\tfrac{q_\theta}{\mu} k_1$	$D_{\mathrm{KL}}(q_\theta \\| p)$	$\nabla_\theta D_{\mathrm{KL}}(q_\theta \\| p)$ (reverse KL) ✓
$\tfrac{q_\theta}{\mu} k_2$	$\mathbb{E}_q[k_2]$ (f-divergence)	$\nabla_\theta \mathbb{E}_q[k_2]$, not reverse KL ✗
$\text{sg}\left(\tfrac{q_\theta}{\mu}\right) k_2$	$\mathbb{E}_q[k_2]$ (f-divergence)	$\nabla_\theta D_{\mathrm{KL}}(q_\theta \\| p)$ (reverse KL) ✓
$\tfrac{q_\theta}{\mu} k_3$	$D_{\mathrm{KL}}(q_\theta \\| p)$	$\nabla_\theta D_{\mathrm{KL}}(q_\theta \\| p)$ (reverse KL) ✓

Interesting reversal vs. on-policy:

On-policy: $k_2$ as loss gives reverse KL gradient; $k_1$ gradient is zero.
Off-policy + weights: $\tfrac{q}{\mu}k_1$ and $\tfrac{q}{\mu}k_3$ give reverse KL gradients; $\tfrac{q}{\mu}k_2$ (with weight in grad) fails.
Detaching the weight makes $\text{sg}(\tfrac{q}{\mu}) k_2$ also give reverse KL gradient.

Variance of the three unbiased off-policy gradient estimators

Unbiased reverse-KL gradient estimators (off-policy + IS):

\[L_1 = w k_1, \quad L_2 = \bar w k_2, \quad L_3 = w k_3,\]

With $w = \tfrac{q_\theta}{\mu}$, $\bar w = \mathrm{sg}(w)$. Using $\nabla_\theta w = w s_\theta$, $\nabla_\theta k_1 = s_\theta$, $\nabla_\theta k_2 = k_1 s_\theta$, $\nabla_\theta k_3 = (1-r) s_\theta$:

\[\begin{aligned} g_1 &= w s_\theta (k_1+1),\\ g_2 &= w s_\theta k_1,\\ g_3 &= w s_\theta k_1. \end{aligned}\]

So $g_2 \equiv g_3$. Only two distinct variance behaviors: $g_1$ vs. $g_\star := g_2 = g_3$.

Let $A = w s_\theta, B = k_1$. Then

\[g_1 = A(B+1), \quad g_\star = A B.\]

Variance difference:

\[\boxed{\mathrm{Var}_\mu(g_1) - \mathrm{Var}_\mu(g_\star) = \mathbb{E}_\mu[A^2(2B+1)]} = \mathbb{E}_\mu\big[w^2 s_\theta^2 (2k_1+1)\big].\]

In the typical KL-penalty regime $q_\theta \approx p \approx \mu$, write $r = 1 + \varepsilon$, $\lvert\varepsilon\rvert \ll 1$, so $k_1 \approx -\varepsilon$, $2k_1+1 \approx 1 - 2\varepsilon > 0$. Thus $\mathrm{Var}(g_1) > \mathrm{Var}(g_\star)$.

Intuition:

$g_1$ includes an $O(1)$ zero-mean noise term $w s_\theta$.
$g_\star$ cancels that term; remaining magnitude is $O(\varepsilon)$, giving much lower variance.

Table summary:

Estimator	Gradient rv	Scale ($r\approx1$)	Variance
$w k_1$	$w s_\theta (k_1+1)$	$O(1)$	High
$\mathrm{sg}(w) k_2$	$w s_\theta k_1$	$O(\varepsilon)$	Low
$w k_3$	$w s_\theta k_1$	$O(\varepsilon)$	Low

Conclusion: off-policy IS with reverse-KL gradients has three unbiased options: $w k_1$, $\bar w k_2$, $w k_3$. The latter two are identical in gradient and variance and are preferred; $w k_1$ is unbiased but noisier.

When far off-policy: If $w$ explodes (little overlap), any $\tfrac{q}{\mu}$ method suffers. Then the variance advantage of $k_3$ over $k_1$ is not guaranteed; clipping/regularization becomes necessary.

Gradient cheat sheet

Sampling	Loss	$\mathbb{E}[\nabla_\theta \text{Loss}]$	Optimizes	Right for reverse KL?
$q$ (on)	$k_1$	$\mathbb{E}_q[s_\theta] = 0$	None (zero grad)	✗
$q$ (on)	$k_2$	$\nabla_\theta D_{\mathrm{KL}}(q \\| p)$	Reverse KL	✓
$q$ (on)	$k_3$	$\nabla_\theta D_{\mathrm{KL}}(p \\| q)$	Forward KL	✗
$\mu$ (off)	$\tfrac{q}{\mu} k_1$	$\nabla_\theta D_{\mathrm{KL}}(q \\| p)$	Reverse KL	✓ (higher var)
$\mu$ (off)	$\tfrac{q}{\mu} k_2$	$\nabla_\theta \mathbb{E}_q[k_2]$	f-divergence (not KL)	✗
$\mu$ (off)	$\text{sg}\left(\tfrac{q}{\mu}\right) k_2$	$\nabla_\theta D_{\mathrm{KL}}(q \\| p)$	Reverse KL	✓
$\mu$ (off)	$\tfrac{q}{\mu} k_3$	$\nabla_\theta D_{\mathrm{KL}}(q \\| p)$	Reverse KL	✓ (recommended, low var)

Key conclusions: 1) On-policy reverse KL: use $k_2$ (only correct choice). 2) Off-policy reverse KL: three correct options: $\tfrac{q}{\mu} k_1$ (unbiased, higher var); $\text{sg}(\tfrac{q}{\mu}) k_2$ (unbiased, equals next); $\tfrac{q}{\mu} k_3$ (unbiased, lower var; equals previous). 3) $\tfrac{q}{\mu} k_2$ with weight in grad is wrong for reverse KL.

RL practice guide

KL as reward penalty (no gradient needed)

When KL is a scalar penalty in rewards, we only need accurate values, no backprop.

Recommend:

Use $k_1$ or $k_3$ (both unbiased for reverse KL value).
When policies are close, $k_3$ is typically lower variance.
With poor coverage or heavy tails, $k_1$ is more robust.
Off-policy: multiply by $\tfrac{q_\theta}{\mu}$.

For a forward KL penalty, use $\mathbb{E}_q[r \log r]$ or (if sampling from $p$) $\mathbb{E}_p[\log r]$.

KL as loss (needs gradients)

On-policy: optimize reverse KL (most common)

Goal: keep actor near reference.

Use $k_2$ as loss.

\[\mathcal{L}_{k_2} = \tfrac{1}{2}(\log r)^2\]

Then $\mathbb{E}_q[\nabla k_2] = \nabla_\theta D_{\mathrm{KL}}(q | p)$.

On-policy: optimize forward KL (coverage)

Goal: cover the reference distribution (offline RL, imitation, etc.).

Use $k_3$ as loss. Autograd on sample means gives $\mathbb{E}_q[(1-r) s_\theta] = \nabla_\theta D_{\mathrm{KL}}(p | q)$.

Off-policy: optimize reverse KL

Goal: samples from behavior $\mu$, still optimize reverse KL.

Recommended: $\dfrac{q_\theta}{\mu} k_3$ or $\mathrm{sg}\left(\dfrac{q_\theta}{\mu}\right) k_2$ (identical gradients).

\[\mathcal{L} = \dfrac{q_\theta(x)}{\mu(x)} \Big( \dfrac{p(x)}{q_\theta(x)} - 1 - \log \dfrac{p(x)}{q_\theta(x)} \Big)\]

\[\mathcal{L} = \mathrm{sg}\left(\dfrac{q_\theta(x)}{\mu(x)}\right) \cdot \tfrac{1}{2}\left(\log \dfrac{p(x)}{q_\theta(x)}\right)^2.\]

Gradients are unbiased.
When $q_\theta \approx p$, both have much lower variance.

Fallback: $\dfrac{q_\theta}{\mu} k_1$ (unbiased but higher variance).

Avoid: $\dfrac{q_\theta}{\mu} k_2$ with weight in gradient — biased for reverse KL.

“Grab-and-use” crib sheet

Target	Sampling	For value	For gradient
Reverse KL $D_{\mathrm{KL}}(q \\| p)$	$q$ (on-policy)	$k_1$ or $k_3$ (unbiased)	$k_2$
Reverse KL $D_{\mathrm{KL}}(q \\| p)$	$\mu$ (off-policy)	$\tfrac{q}{\mu} k_1$ or $\tfrac{q}{\mu} k_3$ (unbiased)	$\tfrac{q}{\mu} k_3$ (recommended) or $\text{sg}(\tfrac{q}{\mu}) k_2$
Forward KL $D_{\mathrm{KL}}(p \\| q)$	$q$	$\mathbb{E}_q[r\log r]$	$k_3$

Common implementation traps

Trap 1: Using $k_1$ directly as loss (on-policy)

$k_1$ gradient expectation is zero ($\mathbb{E}_q[s_\theta]=0$); as a loss it does nothing.

Fix: use $k_1$ or $k_3$ for reward shaping (no grad), use $k_2$ or $k_3$ for losses.

Trap 2: Mixing up $k_3$ value-unbiasedness vs. gradient target

$k_3$ is value-unbiased for reverse KL, but its gradient is forward KL. If you want reverse KL and backprop $k_3$, you are actually optimizing forward KL.

Fix: be explicit: reverse KL -> $k_2$; forward KL -> $k_3$.

Trap 3: Heavy-tailed $r$ blows up variance

If $r = p/q$ has extreme values, $k_3$ variance can explode.

Fix: enforce KL constraint or clip $r$.

Trap 4: Off-policy but still using $k_2$ or $\tfrac{q_\theta}{\mu} k_2$ (with grad on weight)

If $\mu \neq q_\theta$:

Plain $k_2$ (no weight): expectation is under $\mu$, estimator fails.
$\tfrac{q_\theta}{\mu} k_2$ with weight in grad: gradient is biased (f-divergence), not reverse KL.

Fix: off-policy reverse KL -> use $\tfrac{q_\theta}{\mu} k_3$ (recommended), $\text{sg}(\tfrac{q_\theta}{\mu}) k_2$, or $\tfrac{q_\theta}{\mu} k_1$.

Trap 5: Wrong detach on importance weights

$w = q_\theta / \mu$ often comes from log_prob_q - log_prob_mu then exp. Detaching $w$ matters:

Using $k_1$ or $k_3$: $w$ must participate in gradient (do not detach), otherwise you drop $\nabla_\theta w = w s_\theta$ and get wrong gradients.
Using $k_2$: detach $w$ to get reverse KL gradient. If $w$ stays in the graph, you get f-divergence gradient instead.

Summary: match estimator with the right detach strategy.

Summary

One-liners:

Only value (reward penalty): use $k_1$ or $k_3$ (both unbiased for reverse KL value); off-policy multiply by $\tfrac{q_\theta}{\mu}$.
Need gradients (loss):
- On-policy: reverse KL -> $k_2$; forward KL -> $k_3$.
- Off-policy: reverse KL -> $\tfrac{q_\theta}{\mu} k_3$ or $\text{sg}(\tfrac{q_\theta}{\mu}) k_2$ (same gradient, low variance); fallback $\tfrac{q_\theta}{\mu} k_1$ (unbiased but noisier).

Keep three questions clear: who do we sample from, whose value do we estimate, whose gradient do we need? Especially note: on-policy vs. off-policy choose different estimators for reverse KL — on-policy use $k_2$, off-policy use $\tfrac{q_\theta}{\mu} k_3$ or $\text{sg}(\tfrac{q_\theta}{\mu}) k_2$.

References

Dibya Ghosh. “KL Divergence for Machine Learning”. https://dibyaghosh.com/blog/probability/kldivergence
John Schulman. “Approximating KL Divergence”. https://joschu.net/blog/kl-approx.html
Verl Documentation. “Proximal Policy Optimization (PPO)”. https://verl.readthedocs.io/en/latest/algo/ppo.html
初七123334. “RLHF/RLVR 训练中的 KL 近似方法浅析（k1 / k2 / k3)”. https://zhuanlan.zhihu.com/p/1966872846212010437
Kezhao Liu, Jason Klein Liu, Mingtao Chen, Yiming Liu. “Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization”. https://arxiv.org/abs/2510.01555
Yifan Zhang, Yiping Ji, Gavin Brown, et al. “On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning”. https://arxiv.org/abs/2505.17508

@misc{WangZhang2025KLEstimators,
	author       = {Wang, Xihuai and Zhang, Shao},
	title        = {Understanding {KL} Divergence Estimators in {RL}: From Value Approximation to Gradient Estimation},
	year         = {2025},
	month        = dec,
	day          = {01},
	url          = {https://xihuai18.github.io/reinforcement-learning/2025/12/01/kl-estimators-en.html}
}