This post studies a recurring question in large-scale LLM reinforcement learning: when a training batch mixes data generated by multiple historical policy versions, can one still write down an explicit monotonic-improvement lower bound for PPO-style updates?
The short answer is yes: under dynamic mixture sampling, the basic bound can be summarized as “surrogate objective - update-shift penalty - sampling-staleness penalty”; if the actual advantage estimate is included in the analysis, an additional advantage-replacement error term must also be subtracted.
1. Introduction: Why Should We Care About Off-Policy Training?
When training a large language model with reinforcement learning, the most direct setup is on-policy training: generate a batch of data, update on that batch, then sample again from the updated model.
In large-scale distributed training, though, hundreds of GPUs sample in parallel and model updates take time. By the time a new version is deployed, data generated by older versions are often still sitting in the queue: throwing them away is wasteful, but using them means training on stale data.
That is the central off-policy question: when can data collected by older policies still support an analyzable monotonic-improvement lower bound for a newer one?
We will ultimately see that the lower bound is governed by three pieces: a surrogate objective we try to maximize, an update-shift penalty controlled on the optimization side, and a sampling-staleness penalty controlled on the data side.
In many RLHF / online alignment setups, if we view the prompt as context and the response as action while ignoring long-horizon environment evolution, the problem is often well approximated as a contextual bandit. I still start from the discounted-MDP setting because it lets us write multi-version behavior mixing, sampling staleness, and clipping in one unified language. Section 6 returns to which terms disappear, and which conclusions remain, in the bandit limit.
Related work has already touched neighboring parts of this picture: GePPO studies off-policy sample reuse with policy-improvement guarantees, while Decoupled PPO explicitly separates the behavior policy from the proximal policy. The emphasis here is different: I expand the behavior side into a dynamic mixture of historical policy versions and then split the risk into update increment shift and sampling staleness. You can also read this post as a continuation of the earlier three-policy perspective: here the behavior side is no longer a single policy $\mu$, but a mixture over historical policies $\{\pi^{(i)}\}$, while $\pi_k$ and $\pi_{k+1}$ play the roles of the current reference policy and update target. Even without that earlier post, the only principle needed here is to separate what the current update can control from what comes from behavior-distribution mismatch.
1.1 Which Off-Policy Problems Are Being Analyzed?
To avoid using “stale data” as a catch-all phrase, I separate the theoretical mismatches covered in this post:
| Type | Mathematical form | What it breaks |
|---|---|---|
| Version staleness | samples come from $\pi_{k-m}$ while the update targets $\pi_{k+1}$ | behavior distribution vs. current proximal distribution |
| Behavior-proximal mismatch | $\mu \neq \pi_k$ | the PPO ratio denominator is no longer the true sampling distribution |
| Multi-epoch sample reuse | the same data are reused across several updates | later epochs become increasingly off-policy |
| Mixed behavior policies | $\mu=\sum_i w_i\pi^{(i)}$ or an extended-state $\beta$ | a batch is not drawn from a single old policy |
| Support mismatch | some $\mu(a\mid s)=0$ while $\pi(a\mid s)>0$ | importance ratios are undefined |
The bounds below mainly handle the first four. The fifth is a prerequisite for any importance-ratio argument, so I treat it as an explicit support assumption.
2. Theoretical Foundations
2.1 Basic Setup
We consider a standard Markov Decision Process (MDP) comprising a state space $\mathcal{S}$, action space $\mathcal{A}$, transition probability $p(s'\mid s,a)$, reward function $r(s,a)$, initial distribution $\rho_0$, and discount factor $\gamma \in (0,1)$.
The expected cumulative discounted return of policy $\pi$ is:
$$ J(\pi) := \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \mid \pi\right] $$
Discounted State Visitation Distribution
We define it as:
$$ d_\pi(s) := (1-\gamma) \sum_{t=0}^{\infty} \gamma^t \Pr(s_t = s \mid \pi) $$
Advantage Function
We define it as:
$$ A^\pi(s,a) := Q^\pi(s,a) - V^\pi(s) $$
Total Variation Distance (TV Distance)
We define it as:
$$ D_{\mathrm{TV}}(\pi, \pi'; s) := \frac{1}{2} \sum_{a \in \mathcal{A}} |\pi(a \mid s) - \pi'(a \mid s)| $$
Throughout, we use $\mid$ for conditional probability (e.g., $\pi(a\mid s)$) and reserve $\|\cdot\|$ for norms.
2.2 Core Tool: Policy Performance Difference Lemma
The starting point of the analysis is the classic performance difference lemma, which writes $J(\pi)-J(\pi_k)$ exactly as an expectation of old-policy advantage under the new policy’s occupancy measure. This identity goes back to Kakade-Langford style analysis and is also the starting point of TRPO.
Lemma 2.1 (Policy Performance Difference Lemma)
For any policies $\pi_k$ (old) and $\pi$ (new), the performance difference can be expressed as:
$$ J(\pi) - J(\pi_k) = \frac{1}{1-\gamma} \mathbb{E}_{s \sim d_\pi}\left[ \mathbb{E}_{a \sim \pi(\cdot \mid s)}[A^{\pi_k}(s,a)] \right] $$
Intuitive understanding: How much better the new policy is than the old equals the “average advantage” obtained by selecting actions according to the new policy under the state distribution visited by the new policy.
2.3 Assumptions Behind the Bounds
All lower bounds below should be read as structural guarantees under standard assumptions. To avoid repeating them after every theorem, I state the important ones up front:
-
Common support: whenever the target policy can choose an action, the behavior policy assigns it positive probability. A typical statement is
$$ \pi(a\mid s)>0 \Rightarrow \mu(a\mid s)>0. $$
- Bounded advantage: there exists $A_{\max}$ such that $|A^{\pi_k}(s,a)|\le A_{\max}$. This lets TV / KL distances control distribution-replacement errors.
- Bounded or constrained ratios: the importance ratio $\pi(a\mid s)/\mu(a\mid s)$ must exist and cannot be arbitrarily large in the theoretical statement; clipping produces a biased surrogate rather than an unbiased estimate of the original objective.
- Well-defined mixture index: for trajectory-level mixtures, each trajectory keeps a fixed policy index; for step/segment-level mixtures, one must model the index transition explicitly.
- Controlled advantage replacement error: the actual $\hat A$ used in the loss must stay close enough to $A^{\pi_k}$; Section 5.4 isolates this term.
These assumptions are not technical bookkeeping. They mark the boundary where off-policy theory can make a meaningful statement. In particular, if common support fails, the importance ratio itself is not a legitimate object.
3. Performance Improvement Bounds for Single-Policy Sampling
3.1 Distribution Mismatch and Controlling State Shift
The Policy Performance Difference Lemma has a practical issue: the expectation on the right-hand side is computed under $d_\pi$ (the new policy’s state distribution), while we can only sample from $d_{\pi_k}$ (the old policy).
The solution is to decompose the expectation into “expectation under the old distribution + bias term,” then control the bias. The key question is: What is the quantitative relationship between the difference in state distributions and the difference in policies?
Controlling State Distribution Differences
Lemma 3.1 (Relationship Between State Distribution Difference and Policy TV Distance)
$$ \|d_\pi - d_{\pi_k}\|_1 \leq \frac{2\gamma}{1-\gamma} \mathbb{E}_{s \sim d_{\pi_k}} \big[ D_{\mathrm{TV}}(\pi, \pi_k; s) \big] $$
This is an average-divergence / average-TV style bound. It is closer in spirit to CPO / Achiam et al. (2017) than to the classic TRPO presentation based on $\max_s D_{\mathrm{TV}}$. I use it here because it lands more naturally on sample averages and is easier to carry into the multi-source setting. Appendix A includes a short proof sketch for how the state-distribution difference is reduced to average TV.
Physical Interpretation
Small differences in policies in action space are “amplified” through environment dynamics into differences in state visitation distributions. The coefficient $\frac{2\gamma}{1-\gamma}$ reflects the temporal accumulation effect—in long-horizon tasks ($\gamma$ close to 1), the amplification is stronger.
Proof Sketch
By deriving the fixed-point equation for discounted visitation distributions and exploiting the $\ell_1$ non-expansiveness of stochastic matrices, one can show that state-distribution differences are amplified by policy differences through transition dynamics, yielding the bound in Lemma 3.1. Since the main point of this post is not to re-derive that chain in full, I keep only the idea here and move the proof sketch to Appendix A.
3.2 Policy Performance Improvement Lower Bound
Theorem 3.2 (Policy Performance Improvement Lower Bound)
Define the expected advantage upper bound coefficient $C_{\pi,\pi_k} := \max_{s} \lvert \mathbb{E}_{a \sim \pi}[A^{\pi_k}(s,a)] \rvert$. Then:
$$ J(\pi) - J(\pi_k) \geq L_{\pi_k}(\pi) - \frac{2\gamma C_{\pi,\pi_k}}{(1-\gamma)^2} \mathbb{E}_{s \sim d_{\pi_k}} \big[ D_{\mathrm{TV}}(\pi, \pi_k; s) \big] $$
where the surrogate objective is:
$$ L_{\pi_k}(\pi) := \frac{1}{1-\gamma} \mathbb{E}_{s \sim d_{\pi_k}, a \sim \pi_k} \left[ \frac{\pi(a \mid s)}{\pi_k(a \mid s)} A^{\pi_k}(s,a) \right] $$
Here $L_{\pi_k}(\pi)$ omits the additive constant $J(\pi_k)$, so in textbook TRPO notation one could also write the objective as $J(\pi_k)+L_{\pi_k}(\pi)$. Also note that $C_{\pi,\pi_k}$ depends on the new policy $\pi$, so it should be read as a structural coefficient in the bound rather than a tunable hyperparameter.
This lower bound consists of two parts:
-
Surrogate objective $L_{\pi_k}(\pi)$: Can be directly estimated from old policy data via importance sampling; this is the optimization objective of TRPO/PPO.
-
Policy shift penalty: Increases with the TV distance between new and old policies, explaining why PPO needs to constrain update magnitude.
Core conclusion: this is an explicit improvement lower bound. In particular, when the right-hand side is positive, improvement is guaranteed.
3.3 Finite-Sequence LLM Form and the Support Condition
In the prompt-response form of LLM RL, let the prompt be $x$ and the response be
$$ y=(a_1,\ldots,a_T). $$
The behavior and target sequence probabilities are
$$
\mu(y\mid x)=\prod_{t=1}^T \mu(a_t\mid x,a_{ The sequence-level importance ratio is $$
\rho(y\mid x)=\frac{\pi(y\mid x)}{\mu(y\mid x)}
=\prod_{t=1}^T \frac{\pi(a_t\mid x,a_{ Therefore $$
\log \rho(y\mid x)
=
\sum_{t=1}^T
\left[\log \pi(a_t\mid x,a_{ Small token-level log-ratio errors accumulate linearly along the sequence, while the sequence ratio itself amplifies multiplicatively. Long responses, low-probability tokens, truncated sampling, and MoE routing can all make this ratio heavy-tailed. This also exposes the support condition. If top-$k$ / top-$p$ / masks / EOS rules give zero probability to a token under the behavior policy while the target policy still assigns it positive probability, then $\rho$ is undefined. In that case one cannot simply “add an importance ratio”; one must first redefine the target support or introduce smoothing / mixture distributions that guarantee common support. In practice, a batch of data may come from multiple policy versions $\{\pi^{(1)}, \ldots, \pi^{(M)}\}$, with respective proportions $\alpha_1, \ldots, \alpha_M$. How do we extend Theorem 3.2 to this setting? Core idea: augmented state space The solution is an elegant modeling technique: treat the policy version index as part of the state. Define the augmented state space $\tilde{\mathcal{S}} := \mathcal{S} \times \mathcal{I}$, where $\mathcal{I} = \{1, \ldots, M\}$ is the policy index set. Under augmented state $(s, i)$, the mixture behavior policy is defined as $\beta(a \mid s, i) := \pi^{(i)}(a \mid s)$. The evolution of indices is characterized by the index transition kernel $q(i' \mid i)$. The augmented MDP inherits the original MDP’s rewards and environment transitions, with indices evolving independently according to $q(i'\mid i)$. This technique works because the new policy $\pi$ has the same return in the augmented MDP as in the original one, so Theorem 3.2 can be applied directly. The most common scenario is using a single old policy per trajectory: at trajectory start, sample index $I_0 \sim \alpha$, and use $\pi^{(I_0)}$ throughout. In this case, the index transition kernel is the identity: $q(i' \mid i) = \mathbf{1}_{i'=i}$. From an engineering perspective, many asynchronous actor-learner systems organize data so that an entire trajectory belongs to a particular policy snapshot, and the learner then mixes trajectories produced by different versions. That setup approximately corresponds to what I call trajectory-level mixture. The word “approximately” matters because different systems need not agree on the exact boundary of a trajectory or sampling unit. Lemma 4.1 (Structural Simplification for Trajectory-Level Mixture) (a) The augmented state visitation distribution decomposes as: $d_{\beta}(s, i) = \alpha_i \cdot d_{\pi^{(i)}}(s)$ (b) The advantage function reduces to: $A^{\beta}((s, i), a) = A^{\pi^{(i)}}(s, a)$ Intuition for (b): Since the index never changes, all future trajectories starting from augmented state $(s,i)$ are generated by the same policy $\pi^{(i)}$. Therefore, future cumulative returns are entirely determined by $\pi^{(i)}$, and value functions and advantage functions naturally reduce to their $\pi^{(i)}$ counterparts. Consequently, the mixture policy’s return is the weighted average of individual old policies’ returns: $J_{\mathrm{mix}} = \sum_{i=1}^{M} \alpha_i J(\pi^{(i)})$. Improvement bound Corollary 4.2 (Performance Improvement Lower Bound for Trajectory-Level Mixture) $$
J(\pi) - \sum_{i=1}^{M} \alpha_i J(\pi^{(i)}) \geq \sum_{i=1}^{M} \alpha_i L_{\pi^{(i)}}(\pi) - \frac{2\gamma \max_i C_{\pi, \pi^{(i)}}}{(1-\gamma)^2} \sum_{i=1}^{M} \alpha_i \mathbb{E}_{s \sim d_{\pi^{(i)}}} \big[ D_{\mathrm{TV}}(\pi, \pi^{(i)}; s) \big]
$$ This result shows that when mixing trajectories from multiple old policy versions for training, if we construct the loss using importance ratios corresponding to each trajectory’s source policy while controlling the new policy’s deviation from each old policy, the new policy’s performance has a clear improvement lower bound. Using $\max_i C_{\pi,\pi^{(i)}}$ is just a compact way to write the result. A slightly tighter version would keep a separate $C_i$ for each component and then average them with the mixture weights. Section 4 discussed static mixture—where mixture weights $\alpha_i$ remain fixed. This section considers the more general dynamic mixture—where sampling gradually transitions to the new policy after it is released. The previous results characterize improvement of “the new policy relative to the mixture behavior policy.” However, in actual training, what we truly care about is: Does the latest policy $\pi_{k+1}$ after each update monotonically improve over the previous latest policy $\pi_k$? $$
J(\pi_{k+1}) \geq J(\pi_k)
$$ Two typical forms of dynamic mixture sampling can be uniformly characterized by the index transition kernel $q(i'\mid i)$: Trajectory-level mixture (can be viewed as an abstraction of conventional asynchronous training; identity index transition): $q(i'\mid i) = \mathbf{1}\{i'=i\}$ Step/segment-level mixture (an abstraction of partial rollout, or segment-based sampling; allows switching): $q(i'\mid i) = (1-\sigma(i))\mathbf{1}\{i'=i\} + \sigma(i)\kappa(i'\mid i)$ where $\sigma(i)$ is the switching probability and $\kappa(\cdot\mid i)$ is the target index distribution. By introducing the mixture return $J_{\mathrm{mix}}^{(k)}$ as an intermediate bridge, the performance difference decomposes as: $$
J(\pi_{k+1}) - J(\pi_k) = \underbrace{[J(\pi_{k+1}) - J_{\mathrm{mix}}^{(k)}]}_{\text{improvement over mixture policy}} + \underbrace{[J_{\mathrm{mix}}^{(k)} - J(\pi_k)]}_{\text{mixture bias term}}
$$ The first term is handled using Theorem 3.2. The second is the mixture bias term. The way to handle it is to expand $J_{\mathrm{mix}}^{(k)} - J(\pi_k)$ into a weighted sum of $J(\pi^{(i)}) - J(\pi_k)$ terms, apply a TV-based two-policy lower bound to each component, and then collect everything using $\|A^{\pi_k}\|_\infty$. This gives: $$
J_{\mathrm{mix}}^{(k)} - J(\pi_k) \geq -\frac{2\|A^{\pi_k}\|_\infty}{1-\gamma} \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[ D_{\mathrm{TV}}(\pi^{(i)}, \pi_k; s) \big]
$$ Combining the above results yields the core theorem: Theorem 5.1 (Monotonic Improvement Lower Bound Under Dynamic Mixture Sampling) $$
\begin{aligned}
J(\pi_{k+1}) - J(\pi_k) \geq\;& L_{\beta^{(k)}}(\pi_{k+1}) \\
&- \frac{2\gamma C_{\pi_{k+1},\beta^{(k)}}}{(1-\gamma)^2} \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[ D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)}; s) \big] \\
&- \frac{2\|A^{\pi_k}\|_\infty}{1-\gamma} \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[ D_{\mathrm{TV}}(\pi^{(i)}, \pi_k; s) \big]
\end{aligned}
$$ Here $L_{\beta^{(k)}}(\pi_{k+1})$ denotes the surrogate objective relative to the behavior policy $\beta^{(k)}$ (the same shape as $L_{\pi_k}(\pi)$ in Section 3, but with the behavior policy generalized from a single $\pi_k$ to the mixture $\beta^{(k)}$). More explicitly, one can write $$
L_{\beta^{(k)}}(\pi_{k+1}) := \frac{1}{1-\gamma} \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}},\, a\sim \pi^{(i)}(\cdot\mid s)}\left[\frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)}\,A^{\beta^{(k)}}((s,i),a)\right].
$$ Similarly, define $$
C_{\pi_{k+1},\beta^{(k)}} := \max_{(s,i)}\left|\mathbb{E}_{a\sim \pi_{k+1}(\cdot\mid s)}\big[A^{\beta^{(k)}}((s,i),a)\big]\right|.
$$ This lower bound contains two penalties, so it naturally points to two separate control problems: Before going further, one important qualifier: the infeasibility discussed here concerns an interpretation in which we try to impose the same hard trust-region constraint against every historical source policy. That is not the same as saying PPO-Clip itself explicitly implements such a constraint. The update-shift penalty in Theorem 5.1 may look controllable through constraints on $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)}; s)$, but that interpretation quickly runs into a practical infeasibility: Observation 5.2 (Infeasibility of a Uniform Hard Trust Region) Suppose the mixture sampling includes two old policies $\pi^{(1)}$ and $\pi^{(2)}$. If there exists some state $s$ such that $D_{\mathrm{TV}}(\pi^{(1)}, \pi^{(2)}; s) > 2\delta$, then no policy $\pi_{k+1}$ can simultaneously satisfy $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(1)}; s) \leq \delta$ and $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(2)}; s) \leq \delta$. By the triangle inequality, if both constraints were satisfied, then $D_{\mathrm{TV}}(\pi^{(1)}, \pi^{(2)}; s) \leq 2\delta$, a contradiction. The update shift penalty directly couples $\pi_{k+1}$ with the historical policy family $\{\pi^{(i)}\}$, whose internal structure is a product of historical training and not controllable by the current update. The solution leverages the triangle inequality of TV distance: $$
D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)}; s) \leq D_{\mathrm{TV}}(\pi_{k+1}, \pi_k; s) + D_{\mathrm{TV}}(\pi_k, \pi^{(i)}; s)
$$ This decomposes the coupled constraint into two independent parts: Define: $$
U_k := \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[D_{\mathrm{TV}}(\pi_{k+1}, \pi_k; s)\big], \quad S_k := \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[D_{\mathrm{TV}}(\pi_k, \pi^{(i)}; s)\big]
$$ Corollary 5.3 (Decomposed Monotonic Improvement Lower Bound) $$
J(\pi_{k+1}) - J(\pi_k) \geq L_{\beta^{(k)}}(\pi_{k+1}) - \frac{2\gamma C_{\pi_{k+1},\beta^{(k)}}}{(1-\gamma)^2} U_k - \left( \frac{2\gamma C_{\pi_{k+1},\beta^{(k)}}}{(1-\gamma)^2} + \frac{2\|A^{\pi_k}\|_\infty}{1-\gamma} \right) S_k
$$ The key is that after decomposition, $U_k$ only involves the new policy $\pi_{k+1}$ and the current policy $\pi_k$, completely independent of the structure of the old policy family $\{\pi^{(i)}\}$. So no matter how far apart the old policies are, constraining $U_k$ remains feasible. This is exactly how the infeasibility in Observation 5.2 is avoided. Operationally, this leads to a simple separation of concerns: The bounds above assume that the surrogate uses the theoretical advantage $A^{\pi_k}(s,a)$. In LLM RL, however, the loss often uses a batch estimate, critic / GAE output, normalized verifier reward, or group-relative advantage. Let the actual advantage be $\hat A(s,a)$. Even if the importance ratio is correct, there is an additional advantage-replacement error: $$
\mathbb{E}_{\mu}\left[\rho(s,a)\hat A(s,a)\right]
-
\mathbb{E}_{\mu}\left[\rho(s,a)A^{\pi_k}(s,a)\right]
=
\mathbb{E}_{\mu}\left[\rho(s,a)(\hat A(s,a)-A^{\pi_k}(s,a))\right].
$$ If $|\rho(s,a)|\le M$ and $|\hat A(s,a)-A^{\pi_k}(s,a)|\le \epsilon_A$, then $$
\left|
\mathbb{E}_{\mu}\left[\rho(s,a)(\hat A(s,a)-A^{\pi_k}(s,a))\right]
\right|
\le M\epsilon_A.
$$ Thus the off-policy error has at least three components: update shift $U_k$, sampling staleness $S_k$, and advantage replacement error $\epsilon_A$. Focusing only on the actor ratio misses the third term; theoretically, whether $\hat A$ remains close to $A^{\pi_k}$ is part of the monotonic-improvement condition. The essential difference between the two mixture mechanisms lies in the structure of the index transition kernel: The correspondence with common engineering terminology is: The key dividing line is whether Lemma 4.1’s structural simplification holds: trajectory-level mixture satisfies the advantage reduction, while step/segment-level mixture generally does not because future returns depend on the index transition kernel. Trajectory-level mixture’s staleness arises from: mixture weights $\alpha_i^{(k)}$ retaining probability mass on old policies after new policy release. Step/segment-level mixture has an exponential compression effect in a simplified model: suppose that once the index switches from an old version to the new version it never switches back, and that the switch happens with probability $\sigma$ at each step. Then the marginal mass on the old index under the discounted visitation distribution is $$
(1-\gamma)\sum_{t\ge 0}[\gamma(1-\sigma)]^t = \frac{1-\gamma}{1-\gamma(1-\sigma)}.
$$ As long as $\sigma \gg 1-\gamma$, the old-policy mass is significantly compressed. Trajectory-level mixture: The advantage function reduces to $A^{\pi^{(i)}}(s,a)$, with a clear estimation path. Advantage substitution bias in step/segment-level mixture: If single-policy advantage estimates are used, systematic bias will arise. The reason is that $A^{\beta^{(k)}}((s,i),a)$ requires taking expectations over future index switching, while $A^{\pi^{(i)}}(s,a)$ implicitly assumes “the future always follows $\pi^{(i)}$.” In single-step episode LLM training, with no subsequent state transitions, the estimation problems of both mechanisms unify, with no such bias. Step/segment-level mixture has another hidden concern: even if single-step importance ratios are clipped, multi-step noise accumulation over long trajectories can still amplify gradient estimation variance. When policy changes per update are large, “behavioral discontinuities” within trajectories may induce heavier-tailed ratio distributions. This is also why Table 6.1 recommends trajectory-level mixture for scenarios with large policy change per update. Core trade-off: Step/segment-level mixture is stronger on the sampling side (fast staleness removal), while trajectory-level mixture is more stable on the estimation side (easier surrogate objective estimation). Corollary 5.3 tells us that to guarantee monotonic improvement, we need to control the update increment shift $U_k = \mathbb{E}[D_{\mathrm{TV}}(\pi_{k+1}, \pi_k; s)]$. But TV distance is a distribution-level quantity, so how do we control it with samples? The bridge from theory to samples is the following identity: Lemma 7.1 (Ratio Difference Representation of TV Distance) Suppose policy $\pi_1$’s support covers the supports of $\pi$ and $\pi_2$. Then for any state distribution $\mu$: $$
\mathbb{E}_{s\sim \mu} \big[D_{\mathrm{TV}}(\pi, \pi_2; s)\big] = \frac{1}{2} \mathbb{E}_{s\sim \mu, a\sim\pi_1(\cdot\mid s)} \left| \frac{\pi(a\mid s)}{\pi_1(a\mid s)} - \frac{\pi_2(a\mid s)}{\pi_1(a\mid s)} \right|
$$ Note the support-coverage requirement: the behavior policy used in the denominator must assign nonzero mass to the actions that appear in training. For LLMs, this means hard top-k / top-p truncation without smoothing can make some ratios undefined. Section 8 returns to that issue. The left side is the TV distance between two distributions (requiring enumeration over all actions), while the right side is the absolute difference of two importance ratios when sampling under $\pi_1$. This enables us to estimate and control TV distance using samples. Using Lemma 7.1, setting $\pi = \pi_{k+1}$, $\pi_2 = \pi_k$, $\pi_1 = \pi^{(i)}$ (the sampling-source policy), we obtain: $$
U_k = \frac{1}{2} \mathbb{E}_{(s,i) \sim d_{\beta^{(k)}}, a \sim \pi^{(i)}(\cdot\mid s)} \left| \frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)} - \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)} \right|
$$ Denoting $\rho_{k+1} := \frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)}$ and $\rho_k := \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)}$, we have: $$
U_k = \frac{1}{2} \mathbb{E}_{(s,i,a) \sim \text{training data}} \big| \rho_{k+1} - \rho_k \big|
$$ This means: If we can ensure $\lvert\rho_{k+1} - \rho_k\rvert \leq \epsilon$ for each sample, we can guarantee $U_k \leq \epsilon/2$. For each sample $(s, i, a)$, require: $$
\left| \frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)} - \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)} \right| \leq \epsilon
$$ The clipping interval is $\left[\frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)} - \epsilon, \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)} + \epsilon\right]$, with clipping center at $\rho_k$ rather than 1. Noting that $\rho_{k+1} - \rho_k = \rho_k \cdot \left(\frac{\pi_{k+1}}{\pi_k} - 1\right)$, we have: $$
|\rho_{k+1} - \rho_k| = \rho_k \cdot \left|\frac{\pi_{k+1}(a\mid s)}{\pi_k(a\mid s)} - 1\right|
$$ If we constrain $\left\lvert\frac{\pi_{k+1}(a\mid s)}{\pi_k(a\mid s)} - 1\right\rvert \leq \epsilon$, then $$
|\rho_{k+1} - \rho_k| \leq \epsilon\,\rho_k.
$$ Taking expectations gives $\mathbb{E}[|\rho_{k+1} - \rho_k|] \leq \epsilon\,\mathbb{E}_{a\sim\pi^{(i)}}[\rho_k] = \epsilon$, hence $U_k \leq \epsilon/2$. This method clips $\pi_{k+1}/\pi_k$ with center at 1, meaning the clipping constraint itself does not depend on the old policy family $\pi^{(i)}$. However, if we use the weighted advantage $\hat{A}=\rho_k\cdot A^{\beta^{(k)}}$ below, we still need per-sample behavior probabilities (or recorded logprobs) to compute $\rho_k$. Before writing down the clipped objectives, one caveat is worth making explicit: the two inequalities above are hard per-sample constraints at the theory level. The clipped surrogates below are practical approximations meant to keep $U_k$ in a manageable range; they are not literal guarantees that every optimization step exactly satisfies the hard constraint. For comparison, we present the complete objective functions for three clipping mechanisms. Suppose the current sample comes from old policy $\pi^{(i)}$, and denote: Note: under trajectory-level mixture (index fixed), $A^{\beta^{(k)}}((s,i),a)=A^{\pi^{(i)}}(s,a)$, so per-trajectory advantages from the corresponding old policy are consistent; under step/segment-level mixture, replacing $A^{\beta^{(k)}}$ with $A^{\pi^{(i)}}$ introduces advantage-substitution bias (discussed in Section 6), so the advantage/value estimator must reflect future index switching. Clip $\rho_{k+1}$ with center at 1 $$
L^{\mathrm{PPO}} = \mathbb{E} \left[ \min\left( \rho_{k+1} \cdot A^{\pi^{(i)}}, \; \mathrm{clip}(\rho_{k+1}, 1-\epsilon, 1+\epsilon) \cdot A^{\pi^{(i)}} \right) \right]
$$ Clip $\rho_{k+1}$ with center at $\rho_k$ $$
L^{\mathrm{M1}} = \mathbb{E} \left[ \min\left( \rho_{k+1} \cdot A^{\beta^{(k)}}, \; \mathrm{clip}(\rho_{k+1}, \rho_k-\epsilon, \rho_k+\epsilon) \cdot A^{\beta^{(k)}} \right) \right]
$$ Clip incremental ratio $r$ with center at 1 $$
L^{\mathrm{M2}} = \mathbb{E} \left[ \min\left( r \cdot \hat{A}, \; \mathrm{clip}(r, 1-\epsilon, 1+\epsilon) \cdot \hat{A} \right) \right]
$$ where $\hat{A} = \rho_k \cdot A^{\beta^{(k)}}$ is the importance-weighted advantage estimate. For the hard per-sample constraints discussed earlier, Methods 1 and 2 do directly control $D_{\mathrm{TV}}(\pi_{k+1}, \pi_k)$. For the clipped surrogates used in practice, however, the more accurate reading is that they exert optimization pressure on different shift objects rather than explicitly imposing a TV constraint. If we carry over a single-source trust-region intuition, standard PPO’s clipped objective is most naturally read as suppressing further deviation of the new policy from each sampling-source policy $\pi^{(i)}$. But PPO-Clip does not explicitly impose TV / KL constraints; rather, clipping removes gains from moving farther away from the behavior policy. When the old policies $\pi^{(1)}, \pi^{(2)}, \ldots$ differ substantially, that optimization pressure is easily dominated by the stalest sources. For the hard-constraint versions introduced earlier, Methods 1 and 2 directly control $D_{\mathrm{TV}}(\pi_{k+1}, \pi_k)$. For the clipped-surrogate versions used in practice, the more accurate statement is that they redirect optimization pressure from “stay close to every behavior policy at once” toward “control the update around the current policy $\pi_k$.” Since $\pi_k$ is unique, that target is shared across all sample sources and avoids the structural difficulty behind the infeasibility result. When samples come from very old policies, $\rho_k = \pi_k/\pi^{(i)}$ can be large. Large language models have many tokens having very small probabilities. The discussion so far has focused on optimization-side clipping, but the monotonic-improvement lower bound also contains a sampling-staleness term $S_k$. This subsection handles the sampling side’s responsibility and then returns to what clipping, as an overall operation, actually means. Corollary 5.3 shows that $S_k$ also enters the monotonic-improvement lower bound, but it cannot be controlled from the optimization side. It has to be handled by the sampling system: Set a threshold $\epsilon_{\mathrm{stale}}$. For each sample, compute $\lvert\rho_k - 1\rvert = \lvert\pi_k(a\mid s)/\pi^{(i)}(a\mid s) - 1\rvert$, and discard samples exceeding the threshold. Limit the number of old policy versions in the mixture sampling, e.g., using only data from the most recent $W$ versions. Finally, we clarify the relationship between clipping and the theoretical lower bound. In Corollary 5.3, the coefficient of $U_k$, namely $C_{\pi_{k+1},\beta^{(k)}}$, depends on the new policy $\pi_{k+1}$, so the penalty term cannot simply be replaced by a fixed constant. Operationally, the clipped objective should be read as an approximation to a constrained update, not as a literal theorem with a hand-picked scalar penalty: Maximize the surrogate objective $L_{\beta^{(k)}}(\pi_{k+1})$ subject to the constraint $U_k \leq \epsilon/2$ The clipping objective can be read as a practical approximation to this constrained optimization: clipping keeps the update magnitude in check so that $U_k$ stays manageable, and then gradient ascent pushes up the surrogate objective within that approximation. This section established the theoretical foundations of clipping mechanisms: In large-scale distributed training, policies on the inference side and training side may be inconsistent: Let the behavior policy modeled on the training side be $\pi^{(i)}$, while the policy actually sampling on the inference side is $\hat{\pi}^{(i)}$. The mismatch discussed here is between the behavior policy and the current training policy. It is distinct from the policy-vs-reference-model KL regularization that is common in RLHF; that is a different regularization axis. Define effective staleness: $$
\hat{S}_k := \mathbb{E}_{(s,i) \sim d_{\hat{\beta}^{(k)}}} \big[ D_{\mathrm{TV}}(\pi_k, \hat{\pi}^{(i)}; s) \big]
$$ This definition simultaneously covers version staleness and training-inference implementation differences. By Lemma 7.1, $\hat{S}_k$ can be written in a sample-computable form. Given threshold $\epsilon_{\mathrm{stale}}$, if training only uses samples satisfying $\lvert\pi_k(a\mid s)/\hat{\pi}^{(i)}(a\mid s) - 1\rvert \leq \epsilon_{\mathrm{stale}}$, then the effective staleness on the conditional distribution of retained samples (which we may denote by $\hat{S}_k^{\mathrm{eff}}$) can be controlled to at most $\epsilon_{\mathrm{stale}}/2$. This controls the filtered training distribution, not the original sampling distribution’s $\hat{S}_k$. The structure of the monotonic improvement lower bound is: $$
J(\pi_{k+1}) - J(\pi_k) \geq \underbrace{L_{\beta^{(k)}}(\pi_{k+1})}_{\text{surrogate objective}} - \underbrace{C_1 \cdot U_k}_{\text{update shift penalty}} - \underbrace{C_2 \cdot S_k}_{\text{sampling staleness penalty}}
$$ Here $C_1$ and $C_2$ are just compressed notation for the theoretical coefficients introduced earlier. Concretely, one can read them as $$
C_1 = \frac{2\gamma C_{\pi_{k+1},\beta^{(k)}}}{(1-\gamma)^2},
\qquad
C_2 = C_1 + \frac{2\|A^{\pi_k}\|_\infty}{1-\gamma},
$$ up to the exact naming convention used in the intermediate statements. They are not free hyperparameters. If the advantage-replacement error from Section 5.4 is included explicitly, the structure becomes $$
J(\pi_{k+1}) - J(\pi_k)
\gtrsim
L_{\beta^{(k)}}(\pi_{k+1})
- C_1 U_k
- C_2 S_k
- C_3 \epsilon_A,
$$ where $C_3$ is determined by conditions such as the bound on the importance ratio. This is not a new algorithmic term; it is a reminder that monotonic improvement also depends on the actual advantage estimate being close enough to the theoretical advantage. A standard starting point for Lemma 3.1 is the fixed-point equation of discounted state visitation distributions: $$
d_\pi = (1-\gamma)\rho_0 + \gamma P_\pi^\top d_\pi,
\qquad
d_{\pi_k} = (1-\gamma)\rho_0 + \gamma P_{\pi_k}^\top d_{\pi_k}.
$$ Subtracting the two equations and rearranging expresses $d_\pi-d_{\pi_k}$ in terms of the difference between policy-induced transition kernels acting on the old distribution. Taking an $\ell_1$ upper bound, using the non-expansiveness of Markov kernels in $\ell_1$, and then applying $$
\|(P_\pi-P_{\pi_k})(\cdot\mid s)\|_1 \le 2D_{\mathrm{TV}}(\pi,\pi_k;s),
$$ one obtains the average-TV control $$
\|d_\pi-d_{\pi_k}\|_1 \le \frac{2\gamma}{1-\gamma}\,\mathbb{E}_{s\sim d_{\pi_k}}\big[D_{\mathrm{TV}}(\pi,\pi_k;s)\big].
$$ I omit the linear-operator algebra and constant bookkeeping here because the post only uses the final average-TV form. Jacob Hilton, Karl Cobbe, John Schulman. “Batch size-invariance for policy optimization” (Decoupled PPO). arXiv:2110.00641. https://arxiv.org/abs/2110.006414. Multi-Policy Static Mixture Sampling
4.1 Setup and Unified Modeling (Static Mixture)
4.2 Trajectory-Level Mixture: Simplification and Improvement Bound
5. Dynamic Mixture Sampling and Monotonic Improvement Conditions
5.1 Problem and Unified Modeling (Dynamic Mixture)
Unified Modeling Framework
5.2 Decomposition and Monotonic Improvement Bound
Monotonic Improvement Bound
5.3 Why Direct Constraints Are Infeasible: Triangle Inequality Decomposition
Proof
Root Cause
Triangle Inequality Decomposition
Why Does Decomposition Solve the Problem?
Control Term
Responsible Party
Control Mechanism
$U_k$ (update increment shift)
Optimization algorithm
Policy clipping
$S_k$ (sampling staleness)
Sampling system
Data filtering, version window
5.4 Advantage Replacement Error: Off-Policy Is Not Only a Ratio Problem
6. Comparison of Trajectory-Level and Step/Segment-Level Mixture
6.1 Mechanism Differences and Estimation Implications
Differences in Sampling Staleness $S_k$
Differences in Surrogate Objective Estimation
Unification Under Bandit Setting
6.2 Risks and Applicable Scenarios
Applicable Scenarios
Table 6.1 Applicable Scenarios for Two Mixture Mechanisms
Scenario Characteristics
Recommended Mechanism
Rationale
Long trajectories, high-frequency updates, strong asynchrony
Step/segment-level
Can significantly compress $S_k$
Short trajectories (non-bandit)
Trajectory-level
$S_k$ is naturally low
Large policy change per update
Trajectory-level
Avoids variance amplification
Single-step episode (bandit)
Either
Choose based on implementation convenience
Need for compromise
Segment-level
Switch at natural boundaries
7. Theoretical Foundations of Clipping Mechanisms
7.1 From TV Distance to Sample-Controllable Quantities
Intuitive Understanding
Sample Representation of $U_k$
7.2 Constraining $U_k$: Two Clipping Options
Method 1: Direct Constraint on Ratio Difference
Method 2: Constraint on Incremental Ratio
Objective Functions (Three Clipping Mechanisms)
Standard PPO (Trajectory-Level Mixture)
Method 1
Method 2
7.3 Method Comparison and Selection
Table 7.1 Comparison of Three Clipping Mechanisms
Method
Clipped Variable
Clipping Center
Clipping Interval
More Natural Shift Object
Standard PPO
$\rho_{k+1} = \pi_{k+1}/\pi^{(i)}$
$1$
$[1-\epsilon, 1+\epsilon]$
New policy relative to $\pi^{(i)}$
Method 1
$\rho_{k+1} = \pi_{k+1}/\pi^{(i)}$
$\rho_k = \pi_k/\pi^{(i)}$
$[\rho_k-\epsilon, \rho_k+\epsilon]$
New policy relative to $\pi_k$
Method 2
$r = \pi_{k+1}/\pi_k$
$1$
$[1-\epsilon, 1+\epsilon]$
New policy relative to $\pi_k$
The Fundamental Problem with Standard PPO Under Multi-Policy Mixture
Common Advantages of Methods 1 and 2
Method 1 vs Method 2
Comparison Dimension
Method 1 (Adaptive Clipping)
Method 2 (Incremental Clipping)
Stale samples ($\rho_k \gg 1$)
Automatically tightens constraints, more conservative
May produce large gradient variance
LLM large vocabulary low-probability tokens
Allows larger absolute changes (additive)
Absolute changes are limited (multiplicative)
Implementation complexity
Requires storing $\pi^{(i)}(a\mid s)$ and $\pi_k(a\mid s)$
Needs $\pi_k(a\mid s)$ and $\pi^{(i)}(a\mid s)$ (or stored logprobs) to compute $\rho_k$; clipping itself uses only $\pi_{k+1}/\pi_k$
Advantage function
Uses $A^{\beta^{(k)}}$
Uses weighted advantage $\rho_k \cdot A^{\beta^{(k)}}$
Detailed Explanations
(1) Handling Stale Samples
(2) LLM Large Vocabulary Issue
7.4 Staleness Control and Operational Meaning
Controlling Sampling Staleness
(1) Discarding Stale Data
(2) Controlling Policy Version Window
Operational Meaning of Clipping
Section Summary
8. Handling Training-Inference Inconsistency
8.1 Background and Effective Staleness
Effective Staleness
8.2 Theoretical Control of Effective Staleness
Key Theoretical Conditions
9. Summary: Theoretical Checklist
Core Theoretical Framework
Theoretical Separation of Concerns
Control Term
Theoretical Meaning
Constraint Type
Distribution Object
$U_k$
Current update shift
constrain $\pi_{k+1}$ relative to $\pi_k$
target vs. proximal policy
$S_k$
Sampling staleness shift
constrain behavior-proximal distance
behavior vs. proximal distribution
$\epsilon_A$
Advantage replacement error
bound $\hat A-A^{\pi_k}$
estimated vs. theoretical advantage
Theoretical Role of Clipping Terms
Clipped object
Theoretical role
Cost
Directly constrain $\pi_{k+1}/\pi^{(i)}$
Controls update shift and part of staleness jointly
Stronger dependence on each behavior version
Constrain increment $\pi_{k+1}/\pi_k$
Controls only current update shift $U_k$
Requires separate control of $S_k$
Hard clipping or filtering of ratios
Produces a more conservative surrogate bound
Introduces a biased objective that must be reinterpreted
Theoretical Treatment of Training-Inference Inconsistency
Appendix
A. Proof Sketch: From State-Distribution Difference to Average TV
B. Quick Reference for Key Symbols
Symbol
Meaning
$\pi_k$, $\pi^{(i)}$
Latest policy at round $k$, $i$-th old policy
$d_\pi(s)$, $A^\pi(s,a)$
Discounted state visitation distribution, advantage function
$D_{\mathrm{TV}}(\pi, \pi'; s)$
TV distance between two policies at state $s$
$\beta^{(k)}(a \mid s, i) := \pi^{(i)}(a \mid s)$
Mixture behavior policy at round $k$
$q(i' \mid i)$, $\alpha_i^{(k)}$
Index transition kernel, initial index distribution
$U_k$, $S_k$
Update increment shift, sampling staleness
$\epsilon$, $\epsilon_{\mathrm{stale}}$, $W$
Clipping radius, staleness threshold, version window
$C_{\pi,\pi_k}$
Expected advantage upper bound coefficient
References