This post studies a recurring question in large-scale LLM reinforcement learning: when a training batch mixes data generated by multiple historical policy versions, can one still write down an explicit monotonic-improvement lower bound for PPO-style updates?

The short answer is yes: under dynamic mixture sampling, the bound can be summarized as “surrogate objective - update-shift penalty - sampling-staleness penalty.”

1. Introduction: Why Should We Care About Off-Policy Training?

When training a large language model with reinforcement learning, the most direct setup is on-policy training: generate a batch of data, update on that batch, then sample again from the updated model.

In large-scale distributed training, though, hundreds of GPUs sample in parallel and model updates take time. By the time a new version is deployed, data generated by older versions are often still sitting in the queue: throwing them away is wasteful, but using them means training on stale data.

That is the central off-policy question: when can data collected by older policies still support an analyzable monotonic-improvement lower bound for a newer one?

We will ultimately see that the lower bound is governed by three pieces: a surrogate objective we try to maximize, an update-shift penalty controlled on the optimization side, and a sampling-staleness penalty controlled on the data side.

In many RLHF / online alignment setups, if we view the prompt as context and the response as action while ignoring long-horizon environment evolution, the problem is often well approximated as a contextual bandit. I still start from the discounted-MDP setting because it lets us write multi-version behavior mixing, sampling staleness, and clipping in one unified language. Section 7 returns to which terms disappear, and which conclusions remain, in the bandit limit.

Related work has already touched neighboring parts of this picture: GePPO studies off-policy sample reuse with policy-improvement guarantees, while Decoupled PPO explicitly separates the behavior policy from the proximal policy. The emphasis here is different: I expand the behavior side into a dynamic mixture of historical policy versions and then split the risk into update increment shift and sampling staleness. You can also read this post as a continuation of the earlier three-policy perspective: here the behavior side is no longer a single policy $\mu$, but a mixture over historical policies $\{\pi^{(i)}\}$, while $\pi_k$ and $\pi_{k+1}$ play the roles of the current reference policy and update target. Even without that earlier post, the only principle needed here is to separate what the current update can control from what comes from behavior-distribution mismatch.

2. Theoretical Foundations

2.1 Basic Setup

We consider a standard Markov Decision Process (MDP) comprising a state space $\mathcal{S}$, action space $\mathcal{A}$, transition probability $p(s'\mid s,a)$, reward function $r(s,a)$, initial distribution $\rho_0$, and discount factor $\gamma \in (0,1)$.

The expected cumulative discounted return of policy $\pi$ is:

$$ J(\pi) := \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \mid \pi\right] $$

Discounted State Visitation Distribution

We define it as:

$$ d_\pi(s) := (1-\gamma) \sum_{t=0}^{\infty} \gamma^t \Pr(s_t = s \mid \pi) $$

Advantage Function

We define it as:

$$ A^\pi(s,a) := Q^\pi(s,a) - V^\pi(s) $$

Total Variation Distance (TV Distance)

We define it as:

$$ D_{\mathrm{TV}}(\pi, \pi'; s) := \frac{1}{2} \sum_{a \in \mathcal{A}} |\pi(a \mid s) - \pi'(a \mid s)| $$

Throughout, we use $\mid$ for conditional probability (e.g., $\pi(a\mid s)$) and reserve $\|\cdot\|$ for norms.

2.2 Core Tool: Policy Performance Difference Lemma

The starting point of the analysis is the classic performance difference lemma, which writes $J(\pi)-J(\pi_k)$ exactly as an expectation of old-policy advantage under the new policy’s occupancy measure. This identity goes back to Kakade-Langford style analysis and is also the starting point of TRPO.

Lemma 2.1 (Policy Performance Difference Lemma)

For any policies $\pi_k$ (old) and $\pi$ (new), the performance difference can be expressed as:

$$ J(\pi) - J(\pi_k) = \frac{1}{1-\gamma} \mathbb{E}_{s \sim d_\pi}\left[ \mathbb{E}_{a \sim \pi(\cdot \mid s)}[A^{\pi_k}(s,a)] \right] $$

Intuitive understanding: How much better the new policy is than the old equals the “average advantage” obtained by selecting actions according to the new policy under the state distribution visited by the new policy.

3. Performance Improvement Bounds for Single-Policy Sampling

3.1 Distribution Mismatch and Controlling State Shift

The Policy Performance Difference Lemma has a practical issue: the expectation on the right-hand side is computed under $d_\pi$ (the new policy’s state distribution), while we can only sample from $d_{\pi_k}$ (the old policy).

The solution is to decompose the expectation into “expectation under the old distribution + bias term,” then control the bias. The key question is: What is the quantitative relationship between the difference in state distributions and the difference in policies?

Controlling State Distribution Differences

Lemma 3.1 (Relationship Between State Distribution Difference and Policy TV Distance)

$$ \|d_\pi - d_{\pi_k}\|_1 \leq \frac{2\gamma}{1-\gamma} \mathbb{E}_{s \sim d_{\pi_k}} \big[ D_{\mathrm{TV}}(\pi, \pi_k; s) \big] $$

This is an average-divergence / average-TV style bound. It is closer in spirit to CPO / Achiam et al. (2017) than to the classic TRPO presentation based on $\max_s D_{\mathrm{TV}}$. I use it here because it lands more naturally on sample averages and is easier to carry into the multi-source setting. Appendix A includes a short proof sketch for how the state-distribution difference is reduced to average TV.

Physical Interpretation

Small differences in policies in action space are “amplified” through environment dynamics into differences in state visitation distributions. The coefficient $\frac{2\gamma}{1-\gamma}$ reflects the temporal accumulation effect—in long-horizon tasks ($\gamma$ close to 1), the amplification is stronger.

Proof Sketch

By deriving the fixed-point equation for discounted visitation distributions and exploiting the $\ell_1$ non-expansiveness of stochastic matrices, one can show that state-distribution differences are amplified by policy differences through transition dynamics, yielding the bound in Lemma 3.1. Since the main point of this post is not to re-derive that chain in full, I keep only the idea here and move the proof sketch to Appendix A.

3.2 Policy Performance Improvement Lower Bound

Theorem 3.2 (Policy Performance Improvement Lower Bound)

Define the expected advantage upper bound coefficient $C_{\pi,\pi_k} := \max_{s} \lvert \mathbb{E}_{a \sim \pi}[A^{\pi_k}(s,a)] \rvert$. Then:

$$ J(\pi) - J(\pi_k) \geq L_{\pi_k}(\pi) - \frac{2\gamma C_{\pi,\pi_k}}{(1-\gamma)^2} \mathbb{E}_{s \sim d_{\pi_k}} \big[ D_{\mathrm{TV}}(\pi, \pi_k; s) \big] $$

where the surrogate objective is:

$$ L_{\pi_k}(\pi) := \frac{1}{1-\gamma} \mathbb{E}_{s \sim d_{\pi_k}, a \sim \pi_k} \left[ \frac{\pi(a \mid s)}{\pi_k(a \mid s)} A^{\pi_k}(s,a) \right] $$

Here $L_{\pi_k}(\pi)$ omits the additive constant $J(\pi_k)$, so in textbook TRPO notation one could also write the objective as $J(\pi_k)+L_{\pi_k}(\pi)$. Also note that $C_{\pi,\pi_k}$ depends on the new policy $\pi$, so it should be read as a structural coefficient in the bound rather than a tunable hyperparameter.

This lower bound consists of two parts:

  1. Surrogate objective $L_{\pi_k}(\pi)$: Can be directly estimated from old policy data via importance sampling; this is the optimization objective of TRPO/PPO.

  2. Policy shift penalty: Increases with the TV distance between new and old policies, explaining why PPO needs to constrain update magnitude.

Core conclusion: this is an explicit improvement lower bound. In particular, when the right-hand side is positive, improvement is guaranteed.

4. Multi-Policy Static Mixture Sampling

4.1 Setup and Unified Modeling (Static Mixture)

In practice, a batch of data may come from multiple policy versions $\{\pi^{(1)}, \ldots, \pi^{(M)}\}$, with respective proportions $\alpha_1, \ldots, \alpha_M$. How do we extend Theorem 3.2 to this setting?

Core idea: augmented state space

The solution is an elegant modeling technique: treat the policy version index as part of the state.

Define the augmented state space $\tilde{\mathcal{S}} := \mathcal{S} \times \mathcal{I}$, where $\mathcal{I} = \{1, \ldots, M\}$ is the policy index set. Under augmented state $(s, i)$, the mixture behavior policy is defined as $\beta(a \mid s, i) := \pi^{(i)}(a \mid s)$.

The evolution of indices is characterized by the index transition kernel $q(i' \mid i)$. The augmented MDP inherits the original MDP’s rewards and environment transitions, with indices evolving independently according to $q(i'\mid i)$.

This technique works because the new policy $\pi$ has the same return in the augmented MDP as in the original one, so Theorem 3.2 can be applied directly.

4.2 Trajectory-Level Mixture: Simplification and Improvement Bound

The most common scenario is using a single old policy per trajectory: at trajectory start, sample index $I_0 \sim \alpha$, and use $\pi^{(I_0)}$ throughout. In this case, the index transition kernel is the identity: $q(i' \mid i) = \mathbf{1}_{i'=i}$.

From an engineering perspective, many asynchronous actor-learner systems organize data so that an entire trajectory belongs to a particular policy snapshot, and the learner then mixes trajectories produced by different versions. That setup approximately corresponds to what I call trajectory-level mixture. The word “approximately” matters because different systems need not agree on the exact boundary of a trajectory or sampling unit.

Lemma 4.1 (Structural Simplification for Trajectory-Level Mixture)

(a) The augmented state visitation distribution decomposes as: $d_{\beta}(s, i) = \alpha_i \cdot d_{\pi^{(i)}}(s)$

(b) The advantage function reduces to: $A^{\beta}((s, i), a) = A^{\pi^{(i)}}(s, a)$

Intuition for (b): Since the index never changes, all future trajectories starting from augmented state $(s,i)$ are generated by the same policy $\pi^{(i)}$. Therefore, future cumulative returns are entirely determined by $\pi^{(i)}$, and value functions and advantage functions naturally reduce to their $\pi^{(i)}$ counterparts.

Consequently, the mixture policy’s return is the weighted average of individual old policies’ returns: $J_{\mathrm{mix}} = \sum_{i=1}^{M} \alpha_i J(\pi^{(i)})$.

Improvement bound

Corollary 4.2 (Performance Improvement Lower Bound for Trajectory-Level Mixture)

$$ J(\pi) - \sum_{i=1}^{M} \alpha_i J(\pi^{(i)}) \geq \sum_{i=1}^{M} \alpha_i L_{\pi^{(i)}}(\pi) - \frac{2\gamma \max_i C_{\pi, \pi^{(i)}}}{(1-\gamma)^2} \sum_{i=1}^{M} \alpha_i \mathbb{E}_{s \sim d_{\pi^{(i)}}} \big[ D_{\mathrm{TV}}(\pi, \pi^{(i)}; s) \big] $$

This result shows that when mixing trajectories from multiple old policy versions for training, if we construct the loss using importance ratios corresponding to each trajectory’s source policy while controlling the new policy’s deviation from each old policy, the new policy’s performance has a clear improvement lower bound.

Using $\max_i C_{\pi,\pi^{(i)}}$ is just a compact way to write the result. A slightly tighter version would keep a separate $C_i$ for each component and then average them with the mixture weights.

5. Dynamic Mixture Sampling and Monotonic Improvement Conditions

5.1 Problem and Unified Modeling (Dynamic Mixture)

Section 4 discussed static mixture—where mixture weights $\alpha_i$ remain fixed. This section considers the more general dynamic mixture—where sampling gradually transitions to the new policy after it is released.

The previous results characterize improvement of “the new policy relative to the mixture behavior policy.” However, in actual training, what we truly care about is: Does the latest policy $\pi_{k+1}$ after each update monotonically improve over the previous latest policy $\pi_k$?

$$ J(\pi_{k+1}) \geq J(\pi_k) $$

Unified Modeling Framework

Two typical forms of dynamic mixture sampling can be uniformly characterized by the index transition kernel $q(i'\mid i)$:

Trajectory-level mixture (can be viewed as an abstraction of conventional asynchronous training; identity index transition): $q(i'\mid i) = \mathbf{1}\{i'=i\}$

Step/segment-level mixture (an abstraction of partial rollout, or segment-based sampling; allows switching): $q(i'\mid i) = (1-\sigma(i))\mathbf{1}\{i'=i\} + \sigma(i)\kappa(i'\mid i)$

where $\sigma(i)$ is the switching probability and $\kappa(\cdot\mid i)$ is the target index distribution.

5.2 Decomposition and Monotonic Improvement Bound

By introducing the mixture return $J_{\mathrm{mix}}^{(k)}$ as an intermediate bridge, the performance difference decomposes as:

$$ J(\pi_{k+1}) - J(\pi_k) = \underbrace{[J(\pi_{k+1}) - J_{\mathrm{mix}}^{(k)}]}_{\text{improvement over mixture policy}} + \underbrace{[J_{\mathrm{mix}}^{(k)} - J(\pi_k)]}_{\text{mixture bias term}} $$

The first term is handled using Theorem 3.2. The second is the mixture bias term. The way to handle it is to expand $J_{\mathrm{mix}}^{(k)} - J(\pi_k)$ into a weighted sum of $J(\pi^{(i)}) - J(\pi_k)$ terms, apply a TV-based two-policy lower bound to each component, and then collect everything using $\|A^{\pi_k}\|_\infty$. This gives:

$$ J_{\mathrm{mix}}^{(k)} - J(\pi_k) \geq -\frac{2\|A^{\pi_k}\|_\infty}{1-\gamma} \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[ D_{\mathrm{TV}}(\pi^{(i)}, \pi_k; s) \big] $$

Monotonic Improvement Bound

Combining the above results yields the core theorem:

Theorem 5.1 (Monotonic Improvement Lower Bound Under Dynamic Mixture Sampling)

$$ \begin{aligned} J(\pi_{k+1}) - J(\pi_k) \geq\;& L_{\beta^{(k)}}(\pi_{k+1}) \\ &- \frac{2\gamma C_{\pi_{k+1},\beta^{(k)}}}{(1-\gamma)^2} \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[ D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)}; s) \big] \\ &- \frac{2\|A^{\pi_k}\|_\infty}{1-\gamma} \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[ D_{\mathrm{TV}}(\pi^{(i)}, \pi_k; s) \big] \end{aligned} $$

Here $L_{\beta^{(k)}}(\pi_{k+1})$ denotes the surrogate objective relative to the behavior policy $\beta^{(k)}$ (the same shape as $L_{\pi_k}(\pi)$ in Section 3, but with the behavior policy generalized from a single $\pi_k$ to the mixture $\beta^{(k)}$).

More explicitly, one can write

$$ L_{\beta^{(k)}}(\pi_{k+1}) := \frac{1}{1-\gamma} \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}},\, a\sim \pi^{(i)}(\cdot\mid s)}\left[\frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)}\,A^{\beta^{(k)}}((s,i),a)\right]. $$

Similarly, define

$$ C_{\pi_{k+1},\beta^{(k)}} := \max_{(s,i)}\left|\mathbb{E}_{a\sim \pi_{k+1}(\cdot\mid s)}\big[A^{\beta^{(k)}}((s,i),a)\big]\right|. $$

This lower bound contains two penalties, so it naturally points to two separate control problems:

  • Update shift penalty: Deviation of the new policy $\pi_{k+1}$ from the sampling source policy $\pi^{(i)}$
  • Sampling staleness penalty: Staleness of the sampling source policy $\pi^{(i)}$ relative to the current policy $\pi_k$

5.3 Why Direct Constraints Are Infeasible: Triangle Inequality Decomposition

Before going further, one important qualifier: the infeasibility discussed here concerns an interpretation in which we try to impose the same hard trust-region constraint against every historical source policy. That is not the same as saying PPO-Clip itself explicitly implements such a constraint.

The update-shift penalty in Theorem 5.1 may look controllable through constraints on $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)}; s)$, but that interpretation quickly runs into a practical infeasibility:

Observation 5.2 (Infeasibility of a Uniform Hard Trust Region)

Suppose the mixture sampling includes two old policies $\pi^{(1)}$ and $\pi^{(2)}$. If there exists some state $s$ such that $D_{\mathrm{TV}}(\pi^{(1)}, \pi^{(2)}; s) > 2\delta$, then no policy $\pi_{k+1}$ can simultaneously satisfy $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(1)}; s) \leq \delta$ and $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(2)}; s) \leq \delta$.

Proof

By the triangle inequality, if both constraints were satisfied, then $D_{\mathrm{TV}}(\pi^{(1)}, \pi^{(2)}; s) \leq 2\delta$, a contradiction.

Root Cause

The update shift penalty directly couples $\pi_{k+1}$ with the historical policy family $\{\pi^{(i)}\}$, whose internal structure is a product of historical training and not controllable by the current update.

Triangle Inequality Decomposition

The solution leverages the triangle inequality of TV distance:

$$ D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)}; s) \leq D_{\mathrm{TV}}(\pi_{k+1}, \pi_k; s) + D_{\mathrm{TV}}(\pi_k, \pi^{(i)}; s) $$

This decomposes the coupled constraint into two independent parts:

  • Update increment shift $D_{\mathrm{TV}}(\pi_{k+1}, \pi_k; s)$: Deviation of the new policy from the current policy, controllable by the optimization side
  • Sampling staleness $D_{\mathrm{TV}}(\pi_k, \pi^{(i)}; s)$: Deviation of the current policy from each old policy, must be controlled by the sampling side

Define:

$$ U_k := \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[D_{\mathrm{TV}}(\pi_{k+1}, \pi_k; s)\big], \quad S_k := \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[D_{\mathrm{TV}}(\pi_k, \pi^{(i)}; s)\big] $$

Corollary 5.3 (Decomposed Monotonic Improvement Lower Bound)

$$ J(\pi_{k+1}) - J(\pi_k) \geq L_{\beta^{(k)}}(\pi_{k+1}) - \frac{2\gamma C_{\pi_{k+1},\beta^{(k)}}}{(1-\gamma)^2} U_k - \left( \frac{2\gamma C_{\pi_{k+1},\beta^{(k)}}}{(1-\gamma)^2} + \frac{2\|A^{\pi_k}\|_\infty}{1-\gamma} \right) S_k $$

Why Does Decomposition Solve the Problem?

The key is that after decomposition, $U_k$ only involves the new policy $\pi_{k+1}$ and the current policy $\pi_k$, completely independent of the structure of the old policy family $\{\pi^{(i)}\}$. So no matter how far apart the old policies are, constraining $U_k$ remains feasible. This is exactly how the infeasibility in Observation 5.2 is avoided.

Operationally, this leads to a simple separation of concerns:

Control Term Responsible Party Control Mechanism
$U_k$ (update increment shift) Optimization algorithm Policy clipping
$S_k$ (sampling staleness) Sampling system Data filtering, version window

6. Theoretical Foundations of Clipping Mechanisms

6.1 From TV Distance to Sample-Controllable Quantities

Corollary 5.3 tells us that to guarantee monotonic improvement, we need to control the update increment shift $U_k = \mathbb{E}[D_{\mathrm{TV}}(\pi_{k+1}, \pi_k; s)]$. But TV distance is a distribution-level quantity, so how do we control it with samples?

The bridge from theory to samples is the following identity:

Lemma 6.1 (Ratio Difference Representation of TV Distance)

Suppose policy $\pi_1$’s support covers the supports of $\pi$ and $\pi_2$. Then for any state distribution $\mu$:

$$ \mathbb{E}_{s\sim \mu} \big[D_{\mathrm{TV}}(\pi, \pi_2; s)\big] = \frac{1}{2} \mathbb{E}_{s\sim \mu, a\sim\pi_1(\cdot\mid s)} \left| \frac{\pi(a\mid s)}{\pi_1(a\mid s)} - \frac{\pi_2(a\mid s)}{\pi_1(a\mid s)} \right| $$

Note the support-coverage requirement: the behavior policy used in the denominator must assign nonzero mass to the actions that appear in training. For LLMs, this means hard top-k / top-p truncation without smoothing can make some ratios undefined. Section 8 returns to that issue.

Intuitive Understanding

The left side is the TV distance between two distributions (requiring enumeration over all actions), while the right side is the absolute difference of two importance ratios when sampling under $\pi_1$. This enables us to estimate and control TV distance using samples.

Sample Representation of $U_k$

Using Lemma 6.1, setting $\pi = \pi_{k+1}$, $\pi_2 = \pi_k$, $\pi_1 = \pi^{(i)}$ (the sampling-source policy), we obtain:

$$ U_k = \frac{1}{2} \mathbb{E}_{(s,i) \sim d_{\beta^{(k)}}, a \sim \pi^{(i)}(\cdot\mid s)} \left| \frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)} - \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)} \right| $$

Denoting $\rho_{k+1} := \frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)}$ and $\rho_k := \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)}$, we have:

$$ U_k = \frac{1}{2} \mathbb{E}_{(s,i,a) \sim \text{training data}} \big| \rho_{k+1} - \rho_k \big| $$

This means: If we can ensure $\lvert\rho_{k+1} - \rho_k\rvert \leq \epsilon$ for each sample, we can guarantee $U_k \leq \epsilon/2$.

6.2 Constraining $U_k$: Two Clipping Options

Method 1: Direct Constraint on Ratio Difference

For each sample $(s, i, a)$, require:

$$ \left| \frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)} - \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)} \right| \leq \epsilon $$

The clipping interval is $\left[\frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)} - \epsilon, \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)} + \epsilon\right]$, with clipping center at $\rho_k$ rather than 1.

Method 2: Constraint on Incremental Ratio

Noting that $\rho_{k+1} - \rho_k = \rho_k \cdot \left(\frac{\pi_{k+1}}{\pi_k} - 1\right)$, we have:

$$ |\rho_{k+1} - \rho_k| = \rho_k \cdot \left|\frac{\pi_{k+1}(a\mid s)}{\pi_k(a\mid s)} - 1\right| $$

If we constrain $\left\lvert\frac{\pi_{k+1}(a\mid s)}{\pi_k(a\mid s)} - 1\right\rvert \leq \epsilon$, then

$$ |\rho_{k+1} - \rho_k| \leq \epsilon\,\rho_k. $$

Taking expectations gives $\mathbb{E}[|\rho_{k+1} - \rho_k|] \leq \epsilon\,\mathbb{E}_{a\sim\pi^{(i)}}[\rho_k] = \epsilon$, hence $U_k \leq \epsilon/2$.

This method clips $\pi_{k+1}/\pi_k$ with center at 1, meaning the clipping constraint itself does not depend on the old policy family $\pi^{(i)}$. However, if we use the weighted advantage $\hat{A}=\rho_k\cdot A^{\beta^{(k)}}$ below, we still need per-sample behavior probabilities (or recorded logprobs) to compute $\rho_k$.

Before writing down the clipped objectives, one caveat is worth making explicit: the two inequalities above are hard per-sample constraints at the theory level. The clipped surrogates below are practical approximations meant to keep $U_k$ in a manageable range; they are not literal guarantees that every optimization step exactly satisfies the hard constraint.

Objective Functions (Three Clipping Mechanisms)

For comparison, we present the complete objective functions for three clipping mechanisms. Suppose the current sample comes from old policy $\pi^{(i)}$, and denote:

  • $\rho_{k+1} = \frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)}$ (new policy’s ratio relative to sampling policy)
  • $\rho_k = \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)}$ (current policy’s ratio relative to sampling policy)
  • $r = \frac{\pi_{k+1}(a\mid s)}{\pi_k(a\mid s)}$ (new policy’s incremental ratio relative to current policy)

Note: under trajectory-level mixture (index fixed), $A^{\beta^{(k)}}((s,i),a)=A^{\pi^{(i)}}(s,a)$, so per-trajectory advantages from the corresponding old policy are consistent; under step/segment-level mixture, replacing $A^{\beta^{(k)}}$ with $A^{\pi^{(i)}}$ introduces advantage-substitution bias (discussed in Section 7), so the advantage/value estimator must reflect future index switching.

Standard PPO (Trajectory-Level Mixture)

Clip $\rho_{k+1}$ with center at 1

$$ L^{\mathrm{PPO}} = \mathbb{E} \left[ \min\left( \rho_{k+1} \cdot A^{\pi^{(i)}}, \; \mathrm{clip}(\rho_{k+1}, 1-\epsilon, 1+\epsilon) \cdot A^{\pi^{(i)}} \right) \right] $$

Method 1

Clip $\rho_{k+1}$ with center at $\rho_k$

$$ L^{\mathrm{M1}} = \mathbb{E} \left[ \min\left( \rho_{k+1} \cdot A^{\beta^{(k)}}, \; \mathrm{clip}(\rho_{k+1}, \rho_k-\epsilon, \rho_k+\epsilon) \cdot A^{\beta^{(k)}} \right) \right] $$

Method 2

Clip incremental ratio $r$ with center at 1

$$ L^{\mathrm{M2}} = \mathbb{E} \left[ \min\left( r \cdot \hat{A}, \; \mathrm{clip}(r, 1-\epsilon, 1+\epsilon) \cdot \hat{A} \right) \right] $$

where $\hat{A} = \rho_k \cdot A^{\beta^{(k)}}$ is the importance-weighted advantage estimate.

6.3 Comparison and Practical Controls

Table 6.1 Comparison of Three Clipping Mechanisms

Method Clipped Variable Clipping Center Clipping Interval More Natural Shift Object
Standard PPO $\rho_{k+1} = \pi_{k+1}/\pi^{(i)}$ $1$ $[1-\epsilon, 1+\epsilon]$ New policy relative to $\pi^{(i)}$
Method 1 $\rho_{k+1} = \pi_{k+1}/\pi^{(i)}$ $\rho_k = \pi_k/\pi^{(i)}$ $[\rho_k-\epsilon, \rho_k+\epsilon]$ New policy relative to $\pi_k$
Method 2 $r = \pi_{k+1}/\pi_k$ $1$ $[1-\epsilon, 1+\epsilon]$ New policy relative to $\pi_k$

For the hard per-sample constraints discussed earlier, Methods 1 and 2 do directly control $D_{\mathrm{TV}}(\pi_{k+1}, \pi_k)$. For the clipped surrogates used in practice, however, the more accurate reading is that they exert optimization pressure on different shift objects rather than explicitly imposing a TV constraint.

The Fundamental Problem with Standard PPO Under Multi-Policy Mixture

If we carry over a single-source trust-region intuition, standard PPO’s clipped objective is most naturally read as suppressing further deviation of the new policy from each sampling-source policy $\pi^{(i)}$. But PPO-Clip does not explicitly impose TV / KL constraints; rather, clipping removes gains from moving farther away from the behavior policy. When the old policies $\pi^{(1)}, \pi^{(2)}, \ldots$ differ substantially, that optimization pressure is easily dominated by the stalest sources.

Common Advantages of Methods 1 and 2

For the hard-constraint versions introduced earlier, Methods 1 and 2 directly control $D_{\mathrm{TV}}(\pi_{k+1}, \pi_k)$. For the clipped-surrogate versions used in practice, the more accurate statement is that they redirect optimization pressure from “stay close to every behavior policy at once” toward “control the update around the current policy $\pi_k$.” Since $\pi_k$ is unique, that target is shared across all sample sources and avoids the structural difficulty behind the infeasibility result.

Method 1 vs Method 2

Comparison Dimension Method 1 (Adaptive Clipping) Method 2 (Incremental Clipping)
Stale samples ($\rho_k \gg 1$) Automatically tightens constraints, more conservative May produce large gradient variance
LLM large vocabulary low-probability tokens Allows larger absolute changes (additive) Absolute changes are limited (multiplicative)
Implementation complexity Requires storing $\pi^{(i)}(a\mid s)$ and $\pi_k(a\mid s)$ Needs $\pi_k(a\mid s)$ and $\pi^{(i)}(a\mid s)$ (or stored logprobs) to compute $\rho_k$; clipping itself uses only $\pi_{k+1}/\pi_k$
Advantage function Uses $A^{\beta^{(k)}}$ Uses weighted advantage $\rho_k \cdot A^{\beta^{(k)}}$

Detailed Explanations

(1) Handling Stale Samples

When samples come from very old policies, $\rho_k = \pi_k/\pi^{(i)}$ can be large.

  • Method 2’s integrand is $\rho_k \cdot \lvert r - 1\rvert$; even if $\lvert r-1\rvert \leq \epsilon$, the integrand can reach $\epsilon \cdot \rho_k$, producing spikes.
  • Method 1 directly constrains $\lvert\rho_{k+1} - \rho_k\rvert \leq \epsilon$; the integrand’s upper bound is always $\epsilon$, unaffected by $\rho_k$ amplification.

(2) LLM Large Vocabulary Issue

Large language models have many tokens having very small probabilities.

  • Method 2 constrains $\pi_{k+1} \in [(1-\epsilon)\pi_k, (1+\epsilon)\pi_k]$, which is a multiplicative constraint: if $\pi_k(a\mid s) = 10^{-6}$, the allowed absolute change is only $\epsilon \times 10^{-6}$.
  • Method 1 constrains $\lvert\pi_{k+1} - \pi_k\rvert \leq \epsilon \cdot \pi^{(i)}$, which is an additive constraint: if that token has higher probability under the old policy (e.g., $\pi^{(i)}(a\mid s) = 0.1$), even if the current probability is very low, faster improvement is allowed - provided the token still has enough observable mass under the sampling policy.

Controlling Sampling Staleness

Corollary 5.3 shows that $S_k$ also enters the monotonic-improvement lower bound, but it cannot be controlled from the optimization side. It has to be handled by the sampling system:

(1) Discarding Stale Data

Set a threshold $\epsilon_{\mathrm{stale}}$. For each sample, compute $\lvert\rho_k - 1\rvert = \lvert\pi_k(a\mid s)/\pi^{(i)}(a\mid s) - 1\rvert$, and discard samples exceeding the threshold.

(2) Controlling Policy Version Window

Limit the number of old policy versions in the mixture sampling, e.g., using only data from the most recent $W$ versions.

Operational Meaning of Clipping

Finally, we clarify the relationship between clipping and the theoretical lower bound.

In Corollary 5.3, the coefficient of $U_k$, namely $C_{\pi_{k+1},\beta^{(k)}}$, depends on the new policy $\pi_{k+1}$, so the penalty term cannot simply be replaced by a fixed constant. Operationally, the clipped objective should be read as an approximation to a constrained update, not as a literal theorem with a hand-picked scalar penalty:

Maximize the surrogate objective $L_{\beta^{(k)}}(\pi_{k+1})$ subject to the constraint $U_k \leq \epsilon/2$

The clipping objective can be read as a practical approximation to this constrained optimization: clipping keeps the update magnitude in check so that $U_k$ stays manageable, and then gradient ascent pushes up the surrogate objective within that approximation.

Section Summary

This section established the theoretical foundations of clipping mechanisms:

  1. Lemma 6.1 converts TV distance to sample-level ratio differences, serving as the bridge between theory and implementation
  2. Two constraint methods: the hard-constraint versions of Method 1 (adaptive clipping center) and Method 2 (fixed incremental clipping) both imply $U_k \leq \epsilon/2$; the clipped surrogates used in practice are approximations to this idea
  3. Comparison with standard PPO: under a single-source trust-region intuition, standard PPO applies optimization pressure around the new-vs-behavior shift; Methods 1/2 redirect that pressure to the current policy $\pi_k$, which avoids the structural difficulty caused by multiple behavior sources
  4. Method selection: Method 1 (adaptive) is recommended for high staleness or LLM large vocabulary scenarios; Method 2 (incremental) is attractive when you want the clipping center to avoid depending on the old policy family (but data still needs to provide behavior logprobs to compute $\rho_k$)
  5. $S_k$ control is the sampling side’s responsibility, implemented through data filtering and version windows
  6. Clipping is constrained optimization: Maximize the surrogate objective subject to $U_k$ constraints

7. Comparison of Trajectory-Level and Step/Segment-Level Mixture

7.1 Mechanism Differences and Estimation Implications

The essential difference between the two mixture mechanisms lies in the structure of the index transition kernel:

  • Trajectory-level mixture: $q(i'\mid i) = \mathbf{1}\{i'=i\}$, index never changes
  • Step/segment-level mixture: $\sigma(i) > 0$, allows within-trajectory switching

The correspondence with common engineering terminology is:

  • Trajectory-level mixture here can be roughly understood as an idealized abstraction of “conventional asynchronous training”: data is organized by entire trajectories/episodes belonging to a certain policy version;
  • Step/segment-level mixture here can be roughly understood as an abstraction of “partial rollout”: due to asynchrony between actors and learners, and possible refresh to new policy versions at segment boundaries, using an index transition kernel that allows “within-trajectory version switching” can better approximate this phenomenon. APRIL is a representative systems example of this design pattern, but its main contribution is reducing long-tail rollout bottlenecks rather than providing the monotonic-improvement theory developed here.

The key dividing line is whether Lemma 4.1’s structural simplification holds: trajectory-level mixture satisfies the advantage reduction, while step/segment-level mixture generally does not because future returns depend on the index transition kernel.

Differences in Sampling Staleness $S_k$

Trajectory-level mixture’s staleness arises from: mixture weights $\alpha_i^{(k)}$ retaining probability mass on old policies after new policy release.

Step/segment-level mixture has an exponential compression effect in a simplified model: suppose that once the index switches from an old version to the new version it never switches back, and that the switch happens with probability $\sigma$ at each step. Then the marginal mass on the old index under the discounted visitation distribution is

$$ (1-\gamma)\sum_{t\ge 0}[\gamma(1-\sigma)]^t = \frac{1-\gamma}{1-\gamma(1-\sigma)}. $$

As long as $\sigma \gg 1-\gamma$, the old-policy mass is significantly compressed.

Differences in Surrogate Objective Estimation

Trajectory-level mixture: The advantage function reduces to $A^{\pi^{(i)}}(s,a)$, with a clear estimation path.

Advantage substitution bias in step/segment-level mixture: If single-policy advantage estimates are used, systematic bias will arise. The reason is that $A^{\beta^{(k)}}((s,i),a)$ requires taking expectations over future index switching, while $A^{\pi^{(i)}}(s,a)$ implicitly assumes “the future always follows $\pi^{(i)}$.”

Unification Under Bandit Setting

In single-step episode LLM training, with no subsequent state transitions, the estimation problems of both mechanisms unify, with no such bias.

7.2 Risks and Applicable Scenarios

Step/segment-level mixture has another hidden concern: even if single-step importance ratios are clipped, multi-step noise accumulation over long trajectories can still amplify gradient estimation variance. When policy changes per update are large, “behavioral discontinuities” within trajectories may induce heavier-tailed ratio distributions. This is also why Table 7.1 recommends trajectory-level mixture for scenarios with large policy change per update.

Applicable Scenarios

Table 7.1 Applicable Scenarios for Two Mixture Mechanisms

Scenario Characteristics Recommended Mechanism Rationale
Long trajectories, high-frequency updates, strong asynchrony Step/segment-level Can significantly compress $S_k$
Short trajectories (non-bandit) Trajectory-level $S_k$ is naturally low
Large policy change per update Trajectory-level Avoids variance amplification
Single-step episode (bandit) Either Choose based on implementation convenience
Need for compromise Segment-level Switch at natural boundaries

Core trade-off: Step/segment-level mixture is stronger on the sampling side (fast staleness removal), while trajectory-level mixture is more stable on the estimation side (easier surrogate objective estimation).

8. Handling Training-Inference Inconsistency

8.1 Background and Effective Staleness

In large-scale distributed training, policies on the inference side and training side may be inconsistent:

  • Numerical implementation differences: softmax normalization, quantization, kernel fusion
  • Decoding rule differences: temperature scaling, top-p/top-k sampling

Let the behavior policy modeled on the training side be $\pi^{(i)}$, while the policy actually sampling on the inference side is $\hat{\pi}^{(i)}$.

The mismatch discussed here is between the behavior policy and the current training policy. It is distinct from the policy-vs-reference-model KL regularization that is common in RLHF; that is a different regularization axis.

Effective Staleness

Define effective staleness:

$$ \hat{S}_k := \mathbb{E}_{(s,i) \sim d_{\hat{\beta}^{(k)}}} \big[ D_{\mathrm{TV}}(\pi_k, \hat{\pi}^{(i)}; s) \big] $$

This definition simultaneously covers version staleness and training-inference implementation differences.

8.2 Actionable Control

By Lemma 6.1, $\hat{S}_k$ can be written in a sample-computable form. Given threshold $\epsilon_{\mathrm{stale}}$, if training only uses samples satisfying $\lvert\pi_k(a\mid s)/\hat{\pi}^{(i)}(a\mid s) - 1\rvert \leq \epsilon_{\mathrm{stale}}$, then the effective staleness on the conditional distribution of retained samples (which we may denote by $\hat{S}_k^{\mathrm{eff}}$) can be controlled to at most $\epsilon_{\mathrm{stale}}/2$. This controls the filtered training distribution, not the original sampling distribution’s $\hat{S}_k$.

Key Implementation Points

  1. Behavior denominator alignment: The behavior probability in the loss should use the inference-side recorded $\hat{\pi}^{(i)}(a\mid s)$
  2. Probability smoothing: If the inference side has truncation (e.g., top-k), ensure ratios are valid and the support-coverage condition required by Lemma 6.1 still holds

9. Practical Guidelines

Core Theoretical Framework

The structure of the monotonic improvement lower bound is:

$$ J(\pi_{k+1}) - J(\pi_k) \geq \underbrace{L_{\beta^{(k)}}(\pi_{k+1})}_{\text{surrogate objective}} - \underbrace{C_1 \cdot U_k}_{\text{update shift penalty}} - \underbrace{C_2 \cdot S_k}_{\text{sampling staleness penalty}} $$

Here $C_1$ and $C_2$ are just compressed notation for the theoretical coefficients introduced earlier. Concretely, one can read them as

$$ C_1 = \frac{2\gamma C_{\pi_{k+1},\beta^{(k)}}}{(1-\gamma)^2}, \qquad C_2 = C_1 + \frac{2\|A^{\pi_k}\|_\infty}{1-\gamma}, $$

up to the exact naming convention used in the intermediate statements. They are not free hyperparameters.

Separation of Concerns Principle

Control Term Responsible Party Control Mechanism Specific Operation
$U_k$ Optimization algorithm Policy clipping Clip update increments (e.g., clip $\pi_{k+1}/\pi_k$)
$S_k$ Sampling system Data filtering Discard stale samples
$S_k$ Sampling system Version window Use only most recent $W$ versions

Clipping Method Selection

Scenario Recommended Method Rationale
High staleness Method 1 (adaptive) Automatically tightens constraints for stale samples
Implementation simplicity prioritized Method 2 (incremental) Clipping form is simpler, though behavior logprobs are still needed for $\rho_k$
LLM large vocabulary Method 1 Avoids slow updates for low-probability tokens

Handling Training-Inference Inconsistency

  • Use inference-side recorded $\hat{\pi}^{(i)}$ as the behavior denominator
  • Compress effective staleness through sample filtering

Appendix

A. Proof Sketch: From State-Distribution Difference to Average TV

A standard starting point for Lemma 3.1 is the fixed-point equation of discounted state visitation distributions:

$$ d_\pi = (1-\gamma)\rho_0 + \gamma P_\pi^\top d_\pi, \qquad d_{\pi_k} = (1-\gamma)\rho_0 + \gamma P_{\pi_k}^\top d_{\pi_k}. $$

Subtracting the two equations and rearranging expresses $d_\pi-d_{\pi_k}$ in terms of the difference between policy-induced transition kernels acting on the old distribution. Taking an $\ell_1$ upper bound, using the non-expansiveness of Markov kernels in $\ell_1$, and then applying

$$ \|(P_\pi-P_{\pi_k})(\cdot\mid s)\|_1 \le 2D_{\mathrm{TV}}(\pi,\pi_k;s), $$

one obtains the average-TV control

$$ \|d_\pi-d_{\pi_k}\|_1 \le \frac{2\gamma}{1-\gamma}\,\mathbb{E}_{s\sim d_{\pi_k}}\big[D_{\mathrm{TV}}(\pi,\pi_k;s)\big]. $$

I omit the linear-operator algebra and constant bookkeeping here because the post only uses the final average-TV form.

B. Quick Reference for Key Symbols

Symbol Meaning
$\pi_k$, $\pi^{(i)}$ Latest policy at round $k$, $i$-th old policy
$d_\pi(s)$, $A^\pi(s,a)$ Discounted state visitation distribution, advantage function
$D_{\mathrm{TV}}(\pi, \pi'; s)$ TV distance between two policies at state $s$
$\beta^{(k)}(a \mid s, i) := \pi^{(i)}(a \mid s)$ Mixture behavior policy at round $k$
$q(i' \mid i)$, $\alpha_i^{(k)}$ Index transition kernel, initial index distribution
$U_k$, $S_k$ Update increment shift, sampling staleness
$\epsilon$, $\epsilon_{\mathrm{stale}}$, $W$ Clipping radius, staleness threshold, version window
$C_{\pi,\pi_k}$ Expected advantage upper bound coefficient

References

  1. John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. “Trust Region Policy Optimization” (TRPO). arXiv:1502.05477. https://arxiv.org/abs/1502.05477
  2. Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel. “Constrained Policy Optimization” (CPO). arXiv:1705.10528. https://arxiv.org/abs/1705.10528
  3. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. “Proximal Policy Optimization Algorithms” (PPO). arXiv:1707.06347. https://arxiv.org/abs/1707.06347
  4. James Queeney, Ioannis Ch. Paschalidis, Christos G. Cassandras. “Generalized Proximal Policy Optimization with Sample Reuse” (GePPO). arXiv:2111.00072. https://arxiv.org/abs/2111.00072
  5. Yuzhen Zhou, Jiajun Li, Yusheng Su, et al. “APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation” (APRIL; partial rollout). arXiv:2509.18521. https://arxiv.org/abs/2509.18521
  6. Jacob Hilton, Karl Cobbe, John Schulman. “Batch size-invariance for policy optimization” (Decoupled PPO). arXiv:2110.00641. https://arxiv.org/abs/2110.00641

  7. Sham Kakade, John Langford. “Approximately Optimal Approximate Reinforcement Learning”. ICML 2002. https://dl.acm.org/doi/10.5555/645531.657706
@misc{WangZhang2025OffPolicyLLMRL,
	author       = {Wang, Xihuai and Zhang, Shao},
	title        = {Taming Stale Data: Off-Policy Reinforcement Learning for LLMs with Monotonic Improvement Guarantees},
	year         = {2025},
	month        = dec,
	day          = {17},
	url          = {https://xihuai18.github.io/reinforcement-learning/2025/12/17/offpolicy-en.html},
	urldate      = {2025-12-17}
}