Taming Stale Data: Off-Policy Reinforcement Learning for LLMs with Monotonic Improvement Guarantees

Introduction: Why Should We Care About “Off-Policy”?

Consider the following scenario: you are training a large language model with reinforcement learning to improve its question-answering capabilities. Ideally, each time the model generates a batch of responses, you would immediately update the model with this data, then use the updated model to generate new data, and so on. This approach of “updating with data from the same policy that generated it” is called on-policy training.

Reality, however, is not so simple. In large-scale distributed training, hundreds of GPUs generate data in parallel, while model updates take time. When a new model is deployed, much data generated by “older versions” of the model remains unused—discarding it seems wasteful, yet using it raises concerns about whether “stale data” might harm training effectiveness.

This is the core problem faced by off-policy training: Can we guarantee continued performance improvement when using data collected by older policies to update newer policies?

This article systematically addresses this question. Starting from foundational theory, we progressively derive actionable conditions that specify when mixing data from multiple policy versions can still guarantee monotonic training improvement.

Part I: Theoretical Foundations

1.1 Basic Setup

We consider a standard Markov Decision Process (MDP) comprising a state space $\mathcal{S}$, action space $\mathcal{A}$, transition probability $p(s'\mid s,a)$, reward function $r(s,a)$, initial distribution $\rho_0$, and discount factor $\gamma \in (0,1)$.

The expected cumulative discounted return of policy $\pi$ is:

$$ J(\pi) := \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \mid \pi\right] $$

Discounted State Visitation Distribution

Represents the weighted frequency of visiting each state during long-term policy execution:

$$ d_\pi(s) := (1-\gamma) \sum_{t=0}^{\infty} \gamma^t \Pr(s_t = s \mid \pi) $$

Advantage Function

Measures how much better action $a$ is compared to the policy’s average:

$$ A^\pi(s,a) := Q^\pi(s,a) - V^\pi(s) $$

Total Variation Distance (TV Distance)

Measures the difference between two policies’ action distributions at state $s$:

$$ D_{\mathrm{TV}}(\pi, \pi'; s) := \frac{1}{2} \sum_{a \in \mathcal{A}} |\pi(a \mid s) - \pi'(a \mid s)| $$

Throughout, we use $\mid$ for conditional probability (e.g., $\pi(a\mid s)$) and reserve $\|\cdot\|$ for norms.

1.2 Core Tool: Policy Performance Difference Lemma

The cornerstone of the entire theory is this elegant result:

Lemma 1.1 (Policy Performance Difference Lemma)

For any policies $\pi_k$ (old) and $\pi$ (new), the performance difference can be expressed as:

$$ J(\pi) - J(\pi_k) = \frac{1}{1-\gamma} \mathbb{E}_{s \sim d_\pi}\left[ \mathbb{E}_{a \sim \pi(\cdot \mid s)}[A^{\pi_k}(s,a)] \right] $$

Intuitive understanding: How much better the new policy is than the old equals the “average advantage” obtained by selecting actions according to the new policy under the state distribution visited by the new policy.

Part II: Performance Improvement Bounds for Single-Policy Sampling

2.1 Distribution Mismatch and Controlling State Shift

The Policy Performance Difference Lemma has a practical issue: the expectation on the right-hand side is computed under $d_\pi$ (the new policy’s state distribution), while we can only sample from $d_{\pi_k}$ (the old policy).

The solution is to decompose the expectation into “expectation under the old distribution + bias term,” then control the bias. The key question is: What is the quantitative relationship between the difference in state distributions and the difference in policies?

Controlling State Distribution Differences

Lemma 1.2 (Relationship Between State Distribution Difference and Policy TV Distance)

$$ \|d_\pi - d_{\pi_k}\|_1 \leq \frac{2\gamma}{1-\gamma} \mathbb{E}_{s \sim d_{\pi_k}} \big[ D_{\mathrm{TV}}(\pi, \pi_k; s) \big] $$

Physical Interpretation

Small differences in policies in action space are “amplified” through environment dynamics into differences in state visitation distributions. The coefficient $\frac{\gamma}{1-\gamma}$ reflects the temporal accumulation effect—in long-horizon tasks ($\gamma$ close to 1), the amplification is stronger.

Proof Sketch

By deriving the fixed-point equation for discounted visitation distributions and exploiting the $\ell_1$ non-expansiveness of stochastic matrices, one can show that state distribution differences are amplified by policy differences through transition dynamics, with the amplification factor being precisely $\frac{\gamma}{1-\gamma}$.

2.2 Policy Performance Improvement Lower Bound

Theorem 1.1 (Policy Performance Improvement Lower Bound)

Define the expected advantage upper bound constant $C_{\pi,\pi_k} := \max_{s} \lvert \mathbb{E}_{a \sim \pi}[A^{\pi_k}(s,a)] \rvert$. Then:

$$ J(\pi) - J(\pi_k) \geq L_{\pi_k}(\pi) - \frac{2\gamma C_{\pi,\pi_k}}{(1-\gamma)^2} \mathbb{E}_{s \sim d_{\pi_k}} \big[ D_{\mathrm{TV}}(\pi, \pi_k; s) \big] $$

where the surrogate objective is:

$$ L_{\pi_k}(\pi) := \frac{1}{1-\gamma} \mathbb{E}_{s \sim d_{\pi_k}, a \sim \pi_k} \left[ \frac{\pi(a \mid s)}{\pi_k(a \mid s)} A^{\pi_k}(s,a) \right] $$

This lower bound consists of two parts:

Surrogate objective $L_{\pi_k}(\pi)$: Can be directly estimated from old policy data via importance sampling; this is the optimization objective of TRPO/PPO.
Policy shift penalty: Increases with the TV distance between new and old policies, explaining why PPO needs to constrain update magnitude.

Core conclusion: Maximizing the surrogate objective while controlling policy shift guarantees performance improvement.

Part III: Multi-Policy Static Mixture Sampling

3.1 Setup and Unified Modeling (Static Mixture)

In practice, a batch of data may come from multiple policy versions $\{\pi^{(1)}, \ldots, \pi^{(M)}\}$, with respective proportions $\alpha_1, \ldots, \alpha_M$. How do we extend Theorem 1.1 to this setting?

Core idea: augmented state space

The solution is an elegant modeling technique: treat the policy version index as part of the state.

Define the augmented state space $\tilde{\mathcal{S}} := \mathcal{S} \times \mathcal{I}$, where $\mathcal{I} = \{1, \ldots, M\}$ is the policy index set. Under augmented state $(s, i)$, the mixture behavior policy is defined as $\beta(a \mid s, i) := \pi^{(i)}(a \mid s)$.

The evolution of indices is characterized by the index transition kernel $q(i' \mid i)$. The augmented MDP inherits the original MDP’s rewards and environment transitions, with indices evolving independently according to $q(i'\mid i)$.

This technique works because the new policy $\pi$’s return in the augmented MDP equals its return in the original MDP, allowing direct application of Theorem 1.1.

3.2 Trajectory-Level Mixture: Simplification and Improvement Bound

The most common scenario is using a single old policy per trajectory: at trajectory start, sample index $I_0 \sim \alpha$, and use $\pi^{(I_0)}$ throughout. In this case, the index transition kernel is the identity: $q(i' \mid i) = \mathbf{1}_{i'=i}$.

From an engineering perspective, in many actor-learner asynchronous training setups (when sampling and training organize data by “entire trajectories/complete episodes belonging to a certain policy version”), this approximately corresponds to what we call trajectory-level mixture: actors use a fixed policy snapshot within a sampling unit to generate data, while learners mix trajectories from different versions for updates. We say “approximately” because different systems may not have identical boundaries for “trajectory/sampling unit.”

Lemma 2.1 (Structural Simplification for Trajectory-Level Mixture)

(a) The augmented state visitation distribution decomposes as: $d_{\beta}(s, i) = \alpha_i \cdot d_{\pi^{(i)}}(s)$

(b) The advantage function reduces to: $A^{\beta}((s, i), a) = A^{\pi^{(i)}}(s, a)$

Intuition for (b): Since the index never changes, all future trajectories starting from augmented state $(s,i)$ are generated by the same policy $\pi^{(i)}$. Therefore, future cumulative returns are entirely determined by $\pi^{(i)}$, and value functions and advantage functions naturally reduce to their $\pi^{(i)}$ counterparts.

Consequently, the mixture policy’s return is the weighted average of individual old policies’ returns: $J_{\mathrm{mix}} = \sum_{i=1}^{M} \alpha_i J(\pi^{(i)})$.

Improvement bound

Corollary 2.1 (Performance Improvement Lower Bound for Trajectory-Level Mixture)

$$ J(\pi) - \sum_{i=1}^{M} \alpha_i J(\pi^{(i)}) \geq \sum_{i=1}^{M} \alpha_i L_{\pi^{(i)}}(\pi) - \frac{2\gamma \max_i C_{\pi, \pi^{(i)}}}{(1-\gamma)^2} \sum_{i=1}^{M} \alpha_i \mathbb{E}_{s \sim d_{\pi^{(i)}}} \big[ D_{\mathrm{TV}}(\pi, \pi^{(i)}; s) \big] $$

This result shows that when mixing trajectories from multiple old policy versions for training, if we construct the loss using importance ratios corresponding to each trajectory’s source policy while controlling the new policy’s deviation from each old policy, the new policy’s performance has a clear improvement lower bound.

Part IV: Dynamic Mixture Sampling and Monotonic Improvement Conditions

4.1 Problem and Unified Modeling (Dynamic Mixture)

Part III discussed static mixture—where mixture weights $\alpha_i$ remain fixed. This section considers the more general dynamic mixture—where sampling gradually transitions to the new policy after it is released.

The previous results characterize improvement of “the new policy relative to the mixture behavior policy.” However, in actual training, what we truly care about is: Does the latest policy $\pi_{k+1}$ after each update monotonically improve over the previous latest policy $\pi_k$?

$$ J(\pi_{k+1}) \geq J(\pi_k) $$

Unified Modeling Framework

Two typical forms of dynamic mixture sampling can be uniformly characterized by the index transition kernel $q(i'\mid i)$:

Trajectory-level mixture (can be viewed as an abstraction of conventional asynchronous training; identity index transition): $q(i'\mid i) = \mathbf{1}\{i'=i\}$

Step/segment-level mixture (an abstraction of partial rollout / segment-based sampling; allows switching): $q(i'\mid i) = (1-\sigma(i))\mathbf{1}\{i'=i\} + \sigma(i)\kappa(i'\mid i)$

where $\sigma(i)$ is the switching probability and $\kappa(\cdot\mid i)$ is the target index distribution.

4.2 Decomposition and Monotonic Improvement Bound

By introducing the mixture return $J_{\mathrm{mix}}^{(k)}$ as an intermediate bridge, the performance difference decomposes as:

$$ J(\pi_{k+1}) - J(\pi_k) = \underbrace{[J(\pi_{k+1}) - J_{\mathrm{mix}}^{(k)}]}_{\text{improvement over mixture policy}} + \underbrace{[J_{\mathrm{mix}}^{(k)} - J(\pi_k)]}_{\text{mixture bias term}} $$

The first term can be handled using Theorem 1.1. The second term is the mixture bias term, which can be shown to satisfy:

$$ J_{\mathrm{mix}}^{(k)} - J(\pi_k) \geq -\frac{2\|A^{\pi_k}\|_\infty}{1-\gamma} \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[ D_{\mathrm{TV}}(\pi^{(i)}, \pi_k; s) \big] $$

Monotonic Improvement Bound

Combining the above results yields the core theorem:

Theorem 3.1 (Monotonic Improvement Lower Bound Under Dynamic Mixture Sampling)

$$ \begin{aligned} J(\pi_{k+1}) - J(\pi_k) \geq\;& L_{\beta^{(k)}}(\pi_{k+1}) \\ &- \frac{2\gamma C_{\pi_{k+1},\beta^{(k)}}}{(1-\gamma)^2} \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[ D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)}; s) \big] \\ &- \frac{2\|A^{\pi_k}\|_\infty}{1-\gamma} \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[ D_{\mathrm{TV}}(\pi^{(i)}, \pi_k; s) \big] \end{aligned} $$

Here $L_{\beta^{(k)}}(\pi_{k+1})$ denotes the surrogate objective relative to the behavior policy $\beta^{(k)}$ (the same form as $L_{\pi_k}(\pi)$ in Part II, but with the behavior policy generalized from a single $\pi_k$ to the mixture $\beta^{(k)}$).

More explicitly, one can write $$ L_{\beta^{(k)}}(\pi_{k+1}) := \frac{1}{1-\gamma} \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}},\, a\sim \pi^{(i)}(\cdot\mid s)}\left[\frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)}\,A^{\beta^{(k)}}((s,i),a)\right]. $$

Similarly, define $$ C_{\pi_{k+1},\beta^{(k)}} := \max_{(s,i)}\left|\mathbb{E}_{a\sim \pi_{k+1}(\cdot\mid s)}\big[A^{\beta^{(k)}}((s,i),a)\big]\right|. $$

This lower bound reveals the necessity of dual control:

Update shift penalty: Deviation of the new policy $\pi_{k+1}$ from the sampling source policy $\pi^{(i)}$
Sampling staleness penalty: Staleness of the sampling source policy $\pi^{(i)}$ relative to the current policy $\pi_k$

4.3 Why Direct Constraints Are Infeasible: Triangle Inequality Decomposition

The update shift penalty term in Theorem 3.1 might appear controllable by constraining $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)}; s)$, but this is actually infeasible:

Observation 3.1 (Infeasibility of Update Shift Constraints)

Suppose the mixture sampling includes two old policies $\pi^{(1)}$ and $\pi^{(2)}$. If there exists some state $s$ such that $D_{\mathrm{TV}}(\pi^{(1)}, \pi^{(2)}; s) > 2\delta$, then no policy $\pi_{k+1}$ can simultaneously satisfy $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(1)}; s) \leq \delta$ and $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(2)}; s) \leq \delta$.

Proof

By the triangle inequality, if both constraints were satisfied, then $D_{\mathrm{TV}}(\pi^{(1)}, \pi^{(2)}; s) \leq 2\delta$, a contradiction.

Root Cause

The update shift penalty directly couples $\pi_{k+1}$ with the historical policy family $\{\pi^{(i)}\}$, whose internal structure is a product of historical training and not controllable by the current update.

Triangle Inequality Decomposition

The solution leverages the triangle inequality of TV distance:

$$ D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)}; s) \leq D_{\mathrm{TV}}(\pi_{k+1}, \pi_k; s) + D_{\mathrm{TV}}(\pi_k, \pi^{(i)}; s) $$

This decomposes the coupled constraint into two independent parts:

Update increment shift $D_{\mathrm{TV}}(\pi_{k+1}, \pi_k; s)$: Deviation of the new policy from the current policy, controllable by the optimization side
Sampling staleness $D_{\mathrm{TV}}(\pi_k, \pi^{(i)}; s)$: Deviation of the current policy from each old policy, must be controlled by the sampling side

Define:

$$ U_k := \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[D_{\mathrm{TV}}(\pi_{k+1}, \pi_k; s)\big], \quad S_k := \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[D_{\mathrm{TV}}(\pi_k, \pi^{(i)}; s)\big] $$

Corollary 3.2 (Decomposed Monotonic Improvement Lower Bound)

$$ J(\pi_{k+1}) - J(\pi_k) \geq L_{\beta^{(k)}}(\pi_{k+1}) - \frac{2\gamma C_{\pi_{k+1},\beta^{(k)}}}{(1-\gamma)^2} U_k - \left( \frac{2\gamma C_{\pi_{k+1},\beta^{(k)}}}{(1-\gamma)^2} + \frac{2\|A^{\pi_k}\|_\infty}{1-\gamma} \right) S_k $$

Why Does Decomposition Solve the Problem?

The key is that after decomposition, $U_k$ only involves the new policy $\pi_{k+1}$ and the current policy $\pi_k$, completely independent of the structure of the old policy family $\{\pi^{(i)}\}$. Therefore, regardless of how different the old policies are from each other, constraining $U_k$ is always feasible—this is precisely the resolution to the infeasibility issue revealed in Observation 3.1.

This reveals an important engineering principle—separation of concerns:

Control Term	Responsible Party	Control Mechanism
$U_k$ (update increment shift)	Optimization algorithm	Policy clipping
$S_k$ (sampling staleness)	Sampling system	Data filtering, version window

Part V: Theoretical Foundations of Clipping Mechanisms

5.1 From TV Distance to Sample-Controllable Quantities

Corollary 3.2 tells us that to guarantee monotonic improvement, we need to control the update increment shift $U_k = \mathbb{E}[D_{\mathrm{TV}}(\pi_{k+1}, \pi_k; s)]$. However, TV distance is a distribution-level quantity—how can we control it using samples?

The key bridge is the following identity:

Lemma 3.3 (Ratio Difference Representation of TV Distance)

Suppose policy $\pi_1$’s support covers the supports of $\pi$ and $\pi_2$. Then for any state distribution $\mu$:

$$ \mathbb{E}_{s\sim \mu} \big[D_{\mathrm{TV}}(\pi, \pi_2; s)\big] = \frac{1}{2} \mathbb{E}_{s\sim \mu, a\sim\pi_1(\cdot\mid s)} \left| \frac{\pi(a\mid s)}{\pi_1(a\mid s)} - \frac{\pi_2(a\mid s)}{\pi_1(a\mid s)} \right| $$

Intuitive Understanding

The left side is the TV distance between two distributions (requiring enumeration over all actions), while the right side is the absolute difference of two importance ratios when sampling under $\pi_1$. This enables us to estimate and control TV distance using samples.

Sample Representation of $U_k$

Using Lemma 3.3, setting $\pi = \pi_{k+1}$, $\pi_2 = \pi_k$, $\pi_1 = \pi^{(i)}$ (the sampling source policy), we obtain:

$$ U_k = \frac{1}{2} \mathbb{E}_{(s,i) \sim d_{\beta^{(k)}}, a \sim \pi^{(i)}(\cdot\mid s)} \left| \frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)} - \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)} \right| $$

Denoting $\rho_{k+1} := \frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)}$ and $\rho_k := \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)}$, we have:

$$ U_k = \frac{1}{2} \mathbb{E}_{(s,i,a) \sim \text{training data}} \big| \rho_{k+1} - \rho_k \big| $$

This means: If we can ensure $\lvert\rho_{k+1} - \rho_k\rvert \leq \epsilon$ for each sample, we can guarantee $U_k \leq \epsilon/2$.

5.2 Constraining $U_k$: Two Clipping Options

Method 1: Direct Constraint on Ratio Difference

For each sample $(s, i, a)$, require:

$$ \left| \frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)} - \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)} \right| \leq \epsilon $$

The clipping interval is $\left[\frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)} - \epsilon, \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)} + \epsilon\right]$, with clipping center at $\rho_k$ rather than 1.

Method 2: Constraint on Incremental Ratio

Noting that $\rho_{k+1} - \rho_k = \rho_k \cdot \left(\frac{\pi_{k+1}}{\pi_k} - 1\right)$, we have:

$$ |\rho_{k+1} - \rho_k| = \rho_k \cdot \left|\frac{\pi_{k+1}(a\mid s)}{\pi_k(a\mid s)} - 1\right| $$

If we constrain $\left\lvert\frac{\pi_{k+1}(a\mid s)}{\pi_k(a\mid s)} - 1\right\rvert \leq \epsilon$, since $\mathbb{E}_{a\sim\pi^{(i)}}[\rho_k] = 1$, one can show $U_k \leq \epsilon/2$.

This method clips $\pi_{k+1}/\pi_k$ with center at 1, meaning the clipping constraint itself does not depend on the old policy family $\pi^{(i)}$. However, if we use the weighted advantage $\hat{A}=\rho_k\cdot A^{\beta^{(k)}}$ below, we still need per-sample behavior probabilities (or recorded logprobs) to compute $\rho_k$.

Objective Functions (Three Clipping Mechanisms)

For comparison, we present the complete objective functions for three clipping mechanisms. Suppose the current sample comes from old policy $\pi^{(i)}$, and denote:

$\rho_{k+1} = \frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)}$ (new policy’s ratio relative to sampling policy)
$\rho_k = \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)}$ (current policy’s ratio relative to sampling policy)
$r = \frac{\pi_{k+1}(a\mid s)}{\pi_k(a\mid s)}$ (new policy’s incremental ratio relative to current policy)

Note: under trajectory-level mixture (index fixed), $A^{\beta^{(k)}}((s,i),a)=A^{\pi^{(i)}}(s,a)$, so per-trajectory advantages from the corresponding old policy are consistent; under step/segment-level mixture, replacing $A^{\beta^{(k)}}$ with $A^{\pi^{(i)}}$ introduces advantage-substitution bias (discussed in Part VI), so the advantage/value estimator must reflect future index switching.

Standard PPO

Clip $\rho_{k+1}$ with center at 1

$$ L^{\mathrm{PPO}} = \mathbb{E} \left[ \min\left( \rho_{k+1} \cdot A^{\pi^{(i)}}, \; \mathrm{clip}(\rho_{k+1}, 1-\epsilon, 1+\epsilon) \cdot A^{\pi^{(i)}} \right) \right] $$

Method 1

Clip $\rho_{k+1}$ with center at $\rho_k$

$$ L^{\mathrm{M1}} = \mathbb{E} \left[ \min\left( \rho_{k+1} \cdot A^{\beta^{(k)}}, \; \mathrm{clip}(\rho_{k+1}, \rho_k-\epsilon, \rho_k+\epsilon) \cdot A^{\beta^{(k)}} \right) \right] $$

Method 2

Clip incremental ratio $r$ with center at 1

$$ L^{\mathrm{M2}} = \mathbb{E} \left[ \min\left( r \cdot \hat{A}, \; \mathrm{clip}(r, 1-\epsilon, 1+\epsilon) \cdot \hat{A} \right) \right] $$

where $\hat{A} = \rho_k \cdot A^{\beta^{(k)}}$ is the importance-weighted advantage estimate.

5.3 Comparison and Practical Controls

Table 5.1 Comparison of Three Clipping Mechanisms

Method	Clipped Variable	Clipping Center	Clipping Interval	Constrained TV Distance
Standard PPO	$\rho_{k+1} = \pi_{k+1}/\pi^{(i)}$	$1$	$[1-\epsilon, 1+\epsilon]$	$D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)})$
Method 1	$\rho_{k+1} = \pi_{k+1}/\pi^{(i)}$	$\rho_k = \pi_k/\pi^{(i)}$	$[\rho_k-\epsilon, \rho_k+\epsilon]$	$D_{\mathrm{TV}}(\pi_{k+1}, \pi_k)$
Method 2	$r = \pi_{k+1}/\pi_k$	$1$	$[1-\epsilon, 1+\epsilon]$	$D_{\mathrm{TV}}(\pi_{k+1}, \pi_k)$

The Fundamental Problem with Standard PPO Under Multi-Policy Mixture

Standard PPO constrains $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)})$, requiring the new policy to be simultaneously close to all sampling source policies. By Observation 3.1, when the old policies $\pi^{(1)}, \pi^{(2)}, \ldots$ differ significantly from each other, no $\pi_{k+1}$ can simultaneously satisfy all constraints. This causes the trust region intersection to shrink or even become empty, with updates being limited by the most stale policy.

Common Advantages of Methods 1 and 2

Both methods constrain $D_{\mathrm{TV}}(\pi_{k+1}, \pi_k)$—the deviation of the new policy from the current policy (rather than the sampling policy). Since $\pi_k$ is uniquely determined, this constraint is consistent across all sample sources, completely avoiding the infeasibility problem.

Method 1 vs Method 2

Comparison Dimension	Method 1 (Adaptive Clipping)	Method 2 (Incremental Clipping)
Stale samples ($\rho_k \gg 1$)	Automatically tightens constraints, more conservative	May produce large gradient variance
LLM large vocabulary low-probability tokens	Allows larger absolute changes (additive)	Absolute changes are limited (multiplicative)
Implementation complexity	Requires storing $\pi^{(i)}(a\mid s)$ and $\pi_k(a\mid s)$	Needs $\pi_k(a\mid s)$ and $\pi^{(i)}(a\mid s)$ (or stored logprobs) to compute $\rho_k$; clipping itself uses only $\pi_{k+1}/\pi_k$
Advantage function	Uses $A^{\beta^{(k)}}$	Uses weighted advantage $\rho_k \cdot A^{\beta^{(k)}}$

Detailed Explanations

(1) Handling Stale Samples

When samples come from very old policies, $\rho_k = \pi_k/\pi^{(i)}$ can be large.

Method 2’s integrand is $\rho_k \cdot \lvert r - 1\rvert$; even if $\lvert r-1\rvert \leq \epsilon$, the integrand can reach $\epsilon \cdot \rho_k$, producing spikes.
Method 1 directly constrains $\lvert\rho_{k+1} - \rho_k\rvert \leq \epsilon$; the integrand’s upper bound is always $\epsilon$, unaffected by $\rho_k$ amplification.

(2) LLM Large Vocabulary Issue

Large language models have many tokens having very small probabilities.

Method 2 constrains $\pi_{k+1} \in [(1-\epsilon)\pi_k, (1+\epsilon)\pi_k]$, which is a multiplicative constraint: if $\pi_k(a\mid s) = 10^{-6}$, the allowed absolute change is only $\epsilon \times 10^{-6}$.
Method 1 constrains $\lvert\pi_{k+1} - \pi_k\rvert \leq \epsilon \cdot \pi^{(i)}$, which is an additive constraint: if that token has higher probability under the old policy (e.g., $\pi^{(i)}(a\mid s) = 0.1$), even if the current probability is very low, faster improvement is allowed.

Controlling Sampling Staleness

Corollary 3.2 shows that $S_k$ also affects the monotonic improvement lower bound, but it cannot be controlled through optimization-side clipping and must be implemented by the sampling system:

(1) Discarding Stale Data

Set a threshold $\epsilon_{\mathrm{stale}}$. For each sample, compute $\lvert\rho_k - 1\rvert = \lvert\pi_k(a\mid s)/\pi^{(i)}(a\mid s) - 1\rvert$, and discard samples exceeding the threshold.

(2) Controlling Policy Version Window

Limit the number of old policy versions in the mixture sampling, e.g., using only data from the most recent $W$ versions.

Operational Meaning of Clipping

Finally, we clarify the relationship between clipping and the theoretical lower bound.

In Corollary 3.2, the coefficient of $U_k$, namely $C_{\pi_{k+1},\beta^{(k)}}$, depends on the new policy $\pi_{k+1}$, so the penalty term cannot be simply replaced by a constant. The correct operational meaning is:

Maximize the surrogate objective $L_{\beta^{(k)}}(\pi_{k+1})$ subject to the constraint $U_k \leq \epsilon/2$

The clipping objective function is precisely an implementation of this constrained optimization—clipping hard limits the update magnitude to ensure $U_k$ is controllable; under this premise, gradient ascent improves the surrogate objective, thereby providing guarantees for monotonic policy improvement.

Section Summary

This section established the theoretical foundations of clipping mechanisms:

Lemma 3.3 converts TV distance to sample-level ratio differences, serving as the bridge between theory and implementation
Two constraint methods: Method 1 (adaptive clipping center) and Method 2 (fixed incremental clipping), both guaranteeing $U_k \leq \epsilon/2$
Comparison with standard PPO: Standard PPO constrains $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)})$, which is infeasible under multi-policy mixture; Methods 1/2 constrain $D_{\mathrm{TV}}(\pi_{k+1}, \pi_k)$, avoiding this issue
Method selection: Method 1 (adaptive) is recommended for high staleness or LLM large vocabulary scenarios; Method 2 (incremental) is attractive when you want the trust-region/clipping constraint to avoid depending on the old policy family (but data still needs to provide behavior logprobs to compute $\rho_k$)
$S_k$ control is the sampling side’s responsibility, implemented through data filtering and version windows
Clipping is constrained optimization: Maximize the surrogate objective subject to $U_k$ constraints

Part VI: Comparison of Trajectory-Level and Step/Segment-Level Mixture

6.1 Mechanism Differences and Estimation Implications

The essential difference between the two mixture mechanisms lies in the structure of the index transition kernel:

Trajectory-level mixture: $q(i'\mid i) = \mathbf{1}\{i'=i\}$, index never changes
Step/segment-level mixture: $\sigma(i) > 0$, allows within-trajectory switching

The correspondence with common engineering terminology is:

Trajectory-level mixture here can be roughly understood as an idealized abstraction of “conventional asynchronous training”: data is organized by entire trajectories/episodes belonging to a certain policy version;
Step/segment-level mixture here can be roughly understood as an abstraction of “partial rollout”: due to asynchrony between actors and learners, and possible refresh to new policy versions at segment boundaries, using an index transition kernel that allows “within-trajectory version switching” can better approximate this phenomenon.

The key watershed is whether Lemma 2.1’s structural simplification holds: trajectory-level mixture satisfies advantage function reduction; step/segment-level mixture generally does not, because future returns are affected by the index transition kernel.

Differences in Sampling Staleness $S_k$

Trajectory-level mixture’s staleness arises from: mixture weights $\alpha_i^{(k)}$ retaining probability mass on old policies after new policy release.

Step/segment-level mixture has an exponential compression effect: Consider a simplified model with switching probability $\sigma$ from old to new. The marginal probability mass on old indices under the discounted visitation distribution is $\frac{1-\gamma}{1-\gamma(1-\sigma)}$. As long as $\sigma \gg 1-\gamma$, the old policy weight can be significantly compressed.

Differences in Surrogate Objective Estimation

Trajectory-level mixture: The advantage function reduces to $A^{\pi^{(i)}}(s,a)$, with a clear estimation path.

Advantage substitution bias in step/segment-level mixture: If single-policy advantage estimates are used, systematic bias will arise. The reason is that $A^{\beta^{(k)}}((s,i),a)$ requires taking expectations over future index switching, while $A^{\pi^{(i)}}(s,a)$ implicitly assumes “the future always follows $\pi^{(i)}$.”

Unification Under Bandit Setting

In single-step episode LLM training, with no subsequent state transitions, the estimation problems of both mechanisms unify, with no such bias.

6.2 Risks and Applicable Scenarios

Step/segment-level mixture has another hidden concern: even if single-step importance ratios are clipped, multi-step noise accumulation over long trajectories can still amplify gradient estimation variance. When policy changes per update are large, “behavioral discontinuities” within trajectories may induce heavier-tailed ratio distributions. This is why trajectory-level mixture is recommended for “large policy change per update” scenarios in the table below.

Applicable Scenarios

Table 6.1 Applicable Scenarios for Two Mixture Mechanisms

Scenario Characteristics	Recommended Mechanism	Rationale
Long trajectories, high-frequency updates, strong asynchrony	Step/segment-level	Can significantly compress $S_k$
Short trajectories (non-bandit)	Trajectory-level	$S_k$ is naturally low
Large policy change per update	Trajectory-level	Avoids variance amplification
Single-step episode (bandit)	Either	Choose based on implementation convenience
Need for compromise	Segment-level	Switch at natural boundaries

Core trade-off: Step/segment-level mixture is stronger on the sampling side (fast staleness removal), while trajectory-level mixture is more stable on the estimation side (easier surrogate objective estimation).

Part VII: Handling Training-Inference Inconsistency

7.1 Background and Effective Staleness

In large-scale distributed training, policies on the inference side and training side may be inconsistent:

Numerical implementation differences: softmax normalization, quantization, kernel fusion
Decoding rule differences: temperature scaling, top-p/top-k sampling

Let the behavior policy modeled on the training side be $\pi^{(i)}$, while the policy actually sampling on the inference side is $\hat{\pi}^{(i)}$.

Effective Staleness

Define effective staleness:

$$ \hat{S}_k := \mathbb{E}_{(s,i) \sim d_{\hat{\beta}^{(k)}}} \big[ D_{\mathrm{TV}}(\pi_k, \hat{\pi}^{(i)}; s) \big] $$

This definition simultaneously covers version staleness and training-inference implementation differences.

7.2 Actionable Control

By Lemma 3.3, $\hat{S}_k$ can be expressed in sample-level computable form. Given threshold $\epsilon_{\mathrm{stale}}$, if training only uses samples satisfying $\lvert\pi_k(a\mid s)/\hat{\pi}^{(i)}(a\mid s) - 1\rvert \leq \epsilon_{\mathrm{stale}}$, then $\hat{S}_k \leq \epsilon_{\mathrm{stale}}/2$.

Key Implementation Points

Behavior denominator alignment: The behavior probability in the loss should use the inference-side recorded $\hat{\pi}^{(i)}(a\mid s)$
Probability smoothing: If the inference side has truncation (e.g., top-k), ensure ratios are valid

Summary: Practical Guidelines

Core Theoretical Framework

The structure of the monotonic improvement lower bound is:

$$ J(\pi_{k+1}) - J(\pi_k) \geq \underbrace{L_{\beta^{(k)}}(\pi_{k+1})}_{\text{surrogate objective}} - \underbrace{C_1 \cdot U_k}_{\text{update shift penalty}} - \underbrace{C_2 \cdot S_k}_{\text{sampling staleness penalty}} $$

Separation of Concerns Principle

Control Term	Responsible Party	Control Mechanism	Specific Operation
$U_k$	Optimization algorithm	Policy clipping	Clip update increments (e.g., clip $\pi_{k+1}/\pi_k$)
$S_k$	Sampling system	Data filtering	Discard stale samples
$S_k$	Sampling system	Version window	Use only most recent $W$ versions

Clipping Method Selection

Scenario	Recommended Method	Rationale
High staleness	Method 1 (adaptive)	Automatically tightens constraints for stale samples
Implementation simplicity prioritized	Method 2 (incremental)	No need to store old policy information
LLM large vocabulary	Method 1	Avoids slow updates for low-probability tokens

Handling Training-Inference Inconsistency

Use inference-side recorded $\hat{\pi}^{(i)}$ as the behavior denominator
Compress effective staleness through sample filtering

Appendix: Quick Reference for Key Symbols

Symbol	Meaning
$\pi_k$, $\pi^{(i)}$	Latest policy at round $k$, $i$-th old policy
$d_\pi(s)$, $A^\pi(s,a)$	Discounted state visitation distribution, advantage function
$D_{\mathrm{TV}}(\pi, \pi'; s)$	TV distance between two policies at state $s$
$\beta^{(k)}(a \mid s, i) := \pi^{(i)}(a \mid s)$	Mixture behavior policy at round $k$
$q(i' \mid i)$, $\alpha_i^{(k)}$	Index transition kernel, initial index distribution
$U_k$, $S_k$	Update increment shift, sampling staleness
$\epsilon$, $\epsilon_{\mathrm{stale}}$, $W$	Clipping radius, staleness threshold, version window
$C_{\pi,\pi_k}$	Expected advantage upper bound constant

References

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. “Trust Region Policy Optimization” (TRPO). arXiv:1502.05477. https://arxiv.org/abs/1502.05477
Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel. “Constrained Policy Optimization” (CPO). arXiv:1705.10528. https://arxiv.org/abs/1705.10528
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. “Proximal Policy Optimization Algorithms” (PPO). arXiv:1707.06347. https://arxiv.org/abs/1707.06347
James Queeney, Ioannis Ch. Paschalidis, Christos G. Cassandras. “Generalized Proximal Policy Optimization with Sample Reuse” (GePPO). arXiv:2111.00072. https://arxiv.org/abs/2111.00072
Yuzhen Zhou, Jiajun Li, Yusheng Su, et al. “APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation” (APRIL; partial rollout). arXiv:2509.18521. https://arxiv.org/abs/2509.18521
Jacob Hilton, Karl Cobbe, John Schulman. “Batch size-invariance for policy optimization” (Decoupled PPO). arXiv:2110.00641. https://arxiv.org/abs/2110.00641

@misc{WangZhang2025OffPolicyLLMRL,
	author       = {Wang, Xihuai and Zhang, Shao},
	title        = {Off-Policy Training in LLM Reinforcement Learning: From Theory to Practice},
	year         = {2025},
	month        = dec,
	day          = {17},
	url          = {https://xihuai18.github.io/reinforcement-learning/2025/12/17/offpolicy-en.html},
	urldate      = {2025-12-17}
}