Introduction: Why Should We Care About “Off-Policy”?
Consider the following scenario: you are training a large language model with reinforcement learning to improve its question-answering capabilities. Ideally, each time the model generates a batch of responses, you would immediately update the model with this data, then use the updated model to generate new data, and so on. This approach of “updating with data from the same policy that generated it” is called on-policy training.
Reality, however, is not so simple. In large-scale distributed training, hundreds of GPUs generate data in parallel, while model updates take time. When a new model is deployed, much data generated by “older versions” of the model remains unused—discarding it seems wasteful, yet using it raises concerns about whether “stale data” might harm training effectiveness.
This is the core problem faced by off-policy training: Can we guarantee continued performance improvement when using data collected by older policies to update newer policies?
This article systematically addresses this question. Starting from foundational theory, we progressively derive actionable conditions that specify when mixing data from multiple policy versions can still guarantee monotonic training improvement.
Part I: Theoretical Foundations
1.1 Basic Setup
We consider a standard Markov Decision Process (MDP) comprising a state space $\mathcal{S}$, action space $\mathcal{A}$, transition probability $p(s'\mid s,a)$, reward function $r(s,a)$, initial distribution $\rho_0$, and discount factor $\gamma \in (0,1)$.
The expected cumulative discounted return of policy $\pi$ is:
$$ J(\pi) := \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \mid \pi\right] $$
Discounted State Visitation Distribution
Represents the weighted frequency of visiting each state during long-term policy execution:
$$ d_\pi(s) := (1-\gamma) \sum_{t=0}^{\infty} \gamma^t \Pr(s_t = s \mid \pi) $$
Advantage Function
Measures how much better action $a$ is compared to the policy’s average:
$$ A^\pi(s,a) := Q^\pi(s,a) - V^\pi(s) $$
Total Variation Distance (TV Distance)
Measures the difference between two policies’ action distributions at state $s$:
$$ D_{\mathrm{TV}}(\pi, \pi'; s) := \frac{1}{2} \sum_{a \in \mathcal{A}} |\pi(a \mid s) - \pi'(a \mid s)| $$
Throughout, we use $\mid$ for conditional probability (e.g., $\pi(a\mid s)$) and reserve $\|\cdot\|$ for norms.
1.2 Core Tool: Policy Performance Difference Lemma
The cornerstone of the entire theory is this elegant result:
Lemma 1.1 (Policy Performance Difference Lemma)
For any policies $\pi_k$ (old) and $\pi$ (new), the performance difference can be expressed as:
$$ J(\pi) - J(\pi_k) = \frac{1}{1-\gamma} \mathbb{E}_{s \sim d_\pi}\left[ \mathbb{E}_{a \sim \pi(\cdot \mid s)}[A^{\pi_k}(s,a)] \right] $$
Intuitive understanding: How much better the new policy is than the old equals the “average advantage” obtained by selecting actions according to the new policy under the state distribution visited by the new policy.
Part II: Performance Improvement Bounds for Single-Policy Sampling
2.1 Distribution Mismatch and Controlling State Shift
The Policy Performance Difference Lemma has a practical issue: the expectation on the right-hand side is computed under $d_\pi$ (the new policy’s state distribution), while we can only sample from $d_{\pi_k}$ (the old policy).
The solution is to decompose the expectation into “expectation under the old distribution + bias term,” then control the bias. The key question is: What is the quantitative relationship between the difference in state distributions and the difference in policies?
Controlling State Distribution Differences
Lemma 1.2 (Relationship Between State Distribution Difference and Policy TV Distance)
$$ \|d_\pi - d_{\pi_k}\|_1 \leq \frac{2\gamma}{1-\gamma} \mathbb{E}_{s \sim d_{\pi_k}} \big[ D_{\mathrm{TV}}(\pi, \pi_k; s) \big] $$
Physical Interpretation
Small differences in policies in action space are “amplified” through environment dynamics into differences in state visitation distributions. The coefficient $\frac{\gamma}{1-\gamma}$ reflects the temporal accumulation effect—in long-horizon tasks ($\gamma$ close to 1), the amplification is stronger.
Proof Sketch
By deriving the fixed-point equation for discounted visitation distributions and exploiting the $\ell_1$ non-expansiveness of stochastic matrices, one can show that state distribution differences are amplified by policy differences through transition dynamics, with the amplification factor being precisely $\frac{\gamma}{1-\gamma}$.
2.2 Policy Performance Improvement Lower Bound
Theorem 1.1 (Policy Performance Improvement Lower Bound)
Define the expected advantage upper bound constant $C_{\pi,\pi_k} := \max_{s} \lvert \mathbb{E}_{a \sim \pi}[A^{\pi_k}(s,a)] \rvert$. Then:
$$ J(\pi) - J(\pi_k) \geq L_{\pi_k}(\pi) - \frac{2\gamma C_{\pi,\pi_k}}{(1-\gamma)^2} \mathbb{E}_{s \sim d_{\pi_k}} \big[ D_{\mathrm{TV}}(\pi, \pi_k; s) \big] $$
where the surrogate objective is:
$$ L_{\pi_k}(\pi) := \frac{1}{1-\gamma} \mathbb{E}_{s \sim d_{\pi_k}, a \sim \pi_k} \left[ \frac{\pi(a \mid s)}{\pi_k(a \mid s)} A^{\pi_k}(s,a) \right] $$
This lower bound consists of two parts:
-
Surrogate objective $L_{\pi_k}(\pi)$: Can be directly estimated from old policy data via importance sampling; this is the optimization objective of TRPO/PPO.
-
Policy shift penalty: Increases with the TV distance between new and old policies, explaining why PPO needs to constrain update magnitude.
Core conclusion: Maximizing the surrogate objective while controlling policy shift guarantees performance improvement.
Part III: Multi-Policy Static Mixture Sampling
3.1 Setup and Unified Modeling (Static Mixture)
In practice, a batch of data may come from multiple policy versions $\{\pi^{(1)}, \ldots, \pi^{(M)}\}$, with respective proportions $\alpha_1, \ldots, \alpha_M$. How do we extend Theorem 1.1 to this setting?
Core idea: augmented state space
The solution is an elegant modeling technique: treat the policy version index as part of the state.
Define the augmented state space $\tilde{\mathcal{S}} := \mathcal{S} \times \mathcal{I}$, where $\mathcal{I} = \{1, \ldots, M\}$ is the policy index set. Under augmented state $(s, i)$, the mixture behavior policy is defined as $\beta(a \mid s, i) := \pi^{(i)}(a \mid s)$.
The evolution of indices is characterized by the index transition kernel $q(i' \mid i)$. The augmented MDP inherits the original MDP’s rewards and environment transitions, with indices evolving independently according to $q(i'\mid i)$.
This technique works because the new policy $\pi$’s return in the augmented MDP equals its return in the original MDP, allowing direct application of Theorem 1.1.
3.2 Trajectory-Level Mixture: Simplification and Improvement Bound
The most common scenario is using a single old policy per trajectory: at trajectory start, sample index $I_0 \sim \alpha$, and use $\pi^{(I_0)}$ throughout. In this case, the index transition kernel is the identity: $q(i' \mid i) = \mathbf{1}_{i'=i}$.
From an engineering perspective, in many actor-learner asynchronous training setups (when sampling and training organize data by “entire trajectories/complete episodes belonging to a certain policy version”), this approximately corresponds to what we call trajectory-level mixture: actors use a fixed policy snapshot within a sampling unit to generate data, while learners mix trajectories from different versions for updates. We say “approximately” because different systems may not have identical boundaries for “trajectory/sampling unit.”
Lemma 2.1 (Structural Simplification for Trajectory-Level Mixture)
(a) The augmented state visitation distribution decomposes as: $d_{\beta}(s, i) = \alpha_i \cdot d_{\pi^{(i)}}(s)$
(b) The advantage function reduces to: $A^{\beta}((s, i), a) = A^{\pi^{(i)}}(s, a)$
Intuition for (b): Since the index never changes, all future trajectories starting from augmented state $(s,i)$ are generated by the same policy $\pi^{(i)}$. Therefore, future cumulative returns are entirely determined by $\pi^{(i)}$, and value functions and advantage functions naturally reduce to their $\pi^{(i)}$ counterparts.
Consequently, the mixture policy’s return is the weighted average of individual old policies’ returns: $J_{\mathrm{mix}} = \sum_{i=1}^{M} \alpha_i J(\pi^{(i)})$.
Improvement bound
Corollary 2.1 (Performance Improvement Lower Bound for Trajectory-Level Mixture)
$$ J(\pi) - \sum_{i=1}^{M} \alpha_i J(\pi^{(i)}) \geq \sum_{i=1}^{M} \alpha_i L_{\pi^{(i)}}(\pi) - \frac{2\gamma \max_i C_{\pi, \pi^{(i)}}}{(1-\gamma)^2} \sum_{i=1}^{M} \alpha_i \mathbb{E}_{s \sim d_{\pi^{(i)}}} \big[ D_{\mathrm{TV}}(\pi, \pi^{(i)}; s) \big] $$
This result shows that when mixing trajectories from multiple old policy versions for training, if we construct the loss using importance ratios corresponding to each trajectory’s source policy while controlling the new policy’s deviation from each old policy, the new policy’s performance has a clear improvement lower bound.
Part IV: Dynamic Mixture Sampling and Monotonic Improvement Conditions
4.1 Problem and Unified Modeling (Dynamic Mixture)
Part III discussed static mixture—where mixture weights $\alpha_i$ remain fixed. This section considers the more general dynamic mixture—where sampling gradually transitions to the new policy after it is released.
The previous results characterize improvement of “the new policy relative to the mixture behavior policy.” However, in actual training, what we truly care about is: Does the latest policy $\pi_{k+1}$ after each update monotonically improve over the previous latest policy $\pi_k$?
$$ J(\pi_{k+1}) \geq J(\pi_k) $$
Unified Modeling Framework
Two typical forms of dynamic mixture sampling can be uniformly characterized by the index transition kernel $q(i'\mid i)$:
Trajectory-level mixture (can be viewed as an abstraction of conventional asynchronous training; identity index transition): $q(i'\mid i) = \mathbf{1}\{i'=i\}$
Step/segment-level mixture (an abstraction of partial rollout / segment-based sampling; allows switching): $q(i'\mid i) = (1-\sigma(i))\mathbf{1}\{i'=i\} + \sigma(i)\kappa(i'\mid i)$
where $\sigma(i)$ is the switching probability and $\kappa(\cdot\mid i)$ is the target index distribution.
4.2 Decomposition and Monotonic Improvement Bound
By introducing the mixture return $J_{\mathrm{mix}}^{(k)}$ as an intermediate bridge, the performance difference decomposes as:
$$ J(\pi_{k+1}) - J(\pi_k) = \underbrace{[J(\pi_{k+1}) - J_{\mathrm{mix}}^{(k)}]}_{\text{improvement over mixture policy}} + \underbrace{[J_{\mathrm{mix}}^{(k)} - J(\pi_k)]}_{\text{mixture bias term}} $$
The first term can be handled using Theorem 1.1. The second term is the mixture bias term, which can be shown to satisfy:
$$ J_{\mathrm{mix}}^{(k)} - J(\pi_k) \geq -\frac{2\|A^{\pi_k}\|_\infty}{1-\gamma} \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[ D_{\mathrm{TV}}(\pi^{(i)}, \pi_k; s) \big] $$
Monotonic Improvement Bound
Combining the above results yields the core theorem:
Theorem 3.1 (Monotonic Improvement Lower Bound Under Dynamic Mixture Sampling)
$$ \begin{aligned} J(\pi_{k+1}) - J(\pi_k) \geq\;& L_{\beta^{(k)}}(\pi_{k+1}) \\ &- \frac{2\gamma C_{\pi_{k+1},\beta^{(k)}}}{(1-\gamma)^2} \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[ D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)}; s) \big] \\ &- \frac{2\|A^{\pi_k}\|_\infty}{1-\gamma} \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[ D_{\mathrm{TV}}(\pi^{(i)}, \pi_k; s) \big] \end{aligned} $$
Here $L_{\beta^{(k)}}(\pi_{k+1})$ denotes the surrogate objective relative to the behavior policy $\beta^{(k)}$ (the same form as $L_{\pi_k}(\pi)$ in Part II, but with the behavior policy generalized from a single $\pi_k$ to the mixture $\beta^{(k)}$).
More explicitly, one can write $$ L_{\beta^{(k)}}(\pi_{k+1}) := \frac{1}{1-\gamma} \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}},\, a\sim \pi^{(i)}(\cdot\mid s)}\left[\frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)}\,A^{\beta^{(k)}}((s,i),a)\right]. $$
Similarly, define $$ C_{\pi_{k+1},\beta^{(k)}} := \max_{(s,i)}\left|\mathbb{E}_{a\sim \pi_{k+1}(\cdot\mid s)}\big[A^{\beta^{(k)}}((s,i),a)\big]\right|. $$
This lower bound reveals the necessity of dual control:
- Update shift penalty: Deviation of the new policy $\pi_{k+1}$ from the sampling source policy $\pi^{(i)}$
- Sampling staleness penalty: Staleness of the sampling source policy $\pi^{(i)}$ relative to the current policy $\pi_k$
4.3 Why Direct Constraints Are Infeasible: Triangle Inequality Decomposition
The update shift penalty term in Theorem 3.1 might appear controllable by constraining $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)}; s)$, but this is actually infeasible:
Observation 3.1 (Infeasibility of Update Shift Constraints)
Suppose the mixture sampling includes two old policies $\pi^{(1)}$ and $\pi^{(2)}$. If there exists some state $s$ such that $D_{\mathrm{TV}}(\pi^{(1)}, \pi^{(2)}; s) > 2\delta$, then no policy $\pi_{k+1}$ can simultaneously satisfy $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(1)}; s) \leq \delta$ and $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(2)}; s) \leq \delta$.
Proof
By the triangle inequality, if both constraints were satisfied, then $D_{\mathrm{TV}}(\pi^{(1)}, \pi^{(2)}; s) \leq 2\delta$, a contradiction.
Root Cause
The update shift penalty directly couples $\pi_{k+1}$ with the historical policy family $\{\pi^{(i)}\}$, whose internal structure is a product of historical training and not controllable by the current update.
Triangle Inequality Decomposition
The solution leverages the triangle inequality of TV distance:
$$ D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)}; s) \leq D_{\mathrm{TV}}(\pi_{k+1}, \pi_k; s) + D_{\mathrm{TV}}(\pi_k, \pi^{(i)}; s) $$
This decomposes the coupled constraint into two independent parts:
- Update increment shift $D_{\mathrm{TV}}(\pi_{k+1}, \pi_k; s)$: Deviation of the new policy from the current policy, controllable by the optimization side
- Sampling staleness $D_{\mathrm{TV}}(\pi_k, \pi^{(i)}; s)$: Deviation of the current policy from each old policy, must be controlled by the sampling side
Define:
$$ U_k := \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[D_{\mathrm{TV}}(\pi_{k+1}, \pi_k; s)\big], \quad S_k := \mathbb{E}_{(s,i)\sim d_{\beta^{(k)}}} \big[D_{\mathrm{TV}}(\pi_k, \pi^{(i)}; s)\big] $$
Corollary 3.2 (Decomposed Monotonic Improvement Lower Bound)
$$ J(\pi_{k+1}) - J(\pi_k) \geq L_{\beta^{(k)}}(\pi_{k+1}) - \frac{2\gamma C_{\pi_{k+1},\beta^{(k)}}}{(1-\gamma)^2} U_k - \left( \frac{2\gamma C_{\pi_{k+1},\beta^{(k)}}}{(1-\gamma)^2} + \frac{2\|A^{\pi_k}\|_\infty}{1-\gamma} \right) S_k $$
Why Does Decomposition Solve the Problem?
The key is that after decomposition, $U_k$ only involves the new policy $\pi_{k+1}$ and the current policy $\pi_k$, completely independent of the structure of the old policy family $\{\pi^{(i)}\}$. Therefore, regardless of how different the old policies are from each other, constraining $U_k$ is always feasible—this is precisely the resolution to the infeasibility issue revealed in Observation 3.1.
This reveals an important engineering principle—separation of concerns:
| Control Term | Responsible Party | Control Mechanism |
|---|---|---|
| $U_k$ (update increment shift) | Optimization algorithm | Policy clipping |
| $S_k$ (sampling staleness) | Sampling system | Data filtering, version window |
Part V: Theoretical Foundations of Clipping Mechanisms
5.1 From TV Distance to Sample-Controllable Quantities
Corollary 3.2 tells us that to guarantee monotonic improvement, we need to control the update increment shift $U_k = \mathbb{E}[D_{\mathrm{TV}}(\pi_{k+1}, \pi_k; s)]$. However, TV distance is a distribution-level quantity—how can we control it using samples?
The key bridge is the following identity:
Lemma 3.3 (Ratio Difference Representation of TV Distance)
Suppose policy $\pi_1$’s support covers the supports of $\pi$ and $\pi_2$. Then for any state distribution $\mu$:
$$ \mathbb{E}_{s\sim \mu} \big[D_{\mathrm{TV}}(\pi, \pi_2; s)\big] = \frac{1}{2} \mathbb{E}_{s\sim \mu, a\sim\pi_1(\cdot\mid s)} \left| \frac{\pi(a\mid s)}{\pi_1(a\mid s)} - \frac{\pi_2(a\mid s)}{\pi_1(a\mid s)} \right| $$
Intuitive Understanding
The left side is the TV distance between two distributions (requiring enumeration over all actions), while the right side is the absolute difference of two importance ratios when sampling under $\pi_1$. This enables us to estimate and control TV distance using samples.
Sample Representation of $U_k$
Using Lemma 3.3, setting $\pi = \pi_{k+1}$, $\pi_2 = \pi_k$, $\pi_1 = \pi^{(i)}$ (the sampling source policy), we obtain:
$$ U_k = \frac{1}{2} \mathbb{E}_{(s,i) \sim d_{\beta^{(k)}}, a \sim \pi^{(i)}(\cdot\mid s)} \left| \frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)} - \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)} \right| $$
Denoting $\rho_{k+1} := \frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)}$ and $\rho_k := \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)}$, we have:
$$ U_k = \frac{1}{2} \mathbb{E}_{(s,i,a) \sim \text{training data}} \big| \rho_{k+1} - \rho_k \big| $$
This means: If we can ensure $\lvert\rho_{k+1} - \rho_k\rvert \leq \epsilon$ for each sample, we can guarantee $U_k \leq \epsilon/2$.
5.2 Constraining $U_k$: Two Clipping Options
Method 1: Direct Constraint on Ratio Difference
For each sample $(s, i, a)$, require:
$$ \left| \frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)} - \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)} \right| \leq \epsilon $$
The clipping interval is $\left[\frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)} - \epsilon, \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)} + \epsilon\right]$, with clipping center at $\rho_k$ rather than 1.
Method 2: Constraint on Incremental Ratio
Noting that $\rho_{k+1} - \rho_k = \rho_k \cdot \left(\frac{\pi_{k+1}}{\pi_k} - 1\right)$, we have:
$$ |\rho_{k+1} - \rho_k| = \rho_k \cdot \left|\frac{\pi_{k+1}(a\mid s)}{\pi_k(a\mid s)} - 1\right| $$
If we constrain $\left\lvert\frac{\pi_{k+1}(a\mid s)}{\pi_k(a\mid s)} - 1\right\rvert \leq \epsilon$, since $\mathbb{E}_{a\sim\pi^{(i)}}[\rho_k] = 1$, one can show $U_k \leq \epsilon/2$.
This method clips $\pi_{k+1}/\pi_k$ with center at 1, meaning the clipping constraint itself does not depend on the old policy family $\pi^{(i)}$. However, if we use the weighted advantage $\hat{A}=\rho_k\cdot A^{\beta^{(k)}}$ below, we still need per-sample behavior probabilities (or recorded logprobs) to compute $\rho_k$.
Objective Functions (Three Clipping Mechanisms)
For comparison, we present the complete objective functions for three clipping mechanisms. Suppose the current sample comes from old policy $\pi^{(i)}$, and denote:
- $\rho_{k+1} = \frac{\pi_{k+1}(a\mid s)}{\pi^{(i)}(a\mid s)}$ (new policy’s ratio relative to sampling policy)
- $\rho_k = \frac{\pi_k(a\mid s)}{\pi^{(i)}(a\mid s)}$ (current policy’s ratio relative to sampling policy)
- $r = \frac{\pi_{k+1}(a\mid s)}{\pi_k(a\mid s)}$ (new policy’s incremental ratio relative to current policy)
Note: under trajectory-level mixture (index fixed), $A^{\beta^{(k)}}((s,i),a)=A^{\pi^{(i)}}(s,a)$, so per-trajectory advantages from the corresponding old policy are consistent; under step/segment-level mixture, replacing $A^{\beta^{(k)}}$ with $A^{\pi^{(i)}}$ introduces advantage-substitution bias (discussed in Part VI), so the advantage/value estimator must reflect future index switching.
Standard PPO
Clip $\rho_{k+1}$ with center at 1
$$ L^{\mathrm{PPO}} = \mathbb{E} \left[ \min\left( \rho_{k+1} \cdot A^{\pi^{(i)}}, \; \mathrm{clip}(\rho_{k+1}, 1-\epsilon, 1+\epsilon) \cdot A^{\pi^{(i)}} \right) \right] $$
Method 1
Clip $\rho_{k+1}$ with center at $\rho_k$
$$ L^{\mathrm{M1}} = \mathbb{E} \left[ \min\left( \rho_{k+1} \cdot A^{\beta^{(k)}}, \; \mathrm{clip}(\rho_{k+1}, \rho_k-\epsilon, \rho_k+\epsilon) \cdot A^{\beta^{(k)}} \right) \right] $$
Method 2
Clip incremental ratio $r$ with center at 1
$$ L^{\mathrm{M2}} = \mathbb{E} \left[ \min\left( r \cdot \hat{A}, \; \mathrm{clip}(r, 1-\epsilon, 1+\epsilon) \cdot \hat{A} \right) \right] $$
where $\hat{A} = \rho_k \cdot A^{\beta^{(k)}}$ is the importance-weighted advantage estimate.
5.3 Comparison and Practical Controls
Table 5.1 Comparison of Three Clipping Mechanisms
| Method | Clipped Variable | Clipping Center | Clipping Interval | Constrained TV Distance |
|---|---|---|---|---|
| Standard PPO | $\rho_{k+1} = \pi_{k+1}/\pi^{(i)}$ | $1$ | $[1-\epsilon, 1+\epsilon]$ | $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)})$ |
| Method 1 | $\rho_{k+1} = \pi_{k+1}/\pi^{(i)}$ | $\rho_k = \pi_k/\pi^{(i)}$ | $[\rho_k-\epsilon, \rho_k+\epsilon]$ | $D_{\mathrm{TV}}(\pi_{k+1}, \pi_k)$ |
| Method 2 | $r = \pi_{k+1}/\pi_k$ | $1$ | $[1-\epsilon, 1+\epsilon]$ | $D_{\mathrm{TV}}(\pi_{k+1}, \pi_k)$ |
The Fundamental Problem with Standard PPO Under Multi-Policy Mixture
Standard PPO constrains $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)})$, requiring the new policy to be simultaneously close to all sampling source policies. By Observation 3.1, when the old policies $\pi^{(1)}, \pi^{(2)}, \ldots$ differ significantly from each other, no $\pi_{k+1}$ can simultaneously satisfy all constraints. This causes the trust region intersection to shrink or even become empty, with updates being limited by the most stale policy.
Common Advantages of Methods 1 and 2
Both methods constrain $D_{\mathrm{TV}}(\pi_{k+1}, \pi_k)$—the deviation of the new policy from the current policy (rather than the sampling policy). Since $\pi_k$ is uniquely determined, this constraint is consistent across all sample sources, completely avoiding the infeasibility problem.
Method 1 vs Method 2
| Comparison Dimension | Method 1 (Adaptive Clipping) | Method 2 (Incremental Clipping) |
|---|---|---|
| Stale samples ($\rho_k \gg 1$) | Automatically tightens constraints, more conservative | May produce large gradient variance |
| LLM large vocabulary low-probability tokens | Allows larger absolute changes (additive) | Absolute changes are limited (multiplicative) |
| Implementation complexity | Requires storing $\pi^{(i)}(a\mid s)$ and $\pi_k(a\mid s)$ | Needs $\pi_k(a\mid s)$ and $\pi^{(i)}(a\mid s)$ (or stored logprobs) to compute $\rho_k$; clipping itself uses only $\pi_{k+1}/\pi_k$ |
| Advantage function | Uses $A^{\beta^{(k)}}$ | Uses weighted advantage $\rho_k \cdot A^{\beta^{(k)}}$ |
Detailed Explanations
(1) Handling Stale Samples
When samples come from very old policies, $\rho_k = \pi_k/\pi^{(i)}$ can be large.
- Method 2’s integrand is $\rho_k \cdot \lvert r - 1\rvert$; even if $\lvert r-1\rvert \leq \epsilon$, the integrand can reach $\epsilon \cdot \rho_k$, producing spikes.
- Method 1 directly constrains $\lvert\rho_{k+1} - \rho_k\rvert \leq \epsilon$; the integrand’s upper bound is always $\epsilon$, unaffected by $\rho_k$ amplification.
(2) LLM Large Vocabulary Issue
Large language models have many tokens having very small probabilities.
- Method 2 constrains $\pi_{k+1} \in [(1-\epsilon)\pi_k, (1+\epsilon)\pi_k]$, which is a multiplicative constraint: if $\pi_k(a\mid s) = 10^{-6}$, the allowed absolute change is only $\epsilon \times 10^{-6}$.
- Method 1 constrains $\lvert\pi_{k+1} - \pi_k\rvert \leq \epsilon \cdot \pi^{(i)}$, which is an additive constraint: if that token has higher probability under the old policy (e.g., $\pi^{(i)}(a\mid s) = 0.1$), even if the current probability is very low, faster improvement is allowed.
Controlling Sampling Staleness
Corollary 3.2 shows that $S_k$ also affects the monotonic improvement lower bound, but it cannot be controlled through optimization-side clipping and must be implemented by the sampling system:
(1) Discarding Stale Data
Set a threshold $\epsilon_{\mathrm{stale}}$. For each sample, compute $\lvert\rho_k - 1\rvert = \lvert\pi_k(a\mid s)/\pi^{(i)}(a\mid s) - 1\rvert$, and discard samples exceeding the threshold.
(2) Controlling Policy Version Window
Limit the number of old policy versions in the mixture sampling, e.g., using only data from the most recent $W$ versions.
Operational Meaning of Clipping
Finally, we clarify the relationship between clipping and the theoretical lower bound.
In Corollary 3.2, the coefficient of $U_k$, namely $C_{\pi_{k+1},\beta^{(k)}}$, depends on the new policy $\pi_{k+1}$, so the penalty term cannot be simply replaced by a constant. The correct operational meaning is:
Maximize the surrogate objective $L_{\beta^{(k)}}(\pi_{k+1})$ subject to the constraint $U_k \leq \epsilon/2$
The clipping objective function is precisely an implementation of this constrained optimization—clipping hard limits the update magnitude to ensure $U_k$ is controllable; under this premise, gradient ascent improves the surrogate objective, thereby providing guarantees for monotonic policy improvement.
Section Summary
This section established the theoretical foundations of clipping mechanisms:
- Lemma 3.3 converts TV distance to sample-level ratio differences, serving as the bridge between theory and implementation
- Two constraint methods: Method 1 (adaptive clipping center) and Method 2 (fixed incremental clipping), both guaranteeing $U_k \leq \epsilon/2$
- Comparison with standard PPO: Standard PPO constrains $D_{\mathrm{TV}}(\pi_{k+1}, \pi^{(i)})$, which is infeasible under multi-policy mixture; Methods 1/2 constrain $D_{\mathrm{TV}}(\pi_{k+1}, \pi_k)$, avoiding this issue
- Method selection: Method 1 (adaptive) is recommended for high staleness or LLM large vocabulary scenarios; Method 2 (incremental) is attractive when you want the trust-region/clipping constraint to avoid depending on the old policy family (but data still needs to provide behavior logprobs to compute $\rho_k$)
- $S_k$ control is the sampling side’s responsibility, implemented through data filtering and version windows
- Clipping is constrained optimization: Maximize the surrogate objective subject to $U_k$ constraints
Part VI: Comparison of Trajectory-Level and Step/Segment-Level Mixture
6.1 Mechanism Differences and Estimation Implications
The essential difference between the two mixture mechanisms lies in the structure of the index transition kernel:
- Trajectory-level mixture: $q(i'\mid i) = \mathbf{1}\{i'=i\}$, index never changes
- Step/segment-level mixture: $\sigma(i) > 0$, allows within-trajectory switching
The correspondence with common engineering terminology is:
- Trajectory-level mixture here can be roughly understood as an idealized abstraction of “conventional asynchronous training”: data is organized by entire trajectories/episodes belonging to a certain policy version;
- Step/segment-level mixture here can be roughly understood as an abstraction of “partial rollout”: due to asynchrony between actors and learners, and possible refresh to new policy versions at segment boundaries, using an index transition kernel that allows “within-trajectory version switching” can better approximate this phenomenon.
The key watershed is whether Lemma 2.1’s structural simplification holds: trajectory-level mixture satisfies advantage function reduction; step/segment-level mixture generally does not, because future returns are affected by the index transition kernel.
Differences in Sampling Staleness $S_k$
Trajectory-level mixture’s staleness arises from: mixture weights $\alpha_i^{(k)}$ retaining probability mass on old policies after new policy release.
Step/segment-level mixture has an exponential compression effect: Consider a simplified model with switching probability $\sigma$ from old to new. The marginal probability mass on old indices under the discounted visitation distribution is $\frac{1-\gamma}{1-\gamma(1-\sigma)}$. As long as $\sigma \gg 1-\gamma$, the old policy weight can be significantly compressed.
Differences in Surrogate Objective Estimation
Trajectory-level mixture: The advantage function reduces to $A^{\pi^{(i)}}(s,a)$, with a clear estimation path.
Advantage substitution bias in step/segment-level mixture: If single-policy advantage estimates are used, systematic bias will arise. The reason is that $A^{\beta^{(k)}}((s,i),a)$ requires taking expectations over future index switching, while $A^{\pi^{(i)}}(s,a)$ implicitly assumes “the future always follows $\pi^{(i)}$.”
Unification Under Bandit Setting
In single-step episode LLM training, with no subsequent state transitions, the estimation problems of both mechanisms unify, with no such bias.
6.2 Risks and Applicable Scenarios
Step/segment-level mixture has another hidden concern: even if single-step importance ratios are clipped, multi-step noise accumulation over long trajectories can still amplify gradient estimation variance. When policy changes per update are large, “behavioral discontinuities” within trajectories may induce heavier-tailed ratio distributions. This is why trajectory-level mixture is recommended for “large policy change per update” scenarios in the table below.
Applicable Scenarios
Table 6.1 Applicable Scenarios for Two Mixture Mechanisms
| Scenario Characteristics | Recommended Mechanism | Rationale |
|---|---|---|
| Long trajectories, high-frequency updates, strong asynchrony | Step/segment-level | Can significantly compress $S_k$ |
| Short trajectories (non-bandit) | Trajectory-level | $S_k$ is naturally low |
| Large policy change per update | Trajectory-level | Avoids variance amplification |
| Single-step episode (bandit) | Either | Choose based on implementation convenience |
| Need for compromise | Segment-level | Switch at natural boundaries |
Core trade-off: Step/segment-level mixture is stronger on the sampling side (fast staleness removal), while trajectory-level mixture is more stable on the estimation side (easier surrogate objective estimation).
Part VII: Handling Training-Inference Inconsistency
7.1 Background and Effective Staleness
In large-scale distributed training, policies on the inference side and training side may be inconsistent:
- Numerical implementation differences: softmax normalization, quantization, kernel fusion
- Decoding rule differences: temperature scaling, top-p/top-k sampling
Let the behavior policy modeled on the training side be $\pi^{(i)}$, while the policy actually sampling on the inference side is $\hat{\pi}^{(i)}$.
Effective Staleness
Define effective staleness:
$$ \hat{S}_k := \mathbb{E}_{(s,i) \sim d_{\hat{\beta}^{(k)}}} \big[ D_{\mathrm{TV}}(\pi_k, \hat{\pi}^{(i)}; s) \big] $$
This definition simultaneously covers version staleness and training-inference implementation differences.
7.2 Actionable Control
By Lemma 3.3, $\hat{S}_k$ can be expressed in sample-level computable form. Given threshold $\epsilon_{\mathrm{stale}}$, if training only uses samples satisfying $\lvert\pi_k(a\mid s)/\hat{\pi}^{(i)}(a\mid s) - 1\rvert \leq \epsilon_{\mathrm{stale}}$, then $\hat{S}_k \leq \epsilon_{\mathrm{stale}}/2$.
Key Implementation Points
- Behavior denominator alignment: The behavior probability in the loss should use the inference-side recorded $\hat{\pi}^{(i)}(a\mid s)$
- Probability smoothing: If the inference side has truncation (e.g., top-k), ensure ratios are valid
Summary: Practical Guidelines
Core Theoretical Framework
The structure of the monotonic improvement lower bound is:
$$ J(\pi_{k+1}) - J(\pi_k) \geq \underbrace{L_{\beta^{(k)}}(\pi_{k+1})}_{\text{surrogate objective}} - \underbrace{C_1 \cdot U_k}_{\text{update shift penalty}} - \underbrace{C_2 \cdot S_k}_{\text{sampling staleness penalty}} $$
Separation of Concerns Principle
| Control Term | Responsible Party | Control Mechanism | Specific Operation |
|---|---|---|---|
| $U_k$ | Optimization algorithm | Policy clipping | Clip update increments (e.g., clip $\pi_{k+1}/\pi_k$) |
| $S_k$ | Sampling system | Data filtering | Discard stale samples |
| $S_k$ | Sampling system | Version window | Use only most recent $W$ versions |
Clipping Method Selection
| Scenario | Recommended Method | Rationale |
|---|---|---|
| High staleness | Method 1 (adaptive) | Automatically tightens constraints for stale samples |
| Implementation simplicity prioritized | Method 2 (incremental) | No need to store old policy information |
| LLM large vocabulary | Method 1 | Avoids slow updates for low-probability tokens |
Handling Training-Inference Inconsistency
- Use inference-side recorded $\hat{\pi}^{(i)}$ as the behavior denominator
- Compress effective staleness through sample filtering
Appendix: Quick Reference for Key Symbols
| Symbol | Meaning |
|---|---|
| $\pi_k$, $\pi^{(i)}$ | Latest policy at round $k$, $i$-th old policy |
| $d_\pi(s)$, $A^\pi(s,a)$ | Discounted state visitation distribution, advantage function |
| $D_{\mathrm{TV}}(\pi, \pi'; s)$ | TV distance between two policies at state $s$ |
| $\beta^{(k)}(a \mid s, i) := \pi^{(i)}(a \mid s)$ | Mixture behavior policy at round $k$ |
| $q(i' \mid i)$, $\alpha_i^{(k)}$ | Index transition kernel, initial index distribution |
| $U_k$, $S_k$ | Update increment shift, sampling staleness |
| $\epsilon$, $\epsilon_{\mathrm{stale}}$, $W$ | Clipping radius, staleness threshold, version window |
| $C_{\pi,\pi_k}$ | Expected advantage upper bound constant |
References
-
John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. “Trust Region Policy Optimization” (TRPO). arXiv:1502.05477. https://arxiv.org/abs/1502.05477
-
Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel. “Constrained Policy Optimization” (CPO). arXiv:1705.10528. https://arxiv.org/abs/1705.10528
-
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. “Proximal Policy Optimization Algorithms” (PPO). arXiv:1707.06347. https://arxiv.org/abs/1707.06347
-
James Queeney, Ioannis Ch. Paschalidis, Christos G. Cassandras. “Generalized Proximal Policy Optimization with Sample Reuse” (GePPO). arXiv:2111.00072. https://arxiv.org/abs/2111.00072
-
Yuzhen Zhou, Jiajun Li, Yusheng Su, et al. “APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation” (APRIL; partial rollout). arXiv:2509.18521. https://arxiv.org/abs/2509.18521
-
Jacob Hilton, Karl Cobbe, John Schulman. “Batch size-invariance for policy optimization” (Decoupled PPO). arXiv:2110.00641. https://arxiv.org/abs/2110.00641
@misc{WangZhang2025OffPolicyLLMRL,
author = {Wang, Xihuai and Zhang, Shao},
title = {Off-Policy Training in LLM Reinforcement Learning: From Theory to Practice},
year = {2025},
month = dec,
day = {17},
url = {https://xihuai18.github.io/reinforcement-learning/2025/12/17/offpolicy-en.html},
urldate = {2025-12-17}
}