Blog | Xihuai Wang's Page

English
简体中文

Taming Stale Data: Off-Policy Reinforcement Learning for LLMs with Monotonic Improvement Guarantees

A systematic derivation of off-policy training theory for LLM reinforcement learning: starting from single-policy sampling performance improvement bounds, extending to multi-policy static/dynamic mixture sampling, establishing sufficient conditions for monotonic improvement, decomposing constraints via the triangle inequality into update increment shift (controllable by optimization) and sampling staleness (controllable by sampling), and ultimately translating these into actionable clipping mechanisms and data filtering strategies.

22 min read · December 17, 2025 · 28 views

📅 2025 · 🏷️ reinforcement-learning

驯服陈旧数据：LLM 强化学习的异策略训练与单调改进保证

系统推导大模型强化学习中的异策略训练理论：从单策略采样的性能改进下界出发，扩展到多策略静态/动态混合采样，给出单调提升的充分条件，并通过三角不等式分解将约束拆分为更新增量偏移（优化侧可控）与采样陈旧性（采样侧可控）两部分，最终落地为可操作的裁剪机制与数据过滤策略。

7 分钟阅读 · 2025年12月17日 · 63 阅读

📅 2025 · 🏷️ reinforcement-learning

English
简体中文

Understanding KL Divergence Estimators in RL: From Value Approximation to Gradient Estimation

How you approximate KL can make or break training stability. This post analyzes the classic estimators k1, k2, k3 in on-policy and off-policy settings, and gives practical guidance on using KL as a differentiable loss term versus as a detached reward penalty.

36 min read · December 01, 2025 · 400 views

📅 2025 · 🏷️ reinforcement-learning

简单理解 RL 中的 KL 散度估计器：从数值估计到梯度估计

在强化学习中，KL 散度的估计方式直接影响训练稳定性。本文系统剖析三种经典估计器 k1、k2、k3 的性质差异，涵盖 on-policy 与 off-policy 两种场景，并提供在「用于 loss 梯度回传」与「用于 reward 惩罚」两种情况下的选型指南。

16 分钟阅读 · 2025年12月01日 · 345 阅读

📅 2025 · 🏷️ reinforcement-learning

English
简体中文

From Two Policies to Three: Extending TRPO under Behavior–Reference Policy Mismatch in LLM RL

Modern LLM RL pipelines often train under an "old policy" that silently drifts away from the behavior policy that actually generates rollouts, breaking the usual on-policy assumptions. This post rewrites the classic TRPO lower bound in a three-policy form — behavior, reference, and target — so that the performance gap cleanly decomposes into two TV distances that we can reason about and control. Seen through this lens, methods like Decoupled PPO, AReaL, TIS, IcePop, sequence-level MIS, Worst Token Reject Sampling (WTRS), MoE routing replay, and common engineering tricks for training–inference alignment all become different ways of shrinking these two deviations.

27 min read · November 15, 2025 · 122 views

📅 2025 · 🏷️ reinforcement-learning

从两策略到三策略：LLM RL 中行为策略–参考策略不一致下的 TRPO 扩展

现代 LLM RL 流程中，"旧策略"常常悄然偏离实际生成 rollout 的行为策略，破坏了通常的同策略假设。本文将经典的 TRPO 下界改写为三策略形式——行为策略、参考策略和目标策略——使得性能差距可以分解为两个可推理、可控制的 TV 距离。在这一视角下，Decoupled PPO、AReaL、TIS、IcePop、sequence-level MIS、最坏 Token 拒绝采样 (WTRS)、MoE 路由回放等方法，以及常见的训推对齐工程技巧，都可以看作是缩小这两个偏差的不同实现方式。

8 分钟阅读 · 2025年11月15日 · 107 阅读

📅 2025 · 🏷️ reinforcement-learning