-
Understanding KL Divergence Estimators in RL: From Value Approximation to Gradient Estimation
How we approximate KL directly affects stability. This post dissects three classic estimators k1, k2, k3, covering on-policy and off-policy, and gives practical rules for using them for reward penalties vs. losses that backpropagate.
-
简单理解 RL 中的 KL 散度估计器:从数值估计到梯度估计
在强化学习中,KL 散度的估计方式直接影响训练稳定性。本文系统剖析三种经典估计器 k1, k2, k3 的性质差异,涵盖 on-policy 与 off-policy 两种场景,并给出「用于 reward 惩罚」与「用于 loss 回传」时的选型指南。
-
From Two Policies to Three: Extending TRPO under Behavior–Reference Policy Mismatch in LLM RL
Modern LLM RL pipelines often train under an "old policy" that silently drifts away from the behavior policy that actually generates rollouts, breaking the usual on-policy assumptions. This post rewrites the classic TRPO lower bound in a three-policy form — behavior, reference, and target — so that the performance gap cleanly decomposes into two TV distances that we can reason about and control. Seen through this lens, methods like Decoupled PPO, AReaL, TIS, IcePop, sequence-level MIS, Worst Token Reject Sampling (WTRS), MoE routing replay, and common engineering tricks for training–inference alignment all become different ways of shrinking these two deviations.
-
从两策略到三策略:LLM RL 中行为策略–参考策略不一致下的 TRPO 扩展
现代 LLM RL 流程常常在"旧策略"悄然偏离实际生成 rollout 的行为策略时进行训练,破坏了通常的同策略假设。本文将经典的 TRPO 下界改写为三策略形式——行为策略、参考策略和目标策略——使得性能差距可以分解为两个可以推理和控制的 TV 距离。在这一视角下,Decoupled PPO、AReaL、TIS、IcePop、sequence-level MIS、最坏 Token 拒绝采样 (WTRS)、MoE 路由回放等方法,以及常见的训推对齐工程技巧,都可以看作是缩小这两个偏差的不同方式。