Xihuai’s Blog
-
Taming Stale Data: Off-Policy Reinforcement Learning for LLMs with Monotonic Improvement Guarantees
This post derives an off-policy view of LLM reinforcement learning: from single-policy performance bounds to multi-policy mixture sampling, with monotonic-improvement conditions, a decomposition via triangle inequality into update shift and sampling staleness, and practical clipping and filtering strategies.
驯服陈旧数据:LLM 强化学习的异策略训练与单调提升条件
本文从单策略采样的性能改进下界出发,推导 LLM 强化学习中的异策略训练问题,扩展到多策略静态/动态混合采样,并把单调提升条件拆成更新增量偏移与采样陈旧性两部分,最后落到可操作的裁剪与过滤策略。
-
Choosing KL Estimators in RL: From Value Unbiasedness to Gradient Correctness
In RL, KL estimators should not be judged only by how accurately they estimate KL values, but also by what objective their gradients actually optimize. This post compares k1, k2, k3 in on-policy and off-policy settings, and turns the result into a practical selection guide.
RL 中的 KL 估计器选型:从数值无偏到梯度正确
在强化学习中,KL 估计器不能只看数值估得准不准,还要看它在 loss 或 reward 写法下究竟优化了谁。本文比较 k1、k2、k3 在 on-policy 与 off-policy 场景中的差异,并给出可直接落地的选型建议。
-
From Two Policies to Three: Extending TRPO under Behavior–Reference Policy Mismatch in LLM RL
In modern LLM RL pipelines, the policy used as the "old policy" in training can differ from the behavior policy that actually generated the rollouts, breaking the usual on-policy assumption. This note rewrites the TRPO lower bound in a three-policy form — behavior, reference, and target — and argues that the surrogate gap is jointly controlled by two mismatch sources.
从两策略到三策略:LLM RL 中行为策略–参考策略不一致下的 TRPO 扩展
在现代 LLM RL 流程中,训练里使用的“旧策略”可能已经不同于真正生成 rollout 的行为策略,从而破坏常见的同策略假设。本文把经典 TRPO 下界改写成行为策略、参考策略、目标策略的三策略形式,并说明 surrogate gap 会同时受到两个偏差来源的共同控制。