The Blog

Xihuai’s Blog

Notes on reinforcement learning, multi-agent systems, and LLM reasoning — written to clarify my own thinking, shared in case they help yours.

2025 3 posts

Dec 17, 2025

Taming Stale Data: Off-Policy Reinforcement Learning for LLMs with Monotonic Improvement Guarantees
This post derives an off-policy view of LLM reinforcement learning: from single-policy performance bounds to multi-policy mixture sampling, with monotonic-improvement conditions that decompose update shift, sampling staleness, advantage replacement error, and support assumptions.
30 min read Reinforcement learning 63 views
Dec 1, 2025

Choosing KL Estimators in RL: From Value Unbiasedness to Gradient Correctness
In RL, KL estimators should not be judged only by how accurately they estimate KL values, but also by what objective their gradients actually optimize. This post compares k1, k2, k3 in on-policy and off-policy settings, and turns the result into a practical selection guide.
34 min read Reinforcement learning 986 views
Nov 15, 2025

From Two Policies to Three: Extending TRPO under Behavior–Reference Policy Mismatch in LLM RL
In modern LLM RL pipelines, the policy used as the "old policy" in training can differ from the behavior policy that actually generated the rollouts, breaking the usual on-policy assumption. This note rewrites the TRPO lower bound in a three-policy form — behavior, reference, and target — and argues that the surrogate gap is jointly controlled by two mismatch sources.
31 min read Reinforcement learning 476 views