Skip to content

The Blog

Xihuai’s Blog

Notes on reinforcement learning, multi-agent systems, and LLM reasoning — written to clarify my own thinking, shared in case they help yours.

2025 3 posts

  1. Taming Stale Data: Off-Policy Reinforcement Learning for LLMs with Monotonic Improvement Guarantees

    This post derives an off-policy view of LLM reinforcement learning: from single-policy performance bounds to multi-policy mixture sampling, with monotonic-improvement conditions that decompose update shift, sampling staleness, advantage replacement error, and support assumptions.

    30 min read Reinforcement learning 51 views

  2. Choosing KL Estimators in RL: From Value Unbiasedness to Gradient Correctness

    In RL, KL estimators should not be judged only by how accurately they estimate KL values, but also by what objective their gradients actually optimize. This post compares k1, k2, k3 in on-policy and off-policy settings, and turns the result into a practical selection guide.

    34 min read Reinforcement learning 844 views

  3. From Two Policies to Three: Extending TRPO under Behavior–Reference Policy Mismatch in LLM RL

    In modern LLM RL pipelines, the policy used as the "old policy" in training can differ from the behavior policy that actually generated the rollouts, breaking the usual on-policy assumption. This note rewrites the TRPO lower bound in a three-policy form — behavior, reference, and target — and argues that the surrogate gap is jointly controlled by two mismatch sources.

    31 min read Reinforcement learning 382 views