Dec 1, 2025 Understanding KL Divergence Estimators in RL: From Value Approximation to Gradient Estimation Dec 1, 2025 简单理解 RL 中的 KL 散度估计器:从数值估计到梯度估计 Nov 15, 2025 From Two Policies to Three: Extending TRPO under Behavior–Reference Policy Mismatch in LLM RL Nov 15, 2025 从两策略到三策略:LLM RL 中行为策略–参考策略不一致下的 TRPO 扩展