Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning
Abstract
Language feedback is leveraged in reinforcement learning to improve exploration efficiency and sample utilization through grouped critique aggregation and joint generation-refinement optimization.
Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2times improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.
Community
Key idea:
- Instead of discarding the rich information in NL feedback, GOLF aggregates two complementary signals — external critiques (pinpointing errors and fixes) and intra-group attempts (diverse partial solutions and failure patterns) — to produce high-quality refinements that serve as off-policy scaffolds during training.
Highlights:
- Adaptive injection of refinements to provide targeted guidance in sparse-reward regions
- Joint optimization of generation and refinement in a unified RL loop, forming a virtuous improvement cycle
- 2.2× sample efficiency gain over scalar-reward-only RL methods
- Consistent improvements on both verifiable and non-verifiable benchmarks
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper