arxiv:2603.04597

Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

Published on Mar 4

· Submitted by

Authors:

Abstract

Language feedback is leveraged in reinforcement learning to improve exploration efficiency and sample utilization through grouped critique aggregation and joint generation-refinement optimization.

AI-generated summary

Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2times improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.

View arXiv page View PDF GitHub 3 Add to collection

Community

Luckyyy

Paper submitter about 7 hours ago

Key idea:

Instead of discarding the rich information in NL feedback, GOLF aggregates two complementary signals — external critiques (pinpointing errors and fixes) and intra-group attempts (diverse partial solutions and failure patterns) — to produce high-quality refinements that serve as off-policy scaffolds during training.

Highlights:

Adaptive injection of refinements to provide targeted guidance in sparse-reward regions
Joint optimization of generation and refinement in a unified RL loop, forming a virtuous improvement cycle
2.2× sample efficiency gain over scalar-reward-only RL methods
Consistent improvements on both verifiable and non-verifiable benchmarks

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.04597 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.04597 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.04597 in a Space README.md to link it from this page.