Continuous Learning GRPO Implementation Guide¶
Based on Tencent youtu-agent Paper (arXiv:2510.08191v1)¶
This document provides exact implementation specifications for Continuous Learning GRPO in zoo/gym.
1. Algorithm Overview¶
Vanilla GRPO vs Continuous Learning GRPO¶
Vanilla GRPO (Current zoo/gym):
# For each query q:
outputs = [π_θ(o_i|q) for i in range(G)] # Generate G outputs
rewards = [R(q, o_i) for o_i in outputs] # Score each
advantages = [(r_i - mean(r))/std(r) for r_i in rewards] # Numerical advantage
# Update parameters θ via gradient ascent on PPO objective
Continuous Learning GRPO (Target):
# Initialize: E = {} (experience library)
# For each query q:
outputs = [π_θ(o_i|q, E) for i in range(G)] # Inject experiences
rewards = [R(q, o_i) for o_i in outputs] # Score each
# Only for groups with std(rewards) > 0:
summaries = [LLM.summarize(q, o_i, r_i, groundtruth) for o_i, r_i in zip(outputs, rewards)]
semantic_advantage = LLM.extract_insights(q, summaries, E)
# Batch update (after all queries in epoch):
operations = LLM.consolidate([all semantic_advantages], E)
E = apply_operations(E, operations) # Add/Delete/Modify experiences
# Next epoch uses updated E
2. Core Components¶
2.1 Experience Manager¶
File: src/gym/train/grpo/experience_manager.py
from typing import Dict, List, Literal
import json
from pathlib import Path
Operation = Literal["add", "delete", "modify", "keep"]
class ExperienceManager:
"""Manages the experience library E for Continuous Learning GRPO."""
def __init__(self, checkpoint_path: str = None):
"""
Args:
checkpoint_path: Path to load/save experiences
"""
self.experiences: Dict[str, str] = {}
self._next_id = 0
self.checkpoint_path = checkpoint_path
if checkpoint_path and Path(checkpoint_path).exists():
self.load(checkpoint_path)
def add(self, experience: str) -> str:
"""Add new experience, return assigned ID."""
exp_id = f"G{self._next_id}"
self.experiences[exp_id] = experience
self._next_id += 1
return exp_id
def delete(self, exp_id: str) -> bool:
"""Delete experience by ID."""
if exp_id in self.experiences:
del self.experiences[exp_id]
return True
return False
def modify(self, exp_id: str, new_experience: str) -> bool:
"""Modify existing experience."""
if exp_id in self.experiences:
self.experiences[exp_id] = new_experience
return True
return False
def merge(self, exp_ids: List[str], merged_experience: str) -> str:
"""Merge multiple experiences into one."""
# Delete old experiences
for exp_id in exp_ids:
self.delete(exp_id)
# Add merged experience
return self.add(merged_experience)
def apply_operations(self, operations: List[Dict]) -> None:
"""
Apply batch of operations from LLM.
Args:
operations: List of dicts with keys: "option", "experience", etc.
Example: [
{"option": "add", "experience": "When solving..."},
{"option": "modify", "experience": "...", "modified_from": "G17"},
{"option": "delete", "delete_id": "G5"},
{"option": "merge", "experience": "...", "merged_from": ["G1", "G3"]}
]
"""
for op in operations:
option = op.get("option", "keep")
if option == "add":
self.add(op["experience"])
elif option == "delete":
self.delete(op["delete_id"])
elif option == "modify":
self.modify(op["modified_from"], op["experience"])
elif option == "merge":
self.merge(op["merged_from"], op["experience"])
def format_for_prompt(self) -> str:
"""Format experiences for injection into prompts."""
if not self.experiences:
return "None"
formatted = []
for exp_id, exp_text in self.experiences.items():
formatted.append(f"[{exp_id}]. {exp_text}")
return "\n".join(formatted)
def save(self, path: str) -> None:
"""Save experiences to JSON."""
with open(path, 'w') as f:
json.dump({
"experiences": self.experiences,
"next_id": self._next_id
}, f, indent=2)
def load(self, path: str) -> None:
"""Load experiences from JSON."""
with open(path) as f:
data = json.load(f)
self.experiences = data["experiences"]
self._next_id = data["next_id"]
def __len__(self) -> int:
return len(self.experiences)
2.2 Semantic Extractor¶
File: src/gym/train/grpo/semantic_extractor.py
from typing import List, Dict, Optional
from dataclasses import dataclass
@dataclass
class Trajectory:
"""Single rollout trajectory."""
query: str
output: str
reward: float
groundtruth: Optional[str] = None
summary: Optional[str] = None
class SemanticExtractor:
"""
Extracts semantic advantages from groups of trajectories.
Implements the 3-stage LLM process from Continuous Learning GRPO paper.
"""
def __init__(self, llm_client, max_operations: int = 3):
"""
Args:
llm_client: LLM client for introspection (e.g., OpenAI, DeepSeek)
max_operations: Max operations per group critique
"""
self.llm = llm_client
self.max_operations = max_operations
# STAGE 1: Trajectory Summarization
def summarize_trajectory(
self,
trajectory: Trajectory,
use_groundtruth: bool = True
) -> str:
"""
Summarize a single trajectory.
Implements Figure 11 from paper.
"""
prompt = f"""An agent system may be provided with some experiences, and then it produces the following trajectory to solve the given problem. Please summarize the trajectory step-by-step:
1. For each step, describe what action is being taken, and which experience has been used in this step.
2. Given the grading of this rollout and the correct answer, identify and explain any steps that represent detours, errors, or backtracking, highlighting why they might have occurred and what their impact was on the trajectory's progress.
3. Maintain all the core outcome of each step, even if it was part of a flawed process.
<trajectory>
{trajectory.output}
</trajectory>
<evaluation>
{'This trajectory delivers **correct** answer' if trajectory.reward > 0 else 'This trajectory delivers **wrong** answer'}
</evaluation>
{f'<groundtruth>{trajectory.groundtruth}</groundtruth>' if use_groundtruth else ''}
Only return the trajectory summary of each step, e.g.,
1. what happened in the first step and the core outcomes
2. what happened in the second step and the core outcomes
3. ..."""
response = self.llm.chat(prompt)
return response
# STAGE 2: Group Advantage Extraction
def extract_group_advantage(
self,
trajectories: List[Trajectory],
experiences: str, # Formatted experience library
use_groundtruth: bool = True
) -> Dict:
"""
Extract semantic advantage from a group of trajectories.
Implements Figure 12 from paper.
Returns operations: [{"option": "add", "experience": "..."}, ...]
"""
# Check if group has variation (std > 0)
rewards = [t.reward for t in trajectories]
if len(set(rewards)) <= 1:
return [] # Skip homogeneous groups
# Format trajectories with summaries
formatted_trajectories = []
for i, traj in enumerate(trajectories):
status = "correct" if traj.reward > 0 else "wrong"
formatted_trajectories.append(
f"Attempt {i+1} (Answer {status}):\n{traj.summary or traj.output}"
)
trajectories_text = "\n\n".join(formatted_trajectories)
prompt = f"""An agent system is provided with a set of experiences and has tried to solve the problem multiple times with both successful and wrong solutions. Review these problem-solving attempt and extract generalizable experiences. Follow these steps:
1. Trajectory Analysis:
- For successful steps: Identify key correct decisions and insights
- For errors: Pinpoint where and why the reasoning went wrong
- Note any important patterns or strategies used/missed
- Review why some trajectories fail? Is there any existing experiences are missed, or experiences do not provide enough guidance?
2. Update Existing Experiences
- Some trajectories may be correct and others may be wrong, you should ensure there are experiences can help to run correctly
- You have three options: [modify, add, delete]
* modify: You can modify current experiences to make it helpful
* add: You can introduce new experiences to improve future performance
* delete: You can delete existing experiences
- You can update at most {self.max_operations} clear, generalizable lessons for this case
- Before updating each experience, you need to:
* Specify when it would be most relevant
* List key problem features that make this experience applicable
* Identify similar problem patterns where this advice applies
3. Requirements for each experience that is modified or added.
- Begin with general background with several words in the experience
- Focus on strategic thinking patterns, not specific calculations
- Emphasize decision points that could apply to similar problems
Please provide reasoning in details under the guidance of the above 3 steps. After the step-by-step reasoning, you will finish by returning in this JSON format as follows:
```json
[
{{
"option": "modify",
"experience": "the modified experience",
"modified_from": "G17"
}},
{{
"option": "add",
"experience": "the added experience"
}},
{{
"option": "delete",
"delete_id": "G5"
}}
]
Note that your updated experiences may not need to cover all the options.
{f'
response = self.llm.chat(prompt)
# Parse JSON operations
import json
import re
# Extract JSON block
json_match = re.search(r'```json\s*(.*?)\s*```', response, re.DOTALL)
if json_match:
operations = json.loads(json_match.group(1))
return operations[:self.max_operations]
return []
# STAGE 3: Batch Consolidation
def consolidate_batch(
self,
all_group_operations: List[List[Dict]],
experiences: str
) -> List[Dict]:
"""
Consolidate all group advantages into final experience updates.
Implements Figure 13 from paper.
"""
# Flatten all operations
all_ops = []
for group_ops in all_group_operations:
all_ops.extend(group_ops)
if not all_ops:
return []
prompt = f"""An agent system is provided with a set of experiences and has tried to solve the problem multiple times. From the reflections, some suggestions on the existing experiences have been posed. Your task is to collect and think for the final experience revision plan. Each final experience must satisfy the following requirements:
- It must be clear, generalizable lessons for this case, with no more than 32 words
- Begin with general background with several words in the experience
- Focus on strategic thinking patterns, not specific calculations
- Emphasize decision points that could apply to similar problems
- Avoid repeating saying similar experience in multiple different experiences
Please provide reasoning in each of the suggestions, and think for how to update existing experiences. You have three update options: [modify, merge, delete]
- modify: You can modify current experiences to make it helpful
- merge: You can merge some similar experiences into a more general forms to reduce duplication
- delete: You can delete an experience
After generating the step-by-step reasoning, you need to give the final experience revision details by returning in this JSON format as follows:
[
{{
"option": "modify",
"experience": "the modified experience",
"modified_from": "G17"
}},
{{
"option": "merge",
"experience": "the merged experience",
"merged_from": ["C1", "C3", "S4"]
}},
{{
"option": "delete",
"delete_id": "G5"
}}
]
```"""
response = self.llm.chat(prompt)
# Parse JSON
import json
import re
json_match = re.search(r'```json\s*(.*?)\s*```', response, re.DOTALL)
if json_match:
return json.loads(json_match.group(1))
return []
2.3 Integration with GRPOTrainer¶
File: src/gym/train/grpo/trainer.py (modifications)
# Add to existing GRPOTrainer class
from .experience_manager import ExperienceManager
from .semantic_extractor import SemanticExtractor, Trajectory
class GRPOTrainer:
def __init__(self, ...):
# Existing initialization
...
# NEW: Continuous Learning GRPO components
self.use_training_free = self.args.continuous_learning_grpo # New arg
if self.use_training_free:
self.experience_manager = ExperienceManager(
checkpoint_path=self.args.experience_checkpoint_path
)
self.semantic_extractor = SemanticExtractor(
llm_client=self._get_llm_client(), # OpenAI/DeepSeek client
max_operations=self.args.max_experience_operations
)
def _get_llm_client(self):
"""Initialize LLM client for semantic extraction."""
# Use OpenAI-compatible client
from openai import OpenAI
return OpenAI(
api_key=self.args.llm_api_key,
base_url=self.args.llm_base_url
)
def training_step(self, model, inputs):
"""Modified training step for Continuous Learning GRPO."""
if not self.use_training_free:
# Use vanilla GRPO (existing code)
return super().training_step(model, inputs)
# Continuous Learning GRPO
queries = inputs["query"]
batch_size = len(queries)
# Inject experiences into prompts
experiences_text = self.experience_manager.format_for_prompt()
enhanced_queries = [
self._inject_experiences(q, experiences_text)
for q in queries
]
# Generate G rollouts per query
G = self.args.group_size # e.g., 5
all_trajectories = []
for query, enhanced_query in zip(queries, enhanced_queries):
group_trajectories = []
for _ in range(G):
# Generate response
response = self._generate_response(
model,
enhanced_query,
temperature=self.args.rollout_temperature
)
# Compute reward
reward = self.compute_rewards([query], [response])[0]
# Create trajectory
traj = Trajectory(
query=query,
output=response,
reward=reward.item(),
groundtruth=inputs.get("groundtruth")
)
group_trajectories.append(traj)
all_trajectories.append(group_trajectories)
# Extract semantic advantages
all_group_operations = []
for group in all_trajectories:
# Stage 1: Summarize each trajectory
for traj in group:
traj.summary = self.semantic_extractor.summarize_trajectory(
traj,
use_groundtruth=self.args.use_groundtruth
)
# Stage 2: Extract group advantage
operations = self.semantic_extractor.extract_group_advantage(
group,
experiences_text,
use_groundtruth=self.args.use_groundtruth
)
if operations:
all_group_operations.append(operations)
# Stage 3: Consolidate batch
if all_group_operations:
final_operations = self.semantic_extractor.consolidate_batch(
all_group_operations,
experiences_text
)
# Apply updates to experience library
self.experience_manager.apply_operations(final_operations)
# NO PARAMETER UPDATES - model stays frozen
# Return dummy loss for compatibility
return torch.tensor(0.0, requires_grad=True)
def _inject_experiences(self, query: str, experiences: str) -> str:
"""Inject experiences into query prompt."""
template = f"""Please solve the problem:
{query}
When solving problems, you MUST first carefully read and understand the helpful instructions and experiences:
{experiences}"""
return template
def save_model(self, output_dir: str, **kwargs):
"""Save model and experiences."""
super().save_model(output_dir, **kwargs)
if self.use_training_free:
# Save experience library
exp_path = Path(output_dir) / "experiences.json"
self.experience_manager.save(str(exp_path))
2.4 Training Arguments¶
File: src/gym/train/arguments.py (additions)
@dataclass
class FinetuningArguments:
# Existing arguments
...
# NEW: Continuous Learning GRPO arguments
continuous_learning_grpo: bool = field(
default=False,
metadata={"help": "Use Continuous Learning GRPO instead of vanilla GRPO"}
)
experience_checkpoint_path: Optional[str] = field(
default=None,
metadata={"help": "Path to load/save experience library"}
)
llm_api_key: str = field(
default=None,
metadata={"help": "API key for LLM (e.g., DeepSeek, OpenAI)"}
)
llm_base_url: str = field(
default="https://api.deepseek.com/v1",
metadata={"help": "Base URL for LLM API"}
)
max_experience_operations: int = field(
default=3,
metadata={"help": "Max operations per group critique"}
)
rollout_temperature: float = field(
default=0.7,
metadata={"help": "Temperature for rollout generation"}
)
use_groundtruth: bool = field(
default=True,
metadata={"help": "Use ground truth in semantic extraction"}
)
group_size: int = field(
default=5,
metadata={"help": "Number of rollouts per query (G in paper)"}
)
3. Usage Example¶
# Training script
python src/train.py \
--model_name_or_path deepseek-ai/DeepSeek-V3 \
--dataset_name custom_math_dataset \
--output_dir ./output/continuous_learning_grpo \
--continuous_learning_grpo \
--group_size 5 \
--rollout_temperature 0.7 \
--num_train_epochs 3 \
--per_device_train_batch_size 8 \
--llm_api_key $DEEPSEEK_API_KEY \
--llm_base_url https://api.deepseek.com/v1 \
--use_groundtruth \
--max_experience_operations 3
4. Performance Expectations¶
Based on paper results with DeepSeek-V3.1-Terminus:
| Metric | Baseline | Continuous Learning GRPO | Improvement |
|---|---|---|---|
| AIME24 | 80.0% | 82.7% | +2.7% |
| AIME25 | 67.9% | 73.3% | +5.4% |
| Training Cost | N/A | ~$18 (100 samples) | 500x cheaper than fine-tuning |
| Training Data | N/A | 100 samples | 100x less than vanilla RL |
Key Success Factors: - Group size G > 1 (ablation shows G=1 degrades performance) - Multi-epoch training (3 epochs recommended) - High-quality base model (works best on 100B+ models) - Domain-appropriate reward functions
5. Testing Checklist¶
- ExperienceManager: CRUD operations work correctly
- ExperienceManager: Serialization/deserialization preserves state
- SemanticExtractor: LLM responses parse correctly to JSON
- SemanticExtractor: Handles empty/homogeneous groups gracefully
- Trainer: Experiences inject into prompts correctly
- Trainer: Rollout generation produces G outputs per query
- Trainer: No parameter updates occur (verify gradients not computed)
- Trainer: Experience library grows across epochs
- End-to-end: Performance improves on validation set across epochs
- End-to-end: Learned experiences are human-readable and generalizable
6. Next Steps¶
Phase 1 (Week 1-2): Core Implementation - Implement ExperienceManager - Implement SemanticExtractor - Add unit tests
Phase 2 (Week 3-4): Integration - Modify GRPOTrainer - Add command-line arguments - Test on small dataset (10-20 samples)
Phase 3 (Week 5-6): Evaluation - Run on full dataset (100+ samples) - Compare with vanilla GRPO baseline - Analyze learned experiences
Phase 4 (Week 7-8): Optimization - Parallelize LLM calls - Add caching for repeated queries - Optimize prompt templates - Add IPFS/on-chain storage (optional)
References¶
- Paper: Training-Free Group Relative Policy Optimization (arXiv:2510.08191v1)
- Code: https://github.com/TencentCloudADP/youtu-agent/tree/training_free_GRPO
- DeepSeek API: https://api-docs.deepseek.com/