Continuous Learning GRPO Integration Strategy for Gym¶
Date: October 28, 2025
Overview¶
Gym already has excellent foundation for Continuous Learning GRPO. The integration should leverage existing infrastructure (GRPO/GSPO trainers) and extend them cleanly.
Current Infrastructure ✅¶
1. Configuration Parameters (Already Exist!)¶
Location: src/gym/hparams/finetuning_args.py lines 277-312
# Continuous Learning GRPO parameters
continuous_learning_grpo: bool = False
experience_lib_path: Optional[str] = None
experience_max_size: int = 100
llm_api_key: Optional[str] = None
llm_base_url: str = "https://api.deepseek.com/v1"
llm_model: str = "deepseek-chat"
semantic_max_operations: int = 3
rollout_temperature: float = 0.7
use_groundtruth: bool = True
✅ Status: Complete, ready to use
2. Core Components (Fully Implemented)¶
- ✅
ExperienceManager- src/gym/train/grpo/experience_manager.py - ✅
SemanticExtractor- src/gym/train/grpo/semantic_extractor.py - ✅
APIModelAdapter- src/gym/train/grpo/api_model_adapter.py - ✅ Comprehensive unit tests - tests/train/test_continuous_learning_grpo.py
3. Trainer Infrastructure¶
- ✅
GRPOTrainer- Group-based advantage computation - ✅
GSPOTrainer- Sequence-level optimization - ✅ Unified training pipeline - src/gym/train/grpo/workflow.py
Recommended Integration Approach¶
Option 1: Extend GRPOTrainer (RECOMMENDED)¶
Why this approach: - ✅ Maintains unified codebase - ✅ Leverages existing infrastructure - ✅ Clean flag-based behavior switching - ✅ Consistent with Gym's design philosophy - ✅ Easy to test and maintain
Implementation:
# src/gym/train/grpo/trainer.py
class GRPOTrainer(Trainer):
def __init__(self, finetuning_args, ...):
super().__init__(...)
# Training-free mode setup
if finetuning_args.continuous_learning_grpo:
self._setup_training_free_mode(finetuning_args)
def _setup_training_free_mode(self, args):
"""Initialize Continuous Learning GRPO components."""
from .experience_manager import ExperienceManager
from .semantic_extractor import SemanticExtractor, LLMClient
from .api_model_adapter import DeepSeekAdapter
# Experience library
self.experience_manager = ExperienceManager(
checkpoint_path=args.experience_lib_path
)
# LLM for semantic extraction
llm_client = LLMClient(
api_key=args.llm_api_key,
base_url=args.llm_base_url,
model=args.llm_model
)
# Semantic extractor
self.semantic_extractor = SemanticExtractor(
llm_client=llm_client,
max_operations=args.semantic_max_operations
)
# Freeze model
self.model.eval()
for param in self.model.parameters():
param.requires_grad = False
def training_step(self, model, inputs):
"""Dispatch to appropriate training mode."""
if self.finetuning_args.continuous_learning_grpo:
return self._training_step_free(model, inputs)
else:
return self._training_step_standard(model, inputs)
def _training_step_free(self, model, inputs):
"""Continuous Learning GRPO: no parameter updates."""
# 1. Inject experiences into prompts
enhanced_inputs = self._inject_experiences(inputs)
# 2. Generate G rollouts per query
rollouts = self._generate_rollouts(enhanced_inputs)
# 3. Compute rewards (via verify function)
for rollout in rollouts:
rollout["reward"] = self._verify_rollout(rollout)
# 4. Extract semantic advantages
operations = self._extract_semantic_advantages(rollouts)
# 5. Update experience library
self.experience_manager.apply_operations(operations)
# 6. Save checkpoint (experiences only)
self._save_experiences()
# 7. Return zero loss (no gradient update!)
return torch.tensor(0.0, device=model.device)
def _inject_experiences(self, inputs):
"""Inject experience library into prompts."""
experiences = self.experience_manager.format_for_prompt()
# Enhance prompts with experiences
# Implementation depends on tokenizer/template
...
return enhanced_inputs
def _generate_rollouts(self, inputs):
"""Generate G rollouts per query."""
rollouts = []
for i in range(self.group_size):
# Generate with temperature for diversity
output = self.model.generate(
**inputs,
temperature=self.finetuning_args.rollout_temperature,
do_sample=True
)
rollouts.append(output)
return rollouts
def _extract_semantic_advantages(self, rollouts):
"""Extract semantic advantages using LLM."""
# Group rollouts by query
query_groups = self._group_by_query(rollouts)
# Stage 1: Summarize trajectories
for group in query_groups:
for rollout in group:
rollout["summary"] = self.semantic_extractor.summarize_trajectory(
rollout,
use_groundtruth=self.finetuning_args.use_groundtruth
)
# Stage 2: Extract group advantages
all_operations = []
for group in query_groups:
experiences_str = self.experience_manager.format_for_prompt()
operations = self.semantic_extractor.extract_group_advantage(
group,
experiences_str,
use_groundtruth=self.finetuning_args.use_groundtruth
)
all_operations.append(operations)
# Stage 3: Consolidate batch
experiences_str = self.experience_manager.format_for_prompt()
final_operations = self.semantic_extractor.consolidate_batch(
all_operations,
experiences_str
)
return final_operations
def _save_experiences(self):
"""Save experience library to checkpoint."""
if self.experience_manager.checkpoint_path:
self.experience_manager.save(
self.experience_manager.checkpoint_path
)
Benefits: - Single unified trainer - Clean separation of concerns - Easy to switch between modes - Maintains all existing functionality - No code duplication
Option 2: Separate ContinuousLearningGRPOTrainer (NOT RECOMMENDED)¶
Why NOT recommended: - ❌ Code duplication (group logic, advantage computation) - ❌ Harder to maintain consistency - ❌ More complex for users (which trainer to use?) - ❌ Breaks DRY principle
Only use if: - Training-free logic is drastically different - Need completely separate workflow - Want to deprecate standard GRPO eventually
Integration Steps¶
Phase 1: Extend GRPOTrainer (Week 1)¶
Files to modify:
src/gym/train/grpo/trainer.py(main changes)- Add
_setup_training_free_mode() - Add
_training_step_free() - Add
_inject_experiences() - Add
_generate_rollouts() - Add
_extract_semantic_advantages() -
Update
training_step()to dispatch -
src/gym/data/template.py(experience injection)def format_with_experiences( self, query: str, experiences: str, system: Optional[str] = None ) -> str: """Format query with experiences prepended.""" if experiences and experiences != "None": experience_section = f"\n\nHelpful experiences:\n{experiences}" else: experience_section = "" return f"{system or ''}{experience_section}\n\nProblem: {query}" -
src/gym/train/grpo/workflow.py(checkpoint management) - Update to save experiences alongside model
- Load experiences at training start
Testing:
# tests/train/test_grpo_trainer_integration.py
def test_training_free_mode():
trainer = GRPOTrainer(
finetuning_args=FinetuningArguments(
continuous_learning_grpo=True,
llm_api_key="sk-test",
experience_lib_path="./test_exp.json"
),
...
)
# Verify no parameter updates
initial_params = copy.deepcopy(trainer.model.state_dict())
# Run training step
loss = trainer.training_step(model, inputs)
# Check model unchanged
final_params = trainer.model.state_dict()
assert torch.allclose(initial_params['...'], final_params['...'])
# Check experiences updated
assert len(trainer.experience_manager) > 0
Phase 2: Domain Modules (Week 2)¶
Create domain structure:
src/gym/train/grpo/domains/
├── __init__.py
├── base.py # Base classes
├── math/
│ ├── __init__.py
│ ├── dataset.py # AIME dataset loader
│ ├── verify.py # Math correctness checker
│ └── prompts.py # Math-specific templates
└── web/
├── __init__.py
├── dataset.py # WebWalker dataset loader
├── verify.py # Navigation success checker
└── prompts.py # Web-specific templates
Base interface:
# src/gym/train/grpo/domains/base.py
class DomainAdapter:
"""Base class for domain-specific adapters."""
@abstractmethod
def load_dataset(self, name: str) -> List[Dict]:
"""Load domain-specific dataset."""
pass
@abstractmethod
def verify(self, output: str, groundtruth: str) -> float:
"""Verify correctness, return reward."""
pass
@abstractmethod
def format_prompt(self, problem: str, experiences: str) -> str:
"""Format domain-specific prompt."""
pass
Math domain example:
# src/gym/train/grpo/domains/math/dataset.py
def load_aime_dataset(split: str = "2024") -> List[Dict]:
"""Load AIME competition problems."""
# Load from HuggingFace datasets or local JSON
dataset = load_dataset("zooai/aime", split=split)
return [
{
"problem": sample["problem"],
"groundtruth": sample["answer"],
"domain": "math"
}
for sample in dataset
]
# src/gym/train/grpo/domains/math/verify.py
def verify_math_answer(output: str, groundtruth: str) -> float:
"""Verify math answer correctness."""
# Extract answer from output
predicted = extract_answer(output)
expected = extract_answer(groundtruth)
# Compare with tolerance
if math.isclose(float(predicted), float(expected), rel_tol=1e-5):
return 1.0
else:
return 0.0
Phase 3: Example Scripts (Week 3)¶
Create user-facing scripts:
# scripts/train_grpo_free_math.py
"""Continuous Learning GRPO on AIME math problems."""
from gym import run_train
from gym.hparams import get_train_args
def main():
# Configure training
config = {
# Model (API-based)
"model_name_or_path": "deepseek-chat",
"api_mode": True,
"api_base_url": "https://api.deepseek.com/v1",
# Continuous Learning GRPO
"stage": "grpo",
"continuous_learning_grpo": True,
"grpo_group_size": 5,
# Experience management
"experience_lib_path": "./output/math_experiences.json",
"llm_api_key": os.getenv("DEEPSEEK_API_KEY"),
"semantic_max_operations": 3,
"use_groundtruth": True,
# Dataset
"dataset": "aime24",
"domain": "math",
# Training
"num_train_epochs": 3,
"per_device_train_batch_size": 64,
"rollout_temperature": 0.7,
# Output
"output_dir": "./output/grpo_free_math",
"save_steps": 1,
"logging_steps": 1
}
# Parse args
model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(config)
# Run training
run_train(model_args, data_args, training_args, finetuning_args, generating_args)
if __name__ == "__main__":
main()
Shell script wrapper:
#!/bin/bash
# scripts/train_grpo_free_math.sh
export DEEPSEEK_API_KEY="sk-xxx"
python scripts/train_grpo_free_math.py \
--dataset aime24 \
--dataset_truncate 100 \
--num_train_epochs 3 \
--grpo_group_size 5 \
--output_dir ./output/grpo_free_math_test
API vs Local Model Support¶
API Mode (Recommended for Training-Free)¶
Pros: - ✅ Matches Tencent paper (DeepSeek-V3) - ✅ No GPU required - ✅ Faster inference - ✅ Lower total cost (~$18 for 100 samples) - ✅ Easier setup
Implementation:
if finetuning_args.api_mode:
from .api_model_adapter import DeepSeekAdapter
self.model = DeepSeekAdapter(
api_key=finetuning_args.llm_api_key,
model=finetuning_args.model_name_or_path
)
Local Model Mode (Optional)¶
Pros: - ✅ No API dependency - ✅ Full control - ✅ Better for privacy
Cons: - ❌ Requires GPU (32GB+ for 32B models) - ❌ Slower inference - ❌ Higher infrastructure cost
Implementation:
if not finetuning_args.api_mode:
# Use standard HuggingFace loading
self.model = AutoModelForCausalLM.from_pretrained(
model_args.model_name_or_path,
...
)
Configuration Examples¶
Math Domain (API-based)¶
# configs/grpo_free_math_api.yaml
model_name_or_path: deepseek-chat
api_mode: true
api_base_url: https://api.deepseek.com/v1
stage: grpo
continuous_learning_grpo: true
grpo_group_size: 5
rollout_temperature: 0.7
dataset: aime24
domain: math
use_groundtruth: true
experience_lib_path: ./output/math_exp/experiences.json
semantic_max_operations: 3
num_train_epochs: 3
per_device_train_batch_size: 64
output_dir: ./output/grpo_free_math
save_steps: 1
logging_steps: 1
Web Domain (Local Qwen3-32B)¶
# configs/grpo_free_web_local.yaml
model_name_or_path: Qwen/Qwen3-32B-Instruct
api_mode: false
stage: grpo
continuous_learning_grpo: true
grpo_group_size: 5
dataset: webwalker
domain: web
use_groundtruth: false # Self-discrimination
experience_lib_path: ./output/web_exp/experiences.json
llm_api_key: ${DEEPSEEK_API_KEY} # For semantic extraction only
num_train_epochs: 3
per_device_train_batch_size: 32
output_dir: ./output/grpo_free_web
Testing Strategy¶
Unit Tests (Already Complete ✅)¶
- ExperienceManager: 18/18 passing
- SemanticExtractor: 10/10 passing
- Trajectory: 2/2 passing
Integration Tests (To Add)¶
# tests/train/test_grpo_trainer_training_free.py
def test_training_free_mode_initialization():
"""Test training-free mode setup."""
...
def test_experience_injection():
"""Test experience injection into prompts."""
...
def test_no_parameter_updates():
"""Verify model weights unchanged."""
...
def test_experience_library_growth():
"""Check experiences accumulate across epochs."""
...
def test_checkpoint_loading():
"""Test loading experiences from checkpoint."""
...
def test_domain_adapters():
"""Test math/web domain modules."""
...
End-to-End Tests¶
# tests/train/test_grpo_free_e2e.py
def test_full_training_workflow():
"""Run 2 epochs on toy dataset."""
config = {...}
run_train(config)
# Verify results
assert Path("output/experiences.json").exists()
experiences = json.load(open("output/experiences.json"))
assert len(experiences) > 0
Migration from Tencent Code¶
What to Keep¶
- ✅ 3-stage LLM process (already implemented)
- ✅ Experience format (already implemented)
- ✅ Prompts (already implemented)
- ✅ Domain structure (need to add)
What to Adapt¶
- Training loop → Integrate with GRPOTrainer
- Async rollouts → Add to trainer
- SimpleAgent → Optional, support API and HF models
- Directory structure → Use Gym's output format
What to Skip¶
- ❌ Tencent's config system (use Gym's)
- ❌ UTU agent framework (use Gym's trainer)
- ❌ Custom CLI (use gym-cli)
Timeline¶
Week 1: Trainer Integration¶
- Add training_free mode to GRPOTrainer
- Implement experience injection
- Add rollout generation
- Basic testing
Week 2: Domain Modules¶
- Create domains/ directory structure
- Implement math domain
- Implement web domain
- Domain-specific tests
Week 3: Scripts & Examples¶
- Create train_grpo_free_math.py
- Create train_grpo_free_web.py
- Add YAML configs
- User documentation
Week 4: Testing & Benchmarks¶
- End-to-end tests
- AIME24 benchmark
- Cost analysis
- Performance comparison
Success Criteria¶
- Functional:
- Runs 3-epoch training on AIME24 (100 samples)
- Experience library grows (50-200 experiences)
- Zero parameter updates (model frozen)
-
Checkpoint save/load works
-
Performance:
- AIME24 accuracy ≥75% (target: 82.7%)
- Training cost ≤$25 (target: $18)
-
Training time ≤8 hours (target: 6 hours)
-
Usability:
- Simple config YAML
- Works with
gym-cli train - Clear error messages
-
Example scripts
-
Quality:
- All tests passing
- Documentation complete
- Code review approved
Recommendation¶
Use Option 1: Extend GRPOTrainer
This approach: - ✅ Leverages existing GRPO infrastructure - ✅ Maintains clean separation with continuous_learning_grpo flag - ✅ Consistent with Gym's design philosophy - ✅ Easy to test and maintain - ✅ Minimal code duplication - ✅ Works with existing configs (already have the parameters!)
Next immediate actions: 1. Modify src/gym/train/grpo/trainer.py to add training-free mode 2. Test with API model (DeepSeek) 3. Add math domain module 4. Run toy benchmark
ETA: 2-3 weeks for full integration and testing.
Questions to resolve: 1. Should async rollout generation be mandatory or optional? 2. Should we support Tencent's SimpleAgent or focus on API/HF models? 3. Should domain modules be required or optional (generic fallback)?
Recommendation: Start simple (API + math domain + sync rollouts), add advanced features later.