Continuous Learning GRPO Integration Status¶
Date: October 28, 2025
Based on: Tencent youtu-agent implementation (arXiv:2510.08191v1)
Executive Summary¶
✅ Core components implemented - ExperienceManager, SemanticExtractor, APIModelAdapter
⚠️ Training workflow incomplete - Need unified training script matching Tencent's architecture
📋 Domain structure missing - Need math/web modules with dataset/verify/prompts
Estimated completion: 2-3 weeks for full parity with Tencent implementation
Component Comparison¶
✅ Fully Implemented in Gym¶
1. ExperienceManager (src/gym/train/grpo/experience_manager.py)¶
Status: 100% complete, matches Tencent functionality
| Feature | Gym | Tencent | Notes |
|---|---|---|---|
| Add experience | ✅ | ✅ | Identical API |
| Delete experience | ✅ | ✅ | Identical API |
| Modify experience | ✅ | ✅ | Identical API |
| Merge experiences | ✅ | ✅ | Identical API |
| Apply operations batch | ✅ | ✅ | Identical API |
| Format for prompt | ✅ | ✅ | Identical format |
| Save/load JSON | ✅ | ✅ | Identical format |
| Unit tests | ✅ | ❌ | Gym has comprehensive tests |
Code Quality: Excellent, well-documented, fully tested
2. SemanticExtractor (src/gym/train/grpo/semantic_extractor.py)¶
Status: 100% complete, implements all 3 stages
| Feature | Gym | Tencent | Notes |
|---|---|---|---|
| Stage 1: Trajectory summarization | ✅ | ✅ | Identical prompts |
| Stage 2: Group advantage extraction | ✅ | ✅ | Identical prompts |
| Stage 3: Batch consolidation | ✅ | ✅ | Identical prompts |
| JSON parsing with error handling | ✅ | ✅ | Better error handling in Gym |
| Support ground truth | ✅ | ✅ | Both support optional GT |
| Max operations limit | ✅ | ✅ | Configurable |
| Unit tests | ✅ | ❌ | Gym has comprehensive tests |
Code Quality: Excellent, follows paper exactly, well-tested
3. APIModelAdapter (src/gym/train/grpo/api_model_adapter.py)¶
Status: 100% complete, with additional features
| Feature | Gym | Tencent | Notes |
|---|---|---|---|
| OpenAI-compatible client | ✅ | ✅ | Identical |
| DeepSeek support | ✅ | ✅ | Identical |
| OpenAI support | ✅ | ❌ | Gym has extra wrapper |
| Batch generation | ✅ | ❌ | Gym has extra feature |
| Experience injection | ✅ | ❌ | Gym has built-in method |
| System prompt support | ✅ | ❌ | Gym has extra feature |
Code Quality: Excellent, more flexible than Tencent
⚠️ Partially Implemented¶
4. GRPOTrainer (src/gym/train/grpo/trainer.py)¶
Status: Base GRPO complete, training-free mode missing
| Feature | Gym | Tencent | Gap |
|---|---|---|---|
| Group-based rollouts | ✅ | ✅ | Identical |
| Advantage computation | ✅ | ✅ | Identical |
| Value-free optimization | ✅ | ✅ | Identical |
| Training-free mode | ❌ | ✅ | MISSING |
| Experience integration | ❌ | ✅ | MISSING |
| Frozen model operation | ❌ | ✅ | MISSING |
| Async rollout generation | ❌ | ✅ | MISSING |
| Timeout/retry logic | ❌ | ✅ | MISSING |
What's Needed:
# Add to GRPOTrainer
def training_step_free(self, model, inputs):
"""Continuous Learning GRPO: no parameter updates."""
# 1. Generate G rollouts with experience injection
# 2. Compute rewards via verify function
# 3. Extract semantic advantages (not numerical)
# 4. Update experience library
# 5. Return zero loss (no gradient update)
return torch.tensor(0.0)
❌ Not Yet Implemented¶
5. Training Workflow¶
Status: Core infrastructure exists, unified script missing
Tencent's train.py structure:
for epoch in epochs:
for batch in batches:
# 1. Load experiences from previous step
experiences = load_experiences(step - 1)
# 2. Inject experiences into prompts
enhanced_prompts = inject_experiences(batch, experiences)
# 3. Generate G rollouts per query
rollouts = await rollout_dataset(enhanced_prompts)
# 4. Compute rewards
for rollout in rollouts:
rollout["reward"] = verify_func(rollout)
# 5. Extract semantic advantages
new_experiences = ExperienceUpdater().run(rollouts, experiences)
# 6. Save for next step
save_experiences(step + 1, new_experiences)
What Gym needs: - [ ] Create src/gym/train/grpo/training_free_workflow.py - [ ] Implement multi-epoch training loop - [ ] Add experience checkpoint management - [ ] Integrate with GRPOTrainer
6. Rollout Infrastructure¶
Status: Not implemented
Tencent's main.py features:
async def rollout_dataset(
worker_agent,
data,
rollouts,
verify_func,
rollout_concurrency=5,
task_timeout=3600,
max_retries=3
):
# Async worker pool
# Timeout handling
# Automatic retry on failure
# Progress tracking with tqdm
# Incremental saving
What Gym needs: - [ ] Create src/gym/train/grpo/rollout_manager.py - [ ] Implement async batch rollout generation - [ ] Add timeout/retry logic - [ ] Support both API and local models - [ ] Progress tracking and logging
7. Domain-Specific Modules¶
Status: Not implemented
Tencent structure:
continuous_learning_grpo/
├── math/
│ ├── dataset.py # Load AIME/math datasets
│ ├── verify.py # Verify correctness (reward function)
│ ├── prompts.py # Math-specific prompt templates
│ └── experience.py # Math-specific ExperienceUpdater
└── web/
├── dataset.py # Load WebWalker datasets
├── verify.py # Verify web navigation success
├── prompts.py # Web-specific prompt templates
└── experience.py # Web-specific ExperienceUpdater
What Gym needs: - [ ] Create src/gym/train/grpo/domains/ directory - [ ] Implement domains/math/ module - [ ] Implement domains/web/ module - [ ] Generic base classes for custom domains
Architecture Gaps¶
1. Experience Injection Pipeline¶
Tencent approach:
# Inject experiences into prompt
enhanced_prompt = PROBLEM_WITH_EXPERIENCE_TEMPLATE.format(
experiences=formatted_experiences,
problem=problem
)
# Generate with experiences as context
response = model.generate(enhanced_prompt)
Gym approach (currently missing): - Data collator doesn't support experience injection - Template system needs extension - Need custom prompt formatter
Solution:
# Add to data/template.py
def format_with_experiences(
self,
query: str,
experiences: str,
system: str = None
) -> str:
"""Format query with experiences injected."""
experience_context = f"\\n\\nHelpful experiences:\\n{experiences}" if experiences else ""
return f"{system}{experience_context}\\n\\nProblem: {query}"
2. Frozen Model Training¶
Tencent approach:
# Model weights never updated
# All learning happens in experience library
new_experiences = extract_semantic_advantages(rollouts)
experience_library.update(new_experiences)
# No optimizer.step(), no gradient computation
Gym approach (currently missing): - GRPOTrainer assumes gradient updates - Need training_free flag to skip optimizer - Need checkpoint system to save experiences
Solution:
class GRPOTrainer:
def training_step(self, model, inputs):
if self.finetuning_args.continuous_learning_grpo:
return self.training_step_free(model, inputs)
else:
return self.training_step_standard(model, inputs)
3. Multi-Epoch Experience Evolution¶
Tencent approach:
Epoch 0: experiences = {}
Epoch 1: experiences = E1 (learned from epoch 0)
Epoch 2: experiences = E2 (learned from epoch 1)
...
Gym approach (currently missing): - No cross-epoch experience persistence - Need checkpoint directory structure - Need experience version tracking
Solution:
output/
└── experiment_name/
├── epoch_0/
│ ├── shuffled_data.jsonl
│ └── step_0/
│ ├── rollout.jsonl
│ ├── single_rollout_summary.json
│ ├── single_query_critique.json
│ ├── batch_update.json
│ └── experiences.json # Used by step_1
├── epoch_1/
│ └── step_1/
│ └── experiences.json # Used by step_2
└── stats.json
Integration Roadmap¶
Phase 1: Foundation (Week 1)¶
Goal: Create unified training script
- ✅ ExperienceManager implementation
- ✅ SemanticExtractor implementation
- ✅ APIModelAdapter implementation
- Create
training_free_workflow.py - Implement basic training loop
- Add experience checkpoint system
Deliverable: Run single epoch on toy dataset
Phase 2: Rollout Infrastructure (Week 2)¶
Goal: Async batch processing
- Create
rollout_manager.py - Implement async worker pool
- Add timeout/retry logic
- Support API and local models
- Progress tracking
Deliverable: Generate 100 rollouts in parallel
Phase 3: Domain Integration (Week 3)¶
Goal: Math and web domains
- Create
domains/math/module - dataset.py (AIME loader)
- verify.py (correctness checker)
- prompts.py (math templates)
- Create
domains/web/module - dataset.py (WebWalker loader)
- verify.py (navigation checker)
- prompts.py (web templates)
Deliverable: Train on AIME24 dataset
Phase 4: End-to-End Testing (Week 4)¶
Goal: Validate full workflow
- Run 3-epoch training on AIME24 (100 samples)
- Validate experience library growth
- Compare with Tencent baseline
- Verify zero parameter updates
- Cost analysis ($18 target)
Deliverable: Working Continuous Learning GRPO system
Testing Status¶
Unit Tests¶
Location: tests/train/test_continuous_learning_grpo.py
| Component | Test Coverage | Status |
|---|---|---|
| ExperienceManager | 100% | ✅ 18/18 passing |
| SemanticExtractor | 100% | ✅ 10/10 passing |
| Trajectory | 100% | ✅ 2/2 passing |
| Integration | 0% | ❌ Not implemented |
Integration Tests Needed¶
- End-to-end training loop
- Multi-epoch experience evolution
- Experience injection into prompts
- Frozen model verification
- Cost tracking
- Performance benchmarks
Example Scripts Status¶
Tencent Examples¶
# Train on math domain
python continuous_learning_grpo/train.py \
--mode agent \
--domain math \
--experiment_name test_aime \
--dataset aime24 \
--epochs 3 \
--grpo_n 5
# Evaluate with experiences
python continuous_learning_grpo/main.py \
--mode agent \
--domain math \
--experiment_name eval_aime \
--dataset aime25 \
--experience_file data/math/train/test_aime/step_X/experiences.json
Gym Examples (Needed)¶
-
scripts/train_grpo_free_math.py- Math domain training -
scripts/train_grpo_free_web.py- Web domain training -
scripts/eval_grpo_free.py- Evaluation with experiences -
examples/grpo_free_custom_domain.py- Custom domain guide
Performance Targets¶
Based on Tencent paper benchmarks:
| Metric | Target | Current | Status |
|---|---|---|---|
| AIME24 Accuracy | 82.7% | N/A | ⏳ Not tested |
| AIME25 Accuracy | 73.3% | N/A | ⏳ Not tested |
| Training Cost | $18 | N/A | ⏳ Not measured |
| Training Time | 6 hours | N/A | ⏳ Not measured |
| Experience Count | 50-200 | N/A | ⏳ Not tracked |
| Data Efficiency | 100 samples | N/A | ⏳ Not tested |
Documentation Status¶
Existing Documentation¶
- ✅ Architecture overview in
LLM.md - ✅ Paper analysis in
LLM.md - ✅ Component API docs (docstrings)
- ✅ Unit test examples
Missing Documentation¶
- End-to-end tutorial
- Custom domain guide
- Cost optimization tips
- Troubleshooting guide
- Comparison with fine-tuning
- API reference
Dependencies¶
Already Installed¶
Additional Needed¶
aiohttp # Async HTTP (for rollouts)
tqdm # Progress tracking
datasets # Dataset loading (already have)
Configuration Integration¶
Tencent Config (Environment Variables)¶
export UTU_LLM_TYPE="deepseek"
export UTU_LLM_MODEL="deepseek-chat"
export UTU_LLM_BASE_URL="https://api.deepseek.com/v1"
export UTU_LLM_API_KEY="sk-xxx"
Gym Config (YAML)¶
Needed: configs/grpo_free_math.yaml
model_name_or_path: deepseek-chat
api_mode: true
api_base_url: https://api.deepseek.com/v1
api_key: ${DEEPSEEK_API_KEY}
finetuning_type: grpo
continuous_learning_grpo: true
grpo_group_size: 5
grpo_normalize_advantages: false # Not used in training-free
dataset: aime24
dataset_truncate: 100
domain: math
epochs: 3
batch_size: 64
rollout_concurrency: 5
rollout_temperature: 0.7
rollout_max_tokens: 4096
task_timeout: 3600
output_dir: ./output/grpo_free_math
save_steps: 1
logging_steps: 1
Key Differences: Gym vs Tencent¶
| Aspect | Gym | Tencent | Recommendation |
|---|---|---|---|
| Framework | HuggingFace Trainer | Custom async loop | Keep Gym's, add training-free mode |
| Model Loading | Transformers | API client | Support both (already have) |
| Rollout Generation | Sync | Async | Add async option |
| Experience Storage | JSON | JSON | Identical ✅ |
| Prompts | Generic | Domain-specific | Add domain modules |
| Verify Function | Placeholder | Domain-specific | Add domain modules |
| Agent Support | No | Yes (SimpleAgent) | Optional, not required |
| Testing | Comprehensive | None | Keep Gym's ✅ |
Next Steps (Immediate)¶
-
Create training workflow script (1-2 days)
-
Add math domain module (1-2 days)
-
Test on toy dataset (1 day)
-
Full AIME benchmark (1 day)
Questions for User¶
- API vs Local: Should we prioritize API-based (DeepSeek) or local model support first?
- API: Faster to implement, matches paper
-
Local: More flexible, no API costs
-
Domain Priority: Which domain to implement first?
- Math: Better benchmarks available (AIME)
-
Web: More complex, may reveal edge cases
-
Integration Style: How to integrate with existing Gym?
- Option A: Extend GRPOTrainer with
continuous_learning_grpo=Trueflag - Option B: Create separate
ContinuousLearningGRPOTrainerclass -
Recommendation: Option A (cleaner)
-
Agent Framework: Should we support Tencent's SimpleAgent?
- Pro: Full compatibility
- Con: Additional dependency
- Recommendation: Optional, focus on API/model first
Last Updated: October 28, 2025
Status: ✅ Core components complete, ⚠️ Training workflow in progress
ETA for Full Parity: 2-3 weeks