Continuous Learning GRPO Integration Status¶

Date: October 28, 2025
Based on: Tencent youtu-agent implementation (arXiv:2510.08191v1)

Executive Summary¶

✅ Core components implemented - ExperienceManager, SemanticExtractor, APIModelAdapter
⚠️ Training workflow incomplete - Need unified training script matching Tencent's architecture
📋 Domain structure missing - Need math/web modules with dataset/verify/prompts

Estimated completion: 2-3 weeks for full parity with Tencent implementation

Component Comparison¶

✅ Fully Implemented in Gym¶

1. ExperienceManager (`src/gym/train/grpo/experience_manager.py`)¶

Status: 100% complete, matches Tencent functionality

Feature	Gym	Tencent	Notes
Add experience	✅	✅	Identical API
Delete experience	✅	✅	Identical API
Modify experience	✅	✅	Identical API
Merge experiences	✅	✅	Identical API
Apply operations batch	✅	✅	Identical API
Format for prompt	✅	✅	Identical format
Save/load JSON	✅	✅	Identical format
Unit tests	✅	❌	Gym has comprehensive tests

Code Quality: Excellent, well-documented, fully tested

2. SemanticExtractor (`src/gym/train/grpo/semantic_extractor.py`)¶

Status: 100% complete, implements all 3 stages

Feature	Gym	Tencent	Notes
Stage 1: Trajectory summarization	✅	✅	Identical prompts
Stage 2: Group advantage extraction	✅	✅	Identical prompts
Stage 3: Batch consolidation	✅	✅	Identical prompts
JSON parsing with error handling	✅	✅	Better error handling in Gym
Support ground truth	✅	✅	Both support optional GT
Max operations limit	✅	✅	Configurable
Unit tests	✅	❌	Gym has comprehensive tests

Code Quality: Excellent, follows paper exactly, well-tested

3. APIModelAdapter (`src/gym/train/grpo/api_model_adapter.py`)¶

Status: 100% complete, with additional features

Feature	Gym	Tencent	Notes
OpenAI-compatible client	✅	✅	Identical
DeepSeek support	✅	✅	Identical
OpenAI support	✅	❌	Gym has extra wrapper
Batch generation	✅	❌	Gym has extra feature
Experience injection	✅	❌	Gym has built-in method
System prompt support	✅	❌	Gym has extra feature

Code Quality: Excellent, more flexible than Tencent

⚠️ Partially Implemented¶

4. GRPOTrainer (`src/gym/train/grpo/trainer.py`)¶

Status: Base GRPO complete, training-free mode missing

Feature	Gym	Tencent	Gap
Group-based rollouts	✅	✅	Identical
Advantage computation	✅	✅	Identical
Value-free optimization	✅	✅	Identical
Training-free mode	❌	✅	MISSING
Experience integration	❌	✅	MISSING
Frozen model operation	❌	✅	MISSING
Async rollout generation	❌	✅	MISSING
Timeout/retry logic	❌	✅	MISSING

What's Needed:

# Add to GRPOTrainer
def training_step_free(self, model, inputs):
    """Continuous Learning GRPO: no parameter updates."""
    # 1. Generate G rollouts with experience injection
    # 2. Compute rewards via verify function
    # 3. Extract semantic advantages (not numerical)
    # 4. Update experience library
    # 5. Return zero loss (no gradient update)
    return torch.tensor(0.0)

❌ Not Yet Implemented¶

5. Training Workflow¶

Status: Core infrastructure exists, unified script missing

Tencent's train.py structure:

>

for epoch in epochs: for batch in batches: # 1. Load experiences from previous step experiences = load_experiences(step - 1) # 2. Inject experiences into prompts enhanced_prompts = inject_experiences(batch, experiences) # 3. Generate G rollouts per query rollouts = await rollout_dataset(enhanced_prompts) # 4. Compute rewards for rollout in rollouts: rollout["reward"] = verify_func(rollout) # 5. Extract semantic advantages new_experiences = ExperienceUpdater().run(rollouts, experiences) # 6. Save for next step save_experiences(step + 1, new_experiences)
 What Gym needs: - [ ] Create src/gym/train/grpo/training_free_workflow.py - [ ] Implement multi-epoch training loop - [ ] Add experience checkpoint management - [ ] Integrate with GRPOTrainer
 6. Rollout Infrastructure¶
 Status: Not implemented
 Tencent's main.py features: 
async def rollout_dataset(
    worker_agent,
    data,
    rollouts,
    verify_func,
    rollout_concurrency=5,
    task_timeout=3600,
    max_retries=3
):
    # Async worker pool
    # Timeout handling
    # Automatic retry on failure
    # Progress tracking with tqdm
    # Incremental saving
 What Gym needs: - [ ] Create src/gym/train/grpo/rollout_manager.py - [ ] Implement async batch rollout generation - [ ] Add timeout/retry logic - [ ] Support both API and local models - [ ] Progress tracking and logging
 7. Domain-Specific Modules¶
 Status: Not implemented
 Tencent structure: 
continuous_learning_grpo/
├── math/
│   ├── dataset.py      # Load AIME/math datasets
│   ├── verify.py       # Verify correctness (reward function)
│   ├── prompts.py      # Math-specific prompt templates
│   └── experience.py   # Math-specific ExperienceUpdater
└── web/
    ├── dataset.py      # Load WebWalker datasets
    ├── verify.py       # Verify web navigation success
    ├── prompts.py      # Web-specific prompt templates
    └── experience.py   # Web-specific ExperienceUpdater
 What Gym needs: - [ ] Create src/gym/train/grpo/domains/ directory - [ ] Implement domains/math/ module - [ ] Implement domains/web/ module - [ ] Generic base classes for custom domains
 
 Architecture Gaps¶
 1. Experience Injection Pipeline¶
 Tencent approach: 
# Inject experiences into prompt
enhanced_prompt = PROBLEM_WITH_EXPERIENCE_TEMPLATE.format(
    experiences=formatted_experiences,
    problem=problem
)

# Generate with experiences as context
response = model.generate(enhanced_prompt)
 Gym approach (currently missing): - Data collator doesn't support experience injection - Template system needs extension - Need custom prompt formatter
 Solution: 
# Add to data/template.py
def format_with_experiences(
    self,
    query: str,
    experiences: str,
    system: str = None
) -> str:
    """Format query with experiences injected."""
    experience_context = f"\\n\\nHelpful experiences:\\n{experiences}" if experiences else ""
    return f"{system}{experience_context}\\n\\nProblem: {query}"
 2. Frozen Model Training¶
 Tencent approach: 
# Model weights never updated
# All learning happens in experience library
new_experiences = extract_semantic_advantages(rollouts)
experience_library.update(new_experiences)
# No optimizer.step(), no gradient computation
 Gym approach (currently missing): - GRPOTrainer assumes gradient updates - Need training_free flag to skip optimizer - Need checkpoint system to save experiences
 Solution: 
class GRPOTrainer:
    def training_step(self, model, inputs):
        if self.finetuning_args.continuous_learning_grpo:
            return self.training_step_free(model, inputs)
        else:
            return self.training_step_standard(model, inputs)
 3. Multi-Epoch Experience Evolution¶
 Tencent approach: 
Epoch 0: experiences = {}
Epoch 1: experiences = E1 (learned from epoch 0)
Epoch 2: experiences = E2 (learned from epoch 1)
...
 Gym approach (currently missing): - No cross-epoch experience persistence - Need checkpoint directory structure - Need experience version tracking
 Solution: 
output/
└── experiment_name/
    ├── epoch_0/
    │   ├── shuffled_data.jsonl
    │   └── step_0/
    │       ├── rollout.jsonl
    │       ├── single_rollout_summary.json
    │       ├── single_query_critique.json
    │       ├── batch_update.json
    │       └── experiences.json  # Used by step_1
    ├── epoch_1/
    │   └── step_1/
    │       └── experiences.json  # Used by step_2
    └── stats.json
 
 Integration Roadmap¶
 Phase 1: Foundation (Week 1)¶
 Goal: Create unified training script
   ✅ ExperienceManager implementation
  ✅ SemanticExtractor implementation
  ✅ APIModelAdapter implementation
  Create training_free_workflow.py
  Implement basic training loop
  Add experience checkpoint system
 
 Deliverable: Run single epoch on toy dataset
 Phase 2: Rollout Infrastructure (Week 2)¶
 Goal: Async batch processing
   Create rollout_manager.py
  Implement async worker pool
  Add timeout/retry logic
  Support API and local models
  Progress tracking
 
 Deliverable: Generate 100 rollouts in parallel
 Phase 3: Domain Integration (Week 3)¶
 Goal: Math and web domains
   Create domains/math/ module
  dataset.py (AIME loader)
  verify.py (correctness checker)
  prompts.py (math templates)
  Create domains/web/ module
  dataset.py (WebWalker loader)
  verify.py (navigation checker)
  prompts.py (web templates)
 
 Deliverable: Train on AIME24 dataset
 Phase 4: End-to-End Testing (Week 4)¶
 Goal: Validate full workflow
   Run 3-epoch training on AIME24 (100 samples)
  Validate experience library growth
  Compare with Tencent baseline
  Verify zero parameter updates
  Cost analysis ($18 target)
 
 Deliverable: Working Continuous Learning GRPO system
 
 Testing Status¶
 Unit Tests¶
 Location: tests/train/test_continuous_learning_grpo.py
    Component  Test Coverage  Status  
 
   ExperienceManager  100%  ✅ 18/18 passing  
  SemanticExtractor  100%  ✅ 10/10 passing  
  Trajectory  100%  ✅ 2/2 passing  
  Integration  0%  ❌ Not implemented  
 
 
 Integration Tests Needed¶
   End-to-end training loop
  Multi-epoch experience evolution
  Experience injection into prompts
  Frozen model verification
  Cost tracking
  Performance benchmarks
 
 
 Example Scripts Status¶
 Tencent Examples¶
 # Train on math domain
python continuous_learning_grpo/train.py \
  --mode agent \
  --domain math \
  --experiment_name test_aime \
  --dataset aime24 \
  --epochs 3 \
  --grpo_n 5

# Evaluate with experiences
python continuous_learning_grpo/main.py \
  --mode agent \
  --domain math \
  --experiment_name eval_aime \
  --dataset aime25 \
  --experience_file data/math/train/test_aime/step_X/experiences.json
 Gym Examples (Needed)¶
   scripts/train_grpo_free_math.py - Math domain training
  scripts/train_grpo_free_web.py - Web domain training
  scripts/eval_grpo_free.py - Evaluation with experiences
  examples/grpo_free_custom_domain.py - Custom domain guide
 
 
 Performance Targets¶
 Based on Tencent paper benchmarks:
    Metric  Target  Current  Status  
 
   AIME24 Accuracy  82.7%  N/A  ⏳ Not tested  
  AIME25 Accuracy  73.3%  N/A  ⏳ Not tested  
  Training Cost  $18  N/A  ⏳ Not measured  
  Training Time  6 hours  N/A  ⏳ Not measured  
  Experience Count  50-200  N/A  ⏳ Not tracked  
  Data Efficiency  100 samples  N/A  ⏳ Not tested  
 
 
 
 Documentation Status¶
 Existing Documentation¶
   ✅ Architecture overview in LLM.md
  ✅ Paper analysis in LLM.md
  ✅ Component API docs (docstrings)
  ✅ Unit test examples
 
 Missing Documentation¶
   End-to-end tutorial
  Custom domain guide
  Cost optimization tips
  Troubleshooting guide
  Comparison with fine-tuning
  API reference
 
 
 Dependencies¶
 Already Installed¶
 openai>=1.0.0          # API client
torch>=2.0.0           # Core framework
transformers>=4.40.0   # Model loading
 Additional Needed¶
 aiohttp                # Async HTTP (for rollouts)
tqdm                   # Progress tracking
datasets               # Dataset loading (already have)
 
 Configuration Integration¶
 Tencent Config (Environment Variables)¶
 export UTU_LLM_TYPE="deepseek"
export UTU_LLM_MODEL="deepseek-chat"
export UTU_LLM_BASE_URL="https://api.deepseek.com/v1"
export UTU_LLM_API_KEY="sk-xxx"
 Gym Config (YAML)¶
 Needed: configs/grpo_free_math.yaml
 model_name_or_path: deepseek-chat
api_mode: true
api_base_url: https://api.deepseek.com/v1
api_key: ${DEEPSEEK_API_KEY}

finetuning_type: grpo
continuous_learning_grpo: true
grpo_group_size: 5
grpo_normalize_advantages: false  # Not used in training-free

dataset: aime24
dataset_truncate: 100
domain: math

epochs: 3
batch_size: 64
rollout_concurrency: 5
rollout_temperature: 0.7
rollout_max_tokens: 4096
task_timeout: 3600

output_dir: ./output/grpo_free_math
save_steps: 1
logging_steps: 1
 
 Key Differences: Gym vs Tencent¶
    Aspect  Gym  Tencent  Recommendation  
 
   Framework  HuggingFace Trainer  Custom async loop  Keep Gym's, add training-free mode  
  Model Loading  Transformers  API client  Support both (already have)  
  Rollout Generation  Sync  Async  Add async option  
  Experience Storage  JSON  JSON  Identical ✅  
  Prompts  Generic  Domain-specific  Add domain modules  
  Verify Function  Placeholder  Domain-specific  Add domain modules  
  Agent Support  No  Yes (SimpleAgent)  Optional, not required  
  Testing  Comprehensive  None  Keep Gym's ✅  
 
 
 
 Next Steps (Immediate)¶
   Create training workflow script (1-2 days) 
src/gym/train/grpo/training_free_workflow.py
 
  Add math domain module (1-2 days) 
src/gym/train/grpo/domains/math/
 
  Test on toy dataset (1 day) 
python scripts/test_grpo_free_mini.py
 
  Full AIME benchmark (1 day) 
python scripts/train_grpo_free_math.py --dataset aime24
 
 
 
 Questions for User¶
  API vs Local: Should we prioritize API-based (DeepSeek) or local model support first?
 API: Faster to implement, matches paper
  Local: More flexible, no API costs
 
  Domain Priority: Which domain to implement first?
 
 Math: Better benchmarks available (AIME)
  Web: More complex, may reveal edge cases
 
  Integration Style: How to integrate with existing Gym?
 
 Option A: Extend GRPOTrainer with continuous_learning_grpo=True flag
 Option B: Create separate ContinuousLearningGRPOTrainer class
  Recommendation: Option A (cleaner)
 
  Agent Framework: Should we support Tencent's SimpleAgent?
 
 Pro: Full compatibility
 Con: Additional dependency
 Recommendation: Optional, focus on API/model first
 
 
 Last Updated: October 28, 2025
 Status: ✅ Core components complete, ⚠️ Training workflow in progress
 ETA for Full Parity: 2-3 weeks
      October 29, 2025      October 29, 2025

Component	Test Coverage	Status
ExperienceManager	100%	✅ 18/18 passing
SemanticExtractor	100%	✅ 10/10 passing
Trajectory	100%	✅ 2/2 passing
Integration	0%	❌ Not implemented

Metric	Target	Current	Status
AIME24 Accuracy	82.7%	N/A	⏳ Not tested
AIME25 Accuracy	73.3%	N/A	⏳ Not tested
Training Cost	$18	N/A	⏳ Not measured
Training Time	6 hours	N/A	⏳ Not measured
Experience Count	50-200	N/A	⏳ Not tracked
Data Efficiency	100 samples	N/A	⏳ Not tested

Aspect	Gym	Tencent	Recommendation
Framework	HuggingFace Trainer	Custom async loop	Keep Gym's, add training-free mode
Model Loading	Transformers	API client	Support both (already have)
Rollout Generation	Sync	Async	Add async option
Experience Storage	JSON	JSON	Identical ✅
Prompts	Generic	Domain-specific	Add domain modules
Verify Function	Placeholder	Domain-specific	Add domain modules
Agent Support	No	Yes (SimpleAgent)	Optional, not required
Testing	Comprehensive	None	Keep Gym's ✅

Continuous Learning GRPO Integration Status¶

Executive Summary¶

Component Comparison¶

✅ Fully Implemented in Gym¶

1. ExperienceManager (src/gym/train/grpo/experience_manager.py)¶

2. SemanticExtractor (src/gym/train/grpo/semantic_extractor.py)¶

3. APIModelAdapter (src/gym/train/grpo/api_model_adapter.py)¶

⚠️ Partially Implemented¶

4. GRPOTrainer (src/gym/train/grpo/trainer.py)¶

❌ Not Yet Implemented¶

5. Training Workflow¶

6. Rollout Infrastructure¶

7. Domain-Specific Modules¶

Architecture Gaps¶

1. Experience Injection Pipeline¶

2. Frozen Model Training¶

3. Multi-Epoch Experience Evolution¶

Integration Roadmap¶

Phase 1: Foundation (Week 1)¶

Phase 2: Rollout Infrastructure (Week 2)¶

Phase 3: Domain Integration (Week 3)¶

Phase 4: End-to-End Testing (Week 4)¶

Testing Status¶

Unit Tests¶

Integration Tests Needed¶

Example Scripts Status¶

Tencent Examples¶

Gym Examples (Needed)¶

Performance Targets¶

Documentation Status¶

Existing Documentation¶

Missing Documentation¶

Dependencies¶

Already Installed¶

Additional Needed¶

Configuration Integration¶

Tencent Config (Environment Variables)¶

Gym Config (YAML)¶

Key Differences: Gym vs Tencent¶

Next Steps (Immediate)¶

Questions for User¶

1. ExperienceManager (`src/gym/train/grpo/experience_manager.py`)¶

2. SemanticExtractor (`src/gym/train/grpo/semantic_extractor.py`)¶

3. APIModelAdapter (`src/gym/train/grpo/api_model_adapter.py`)¶

4. GRPOTrainer (`src/gym/train/grpo/trainer.py`)¶