Continuous Learning GRPO Integration Summary¶
Executive Summary¶
The Zoo Gym training infrastructure has solid foundational support for GRPO but requires semantic layer enhancements to enable training-free operation. Current GRPO is numerical (reward-based); Continuous Learning GRPO requires semantic experience library management.
Current State Analysis¶
✅ What Works Today¶
GRPOTrainer (trainer.py)
├── ✅ Group advantage computation (relative scaling)
├── ✅ Response generation (inference loop)
├── ✅ Policy loss calculation (PPO-style clipping)
├── ✅ Reference model integration
├── ✅ Checkpoint/saving system
├── ✅ Multi-GPU training (DDP, FSDP, DeepSpeed)
└── ❌ Reward computation (placeholder/random)
Data Pipeline
├── ✅ Dataset loading (JSON, CSV, Parquet)
├── ✅ Template-based formatting
├── ✅ Feedback processor (preference pairs)
├── ✅ Batch collation and padding
└── ❌ Experience persistence across epochs
Training Loop
├── ✅ Standard trainer callbacks
├── ✅ Checkpoint system
├── ✅ Hyperparameter management
├── ✅ Logging and metrics
└── ❌ Experience curation updates
Evaluation
├── ✅ Task-based metrics (MMLU, CEVAL, CMMLU)
├── ✅ Accuracy tracking
└── ❌ Experience library quality metrics
❌ What's Missing for Training-Free¶
Experience Library
├── ❌ Experience storage/persistence
├── ❌ Semantic advantage extraction
├── ❌ Experience deduplication
├── ❌ Embedding generation
└── ❌ Retrieval mechanisms
Semantic Layer
├── ❌ Natural language insight generation
├── ❌ Confidence scoring
├── ❌ Domain classification
└── ❌ Context-based filtering
Reward Model Enhancement
├── ❌ LLM introspection hook
├── ❌ Semantic advantage computation
└── ❌ Knowledge consolidation
Integration
├── ❌ Context injection into prompts
├── ❌ Experience-aware inference
├── ❌ Governance voting system
└── ❌ IPFS/Arweave storage
Architecture Overview¶
Current GRPO Flow¶
Input Prompts
↓
Generate Responses (k per prompt)
↓
Compute Rewards [PLACEHOLDER - returns random]
↓
Group-Relative Advantages
↓
Policy Loss (PPO clipping)
↓
Gradient Update
↓
Checkpoint Saved
(Experiences discarded)
Continuous Learning GRPO Flow (Needed)¶
Input Prompts + Experience Context
↓
Generate Responses (k per prompt)
↓
Compute Semantic Advantages [NEW]
│
├─→ Extract natural language insights [NEW]
│
└─→ Score by confidence [NEW]
↓
Group-Relative Advantages
↓
Policy Loss (PPO clipping)
↓
Gradient Update
↓
Update Experience Library [NEW]
├─→ Add new experiences
├─→ Deduplicate
└─→ Rank by quality
↓
Checkpoint + Experience Library Saved [NEW]
Code Structure (Current)¶
src/gym/train/
├── grpo/
│ ├── trainer.py [FULL: 401 lines]
│ │ ├── GRPOTrainer
│ │ ├── compute_group_advantages() ✅
│ │ ├── compute_policy_loss() ✅
│ │ ├── generate_responses() ✅
│ │ ├── compute_rewards() ❌ [RANDOM]
│ │ └── compute_loss() ✅
│ ├── trainer_simple.py [MINIMAL: 153 lines]
│ │ └── Simplified version (Go style)
│ ├── workflow.py [RUNNER: 175 lines]
│ │ └── run_grpo() - orchestrates training
│ └── __init__.py
│
├── ppo/
│ ├── trainer.py [VALUE-BASED PPO]
│ ├── ppo_utils.py
│ └── workflow.py
│
├── dpo/
│ ├── trainer.py [PAIRWISE COMPARISON]
│ └── workflow.py
│
├── gspo/
│ ├── trainer.py [SEQUENCE-LEVEL]
│ └── trainer_simple.py
│
├── sft/ [SFT BASE]
├── rm/ [REWARD MODEL]
├── kto/ [PREFERENCE]
├── pt/ [PRE-TRAINING]
│
├── trainer_utils.py [SHARED UTILITIES: 640 lines]
│ ├── create_optimizer()
│ ├── create_scheduler()
│ ├── create_ref_model()
│ ├── create_reward_model()
│ └── [optimizer variants: Adam, AdamW, GaLore, Apollo, BAdam]
│
├── callbacks.py [TRAINING CALLBACKS]
│ ├── SaveProcessorCallback
│ ├── FixValueHeadModelCallback
│ └── LogCallback
│
└── test_utils.py [TEST HELPERS]
src/gym/data/
├── loader.py [DATASET LOADING: 400 lines]
│ └── get_dataset()
├── processor/
│ ├── __init__.py
│ ├── supervised.py [SFT DATA]
│ ├── feedback.py [PREFERENCE PAIRS: 130 lines]
│ │ └── FeedbackDatasetProcessor
│ │ ├── _encode_data_example()
│ │ └── preprocess_dataset()
│ ├── pairwise.py [DPO/ORPO DATA]
│ ├── pretrain.py
│ └── unsupervised.py
├── template.py [FORMATTING: 1850 lines]
│ ├── Template class
│ └── encode_oneturn()
└── collator.py [BATCHING: 400 lines]
src/gym/hparams/
├── finetuning_args.py [PARAMETERS: 1200+ lines]
│ ├── FreezeArguments
│ ├── LoraArguments
│ ├── ...
│ ├── GRPOArguments [CURRENT: ~40 lines]
│ │ ├── grpo_group_size (default: 8)
│ │ ├── grpo_beta (default: 0.01)
│ │ ├── grpo_clip_range (default: 0.2)
│ │ └── grpo_normalize_advantages (default: True)
│ └── RLHFArguments
└── ...
Implementation Roadmap¶
Phase 1: Foundation (Weeks 1-2) - HIGH PRIORITY¶
Create core semantic infrastructure
File: src/gym/train/grpo/experience_manager.py (NEW - ~200 lines)
class ExperienceManager:
def __init__(self, library_path: str, max_size: int = 100):
self.experiences = []
self.embeddings = []
self.library_path = library_path
self.max_size = max_size
def add_experience(self, text: str, advantage: float, domain: str):
"""Add experience to library with deduplication"""
def retrieve_relevant(self, query: str, top_k: int = 5):
"""Similarity-based retrieval"""
def to_prompt(self) -> str:
"""Format as context string"""
def save(self):
"""JSON persistence"""
def load(self):
"""Load from disk"""
# Tests: test_experience_manager.py
File: src/gym/train/grpo/semantic_extractor.py (NEW - ~150 lines)
class SemanticExtractor:
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
def extract_insight(self, better_output, worse_output, advantage_magnitude):
"""LLM introspection to explain advantage"""
# Generates natural language insight
# Scores confidence based on advantage_magnitude
return experience_text, confidence_score
# Tests: test_semantic_extractor.py
Phase 2: Integration (Weeks 3-4) - MEDIUM PRIORITY¶
Connect semantic layer to training loop
File: src/gym/train/grpo/trainer.py (MODIFICATIONS - ~50 lines)
# In GRPOTrainer.__init__():
if self.finetuning_args.grpo_training_free:
self.experience_manager = ExperienceManager(...)
self.semantic_extractor = SemanticExtractor(...)
# In compute_loss():
if self.finetuning_args.grpo_training_free:
experience = self.extract_experience_from_group(...)
self.experience_manager.add_experience(...)
# In save_model():
if hasattr(self, 'experience_manager'):
self.experience_manager.save(...)
File: src/gym/data/template.py (MODIFICATIONS - ~30 lines)
def encode_oneturn_with_experiences(
self, tokenizer, messages, system, tools, experiences: Optional[str] = None
):
"""Prepend experience context to system message"""
if experiences:
system = f"{system}\n\n# Learned Experiences\n{experiences}"
return self.encode_oneturn(tokenizer, messages, system, tools)
File: src/gym/hparams/finetuning_args.py (MODIFICATIONS - ~40 lines)
# Add after existing GRPO parameters:
grpo_training_free: bool = False
grpo_experience_lib_path: str = "./experience_lib"
grpo_experience_max_size: int = 100
grpo_semantic_extraction: bool = True
grpo_context_injection: bool = True
grpo_experience_update_interval: int = 1 # epochs
Phase 3: Optimization (Weeks 5-6) - MEDIUM PRIORITY¶
Performance and quality enhancements
File: src/gym/train/grpo/embedding_service.py (NEW - ~100 lines)
class EmbeddingService:
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
def embed(self, text: str) -> np.ndarray:
"""Fast embedding generation"""
def similarity(self, text1: str, text2: str) -> float:
"""Cosine similarity"""
File: src/gym/eval/experience_eval.py (NEW - ~150 lines)
def evaluate_experience_library(library: ExperienceManager):
"""Quality metrics for experience library"""
return {
"total": len(library.experiences),
"avg_confidence": np.mean([e.confidence for e in ...]),
"coverage_domains": len(set(e.domain for e in ...)),
"deduplication_ratio": (initial - final) / initial
}
Phase 4: Decentralization (Weeks 7-8) - LOW PRIORITY (Future)¶
On-chain governance and storage
File: src/gym/train/grpo/ipfs_integration.py (NEW - ~200 lines)
class IPFSExperienceStorage:
def __init__(self, ipfs_client_endpoint: str):
self.client = ipfshttpclient.connect(ipfs_client_endpoint)
def upload_library(self, experiences: List[Experience]) -> str:
"""Returns IPFS CID"""
def download_library(self, cid: str) -> List[Experience]:
"""Load from IPFS"""
File-by-File Changes Summary¶
| File | Lines | Change Type | Priority | Status |
|---|---|---|---|---|
src/gym/train/grpo/trainer.py | 401 → 450 | Modify | HIGH | Phase 2 |
src/gym/train/grpo/experience_manager.py | - → 200 | NEW | HIGH | Phase 1 |
src/gym/train/grpo/semantic_extractor.py | - → 150 | NEW | HIGH | Phase 1 |
src/gym/data/template.py | 1850 → 1880 | Modify | MEDIUM | Phase 2 |
src/gym/hparams/finetuning_args.py | 1200 → 1240 | Modify | MEDIUM | Phase 2 |
src/gym/train/grpo/embedding_service.py | - → 100 | NEW | MEDIUM | Phase 3 |
src/gym/eval/experience_eval.py | - → 150 | NEW | MEDIUM | Phase 3 |
src/gym/train/grpo/ipfs_integration.py | - → 200 | NEW | LOW | Phase 4 |
tests/train/test_grpo_training_free.py | - → 300 | NEW | HIGH | Phase 1 |
Total new code: ~1,100 lines Total modifications: ~120 lines
Testing Strategy¶
Unit Tests (Phase 1)¶
tests/train/test_experience_manager.py
├── test_add_experience()
├── test_deduplication()
├── test_serialize_deserialize()
└── test_persistence()
tests/train/test_semantic_extractor.py
├── test_extract_insight()
├── test_confidence_scoring()
└── test_empty_handling()
Integration Tests (Phase 2)¶
tests/train/test_grpo_training_free.py
├── test_training_loop_with_experience_tracking()
├── test_experience_injection_into_context()
├── test_checkpoint_saves_experience_library()
└── test_end_to_end_with_qwen3_mini()
Benchmark Tests (Phase 3)¶
tests/train/test_grpo_performance.py
├── test_training_speed_comparison()
├── test_memory_usage_vs_standard_grpo()
├── test_experience_retrieval_latency()
└── test_downstream_task_improvement()
Expected Outcomes¶
After Phase 1¶
- Experience library can be created and persisted
- Semantic extraction generates natural language insights
- Unit tests pass with 90%+ coverage
After Phase 2¶
- Full integration with training loop
- Experience context injection working
- Training runs without errors
- Checkpoints include experience library
After Phase 3¶
- Similarity-based retrieval operational
- Quality metrics reported during training
- 5-10x improvement in context relevance
- Performance benchmarks documented
After Phase 4¶
- IPFS integration for decentralized storage
- On-chain merkle proofs
- DAO governance system
- Experience marketplace functional
Compatibility Notes¶
✅ No Breaking Changes¶
All enhancements are backwards compatible: - Existing GRPO code unchanged (only additions) - Feature flag grpo_training_free=False by default - Existing checkpoints still loadable - No modification to core loss computation
⚠️ Dependencies¶
New packages needed: - sentence-transformers - For embeddings - ipfshttpclient - For IPFS (Phase 4 only) - numpy - Already included
🔄 Configuration¶
# Example: config.yaml
model_name_or_path: Qwen/Qwen3-4B-Instruct
stage: grpo
finetuning_type: lora
output_dir: ./output/grpo_tf
# NEW: Continuous Learning GRPO settings
grpo_training_free: true
grpo_experience_lib_path: ./output/experience_lib
grpo_experience_max_size: 100
grpo_semantic_extraction: true
grpo_context_injection: true
Risk Assessment¶
| Risk | Impact | Mitigation | Status |
|---|---|---|---|
| Breaking changes | HIGH | Feature flags, backward compatibility tests | ✅ Mitigated |
| Performance regression | MEDIUM | Benchmark tests, optional disabling | ✅ Mitigated |
| Memory overhead | MEDIUM | Configurable library size, cleanup logic | ✅ Mitigated |
| LLM introspection latency | MEDIUM | Batch processing, caching | ✅ Mitigated |
| Embedding quality | MEDIUM | Pre-trained models, validation | ✅ Mitigated |
Success Metrics¶
- Code Quality
- 90%+ test coverage
- No regression in existing GRPO tests
-
Type hints on all functions
-
Functionality
- Experience library creation/update working
- Semantic extraction producing valid insights
-
Context injection improving task accuracy by 5%+
-
Performance
- <100ms experience retrieval latency
- <5% training speed overhead
-
<500MB additional memory with max_size=100
-
Documentation
- API documentation complete
- Usage examples provided
- Integration guide written
Resources¶
Key Files to Review: - /Users/z/work/zoo/gym/src/gym/train/grpo/trainer.py - Main GRPO trainer - /Users/z/work/zoo/gym/src/gym/train/ppo/trainer.py - Reference (value-based RL) - /Users/z/work/zoo/gym/src/gym/train/dpo/trainer.py - Reference (preference-based) - /Users/z/work/zoo/gym/src/gym/data/processor/feedback.py - Preference data handling - /Users/z/work/zoo/gym/LLM.md - Full architecture documentation
Research Papers: - DeepSeek GRPO: https://arxiv.org/abs/2502.01155 - Alibaba GSPO: https://arxiv.org/abs/2507.18071
Document Status: ✅ Complete
Last Updated: October 28, 2025
Author: Claude Code (AI Assistant)
Version: 1.0