Continuous Learning GRPO - Training-Free AI Model Improvement¶
Executive Summary¶
Continuous Learning GRPO enables AI models to improve through experience accumulation rather than parameter updates. Instead of traditional fine-tuning (which requires expensive GPU compute and thousands of examples), this approach learns from 50-100 examples by extracting and curating semantic experiences - human-readable insights that guide future problem-solving.
Key Results (from Tencent Research, arXiv:2510.08191): - Cost: $18 vs $10,000+ for fine-tuning - Performance: 82.7% on AIME24 (+2.7% over fine-tuning) - Data efficiency: 50-100 examples vs 1000s - Training time: Minutes vs hours/days - Interpretability: Human-readable experiences vs black-box weights
Table of Contents¶
- What is Continuous Learning GRPO?
- Core Concept: Semantic Experiences
- Architecture Overview
- When to Use vs Traditional Fine-Tuning
- Three-Stage Learning Process
- Cost & Performance Analysis
- Integration with Gym
- Quick Start
- Advanced Usage
- Limitations & Best Practices
What is Continuous Learning GRPO?¶
Traditional AI model improvement requires updating billions of parameters through gradient descent - an expensive, opaque process. Continuous Learning GRPO takes a fundamentally different approach:
Traditional Fine-Tuning:
Model weights → Gradient updates → Modified weights
Cost: $10,000+ | Time: Hours/Days | Interpretability: None
Continuous Learning:
Experiences → Semantic extraction → Updated experience library
Cost: $18 | Time: Minutes | Interpretability: Full
The Innovation¶
Instead of changing what the model is (its weights), we change what the model sees (its context). The base model remains frozen (verifiable via cryptographic hash), while a curated library of experiences guides its reasoning.
Analogy: Traditional fine-tuning is like brain surgery - modifying neural connections. Continuous Learning is like education - providing better examples and guidance.
Core Concept: Semantic Experiences¶
What is a Semantic Experience?¶
A semantic experience is a concise (≤32 words) natural language statement that captures a generalizable problem-solving strategy.
Examples from Mathematics Domain:
[G0]. When solving geometry problems with intersections, validate solutions
lie within bounded regions or segments, not on extensions, to avoid
extraneous answers.
[G1]. For expected extreme statistics in combinatorial problems, use direct
enumeration for small sizes.
[G10]. When using mathematical invariants to prove impossibility, always
validate them against known achievable states or small cases.
[G21]. For complex polynomials with real parameters, separate real and
imaginary parts to find when real roots exist.
Characteristics of Good Experiences¶
- Strategic, not computational: "Check boundary conditions" not "Calculate derivative"
- Context-aware: Begins with "When [condition]..."
- Actionable: Provides clear guidance
- Generalizable: Applies to similar problem classes
- Concise: ≤32 words
- Domain-agnostic: Focuses on reasoning patterns
Traditional Numerical Advantages vs Semantic Advantages¶
| Aspect | Numerical (GRPO) | Semantic (Continuous Learning) |
|---|---|---|
| Format | Scalar: 0.73 | Text: "When solving equations..." |
| Interpretability | Opaque | Human-readable |
| Persistence | Lost after batch | Accumulated in library |
| Composability | Single value | Combinable experiences |
| Governance | Not applicable | DAO-votable |
| Auditability | None | Full trail |
Architecture Overview¶
System Components¶
┌─────────────────────────────────────────────────────────────┐
│ Continuous Learning GRPO │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────┐ ┌───────────────────────┐ │
│ │ User Query │─────→│ Experience Manager │ │
│ └────────────────┘ │ - Load experiences │ │
│ │ - Format for context │ │
│ └───────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Base Model (Frozen) │ │
│ ┌────────────────┐ │ + Experience Context │ │
│ │ Generate G │◀────┤ → Generate response │ │
│ │ Rollouts │ └───────────────────────┘ │
│ └────────┬───────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────┐ │
│ │ Semantic Extractor (3-Stage LLM Process) │ │
│ │ 1. Summarize trajectories │ │
│ │ 2. Extract group advantages │ │
│ │ 3. Consolidate batch updates │ │
│ └────────────────┬──────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────┐ │
│ │ Experience Library Update │ │
│ │ - Add new experiences │ │
│ │ - Modify existing ones │ │
│ │ - Delete obsolete entries │ │
│ │ - Merge similar experiences │ │
│ └────────────────────────────────┘ │
│ │
│ [Model weights never change - verifiable via hash] │
└─────────────────────────────────────────────────────────────┘
Key Modules¶
1. ExperienceManager (src/gym/train/grpo/experience_manager.py)¶
Purpose: Manages the experience library E - the core knowledge base.
Operations: - add(experience): Add new experience - modify(exp_id, new_text): Update existing experience - delete(exp_id): Remove obsolete experience - merge(exp_ids, merged_text): Combine similar experiences - format_for_prompt(): Convert library to context string - save(path) / load(path): Persistence
Data Structure:
{
"experiences": {
"G0": "When solving equations, verify by substitution...",
"G1": "For optimization, check boundary conditions...",
...
},
"next_id": 42
}
2. SemanticExtractor (src/gym/train/grpo/semantic_extractor.py)¶
Purpose: Extracts semantic advantages using 3-stage LLM process.
Stage 1 - Trajectory Summarization:
def summarize_trajectory(trajectory, use_groundtruth=True):
"""
Input: Full trajectory, correctness label, ground truth
Output: Step-by-step natural language summary
Example:
1. Applied quadratic formula with a=1, b=2, c=5
2. Calculated discriminant: b²-4ac = 4-20 = -16
3. Concluded no real solutions (correct)
"""
Stage 2 - Group Advantage Extraction:
def extract_group_advantage(trajectories, experiences):
"""
Input: G trajectories (both correct/incorrect), current experiences
Output: JSON operations [{"option": "add", "experience": "..."}, ...]
Process:
- Compare successful vs failed trajectories
- Identify what made successful ones succeed
- Propose max 3 operations: add/modify/delete
"""
Stage 3 - Batch Consolidation:
def consolidate_batch(all_group_operations, experiences):
"""
Input: All group operations from batch, current experiences
Output: Final consolidated operations
Process:
- Merge similar suggestions
- Ensure ≤32 words per experience
- Eliminate redundancy
"""
3. APIModelAdapter (src/gym/train/grpo/api_model_adapter.py)¶
Purpose: Enable using cloud-hosted models (DeepSeek, OpenAI) instead of local GPUs.
Benefits: - No GPU required - Faster inference (optimized infrastructure) - Better base models (DeepSeek-V3, GPT-4o) - Lower total cost (pay per use)
Usage:
from gym.train.grpo.api_model_adapter import DeepSeekAdapter
adapter = DeepSeekAdapter(api_key="sk-xxx", model="deepseek-chat")
response = adapter.generate_with_experiences(
query="Solve: x² + 2x + 5 = 0",
experiences=experience_library.format_for_prompt()
)
When to Use vs Traditional Fine-Tuning¶
Continuous Learning is Better When:¶
✅ Limited training data (50-1000 examples) - Traditional methods need 10K+ examples - Continuous Learning learns from dozens
✅ Tight budget ($18 vs $10,000+) - No GPU infrastructure needed - Pay only for API calls
✅ Fast iteration (minutes vs days) - Immediate experimentation - No waiting for training convergence
✅ Interpretability required - Experiences are human-readable - Can audit why model made decisions - Stakeholders can review/vote on experiences
✅ Domain adaptation (medical, legal, specialized) - Transfer to new domain without retraining - Domain experts can write experiences directly - Cross-domain performance maintained
✅ Governance/compliance - Need audit trail - Community-driven model evolution - Verifiable model weights (frozen)
Traditional Fine-Tuning is Better When:¶
❌ Massive training data (100K+ examples) - Fine-tuning scales better with data - Parameter updates can capture complex patterns
❌ Latency-critical inference - Fine-tuned models have no context overhead - Continuous Learning adds ~500 tokens context
❌ Extremely specialized tasks - Deep domain knowledge encoded in weights - When semantic experiences can't capture nuance
❌ One-time deployment (static model) - No need for continuous updates - Fixed deployment constraints
Hybrid Approach: Best of Both Worlds¶
You can combine both: 1. Base model: Fine-tuned on large general dataset 2. Continuous Learning: Adapt to specific use cases 3. Result: General competence + specific expertise
Three-Stage Learning Process¶
Overview¶
Input: G rollouts per query, ground truth labels
Output: Updated experience library E
For each epoch:
For each batch:
For each query in batch:
Stage 1: Summarize each of G trajectories
Stage 2: Extract group advantage (max 3 operations)
Stage 3: Consolidate all group operations
Apply consolidated operations to E
Stage 1: Trajectory Summarization (Figure 11 in Paper)¶
Goal: Analyze what happened in each rollout step-by-step.
Prompt Template:
An agent system may be provided with some experiences, and then it produces
the following trajectory to solve the given problem. Please summarize the
trajectory step-by-step:
1. For each step, describe what action is being taken, and which experience
has been used in this step.
2. Given the grading of this rollout and the correct answer, identify and
explain any steps that represent detours, errors, or backtracking.
3. Maintain all the core outcome of each step.
<trajectory>{full_output}</trajectory>
<evaluation>{correct/wrong}</evaluation>
<groundtruth>{answer}</groundtruth>
Only return the trajectory summary of each step.
Example Output:
1. Applied quadratic formula with a=1, b=2, c=5 (Used experience G21)
2. Calculated discriminant: b²-4ac = 4-20 = -16
3. Correctly concluded no real solutions exist (discriminant < 0)
4. Verified answer matches ground truth
Why This Stage Matters: - Reduces trajectory length (models produce verbose reasoning) - Identifies which experiences were actually used - Highlights errors/detours for learning - Provides clean input for Stage 2
Stage 2: Group Advantage Extraction (Figure 12 in Paper)¶
Goal: Compare G trajectories to identify what made successful ones succeed.
Prompt Template:
An agent system is provided with a set of experiences and has tried to solve
the problem multiple times with both successful and wrong solutions. Review
these problem-solving attempts and extract generalizable experiences.
1. Trajectory Analysis:
- For successful steps: Identify key correct decisions
- For errors: Pinpoint where/why reasoning went wrong
- Note patterns or strategies used/missed
2. Update Existing Experiences:
- Options: [modify, add, delete]
- Max 3 operations per group
- Requirements: Begin with context, focus on strategic patterns
<problem>{query}</problem>
<trajectories>
Attempt 1 (correct): {summary_1}
Attempt 2 (wrong): {summary_2}
...
</trajectories>
<groundtruth>{answer}</groundtruth>
<experience>{current_library}</experience>
Return JSON: [{"option": "add", "experience": "..."}, ...]
Example Output:
[
{
"option": "add",
"experience": "For quadratic equations, always check discriminant sign before attempting to find roots."
},
{
"option": "modify",
"experience": "When solving polynomial equations, verify solution count matches degree minus multiplicity.",
"modified_from": "G17"
}
]
Why This Stage Matters: - Extracts why one approach succeeded over another - Focuses on decision points (not calculations) - Proposes concrete updates to experience library - Limits to 3 operations to maintain quality
Stage 3: Batch Consolidation (Figure 13 in Paper)¶
Goal: Merge all group operations into final, non-redundant updates.
Prompt Template:
An agent system has tried to solve multiple problems. From the reflections,
some suggestions on experiences have been posed. Consolidate these into
final experience revisions.
Requirements:
1. Clear, generalizable, ≤32 words each
2. Focus on strategic thinking (not calculations)
3. Avoid duplication between experiences
<experience>{current_library}</experience>
<suggested_updates>
[
{from group 1 operations},
{from group 2 operations},
...
]
</suggested_updates>
Options: [modify, merge, delete]
Return JSON with final operations.
Example Output:
[
{
"option": "merge",
"experience": "For polynomial equations, verify discriminant before solving and ensure solution count matches theoretical maximum.",
"merged_from": ["G3", "G17", "S4"]
},
{
"option": "modify",
"experience": "When encountering complex numbers, separate real and imaginary parts early in the solution process.",
"modified_from": "G21"
}
]
Why This Stage Matters: - Prevents experience library explosion - Ensures consistency across batch - Eliminates near-duplicates - Enforces quality standards (≤32 words)
Cost & Performance Analysis¶
Cost Breakdown (100 training samples, 3 epochs)¶
| Component | API Calls | Tokens (avg) | Cost per Call | Total Cost |
|---|---|---|---|---|
| Rollout Generation (G=5) | 1,500 | 2,000 | $0.002 | $3.00 |
| Stage 1: Summarization | 1,500 | 1,500 | $0.002 | $3.00 |
| Stage 2: Group Extraction | 300 | 4,000 | $0.004 | $1.20 |
| Stage 3: Batch Consolidation | 30 | 6,000 | $0.006 | $0.18 |
| Evaluation | 300 | 1,000 | $0.001 | $0.30 |
| Total | 3,630 | - | - | $7.68 |
Actual costs vary by provider: - DeepSeek: $0.14/M input, \(0.28/M output → **\)8-12 total** - OpenAI (GPT-4o-mini): $0.15/M input, \(0.60/M output → **\)12-18 total** - OpenAI (GPT-4o): $2.50/M input, \(10/M output → **\)60-100 total**
Comparison to Traditional Fine-Tuning: - LoRA (7B model): $500-1,000 (GPU hours + setup) - QLoRA (32B model): $2,000-5,000 (memory-efficient) - Full Fine-Tuning (70B): $10,000-50,000 (days of GPU time)
Performance Metrics (from Paper)¶
AIME Math Competition (100 training samples): - Baseline (no training): 67.9% (AIME25), 80.0% (AIME24) - After Continuous Learning: 73.3% (+5.4%), 82.7% (+2.7%) - vs Traditional Fine-Tuning: +2-5% improvement at 1000x lower cost
Cross-Domain Generalization: | Method | AIME24 (Math) | WebWalker (Web) | |--------|---------------|-----------------| | ReTool (math-tuned) | 67.0% | 18.3% ❌ | | MiroThinker (web-tuned) | 43.5% ❌ | 53.6% | | Continuous Learning | 82.7% ✅ | 67.8% ✅ |
Key Insight: Frozen base model + domain experiences outperforms specialized fine-tuning on both domains.
Time Comparison¶
| Stage | Time (100 samples) |
|---|---|
| Rollout generation | 30-45 min |
| Summarization | 20-30 min |
| Group extraction | 10-15 min |
| Batch consolidation | 2-3 min |
| Total (1 epoch) | ~1.5 hours |
| 3 epochs | ~4-5 hours |
vs Fine-Tuning: - LoRA (7B): 4-8 hours - Full (32B): 24-72 hours - Full (70B): 3-7 days
Integration with Gym¶
Installation¶
# Install Gym with Continuous Learning dependencies
pip install zoo-gym[grpo]
# Or install from source
git clone https://github.com/zooai/gym.git
cd gym
pip install -e ".[grpo]"
Additional dependencies:
Configuration¶
Create a YAML config file (configs/continuous_learning.yaml):
# Base model configuration
model_name_or_path: "Qwen/Qwen3-4B-Instruct" # Or use API adapter
template: "qwen3"
# Continuous Learning GRPO parameters
training_free_grpo: true
grpo_group_size: 5 # Number of rollouts per query
grpo_use_groundtruth: true # Use ground truth for advantage extraction
grpo_beta: 0.01 # KL penalty coefficient (unused in training-free)
grpo_experience_lib_path: "./output/experiences"
# API Model Adapter (optional - for cloud models)
use_api_model: true
api_provider: "deepseek" # or "openai"
api_key: "${DEEPSEEK_API_KEY}" # Use environment variable
api_model: "deepseek-chat"
# LLM for semantic extraction
llm_api_provider: "deepseek"
llm_api_key: "${DEEPSEEK_API_KEY}"
llm_model: "deepseek-chat"
# Dataset
dataset: "alpaca_en_demo" # Your custom dataset
dataset_dir: "./data"
# Training
output_dir: "./output/continuous_learning"
num_train_epochs: 3
per_device_train_batch_size: 8
learning_rate: 0 # No gradient updates
logging_steps: 10
save_steps: 100
Python API Usage¶
from gym.train.grpo.experience_manager import ExperienceManager
from gym.train.grpo.semantic_extractor import SemanticExtractor, LLMClient
from gym.train.grpo.api_model_adapter import DeepSeekAdapter
# 1. Initialize components
experience_manager = ExperienceManager(checkpoint_path="./experiences.json")
llm_client = LLMClient(api_key="sk-xxx", model="deepseek-chat")
extractor = SemanticExtractor(llm_client, max_operations=3)
model_adapter = DeepSeekAdapter(api_key="sk-xxx")
# 2. Training loop
queries = [
"Solve: x² + 2x + 5 = 0",
"Find derivative of x³ + 2x",
...
]
groundtruths = ["No real solutions", "3x² + 2", ...]
for epoch in range(3):
print(f"\n=== Epoch {epoch+1}/3 ===")
for query, gt in zip(queries, groundtruths):
# Generate G rollouts
experiences = experience_manager.format_for_prompt()
trajectories = []
for _ in range(5): # G=5
output = model_adapter.generate_with_experiences(query, experiences)
reward = evaluate_answer(output, gt) # Your reward function
trajectories.append(Trajectory(query, output, reward, gt))
# Stage 1: Summarize
for traj in trajectories:
traj.summary = extractor.summarize_trajectory(traj)
# Stage 2: Extract group advantage
operations = extractor.extract_group_advantage(
trajectories, experiences, use_groundtruth=True
)
# Apply operations
experience_manager.apply_operations(operations)
# Save checkpoint
experience_manager.save(f"./experiences_epoch{epoch+1}.json")
print(f"Experience library size: {len(experience_manager)}")
# 3. Final evaluation
print("\n=== Final Experience Library ===")
print(experience_manager.format_for_prompt())
CLI Usage¶
# Quick start with example config
gym train \
--config configs/continuous_learning.yaml \
--dataset alpaca_en_demo
# With custom dataset
gym train \
--config configs/continuous_learning.yaml \
--dataset custom_math_problems \
--dataset_dir ./my_data
# Resume from checkpoint
gym train \
--config configs/continuous_learning.yaml \
--resume_from_checkpoint ./output/continuous_learning/checkpoint-100
Advanced Usage¶
Custom Reward Functions¶
By default, Gym uses simple correctness (0/1). You can provide custom rewards:
def custom_reward_function(query: str, output: str, groundtruth: str) -> float:
"""
Custom reward that considers partial correctness.
Returns:
reward: Float in [0, 1], higher is better
"""
# Exact match
if output.strip().lower() == groundtruth.strip().lower():
return 1.0
# Partial credit for methodology
if "discriminant" in output and "quadratic" in query:
return 0.5
# Wrong answer
return 0.0
# Use in training
reward = custom_reward_function(query, output, gt)
Multi-Domain Experiences¶
Organize experiences by domain for better retrieval:
class DomainAwareExperienceManager(ExperienceManager):
def __init__(self, checkpoint_path=None):
super().__init__(checkpoint_path)
self.domains = {} # exp_id -> domain
def add_with_domain(self, experience: str, domain: str) -> str:
exp_id = self.add(experience)
self.domains[exp_id] = domain
return exp_id
def format_for_domain(self, domain: str) -> str:
"""Return only experiences for specific domain."""
filtered = {
exp_id: text
for exp_id, text in self.experiences.items()
if self.domains.get(exp_id) == domain
}
return "\n".join([f"[{k}]. {v}" for k, v in filtered.items()])
# Usage
manager = DomainAwareExperienceManager()
manager.add_with_domain("For calculus problems...", domain="math.calculus")
manager.add_with_domain("When parsing JSON...", domain="coding.json")
# Retrieve domain-specific experiences
math_exp = manager.format_for_domain("math.calculus")
Embedding-Based Retrieval¶
For large experience libraries (100+ experiences), use semantic search:
from sentence_transformers import SentenceTransformer
class SemanticExperienceManager(ExperienceManager):
def __init__(self, checkpoint_path=None):
super().__init__(checkpoint_path)
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
self.embeddings = {} # exp_id -> embedding
def add(self, experience: str) -> str:
exp_id = super().add(experience)
self.embeddings[exp_id] = self.embedder.encode(experience)
return exp_id
def retrieve_relevant(self, query: str, top_k: int = 5) -> str:
"""Retrieve top-k most relevant experiences."""
query_emb = self.embedder.encode(query)
# Compute similarities
similarities = {
exp_id: cosine_similarity(query_emb, emb)
for exp_id, emb in self.embeddings.items()
}
# Get top-k
top_ids = sorted(similarities, key=similarities.get, reverse=True)[:top_k]
# Format
relevant = {exp_id: self.experiences[exp_id] for exp_id in top_ids}
return "\n".join([f"[{k}]. {v}" for k, v in relevant.items()])
# Usage
manager = SemanticExperienceManager()
# ... add experiences ...
relevant_exp = manager.retrieve_relevant("Solve quadratic equation", top_k=3)
Experience Governance (DAO-style)¶
Enable community voting on experiences:
class GovernedExperienceManager(ExperienceManager):
def __init__(self, checkpoint_path=None):
super().__init__(checkpoint_path)
self.votes = {} # exp_id -> {"upvotes": int, "downvotes": int}
self.vote_threshold = 0.66 # 2/3 majority
def propose_experience(self, experience: str) -> str:
"""Propose experience for community vote."""
exp_id = f"P{self._next_id}" # Pending
self.votes[exp_id] = {"upvotes": 0, "downvotes": 0, "text": experience}
self._next_id += 1
return exp_id
def vote(self, exp_id: str, upvote: bool):
"""Cast vote on proposed experience."""
if upvote:
self.votes[exp_id]["upvotes"] += 1
else:
self.votes[exp_id]["downvotes"] += 1
def finalize_votes(self):
"""Accept experiences with 2/3 majority."""
for exp_id, vote_data in list(self.votes.items()):
total = vote_data["upvotes"] + vote_data["downvotes"]
if total == 0:
continue
approval_rate = vote_data["upvotes"] / total
if approval_rate >= self.vote_threshold:
# Accept experience
self.add(vote_data["text"])
print(f"✅ Accepted: {exp_id} ({approval_rate:.1%})")
else:
print(f"❌ Rejected: {exp_id} ({approval_rate:.1%})")
del self.votes[exp_id]
Limitations & Best Practices¶
Known Limitations¶
- Base Model Capability Threshold
- Works best with strong base models (7B+ parameters)
- Weaker models struggle with introspection
-
Recommendation: Qwen3-4B+, DeepSeek-V3, GPT-4o-mini+
-
Context Window Constraints
- Large experience libraries (200+) may exceed context limit
-
Solution: Use embedding-based retrieval (top-k relevant)
-
Cost Scales with Group Size
- G=8 requires 8x API calls per query
- Trade-off: G=1 (cheap but ineffective) vs G=8 (expensive but better)
-
Sweet spot: G=5
-
Ground Truth Dependency
- Best results with ground truth labels
- Without GT: -2% to -5% performance
-
Can use self-discrimination via majority voting
-
Domain Specificity
- Experiences are domain-specific
- Math experiences don't help coding tasks
- Solution: Multi-domain libraries or domain tagging
Best Practices¶
1. Start Small, Scale Gradually¶
# Phase 1: Prototype (10 examples, 1 epoch)
# Goal: Validate setup, check experience quality
# Phase 2: Pilot (50 examples, 2 epochs)
# Goal: Measure performance improvement
# Phase 3: Production (100-500 examples, 3 epochs)
# Goal: Deploy and monitor
2. Monitor Experience Quality¶
def audit_experience_library(manager: ExperienceManager):
"""Check for common quality issues."""
issues = []
for exp_id, exp_text in manager.experiences.items():
# Check length
word_count = len(exp_text.split())
if word_count > 32:
issues.append(f"{exp_id}: Too long ({word_count} words)")
# Check if too specific (mentions numbers)
if any(char.isdigit() for char in exp_text):
issues.append(f"{exp_id}: Contains numbers (too specific)")
# Check if actionable (has verb)
if not any(word in exp_text.lower() for word in ["use", "check", "verify", "apply", "calculate"]):
issues.append(f"{exp_id}: No action verb (not actionable)")
return issues
# Run after each epoch
issues = audit_experience_library(experience_manager)
if issues:
print("⚠️ Quality Issues:")
for issue in issues:
print(f" - {issue}")
3. Use Checkpointing¶
# Save after every epoch
for epoch in range(num_epochs):
# ... training ...
# Checkpoint
checkpoint_path = f"{output_dir}/experiences_epoch{epoch+1}.json"
experience_manager.save(checkpoint_path)
# Also save metadata
metadata = {
"epoch": epoch + 1,
"num_experiences": len(experience_manager),
"timestamp": datetime.now().isoformat(),
"performance": evaluate_on_validation_set()
}
with open(f"{output_dir}/metadata_epoch{epoch+1}.json", 'w') as f:
json.dump(metadata, f, indent=2)
4. Combine with Traditional Methods¶
# Hybrid approach:
# 1. Fine-tune base model on large general dataset (one-time)
# 2. Use Continuous Learning for task-specific adaptation (ongoing)
# Step 1: Fine-tune (traditional)
gym train \
--model_name_or_path Qwen/Qwen3-4B \
--finetuning_type lora \
--dataset large_math_corpus \
--output_dir ./output/math_base
# Step 2: Continuous Learning (on top of fine-tuned model)
gym train \
--model_name_or_path ./output/math_base \
--adapter_name_or_path ./output/math_base/lora_adapter \
--config configs/continuous_learning.yaml \
--dataset specific_geometry_problems \
--output_dir ./output/geometry_adapted
5. Validate on Out-of-Domain Tasks¶
# Test generalization
def cross_domain_evaluation(experience_manager):
"""Test if experiences help on unseen domains."""
domains = ["math", "coding", "logic", "creative"]
results = {}
for domain in domains:
test_set = load_test_set(domain)
accuracy = evaluate_with_experiences(
test_set,
experience_manager.format_for_prompt()
)
results[domain] = accuracy
return results
# Run after training
results = cross_domain_evaluation(experience_manager)
print("Cross-Domain Performance:")
for domain, acc in results.items():
print(f" {domain}: {acc:.1%}")
Troubleshooting¶
Common Issues¶
Issue: Experience library not growing¶
Symptoms: len(experience_manager) stays at 0 or very small
Causes: 1. Group size too small (G=1) 2. Homogeneous rewards (all correct or all wrong) 3. LLM not returning valid JSON
Solutions:
# 1. Increase group size
grpo_group_size: 5 # or higher
# 2. Check reward distribution
rewards = [t.reward for t in trajectories]
print(f"Reward std: {np.std(rewards)}") # Should be > 0
# 3. Debug JSON parsing
response = extractor.extract_group_advantage(...)
if not response:
print("No operations extracted - check LLM output")
Issue: Performance not improving¶
Symptoms: Validation accuracy flat or decreasing
Causes: 1. Base model too weak 2. Experiences too specific (not generalizable) 3. Not enough epochs
Solutions:
# 1. Use stronger base model
model_name_or_path: "Qwen/Qwen3-7B-Instruct" # or DeepSeek-V3 via API
# 2. Audit experience quality (see Best Practices)
# 3. Increase epochs
num_train_epochs: 5 # instead of 3
Issue: API rate limits¶
Symptoms: RateLimitError from OpenAI/DeepSeek
Solutions:
import time
# Add retry logic
def generate_with_retry(adapter, prompt, max_retries=3):
for i in range(max_retries):
try:
return adapter.generate(prompt)
except RateLimitError:
if i < max_retries - 1:
time.sleep(2 ** i) # Exponential backoff
else:
raise
# Or use batch processing with delays
for batch in batches:
process_batch(batch)
time.sleep(1) # 1 second between batches
Summary¶
Continuous Learning GRPO represents a paradigm shift in AI model improvement:
✅ Learn from dozens, not thousands of examples ✅ $18 vs $10,000 cost reduction ✅ Human-readable experiences instead of black-box weights ✅ Minutes, not days of training time ✅ Cross-domain transfer maintained ✅ Verifiable model weights (frozen, cryptographically hash-able)
Next Steps: 1. Read Chat-to-Experience Tutorial 2. Read Custom Agent Guide 3. Check API Reference 4. Try the Quick Start
Join the Community: - GitHub: github.com/zooai/gym - Discord: discord.gg/zooai - Twitter: @zoolabsfdn
Last Updated: October 28, 2025 Gym v0.9.4 - Democratizing AI Training Zoo Labs Foundation Inc - 501©(3) Non-Profit