Continuous Learning GRPO - Training-Free AI Model Improvement¶

Executive Summary¶

Continuous Learning GRPO enables AI models to improve through experience accumulation rather than parameter updates. Instead of traditional fine-tuning (which requires expensive GPU compute and thousands of examples), this approach learns from 50-100 examples by extracting and curating semantic experiences - human-readable insights that guide future problem-solving.

Key Results (from Tencent Research, arXiv:2510.08191): - Cost: $18 vs $10,000+ for fine-tuning - Performance: 82.7% on AIME24 (+2.7% over fine-tuning) - Data efficiency: 50-100 examples vs 1000s - Training time: Minutes vs hours/days - Interpretability: Human-readable experiences vs black-box weights

What is Continuous Learning GRPO?¶

Traditional AI model improvement requires updating billions of parameters through gradient descent - an expensive, opaque process. Continuous Learning GRPO takes a fundamentally different approach:

Traditional Fine-Tuning:
  Model weights → Gradient updates → Modified weights
  Cost: $10,000+ | Time: Hours/Days | Interpretability: None

Continuous Learning:
  Experiences → Semantic extraction → Updated experience library
  Cost: $18 | Time: Minutes | Interpretability: Full

The Innovation¶

Instead of changing what the model is (its weights), we change what the model sees (its context). The base model remains frozen (verifiable via cryptographic hash), while a curated library of experiences guides its reasoning.

Analogy: Traditional fine-tuning is like brain surgery - modifying neural connections. Continuous Learning is like education - providing better examples and guidance.

Core Concept: Semantic Experiences¶

What is a Semantic Experience?¶

A semantic experience is a concise (≤32 words) natural language statement that captures a generalizable problem-solving strategy.

Examples from Mathematics Domain:

[G0]. When solving geometry problems with intersections, validate solutions
      lie within bounded regions or segments, not on extensions, to avoid
      extraneous answers.

[G1]. For expected extreme statistics in combinatorial problems, use direct
      enumeration for small sizes.

[G10]. When using mathematical invariants to prove impossibility, always
       validate them against known achievable states or small cases.

[G21]. For complex polynomials with real parameters, separate real and
       imaginary parts to find when real roots exist.

Characteristics of Good Experiences¶

Strategic, not computational: "Check boundary conditions" not "Calculate derivative"
Context-aware: Begins with "When [condition]..."
Actionable: Provides clear guidance
Generalizable: Applies to similar problem classes
Concise: ≤32 words
Domain-agnostic: Focuses on reasoning patterns

Traditional Numerical Advantages vs Semantic Advantages¶

Aspect	Numerical (GRPO)	Semantic (Continuous Learning)
Format	Scalar: 0.73	Text: "When solving equations..."
Interpretability	Opaque	Human-readable
Persistence	Lost after batch	Accumulated in library
Composability	Single value	Combinable experiences
Governance	Not applicable	DAO-votable
Auditability	None	Full trail

Architecture Overview¶

System Components¶

┌─────────────────────────────────────────────────────────────┐
│                    Continuous Learning GRPO                  │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌────────────────┐      ┌───────────────────────┐         │
│  │  User Query    │─────→│  Experience Manager   │         │
│  └────────────────┘      │  - Load experiences   │         │
│                          │  - Format for context │         │
│                          └───────────┬───────────┘         │
│                                      │                      │
│                                      ▼                      │
│                          ┌───────────────────────┐         │
│                          │   Base Model (Frozen) │         │
│  ┌────────────────┐     │  + Experience Context │         │
│  │ Generate G     │◀────┤   → Generate response │         │
│  │ Rollouts       │     └───────────────────────┘         │
│  └────────┬───────┘                                        │
│           │                                                 │
│           ▼                                                 │
│  ┌───────────────────────────────────────────┐            │
│  │  Semantic Extractor (3-Stage LLM Process) │            │
│  │  1. Summarize trajectories                │            │
│  │  2. Extract group advantages              │            │
│  │  3. Consolidate batch updates             │            │
│  └────────────────┬──────────────────────────┘            │
│                   │                                         │
│                   ▼                                         │
│  ┌────────────────────────────────┐                        │
│  │  Experience Library Update     │                        │
│  │  - Add new experiences         │                        │
│  │  - Modify existing ones        │                        │
│  │  - Delete obsolete entries     │                        │
│  │  - Merge similar experiences   │                        │
│  └────────────────────────────────┘                        │
│                                                              │
│  [Model weights never change - verifiable via hash]        │
└─────────────────────────────────────────────────────────────┘

Key Modules¶

1. ExperienceManager (`src/gym/train/grpo/experience_manager.py`)¶

Purpose: Manages the experience library E - the core knowledge base.

Operations: - add(experience): Add new experience - modify(exp_id, new_text): Update existing experience - delete(exp_id): Remove obsolete experience - merge(exp_ids, merged_text): Combine similar experiences - format_for_prompt(): Convert library to context string - save(path) / load(path): Persistence

Data Structure:

{
  "experiences": {
    "G0": "When solving equations, verify by substitution...",
    "G1": "For optimization, check boundary conditions...",
    ...
  },
  "next_id": 42
}

2. SemanticExtractor (`src/gym/train/grpo/semantic_extractor.py`)¶

Purpose: Extracts semantic advantages using 3-stage LLM process.

Stage 1 - Trajectory Summarization:

def summarize_trajectory(trajectory, use_groundtruth=True):
    """
    Input: Full trajectory, correctness label, ground truth
    Output: Step-by-step natural language summary

    Example:
      1. Applied quadratic formula with a=1, b=2, c=5
      2. Calculated discriminant: b²-4ac = 4-20 = -16
      3. Concluded no real solutions (correct)
    """

Stage 2 - Group Advantage Extraction:

def extract_group_advantage(trajectories, experiences):
    """
    Input: G trajectories (both correct/incorrect), current experiences
    Output: JSON operations [{"option": "add", "experience": "..."}, ...]

    Process:
      - Compare successful vs failed trajectories
      - Identify what made successful ones succeed
      - Propose max 3 operations: add/modify/delete
    """

Stage 3 - Batch Consolidation:

def consolidate_batch(all_group_operations, experiences):
    """
    Input: All group operations from batch, current experiences
    Output: Final consolidated operations

    Process:
      - Merge similar suggestions
      - Ensure ≤32 words per experience
      - Eliminate redundancy
    """

3. APIModelAdapter (`src/gym/train/grpo/api_model_adapter.py`)¶

Purpose: Enable using cloud-hosted models (DeepSeek, OpenAI) instead of local GPUs.

Benefits: - No GPU required - Faster inference (optimized infrastructure) - Better base models (DeepSeek-V3, GPT-4o) - Lower total cost (pay per use)

Usage:

from gym.train.grpo.api_model_adapter import DeepSeekAdapter

adapter = DeepSeekAdapter(api_key="sk-xxx", model="deepseek-chat")
response = adapter.generate_with_experiences(
    query="Solve: x² + 2x + 5 = 0",
    experiences=experience_library.format_for_prompt()
)

When to Use vs Traditional Fine-Tuning¶

Continuous Learning is Better When:¶

✅ Limited training data (50-1000 examples) - Traditional methods need 10K+ examples - Continuous Learning learns from dozens

✅ Tight budget ($18 vs $10,000+) - No GPU infrastructure needed - Pay only for API calls

✅ Fast iteration (minutes vs days) - Immediate experimentation - No waiting for training convergence

✅ Interpretability required - Experiences are human-readable - Can audit why model made decisions - Stakeholders can review/vote on experiences

✅ Domain adaptation (medical, legal, specialized) - Transfer to new domain without retraining - Domain experts can write experiences directly - Cross-domain performance maintained

✅ Governance/compliance - Need audit trail - Community-driven model evolution - Verifiable model weights (frozen)

Traditional Fine-Tuning is Better When:¶

❌ Massive training data (100K+ examples) - Fine-tuning scales better with data - Parameter updates can capture complex patterns

❌ Latency-critical inference - Fine-tuned models have no context overhead - Continuous Learning adds ~500 tokens context

❌ Extremely specialized tasks - Deep domain knowledge encoded in weights - When semantic experiences can't capture nuance

❌ One-time deployment (static model) - No need for continuous updates - Fixed deployment constraints

Hybrid Approach: Best of Both Worlds¶

You can combine both: 1. Base model: Fine-tuned on large general dataset 2. Continuous Learning: Adapt to specific use cases 3. Result: General competence + specific expertise

Three-Stage Learning Process¶

Overview¶

Input: G rollouts per query, ground truth labels
Output: Updated experience library E

For each epoch:
  For each batch:
    For each query in batch:
      Stage 1: Summarize each of G trajectories
      Stage 2: Extract group advantage (max 3 operations)
    Stage 3: Consolidate all group operations
    Apply consolidated operations to E

Stage 1: Trajectory Summarization (Figure 11 in Paper)¶

Goal: Analyze what happened in each rollout step-by-step.

Prompt Template:

An agent system may be provided with some experiences, and then it produces
the following trajectory to solve the given problem. Please summarize the
trajectory step-by-step:

1. For each step, describe what action is being taken, and which experience
   has been used in this step.
2. Given the grading of this rollout and the correct answer, identify and
   explain any steps that represent detours, errors, or backtracking.
3. Maintain all the core outcome of each step.

<trajectory>{full_output}</trajectory>
<evaluation>{correct/wrong}</evaluation>
<groundtruth>{answer}</groundtruth>

Only return the trajectory summary of each step.

Example Output:

1. Applied quadratic formula with a=1, b=2, c=5 (Used experience G21)
2. Calculated discriminant: b²-4ac = 4-20 = -16
3. Correctly concluded no real solutions exist (discriminant < 0)
4. Verified answer matches ground truth

Why This Stage Matters: - Reduces trajectory length (models produce verbose reasoning) - Identifies which experiences were actually used - Highlights errors/detours for learning - Provides clean input for Stage 2

Stage 2: Group Advantage Extraction (Figure 12 in Paper)¶

Goal: Compare G trajectories to identify what made successful ones succeed.

Prompt Template:

An agent system is provided with a set of experiences and has tried to solve
the problem multiple times with both successful and wrong solutions. Review
these problem-solving attempts and extract generalizable experiences.

1. Trajectory Analysis:
   - For successful steps: Identify key correct decisions
   - For errors: Pinpoint where/why reasoning went wrong
   - Note patterns or strategies used/missed

2. Update Existing Experiences:
   - Options: [modify, add, delete]
   - Max 3 operations per group
   - Requirements: Begin with context, focus on strategic patterns

<problem>{query}</problem>
<trajectories>
Attempt 1 (correct): {summary_1}
Attempt 2 (wrong): {summary_2}
...
</trajectories>
<groundtruth>{answer}</groundtruth>
<experience>{current_library}</experience>

Return JSON: [{"option": "add", "experience": "..."}, ...]

Example Output:

[
  {
    "option": "add",
    "experience": "For quadratic equations, always check discriminant sign before attempting to find roots."
  },
  {
    "option": "modify",
    "experience": "When solving polynomial equations, verify solution count matches degree minus multiplicity.",
    "modified_from": "G17"
  }
]

Why This Stage Matters: - Extracts why one approach succeeded over another - Focuses on decision points (not calculations) - Proposes concrete updates to experience library - Limits to 3 operations to maintain quality

Stage 3: Batch Consolidation (Figure 13 in Paper)¶

Goal: Merge all group operations into final, non-redundant updates.

Prompt Template:

An agent system has tried to solve multiple problems. From the reflections,
some suggestions on experiences have been posed. Consolidate these into
final experience revisions.

Requirements:
1. Clear, generalizable, ≤32 words each
2. Focus on strategic thinking (not calculations)
3. Avoid duplication between experiences

<experience>{current_library}</experience>
<suggested_updates>
[
  {from group 1 operations},
  {from group 2 operations},
  ...
]
</suggested_updates>

Options: [modify, merge, delete]
Return JSON with final operations.

Example Output:

[
  {
    "option": "merge",
    "experience": "For polynomial equations, verify discriminant before solving and ensure solution count matches theoretical maximum.",
    "merged_from": ["G3", "G17", "S4"]
  },
  {
    "option": "modify",
    "experience": "When encountering complex numbers, separate real and imaginary parts early in the solution process.",
    "modified_from": "G21"
  }
]

Why This Stage Matters: - Prevents experience library explosion - Ensures consistency across batch - Eliminates near-duplicates - Enforces quality standards (≤32 words)

Cost & Performance Analysis¶

Cost Breakdown (100 training samples, 3 epochs)¶

Component	API Calls	Tokens (avg)	Cost per Call	Total Cost
Rollout Generation (G=5)	1,500	2,000	$0.002	$3.00
Stage 1: Summarization	1,500	1,500	$0.002	$3.00
Stage 2: Group Extraction	300	4,000	$0.004	$1.20
Stage 3: Batch Consolidation	30	6,000	$0.006	$0.18
Evaluation	300	1,000	$0.001	$0.30
Total	3,630	-	-	$7.68

Actual costs vary by provider: - DeepSeek: $0.14/M input, $0.28/M output → **$8-12 total** - OpenAI (GPT-4o-mini): $0.15/M input, $0.60/M output → **$12-18 total** - OpenAI (GPT-4o): $2.50/M input, $10/M output → **$60-100 total**

Comparison to Traditional Fine-Tuning: - LoRA (7B model): $500-1,000 (GPU hours + setup) - QLoRA (32B model): $2,000-5,000 (memory-efficient) - Full Fine-Tuning (70B): $10,000-50,000 (days of GPU time)

Performance Metrics (from Paper)¶

AIME Math Competition (100 training samples): - Baseline (no training): 67.9% (AIME25), 80.0% (AIME24) - After Continuous Learning: 73.3% (+5.4%), 82.7% (+2.7%) - vs Traditional Fine-Tuning: +2-5% improvement at 1000x lower cost

Cross-Domain Generalization: | Method | AIME24 (Math) | WebWalker (Web) | |--------|---------------|-----------------| | ReTool (math-tuned) | 67.0% | 18.3% ❌ | | MiroThinker (web-tuned) | 43.5% ❌ | 53.6% | | Continuous Learning | 82.7% ✅ | 67.8% ✅ |

Key Insight: Frozen base model + domain experiences outperforms specialized fine-tuning on both domains.

Time Comparison¶

Stage	Time (100 samples)
Rollout generation	30-45 min
Summarization	20-30 min
Group extraction	10-15 min
Batch consolidation	2-3 min
Total (1 epoch)	~1.5 hours
3 epochs	~4-5 hours

vs Fine-Tuning: - LoRA (7B): 4-8 hours - Full (32B): 24-72 hours - Full (70B): 3-7 days

Integration with Gym¶

Installation¶

# Install Gym with Continuous Learning dependencies
pip install zoo-gym[grpo]

# Or install from source
git clone https://github.com/zooai/gym.git
cd gym
pip install -e ".[grpo]"

Additional dependencies:

pip install openai  # For API model adapters

Configuration¶

Create a YAML config file (configs/continuous_learning.yaml):

# Base model configuration
model_name_or_path: "Qwen/Qwen3-4B-Instruct"  # Or use API adapter
template: "qwen3"

# Continuous Learning GRPO parameters
training_free_grpo: true
grpo_group_size: 5  # Number of rollouts per query
grpo_use_groundtruth: true  # Use ground truth for advantage extraction
grpo_beta: 0.01  # KL penalty coefficient (unused in training-free)
grpo_experience_lib_path: "./output/experiences"

# API Model Adapter (optional - for cloud models)
use_api_model: true
api_provider: "deepseek"  # or "openai"
api_key: "${DEEPSEEK_API_KEY}"  # Use environment variable
api_model: "deepseek-chat"

# LLM for semantic extraction
llm_api_provider: "deepseek"
llm_api_key: "${DEEPSEEK_API_KEY}"
llm_model: "deepseek-chat"

# Dataset
dataset: "alpaca_en_demo"  # Your custom dataset
dataset_dir: "./data"

# Training
output_dir: "./output/continuous_learning"
num_train_epochs: 3
per_device_train_batch_size: 8
learning_rate: 0  # No gradient updates
logging_steps: 10
save_steps: 100

Python API Usage¶

from gym.train.grpo.experience_manager import ExperienceManager
from gym.train.grpo.semantic_extractor import SemanticExtractor, LLMClient
from gym.train.grpo.api_model_adapter import DeepSeekAdapter

# 1. Initialize components
experience_manager = ExperienceManager(checkpoint_path="./experiences.json")
llm_client = LLMClient(api_key="sk-xxx", model="deepseek-chat")
extractor = SemanticExtractor(llm_client, max_operations=3)
model_adapter = DeepSeekAdapter(api_key="sk-xxx")

# 2. Training loop
queries = [
    "Solve: x² + 2x + 5 = 0",
    "Find derivative of x³ + 2x",
    ...
]
groundtruths = ["No real solutions", "3x² + 2", ...]

for epoch in range(3):
    print(f"\n=== Epoch {epoch+1}/3 ===")

    for query, gt in zip(queries, groundtruths):
        # Generate G rollouts
        experiences = experience_manager.format_for_prompt()
        trajectories = []

        for _ in range(5):  # G=5
            output = model_adapter.generate_with_experiences(query, experiences)
            reward = evaluate_answer(output, gt)  # Your reward function
            trajectories.append(Trajectory(query, output, reward, gt))

        # Stage 1: Summarize
        for traj in trajectories:
            traj.summary = extractor.summarize_trajectory(traj)

        # Stage 2: Extract group advantage
        operations = extractor.extract_group_advantage(
            trajectories, experiences, use_groundtruth=True
        )

        # Apply operations
        experience_manager.apply_operations(operations)

    # Save checkpoint
    experience_manager.save(f"./experiences_epoch{epoch+1}.json")
    print(f"Experience library size: {len(experience_manager)}")

# 3. Final evaluation
print("\n=== Final Experience Library ===")
print(experience_manager.format_for_prompt())

CLI Usage¶

# Quick start with example config
gym train \
  --config configs/continuous_learning.yaml \
  --dataset alpaca_en_demo

# With custom dataset
gym train \
  --config configs/continuous_learning.yaml \
  --dataset custom_math_problems \
  --dataset_dir ./my_data

# Resume from checkpoint
gym train \
  --config configs/continuous_learning.yaml \
  --resume_from_checkpoint ./output/continuous_learning/checkpoint-100

Advanced Usage¶

Custom Reward Functions¶

By default, Gym uses simple correctness (0/1). You can provide custom rewards:

def custom_reward_function(query: str, output: str, groundtruth: str) -> float:
    """
    Custom reward that considers partial correctness.

    Returns:
        reward: Float in [0, 1], higher is better
    """
    # Exact match
    if output.strip().lower() == groundtruth.strip().lower():
        return 1.0

    # Partial credit for methodology
    if "discriminant" in output and "quadratic" in query:
        return 0.5

    # Wrong answer
    return 0.0

# Use in training
reward = custom_reward_function(query, output, gt)

Multi-Domain Experiences¶

Organize experiences by domain for better retrieval:

class DomainAwareExperienceManager(ExperienceManager):
    def __init__(self, checkpoint_path=None):
        super().__init__(checkpoint_path)
        self.domains = {}  # exp_id -> domain

    def add_with_domain(self, experience: str, domain: str) -> str:
        exp_id = self.add(experience)
        self.domains[exp_id] = domain
        return exp_id

    def format_for_domain(self, domain: str) -> str:
        """Return only experiences for specific domain."""
        filtered = {
            exp_id: text
            for exp_id, text in self.experiences.items()
            if self.domains.get(exp_id) == domain
        }
        return "\n".join([f"[{k}]. {v}" for k, v in filtered.items()])

# Usage
manager = DomainAwareExperienceManager()
manager.add_with_domain("For calculus problems...", domain="math.calculus")
manager.add_with_domain("When parsing JSON...", domain="coding.json")

# Retrieve domain-specific experiences
math_exp = manager.format_for_domain("math.calculus")

Embedding-Based Retrieval¶

For large experience libraries (100+ experiences), use semantic search:

from sentence_transformers import SentenceTransformer

class SemanticExperienceManager(ExperienceManager):
    def __init__(self, checkpoint_path=None):
        super().__init__(checkpoint_path)
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        self.embeddings = {}  # exp_id -> embedding

    def add(self, experience: str) -> str:
        exp_id = super().add(experience)
        self.embeddings[exp_id] = self.embedder.encode(experience)
        return exp_id

    def retrieve_relevant(self, query: str, top_k: int = 5) -> str:
        """Retrieve top-k most relevant experiences."""
        query_emb = self.embedder.encode(query)

        # Compute similarities
        similarities = {
            exp_id: cosine_similarity(query_emb, emb)
            for exp_id, emb in self.embeddings.items()
        }

        # Get top-k
        top_ids = sorted(similarities, key=similarities.get, reverse=True)[:top_k]

        # Format
        relevant = {exp_id: self.experiences[exp_id] for exp_id in top_ids}
        return "\n".join([f"[{k}]. {v}" for k, v in relevant.items()])

# Usage
manager = SemanticExperienceManager()
# ... add experiences ...
relevant_exp = manager.retrieve_relevant("Solve quadratic equation", top_k=3)

Experience Governance (DAO-style)¶

Enable community voting on experiences:

class GovernedExperienceManager(ExperienceManager):
    def __init__(self, checkpoint_path=None):
        super().__init__(checkpoint_path)
        self.votes = {}  # exp_id -> {"upvotes": int, "downvotes": int}
        self.vote_threshold = 0.66  # 2/3 majority

    def propose_experience(self, experience: str) -> str:
        """Propose experience for community vote."""
        exp_id = f"P{self._next_id}"  # Pending
        self.votes[exp_id] = {"upvotes": 0, "downvotes": 0, "text": experience}
        self._next_id += 1
        return exp_id

    def vote(self, exp_id: str, upvote: bool):
        """Cast vote on proposed experience."""
        if upvote:
            self.votes[exp_id]["upvotes"] += 1
        else:
            self.votes[exp_id]["downvotes"] += 1

    def finalize_votes(self):
        """Accept experiences with 2/3 majority."""
        for exp_id, vote_data in list(self.votes.items()):
            total = vote_data["upvotes"] + vote_data["downvotes"]
            if total == 0:
                continue

            approval_rate = vote_data["upvotes"] / total

            if approval_rate >= self.vote_threshold:
                # Accept experience
                self.add(vote_data["text"])
                print(f"✅ Accepted: {exp_id} ({approval_rate:.1%})")
            else:
                print(f"❌ Rejected: {exp_id} ({approval_rate:.1%})")

            del self.votes[exp_id]

Limitations & Best Practices¶

Known Limitations¶

Base Model Capability Threshold
Works best with strong base models (7B+ parameters)
Weaker models struggle with introspection
Recommendation: Qwen3-4B+, DeepSeek-V3, GPT-4o-mini+
Context Window Constraints
Large experience libraries (200+) may exceed context limit
Solution: Use embedding-based retrieval (top-k relevant)
Cost Scales with Group Size
G=8 requires 8x API calls per query
Trade-off: G=1 (cheap but ineffective) vs G=8 (expensive but better)
Sweet spot: G=5
Ground Truth Dependency
Best results with ground truth labels
Without GT: -2% to -5% performance
Can use self-discrimination via majority voting
Domain Specificity
Experiences are domain-specific
Math experiences don't help coding tasks
Solution: Multi-domain libraries or domain tagging

Best Practices¶

1. Start Small, Scale Gradually¶

# Phase 1: Prototype (10 examples, 1 epoch)
# Goal: Validate setup, check experience quality

# Phase 2: Pilot (50 examples, 2 epochs)
# Goal: Measure performance improvement

# Phase 3: Production (100-500 examples, 3 epochs)
# Goal: Deploy and monitor

2. Monitor Experience Quality¶

def audit_experience_library(manager: ExperienceManager):
    """Check for common quality issues."""
    issues = []

    for exp_id, exp_text in manager.experiences.items():
        # Check length
        word_count = len(exp_text.split())
        if word_count > 32:
            issues.append(f"{exp_id}: Too long ({word_count} words)")

        # Check if too specific (mentions numbers)
        if any(char.isdigit() for char in exp_text):
            issues.append(f"{exp_id}: Contains numbers (too specific)")

        # Check if actionable (has verb)
        if not any(word in exp_text.lower() for word in ["use", "check", "verify", "apply", "calculate"]):
            issues.append(f"{exp_id}: No action verb (not actionable)")

    return issues

# Run after each epoch
issues = audit_experience_library(experience_manager)
if issues:
    print("⚠️  Quality Issues:")
    for issue in issues:
        print(f"  - {issue}")

3. Use Checkpointing¶

# Save after every epoch
for epoch in range(num_epochs):
    # ... training ...

    # Checkpoint
    checkpoint_path = f"{output_dir}/experiences_epoch{epoch+1}.json"
    experience_manager.save(checkpoint_path)

    # Also save metadata
    metadata = {
        "epoch": epoch + 1,
        "num_experiences": len(experience_manager),
        "timestamp": datetime.now().isoformat(),
        "performance": evaluate_on_validation_set()
    }
    with open(f"{output_dir}/metadata_epoch{epoch+1}.json", 'w') as f:
        json.dump(metadata, f, indent=2)

4. Combine with Traditional Methods¶

# Hybrid approach:
# 1. Fine-tune base model on large general dataset (one-time)
# 2. Use Continuous Learning for task-specific adaptation (ongoing)

# Step 1: Fine-tune (traditional)
gym train \
  --model_name_or_path Qwen/Qwen3-4B \
  --finetuning_type lora \
  --dataset large_math_corpus \
  --output_dir ./output/math_base

# Step 2: Continuous Learning (on top of fine-tuned model)
gym train \
  --model_name_or_path ./output/math_base \
  --adapter_name_or_path ./output/math_base/lora_adapter \
  --config configs/continuous_learning.yaml \
  --dataset specific_geometry_problems \
  --output_dir ./output/geometry_adapted

5. Validate on Out-of-Domain Tasks¶

# Test generalization
def cross_domain_evaluation(experience_manager):
    """Test if experiences help on unseen domains."""
    domains = ["math", "coding", "logic", "creative"]
    results = {}

    for domain in domains:
        test_set = load_test_set(domain)
        accuracy = evaluate_with_experiences(
            test_set,
            experience_manager.format_for_prompt()
        )
        results[domain] = accuracy

    return results

# Run after training
results = cross_domain_evaluation(experience_manager)
print("Cross-Domain Performance:")
for domain, acc in results.items():
    print(f"  {domain}: {acc:.1%}")

Troubleshooting¶

Common Issues¶

Issue: Experience library not growing¶

Symptoms: len(experience_manager) stays at 0 or very small

Causes: 1. Group size too small (G=1) 2. Homogeneous rewards (all correct or all wrong) 3. LLM not returning valid JSON

Solutions:

# 1. Increase group size
grpo_group_size: 5  # or higher

# 2. Check reward distribution
rewards = [t.reward for t in trajectories]
print(f"Reward std: {np.std(rewards)}")  # Should be > 0

# 3. Debug JSON parsing
response = extractor.extract_group_advantage(...)
if not response:
    print("No operations extracted - check LLM output")

Issue: Performance not improving¶

Symptoms: Validation accuracy flat or decreasing

Causes: 1. Base model too weak 2. Experiences too specific (not generalizable) 3. Not enough epochs

Solutions:

# 1. Use stronger base model
model_name_or_path: "Qwen/Qwen3-7B-Instruct"  # or DeepSeek-V3 via API

# 2. Audit experience quality (see Best Practices)

# 3. Increase epochs
num_train_epochs: 5  # instead of 3

Issue: API rate limits¶

Symptoms: RateLimitError from OpenAI/DeepSeek

Solutions:

import time

# Add retry logic
def generate_with_retry(adapter, prompt, max_retries=3):
    for i in range(max_retries):
        try:
            return adapter.generate(prompt)
        except RateLimitError:
            if i < max_retries - 1:
                time.sleep(2 ** i)  # Exponential backoff
            else:
                raise

# Or use batch processing with delays
for batch in batches:
    process_batch(batch)
    time.sleep(1)  # 1 second between batches

Summary¶

Continuous Learning GRPO represents a paradigm shift in AI model improvement:

✅ Learn from dozens, not thousands of examples ✅ $18 vs $10,000 cost reduction ✅ Human-readable experiences instead of black-box weights ✅ Minutes, not days of training time ✅ Cross-domain transfer maintained ✅ Verifiable model weights (frozen, cryptographically hash-able)

Next Steps: 1. Read Chat-to-Experience Tutorial 2. Read Custom Agent Guide 3. Check API Reference 4. Try the Quick Start

Join the Community: - GitHub: github.com/zooai/gym - Discord: discord.gg/zooai - Twitter: @zoolabsfdn

Continuous Learning GRPO - Training-Free AI Model Improvement¶

Executive Summary¶

Table of Contents¶

What is Continuous Learning GRPO?¶

The Innovation¶

Core Concept: Semantic Experiences¶

What is a Semantic Experience?¶

Characteristics of Good Experiences¶

Traditional Numerical Advantages vs Semantic Advantages¶

Architecture Overview¶

System Components¶

Key Modules¶

1. ExperienceManager (src/gym/train/grpo/experience_manager.py)¶

2. SemanticExtractor (src/gym/train/grpo/semantic_extractor.py)¶

3. APIModelAdapter (src/gym/train/grpo/api_model_adapter.py)¶

When to Use vs Traditional Fine-Tuning¶

Continuous Learning is Better When:¶

Traditional Fine-Tuning is Better When:¶

Hybrid Approach: Best of Both Worlds¶

Three-Stage Learning Process¶

Overview¶

Stage 1: Trajectory Summarization (Figure 11 in Paper)¶

Stage 2: Group Advantage Extraction (Figure 12 in Paper)¶

Stage 3: Batch Consolidation (Figure 13 in Paper)¶

Cost & Performance Analysis¶

Cost Breakdown (100 training samples, 3 epochs)¶

Performance Metrics (from Paper)¶

Time Comparison¶

Integration with Gym¶

Installation¶

Configuration¶

Python API Usage¶

CLI Usage¶

Advanced Usage¶

Custom Reward Functions¶

Multi-Domain Experiences¶

Embedding-Based Retrieval¶

Experience Governance (DAO-style)¶

Limitations & Best Practices¶

Known Limitations¶

Best Practices¶

1. Start Small, Scale Gradually¶

2. Monitor Experience Quality¶

3. Use Checkpointing¶

4. Combine with Traditional Methods¶

5. Validate on Out-of-Domain Tasks¶

Troubleshooting¶

Common Issues¶

Issue: Experience library not growing¶

Issue: Performance not improving¶

Issue: API rate limits¶

Summary¶

1. ExperienceManager (`src/gym/train/grpo/experience_manager.py`)¶

2. SemanticExtractor (`src/gym/train/grpo/semantic_extractor.py`)¶

3. APIModelAdapter (`src/gym/train/grpo/api_model_adapter.py`)¶