Skip to content

Continuous Learning GRPO - Training-Free AI Model Improvement

Executive Summary

Continuous Learning GRPO enables AI models to improve through experience accumulation rather than parameter updates. Instead of traditional fine-tuning (which requires expensive GPU compute and thousands of examples), this approach learns from 50-100 examples by extracting and curating semantic experiences - human-readable insights that guide future problem-solving.

Key Results (from Tencent Research, arXiv:2510.08191): - Cost: $18 vs $10,000+ for fine-tuning - Performance: 82.7% on AIME24 (+2.7% over fine-tuning) - Data efficiency: 50-100 examples vs 1000s - Training time: Minutes vs hours/days - Interpretability: Human-readable experiences vs black-box weights

Table of Contents

  1. What is Continuous Learning GRPO?
  2. Core Concept: Semantic Experiences
  3. Architecture Overview
  4. When to Use vs Traditional Fine-Tuning
  5. Three-Stage Learning Process
  6. Cost & Performance Analysis
  7. Integration with Gym
  8. Quick Start
  9. Advanced Usage
  10. Limitations & Best Practices

What is Continuous Learning GRPO?

Traditional AI model improvement requires updating billions of parameters through gradient descent - an expensive, opaque process. Continuous Learning GRPO takes a fundamentally different approach:

Traditional Fine-Tuning:
  Model weights → Gradient updates → Modified weights
  Cost: $10,000+ | Time: Hours/Days | Interpretability: None

Continuous Learning:
  Experiences → Semantic extraction → Updated experience library
  Cost: $18 | Time: Minutes | Interpretability: Full

The Innovation

Instead of changing what the model is (its weights), we change what the model sees (its context). The base model remains frozen (verifiable via cryptographic hash), while a curated library of experiences guides its reasoning.

Analogy: Traditional fine-tuning is like brain surgery - modifying neural connections. Continuous Learning is like education - providing better examples and guidance.


Core Concept: Semantic Experiences

What is a Semantic Experience?

A semantic experience is a concise (≤32 words) natural language statement that captures a generalizable problem-solving strategy.

Examples from Mathematics Domain:

[G0]. When solving geometry problems with intersections, validate solutions
      lie within bounded regions or segments, not on extensions, to avoid
      extraneous answers.

[G1]. For expected extreme statistics in combinatorial problems, use direct
      enumeration for small sizes.

[G10]. When using mathematical invariants to prove impossibility, always
       validate them against known achievable states or small cases.

[G21]. For complex polynomials with real parameters, separate real and
       imaginary parts to find when real roots exist.

Characteristics of Good Experiences

  1. Strategic, not computational: "Check boundary conditions" not "Calculate derivative"
  2. Context-aware: Begins with "When [condition]..."
  3. Actionable: Provides clear guidance
  4. Generalizable: Applies to similar problem classes
  5. Concise: ≤32 words
  6. Domain-agnostic: Focuses on reasoning patterns

Traditional Numerical Advantages vs Semantic Advantages

Aspect Numerical (GRPO) Semantic (Continuous Learning)
Format Scalar: 0.73 Text: "When solving equations..."
Interpretability Opaque Human-readable
Persistence Lost after batch Accumulated in library
Composability Single value Combinable experiences
Governance Not applicable DAO-votable
Auditability None Full trail

Architecture Overview

System Components

┌─────────────────────────────────────────────────────────────┐
│                    Continuous Learning GRPO                  │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌────────────────┐      ┌───────────────────────┐         │
│  │  User Query    │─────→│  Experience Manager   │         │
│  └────────────────┘      │  - Load experiences   │         │
│                          │  - Format for context │         │
│                          └───────────┬───────────┘         │
│                                      │                      │
│                                      ▼                      │
│                          ┌───────────────────────┐         │
│                          │   Base Model (Frozen) │         │
│  ┌────────────────┐     │  + Experience Context │         │
│  │ Generate G     │◀────┤   → Generate response │         │
│  │ Rollouts       │     └───────────────────────┘         │
│  └────────┬───────┘                                        │
│           │                                                 │
│           ▼                                                 │
│  ┌───────────────────────────────────────────┐            │
│  │  Semantic Extractor (3-Stage LLM Process) │            │
│  │  1. Summarize trajectories                │            │
│  │  2. Extract group advantages              │            │
│  │  3. Consolidate batch updates             │            │
│  └────────────────┬──────────────────────────┘            │
│                   │                                         │
│                   ▼                                         │
│  ┌────────────────────────────────┐                        │
│  │  Experience Library Update     │                        │
│  │  - Add new experiences         │                        │
│  │  - Modify existing ones        │                        │
│  │  - Delete obsolete entries     │                        │
│  │  - Merge similar experiences   │                        │
│  └────────────────────────────────┘                        │
│                                                              │
│  [Model weights never change - verifiable via hash]        │
└─────────────────────────────────────────────────────────────┘

Key Modules

1. ExperienceManager (src/gym/train/grpo/experience_manager.py)

Purpose: Manages the experience library E - the core knowledge base.

Operations: - add(experience): Add new experience - modify(exp_id, new_text): Update existing experience - delete(exp_id): Remove obsolete experience - merge(exp_ids, merged_text): Combine similar experiences - format_for_prompt(): Convert library to context string - save(path) / load(path): Persistence

Data Structure:

{
  "experiences": {
    "G0": "When solving equations, verify by substitution...",
    "G1": "For optimization, check boundary conditions...",
    ...
  },
  "next_id": 42
}

2. SemanticExtractor (src/gym/train/grpo/semantic_extractor.py)

Purpose: Extracts semantic advantages using 3-stage LLM process.

Stage 1 - Trajectory Summarization:

def summarize_trajectory(trajectory, use_groundtruth=True):
    """
    Input: Full trajectory, correctness label, ground truth
    Output: Step-by-step natural language summary

    Example:
      1. Applied quadratic formula with a=1, b=2, c=5
      2. Calculated discriminant: b²-4ac = 4-20 = -16
      3. Concluded no real solutions (correct)
    """

Stage 2 - Group Advantage Extraction:

def extract_group_advantage(trajectories, experiences):
    """
    Input: G trajectories (both correct/incorrect), current experiences
    Output: JSON operations [{"option": "add", "experience": "..."}, ...]

    Process:
      - Compare successful vs failed trajectories
      - Identify what made successful ones succeed
      - Propose max 3 operations: add/modify/delete
    """

Stage 3 - Batch Consolidation:

def consolidate_batch(all_group_operations, experiences):
    """
    Input: All group operations from batch, current experiences
    Output: Final consolidated operations

    Process:
      - Merge similar suggestions
      - Ensure ≤32 words per experience
      - Eliminate redundancy
    """

3. APIModelAdapter (src/gym/train/grpo/api_model_adapter.py)

Purpose: Enable using cloud-hosted models (DeepSeek, OpenAI) instead of local GPUs.

Benefits: - No GPU required - Faster inference (optimized infrastructure) - Better base models (DeepSeek-V3, GPT-4o) - Lower total cost (pay per use)

Usage:

from gym.train.grpo.api_model_adapter import DeepSeekAdapter

adapter = DeepSeekAdapter(api_key="sk-xxx", model="deepseek-chat")
response = adapter.generate_with_experiences(
    query="Solve: x² + 2x + 5 = 0",
    experiences=experience_library.format_for_prompt()
)


When to Use vs Traditional Fine-Tuning

Continuous Learning is Better When:

Limited training data (50-1000 examples) - Traditional methods need 10K+ examples - Continuous Learning learns from dozens

Tight budget ($18 vs $10,000+) - No GPU infrastructure needed - Pay only for API calls

Fast iteration (minutes vs days) - Immediate experimentation - No waiting for training convergence

Interpretability required - Experiences are human-readable - Can audit why model made decisions - Stakeholders can review/vote on experiences

Domain adaptation (medical, legal, specialized) - Transfer to new domain without retraining - Domain experts can write experiences directly - Cross-domain performance maintained

Governance/compliance - Need audit trail - Community-driven model evolution - Verifiable model weights (frozen)

Traditional Fine-Tuning is Better When:

Massive training data (100K+ examples) - Fine-tuning scales better with data - Parameter updates can capture complex patterns

Latency-critical inference - Fine-tuned models have no context overhead - Continuous Learning adds ~500 tokens context

Extremely specialized tasks - Deep domain knowledge encoded in weights - When semantic experiences can't capture nuance

One-time deployment (static model) - No need for continuous updates - Fixed deployment constraints

Hybrid Approach: Best of Both Worlds

You can combine both: 1. Base model: Fine-tuned on large general dataset 2. Continuous Learning: Adapt to specific use cases 3. Result: General competence + specific expertise


Three-Stage Learning Process

Overview

Input: G rollouts per query, ground truth labels
Output: Updated experience library E

For each epoch:
  For each batch:
    For each query in batch:
      Stage 1: Summarize each of G trajectories
      Stage 2: Extract group advantage (max 3 operations)
    Stage 3: Consolidate all group operations
    Apply consolidated operations to E

Stage 1: Trajectory Summarization (Figure 11 in Paper)

Goal: Analyze what happened in each rollout step-by-step.

Prompt Template:

An agent system may be provided with some experiences, and then it produces
the following trajectory to solve the given problem. Please summarize the
trajectory step-by-step:

1. For each step, describe what action is being taken, and which experience
   has been used in this step.
2. Given the grading of this rollout and the correct answer, identify and
   explain any steps that represent detours, errors, or backtracking.
3. Maintain all the core outcome of each step.

<trajectory>{full_output}</trajectory>
<evaluation>{correct/wrong}</evaluation>
<groundtruth>{answer}</groundtruth>

Only return the trajectory summary of each step.

Example Output:

1. Applied quadratic formula with a=1, b=2, c=5 (Used experience G21)
2. Calculated discriminant: b²-4ac = 4-20 = -16
3. Correctly concluded no real solutions exist (discriminant < 0)
4. Verified answer matches ground truth

Why This Stage Matters: - Reduces trajectory length (models produce verbose reasoning) - Identifies which experiences were actually used - Highlights errors/detours for learning - Provides clean input for Stage 2

Stage 2: Group Advantage Extraction (Figure 12 in Paper)

Goal: Compare G trajectories to identify what made successful ones succeed.

Prompt Template:

An agent system is provided with a set of experiences and has tried to solve
the problem multiple times with both successful and wrong solutions. Review
these problem-solving attempts and extract generalizable experiences.

1. Trajectory Analysis:
   - For successful steps: Identify key correct decisions
   - For errors: Pinpoint where/why reasoning went wrong
   - Note patterns or strategies used/missed

2. Update Existing Experiences:
   - Options: [modify, add, delete]
   - Max 3 operations per group
   - Requirements: Begin with context, focus on strategic patterns

<problem>{query}</problem>
<trajectories>
Attempt 1 (correct): {summary_1}
Attempt 2 (wrong): {summary_2}
...
</trajectories>
<groundtruth>{answer}</groundtruth>
<experience>{current_library}</experience>

Return JSON: [{"option": "add", "experience": "..."}, ...]

Example Output:

[
  {
    "option": "add",
    "experience": "For quadratic equations, always check discriminant sign before attempting to find roots."
  },
  {
    "option": "modify",
    "experience": "When solving polynomial equations, verify solution count matches degree minus multiplicity.",
    "modified_from": "G17"
  }
]

Why This Stage Matters: - Extracts why one approach succeeded over another - Focuses on decision points (not calculations) - Proposes concrete updates to experience library - Limits to 3 operations to maintain quality

Stage 3: Batch Consolidation (Figure 13 in Paper)

Goal: Merge all group operations into final, non-redundant updates.

Prompt Template:

An agent system has tried to solve multiple problems. From the reflections,
some suggestions on experiences have been posed. Consolidate these into
final experience revisions.

Requirements:
1. Clear, generalizable, ≤32 words each
2. Focus on strategic thinking (not calculations)
3. Avoid duplication between experiences

<experience>{current_library}</experience>
<suggested_updates>
[
  {from group 1 operations},
  {from group 2 operations},
  ...
]
</suggested_updates>

Options: [modify, merge, delete]
Return JSON with final operations.

Example Output:

[
  {
    "option": "merge",
    "experience": "For polynomial equations, verify discriminant before solving and ensure solution count matches theoretical maximum.",
    "merged_from": ["G3", "G17", "S4"]
  },
  {
    "option": "modify",
    "experience": "When encountering complex numbers, separate real and imaginary parts early in the solution process.",
    "modified_from": "G21"
  }
]

Why This Stage Matters: - Prevents experience library explosion - Ensures consistency across batch - Eliminates near-duplicates - Enforces quality standards (≤32 words)


Cost & Performance Analysis

Cost Breakdown (100 training samples, 3 epochs)

Component API Calls Tokens (avg) Cost per Call Total Cost
Rollout Generation (G=5) 1,500 2,000 $0.002 $3.00
Stage 1: Summarization 1,500 1,500 $0.002 $3.00
Stage 2: Group Extraction 300 4,000 $0.004 $1.20
Stage 3: Batch Consolidation 30 6,000 $0.006 $0.18
Evaluation 300 1,000 $0.001 $0.30
Total 3,630 - - $7.68

Actual costs vary by provider: - DeepSeek: $0.14/M input, \(0.28/M output → **\)8-12 total** - OpenAI (GPT-4o-mini): $0.15/M input, \(0.60/M output → **\)12-18 total** - OpenAI (GPT-4o): $2.50/M input, \(10/M output → **\)60-100 total**

Comparison to Traditional Fine-Tuning: - LoRA (7B model): $500-1,000 (GPU hours + setup) - QLoRA (32B model): $2,000-5,000 (memory-efficient) - Full Fine-Tuning (70B): $10,000-50,000 (days of GPU time)

Performance Metrics (from Paper)

AIME Math Competition (100 training samples): - Baseline (no training): 67.9% (AIME25), 80.0% (AIME24) - After Continuous Learning: 73.3% (+5.4%), 82.7% (+2.7%) - vs Traditional Fine-Tuning: +2-5% improvement at 1000x lower cost

Cross-Domain Generalization: | Method | AIME24 (Math) | WebWalker (Web) | |--------|---------------|-----------------| | ReTool (math-tuned) | 67.0% | 18.3% ❌ | | MiroThinker (web-tuned) | 43.5% ❌ | 53.6% | | Continuous Learning | 82.7% ✅ | 67.8% ✅ |

Key Insight: Frozen base model + domain experiences outperforms specialized fine-tuning on both domains.

Time Comparison

Stage Time (100 samples)
Rollout generation 30-45 min
Summarization 20-30 min
Group extraction 10-15 min
Batch consolidation 2-3 min
Total (1 epoch) ~1.5 hours
3 epochs ~4-5 hours

vs Fine-Tuning: - LoRA (7B): 4-8 hours - Full (32B): 24-72 hours - Full (70B): 3-7 days


Integration with Gym

Installation

# Install Gym with Continuous Learning dependencies
pip install zoo-gym[grpo]

# Or install from source
git clone https://github.com/zooai/gym.git
cd gym
pip install -e ".[grpo]"

Additional dependencies:

pip install openai  # For API model adapters

Configuration

Create a YAML config file (configs/continuous_learning.yaml):

# Base model configuration
model_name_or_path: "Qwen/Qwen3-4B-Instruct"  # Or use API adapter
template: "qwen3"

# Continuous Learning GRPO parameters
training_free_grpo: true
grpo_group_size: 5  # Number of rollouts per query
grpo_use_groundtruth: true  # Use ground truth for advantage extraction
grpo_beta: 0.01  # KL penalty coefficient (unused in training-free)
grpo_experience_lib_path: "./output/experiences"

# API Model Adapter (optional - for cloud models)
use_api_model: true
api_provider: "deepseek"  # or "openai"
api_key: "${DEEPSEEK_API_KEY}"  # Use environment variable
api_model: "deepseek-chat"

# LLM for semantic extraction
llm_api_provider: "deepseek"
llm_api_key: "${DEEPSEEK_API_KEY}"
llm_model: "deepseek-chat"

# Dataset
dataset: "alpaca_en_demo"  # Your custom dataset
dataset_dir: "./data"

# Training
output_dir: "./output/continuous_learning"
num_train_epochs: 3
per_device_train_batch_size: 8
learning_rate: 0  # No gradient updates
logging_steps: 10
save_steps: 100

Python API Usage

from gym.train.grpo.experience_manager import ExperienceManager
from gym.train.grpo.semantic_extractor import SemanticExtractor, LLMClient
from gym.train.grpo.api_model_adapter import DeepSeekAdapter

# 1. Initialize components
experience_manager = ExperienceManager(checkpoint_path="./experiences.json")
llm_client = LLMClient(api_key="sk-xxx", model="deepseek-chat")
extractor = SemanticExtractor(llm_client, max_operations=3)
model_adapter = DeepSeekAdapter(api_key="sk-xxx")

# 2. Training loop
queries = [
    "Solve: x² + 2x + 5 = 0",
    "Find derivative of x³ + 2x",
    ...
]
groundtruths = ["No real solutions", "3x² + 2", ...]

for epoch in range(3):
    print(f"\n=== Epoch {epoch+1}/3 ===")

    for query, gt in zip(queries, groundtruths):
        # Generate G rollouts
        experiences = experience_manager.format_for_prompt()
        trajectories = []

        for _ in range(5):  # G=5
            output = model_adapter.generate_with_experiences(query, experiences)
            reward = evaluate_answer(output, gt)  # Your reward function
            trajectories.append(Trajectory(query, output, reward, gt))

        # Stage 1: Summarize
        for traj in trajectories:
            traj.summary = extractor.summarize_trajectory(traj)

        # Stage 2: Extract group advantage
        operations = extractor.extract_group_advantage(
            trajectories, experiences, use_groundtruth=True
        )

        # Apply operations
        experience_manager.apply_operations(operations)

    # Save checkpoint
    experience_manager.save(f"./experiences_epoch{epoch+1}.json")
    print(f"Experience library size: {len(experience_manager)}")

# 3. Final evaluation
print("\n=== Final Experience Library ===")
print(experience_manager.format_for_prompt())

CLI Usage

# Quick start with example config
gym train \
  --config configs/continuous_learning.yaml \
  --dataset alpaca_en_demo

# With custom dataset
gym train \
  --config configs/continuous_learning.yaml \
  --dataset custom_math_problems \
  --dataset_dir ./my_data

# Resume from checkpoint
gym train \
  --config configs/continuous_learning.yaml \
  --resume_from_checkpoint ./output/continuous_learning/checkpoint-100

Advanced Usage

Custom Reward Functions

By default, Gym uses simple correctness (0/1). You can provide custom rewards:

def custom_reward_function(query: str, output: str, groundtruth: str) -> float:
    """
    Custom reward that considers partial correctness.

    Returns:
        reward: Float in [0, 1], higher is better
    """
    # Exact match
    if output.strip().lower() == groundtruth.strip().lower():
        return 1.0

    # Partial credit for methodology
    if "discriminant" in output and "quadratic" in query:
        return 0.5

    # Wrong answer
    return 0.0

# Use in training
reward = custom_reward_function(query, output, gt)

Multi-Domain Experiences

Organize experiences by domain for better retrieval:

class DomainAwareExperienceManager(ExperienceManager):
    def __init__(self, checkpoint_path=None):
        super().__init__(checkpoint_path)
        self.domains = {}  # exp_id -> domain

    def add_with_domain(self, experience: str, domain: str) -> str:
        exp_id = self.add(experience)
        self.domains[exp_id] = domain
        return exp_id

    def format_for_domain(self, domain: str) -> str:
        """Return only experiences for specific domain."""
        filtered = {
            exp_id: text
            for exp_id, text in self.experiences.items()
            if self.domains.get(exp_id) == domain
        }
        return "\n".join([f"[{k}]. {v}" for k, v in filtered.items()])

# Usage
manager = DomainAwareExperienceManager()
manager.add_with_domain("For calculus problems...", domain="math.calculus")
manager.add_with_domain("When parsing JSON...", domain="coding.json")

# Retrieve domain-specific experiences
math_exp = manager.format_for_domain("math.calculus")

Embedding-Based Retrieval

For large experience libraries (100+ experiences), use semantic search:

from sentence_transformers import SentenceTransformer

class SemanticExperienceManager(ExperienceManager):
    def __init__(self, checkpoint_path=None):
        super().__init__(checkpoint_path)
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        self.embeddings = {}  # exp_id -> embedding

    def add(self, experience: str) -> str:
        exp_id = super().add(experience)
        self.embeddings[exp_id] = self.embedder.encode(experience)
        return exp_id

    def retrieve_relevant(self, query: str, top_k: int = 5) -> str:
        """Retrieve top-k most relevant experiences."""
        query_emb = self.embedder.encode(query)

        # Compute similarities
        similarities = {
            exp_id: cosine_similarity(query_emb, emb)
            for exp_id, emb in self.embeddings.items()
        }

        # Get top-k
        top_ids = sorted(similarities, key=similarities.get, reverse=True)[:top_k]

        # Format
        relevant = {exp_id: self.experiences[exp_id] for exp_id in top_ids}
        return "\n".join([f"[{k}]. {v}" for k, v in relevant.items()])

# Usage
manager = SemanticExperienceManager()
# ... add experiences ...
relevant_exp = manager.retrieve_relevant("Solve quadratic equation", top_k=3)

Experience Governance (DAO-style)

Enable community voting on experiences:

class GovernedExperienceManager(ExperienceManager):
    def __init__(self, checkpoint_path=None):
        super().__init__(checkpoint_path)
        self.votes = {}  # exp_id -> {"upvotes": int, "downvotes": int}
        self.vote_threshold = 0.66  # 2/3 majority

    def propose_experience(self, experience: str) -> str:
        """Propose experience for community vote."""
        exp_id = f"P{self._next_id}"  # Pending
        self.votes[exp_id] = {"upvotes": 0, "downvotes": 0, "text": experience}
        self._next_id += 1
        return exp_id

    def vote(self, exp_id: str, upvote: bool):
        """Cast vote on proposed experience."""
        if upvote:
            self.votes[exp_id]["upvotes"] += 1
        else:
            self.votes[exp_id]["downvotes"] += 1

    def finalize_votes(self):
        """Accept experiences with 2/3 majority."""
        for exp_id, vote_data in list(self.votes.items()):
            total = vote_data["upvotes"] + vote_data["downvotes"]
            if total == 0:
                continue

            approval_rate = vote_data["upvotes"] / total

            if approval_rate >= self.vote_threshold:
                # Accept experience
                self.add(vote_data["text"])
                print(f"✅ Accepted: {exp_id} ({approval_rate:.1%})")
            else:
                print(f"❌ Rejected: {exp_id} ({approval_rate:.1%})")

            del self.votes[exp_id]

Limitations & Best Practices

Known Limitations

  1. Base Model Capability Threshold
  2. Works best with strong base models (7B+ parameters)
  3. Weaker models struggle with introspection
  4. Recommendation: Qwen3-4B+, DeepSeek-V3, GPT-4o-mini+

  5. Context Window Constraints

  6. Large experience libraries (200+) may exceed context limit
  7. Solution: Use embedding-based retrieval (top-k relevant)

  8. Cost Scales with Group Size

  9. G=8 requires 8x API calls per query
  10. Trade-off: G=1 (cheap but ineffective) vs G=8 (expensive but better)
  11. Sweet spot: G=5

  12. Ground Truth Dependency

  13. Best results with ground truth labels
  14. Without GT: -2% to -5% performance
  15. Can use self-discrimination via majority voting

  16. Domain Specificity

  17. Experiences are domain-specific
  18. Math experiences don't help coding tasks
  19. Solution: Multi-domain libraries or domain tagging

Best Practices

1. Start Small, Scale Gradually

# Phase 1: Prototype (10 examples, 1 epoch)
# Goal: Validate setup, check experience quality

# Phase 2: Pilot (50 examples, 2 epochs)
# Goal: Measure performance improvement

# Phase 3: Production (100-500 examples, 3 epochs)
# Goal: Deploy and monitor

2. Monitor Experience Quality

def audit_experience_library(manager: ExperienceManager):
    """Check for common quality issues."""
    issues = []

    for exp_id, exp_text in manager.experiences.items():
        # Check length
        word_count = len(exp_text.split())
        if word_count > 32:
            issues.append(f"{exp_id}: Too long ({word_count} words)")

        # Check if too specific (mentions numbers)
        if any(char.isdigit() for char in exp_text):
            issues.append(f"{exp_id}: Contains numbers (too specific)")

        # Check if actionable (has verb)
        if not any(word in exp_text.lower() for word in ["use", "check", "verify", "apply", "calculate"]):
            issues.append(f"{exp_id}: No action verb (not actionable)")

    return issues

# Run after each epoch
issues = audit_experience_library(experience_manager)
if issues:
    print("⚠️  Quality Issues:")
    for issue in issues:
        print(f"  - {issue}")

3. Use Checkpointing

# Save after every epoch
for epoch in range(num_epochs):
    # ... training ...

    # Checkpoint
    checkpoint_path = f"{output_dir}/experiences_epoch{epoch+1}.json"
    experience_manager.save(checkpoint_path)

    # Also save metadata
    metadata = {
        "epoch": epoch + 1,
        "num_experiences": len(experience_manager),
        "timestamp": datetime.now().isoformat(),
        "performance": evaluate_on_validation_set()
    }
    with open(f"{output_dir}/metadata_epoch{epoch+1}.json", 'w') as f:
        json.dump(metadata, f, indent=2)

4. Combine with Traditional Methods

# Hybrid approach:
# 1. Fine-tune base model on large general dataset (one-time)
# 2. Use Continuous Learning for task-specific adaptation (ongoing)

# Step 1: Fine-tune (traditional)
gym train \
  --model_name_or_path Qwen/Qwen3-4B \
  --finetuning_type lora \
  --dataset large_math_corpus \
  --output_dir ./output/math_base

# Step 2: Continuous Learning (on top of fine-tuned model)
gym train \
  --model_name_or_path ./output/math_base \
  --adapter_name_or_path ./output/math_base/lora_adapter \
  --config configs/continuous_learning.yaml \
  --dataset specific_geometry_problems \
  --output_dir ./output/geometry_adapted

5. Validate on Out-of-Domain Tasks

# Test generalization
def cross_domain_evaluation(experience_manager):
    """Test if experiences help on unseen domains."""
    domains = ["math", "coding", "logic", "creative"]
    results = {}

    for domain in domains:
        test_set = load_test_set(domain)
        accuracy = evaluate_with_experiences(
            test_set,
            experience_manager.format_for_prompt()
        )
        results[domain] = accuracy

    return results

# Run after training
results = cross_domain_evaluation(experience_manager)
print("Cross-Domain Performance:")
for domain, acc in results.items():
    print(f"  {domain}: {acc:.1%}")

Troubleshooting

Common Issues

Issue: Experience library not growing

Symptoms: len(experience_manager) stays at 0 or very small

Causes: 1. Group size too small (G=1) 2. Homogeneous rewards (all correct or all wrong) 3. LLM not returning valid JSON

Solutions:

# 1. Increase group size
grpo_group_size: 5  # or higher

# 2. Check reward distribution
rewards = [t.reward for t in trajectories]
print(f"Reward std: {np.std(rewards)}")  # Should be > 0

# 3. Debug JSON parsing
response = extractor.extract_group_advantage(...)
if not response:
    print("No operations extracted - check LLM output")

Issue: Performance not improving

Symptoms: Validation accuracy flat or decreasing

Causes: 1. Base model too weak 2. Experiences too specific (not generalizable) 3. Not enough epochs

Solutions:

# 1. Use stronger base model
model_name_or_path: "Qwen/Qwen3-7B-Instruct"  # or DeepSeek-V3 via API

# 2. Audit experience quality (see Best Practices)

# 3. Increase epochs
num_train_epochs: 5  # instead of 3

Issue: API rate limits

Symptoms: RateLimitError from OpenAI/DeepSeek

Solutions:

import time

# Add retry logic
def generate_with_retry(adapter, prompt, max_retries=3):
    for i in range(max_retries):
        try:
            return adapter.generate(prompt)
        except RateLimitError:
            if i < max_retries - 1:
                time.sleep(2 ** i)  # Exponential backoff
            else:
                raise

# Or use batch processing with delays
for batch in batches:
    process_batch(batch)
    time.sleep(1)  # 1 second between batches


Summary

Continuous Learning GRPO represents a paradigm shift in AI model improvement:

Learn from dozens, not thousands of examples ✅ $18 vs $10,000 cost reduction ✅ Human-readable experiences instead of black-box weights ✅ Minutes, not days of training time ✅ Cross-domain transfer maintained ✅ Verifiable model weights (frozen, cryptographically hash-able)

Next Steps: 1. Read Chat-to-Experience Tutorial 2. Read Custom Agent Guide 3. Check API Reference 4. Try the Quick Start

Join the Community: - GitHub: github.com/zooai/gym - Discord: discord.gg/zooai - Twitter: @zoolabsfdn


Last Updated: October 28, 2025 Gym v0.9.4 - Democratizing AI Training Zoo Labs Foundation Inc - 501©(3) Non-Profit