Skip to content

Continuous Learning GRPO Implementation Guide

Based on Tencent youtu-agent Paper (arXiv:2510.08191v1)

This document provides exact implementation specifications for Continuous Learning GRPO in zoo/gym.


1. Algorithm Overview

Vanilla GRPO vs Continuous Learning GRPO

Vanilla GRPO (Current zoo/gym):

# For each query q:
outputs = [π_θ(o_i|q) for i in range(G)]  # Generate G outputs
rewards = [R(q, o_i) for o_i in outputs]   # Score each
advantages = [(r_i - mean(r))/std(r) for r_i in rewards]  # Numerical advantage
# Update parameters θ via gradient ascent on PPO objective

Continuous Learning GRPO (Target):

# Initialize: E = {} (experience library)
# For each query q:
outputs = [π_θ(o_i|q, E) for i in range(G)]  # Inject experiences
rewards = [R(q, o_i) for o_i in outputs]      # Score each

# Only for groups with std(rewards) > 0:
summaries = [LLM.summarize(q, o_i, r_i, groundtruth) for o_i, r_i in zip(outputs, rewards)]
semantic_advantage = LLM.extract_insights(q, summaries, E)

# Batch update (after all queries in epoch):
operations = LLM.consolidate([all semantic_advantages], E)
E = apply_operations(E, operations)  # Add/Delete/Modify experiences

# Next epoch uses updated E


2. Core Components

2.1 Experience Manager

File: src/gym/train/grpo/experience_manager.py

from typing import Dict, List, Literal
import json
from pathlib import Path

Operation = Literal["add", "delete", "modify", "keep"]

class ExperienceManager:
    """Manages the experience library E for Continuous Learning GRPO."""

    def __init__(self, checkpoint_path: str = None):
        """
        Args:
            checkpoint_path: Path to load/save experiences
        """
        self.experiences: Dict[str, str] = {}
        self._next_id = 0
        self.checkpoint_path = checkpoint_path

        if checkpoint_path and Path(checkpoint_path).exists():
            self.load(checkpoint_path)

    def add(self, experience: str) -> str:
        """Add new experience, return assigned ID."""
        exp_id = f"G{self._next_id}"
        self.experiences[exp_id] = experience
        self._next_id += 1
        return exp_id

    def delete(self, exp_id: str) -> bool:
        """Delete experience by ID."""
        if exp_id in self.experiences:
            del self.experiences[exp_id]
            return True
        return False

    def modify(self, exp_id: str, new_experience: str) -> bool:
        """Modify existing experience."""
        if exp_id in self.experiences:
            self.experiences[exp_id] = new_experience
            return True
        return False

    def merge(self, exp_ids: List[str], merged_experience: str) -> str:
        """Merge multiple experiences into one."""
        # Delete old experiences
        for exp_id in exp_ids:
            self.delete(exp_id)
        # Add merged experience
        return self.add(merged_experience)

    def apply_operations(self, operations: List[Dict]) -> None:
        """
        Apply batch of operations from LLM.

        Args:
            operations: List of dicts with keys: "option", "experience", etc.
                Example: [
                    {"option": "add", "experience": "When solving..."},
                    {"option": "modify", "experience": "...", "modified_from": "G17"},
                    {"option": "delete", "delete_id": "G5"},
                    {"option": "merge", "experience": "...", "merged_from": ["G1", "G3"]}
                ]
        """
        for op in operations:
            option = op.get("option", "keep")

            if option == "add":
                self.add(op["experience"])

            elif option == "delete":
                self.delete(op["delete_id"])

            elif option == "modify":
                self.modify(op["modified_from"], op["experience"])

            elif option == "merge":
                self.merge(op["merged_from"], op["experience"])

    def format_for_prompt(self) -> str:
        """Format experiences for injection into prompts."""
        if not self.experiences:
            return "None"

        formatted = []
        for exp_id, exp_text in self.experiences.items():
            formatted.append(f"[{exp_id}]. {exp_text}")

        return "\n".join(formatted)

    def save(self, path: str) -> None:
        """Save experiences to JSON."""
        with open(path, 'w') as f:
            json.dump({
                "experiences": self.experiences,
                "next_id": self._next_id
            }, f, indent=2)

    def load(self, path: str) -> None:
        """Load experiences from JSON."""
        with open(path) as f:
            data = json.load(f)
            self.experiences = data["experiences"]
            self._next_id = data["next_id"]

    def __len__(self) -> int:
        return len(self.experiences)

2.2 Semantic Extractor

File: src/gym/train/grpo/semantic_extractor.py

from typing import List, Dict, Optional
from dataclasses import dataclass

@dataclass
class Trajectory:
    """Single rollout trajectory."""
    query: str
    output: str
    reward: float
    groundtruth: Optional[str] = None
    summary: Optional[str] = None


class SemanticExtractor:
    """
    Extracts semantic advantages from groups of trajectories.
    Implements the 3-stage LLM process from Continuous Learning GRPO paper.
    """

    def __init__(self, llm_client, max_operations: int = 3):
        """
        Args:
            llm_client: LLM client for introspection (e.g., OpenAI, DeepSeek)
            max_operations: Max operations per group critique
        """
        self.llm = llm_client
        self.max_operations = max_operations

    # STAGE 1: Trajectory Summarization
    def summarize_trajectory(
        self, 
        trajectory: Trajectory,
        use_groundtruth: bool = True
    ) -> str:
        """
        Summarize a single trajectory.

        Implements Figure 11 from paper.
        """
        prompt = f"""An agent system may be provided with some experiences, and then it produces the following trajectory to solve the given problem. Please summarize the trajectory step-by-step:

1. For each step, describe what action is being taken, and which experience has been used in this step.
2. Given the grading of this rollout and the correct answer, identify and explain any steps that represent detours, errors, or backtracking, highlighting why they might have occurred and what their impact was on the trajectory's progress.
3. Maintain all the core outcome of each step, even if it was part of a flawed process.

<trajectory>
{trajectory.output}
</trajectory>

<evaluation>
{'This trajectory delivers **correct** answer' if trajectory.reward > 0 else 'This trajectory delivers **wrong** answer'}
</evaluation>

{f'<groundtruth>{trajectory.groundtruth}</groundtruth>' if use_groundtruth else ''}

Only return the trajectory summary of each step, e.g.,
1. what happened in the first step and the core outcomes
2. what happened in the second step and the core outcomes
3. ..."""

        response = self.llm.chat(prompt)
        return response

    # STAGE 2: Group Advantage Extraction
    def extract_group_advantage(
        self,
        trajectories: List[Trajectory],
        experiences: str,  # Formatted experience library
        use_groundtruth: bool = True
    ) -> Dict:
        """
        Extract semantic advantage from a group of trajectories.

        Implements Figure 12 from paper.
        Returns operations: [{"option": "add", "experience": "..."}, ...]
        """
        # Check if group has variation (std > 0)
        rewards = [t.reward for t in trajectories]
        if len(set(rewards)) <= 1:
            return []  # Skip homogeneous groups

        # Format trajectories with summaries
        formatted_trajectories = []
        for i, traj in enumerate(trajectories):
            status = "correct" if traj.reward > 0 else "wrong"
            formatted_trajectories.append(
                f"Attempt {i+1} (Answer {status}):\n{traj.summary or traj.output}"
            )

        trajectories_text = "\n\n".join(formatted_trajectories)

        prompt = f"""An agent system is provided with a set of experiences and has tried to solve the problem multiple times with both successful and wrong solutions. Review these problem-solving attempt and extract generalizable experiences. Follow these steps:

1. Trajectory Analysis:
   - For successful steps: Identify key correct decisions and insights
   - For errors: Pinpoint where and why the reasoning went wrong
   - Note any important patterns or strategies used/missed
   - Review why some trajectories fail? Is there any existing experiences are missed, or experiences do not provide enough guidance?

2. Update Existing Experiences
   - Some trajectories may be correct and others may be wrong, you should ensure there are experiences can help to run correctly
   - You have three options: [modify, add, delete]
      * modify: You can modify current experiences to make it helpful
      * add: You can introduce new experiences to improve future performance
      * delete: You can delete existing experiences
   - You can update at most {self.max_operations} clear, generalizable lessons for this case
   - Before updating each experience, you need to:
      * Specify when it would be most relevant
      * List key problem features that make this experience applicable
      * Identify similar problem patterns where this advice applies

3. Requirements for each experience that is modified or added.
   - Begin with general background with several words in the experience
   - Focus on strategic thinking patterns, not specific calculations
   - Emphasize decision points that could apply to similar problems

Please provide reasoning in details under the guidance of the above 3 steps. After the step-by-step reasoning, you will finish by returning in this JSON format as follows:

```json
[
    {{
        "option": "modify",
        "experience": "the modified experience",
        "modified_from": "G17"
    }},
    {{
        "option": "add",
        "experience": "the added experience"
    }},
    {{
        "option": "delete",
        "delete_id": "G5"
    }}
]

Note that your updated experiences may not need to cover all the options.

{trajectories[0].query}

{trajectories_text}

{f'{trajectories[0].groundtruth}' if use_groundtruth else ''}

{experiences} """

    response = self.llm.chat(prompt)

    # Parse JSON operations
    import json
    import re

    # Extract JSON block
    json_match = re.search(r'```json\s*(.*?)\s*```', response, re.DOTALL)
    if json_match:
        operations = json.loads(json_match.group(1))
        return operations[:self.max_operations]

    return []

# STAGE 3: Batch Consolidation
def consolidate_batch(
    self,
    all_group_operations: List[List[Dict]],
    experiences: str
) -> List[Dict]:
    """
    Consolidate all group advantages into final experience updates.

    Implements Figure 13 from paper.
    """
    # Flatten all operations
    all_ops = []
    for group_ops in all_group_operations:
        all_ops.extend(group_ops)

    if not all_ops:
        return []

    prompt = f"""An agent system is provided with a set of experiences and has tried to solve the problem multiple times. From the reflections, some suggestions on the existing experiences have been posed. Your task is to collect and think for the final experience revision plan. Each final experience must satisfy the following requirements:
  1. It must be clear, generalizable lessons for this case, with no more than 32 words
  2. Begin with general background with several words in the experience
  3. Focus on strategic thinking patterns, not specific calculations
  4. Emphasize decision points that could apply to similar problems
  5. Avoid repeating saying similar experience in multiple different experiences

{experiences}

{json.dumps(all_ops, indent=2)}

Please provide reasoning in each of the suggestions, and think for how to update existing experiences. You have three update options: [modify, merge, delete]

  • modify: You can modify current experiences to make it helpful
  • merge: You can merge some similar experiences into a more general forms to reduce duplication
  • delete: You can delete an experience

After generating the step-by-step reasoning, you need to give the final experience revision details by returning in this JSON format as follows:

[
    {{
        "option": "modify",
        "experience": "the modified experience",
        "modified_from": "G17"
    }},
    {{
        "option": "merge",
        "experience": "the merged experience",
        "merged_from": ["C1", "C3", "S4"]
    }},
    {{
        "option": "delete",
        "delete_id": "G5"
    }}
]
```"""

        response = self.llm.chat(prompt)

        # Parse JSON
        import json
        import re

        json_match = re.search(r'```json\s*(.*?)\s*```', response, re.DOTALL)
        if json_match:
            return json.loads(json_match.group(1))

        return []

2.3 Integration with GRPOTrainer

File: src/gym/train/grpo/trainer.py (modifications)

# Add to existing GRPOTrainer class

from .experience_manager import ExperienceManager
from .semantic_extractor import SemanticExtractor, Trajectory

class GRPOTrainer:
    def __init__(self, ...):
        # Existing initialization
        ...

        # NEW: Continuous Learning GRPO components
        self.use_training_free = self.args.continuous_learning_grpo  # New arg

        if self.use_training_free:
            self.experience_manager = ExperienceManager(
                checkpoint_path=self.args.experience_checkpoint_path
            )
            self.semantic_extractor = SemanticExtractor(
                llm_client=self._get_llm_client(),  # OpenAI/DeepSeek client
                max_operations=self.args.max_experience_operations
            )

    def _get_llm_client(self):
        """Initialize LLM client for semantic extraction."""
        # Use OpenAI-compatible client
        from openai import OpenAI

        return OpenAI(
            api_key=self.args.llm_api_key,
            base_url=self.args.llm_base_url
        )

    def training_step(self, model, inputs):
        """Modified training step for Continuous Learning GRPO."""

        if not self.use_training_free:
            # Use vanilla GRPO (existing code)
            return super().training_step(model, inputs)

        # Continuous Learning GRPO
        queries = inputs["query"]
        batch_size = len(queries)

        # Inject experiences into prompts
        experiences_text = self.experience_manager.format_for_prompt()
        enhanced_queries = [
            self._inject_experiences(q, experiences_text) 
            for q in queries
        ]

        # Generate G rollouts per query
        G = self.args.group_size  # e.g., 5
        all_trajectories = []

        for query, enhanced_query in zip(queries, enhanced_queries):
            group_trajectories = []

            for _ in range(G):
                # Generate response
                response = self._generate_response(
                    model, 
                    enhanced_query,
                    temperature=self.args.rollout_temperature
                )

                # Compute reward
                reward = self.compute_rewards([query], [response])[0]

                # Create trajectory
                traj = Trajectory(
                    query=query,
                    output=response,
                    reward=reward.item(),
                    groundtruth=inputs.get("groundtruth")
                )

                group_trajectories.append(traj)

            all_trajectories.append(group_trajectories)

        # Extract semantic advantages
        all_group_operations = []

        for group in all_trajectories:
            # Stage 1: Summarize each trajectory
            for traj in group:
                traj.summary = self.semantic_extractor.summarize_trajectory(
                    traj,
                    use_groundtruth=self.args.use_groundtruth
                )

            # Stage 2: Extract group advantage
            operations = self.semantic_extractor.extract_group_advantage(
                group,
                experiences_text,
                use_groundtruth=self.args.use_groundtruth
            )

            if operations:
                all_group_operations.append(operations)

        # Stage 3: Consolidate batch
        if all_group_operations:
            final_operations = self.semantic_extractor.consolidate_batch(
                all_group_operations,
                experiences_text
            )

            # Apply updates to experience library
            self.experience_manager.apply_operations(final_operations)

        # NO PARAMETER UPDATES - model stays frozen
        # Return dummy loss for compatibility
        return torch.tensor(0.0, requires_grad=True)

    def _inject_experiences(self, query: str, experiences: str) -> str:
        """Inject experiences into query prompt."""
        template = f"""Please solve the problem:
{query}

When solving problems, you MUST first carefully read and understand the helpful instructions and experiences:
{experiences}"""
        return template

    def save_model(self, output_dir: str, **kwargs):
        """Save model and experiences."""
        super().save_model(output_dir, **kwargs)

        if self.use_training_free:
            # Save experience library
            exp_path = Path(output_dir) / "experiences.json"
            self.experience_manager.save(str(exp_path))

2.4 Training Arguments

File: src/gym/train/arguments.py (additions)

@dataclass
class FinetuningArguments:
    # Existing arguments
    ...

    # NEW: Continuous Learning GRPO arguments
    continuous_learning_grpo: bool = field(
        default=False,
        metadata={"help": "Use Continuous Learning GRPO instead of vanilla GRPO"}
    )

    experience_checkpoint_path: Optional[str] = field(
        default=None,
        metadata={"help": "Path to load/save experience library"}
    )

    llm_api_key: str = field(
        default=None,
        metadata={"help": "API key for LLM (e.g., DeepSeek, OpenAI)"}
    )

    llm_base_url: str = field(
        default="https://api.deepseek.com/v1",
        metadata={"help": "Base URL for LLM API"}
    )

    max_experience_operations: int = field(
        default=3,
        metadata={"help": "Max operations per group critique"}
    )

    rollout_temperature: float = field(
        default=0.7,
        metadata={"help": "Temperature for rollout generation"}
    )

    use_groundtruth: bool = field(
        default=True,
        metadata={"help": "Use ground truth in semantic extraction"}
    )

    group_size: int = field(
        default=5,
        metadata={"help": "Number of rollouts per query (G in paper)"}
    )

3. Usage Example

# Training script
python src/train.py \
    --model_name_or_path deepseek-ai/DeepSeek-V3 \
    --dataset_name custom_math_dataset \
    --output_dir ./output/continuous_learning_grpo \
    --continuous_learning_grpo \
    --group_size 5 \
    --rollout_temperature 0.7 \
    --num_train_epochs 3 \
    --per_device_train_batch_size 8 \
    --llm_api_key $DEEPSEEK_API_KEY \
    --llm_base_url https://api.deepseek.com/v1 \
    --use_groundtruth \
    --max_experience_operations 3

4. Performance Expectations

Based on paper results with DeepSeek-V3.1-Terminus:

Metric Baseline Continuous Learning GRPO Improvement
AIME24 80.0% 82.7% +2.7%
AIME25 67.9% 73.3% +5.4%
Training Cost N/A ~$18 (100 samples) 500x cheaper than fine-tuning
Training Data N/A 100 samples 100x less than vanilla RL

Key Success Factors: - Group size G > 1 (ablation shows G=1 degrades performance) - Multi-epoch training (3 epochs recommended) - High-quality base model (works best on 100B+ models) - Domain-appropriate reward functions


5. Testing Checklist

  • ExperienceManager: CRUD operations work correctly
  • ExperienceManager: Serialization/deserialization preserves state
  • SemanticExtractor: LLM responses parse correctly to JSON
  • SemanticExtractor: Handles empty/homogeneous groups gracefully
  • Trainer: Experiences inject into prompts correctly
  • Trainer: Rollout generation produces G outputs per query
  • Trainer: No parameter updates occur (verify gradients not computed)
  • Trainer: Experience library grows across epochs
  • End-to-end: Performance improves on validation set across epochs
  • End-to-end: Learned experiences are human-readable and generalizable

6. Next Steps

Phase 1 (Week 1-2): Core Implementation - Implement ExperienceManager - Implement SemanticExtractor - Add unit tests

Phase 2 (Week 3-4): Integration - Modify GRPOTrainer - Add command-line arguments - Test on small dataset (10-20 samples)

Phase 3 (Week 5-6): Evaluation - Run on full dataset (100+ samples) - Compare with vanilla GRPO baseline - Analyze learned experiences

Phase 4 (Week 7-8): Optimization - Parallelize LLM calls - Add caching for repeated queries - Optimize prompt templates - Add IPFS/on-chain storage (optional)


References

  1. Paper: Training-Free Group Relative Policy Optimization (arXiv:2510.08191v1)
  2. Code: https://github.com/TencentCloudADP/youtu-agent/tree/training_free_GRPO
  3. DeepSeek API: https://api-docs.deepseek.com/