Continuous Learning GRPO Integration Strategy for Gym¶

Date: October 28, 2025

Overview¶

Gym already has excellent foundation for Continuous Learning GRPO. The integration should leverage existing infrastructure (GRPO/GSPO trainers) and extend them cleanly.

Current Infrastructure ✅¶

1. Configuration Parameters (Already Exist!)¶

Location: src/gym/hparams/finetuning_args.py lines 277-312

# Continuous Learning GRPO parameters
continuous_learning_grpo: bool = False
experience_lib_path: Optional[str] = None
experience_max_size: int = 100
llm_api_key: Optional[str] = None
llm_base_url: str = "https://api.deepseek.com/v1"
llm_model: str = "deepseek-chat"
semantic_max_operations: int = 3
rollout_temperature: float = 0.7
use_groundtruth: bool = True

✅ Status: Complete, ready to use

2. Core Components (Fully Implemented)¶

✅ ExperienceManager - src/gym/train/grpo/experience_manager.py
✅ SemanticExtractor - src/gym/train/grpo/semantic_extractor.py
✅ APIModelAdapter - src/gym/train/grpo/api_model_adapter.py
✅ Comprehensive unit tests - tests/train/test_continuous_learning_grpo.py

3. Trainer Infrastructure¶

✅ GRPOTrainer - Group-based advantage computation
✅ GSPOTrainer - Sequence-level optimization
✅ Unified training pipeline - src/gym/train/grpo/workflow.py

Recommended Integration Approach¶

Option 1: Extend GRPOTrainer (RECOMMENDED)¶

Why this approach: - ✅ Maintains unified codebase - ✅ Leverages existing infrastructure - ✅ Clean flag-based behavior switching - ✅ Consistent with Gym's design philosophy - ✅ Easy to test and maintain

Implementation:

# src/gym/train/grpo/trainer.py

class GRPOTrainer(Trainer):
    def __init__(self, finetuning_args, ...):
        super().__init__(...)

        # Training-free mode setup
        if finetuning_args.continuous_learning_grpo:
            self._setup_training_free_mode(finetuning_args)

    def _setup_training_free_mode(self, args):
        """Initialize Continuous Learning GRPO components."""
        from .experience_manager import ExperienceManager
        from .semantic_extractor import SemanticExtractor, LLMClient
        from .api_model_adapter import DeepSeekAdapter

        # Experience library
        self.experience_manager = ExperienceManager(
            checkpoint_path=args.experience_lib_path
        )

        # LLM for semantic extraction
        llm_client = LLMClient(
            api_key=args.llm_api_key,
            base_url=args.llm_base_url,
            model=args.llm_model
        )

        # Semantic extractor
        self.semantic_extractor = SemanticExtractor(
            llm_client=llm_client,
            max_operations=args.semantic_max_operations
        )

        # Freeze model
        self.model.eval()
        for param in self.model.parameters():
            param.requires_grad = False

    def training_step(self, model, inputs):
        """Dispatch to appropriate training mode."""
        if self.finetuning_args.continuous_learning_grpo:
            return self._training_step_free(model, inputs)
        else:
            return self._training_step_standard(model, inputs)

    def _training_step_free(self, model, inputs):
        """Continuous Learning GRPO: no parameter updates."""
        # 1. Inject experiences into prompts
        enhanced_inputs = self._inject_experiences(inputs)

        # 2. Generate G rollouts per query
        rollouts = self._generate_rollouts(enhanced_inputs)

        # 3. Compute rewards (via verify function)
        for rollout in rollouts:
            rollout["reward"] = self._verify_rollout(rollout)

        # 4. Extract semantic advantages
        operations = self._extract_semantic_advantages(rollouts)

        # 5. Update experience library
        self.experience_manager.apply_operations(operations)

        # 6. Save checkpoint (experiences only)
        self._save_experiences()

        # 7. Return zero loss (no gradient update!)
        return torch.tensor(0.0, device=model.device)

    def _inject_experiences(self, inputs):
        """Inject experience library into prompts."""
        experiences = self.experience_manager.format_for_prompt()

        # Enhance prompts with experiences
        # Implementation depends on tokenizer/template
        ...

        return enhanced_inputs

    def _generate_rollouts(self, inputs):
        """Generate G rollouts per query."""
        rollouts = []
        for i in range(self.group_size):
            # Generate with temperature for diversity
            output = self.model.generate(
                **inputs,
                temperature=self.finetuning_args.rollout_temperature,
                do_sample=True
            )
            rollouts.append(output)
        return rollouts

    def _extract_semantic_advantages(self, rollouts):
        """Extract semantic advantages using LLM."""
        # Group rollouts by query
        query_groups = self._group_by_query(rollouts)

        # Stage 1: Summarize trajectories
        for group in query_groups:
            for rollout in group:
                rollout["summary"] = self.semantic_extractor.summarize_trajectory(
                    rollout,
                    use_groundtruth=self.finetuning_args.use_groundtruth
                )

        # Stage 2: Extract group advantages
        all_operations = []
        for group in query_groups:
            experiences_str = self.experience_manager.format_for_prompt()
            operations = self.semantic_extractor.extract_group_advantage(
                group,
                experiences_str,
                use_groundtruth=self.finetuning_args.use_groundtruth
            )
            all_operations.append(operations)

        # Stage 3: Consolidate batch
        experiences_str = self.experience_manager.format_for_prompt()
        final_operations = self.semantic_extractor.consolidate_batch(
            all_operations,
            experiences_str
        )

        return final_operations

    def _save_experiences(self):
        """Save experience library to checkpoint."""
        if self.experience_manager.checkpoint_path:
            self.experience_manager.save(
                self.experience_manager.checkpoint_path
            )

Benefits: - Single unified trainer - Clean separation of concerns - Easy to switch between modes - Maintains all existing functionality - No code duplication

Option 2: Separate ContinuousLearningGRPOTrainer (NOT RECOMMENDED)¶

Why NOT recommended: - ❌ Code duplication (group logic, advantage computation) - ❌ Harder to maintain consistency - ❌ More complex for users (which trainer to use?) - ❌ Breaks DRY principle

Only use if: - Training-free logic is drastically different - Need completely separate workflow - Want to deprecate standard GRPO eventually

Integration Steps¶

Phase 1: Extend GRPOTrainer (Week 1)¶

Files to modify:

src/gym/train/grpo/trainer.py (main changes)
Add _setup_training_free_mode()
Add _training_step_free()
Add _inject_experiences()
Add _generate_rollouts()
Add _extract_semantic_advantages()
Update training_step() to dispatch

src/gym/data/template.py (experience injection)

def format_with_experiences(
    self,
    query: str,
    experiences: str,
    system: Optional[str] = None
) -> str:
    """Format query with experiences prepended."""
    if experiences and experiences != "None":
        experience_section = f"\n\nHelpful experiences:\n{experiences}"
    else:
        experience_section = ""

    return f"{system or ''}{experience_section}\n\nProblem: {query}"

src/gym/train/grpo/workflow.py (checkpoint management)
Update to save experiences alongside model
Load experiences at training start

Testing:

id=__span-3-1># tests/train/test_grpo_trainer_integration.py class=k>def test_training_free_mode(): trainer = GRPOTrainer( finetuning_args=FinetuningArguments( continuous_learning_grpo=True, llm_api_key="sk-test", experience_lib_path="./test_exp.json" ), ... ) # Verify no parameter updates initial_params = copy.deepcopy(trainer.model.state_dict()) # Run training step loss = trainer.training_step(model, inputs) # Check model unchanged final_params = trainer.model.state_dict() assert torch.allclose(initial_params['...'], final_params['...']) # Check experiences updated assert len(trainer.experience_manager) > 0

Phase 2: Domain Modules (Week 2)¶

Create domain structure:

src/gym/train/grpo/domains/
├── __init__.py
├── base.py              # Base classes
├── math/
│   ├── __init__.py
│   ├── dataset.py       # AIME dataset loader
│   ├── verify.py        # Math correctness checker
│   └── prompts.py       # Math-specific templates
└── web/
    ├── __init__.py
    ├── dataset.py       # WebWalker dataset loader
    ├── verify.py        # Navigation success checker
    └── prompts.py       # Web-specific templates

Base interface:

# src/gym/train/grpo/domains/base.py
class DomainAdapter:
    """Base class for domain-specific adapters."""

    @abstractmethod
    def load_dataset(self, name: str) -> List[Dict]:
        """Load domain-specific dataset."""
        pass

    @abstractmethod
    def verify(self, output: str, groundtruth: str) -> float:
        """Verify correctness, return reward."""
        pass

    @abstractmethod
    def format_prompt(self, problem: str, experiences: str) -> str:
        """Format domain-specific prompt."""
        pass

Math domain example:

# src/gym/train/grpo/domains/math/dataset.py
def load_aime_dataset(split: str = "2024") -> List[Dict]:
    """Load AIME competition problems."""
    # Load from HuggingFace datasets or local JSON
    dataset = load_dataset("zooai/aime", split=split)

    return [
        {
            "problem": sample["problem"],
            "groundtruth": sample["answer"],
            "domain": "math"
        }
        for sample in dataset
    ]

# src/gym/train/grpo/domains/math/verify.py
def verify_math_answer(output: str, groundtruth: str) -> float:
    """Verify math answer correctness."""
    # Extract answer from output
    predicted = extract_answer(output)
    expected = extract_answer(groundtruth)

    # Compare with tolerance
    if math.isclose(float(predicted), float(expected), rel_tol=1e-5):
        return 1.0
    else:
        return 0.0

Phase 3: Example Scripts (Week 3)¶

Create user-facing scripts:

# scripts/train_grpo_free_math.py
"""Continuous Learning GRPO on AIME math problems."""

from gym import run_train
from gym.hparams import get_train_args

def main():
    # Configure training
    config = {
        # Model (API-based)
        "model_name_or_path": "deepseek-chat",
        "api_mode": True,
        "api_base_url": "https://api.deepseek.com/v1",

        # Continuous Learning GRPO
        "stage": "grpo",
        "continuous_learning_grpo": True,
        "grpo_group_size": 5,

        # Experience management
        "experience_lib_path": "./output/math_experiences.json",
        "llm_api_key": os.getenv("DEEPSEEK_API_KEY"),
        "semantic_max_operations": 3,
        "use_groundtruth": True,

        # Dataset
        "dataset": "aime24",
        "domain": "math",

        # Training
        "num_train_epochs": 3,
        "per_device_train_batch_size": 64,
        "rollout_temperature": 0.7,

        # Output
        "output_dir": "./output/grpo_free_math",
        "save_steps": 1,
        "logging_steps": 1
    }

    # Parse args
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(config)

    # Run training
    run_train(model_args, data_args, training_args, finetuning_args, generating_args)

if __name__ == "__main__":
    main()

Shell script wrapper:

#!/bin/bash
# scripts/train_grpo_free_math.sh

export DEEPSEEK_API_KEY="sk-xxx"

python scripts/train_grpo_free_math.py \
    --dataset aime24 \
    --dataset_truncate 100 \
    --num_train_epochs 3 \
    --grpo_group_size 5 \
    --output_dir ./output/grpo_free_math_test

API vs Local Model Support¶

API Mode (Recommended for Training-Free)¶

Pros: - ✅ Matches Tencent paper (DeepSeek-V3) - ✅ No GPU required - ✅ Faster inference - ✅ Lower total cost (~$18 for 100 samples) - ✅ Easier setup

Implementation:

if finetuning_args.api_mode:
    from .api_model_adapter import DeepSeekAdapter
    self.model = DeepSeekAdapter(
        api_key=finetuning_args.llm_api_key,
        model=finetuning_args.model_name_or_path
    )

Local Model Mode (Optional)¶

Pros: - ✅ No API dependency - ✅ Full control - ✅ Better for privacy

Cons: - ❌ Requires GPU (32GB+ for 32B models) - ❌ Slower inference - ❌ Higher infrastructure cost

Implementation:

if not finetuning_args.api_mode:
    # Use standard HuggingFace loading
    self.model = AutoModelForCausalLM.from_pretrained(
        model_args.model_name_or_path,
        ...
    )

Configuration Examples¶

Math Domain (API-based)¶

# configs/grpo_free_math_api.yaml
model_name_or_path: deepseek-chat
api_mode: true
api_base_url: https://api.deepseek.com/v1

stage: grpo
continuous_learning_grpo: true
grpo_group_size: 5
rollout_temperature: 0.7

dataset: aime24
domain: math
use_groundtruth: true

experience_lib_path: ./output/math_exp/experiences.json
semantic_max_operations: 3

num_train_epochs: 3
per_device_train_batch_size: 64

output_dir: ./output/grpo_free_math
save_steps: 1
logging_steps: 1

Web Domain (Local Qwen3-32B)¶

# configs/grpo_free_web_local.yaml
model_name_or_path: Qwen/Qwen3-32B-Instruct
api_mode: false

stage: grpo
continuous_learning_grpo: true
grpo_group_size: 5

dataset: webwalker
domain: web
use_groundtruth: false  # Self-discrimination

experience_lib_path: ./output/web_exp/experiences.json
llm_api_key: ${DEEPSEEK_API_KEY}  # For semantic extraction only

num_train_epochs: 3
per_device_train_batch_size: 32

output_dir: ./output/grpo_free_web

Testing Strategy¶

Unit Tests (Already Complete ✅)¶

ExperienceManager: 18/18 passing
SemanticExtractor: 10/10 passing
Trajectory: 2/2 passing

Integration Tests (To Add)¶

# tests/train/test_grpo_trainer_training_free.py

def test_training_free_mode_initialization():
    """Test training-free mode setup."""
    ...

def test_experience_injection():
    """Test experience injection into prompts."""
    ...

def test_no_parameter_updates():
    """Verify model weights unchanged."""
    ...

def test_experience_library_growth():
    """Check experiences accumulate across epochs."""
    ...

def test_checkpoint_loading():
    """Test loading experiences from checkpoint."""
    ...

def test_domain_adapters():
    """Test math/web domain modules."""
    ...

End-to-End Tests¶

# tests/train/test_grpo_free_e2e.py

def test_full_training_workflow():
    """Run 2 epochs on toy dataset."""
    config = {...}
    run_train(config)

    # Verify results
    assert Path("output/experiences.json").exists()
    experiences = json.load(open("output/experiences.json"))
    assert len(experiences) > 0

Migration from Tencent Code¶

What to Keep¶

✅ 3-stage LLM process (already implemented)
✅ Experience format (already implemented)
✅ Prompts (already implemented)
✅ Domain structure (need to add)

What to Adapt¶

Training loop → Integrate with GRPOTrainer
Async rollouts → Add to trainer
SimpleAgent → Optional, support API and HF models
Directory structure → Use Gym's output format

What to Skip¶

❌ Tencent's config system (use Gym's)
❌ UTU agent framework (use Gym's trainer)
❌ Custom CLI (use gym-cli)

Timeline¶

Week 1: Trainer Integration¶

Add training_free mode to GRPOTrainer
Implement experience injection
Add rollout generation
Basic testing

Week 2: Domain Modules¶

Create domains/ directory structure
Implement math domain
Implement web domain
Domain-specific tests

Week 3: Scripts & Examples¶

Create train_grpo_free_math.py
Create train_grpo_free_web.py
Add YAML configs
User documentation

Week 4: Testing & Benchmarks¶

End-to-end tests
AIME24 benchmark
Cost analysis
Performance comparison

Success Criteria¶

Recommendation¶

Use Option 1: Extend GRPOTrainer

This approach: - ✅ Leverages existing GRPO infrastructure - ✅ Maintains clean separation with continuous_learning_grpo flag - ✅ Consistent with Gym's design philosophy - ✅ Easy to test and maintain - ✅ Minimal code duplication - ✅ Works with existing configs (already have the parameters!)

Next immediate actions: 1. Modify src/gym/train/grpo/trainer.py to add training-free mode 2. Test with API model (DeepSeek) 3. Add math domain module 4. Run toy benchmark

ETA: 2-3 weeks for full integration and testing.

Questions to resolve: 1. Should async rollout generation be mandatory or optional? 2. Should we support Tencent's SimpleAgent or focus on API/HF models? 3. Should domain modules be required or optional (generic fallback)?

Recommendation: Start simple (API + math domain + sync rollouts), add advanced features later.