Skip to content

Converting Chat History to Experiences - Tutorial

Introduction

One of the most powerful features of Continuous Learning GRPO is the ability to learn from real user interactions. Instead of curated training datasets, you can extract experiences directly from chat logs, user feedback, and production conversations.

This tutorial shows you how to convert chat history into semantic experiences that improve your AI model.

Why Chat History?

Traditional Training Data:

{
  "instruction": "Solve: x² + 2x + 5 = 0",
  "output": "No real solutions (discriminant < 0)"
}
- Manually curated - Expensive to create - May not reflect real usage

Chat History:

{
  "messages": [
    {"role": "user", "content": "Help me solve x² + 2x + 5 = 0"},
    {"role": "assistant", "content": "Let me calculate the discriminant..."},
    {"role": "user", "content": "Wait, there's no real solution right?"},
    {"role": "assistant", "content": "You're absolutely right! b²-4ac = -16 < 0, so no real solutions."}
  ],
  "feedback": "helpful"
}
- Organic user queries - Real conversation patterns - Includes corrections and clarifications - Free (already exists in production logs)

Overview of the Process

Chat Logs → Trajectory Extraction → Reward Assignment → Experience Extraction → Library Update

Step 1: Parse chat history into individual conversations
Step 2: Extract query, response, and user feedback
Step 3: Assign reward based on feedback (thumbs up/down, corrections)
Step 4: Run 3-stage semantic extraction
Step 5: Update experience library

Prerequisites

# Install Gym with Continuous Learning support
pip install zoo-gym[grpo]

# Additional dependencies for chat parsing
pip install pandas jsonlines

Step 1: Chat History Format

Supported Input Formats

[
  {
    "id": "conv_001",
    "conversations": [
      {"from": "human", "value": "How do I solve x² + 2x + 5 = 0?"},
      {"from": "gpt", "value": "Using the quadratic formula: x = (-b ± √(b²-4ac)) / 2a\nWith a=1, b=2, c=5:\nx = (-2 ± √(4-20)) / 2\nx = (-2 ± √(-16)) / 2\nSince the discriminant is negative, there are no real solutions."}
    ],
    "metadata": {
      "feedback": "positive",
      "rating": 5
    }
  },
  {
    "id": "conv_002",
    "conversations": [
      {"from": "human", "value": "What's the derivative of x³?"},
      {"from": "gpt", "value": "The derivative is x²."},
      {"from": "human", "value": "No, that's wrong. It should be 3x²."},
      {"from": "gpt", "value": "You're absolutely right, I apologize. The derivative of x³ is 3x²."}
    ],
    "metadata": {
      "feedback": "corrected",
      "rating": 3
    }
  }
]

Format 2: OpenAI Chat Format

[
  {
    "messages": [
      {"role": "user", "content": "Solve quadratic: x² + 2x + 5 = 0"},
      {"role": "assistant", "content": "No real solutions (discriminant = -16 < 0)"}
    ],
    "feedback": {"thumbs_up": true}
  }
]

Format 3: Custom CSV

conversation_id,user_message,assistant_message,feedback,rating
conv_001,"Solve x² + 2x + 5 = 0","No real solutions",positive,5
conv_002,"Derivative of x³","x²",negative,2

Step 2: Parsing Chat History

Basic Parser

import json
from pathlib import Path
from typing import List, Dict

class ChatHistoryParser:
    """Parse various chat history formats into unified structure."""

    def __init__(self, format_type: str = "sharegpt"):
        """
        Args:
            format_type: One of ["sharegpt", "openai", "csv"]
        """
        self.format_type = format_type

    def parse(self, file_path: str) -> List[Dict]:
        """
        Parse chat history file.

        Returns:
            conversations: List of conversation dictionaries with:
                - query: User's question/prompt
                - response: Assistant's answer
                - reward: Numerical reward (0-1)
                - metadata: Additional info (timestamp, user_id, etc.)
        """
        if self.format_type == "sharegpt":
            return self._parse_sharegpt(file_path)
        elif self.format_type == "openai":
            return self._parse_openai(file_path)
        elif self.format_type == "csv":
            return self._parse_csv(file_path)
        else:
            raise ValueError(f"Unknown format: {self.format_type}")

    def _parse_sharegpt(self, file_path: str) -> List[Dict]:
        """Parse ShareGPT format."""
        with open(file_path) as f:
            data = json.load(f)

        conversations = []
        for item in data:
            # Extract first user message as query
            user_msgs = [m for m in item["conversations"] if m["from"] == "human"]
            asst_msgs = [m for m in item["conversations"] if m["from"] == "gpt"]

            if not user_msgs or not asst_msgs:
                continue

            # Combine multi-turn into single trajectory
            query = user_msgs[0]["value"]
            response = "\n".join([m["value"] for m in asst_msgs])

            # Extract reward from feedback
            reward = self._feedback_to_reward(item.get("metadata", {}))

            conversations.append({
                "query": query,
                "response": response,
                "reward": reward,
                "metadata": item.get("metadata", {})
            })

        return conversations

    def _parse_openai(self, file_path: str) -> List[Dict]:
        """Parse OpenAI chat format."""
        with open(file_path) as f:
            data = json.load(f)

        conversations = []
        for item in data:
            messages = item["messages"]

            # Extract query and response
            user_msgs = [m["content"] for m in messages if m["role"] == "user"]
            asst_msgs = [m["content"] for m in messages if m["role"] == "assistant"]

            if not user_msgs or not asst_msgs:
                continue

            query = user_msgs[0]
            response = asst_msgs[-1]  # Last assistant message

            # Extract reward
            reward = self._feedback_to_reward(item.get("feedback", {}))

            conversations.append({
                "query": query,
                "response": response,
                "reward": reward,
                "metadata": item.get("feedback", {})
            })

        return conversations

    def _parse_csv(self, file_path: str) -> List[Dict]:
        """Parse CSV format."""
        import pandas as pd

        df = pd.read_csv(file_path)

        conversations = []
        for _, row in df.iterrows():
            conversations.append({
                "query": row["user_message"],
                "response": row["assistant_message"],
                "reward": self._feedback_to_reward({"feedback": row.get("feedback"), "rating": row.get("rating")}),
                "metadata": {"conversation_id": row.get("conversation_id")}
            })

        return conversations

    def _feedback_to_reward(self, metadata: Dict) -> float:
        """
        Convert user feedback to numerical reward.

        Args:
            metadata: Dictionary with feedback info

        Returns:
            reward: Float in [0, 1]
        """
        # Check for explicit feedback
        if metadata.get("feedback") == "positive":
            return 1.0
        elif metadata.get("feedback") == "negative":
            return 0.0
        elif metadata.get("feedback") == "corrected":
            return 0.3  # Partial credit for eventually getting it right

        # Check for thumbs up/down
        if metadata.get("thumbs_up"):
            return 1.0
        elif metadata.get("thumbs_down"):
            return 0.0

        # Check for rating (1-5 scale)
        if "rating" in metadata:
            rating = metadata["rating"]
            return (rating - 1) / 4  # Normalize to [0, 1]

        # Default: neutral (skip this conversation)
        return 0.5

# Usage
parser = ChatHistoryParser(format_type="sharegpt")
conversations = parser.parse("chat_logs.json")

print(f"Parsed {len(conversations)} conversations")
print(f"Average reward: {sum(c['reward'] for c in conversations) / len(conversations):.2f}")

Step 3: Filtering and Quality Control

Not all chat history is suitable for learning. Apply filters to ensure quality:

class ChatDataFilter:
    """Filter chat history for quality and relevance."""

    def __init__(
        self,
        min_reward: float = 0.5,
        min_length: int = 10,
        max_length: int = 2000,
        require_feedback: bool = True
    ):
        """
        Args:
            min_reward: Minimum reward threshold (keep only positive examples)
            min_length: Minimum response length (chars)
            max_length: Maximum response length (chars)
            require_feedback: Only keep conversations with explicit feedback
        """
        self.min_reward = min_reward
        self.min_length = min_length
        self.max_length = max_length
        self.require_feedback = require_feedback

    def filter(self, conversations: List[Dict]) -> List[Dict]:
        """Apply all filters."""
        filtered = []

        for conv in conversations:
            # Check reward
            if conv["reward"] < self.min_reward:
                continue

            # Check length
            response_len = len(conv["response"])
            if response_len < self.min_length or response_len > self.max_length:
                continue

            # Check for feedback
            if self.require_feedback and not conv.get("metadata"):
                continue

            # Check for toxic content (optional)
            if self._is_toxic(conv["response"]):
                continue

            filtered.append(conv)

        return filtered

    def _is_toxic(self, text: str) -> bool:
        """Check for toxic/inappropriate content."""
        # Simple keyword-based check (use proper moderation API in production)
        toxic_keywords = ["offensive", "inappropriate", "harmful"]
        return any(kw in text.lower() for kw in toxic_keywords)

# Usage
filter = ChatDataFilter(min_reward=0.7, require_feedback=True)
filtered_convs = filter.filter(conversations)

print(f"After filtering: {len(filtered_convs)}/{len(conversations)} conversations")

Step 4: Creating Trajectories

Convert filtered conversations into Trajectory objects:

from gym.train.grpo.semantic_extractor import Trajectory

def conversations_to_trajectories(conversations: List[Dict]) -> List[Trajectory]:
    """
    Convert parsed conversations to Trajectory objects.

    Args:
        conversations: List of conversation dictionaries

    Returns:
        trajectories: List of Trajectory objects
    """
    trajectories = []

    for conv in conversations:
        traj = Trajectory(
            query=conv["query"],
            output=conv["response"],
            reward=conv["reward"],
            groundtruth=None,  # Chat logs usually don't have ground truth
            summary=None  # Will be filled by Stage 1
        )
        trajectories.append(traj)

    return trajectories

# Usage
trajectories = conversations_to_trajectories(filtered_convs)
print(f"Created {len(trajectories)} trajectories")

Step 5: Extracting Ground Truth (Optional)

If you have user corrections, extract them as ground truth:

def extract_ground_truth_from_corrections(conversations: List[Dict]) -> List[Dict]:
    """
    Extract ground truth from multi-turn conversations with corrections.

    Args:
        conversations: Parsed conversations

    Returns:
        enhanced_conversations: Conversations with ground truth added
    """
    enhanced = []

    for conv in conversations:
        # Check for correction pattern in metadata
        if conv.get("metadata", {}).get("feedback") == "corrected":
            # Look for correction in later messages
            # (This is format-specific - adjust for your data)
            # Example: "No, the answer should be X" → groundtruth = "X"

            # Simple heuristic: extract text after "should be"
            response = conv["response"]
            if "should be" in response.lower():
                parts = response.lower().split("should be")
                if len(parts) > 1:
                    groundtruth = parts[1].strip().split(".")[0]
                    conv["groundtruth"] = groundtruth

        enhanced.append(conv)

    return enhanced

# Usage
enhanced_convs = extract_ground_truth_from_corrections(conversations)

Step 6: Running Continuous Learning

Now run the full Continuous Learning pipeline:

from gym.train.grpo.experience_manager import ExperienceManager
from gym.train.grpo.semantic_extractor import SemanticExtractor, LLMClient

# 1. Initialize components
experience_manager = ExperienceManager(checkpoint_path="./chat_experiences.json")
llm_client = LLMClient(api_key="sk-xxx", model="deepseek-chat")
extractor = SemanticExtractor(llm_client, max_operations=3)

# 2. Parse and filter chat history
parser = ChatHistoryParser(format_type="sharegpt")
conversations = parser.parse("production_chat_logs.json")

data_filter = ChatDataFilter(min_reward=0.7, require_feedback=True)
filtered_convs = data_filter.filter(conversations)

trajectories = conversations_to_trajectories(filtered_convs)

print(f"Training on {len(trajectories)} conversations from chat logs")

# 3. Run Continuous Learning
batch_size = 10  # Process 10 conversations at a time
num_epochs = 3

for epoch in range(num_epochs):
    print(f"\n=== Epoch {epoch+1}/{num_epochs} ===")

    # Process in batches
    for i in range(0, len(trajectories), batch_size):
        batch = trajectories[i:i+batch_size]

        # Stage 1: Summarize each trajectory
        for traj in batch:
            traj.summary = extractor.summarize_trajectory(
                traj,
                use_groundtruth=traj.groundtruth is not None
            )

        # Group trajectories by reward (simulate group structure)
        # In real GRPO, we'd generate multiple rollouts per query
        # Here, we group similar reward levels
        groups = []
        current_group = []

        for traj in batch:
            current_group.append(traj)
            if len(current_group) == 5:  # Group size
                groups.append(current_group)
                current_group = []

        # Process each group
        all_operations = []
        for group in groups:
            # Check if group has variation
            rewards = [t.reward for t in group]
            if len(set(rewards)) <= 1:
                continue  # Skip homogeneous groups

            # Stage 2: Extract group advantage
            experiences_str = experience_manager.format_for_prompt()
            operations = extractor.extract_group_advantage(
                group,
                experiences_str,
                use_groundtruth=any(t.groundtruth for t in group)
            )
            all_operations.append(operations)

        # Stage 3: Batch consolidation
        if all_operations:
            experiences_str = experience_manager.format_for_prompt()
            final_ops = extractor.consolidate_batch(all_operations, experiences_str)

            # Apply operations
            experience_manager.apply_operations(final_ops)

    # Save checkpoint
    experience_manager.save(f"./chat_experiences_epoch{epoch+1}.json")
    print(f"Experience library size: {len(experience_manager)}")

# 4. View results
print("\n=== Final Experience Library ===")
print(experience_manager.format_for_prompt())

Step 7: Privacy and Compliance

When using production chat logs, ensure compliance:

Anonymization

import re
import hashlib

class ChatAnonymizer:
    """Anonymize sensitive information in chat logs."""

    def __init__(self):
        self.patterns = {
            "email": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
            "phone": r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
            "ssn": r'\b\d{3}-\d{2}-\d{4}\b',
            "credit_card": r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b'
        }

    def anonymize(self, text: str) -> str:
        """Remove or hash sensitive information."""
        anonymized = text

        # Replace emails
        anonymized = re.sub(
            self.patterns["email"],
            lambda m: self._hash_value(m.group()),
            anonymized
        )

        # Replace phone numbers
        anonymized = re.sub(self.patterns["phone"], "[PHONE]", anonymized)

        # Replace SSNs
        anonymized = re.sub(self.patterns["ssn"], "[SSN]", anonymized)

        # Replace credit cards
        anonymized = re.sub(self.patterns["credit_card"], "[CARD]", anonymized)

        return anonymized

    def _hash_value(self, value: str) -> str:
        """Hash sensitive value for consistency."""
        return hashlib.sha256(value.encode()).hexdigest()[:8]

# Usage
anonymizer = ChatAnonymizer()

for conv in conversations:
    conv["query"] = anonymizer.anonymize(conv["query"])
    conv["response"] = anonymizer.anonymize(conv["response"])
# Include only conversations with explicit consent
def filter_by_consent(conversations: List[Dict]) -> List[Dict]:
    """Keep only conversations where user consented to training."""
    return [
        conv for conv in conversations
        if conv.get("metadata", {}).get("consent_to_training", False)
    ]

consented_convs = filter_by_consent(conversations)

Step 8: Evaluation

Measure impact of chat-derived experiences:

def evaluate_chat_learning(
    experience_manager: ExperienceManager,
    validation_set: List[Dict]
) -> Dict[str, float]:
    """
    Evaluate model with chat-derived experiences.

    Args:
        experience_manager: Experience manager with chat-derived experiences
        validation_set: Validation conversations

    Returns:
        metrics: Dictionary of evaluation metrics
    """
    from gym.train.grpo.api_model_adapter import DeepSeekAdapter

    # Initialize model
    adapter = DeepSeekAdapter(api_key="sk-xxx")

    # Evaluate with and without experiences
    results_with = []
    results_without = []

    for val_conv in validation_set:
        query = val_conv["query"]
        expected_reward = val_conv["reward"]

        # With experiences
        experiences = experience_manager.format_for_prompt()
        response_with = adapter.generate_with_experiences(query, experiences)
        reward_with = evaluate_response(response_with, val_conv["response"])
        results_with.append(reward_with)

        # Without experiences
        response_without = adapter.generate(query)
        reward_without = evaluate_response(response_without, val_conv["response"])
        results_without.append(reward_without)

    # Calculate metrics
    metrics = {
        "accuracy_with_experiences": sum(results_with) / len(results_with),
        "accuracy_without_experiences": sum(results_without) / len(results_without),
        "improvement": (sum(results_with) - sum(results_without)) / len(results_with)
    }

    return metrics

def evaluate_response(response: str, reference: str) -> float:
    """Simple similarity-based reward."""
    # Use more sophisticated evaluation in production
    # (BLEU, ROUGE, semantic similarity, etc.)
    return 1.0 if response.strip() == reference.strip() else 0.0

# Usage
validation_convs = parser.parse("validation_chat_logs.json")
metrics = evaluate_chat_learning(experience_manager, validation_convs)

print(f"Accuracy with experiences: {metrics['accuracy_with_experiences']:.1%}")
print(f"Accuracy without: {metrics['accuracy_without_experiences']:.1%}")
print(f"Improvement: {metrics['improvement']:+.1%}")

Complete Example

Here's a full end-to-end script:

#!/usr/bin/env python3
"""
Convert production chat logs to semantic experiences.

Usage:
    python chat_to_experiences.py \
        --chat_logs production_logs.json \
        --format sharegpt \
        --output_dir ./chat_experiences \
        --api_key sk-xxx
"""

import argparse
from pathlib import Path
from gym.train.grpo.experience_manager import ExperienceManager
from gym.train.grpo.semantic_extractor import SemanticExtractor, LLMClient

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--chat_logs", required=True, help="Chat history file")
    parser.add_argument("--format", default="sharegpt", choices=["sharegpt", "openai", "csv"])
    parser.add_argument("--output_dir", default="./chat_experiences")
    parser.add_argument("--api_key", required=True, help="API key for LLM")
    parser.add_argument("--num_epochs", type=int, default=3)
    parser.add_argument("--batch_size", type=int, default=10)
    parser.add_argument("--min_reward", type=float, default=0.7)
    args = parser.parse_args()

    # Setup output directory
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    # Parse chat history
    print("📖 Parsing chat history...")
    parser = ChatHistoryParser(format_type=args.format)
    conversations = parser.parse(args.chat_logs)
    print(f"   Loaded {len(conversations)} conversations")

    # Filter
    print("🔍 Filtering for quality...")
    data_filter = ChatDataFilter(min_reward=args.min_reward)
    filtered_convs = data_filter.filter(conversations)
    print(f"   Kept {len(filtered_convs)}/{len(conversations)} high-quality conversations")

    # Convert to trajectories
    trajectories = conversations_to_trajectories(filtered_convs)

    # Initialize components
    print("🛠️  Initializing Continuous Learning components...")
    experience_manager = ExperienceManager()
    llm_client = LLMClient(api_key=args.api_key)
    extractor = SemanticExtractor(llm_client)

    # Run Continuous Learning
    print(f"\n🚀 Starting Continuous Learning ({args.num_epochs} epochs)...")

    for epoch in range(args.num_epochs):
        print(f"\n=== Epoch {epoch+1}/{args.num_epochs} ===")

        # Process in batches
        for i in range(0, len(trajectories), args.batch_size):
            batch = trajectories[i:i+args.batch_size]

            # Stage 1: Summarize
            for traj in batch:
                traj.summary = extractor.summarize_trajectory(traj)

            # Group and extract advantages
            # (Simplified - see full implementation above)
            groups = [batch[j:j+5] for j in range(0, len(batch), 5)]
            all_operations = []

            for group in groups:
                if len(set(t.reward for t in group)) > 1:
                    experiences_str = experience_manager.format_for_prompt()
                    ops = extractor.extract_group_advantage(group, experiences_str)
                    all_operations.append(ops)

            # Stage 3: Consolidate
            if all_operations:
                experiences_str = experience_manager.format_for_prompt()
                final_ops = extractor.consolidate_batch(all_operations, experiences_str)
                experience_manager.apply_operations(final_ops)

        # Save checkpoint
        checkpoint_path = output_dir / f"experiences_epoch{epoch+1}.json"
        experience_manager.save(str(checkpoint_path))
        print(f"   💾 Saved checkpoint: {checkpoint_path}")
        print(f"   📊 Experience library size: {len(experience_manager)}")

    # Final output
    print("\n✅ Continuous Learning complete!")
    print(f"\n=== Final Experience Library ({len(experience_manager)} experiences) ===")
    print(experience_manager.format_for_prompt())

    # Save final version
    final_path = output_dir / "experiences_final.json"
    experience_manager.save(str(final_path))
    print(f"\n💾 Saved final library: {final_path}")

if __name__ == "__main__":
    main()

Best Practices

1. Incremental Updates

Don't retrain from scratch each time - load existing experiences and add new ones:

# Load existing experiences
experience_manager = ExperienceManager(checkpoint_path="./existing_experiences.json")
print(f"Loaded {len(experience_manager)} existing experiences")

# Add new chat data
new_conversations = parser.parse("new_chat_logs.json")
# ... run continuous learning ...

# Result: experiences from both old and new data

2. Domain Separation

Keep separate experience libraries for different topics:

# Classify conversations by domain
def classify_domain(query: str) -> str:
    """Classify query into domain."""
    keywords = {
        "math": ["equation", "derivative", "integral", "solve"],
        "coding": ["function", "error", "debug", "code"],
        "general": []
    }

    for domain, kws in keywords.items():
        if any(kw in query.lower() for kw in kws):
            return domain

    return "general"

# Create domain-specific managers
managers = {
    "math": ExperienceManager("./math_experiences.json"),
    "coding": ExperienceManager("./coding_experiences.json"),
    "general": ExperienceManager("./general_experiences.json")
}

# Route conversations to appropriate manager
for conv in conversations:
    domain = classify_domain(conv["query"])
    # ... extract experiences using managers[domain] ...

3. Continuous Monitoring

Track experience quality over time:

def track_experience_metrics(experience_manager, epoch):
    """Log metrics for monitoring."""
    metrics = {
        "epoch": epoch,
        "num_experiences": len(experience_manager),
        "avg_word_count": sum(
            len(exp.split()) for exp in experience_manager.experiences.values()
        ) / max(len(experience_manager), 1),
        "timestamp": datetime.now().isoformat()
    }

    # Log to file
    with open("experience_metrics.jsonl", "a") as f:
        f.write(json.dumps(metrics) + "\n")

    return metrics

Troubleshooting

Issue: No ground truth available

Solution: Use self-discrimination via multiple rollouts

# Generate multiple responses, use majority vote as "ground truth"
responses = [adapter.generate(query) for _ in range(5)]
majority_response = max(set(responses), key=responses.count)
# Use as pseudo-ground-truth

Issue: Chat logs too large to process

Solution: Sample strategically

# Prioritize high-reward conversations
sorted_convs = sorted(conversations, key=lambda c: c["reward"], reverse=True)
sampled = sorted_convs[:1000]  # Top 1000

# Or random sample
import random
sampled = random.sample(conversations, 1000)

Issue: Experiences too chat-specific

Solution: Use consolidation to generalize

# Force consolidation after each epoch
final_ops = extractor.consolidate_batch(
    all_operations=[],  # Empty - just consolidate existing
    experiences=experience_manager.format_for_prompt()
)
# This will merge similar experiences and remove specifics

Summary

Converting chat history to experiences enables:

Learn from production - Real user interactions ✅ Zero annotation cost - Use existing logs ✅ Continuous improvement - Update as users interact ✅ Privacy-friendly - Experiences are generalizations, not memorized data

Next Steps: 1. Try the Custom Agent Tutorial 2. Check API Reference 3. Read Main Documentation


Tutorial Last Updated: October 28, 2025