Converting Chat History to Experiences - Tutorial¶
Introduction¶
One of the most powerful features of Continuous Learning GRPO is the ability to learn from real user interactions. Instead of curated training datasets, you can extract experiences directly from chat logs, user feedback, and production conversations.
This tutorial shows you how to convert chat history into semantic experiences that improve your AI model.
Why Chat History?¶
Traditional Training Data:
- Manually curated - Expensive to create - May not reflect real usageChat History:
{
"messages": [
{"role": "user", "content": "Help me solve x² + 2x + 5 = 0"},
{"role": "assistant", "content": "Let me calculate the discriminant..."},
{"role": "user", "content": "Wait, there's no real solution right?"},
{"role": "assistant", "content": "You're absolutely right! b²-4ac = -16 < 0, so no real solutions."}
],
"feedback": "helpful"
}
Overview of the Process¶
Chat Logs → Trajectory Extraction → Reward Assignment → Experience Extraction → Library Update
Step 1: Parse chat history into individual conversations
Step 2: Extract query, response, and user feedback
Step 3: Assign reward based on feedback (thumbs up/down, corrections)
Step 4: Run 3-stage semantic extraction
Step 5: Update experience library
Prerequisites¶
# Install Gym with Continuous Learning support
pip install zoo-gym[grpo]
# Additional dependencies for chat parsing
pip install pandas jsonlines
Step 1: Chat History Format¶
Supported Input Formats¶
Format 1: ShareGPT Style (Recommended)¶
[
{
"id": "conv_001",
"conversations": [
{"from": "human", "value": "How do I solve x² + 2x + 5 = 0?"},
{"from": "gpt", "value": "Using the quadratic formula: x = (-b ± √(b²-4ac)) / 2a\nWith a=1, b=2, c=5:\nx = (-2 ± √(4-20)) / 2\nx = (-2 ± √(-16)) / 2\nSince the discriminant is negative, there are no real solutions."}
],
"metadata": {
"feedback": "positive",
"rating": 5
}
},
{
"id": "conv_002",
"conversations": [
{"from": "human", "value": "What's the derivative of x³?"},
{"from": "gpt", "value": "The derivative is x²."},
{"from": "human", "value": "No, that's wrong. It should be 3x²."},
{"from": "gpt", "value": "You're absolutely right, I apologize. The derivative of x³ is 3x²."}
],
"metadata": {
"feedback": "corrected",
"rating": 3
}
}
]
Format 2: OpenAI Chat Format¶
[
{
"messages": [
{"role": "user", "content": "Solve quadratic: x² + 2x + 5 = 0"},
{"role": "assistant", "content": "No real solutions (discriminant = -16 < 0)"}
],
"feedback": {"thumbs_up": true}
}
]
Format 3: Custom CSV¶
conversation_id,user_message,assistant_message,feedback,rating
conv_001,"Solve x² + 2x + 5 = 0","No real solutions",positive,5
conv_002,"Derivative of x³","x²",negative,2
Step 2: Parsing Chat History¶
Basic Parser¶
import json
from pathlib import Path
from typing import List, Dict
class ChatHistoryParser:
"""Parse various chat history formats into unified structure."""
def __init__(self, format_type: str = "sharegpt"):
"""
Args:
format_type: One of ["sharegpt", "openai", "csv"]
"""
self.format_type = format_type
def parse(self, file_path: str) -> List[Dict]:
"""
Parse chat history file.
Returns:
conversations: List of conversation dictionaries with:
- query: User's question/prompt
- response: Assistant's answer
- reward: Numerical reward (0-1)
- metadata: Additional info (timestamp, user_id, etc.)
"""
if self.format_type == "sharegpt":
return self._parse_sharegpt(file_path)
elif self.format_type == "openai":
return self._parse_openai(file_path)
elif self.format_type == "csv":
return self._parse_csv(file_path)
else:
raise ValueError(f"Unknown format: {self.format_type}")
def _parse_sharegpt(self, file_path: str) -> List[Dict]:
"""Parse ShareGPT format."""
with open(file_path) as f:
data = json.load(f)
conversations = []
for item in data:
# Extract first user message as query
user_msgs = [m for m in item["conversations"] if m["from"] == "human"]
asst_msgs = [m for m in item["conversations"] if m["from"] == "gpt"]
if not user_msgs or not asst_msgs:
continue
# Combine multi-turn into single trajectory
query = user_msgs[0]["value"]
response = "\n".join([m["value"] for m in asst_msgs])
# Extract reward from feedback
reward = self._feedback_to_reward(item.get("metadata", {}))
conversations.append({
"query": query,
"response": response,
"reward": reward,
"metadata": item.get("metadata", {})
})
return conversations
def _parse_openai(self, file_path: str) -> List[Dict]:
"""Parse OpenAI chat format."""
with open(file_path) as f:
data = json.load(f)
conversations = []
for item in data:
messages = item["messages"]
# Extract query and response
user_msgs = [m["content"] for m in messages if m["role"] == "user"]
asst_msgs = [m["content"] for m in messages if m["role"] == "assistant"]
if not user_msgs or not asst_msgs:
continue
query = user_msgs[0]
response = asst_msgs[-1] # Last assistant message
# Extract reward
reward = self._feedback_to_reward(item.get("feedback", {}))
conversations.append({
"query": query,
"response": response,
"reward": reward,
"metadata": item.get("feedback", {})
})
return conversations
def _parse_csv(self, file_path: str) -> List[Dict]:
"""Parse CSV format."""
import pandas as pd
df = pd.read_csv(file_path)
conversations = []
for _, row in df.iterrows():
conversations.append({
"query": row["user_message"],
"response": row["assistant_message"],
"reward": self._feedback_to_reward({"feedback": row.get("feedback"), "rating": row.get("rating")}),
"metadata": {"conversation_id": row.get("conversation_id")}
})
return conversations
def _feedback_to_reward(self, metadata: Dict) -> float:
"""
Convert user feedback to numerical reward.
Args:
metadata: Dictionary with feedback info
Returns:
reward: Float in [0, 1]
"""
# Check for explicit feedback
if metadata.get("feedback") == "positive":
return 1.0
elif metadata.get("feedback") == "negative":
return 0.0
elif metadata.get("feedback") == "corrected":
return 0.3 # Partial credit for eventually getting it right
# Check for thumbs up/down
if metadata.get("thumbs_up"):
return 1.0
elif metadata.get("thumbs_down"):
return 0.0
# Check for rating (1-5 scale)
if "rating" in metadata:
rating = metadata["rating"]
return (rating - 1) / 4 # Normalize to [0, 1]
# Default: neutral (skip this conversation)
return 0.5
# Usage
parser = ChatHistoryParser(format_type="sharegpt")
conversations = parser.parse("chat_logs.json")
print(f"Parsed {len(conversations)} conversations")
print(f"Average reward: {sum(c['reward'] for c in conversations) / len(conversations):.2f}")
Step 3: Filtering and Quality Control¶
Not all chat history is suitable for learning. Apply filters to ensure quality:
class ChatDataFilter:
"""Filter chat history for quality and relevance."""
def __init__(
self,
min_reward: float = 0.5,
min_length: int = 10,
max_length: int = 2000,
require_feedback: bool = True
):
"""
Args:
min_reward: Minimum reward threshold (keep only positive examples)
min_length: Minimum response length (chars)
max_length: Maximum response length (chars)
require_feedback: Only keep conversations with explicit feedback
"""
self.min_reward = min_reward
self.min_length = min_length
self.max_length = max_length
self.require_feedback = require_feedback
def filter(self, conversations: List[Dict]) -> List[Dict]:
"""Apply all filters."""
filtered = []
for conv in conversations:
# Check reward
if conv["reward"] < self.min_reward:
continue
# Check length
response_len = len(conv["response"])
if response_len < self.min_length or response_len > self.max_length:
continue
# Check for feedback
if self.require_feedback and not conv.get("metadata"):
continue
# Check for toxic content (optional)
if self._is_toxic(conv["response"]):
continue
filtered.append(conv)
return filtered
def _is_toxic(self, text: str) -> bool:
"""Check for toxic/inappropriate content."""
# Simple keyword-based check (use proper moderation API in production)
toxic_keywords = ["offensive", "inappropriate", "harmful"]
return any(kw in text.lower() for kw in toxic_keywords)
# Usage
filter = ChatDataFilter(min_reward=0.7, require_feedback=True)
filtered_convs = filter.filter(conversations)
print(f"After filtering: {len(filtered_convs)}/{len(conversations)} conversations")
Step 4: Creating Trajectories¶
Convert filtered conversations into Trajectory objects:
from gym.train.grpo.semantic_extractor import Trajectory
def conversations_to_trajectories(conversations: List[Dict]) -> List[Trajectory]:
"""
Convert parsed conversations to Trajectory objects.
Args:
conversations: List of conversation dictionaries
Returns:
trajectories: List of Trajectory objects
"""
trajectories = []
for conv in conversations:
traj = Trajectory(
query=conv["query"],
output=conv["response"],
reward=conv["reward"],
groundtruth=None, # Chat logs usually don't have ground truth
summary=None # Will be filled by Stage 1
)
trajectories.append(traj)
return trajectories
# Usage
trajectories = conversations_to_trajectories(filtered_convs)
print(f"Created {len(trajectories)} trajectories")
Step 5: Extracting Ground Truth (Optional)¶
If you have user corrections, extract them as ground truth:
def extract_ground_truth_from_corrections(conversations: List[Dict]) -> List[Dict]:
"""
Extract ground truth from multi-turn conversations with corrections.
Args:
conversations: Parsed conversations
Returns:
enhanced_conversations: Conversations with ground truth added
"""
enhanced = []
for conv in conversations:
# Check for correction pattern in metadata
if conv.get("metadata", {}).get("feedback") == "corrected":
# Look for correction in later messages
# (This is format-specific - adjust for your data)
# Example: "No, the answer should be X" → groundtruth = "X"
# Simple heuristic: extract text after "should be"
response = conv["response"]
if "should be" in response.lower():
parts = response.lower().split("should be")
if len(parts) > 1:
groundtruth = parts[1].strip().split(".")[0]
conv["groundtruth"] = groundtruth
enhanced.append(conv)
return enhanced
# Usage
enhanced_convs = extract_ground_truth_from_corrections(conversations)
Step 6: Running Continuous Learning¶
Now run the full Continuous Learning pipeline:
from gym.train.grpo.experience_manager import ExperienceManager
from gym.train.grpo.semantic_extractor import SemanticExtractor, LLMClient
# 1. Initialize components
experience_manager = ExperienceManager(checkpoint_path="./chat_experiences.json")
llm_client = LLMClient(api_key="sk-xxx", model="deepseek-chat")
extractor = SemanticExtractor(llm_client, max_operations=3)
# 2. Parse and filter chat history
parser = ChatHistoryParser(format_type="sharegpt")
conversations = parser.parse("production_chat_logs.json")
data_filter = ChatDataFilter(min_reward=0.7, require_feedback=True)
filtered_convs = data_filter.filter(conversations)
trajectories = conversations_to_trajectories(filtered_convs)
print(f"Training on {len(trajectories)} conversations from chat logs")
# 3. Run Continuous Learning
batch_size = 10 # Process 10 conversations at a time
num_epochs = 3
for epoch in range(num_epochs):
print(f"\n=== Epoch {epoch+1}/{num_epochs} ===")
# Process in batches
for i in range(0, len(trajectories), batch_size):
batch = trajectories[i:i+batch_size]
# Stage 1: Summarize each trajectory
for traj in batch:
traj.summary = extractor.summarize_trajectory(
traj,
use_groundtruth=traj.groundtruth is not None
)
# Group trajectories by reward (simulate group structure)
# In real GRPO, we'd generate multiple rollouts per query
# Here, we group similar reward levels
groups = []
current_group = []
for traj in batch:
current_group.append(traj)
if len(current_group) == 5: # Group size
groups.append(current_group)
current_group = []
# Process each group
all_operations = []
for group in groups:
# Check if group has variation
rewards = [t.reward for t in group]
if len(set(rewards)) <= 1:
continue # Skip homogeneous groups
# Stage 2: Extract group advantage
experiences_str = experience_manager.format_for_prompt()
operations = extractor.extract_group_advantage(
group,
experiences_str,
use_groundtruth=any(t.groundtruth for t in group)
)
all_operations.append(operations)
# Stage 3: Batch consolidation
if all_operations:
experiences_str = experience_manager.format_for_prompt()
final_ops = extractor.consolidate_batch(all_operations, experiences_str)
# Apply operations
experience_manager.apply_operations(final_ops)
# Save checkpoint
experience_manager.save(f"./chat_experiences_epoch{epoch+1}.json")
print(f"Experience library size: {len(experience_manager)}")
# 4. View results
print("\n=== Final Experience Library ===")
print(experience_manager.format_for_prompt())
Step 7: Privacy and Compliance¶
When using production chat logs, ensure compliance:
Anonymization¶
import re
import hashlib
class ChatAnonymizer:
"""Anonymize sensitive information in chat logs."""
def __init__(self):
self.patterns = {
"email": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
"phone": r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
"ssn": r'\b\d{3}-\d{2}-\d{4}\b',
"credit_card": r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b'
}
def anonymize(self, text: str) -> str:
"""Remove or hash sensitive information."""
anonymized = text
# Replace emails
anonymized = re.sub(
self.patterns["email"],
lambda m: self._hash_value(m.group()),
anonymized
)
# Replace phone numbers
anonymized = re.sub(self.patterns["phone"], "[PHONE]", anonymized)
# Replace SSNs
anonymized = re.sub(self.patterns["ssn"], "[SSN]", anonymized)
# Replace credit cards
anonymized = re.sub(self.patterns["credit_card"], "[CARD]", anonymized)
return anonymized
def _hash_value(self, value: str) -> str:
"""Hash sensitive value for consistency."""
return hashlib.sha256(value.encode()).hexdigest()[:8]
# Usage
anonymizer = ChatAnonymizer()
for conv in conversations:
conv["query"] = anonymizer.anonymize(conv["query"])
conv["response"] = anonymizer.anonymize(conv["response"])
User Consent¶
# Include only conversations with explicit consent
def filter_by_consent(conversations: List[Dict]) -> List[Dict]:
"""Keep only conversations where user consented to training."""
return [
conv for conv in conversations
if conv.get("metadata", {}).get("consent_to_training", False)
]
consented_convs = filter_by_consent(conversations)
Step 8: Evaluation¶
Measure impact of chat-derived experiences:
def evaluate_chat_learning(
experience_manager: ExperienceManager,
validation_set: List[Dict]
) -> Dict[str, float]:
"""
Evaluate model with chat-derived experiences.
Args:
experience_manager: Experience manager with chat-derived experiences
validation_set: Validation conversations
Returns:
metrics: Dictionary of evaluation metrics
"""
from gym.train.grpo.api_model_adapter import DeepSeekAdapter
# Initialize model
adapter = DeepSeekAdapter(api_key="sk-xxx")
# Evaluate with and without experiences
results_with = []
results_without = []
for val_conv in validation_set:
query = val_conv["query"]
expected_reward = val_conv["reward"]
# With experiences
experiences = experience_manager.format_for_prompt()
response_with = adapter.generate_with_experiences(query, experiences)
reward_with = evaluate_response(response_with, val_conv["response"])
results_with.append(reward_with)
# Without experiences
response_without = adapter.generate(query)
reward_without = evaluate_response(response_without, val_conv["response"])
results_without.append(reward_without)
# Calculate metrics
metrics = {
"accuracy_with_experiences": sum(results_with) / len(results_with),
"accuracy_without_experiences": sum(results_without) / len(results_without),
"improvement": (sum(results_with) - sum(results_without)) / len(results_with)
}
return metrics
def evaluate_response(response: str, reference: str) -> float:
"""Simple similarity-based reward."""
# Use more sophisticated evaluation in production
# (BLEU, ROUGE, semantic similarity, etc.)
return 1.0 if response.strip() == reference.strip() else 0.0
# Usage
validation_convs = parser.parse("validation_chat_logs.json")
metrics = evaluate_chat_learning(experience_manager, validation_convs)
print(f"Accuracy with experiences: {metrics['accuracy_with_experiences']:.1%}")
print(f"Accuracy without: {metrics['accuracy_without_experiences']:.1%}")
print(f"Improvement: {metrics['improvement']:+.1%}")
Complete Example¶
Here's a full end-to-end script:
#!/usr/bin/env python3
"""
Convert production chat logs to semantic experiences.
Usage:
python chat_to_experiences.py \
--chat_logs production_logs.json \
--format sharegpt \
--output_dir ./chat_experiences \
--api_key sk-xxx
"""
import argparse
from pathlib import Path
from gym.train.grpo.experience_manager import ExperienceManager
from gym.train.grpo.semantic_extractor import SemanticExtractor, LLMClient
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--chat_logs", required=True, help="Chat history file")
parser.add_argument("--format", default="sharegpt", choices=["sharegpt", "openai", "csv"])
parser.add_argument("--output_dir", default="./chat_experiences")
parser.add_argument("--api_key", required=True, help="API key for LLM")
parser.add_argument("--num_epochs", type=int, default=3)
parser.add_argument("--batch_size", type=int, default=10)
parser.add_argument("--min_reward", type=float, default=0.7)
args = parser.parse_args()
# Setup output directory
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
# Parse chat history
print("📖 Parsing chat history...")
parser = ChatHistoryParser(format_type=args.format)
conversations = parser.parse(args.chat_logs)
print(f" Loaded {len(conversations)} conversations")
# Filter
print("🔍 Filtering for quality...")
data_filter = ChatDataFilter(min_reward=args.min_reward)
filtered_convs = data_filter.filter(conversations)
print(f" Kept {len(filtered_convs)}/{len(conversations)} high-quality conversations")
# Convert to trajectories
trajectories = conversations_to_trajectories(filtered_convs)
# Initialize components
print("🛠️ Initializing Continuous Learning components...")
experience_manager = ExperienceManager()
llm_client = LLMClient(api_key=args.api_key)
extractor = SemanticExtractor(llm_client)
# Run Continuous Learning
print(f"\n🚀 Starting Continuous Learning ({args.num_epochs} epochs)...")
for epoch in range(args.num_epochs):
print(f"\n=== Epoch {epoch+1}/{args.num_epochs} ===")
# Process in batches
for i in range(0, len(trajectories), args.batch_size):
batch = trajectories[i:i+args.batch_size]
# Stage 1: Summarize
for traj in batch:
traj.summary = extractor.summarize_trajectory(traj)
# Group and extract advantages
# (Simplified - see full implementation above)
groups = [batch[j:j+5] for j in range(0, len(batch), 5)]
all_operations = []
for group in groups:
if len(set(t.reward for t in group)) > 1:
experiences_str = experience_manager.format_for_prompt()
ops = extractor.extract_group_advantage(group, experiences_str)
all_operations.append(ops)
# Stage 3: Consolidate
if all_operations:
experiences_str = experience_manager.format_for_prompt()
final_ops = extractor.consolidate_batch(all_operations, experiences_str)
experience_manager.apply_operations(final_ops)
# Save checkpoint
checkpoint_path = output_dir / f"experiences_epoch{epoch+1}.json"
experience_manager.save(str(checkpoint_path))
print(f" 💾 Saved checkpoint: {checkpoint_path}")
print(f" 📊 Experience library size: {len(experience_manager)}")
# Final output
print("\n✅ Continuous Learning complete!")
print(f"\n=== Final Experience Library ({len(experience_manager)} experiences) ===")
print(experience_manager.format_for_prompt())
# Save final version
final_path = output_dir / "experiences_final.json"
experience_manager.save(str(final_path))
print(f"\n💾 Saved final library: {final_path}")
if __name__ == "__main__":
main()
Best Practices¶
1. Incremental Updates¶
Don't retrain from scratch each time - load existing experiences and add new ones:
# Load existing experiences
experience_manager = ExperienceManager(checkpoint_path="./existing_experiences.json")
print(f"Loaded {len(experience_manager)} existing experiences")
# Add new chat data
new_conversations = parser.parse("new_chat_logs.json")
# ... run continuous learning ...
# Result: experiences from both old and new data
2. Domain Separation¶
Keep separate experience libraries for different topics:
# Classify conversations by domain
def classify_domain(query: str) -> str:
"""Classify query into domain."""
keywords = {
"math": ["equation", "derivative", "integral", "solve"],
"coding": ["function", "error", "debug", "code"],
"general": []
}
for domain, kws in keywords.items():
if any(kw in query.lower() for kw in kws):
return domain
return "general"
# Create domain-specific managers
managers = {
"math": ExperienceManager("./math_experiences.json"),
"coding": ExperienceManager("./coding_experiences.json"),
"general": ExperienceManager("./general_experiences.json")
}
# Route conversations to appropriate manager
for conv in conversations:
domain = classify_domain(conv["query"])
# ... extract experiences using managers[domain] ...
3. Continuous Monitoring¶
Track experience quality over time:
def track_experience_metrics(experience_manager, epoch):
"""Log metrics for monitoring."""
metrics = {
"epoch": epoch,
"num_experiences": len(experience_manager),
"avg_word_count": sum(
len(exp.split()) for exp in experience_manager.experiences.values()
) / max(len(experience_manager), 1),
"timestamp": datetime.now().isoformat()
}
# Log to file
with open("experience_metrics.jsonl", "a") as f:
f.write(json.dumps(metrics) + "\n")
return metrics
Troubleshooting¶
Issue: No ground truth available¶
Solution: Use self-discrimination via multiple rollouts
# Generate multiple responses, use majority vote as "ground truth"
responses = [adapter.generate(query) for _ in range(5)]
majority_response = max(set(responses), key=responses.count)
# Use as pseudo-ground-truth
Issue: Chat logs too large to process¶
Solution: Sample strategically
# Prioritize high-reward conversations
sorted_convs = sorted(conversations, key=lambda c: c["reward"], reverse=True)
sampled = sorted_convs[:1000] # Top 1000
# Or random sample
import random
sampled = random.sample(conversations, 1000)
Issue: Experiences too chat-specific¶
Solution: Use consolidation to generalize
# Force consolidation after each epoch
final_ops = extractor.consolidate_batch(
all_operations=[], # Empty - just consolidate existing
experiences=experience_manager.format_for_prompt()
)
# This will merge similar experiences and remove specifics
Summary¶
Converting chat history to experiences enables:
✅ Learn from production - Real user interactions ✅ Zero annotation cost - Use existing logs ✅ Continuous improvement - Update as users interact ✅ Privacy-friendly - Experiences are generalizations, not memorized data
Next Steps: 1. Try the Custom Agent Tutorial 2. Check API Reference 3. Read Main Documentation
Tutorial Last Updated: October 28, 2025