Skip to content

Message Tracing

Overview

MASEval provides message tracing to capture agent conversations during benchmark execution. This is useful for:

  • Debugging: Inspect what agents actually said and which tools they called
  • Analysis: Understand agent behavior patterns across tasks
  • Dataset Creation: Extract conversations for further analysis or training

Tracing vs Logging

Tracing = Collecting execution data during task runs (messages, tool calls, metrics)

Logging = Persisting traces and evaluation results to disk/databases after benchmarks complete

This guide covers tracing. Logging functionality for persisting results is planned for future releases.

Core Concepts

MessageHistory: OpenAI-compatible message storage that all agent adapters use internally.

AgentAdapter.get_messages(): Standard method to retrieve conversation history from any wrapped agent.

MessageTracingAgentCallback: Optional callback that automatically collects all agent conversations in memory.

Basic Usage

Accessing Message History

Every agent adapter exposes message history through get_messages():

from maseval.interface.agents import SmolAgentAdapter

# Create and run your agent
agent_adapter = SmolAgentAdapter(agent, name="researcher")
result = agent_adapter.run("What's the capital of France?")

# Get the conversation
messages = agent_adapter.get_messages()

# Inspect messages
for msg in messages:
    print(f"{msg['role']}: {msg.get('content', '')}")
    if 'tool_calls' in msg:
        print(f"  Tools called: {[tc['function']['name'] for tc in msg['tool_calls']]}")

Fresh Conversations for Multiple Tasks

In benchmarks, you typically want a fresh agent instance for each task:

# In your benchmark loop
for task in benchmark.tasks:
    # Create a new adapter instance for each task
    agent_adapter = YourAgentAdapter(agent_instance=agent, name="task_agent")
    result = agent_adapter.run(task.query)
    evaluate(result, task.ground_truth)

This ensures each task starts with a clean slate and avoids conversation history contamination.

Using the Tracing Callback

For multi-agent systems or when you need to collect conversations from many runs, use MessageTracingAgentCallback:

from maseval.core.callbacks import MessageTracingAgentCallback

# Create tracer
tracer = MessageTracingAgentCallback()

# Attach to your agent(s)
agent_adapter = SmolAgentAdapter(agent, name="assistant", callbacks=[tracer])

# Run tasks
agent_adapter.run("Task 1")
agent_adapter.run("Task 2")
agent_adapter.run("Task 3")

# Get all conversations
conversations = tracer.get_all_conversations()

# Each conversation contains:
# - agent_name: which agent ran
# - query: the input query
# - messages: full conversation history
# - message_count: number of messages

Multi-Agent Tracing

Share one tracer across multiple agents to collect all conversations:

tracer = MessageTracingAgentCallback()

# Attach to multiple agents
agent1 = SmolAgentAdapter(agent1, name="researcher", callbacks=[tracer])
agent2 = SmolAgentAdapter(agent2, name="writer", callbacks=[tracer])

# Run both agents
agent1.run("Research topic X")
agent2.run("Write about topic X")

# Get conversations by agent
research_convs = tracer.get_conversations_by_agent("researcher")
writer_convs = tracer.get_conversations_by_agent("writer")

# Or get statistics
stats = tracer.get_statistics()
print(f"Total conversations: {stats['total_conversations']}")
print(f"Total messages: {stats['total_messages']}")

Memory Management

For long-running benchmarks, periodically clear the tracer's memory:

tracer = MessageTracingAgentCallback()

for batch in task_batches:
    for task in batch:
        agent_adapter.run(task.query)

    # Process this batch
    conversations = tracer.get_all_conversations()
    save_to_disk(conversations)

    # Clear memory for next batch
    tracer.clear()

Configuration

Tracer Options

MessageTracingAgentCallback(
    include_metadata=True,   # Include timestamps and metadata (default: True)
    verbose=False            # Print trace info to console (default: False)
)

Typical configurations:

# Debugging - see what's happening
tracer = MessageTracingAgentCallback(verbose=True)

# Production - minimal overhead
tracer = MessageTracingAgentCallback(include_metadata=False, verbose=False)

Message Format

Messages use OpenAI's chat completion format:

{
    "role": "user" | "assistant" | "system" | "tool",
    "content": str,
    "tool_calls": [...],      # Present when assistant calls tools
    "tool_call_id": str,      # Present in tool responses
    "name": str,              # Tool name (for tool role)
}

Tool Call Example

# Assistant calling a tool
{
    "role": "assistant",
    "content": "",
    "tool_calls": [{
        "id": "call_abc123",
        "type": "function",
        "function": {
            "name": "get_weather",
            "arguments": '{"city": "NYC"}'
        }
    }]
}

# Tool response
{
    "role": "tool",
    "tool_call_id": "call_abc123",
    "name": "get_weather",
    "content": "72°F, Sunny"
}

Custom Agent Adapters

If you're implementing a custom adapter, the framework handles message storage automatically via get_messages(). Just ensure your _run_agent() method returns a MessageHistory:

from maseval import AgentAdapter, MessageHistory

class MyAgentAdapter(AgentAdapter):
    def _run_agent(self, query: str) -> MessageHistory:
        # Run your agent
        result = self.agent.run(query)

        # Convert to MessageHistory
        history = MessageHistory()
        history.add_message(role="user", content=query)
        history.add_message(role="assistant", content=result.text)

        # Framework automatically stores this
        return history

See the AgentAdapter guide for details on implementing custom adapters.

Tips

For debugging: Use verbose=True to see traces in real-time.

For benchmarks: Create a new adapter instance for each task to ensure clean conversation history.

For multi-agent systems: Use a shared tracer and get_conversations_by_agent() to analyze each agent separately.

For memory efficiency: Periodically clear() the tracer and save conversations to disk.