Skip to content

Agents

Agent adapters wrap agents from any framework to provide a unified interface for benchmarking. They handle execution, message history tracking, and callback hooks.

View source

AgentAdapter

Bases: ABC, TraceableMixin, ConfigurableMixin, UsageTrackableMixin

Wraps an agent from any framework to provide a standard interface.

This Adapter provides:

  • Unified execution interface via run()
  • Callback hooks for monitoring
  • Message history management via getter/setter
  • Framework-agnostic tracing
  • Automatic cost calculation from token usage (when a cost calculator is available)
Cost Tracking

Agent adapters track token usage from the underlying framework. To also compute cost, you can pass a cost_calculator and optionally a model_id.

Most framework adapters auto-detect both the model ID (from the framework's agent object) and the cost calculator (using LiteLLMCostCalculator if litellm is installed). This means cost tracking often works with zero configuration.

To override auto-detection, pass explicit values::

adapter = SmolAgentAdapter(
    agent, name="researcher",
    cost_calculator=StaticPricingCalculator({...}),
    model_id="my-custom-model",
)

gather_config

gather_config() -> Dict[str, Any]

Gather configuration from this agent.

Collects comprehensive configuration information about the agent including its name, type, and callback configuration.

Output fields:

  • type - Component class name
  • gathered_at - ISO timestamp
  • name - Agent name
  • agent_type - Underlying agent framework class name
  • adapter_type - The specific adapter class (e.g., SmolAgentAdapter)
  • callbacks - List of callback class names attached to this agent
RETURNS DESCRIPTION
Dict[str, Any]

Dictionary containing agent configuration.

How to use

This method is automatically called by Benchmark during config collection. Framework-specific adapters can extend this to include additional data:

def gather_config(self) -> Dict[str, Any]:
    return {
        **super().gather_config(),
        "framework_specific_setting": self.agent.some_setting
    }

gather_traces

gather_traces() -> Dict[str, Any]

Gather execution traces from this agent.

Collects comprehensive information about the agent's execution including message history, callback information, and agent metadata.

Output fields:

  • type - Component class name
  • gathered_at - ISO timestamp
  • name - Agent name
  • agent_type - Underlying agent framework class name
  • message_count - Number of messages in history
  • messages - Full message history as list of dicts
  • callbacks - List of callback class names attached to this agent
RETURNS DESCRIPTION
Dict[str, Any]

Dictionary containing agent execution traces.

How to use

This method is automatically called by Benchmark during trace collection. Framework-specific adapters can extend this to include additional data:

def gather_traces(self) -> Dict[str, Any]:
    return {
        **super().gather_traces(),
        "framework_specific_metric": self.agent.some_metric
    }

gather_usage

gather_usage() -> Usage

Gather usage with automatic cost calculation.

Calls _gather_usage() for raw token counts, then applies the cost calculator if one is available and cost is still 0.0.

The model_id used for cost calculation is resolved in order:

  1. Explicit model_id passed to __init__
  2. Auto-detected from the framework agent via _resolve_model_id()

Subclasses should override _gather_usage() (not this method) to provide framework-specific token extraction.

RETURNS DESCRIPTION
Usage

Usage (or TokenUsage) with cost filled in when possible.

get_messages

get_messages() -> MessageHistory

Get the current message history as an iterable MessageHistory object.

The returned MessageHistory can be: - Iterated: for msg in agent.get_messages(): ... - Indexed: agent.get_messages()[0] - Converted to list: list(agent.get_messages()) or agent.get_messages().to_list() - Checked for emptiness: if agent.get_messages(): ...

RETURNS DESCRIPTION
MessageHistory

MessageHistory object (empty if no messages yet)

Example
# Iterate directly
for msg in agent.get_messages():
    print(msg['role'], msg['content'])

# Convert to list
messages = agent.get_messages().to_list()
messages = list(agent.get_messages())

# Check if empty
if agent.get_messages():
    print("Agent has messages")

run

run(query: str) -> Any

Executes the agent and returns the result.

options: show_source: false

Interfaces

Adapters that integrate MASEval with agent frameworks:

View source

SmolAgentAdapter

Bases: AgentAdapter

An AgentAdapter for HuggingFace smolagents MultiStepAgent.

This adapter integrates smolagents' MultiStepAgent with MASEval's benchmarking framework, converting smolagents' internal message format to OpenAI-compatible MessageHistory format. It automatically tracks tool calls, tool responses, agent reasoning steps, and provides comprehensive execution monitoring through smolagents' built-in memory system.

The adapter leverages smolagents' native memory storage as the source of truth, dynamically fetching messages, logs, and execution traces from the agent's internal state. This ensures accurate tracking of tool usage, timing, and token consumption without additional overhead.

How to use
  1. Create a smolagents agent with tools and configuration
  2. Wrap with SmolAgentAdapter to enable MASEval integration
  3. Use in benchmarks or call directly for testing
  4. Access traces and config for analysis and debugging

Example workflow:

from maseval.interface.agents.smolagents import SmolAgentAdapter
from smolagents import MultiStepAgent, ToolCallingAgent
from smolagents.tools import DuckDuckGoSearchTool

# Create a smolagents agent
agent = ToolCallingAgent(
    tools=[DuckDuckGoSearchTool()],
    model="gpt-4",
    max_steps=10
)

# Wrap with adapter
agent_adapter = SmolAgentAdapter(agent, name="search_agent")

# Run agent
result = agent_adapter.run("What's the latest news on AI?")

# Access message history in OpenAI format
for msg in agent_adapter.get_messages():
    print(f"{msg['role']}: {msg['content']}")

# Gather aggregated usage
usage = agent_adapter.gather_usage()
print(f"Total tokens: {usage.total_tokens}")

# Gather execution traces with timing
traces = agent_adapter.gather_traces()
print(f"Total duration: {traces['total_duration_seconds']}s")

# Use in benchmark
benchmark = MyBenchmark(agent_data={"agent": agent_adapter})
results = benchmark.run(tasks)

The adapter automatically converts smolagents' ActionStep and PlanningStep objects into structured logs, preserving timing, token usage, tool calls, and error information.

Message Format

smolagents uses its own message format. The adapter converts to maseval / OpenAI format.

Tool calls are preserved with their IDs, names, and arguments.

Execution Monitoring

The adapter provides comprehensive monitoring through gather_traces():

  • Token usage: Input, output, and total tokens per step and aggregated
  • Timing: Duration per step and total execution time
  • Tool calls: Complete tool call history with arguments and results
  • Errors: Error tracking with type and message
  • Observations: Tool outputs and agent observations
Requires

smolagents to be installed: pip install maseval[smolagents]

logs property

logs: List[Dict[str, Any]]

Dynamically generate logs from smolagents' internal memory.

Converts smolagents' ActionStep and PlanningStep objects into log entries compatible with the AgentAdapter contract, including all available properties.

RETURNS DESCRIPTION
List[Dict[str, Any]]

List of log dictionaries with comprehensive step information

__init__

__init__(
    agent_instance: Any,
    name: str,
    callbacks: Any = None,
    cost_calculator: Optional[CostCalculator] = None,
    model_id: Optional[str] = None,
)

Initialize the Smolagent adapter.

Note: We don't call super().init() to avoid initializing self.logs as a list, since we override it as a property that dynamically fetches from agent.memory.

PARAMETER DESCRIPTION
agent_instance

smolagents MultiStepAgent or similar

TYPE: Any

name

Agent name for identification

TYPE: str

callbacks

Optional list of AgentCallback instances

TYPE: Any DEFAULT: None

cost_calculator

Optional cost calculator. If not provided, a LiteLLMCostCalculator is created automatically when litellm is available.

TYPE: Optional[CostCalculator] DEFAULT: None

model_id

Optional model ID for cost calculation. If not provided, auto-detected from agent.model.model_id.

TYPE: Optional[str] DEFAULT: None

gather_config

gather_config() -> dict[str, Any]

Gather configuration from this SmolAgent.

Integrates with smolagents' native configuration system by accessing the agent's to_dict() method which includes comprehensive config data.

RETURNS DESCRIPTION
dict[str, Any]

Dictionary containing:

dict[str, Any]
  • type: Component class name
dict[str, Any]
  • gathered_at: ISO timestamp
dict[str, Any]
  • name: Agent name
dict[str, Any]
  • agent_type: Underlying agent class name
dict[str, Any]
  • adapter_type: SmolAgentAdapter
dict[str, Any]
  • callbacks: List of callback class names
dict[str, Any]
  • smolagents_config: Full configuration from agent.to_dict() including:
  • model: Model configuration with class and parameters
  • tools: List of tool configurations
  • max_steps: Maximum number of steps
  • planning_interval: Planning interval (if set)
  • verbosity_level: Logging verbosity
  • additional_authorized_imports: Additional imports (CodeAgent only)
  • executor_type: Code executor type (CodeAgent only)
  • managed_agents: List of managed agent configs (if any)

gather_traces

gather_traces() -> dict

Gather traces including message history and monitoring data.

Extends the base class to include smolagents' per-step monitoring data (token usage, timing, actions, observations). Aggregated usage totals are available via gather_usage().

RETURNS DESCRIPTION
dict

Dict containing messages and per-step monitoring statistics.

gather_usage

gather_usage() -> Usage

Gather usage with automatic cost calculation.

Calls _gather_usage() for raw token counts, then applies the cost calculator if one is available and cost is still 0.0.

The model_id used for cost calculation is resolved in order:

  1. Explicit model_id passed to __init__
  2. Auto-detected from the framework agent via _resolve_model_id()

Subclasses should override _gather_usage() (not this method) to provide framework-specific token extraction.

RETURNS DESCRIPTION
Usage

Usage (or TokenUsage) with cost filled in when possible.

get_messages

get_messages() -> MessageHistory

Get message history by converting from smolagents memory.

This method dynamically fetches messages from the agent's internal memory and converts them to MASEval format.

RETURNS DESCRIPTION
MessageHistory

MessageHistory with converted messages from smolagents

run

run(query: str) -> Any

Executes the agent and returns the result.

View source

LangGraphAgentAdapter

Bases: AgentAdapter

An AgentAdapter for LangGraph CompiledGraph agents.

This adapter integrates LangGraph's compiled graphs with MASEval's benchmarking framework, converting LangChain/LangGraph message types to OpenAI-compatible MessageHistory format. It preserves tool calls, tool responses, multi-modal content, and supports both stateless and stateful (checkpointed) graph execution.

LangGraph graphs can operate in two modes:

  • Stateless: Messages from invoke() result are cached in the adapter for access
  • Stateful: With checkpointer and thread_id, messages are fetched from persistent state

The adapter automatically handles both modes, preferring persistent state when available and falling back to cached results for stateless graphs.

How to use
  1. Create a LangGraph graph with state and nodes
  2. Compile the graph (optionally with checkpointer for state persistence)
  3. Wrap with LangGraphAgentAdapter to enable MASEval integration
  4. Use in benchmarks or call directly for testing
  5. Access traces and config for analysis and debugging

Example workflow:

from maseval.interface.agents.langgraph import LangGraphAgentAdapter
from langgraph.graph import StateGraph, MessagesState
from langgraph.checkpoint.memory import MemorySaver

# Define your graph
def chatbot(state: MessagesState):
    # Your agent logic
    return {"messages": [response]}

# Build graph
graph = StateGraph(MessagesState)
graph.add_node("chatbot", chatbot)
graph.set_entry_point("chatbot")
graph.set_finish_point("chatbot")

# Compile (stateless)
compiled_graph = graph.compile()
agent_adapter = LangGraphAgentAdapter(compiled_graph, "agent_name")

# Or compile with checkpointer (stateful)
memory = MemorySaver()
compiled_graph = graph.compile(checkpointer=memory)
config = {"configurable": {"thread_id": "session_1"}}
agent_adapter = LangGraphAgentAdapter(
    compiled_graph,
    "agent_name",
    config=config
)

# Run agent
result = agent_adapter.run("What's the weather?")

# Access message history in OpenAI format
for msg in agent_adapter.get_messages():
    print(f"{msg['role']}: {msg['content']}")

# Gather execution traces
traces = agent_adapter.gather_traces()
if 'total_tokens' in traces:
    print(f"Total tokens: {traces['total_tokens']}")

# Use in benchmark
benchmark = MyBenchmark(agent_data={"agent": agent_adapter})
results = benchmark.run(tasks)

For stateful graphs, the adapter preserves conversation context across multiple calls using the same thread_id, enabling multi-turn interactions.

Message Format

LangGraph uses LangChain message types. The adapter converts to maseval / OpenAI format.

Tool calls are preserved with metadata and converted to OpenAI's tool call format.

Token Usage

If LangChain messages include usage_metadata, the adapter automatically extracts and aggregates token counts. This is available for models that provide usage information.

Requires

langgraph to be installed: pip install maseval[langgraph]

__init__

__init__(
    agent_instance: Any,
    name: str,
    callbacks: Optional[List[Any]] = None,
    config: Optional[Dict[str, Any]] = None,
    cost_calculator: Optional[CostCalculator] = None,
    model_id: Optional[str] = None,
)

Initialize the LangGraph adapter.

PARAMETER DESCRIPTION
agent_instance

Compiled LangGraph graph

TYPE: Any

name

Agent name

TYPE: str

callbacks

Optional list of callbacks

TYPE: Optional[List[Any]] DEFAULT: None

config

Optional LangGraph config dict (for stateful graphs with checkpointer). Should include 'configurable': {'thread_id': '...'} for persistent state.

TYPE: Optional[Dict[str, Any]] DEFAULT: None

cost_calculator

Optional cost calculator. If not provided, a LiteLLMCostCalculator is created automatically when litellm is available.

TYPE: Optional[CostCalculator] DEFAULT: None

model_id

Model ID for cost calculation. LangGraph graphs can contain multiple models across nodes, so the model ID cannot be auto-detected. Pass the primary model's ID here to enable cost tracking::

LangGraphAgentAdapter(
    graph, "agent",
    model_id="gpt-4o-mini",
)

TYPE: Optional[str] DEFAULT: None

gather_config

gather_config() -> dict[str, Any]

Gather configuration from this LangGraph agent.

RETURNS DESCRIPTION
dict[str, Any]

Dictionary containing:

dict[str, Any]
  • type: Component class name
dict[str, Any]
  • gathered_at: ISO timestamp
dict[str, Any]
  • name: Agent name
dict[str, Any]
  • agent_type: CompiledGraph or similar
dict[str, Any]
  • adapter_type: LangGraphAgentAdapter
dict[str, Any]
  • callbacks: List of callback class names
dict[str, Any]
  • has_checkpointer: Whether the graph has state persistence
dict[str, Any]
  • config: LangGraph config dict (with sensitive data removed)
dict[str, Any]
  • graph_info: Information about the graph structure (if available)

gather_traces

gather_traces() -> Dict[str, Any]

Gather execution traces from this agent.

Collects comprehensive information about the agent's execution including message history, callback information, and agent metadata.

Output fields:

  • type - Component class name
  • gathered_at - ISO timestamp
  • name - Agent name
  • agent_type - Underlying agent framework class name
  • message_count - Number of messages in history
  • messages - Full message history as list of dicts
  • callbacks - List of callback class names attached to this agent
RETURNS DESCRIPTION
Dict[str, Any]

Dictionary containing agent execution traces.

How to use

This method is automatically called by Benchmark during trace collection. Framework-specific adapters can extend this to include additional data:

def gather_traces(self) -> Dict[str, Any]:
    return {
        **super().gather_traces(),
        "framework_specific_metric": self.agent.some_metric
    }

gather_usage

gather_usage() -> Usage

Gather usage with automatic cost calculation.

Calls _gather_usage() for raw token counts, then applies the cost calculator if one is available and cost is still 0.0.

The model_id used for cost calculation is resolved in order:

  1. Explicit model_id passed to __init__
  2. Auto-detected from the framework agent via _resolve_model_id()

Subclasses should override _gather_usage() (not this method) to provide framework-specific token extraction.

RETURNS DESCRIPTION
Usage

Usage (or TokenUsage) with cost filled in when possible.

get_messages

get_messages() -> MessageHistory

Get message history from LangGraph.

For stateful graphs (with checkpointer and thread_id), fetches from graph state. For stateless graphs, returns cached messages from last run.

RETURNS DESCRIPTION
MessageHistory

MessageHistory with converted messages

run

run(query: str) -> Any

Executes the agent and returns the result.

View source

LlamaIndexAgentAdapter

Bases: AgentAdapter

An AgentAdapter for LlamaIndex workflow-based agents.

This adapter integrates LlamaIndex's workflow-based agent system with MASEval's benchmarking framework, converting LlamaIndex's ChatMessage format to OpenAI-compatible MessageHistory format. It handles both AgentWorkflow and BaseWorkflowAgent instances, automatically managing async execution in synchronous contexts.

LlamaIndex agents are async-first, using workflows that must be awaited. This adapter handles the async-to-sync conversion automatically, supporting both agents with persistent memory and stateless execution modes. It seamlessly integrates with MASEval's synchronous benchmarking API.

How to use
  1. Create a LlamaIndex workflow agent with tools and LLM
  2. Wrap with LlamaIndexAgentAdapter to enable MASEval integration
  3. Use in benchmarks or call directly for testing
  4. Access traces and config for analysis and debugging

Example workflow:

from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter
from llama_index.core.agent.workflow import AgentWorkflow
from llama_index.core.llms import OpenAI
from llama_index.core.tools import FunctionTool

# Define a tool
def search(query: str) -> str:
    """Search for information."""
    return f"Results for: {query}"

search_tool = FunctionTool.from_defaults(fn=search)

# Create a LlamaIndex workflow
workflow = AgentWorkflow.from_tools_or_functions(
    tools_or_functions=[search_tool],
    llm=OpenAI(model="gpt-4"),
    system_prompt="You are a helpful research assistant"
)

# Wrap with adapter
agent_adapter = LlamaIndexAgentAdapter(workflow, "research_agent")

# Run agent (async handled automatically)
result = agent_adapter.run("What are the latest developments in quantum computing?")

# Access message history in OpenAI format
for msg in agent_adapter.get_messages():
    print(f"{msg['role']}: {msg['content']}")

# Gather configuration including tools and system prompt
config = agent_adapter.gather_config()
print(f"System prompt: {config['llamaindex_config']['system_prompt']}")
print(f"Tools: {config['llamaindex_config']['tools']}")

# Gather execution traces with timing
traces = agent_adapter.gather_traces()
if 'total_tokens' in traces:
    print(f"Total tokens: {traces['total_tokens']}")

# Use in benchmark
benchmark = MyBenchmark(agent_data={"agent": agent_adapter})
results = benchmark.run(tasks)

The adapter works with various LlamaIndex agent types including AgentWorkflow, FunctionAgent (tool calling), ReActAgent, and CodeActAgent.

Message Format

LlamaIndex uses ChatMessage objects with MessageRole enums. The adapter converts to maseval / OpenAI format.

Tool calls are preserved in the additional_kwargs field and converted to OpenAI's tool call format when available.

Async Handling

LlamaIndex agents return a WorkflowHandler from .run() which must be awaited. The adapter handles this automatically:

  • Checks for run_sync() method first (for compatibility)
  • Falls back to asyncio.run() to execute the async run() method
  • Works seamlessly in synchronous benchmarking contexts

This allows you to use async-first LlamaIndex agents in MASEval's sync API without any additional configuration.

Supported Agent Types
  • AgentWorkflow: Multi-agent workflow orchestrator
  • FunctionAgent: Function-calling based agent (for LLMs with tool calling)
  • ReActAgent: ReAct prompting pattern agent
  • CodeActAgent: Code execution based agent
Token Usage

Token usage is extracted from LLM responses when available. If the LLM response includes usage metadata, it's automatically captured in execution traces.

Requires

llama-index-core to be installed: pip install maseval[llamaindex]

__init__

__init__(
    agent_instance: Any,
    name: str,
    callbacks: Optional[List[Any]] = None,
    max_iterations: Optional[int] = None,
    cost_calculator: Optional[CostCalculator] = None,
    model_id: Optional[str] = None,
)

Initialize the LlamaIndex adapter.

PARAMETER DESCRIPTION
agent_instance

LlamaIndex AgentWorkflow or BaseWorkflowAgent instance

TYPE: Any

name

Agent name

TYPE: str

callbacks

Optional list of callbacks

TYPE: Optional[List[Any]] DEFAULT: None

max_iterations

Maximum number of agent iterations for AgentWorkflow.run(). If None, LlamaIndex's DEFAULT_MAX_ITERATIONS (20) is used. Bug fix: FunctionAgent does NOT have a max_steps constructor parameter — passing max_steps to it is silently swallowed by **kwargs. The actual iteration limit must be passed here so the adapter forwards it to AgentWorkflow.run(max_iterations=...).

TYPE: Optional[int] DEFAULT: None

cost_calculator

Optional cost calculator. If not provided, a LiteLLMCostCalculator is created automatically when litellm is available.

TYPE: Optional[CostCalculator] DEFAULT: None

model_id

Optional model ID for cost calculation. If not provided, auto-detected from agent.llm.metadata.model_name.

TYPE: Optional[str] DEFAULT: None

gather_config

gather_config() -> Dict[str, Any]

Gather configuration from this LlamaIndex agent.

RETURNS DESCRIPTION
Dict[str, Any]

Dictionary containing:

Dict[str, Any]
  • type: Component class name
Dict[str, Any]
  • gathered_at: ISO timestamp
Dict[str, Any]
  • name: Agent name
Dict[str, Any]
  • agent_type: Underlying agent class name
Dict[str, Any]
  • adapter_type: LlamaIndexAgentAdapter
Dict[str, Any]
  • callbacks: List of callback class names
Dict[str, Any]
  • llamaindex_config: LlamaIndex-specific configuration (if available)

gather_traces

gather_traces() -> Dict[str, Any]

Gather execution traces from this agent.

Collects comprehensive information about the agent's execution including message history, callback information, and agent metadata.

Output fields:

  • type - Component class name
  • gathered_at - ISO timestamp
  • name - Agent name
  • agent_type - Underlying agent framework class name
  • message_count - Number of messages in history
  • messages - Full message history as list of dicts
  • callbacks - List of callback class names attached to this agent
RETURNS DESCRIPTION
Dict[str, Any]

Dictionary containing agent execution traces.

How to use

This method is automatically called by Benchmark during trace collection. Framework-specific adapters can extend this to include additional data:

def gather_traces(self) -> Dict[str, Any]:
    return {
        **super().gather_traces(),
        "framework_specific_metric": self.agent.some_metric
    }

gather_usage

gather_usage() -> Usage

Gather usage with automatic cost calculation.

Calls _gather_usage() for raw token counts, then applies the cost calculator if one is available and cost is still 0.0.

The model_id used for cost calculation is resolved in order:

  1. Explicit model_id passed to __init__
  2. Auto-detected from the framework agent via _resolve_model_id()

Subclasses should override _gather_usage() (not this method) to provide framework-specific token extraction.

RETURNS DESCRIPTION
Usage

Usage (or TokenUsage) with cost filled in when possible.

get_messages

get_messages() -> MessageHistory

Get message history from LlamaIndex.

For agents with accessible memory, fetches from the agent's memory. Otherwise, returns cached messages from the last run.

RETURNS DESCRIPTION
MessageHistory

MessageHistory with converted messages

run

run(query: str) -> Any

Executes the agent and returns the result.