Agents

Agent adapters wrap agents from any framework to provide a unified interface for benchmarking. They handle execution, message history tracking, and callback hooks.

View source

AgentAdapter

Bases: ABC, TraceableMixin, ConfigurableMixin, UsageTrackableMixin

Wraps an agent from any framework to provide a standard interface.

This Adapter provides:

Unified execution interface via run()
Callback hooks for monitoring
Message history management via getter/setter
Framework-agnostic tracing
Automatic cost calculation from token usage (when a cost calculator is available)

Cost Tracking

Agent adapters track token usage from the underlying framework. To also compute cost, you can pass a cost_calculator and optionally a model_id.

Most framework adapters auto-detect both the model ID (from the framework's agent object) and the cost calculator (using LiteLLMCostCalculator if litellm is installed). This means cost tracking often works with zero configuration.

To override auto-detection, pass explicit values::

adapter = SmolAgentAdapter(
    agent, name="researcher",
    cost_calculator=StaticPricingCalculator({...}),
    model_id="my-custom-model",
)

gather_config

gather_config() -> Dict[str, Any]

Gather configuration from this agent.

Collects comprehensive configuration information about the agent including its name, type, and callback configuration.

Output fields:

type - Component class name
gathered_at - ISO timestamp
name - Agent name
agent_type - Underlying agent framework class name
adapter_type - The specific adapter class (e.g., SmolAgentAdapter)
callbacks - List of callback class names attached to this agent

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary containing agent configuration.

How to use

This method is automatically called by Benchmark during config collection. Framework-specific adapters can extend this to include additional data:

def gather_config(self) -> Dict[str, Any]:
    return {
        **super().gather_config(),
        "framework_specific_setting": self.agent.some_setting
    }

gather_traces

gather_traces() -> Dict[str, Any]

Gather execution traces from this agent.

Collects comprehensive information about the agent's execution including message history, callback information, and agent metadata.

Output fields:

type - Component class name
gathered_at - ISO timestamp
name - Agent name
agent_type - Underlying agent framework class name
message_count - Number of messages in history
messages - Full message history as list of dicts
callbacks - List of callback class names attached to this agent

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary containing agent execution traces.

How to use

This method is automatically called by Benchmark during trace collection. Framework-specific adapters can extend this to include additional data:

def gather_traces(self) -> Dict[str, Any]:
    return {
        **super().gather_traces(),
        "framework_specific_metric": self.agent.some_metric
    }

gather_usage

gather_usage() -> Usage

Gather usage with automatic cost calculation.

Calls _gather_usage() for raw token counts, then applies the cost calculator if one is available and cost is still 0.0.

The model_id used for cost calculation is resolved in order:

Explicit model_id passed to __init__
Auto-detected from the framework agent via _resolve_model_id()

Subclasses should override _gather_usage() (not this method) to provide framework-specific token extraction.

RETURNS	DESCRIPTION
`Usage`	Usage (or TokenUsage) with cost filled in when possible.

get_messages

get_messages() -> MessageHistory

Get the current message history as an iterable MessageHistory object.

The returned MessageHistory can be: - Iterated: for msg in agent.get_messages(): ... - Indexed: agent.get_messages()[0] - Converted to list: list(agent.get_messages()) or agent.get_messages().to_list() - Checked for emptiness: if agent.get_messages(): ...

RETURNS	DESCRIPTION
`MessageHistory`	MessageHistory object (empty if no messages yet)

Example

# Iterate directly
for msg in agent.get_messages():
    print(msg['role'], msg['content'])

# Convert to list
messages = agent.get_messages().to_list()
messages = list(agent.get_messages())

# Check if empty
if agent.get_messages():
    print("Agent has messages")

run

run(query: str) -> Any

Executes the agent and returns the result.

options: show_source: false

Interfaces

Adapters that integrate MASEval with agent frameworks:

View source

SmolAgentAdapter

Bases: AgentAdapter

An AgentAdapter for HuggingFace smolagents MultiStepAgent.

This adapter integrates smolagents' MultiStepAgent with MASEval's benchmarking framework, converting smolagents' internal message format to OpenAI-compatible MessageHistory format. It automatically tracks tool calls, tool responses, agent reasoning steps, and provides comprehensive execution monitoring through smolagents' built-in memory system.

The adapter leverages smolagents' native memory storage as the source of truth, dynamically fetching messages, logs, and execution traces from the agent's internal state. This ensures accurate tracking of tool usage, timing, and token consumption without additional overhead.

How to use

Create a smolagents agent with tools and configuration
Wrap with SmolAgentAdapter to enable MASEval integration
Use in benchmarks or call directly for testing
Access traces and config for analysis and debugging

Example workflow:

from maseval.interface.agents.smolagents import SmolAgentAdapter
from smolagents import MultiStepAgent, ToolCallingAgent
from smolagents.tools import DuckDuckGoSearchTool

# Create a smolagents agent
agent = ToolCallingAgent(
    tools=[DuckDuckGoSearchTool()],
    model="gpt-4",
    max_steps=10
)

# Wrap with adapter
agent_adapter = SmolAgentAdapter(agent, name="search_agent")

# Run agent
result = agent_adapter.run("What's the latest news on AI?")

# Access message history in OpenAI format
for msg in agent_adapter.get_messages():
    print(f"{msg['role']}: {msg['content']}")

# Gather aggregated usage
usage = agent_adapter.gather_usage()
print(f"Total tokens: {usage.total_tokens}")

# Gather execution traces with timing
traces = agent_adapter.gather_traces()
print(f"Total duration: {traces['total_duration_seconds']}s")

# Use in benchmark
benchmark = MyBenchmark(agent_data={"agent": agent_adapter})
results = benchmark.run(tasks)

The adapter automatically converts smolagents' ActionStep and PlanningStep objects into structured logs, preserving timing, token usage, tool calls, and error information.

Message Format

smolagents uses its own message format. The adapter converts to maseval / OpenAI format.

Tool calls are preserved with their IDs, names, and arguments.

Execution Monitoring

The adapter provides comprehensive monitoring through gather_traces():

Token usage: Input, output, and total tokens per step and aggregated
Timing: Duration per step and total execution time
Tool calls: Complete tool call history with arguments and results
Errors: Error tracking with type and message
Observations: Tool outputs and agent observations

Requires

smolagents to be installed: pip install maseval[smolagents]

logs `property`

logs: List[Dict[str, Any]]

Dynamically generate logs from smolagents' internal memory.

Converts smolagents' ActionStep and PlanningStep objects into log entries compatible with the AgentAdapter contract, including all available properties.

RETURNS	DESCRIPTION
`List[Dict[str, Any]]`	List of log dictionaries with comprehensive step information

init

__init__(
    agent_instance: Any,
    name: str,
    callbacks: Any = None,
    cost_calculator: Optional[CostCalculator] = None,
    model_id: Optional[str] = None,
)

Initialize the Smolagent adapter.

Note: We don't call super().init() to avoid initializing self.logs as a list, since we override it as a property that dynamically fetches from agent.memory.

PARAMETER	DESCRIPTION
`agent_instance`	smolagents MultiStepAgent or similar TYPE: `Any`
`name`	Agent name for identification TYPE: `str`
`callbacks`	Optional list of AgentCallback instances TYPE: `Any` DEFAULT: `None`
`cost_calculator`	Optional cost calculator. If not provided, a `LiteLLMCostCalculator` is created automatically when litellm is available. TYPE: `Optional[CostCalculator]` DEFAULT: `None`
`model_id`	Optional model ID for cost calculation. If not provided, auto-detected from `agent.model.model_id`. TYPE: `Optional[str]` DEFAULT: `None`

gather_config

gather_config() -> dict[str, Any]

Gather configuration from this SmolAgent.

Integrates with smolagents' native configuration system by accessing the agent's to_dict() method which includes comprehensive config data.

RETURNS	DESCRIPTION
`dict[str, Any]`	Dictionary containing:
`dict[str, Any]`	type: Component class name
`dict[str, Any]`	gathered_at: ISO timestamp
`dict[str, Any]`	name: Agent name
`dict[str, Any]`	agent_type: Underlying agent class name
`dict[str, Any]`	adapter_type: SmolAgentAdapter
`dict[str, Any]`	callbacks: List of callback class names
`dict[str, Any]`	smolagents_config: Full configuration from agent.to_dict() including: model: Model configuration with class and parameters tools: List of tool configurations max_steps: Maximum number of steps planning_interval: Planning interval (if set) verbosity_level: Logging verbosity additional_authorized_imports: Additional imports (CodeAgent only) executor_type: Code executor type (CodeAgent only) managed_agents: List of managed agent configs (if any)

gather_traces

gather_traces() -> dict

Gather traces including message history and monitoring data.

Extends the base class to include smolagents' per-step monitoring data (token usage, timing, actions, observations). Aggregated usage totals are available via gather_usage().

RETURNS	DESCRIPTION
`dict`	Dict containing messages and per-step monitoring statistics.

gather_usage

gather_usage() -> Usage

Gather usage with automatic cost calculation.

Calls _gather_usage() for raw token counts, then applies the cost calculator if one is available and cost is still 0.0.

The model_id used for cost calculation is resolved in order:

Explicit model_id passed to __init__
Auto-detected from the framework agent via _resolve_model_id()

Subclasses should override _gather_usage() (not this method) to provide framework-specific token extraction.

RETURNS	DESCRIPTION
`Usage`	Usage (or TokenUsage) with cost filled in when possible.

get_messages

get_messages() -> MessageHistory

Get message history by converting from smolagents memory.

This method dynamically fetches messages from the agent's internal memory and converts them to MASEval format.

RETURNS	DESCRIPTION
`MessageHistory`	MessageHistory with converted messages from smolagents

run

run(query: str) -> Any

Executes the agent and returns the result.

View source

LangGraphAgentAdapter

Bases: AgentAdapter

An AgentAdapter for LangGraph CompiledGraph agents.

This adapter integrates LangGraph's compiled graphs with MASEval's benchmarking framework, converting LangChain/LangGraph message types to OpenAI-compatible MessageHistory format. It preserves tool calls, tool responses, multi-modal content, and supports both stateless and stateful (checkpointed) graph execution.

LangGraph graphs can operate in two modes:

Stateless: Messages from invoke() result are cached in the adapter for access
Stateful: With checkpointer and thread_id, messages are fetched from persistent state

The adapter automatically handles both modes, preferring persistent state when available and falling back to cached results for stateless graphs.

How to use

Create a LangGraph graph with state and nodes
Compile the graph (optionally with checkpointer for state persistence)
Wrap with LangGraphAgentAdapter to enable MASEval integration
Use in benchmarks or call directly for testing
Access traces and config for analysis and debugging

Example workflow:

from maseval.interface.agents.langgraph import LangGraphAgentAdapter
from langgraph.graph import StateGraph, MessagesState
from langgraph.checkpoint.memory import MemorySaver

# Define your graph
def chatbot(state: MessagesState):
    # Your agent logic
    return {"messages": [response]}

# Build graph
graph = StateGraph(MessagesState)
graph.add_node("chatbot", chatbot)
graph.set_entry_point("chatbot")
graph.set_finish_point("chatbot")

# Compile (stateless)
compiled_graph = graph.compile()
agent_adapter = LangGraphAgentAdapter(compiled_graph, "agent_name")

# Or compile with checkpointer (stateful)
memory = MemorySaver()
compiled_graph = graph.compile(checkpointer=memory)
config = {"configurable": {"thread_id": "session_1"}}
agent_adapter = LangGraphAgentAdapter(
    compiled_graph,
    "agent_name",
    config=config
)

# Run agent
result = agent_adapter.run("What's the weather?")

# Access message history in OpenAI format
for msg in agent_adapter.get_messages():
    print(f"{msg['role']}: {msg['content']}")

# Gather execution traces
traces = agent_adapter.gather_traces()
if 'total_tokens' in traces:
    print(f"Total tokens: {traces['total_tokens']}")

# Use in benchmark
benchmark = MyBenchmark(agent_data={"agent": agent_adapter})
results = benchmark.run(tasks)

For stateful graphs, the adapter preserves conversation context across multiple calls using the same thread_id, enabling multi-turn interactions.

Message Format

LangGraph uses LangChain message types. The adapter converts to maseval / OpenAI format.

Tool calls are preserved with metadata and converted to OpenAI's tool call format.

Token Usage

If LangChain messages include usage_metadata, the adapter automatically extracts and aggregates token counts. This is available for models that provide usage information.

Requires

langgraph to be installed: pip install maseval[langgraph]

init

__init__(
    agent_instance: Any,
    name: str,
    callbacks: Optional[List[Any]] = None,
    config: Optional[Dict[str, Any]] = None,
    cost_calculator: Optional[CostCalculator] = None,
    model_id: Optional[str] = None,
)

Initialize the LangGraph adapter.

PARAMETER	DESCRIPTION
`agent_instance`	Compiled LangGraph graph TYPE: `Any`
`name`	Agent name TYPE: `str`
`callbacks`	Optional list of callbacks TYPE: `Optional[List[Any]]` DEFAULT: `None`
`config`	Optional LangGraph config dict (for stateful graphs with checkpointer). Should include `'configurable': {'thread_id': '...'}` for persistent state. TYPE: `Optional[Dict[str, Any]]` DEFAULT: `None`
`cost_calculator`	Optional cost calculator. If not provided, a `LiteLLMCostCalculator` is created automatically when litellm is available. TYPE: `Optional[CostCalculator]` DEFAULT: `None`
`model_id`	Model ID for cost calculation. LangGraph graphs can contain multiple models across nodes, so the model ID cannot be auto-detected. Pass the primary model's ID here to enable cost tracking:: `LangGraphAgentAdapter( graph, "agent", model_id="gpt-4o-mini", )` TYPE: `Optional[str]` DEFAULT: `None`

gather_config

gather_config() -> dict[str, Any]

Gather configuration from this LangGraph agent.

RETURNS	DESCRIPTION
`dict[str, Any]`	Dictionary containing:
`dict[str, Any]`	type: Component class name
`dict[str, Any]`	gathered_at: ISO timestamp
`dict[str, Any]`	name: Agent name
`dict[str, Any]`	agent_type: CompiledGraph or similar
`dict[str, Any]`	adapter_type: LangGraphAgentAdapter
`dict[str, Any]`	callbacks: List of callback class names
`dict[str, Any]`	has_checkpointer: Whether the graph has state persistence
`dict[str, Any]`	config: LangGraph config dict (with sensitive data removed)
`dict[str, Any]`	graph_info: Information about the graph structure (if available)

gather_traces

gather_traces() -> Dict[str, Any]

Gather execution traces from this agent.

Collects comprehensive information about the agent's execution including message history, callback information, and agent metadata.

Output fields:

type - Component class name
gathered_at - ISO timestamp
name - Agent name
agent_type - Underlying agent framework class name
message_count - Number of messages in history
messages - Full message history as list of dicts
callbacks - List of callback class names attached to this agent

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary containing agent execution traces.

How to use

This method is automatically called by Benchmark during trace collection. Framework-specific adapters can extend this to include additional data:

def gather_traces(self) -> Dict[str, Any]:
    return {
        **super().gather_traces(),
        "framework_specific_metric": self.agent.some_metric
    }

gather_usage

gather_usage() -> Usage

Gather usage with automatic cost calculation.

Calls _gather_usage() for raw token counts, then applies the cost calculator if one is available and cost is still 0.0.

The model_id used for cost calculation is resolved in order:

Explicit model_id passed to __init__
Auto-detected from the framework agent via _resolve_model_id()

Subclasses should override _gather_usage() (not this method) to provide framework-specific token extraction.

RETURNS	DESCRIPTION
`Usage`	Usage (or TokenUsage) with cost filled in when possible.

get_messages

get_messages() -> MessageHistory

Get message history from LangGraph.

For stateful graphs (with checkpointer and thread_id), fetches from graph state. For stateless graphs, returns cached messages from last run.

RETURNS	DESCRIPTION
`MessageHistory`	MessageHistory with converted messages

run

run(query: str) -> Any

Executes the agent and returns the result.

View source

LlamaIndexAgentAdapter

Bases: AgentAdapter

An AgentAdapter for LlamaIndex workflow-based agents.

This adapter integrates LlamaIndex's workflow-based agent system with MASEval's benchmarking framework, converting LlamaIndex's ChatMessage format to OpenAI-compatible MessageHistory format. It handles both AgentWorkflow and BaseWorkflowAgent instances, automatically managing async execution in synchronous contexts.

LlamaIndex agents are async-first, using workflows that must be awaited. This adapter handles the async-to-sync conversion automatically, supporting both agents with persistent memory and stateless execution modes. It seamlessly integrates with MASEval's synchronous benchmarking API.

How to use

Create a LlamaIndex workflow agent with tools and LLM
Wrap with LlamaIndexAgentAdapter to enable MASEval integration
Use in benchmarks or call directly for testing
Access traces and config for analysis and debugging

Example workflow:

from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter
from llama_index.core.agent.workflow import AgentWorkflow
from llama_index.core.llms import OpenAI
from llama_index.core.tools import FunctionTool

# Define a tool
def search(query: str) -> str:
    """Search for information."""
    return f"Results for: {query}"

search_tool = FunctionTool.from_defaults(fn=search)

# Create a LlamaIndex workflow
workflow = AgentWorkflow.from_tools_or_functions(
    tools_or_functions=[search_tool],
    llm=OpenAI(model="gpt-4"),
    system_prompt="You are a helpful research assistant"
)

# Wrap with adapter
agent_adapter = LlamaIndexAgentAdapter(workflow, "research_agent")

# Run agent (async handled automatically)
result = agent_adapter.run("What are the latest developments in quantum computing?")

# Access message history in OpenAI format
for msg in agent_adapter.get_messages():
    print(f"{msg['role']}: {msg['content']}")

# Gather configuration including tools and system prompt
config = agent_adapter.gather_config()
print(f"System prompt: {config['llamaindex_config']['system_prompt']}")
print(f"Tools: {config['llamaindex_config']['tools']}")

# Gather execution traces with timing
traces = agent_adapter.gather_traces()
if 'total_tokens' in traces:
    print(f"Total tokens: {traces['total_tokens']}")

# Use in benchmark
benchmark = MyBenchmark(agent_data={"agent": agent_adapter})
results = benchmark.run(tasks)

The adapter works with various LlamaIndex agent types including AgentWorkflow, FunctionAgent (tool calling), ReActAgent, and CodeActAgent.

Message Format

LlamaIndex uses ChatMessage objects with MessageRole enums. The adapter converts to maseval / OpenAI format.

Tool calls are preserved in the additional_kwargs field and converted to OpenAI's tool call format when available.

Async Handling

LlamaIndex agents return a WorkflowHandler from .run() which must be awaited. The adapter handles this automatically:

Checks for run_sync() method first (for compatibility)
Falls back to asyncio.run() to execute the async run() method
Works seamlessly in synchronous benchmarking contexts

This allows you to use async-first LlamaIndex agents in MASEval's sync API without any additional configuration.

Supported Agent Types

AgentWorkflow: Multi-agent workflow orchestrator
FunctionAgent: Function-calling based agent (for LLMs with tool calling)
ReActAgent: ReAct prompting pattern agent
CodeActAgent: Code execution based agent

Token Usage

Token usage is extracted from LLM responses when available. If the LLM response includes usage metadata, it's automatically captured in execution traces.

Requires

llama-index-core to be installed: pip install maseval[llamaindex]

init

__init__(
    agent_instance: Any,
    name: str,
    callbacks: Optional[List[Any]] = None,
    max_iterations: Optional[int] = None,
    cost_calculator: Optional[CostCalculator] = None,
    model_id: Optional[str] = None,
)

Initialize the LlamaIndex adapter.

PARAMETER	DESCRIPTION
`agent_instance`	LlamaIndex AgentWorkflow or BaseWorkflowAgent instance TYPE: `Any`
`name`	Agent name TYPE: `str`
`callbacks`	Optional list of callbacks TYPE: `Optional[List[Any]]` DEFAULT: `None`
`max_iterations`	Maximum number of agent iterations for AgentWorkflow.run(). If None, LlamaIndex's DEFAULT_MAX_ITERATIONS (20) is used. Bug fix: FunctionAgent does NOT have a max_steps constructor parameter — passing max_steps to it is silently swallowed by kwargs. The actual iteration limit must be passed here so the adapter forwards it to AgentWorkflow.run(max_iterations=...). TYPE: `Optional[int]` DEFAULT:** `None`
`cost_calculator`	Optional cost calculator. If not provided, a `LiteLLMCostCalculator` is created automatically when litellm is available. TYPE: `Optional[CostCalculator]` DEFAULT: `None`
`model_id`	Optional model ID for cost calculation. If not provided, auto-detected from `agent.llm.metadata.model_name`. TYPE: `Optional[str]` DEFAULT: `None`

gather_config

gather_config() -> Dict[str, Any]

Gather configuration from this LlamaIndex agent.

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary containing:
`Dict[str, Any]`	type: Component class name
`Dict[str, Any]`	gathered_at: ISO timestamp
`Dict[str, Any]`	name: Agent name
`Dict[str, Any]`	agent_type: Underlying agent class name
`Dict[str, Any]`	adapter_type: LlamaIndexAgentAdapter
`Dict[str, Any]`	callbacks: List of callback class names
`Dict[str, Any]`	llamaindex_config: LlamaIndex-specific configuration (if available)

gather_traces

gather_traces() -> Dict[str, Any]

Gather execution traces from this agent.

Collects comprehensive information about the agent's execution including message history, callback information, and agent metadata.

Output fields:

type - Component class name
gathered_at - ISO timestamp
name - Agent name
agent_type - Underlying agent framework class name
message_count - Number of messages in history
messages - Full message history as list of dicts
callbacks - List of callback class names attached to this agent

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary containing agent execution traces.

How to use

This method is automatically called by Benchmark during trace collection. Framework-specific adapters can extend this to include additional data:

def gather_traces(self) -> Dict[str, Any]:
    return {
        **super().gather_traces(),
        "framework_specific_metric": self.agent.some_metric
    }

gather_usage

gather_usage() -> Usage

Gather usage with automatic cost calculation.

Calls _gather_usage() for raw token counts, then applies the cost calculator if one is available and cost is still 0.0.

The model_id used for cost calculation is resolved in order:

Explicit model_id passed to __init__
Auto-detected from the framework agent via _resolve_model_id()

Subclasses should override _gather_usage() (not this method) to provide framework-specific token extraction.

RETURNS	DESCRIPTION
`Usage`	Usage (or TokenUsage) with cost filled in when possible.

get_messages

get_messages() -> MessageHistory

Get message history from LlamaIndex.

For agents with accessible memory, fetches from the agent's memory. Otherwise, returns cached messages from the last run.

RETURNS	DESCRIPTION
`MessageHistory`	MessageHistory with converted messages

run

run(query: str) -> Any

Executes the agent and returns the result.

Agents

AgentAdapter

gather_config

gather_traces

gather_usage

get_messages

run

Interfaces

SmolAgentAdapter

logs property

__init__

gather_config

gather_traces

gather_usage

get_messages

run

LangGraphAgentAdapter

__init__

gather_config

gather_traces

gather_usage

get_messages

run

LlamaIndexAgentAdapter

__init__

gather_config

gather_traces

gather_usage

get_messages

run

logs `property`

init

init

init