Agents
Agent adapters wrap agents from any framework to provide a unified interface for benchmarking. They handle execution, message history tracking, and callback hooks.
AgentAdapter
Bases: ABC, TraceableMixin, ConfigurableMixin, UsageTrackableMixin
Wraps an agent from any framework to provide a standard interface.
This Adapter provides:
- Unified execution interface via
run() - Callback hooks for monitoring
- Message history management via getter/setter
- Framework-agnostic tracing
- Automatic cost calculation from token usage (when a cost calculator is available)
Cost Tracking
Agent adapters track token usage from the underlying framework. To also
compute cost, you can pass a cost_calculator and optionally a model_id.
Most framework adapters auto-detect both the model ID (from the framework's
agent object) and the cost calculator (using LiteLLMCostCalculator if
litellm is installed). This means cost tracking often works with zero
configuration.
To override auto-detection, pass explicit values::
adapter = SmolAgentAdapter(
agent, name="researcher",
cost_calculator=StaticPricingCalculator({...}),
model_id="my-custom-model",
)
gather_config
gather_config() -> Dict[str, Any]
Gather configuration from this agent.
Collects comprehensive configuration information about the agent including its name, type, and callback configuration.
Output fields:
type- Component class namegathered_at- ISO timestampname- Agent nameagent_type- Underlying agent framework class nameadapter_type- The specific adapter class (e.g.,SmolAgentAdapter)callbacks- List of callback class names attached to this agent
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dictionary containing agent configuration. |
How to use
This method is automatically called by Benchmark during config collection. Framework-specific adapters can extend this to include additional data:
def gather_config(self) -> Dict[str, Any]:
return {
**super().gather_config(),
"framework_specific_setting": self.agent.some_setting
}
gather_traces
gather_traces() -> Dict[str, Any]
Gather execution traces from this agent.
Collects comprehensive information about the agent's execution including message history, callback information, and agent metadata.
Output fields:
type- Component class namegathered_at- ISO timestampname- Agent nameagent_type- Underlying agent framework class namemessage_count- Number of messages in historymessages- Full message history as list of dictscallbacks- List of callback class names attached to this agent
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dictionary containing agent execution traces. |
How to use
This method is automatically called by Benchmark during trace collection. Framework-specific adapters can extend this to include additional data:
def gather_traces(self) -> Dict[str, Any]:
return {
**super().gather_traces(),
"framework_specific_metric": self.agent.some_metric
}
gather_usage
gather_usage() -> Usage
Gather usage with automatic cost calculation.
Calls _gather_usage() for raw token counts, then applies
the cost calculator if one is available and cost is still 0.0.
The model_id used for cost calculation is resolved in order:
- Explicit
model_idpassed to__init__ - Auto-detected from the framework agent via
_resolve_model_id()
Subclasses should override _gather_usage() (not this method)
to provide framework-specific token extraction.
| RETURNS | DESCRIPTION |
|---|---|
Usage
|
Usage (or TokenUsage) with cost filled in when possible. |
get_messages
get_messages() -> MessageHistory
Get the current message history as an iterable MessageHistory object.
The returned MessageHistory can be:
- Iterated: for msg in agent.get_messages(): ...
- Indexed: agent.get_messages()[0]
- Converted to list: list(agent.get_messages()) or agent.get_messages().to_list()
- Checked for emptiness: if agent.get_messages(): ...
| RETURNS | DESCRIPTION |
|---|---|
MessageHistory
|
MessageHistory object (empty if no messages yet) |
Example
# Iterate directly
for msg in agent.get_messages():
print(msg['role'], msg['content'])
# Convert to list
messages = agent.get_messages().to_list()
messages = list(agent.get_messages())
# Check if empty
if agent.get_messages():
print("Agent has messages")
run
run(query: str) -> Any
Executes the agent and returns the result.
options: show_source: false
Interfaces
Adapters that integrate MASEval with agent frameworks:
SmolAgentAdapter
Bases: AgentAdapter
An AgentAdapter for HuggingFace smolagents MultiStepAgent.
This adapter integrates smolagents' MultiStepAgent with MASEval's benchmarking framework, converting smolagents' internal message format to OpenAI-compatible MessageHistory format. It automatically tracks tool calls, tool responses, agent reasoning steps, and provides comprehensive execution monitoring through smolagents' built-in memory system.
The adapter leverages smolagents' native memory storage as the source of truth, dynamically fetching messages, logs, and execution traces from the agent's internal state. This ensures accurate tracking of tool usage, timing, and token consumption without additional overhead.
How to use
- Create a smolagents agent with tools and configuration
- Wrap with SmolAgentAdapter to enable MASEval integration
- Use in benchmarks or call directly for testing
- Access traces and config for analysis and debugging
Example workflow:
from maseval.interface.agents.smolagents import SmolAgentAdapter
from smolagents import MultiStepAgent, ToolCallingAgent
from smolagents.tools import DuckDuckGoSearchTool
# Create a smolagents agent
agent = ToolCallingAgent(
tools=[DuckDuckGoSearchTool()],
model="gpt-4",
max_steps=10
)
# Wrap with adapter
agent_adapter = SmolAgentAdapter(agent, name="search_agent")
# Run agent
result = agent_adapter.run("What's the latest news on AI?")
# Access message history in OpenAI format
for msg in agent_adapter.get_messages():
print(f"{msg['role']}: {msg['content']}")
# Gather aggregated usage
usage = agent_adapter.gather_usage()
print(f"Total tokens: {usage.total_tokens}")
# Gather execution traces with timing
traces = agent_adapter.gather_traces()
print(f"Total duration: {traces['total_duration_seconds']}s")
# Use in benchmark
benchmark = MyBenchmark(agent_data={"agent": agent_adapter})
results = benchmark.run(tasks)
The adapter automatically converts smolagents' ActionStep and PlanningStep objects into structured logs, preserving timing, token usage, tool calls, and error information.
Execution Monitoring
The adapter provides comprehensive monitoring through gather_traces():
- Token usage: Input, output, and total tokens per step and aggregated
- Timing: Duration per step and total execution time
- Tool calls: Complete tool call history with arguments and results
- Errors: Error tracking with type and message
- Observations: Tool outputs and agent observations
Requires
smolagents to be installed: pip install maseval[smolagents]
logs
property
logs: List[Dict[str, Any]]
Dynamically generate logs from smolagents' internal memory.
Converts smolagents' ActionStep and PlanningStep objects into log entries compatible with the AgentAdapter contract, including all available properties.
| RETURNS | DESCRIPTION |
|---|---|
List[Dict[str, Any]]
|
List of log dictionaries with comprehensive step information |
__init__
__init__(
agent_instance: Any,
name: str,
callbacks: Any = None,
cost_calculator: Optional[CostCalculator] = None,
model_id: Optional[str] = None,
)
Initialize the Smolagent adapter.
Note: We don't call super().init() to avoid initializing self.logs as a list, since we override it as a property that dynamically fetches from agent.memory.
| PARAMETER | DESCRIPTION |
|---|---|
agent_instance
|
smolagents MultiStepAgent or similar
TYPE:
|
name
|
Agent name for identification
TYPE:
|
callbacks
|
Optional list of AgentCallback instances
TYPE:
|
cost_calculator
|
Optional cost calculator. If not provided, a
TYPE:
|
model_id
|
Optional model ID for cost calculation. If not provided,
auto-detected from
TYPE:
|
gather_config
gather_config() -> dict[str, Any]
Gather configuration from this SmolAgent.
Integrates with smolagents' native configuration system by accessing the agent's to_dict() method which includes comprehensive config data.
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
Dictionary containing: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
gather_traces
gather_traces() -> dict
Gather traces including message history and monitoring data.
Extends the base class to include smolagents' per-step monitoring data
(token usage, timing, actions, observations). Aggregated usage totals
are available via gather_usage().
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dict containing messages and per-step monitoring statistics. |
gather_usage
gather_usage() -> Usage
Gather usage with automatic cost calculation.
Calls _gather_usage() for raw token counts, then applies
the cost calculator if one is available and cost is still 0.0.
The model_id used for cost calculation is resolved in order:
- Explicit
model_idpassed to__init__ - Auto-detected from the framework agent via
_resolve_model_id()
Subclasses should override _gather_usage() (not this method)
to provide framework-specific token extraction.
| RETURNS | DESCRIPTION |
|---|---|
Usage
|
Usage (or TokenUsage) with cost filled in when possible. |
get_messages
get_messages() -> MessageHistory
Get message history by converting from smolagents memory.
This method dynamically fetches messages from the agent's internal memory and converts them to MASEval format.
| RETURNS | DESCRIPTION |
|---|---|
MessageHistory
|
MessageHistory with converted messages from smolagents |
run
run(query: str) -> Any
Executes the agent and returns the result.
LangGraphAgentAdapter
Bases: AgentAdapter
An AgentAdapter for LangGraph CompiledGraph agents.
This adapter integrates LangGraph's compiled graphs with MASEval's benchmarking framework, converting LangChain/LangGraph message types to OpenAI-compatible MessageHistory format. It preserves tool calls, tool responses, multi-modal content, and supports both stateless and stateful (checkpointed) graph execution.
LangGraph graphs can operate in two modes:
- Stateless: Messages from invoke() result are cached in the adapter for access
- Stateful: With checkpointer and thread_id, messages are fetched from persistent state
The adapter automatically handles both modes, preferring persistent state when available and falling back to cached results for stateless graphs.
How to use
- Create a LangGraph graph with state and nodes
- Compile the graph (optionally with checkpointer for state persistence)
- Wrap with LangGraphAgentAdapter to enable MASEval integration
- Use in benchmarks or call directly for testing
- Access traces and config for analysis and debugging
Example workflow:
from maseval.interface.agents.langgraph import LangGraphAgentAdapter
from langgraph.graph import StateGraph, MessagesState
from langgraph.checkpoint.memory import MemorySaver
# Define your graph
def chatbot(state: MessagesState):
# Your agent logic
return {"messages": [response]}
# Build graph
graph = StateGraph(MessagesState)
graph.add_node("chatbot", chatbot)
graph.set_entry_point("chatbot")
graph.set_finish_point("chatbot")
# Compile (stateless)
compiled_graph = graph.compile()
agent_adapter = LangGraphAgentAdapter(compiled_graph, "agent_name")
# Or compile with checkpointer (stateful)
memory = MemorySaver()
compiled_graph = graph.compile(checkpointer=memory)
config = {"configurable": {"thread_id": "session_1"}}
agent_adapter = LangGraphAgentAdapter(
compiled_graph,
"agent_name",
config=config
)
# Run agent
result = agent_adapter.run("What's the weather?")
# Access message history in OpenAI format
for msg in agent_adapter.get_messages():
print(f"{msg['role']}: {msg['content']}")
# Gather execution traces
traces = agent_adapter.gather_traces()
if 'total_tokens' in traces:
print(f"Total tokens: {traces['total_tokens']}")
# Use in benchmark
benchmark = MyBenchmark(agent_data={"agent": agent_adapter})
results = benchmark.run(tasks)
For stateful graphs, the adapter preserves conversation context across multiple calls using the same thread_id, enabling multi-turn interactions.
Token Usage
If LangChain messages include usage_metadata, the adapter automatically extracts
and aggregates token counts. This is available for models that provide usage information.
Requires
langgraph to be installed: pip install maseval[langgraph]
__init__
__init__(
agent_instance: Any,
name: str,
callbacks: Optional[List[Any]] = None,
config: Optional[Dict[str, Any]] = None,
cost_calculator: Optional[CostCalculator] = None,
model_id: Optional[str] = None,
)
Initialize the LangGraph adapter.
| PARAMETER | DESCRIPTION |
|---|---|
agent_instance
|
Compiled LangGraph graph
TYPE:
|
name
|
Agent name
TYPE:
|
callbacks
|
Optional list of callbacks
TYPE:
|
config
|
Optional LangGraph config dict (for stateful graphs with checkpointer).
Should include
TYPE:
|
cost_calculator
|
Optional cost calculator. If not provided, a
TYPE:
|
model_id
|
Model ID for cost calculation. LangGraph graphs can contain multiple models across nodes, so the model ID cannot be auto-detected. Pass the primary model's ID here to enable cost tracking::
TYPE:
|
gather_config
gather_config() -> dict[str, Any]
Gather configuration from this LangGraph agent.
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
Dictionary containing: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
gather_traces
gather_traces() -> Dict[str, Any]
Gather execution traces from this agent.
Collects comprehensive information about the agent's execution including message history, callback information, and agent metadata.
Output fields:
type- Component class namegathered_at- ISO timestampname- Agent nameagent_type- Underlying agent framework class namemessage_count- Number of messages in historymessages- Full message history as list of dictscallbacks- List of callback class names attached to this agent
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dictionary containing agent execution traces. |
How to use
This method is automatically called by Benchmark during trace collection. Framework-specific adapters can extend this to include additional data:
def gather_traces(self) -> Dict[str, Any]:
return {
**super().gather_traces(),
"framework_specific_metric": self.agent.some_metric
}
gather_usage
gather_usage() -> Usage
Gather usage with automatic cost calculation.
Calls _gather_usage() for raw token counts, then applies
the cost calculator if one is available and cost is still 0.0.
The model_id used for cost calculation is resolved in order:
- Explicit
model_idpassed to__init__ - Auto-detected from the framework agent via
_resolve_model_id()
Subclasses should override _gather_usage() (not this method)
to provide framework-specific token extraction.
| RETURNS | DESCRIPTION |
|---|---|
Usage
|
Usage (or TokenUsage) with cost filled in when possible. |
get_messages
get_messages() -> MessageHistory
Get message history from LangGraph.
For stateful graphs (with checkpointer and thread_id), fetches from graph state. For stateless graphs, returns cached messages from last run.
| RETURNS | DESCRIPTION |
|---|---|
MessageHistory
|
MessageHistory with converted messages |
run
run(query: str) -> Any
Executes the agent and returns the result.
LlamaIndexAgentAdapter
Bases: AgentAdapter
An AgentAdapter for LlamaIndex workflow-based agents.
This adapter integrates LlamaIndex's workflow-based agent system with MASEval's benchmarking framework, converting LlamaIndex's ChatMessage format to OpenAI-compatible MessageHistory format. It handles both AgentWorkflow and BaseWorkflowAgent instances, automatically managing async execution in synchronous contexts.
LlamaIndex agents are async-first, using workflows that must be awaited. This adapter handles the async-to-sync conversion automatically, supporting both agents with persistent memory and stateless execution modes. It seamlessly integrates with MASEval's synchronous benchmarking API.
How to use
- Create a LlamaIndex workflow agent with tools and LLM
- Wrap with LlamaIndexAgentAdapter to enable MASEval integration
- Use in benchmarks or call directly for testing
- Access traces and config for analysis and debugging
Example workflow:
from maseval.interface.agents.llamaindex import LlamaIndexAgentAdapter
from llama_index.core.agent.workflow import AgentWorkflow
from llama_index.core.llms import OpenAI
from llama_index.core.tools import FunctionTool
# Define a tool
def search(query: str) -> str:
"""Search for information."""
return f"Results for: {query}"
search_tool = FunctionTool.from_defaults(fn=search)
# Create a LlamaIndex workflow
workflow = AgentWorkflow.from_tools_or_functions(
tools_or_functions=[search_tool],
llm=OpenAI(model="gpt-4"),
system_prompt="You are a helpful research assistant"
)
# Wrap with adapter
agent_adapter = LlamaIndexAgentAdapter(workflow, "research_agent")
# Run agent (async handled automatically)
result = agent_adapter.run("What are the latest developments in quantum computing?")
# Access message history in OpenAI format
for msg in agent_adapter.get_messages():
print(f"{msg['role']}: {msg['content']}")
# Gather configuration including tools and system prompt
config = agent_adapter.gather_config()
print(f"System prompt: {config['llamaindex_config']['system_prompt']}")
print(f"Tools: {config['llamaindex_config']['tools']}")
# Gather execution traces with timing
traces = agent_adapter.gather_traces()
if 'total_tokens' in traces:
print(f"Total tokens: {traces['total_tokens']}")
# Use in benchmark
benchmark = MyBenchmark(agent_data={"agent": agent_adapter})
results = benchmark.run(tasks)
The adapter works with various LlamaIndex agent types including AgentWorkflow, FunctionAgent (tool calling), ReActAgent, and CodeActAgent.
Async Handling
LlamaIndex agents return a WorkflowHandler from .run() which must be awaited.
The adapter handles this automatically:
- Checks for
run_sync()method first (for compatibility) - Falls back to
asyncio.run()to execute the asyncrun()method - Works seamlessly in synchronous benchmarking contexts
This allows you to use async-first LlamaIndex agents in MASEval's sync API without any additional configuration.
Supported Agent Types
- AgentWorkflow: Multi-agent workflow orchestrator
- FunctionAgent: Function-calling based agent (for LLMs with tool calling)
- ReActAgent: ReAct prompting pattern agent
- CodeActAgent: Code execution based agent
Token Usage
Token usage is extracted from LLM responses when available. If the LLM response includes usage metadata, it's automatically captured in execution traces.
Requires
llama-index-core to be installed: pip install maseval[llamaindex]
__init__
__init__(
agent_instance: Any,
name: str,
callbacks: Optional[List[Any]] = None,
max_iterations: Optional[int] = None,
cost_calculator: Optional[CostCalculator] = None,
model_id: Optional[str] = None,
)
Initialize the LlamaIndex adapter.
| PARAMETER | DESCRIPTION |
|---|---|
agent_instance
|
LlamaIndex AgentWorkflow or BaseWorkflowAgent instance
TYPE:
|
name
|
Agent name
TYPE:
|
callbacks
|
Optional list of callbacks
TYPE:
|
max_iterations
|
Maximum number of agent iterations for AgentWorkflow.run(). If None, LlamaIndex's DEFAULT_MAX_ITERATIONS (20) is used. Bug fix: FunctionAgent does NOT have a max_steps constructor parameter — passing max_steps to it is silently swallowed by **kwargs. The actual iteration limit must be passed here so the adapter forwards it to AgentWorkflow.run(max_iterations=...).
TYPE:
|
cost_calculator
|
Optional cost calculator. If not provided, a
TYPE:
|
model_id
|
Optional model ID for cost calculation. If not provided,
auto-detected from
TYPE:
|
gather_config
gather_config() -> Dict[str, Any]
Gather configuration from this LlamaIndex agent.
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dictionary containing: |
Dict[str, Any]
|
|
Dict[str, Any]
|
|
Dict[str, Any]
|
|
Dict[str, Any]
|
|
Dict[str, Any]
|
|
Dict[str, Any]
|
|
Dict[str, Any]
|
|
gather_traces
gather_traces() -> Dict[str, Any]
Gather execution traces from this agent.
Collects comprehensive information about the agent's execution including message history, callback information, and agent metadata.
Output fields:
type- Component class namegathered_at- ISO timestampname- Agent nameagent_type- Underlying agent framework class namemessage_count- Number of messages in historymessages- Full message history as list of dictscallbacks- List of callback class names attached to this agent
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dictionary containing agent execution traces. |
How to use
This method is automatically called by Benchmark during trace collection. Framework-specific adapters can extend this to include additional data:
def gather_traces(self) -> Dict[str, Any]:
return {
**super().gather_traces(),
"framework_specific_metric": self.agent.some_metric
}
gather_usage
gather_usage() -> Usage
Gather usage with automatic cost calculation.
Calls _gather_usage() for raw token counts, then applies
the cost calculator if one is available and cost is still 0.0.
The model_id used for cost calculation is resolved in order:
- Explicit
model_idpassed to__init__ - Auto-detected from the framework agent via
_resolve_model_id()
Subclasses should override _gather_usage() (not this method)
to provide framework-specific token extraction.
| RETURNS | DESCRIPTION |
|---|---|
Usage
|
Usage (or TokenUsage) with cost filled in when possible. |
get_messages
get_messages() -> MessageHistory
Get message history from LlamaIndex.
For agents with accessible memory, fetches from the agent's memory. Otherwise, returns cached messages from the last run.
| RETURNS | DESCRIPTION |
|---|---|
MessageHistory
|
MessageHistory with converted messages |
run
run(query: str) -> Any
Executes the agent and returns the result.