MultiAgentBench: Multi-Agent Collaboration Benchmark (Beta)
Beta
This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
The MultiAgentBench benchmark evaluates multi-agent collaboration and competition in LLM-based systems across diverse scenarios including research, negotiation, coding, and more.
MultiAgentBench (from the MARBLE framework, where the original work was done) is designed to evaluate how multiple LLM-based agents collaborate and compete to solve complex tasks. We use a bug-fixed fork for MASEval integration. The benchmark features:
- 6 diverse domains: research, bargaining, coding, database, werewolf, minecraft (minecraft is untested)
- Multiple coordination modes: cooperative, star, tree, hierarchical
- LLM-based evaluation: Matches MARBLE's evaluation methodology
- Framework-agnostic: Use with any agent framework or MARBLE's native agents
Reference Paper: MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents
Check out the BENCHMARKS.md file for more information including licenses.
MultiAgentBenchBenchmark
Bases: Benchmark
Abstract base class for framework-agnostic MultiAgentBench evaluation.
This benchmark provides the infrastructure for evaluating multi-agent systems
on MARBLE's MultiAgentBench tasks. Subclasses implement setup_agents() with
their specific agent framework.
The benchmark supports: - Multiple coordination modes (star, cooperative, tree, hierarchical) - Multiple domains (research, bargaining, coding, database, etc.) - LLM-based evaluation matching MARBLE's metrics - Comprehensive tracing of agent interactions
Warning
communication_score is only computed when agents use
MarbleAgentAdapter, which populates the communication_log trace
key from BaseAgent.act(). Custom setup_agents() implementations
using other adapters must explicitly populate communication_log in
each adapter's gather_traces() output for communication evaluation
to work. See MultiAgentBenchEvaluator._extract_communications() for
the expected format.
Example
class MyMultiAgentBenchmark(MultiAgentBenchBenchmark):
def setup_agents(self, agent_data, environment, task, user, seed_generator):
# Derive seeds for agents (returns None if seeding disabled)
agents_gen = seed_generator.child("agents")
agent_seeds = {}
for config in task.environment_data.get("agents", []):
agent_id = config.get("agent_id")
agent_seeds[agent_id] = agents_gen.derive_seed(agent_id)
# Create agents using your framework with seeds
agents_list = []
agents_dict = {}
for config in task.environment_data.get("agents", []):
agent_id = config.get("agent_id")
model = self.get_model_adapter(
agent_data.get("model_id", "gpt-4o"),
register_name=f"agent_{agent_id}",
seed=agent_seeds.get(agent_id),
)
# Create your agent with the seeded model...
...
return agents_list, agents_dict
def get_model_adapter(self, model_id, **kwargs):
seed = kwargs.pop("seed", None)
adapter = MyModelAdapter(model_id, seed=seed)
if "register_name" in kwargs:
self.register("models", kwargs["register_name"], adapter)
return adapter
benchmark = MyMultiAgentBenchmark(seed=42) # Enable seeding
results = benchmark.run(tasks, agent_data={"model_id": "gpt-4o"})
seed_generator
property
seed_generator: SeedGenerator
The seed generator for this benchmark.
The seed generator is configured at benchmark initialization via the seed
or seed_generator parameters. When seed=None (the default), the generator's
derive_seed() method returns None, effectively disabling seeding while
maintaining a uniform interface.
| RETURNS | DESCRIPTION |
|---|---|
SeedGenerator
|
The root |
usage
property
usage: Usage
Running usage total across all task repetitions.
Queryable at any time, including while the benchmark is still running. Returns the grand total of all usage collected so far.
usage_by_component
property
usage_by_component: Dict[str, Usage]
Per-component running usage totals across all repetitions.
Keys are registry keys (e.g., "models:main_model").
__init__
__init__(
callbacks: Optional[List[BenchmarkCallback]] = None,
n_task_repeats: int = 1,
max_invocations: int = 10,
num_workers: int = 1,
fail_on_setup_error: bool = False,
fail_on_task_error: bool = False,
fail_on_evaluation_error: bool = False,
progress_bar: bool | str = True,
seed: Optional[int] = None,
seed_generator: Optional[SeedGenerator] = None,
)
Initialize the benchmark.
| PARAMETER | DESCRIPTION |
|---|---|
callbacks
|
Optional list of callbacks
TYPE:
|
n_task_repeats
|
Number of times to repeat each task
TYPE:
|
max_invocations
|
Maximum agent invocations per task
TYPE:
|
num_workers
|
Number of parallel workers
TYPE:
|
fail_on_setup_error
|
Raise on setup errors
TYPE:
|
fail_on_task_error
|
Raise on task errors
TYPE:
|
fail_on_evaluation_error
|
Raise on evaluation errors
TYPE:
|
progress_bar
|
Progress bar configuration
TYPE:
|
seed
|
Global seed for reproducible benchmark runs
TYPE:
|
seed_generator
|
Custom seed generator (takes precedence over seed)
TYPE:
|
add_callback
add_callback(callback: BenchmarkCallback) -> None
Register a callback handler to monitor benchmark execution.
| PARAMETER | DESCRIPTION |
|---|---|
callback
|
A BenchmarkCallback instance that will receive execution events.
TYPE:
|
How to use
Callbacks receive notifications at key lifecycle points for tracing, progress tracking,
or custom metrics collection. See BenchmarkCallback
for available hooks and their signatures.
from maseval.core.callbacks import MessageTracingCallback
benchmark = MyBenchmark(tasks=tasks, agent_data=config)
benchmark.add_callback(MessageTracingCallback(output_dir="logs"))
results = benchmark.run()
clear_registry
clear_registry() -> None
Clear the component registry after a task repetition completes.
This method is called automatically by run() after each task repetition
to ensure components are not carried over between repetitions. The
reports list persists across all repetitions for aggregated analysis.
collect_all_configs
collect_all_configs() -> Dict[str, Any]
Collect configuration from all registered components for the current task repetition.
This method is called automatically by run() after each task repetition completes
and before evaluation begins. It gathers comprehensive configuration from all registered
components (agents, models, tools, simulators, callbacks, etc.) for that specific
repetition. After collection, the registry is cleared for the next repetition.
The collected configs are stored in benchmark.reports list along with traces
for persistent access across all task repetitions.
Output fields:
metadata- Collection timestamp and thread infoagents- Dict mapping agent names to their config (settings, parameters)models- Dict mapping model names to their config (model IDs, parameters)tools- Dict mapping tool names to their config (specifications, settings)simulators- Dict mapping simulator names to their config (parameters, templates)callbacks- Dict mapping callback names to their config (settings)environment- Direct config from the environment (not nested), orNoneif not presentuser- Direct config from the user simulator (not nested), orNoneif not presentother- Dict for any other registered componentsbenchmark- Benchmark-level configuration (git, system, packages)
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Structured dictionary containing configuration from all registered components. |
How to use
This method is called automatically by run() after each task repetition:
# Automatic collection (recommended)
results = benchmark.run()
# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
# Agents is a dict: agent_name -> config
print(f"Agent config: {report['config']['agents']['my_agent']}")
# Environment and user are direct (not nested)
print(f"Environment config: {report['config']['environment']}")
print(f"User config: {report['config']['user']}")
# Benchmark-level config
print(f"Git commit: {report['config']['benchmark']['git']['commit_hash']}")
The collected configs are available in the results for reproducibility analysis.
collect_all_traces
collect_all_traces() -> Dict[str, Any]
Collect execution traces from all registered components for the current task repetition.
This method is called automatically by run() after each task repetition completes
and before evaluation begins. It gathers comprehensive traces from all registered
components (agents, models, tools, simulators, callbacks, etc.) for that specific
repetition. After collection, the registry is cleared for the next repetition.
The collected traces are stored in benchmark.reports list along with configs
for persistent access across all task repetitions.
Output fields:
metadata- Collection timestamp and thread infoagents- Dict mapping agent names to their traces (messages, execution data)models- Dict mapping model names to their traces (API calls, timing, errors)tools- Dict mapping tool names to their traces (invocations, parameters)simulators- Dict mapping simulator names to their traces (attempts, outcomes)callbacks- Dict mapping callback names to their traces (custom data)environment- Direct traces from the environment (not nested), orNoneif not presentuser- Direct traces from the user simulator (not nested), orNoneif not presentother- Dict for any other registered components
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Structured dictionary containing execution traces from all registered components. |
How to use
This method is called automatically by run() after each task repetition:
# Automatic collection (recommended)
results = benchmark.run()
# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
# Agents is a dict: agent_name -> traces
print(f"Agent messages: {report['traces']['agents']['my_agent']}")
# Environment and user are direct (not nested)
print(f"Environment state: {report['traces']['environment']}")
print(f"User interactions: {report['traces']['user']}")
The collected traces are passed to the evaluator's evaluate() method
and stored in benchmark.reports for later analysis.
collect_all_usage
collect_all_usage() -> Dict[str, Any]
Collect usage from all registered components for the current task repetition.
This method is called automatically by run() after each task repetition
completes. It gathers usage from all registered UsageTrackableMixin
components and also accumulates into persistent running totals accessible
via usage and usage_by_component.
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Structured dictionary containing usage from all registered components. |
evaluate
evaluate(
evaluators: Sequence[Evaluator],
agents: Dict[str, AgentAdapter],
final_answer: Any,
traces: Dict[str, Any],
) -> List[Dict[str, Any]]
Execute evaluators on the results.
| PARAMETER | DESCRIPTION |
|---|---|
evaluators
|
The evaluators
TYPE:
|
agents
|
Dict of all agents
TYPE:
|
final_answer
|
The combined agent outputs
TYPE:
|
traces
|
Execution traces
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[Dict[str, Any]]
|
List of evaluation results |
execution_loop
execution_loop(
agents: Sequence[AgentAdapter],
task: Task,
environment: Environment,
user: Optional[User],
) -> Any
Execute agents in a single pass.
MultiAgentBench uses multi-agent coordination instead of user interaction.
The base class execution_loop breaks after one call when user is None,
so this override makes the single-pass behavior explicit.
Subclasses (e.g. MarbleMultiAgentBenchBenchmark) override this with
multi-iteration coordination loops matching their framework's orchestration.
get_failed_tasks
get_failed_tasks(
status_filter: Optional[
Union[
TaskExecutionStatus, List[TaskExecutionStatus]
]
] = None,
reports: Optional[List[Dict[str, Any]]] = None,
) -> SequentialTaskQueue
Get tasks that failed during benchmark execution.
This method retrieves failed tasks based on their execution status, useful for debugging, retry logic, or failure analysis.
| PARAMETER | DESCRIPTION |
|---|---|
status_filter
|
Filter by specific failure status(es). If None, returns all failed tasks (any status except SUCCESS). Can be a single TaskExecutionStatus or a list of them. Examples: - TaskExecutionStatus.TASK_EXECUTION_FAILED: Only tasks that failed during execution - TaskExecutionStatus.EVALUATION_FAILED: Only tasks where evaluation failed - [TaskExecutionStatus.TASK_EXECUTION_FAILED, TaskExecutionStatus.SETUP_FAILED]: Tasks that failed during execution or setup
TYPE:
|
reports
|
Optional list of reports to analyze. If None, uses the reports from the last run() call. This allows analyzing externally stored or modified reports.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
SequentialTaskQueue
|
SequentialTaskQueue containing the failed tasks. Empty if no failures match the filter. |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If reports is None and run() has not been executed yet. |
How to use
# Run benchmark
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)
# Get all failed tasks (from internal state)
failed = benchmark.get_failed_tasks()
print(f"Failed: {len(failed)}/{len(benchmark.tasks)} tasks")
# Or work with returned reports (safe from internal state changes)
failed = benchmark.get_failed_tasks(reports=reports)
# Get only tasks that failed during execution (not evaluation)
execution_failures = benchmark.get_failed_tasks(
TaskExecutionStatus.TASK_EXECUTION_FAILED,
reports=reports
)
# Get setup and execution failures
critical_failures = benchmark.get_failed_tasks(
status_filter=[
TaskExecutionStatus.SETUP_FAILED,
TaskExecutionStatus.TASK_EXECUTION_FAILED
],
reports=reports
)
# Retry failed tasks elegantly - this is the key use case!
if len(failed) > 0:
retry_reports = benchmark.run(tasks=failed)
# Or more concisely
reports = benchmark.run(tasks=tasks)
retry_reports = benchmark.run(tasks=benchmark.get_failed_tasks())
get_model_adapter
abstractmethod
get_model_adapter(
model_id: str, **kwargs: Any
) -> ModelAdapter
Provide a model adapter (implement in subclass).
| PARAMETER | DESCRIPTION |
|---|---|
model_id
|
Model identifier
TYPE:
|
**kwargs
|
Additional arguments including register_name
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ModelAdapter
|
ModelAdapter instance |
register
register(
category: str,
name: str,
component: RegisterableComponent,
) -> RegisterableComponent
Register a component for comprehensive trace and configuration collection.
All core MASEval components (AgentAdapter, ModelAdapter, Environment, User, LLMSimulator, BenchmarkCallback) inherit from TraceableMixin and/or ConfigurableMixin, and are automatically registered for both trace and configuration collection before evaluation.
Note: Most components are automatically registered when returned from
setup methods (setup_environment, setup_user, setup_agents). You only
need to manually register additional components like models, simulators, or
tools that aren't automatically captured.
| PARAMETER | DESCRIPTION |
|---|---|
category
|
Component category (e.g., "agents", "models", "tools", "simulators", "callbacks", "user", "environment", "seeding"). Use plural form to match the structure in collect_all_traces() and collect_all_configs().
TYPE:
|
name
|
Unique identifier for this component within its category
TYPE:
|
component
|
Any object inheriting from TraceableMixin and/or ConfigurableMixin
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RegisterableComponent
|
The component (for chaining convenience) |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If the component is already registered under a different name |
How to use
Most components are auto-registered. Manual registration is only needed for additional components:
def setup_agents(self, agent_data, environment, task, user):
# Create model (needs manual registration)
model = MyModelAdapter(...)
self.register("models", "main_model", model)
# Create agent (auto-registered when returned)
agent = MyAgent(model=model)
agent_adapter = AgentAdapter(agent, "agent1")
# Environment and user are also auto-registered
return [agent_adapter], {"agent1": agent_adapter}
Traces and configs are automatically collected before evaluation via
collect_all_traces() and collect_all_configs() which are called
internally by the run() method.
run
run(
tasks: Union[
Task, BaseTaskQueue, Iterable[Union[Task, dict]]
],
agent_data: Dict[str, Any] | Iterable[Dict[str, Any]],
) -> List[Dict[str, Any]]
Initialize and execute the complete benchmark loop across all tasks.
| PARAMETER | DESCRIPTION |
|---|---|
tasks
|
Task source for execution. Can be: - A single Task object - A BaseTaskQueue (SequentialTaskQueue, PriorityTaskQueue, or custom AdaptiveTaskQueue) - An iterable of Task objects or dicts that will be converted to Tasks When a BaseTaskQueue is provided, it controls the task ordering. AdaptiveTaskQueue subclasses are automatically registered as callbacks to receive task completion notifications.
TYPE:
|
agent_data
|
Configuration for agents. Either a single dict applied to all tasks, or an iterable of dicts with one configuration per task. Agent data typically includes model parameters, agent architecture details, and tool specifications.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[Dict[str, Any]]
|
List of report dictionaries, one per task repetition. Each report contains: |
List[Dict[str, Any]]
|
|
List[Dict[str, Any]]
|
|
List[Dict[str, Any]]
|
|
List[Dict[str, Any]]
|
|
List[Dict[str, Any]]
|
|
List[Dict[str, Any]]
|
|
List[Dict[str, Any]]
|
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If agent_data length doesn't match number of tasks (when agent_data is an iterable). |
How to use
This is the framework's main orchestration method that runs your entire benchmark. It iterates through all tasks, handles repetitions, and manages the three-stage lifecycle for each execution. You don't implement this method—instead, you call it to start the benchmark after implementing the setup and execution methods.
By default, the benchmark will continue executing remaining tasks even if some fail.
You can change this behavior by setting fail_on_task_error=True,
fail_on_evaluation_error=True, or fail_on_setup_error=True when instantiating
the benchmark. Each task execution returns a status indicating success or the specific
failure type (see TaskExecutionStatus).
For each task execution, the framework:
- Calls your setup methods to initialize components
- Calls your
run_agents()method to execute the task - Collects message histories and calls evaluators
- Stores results and triggers callbacks
Pseudocode structure:
for task in tasks:
for repeat in range(n_task_repeats):
# Setup stage
environment = setup_environment(agent_data, task)
user = setup_user(agent_data, environment, task)
agents_to_run, agents_dict = setup_agents(agent_data, environment, task, user)
evaluators = setup_evaluators(environment, task, agents_to_run, user)
# Run stage (execution_loop handles multi-turn if user exists)
agents_output = execution_loop(agents_to_run, task, environment, user)
# Evaluate stage
traces = collect_message_histories(agents_dict)
eval_results = evaluate(evaluators, traces, agents_dict)
# Store results
store_result(task_id, traces, eval_results)
Callback hooks are triggered at these points:
- on_run_start: Before processing any tasks
- on_task_start: Before processing a task (once per task, not per repeat)
- on_task_repeat_start: Before each repetition of a task
- on_task_repeat_end: After each repetition completes
- on_task_end: After all repetitions of a task complete
- on_run_end: After all tasks complete
# Typical usage
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)
# Analyze results
for report in reports:
print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}: {report['eval']}")
print(f"Config: {report['config']}")
print(f"Traces: {report['traces']}")
# Parallel execution with 4 workers
benchmark = MyBenchmark(num_workers=4)
reports = benchmark.run(tasks=tasks, agent_data=config)
# Single agent config for all tasks
reports = benchmark.run(tasks=tasks, agent_data={"model": "gpt-4"})
# Task-specific agent configs (must match task count)
reports = benchmark.run(
tasks=tasks,
agent_data=[
{"model": "gpt-4", "difficulty": "easy"},
{"model": "gpt-4", "difficulty": "hard"},
]
)
# Priority-based execution
from maseval.core.task import PriorityTaskQueue
for task in tasks:
task.protocol.priority = compute_priority(task)
queue = PriorityTaskQueue(tasks)
reports = benchmark.run(tasks=queue, agent_data=config)
# Adaptive queue (auto-registered as callback)
queue = MyAdaptiveTaskQueue(tasks)
reports = benchmark.run(tasks=queue) # queue receives on_task_complete callbacks
run_agents
run_agents(
agents: Sequence[AgentAdapter],
task: Task,
environment: Environment,
query: str,
) -> Dict[str, Any]
Execute the multi-agent system.
For MultiAgentBench, this runs all agents on the task and collects their outputs.
| PARAMETER | DESCRIPTION |
|---|---|
agents
|
Agents to run
TYPE:
|
task
|
The task
TYPE:
|
environment
|
The environment
TYPE:
|
query
|
The query/task content
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dict with agent_results, communications, and coordination_mode |
setup_agents
abstractmethod
setup_agents(
agent_data: Dict[str, Any],
environment: Environment,
task: Task,
user: Optional[User],
seed_generator: SeedGenerator,
) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]
Create agents for the task (implement in subclass).
Subclasses should: 1. Read agent specifications from task.environment_data["agents"] 2. Derive seeds from seed_generator for each agent's model 3. Create agents using their framework with seeded models 4. Wrap them in AgentAdapter 5. Set up relationships from task.environment_data["relationships"]
| PARAMETER | DESCRIPTION |
|---|---|
agent_data
|
Agent configuration (model IDs, etc.)
TYPE:
|
environment
|
The environment instance
TYPE:
|
task
|
The task containing agent specs
TYPE:
|
user
|
User simulator (None for MultiAgentBench)
TYPE:
|
seed_generator
|
Seed generator for deriving deterministic seeds.
Use
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]
|
Tuple of (agents_to_run, agents_dict) |
Example
def setup_agents(self, agent_data, environment, task, user, seed_generator):
agents_gen = seed_generator.child("agents")
for config in task.environment_data.get("agents", []):
agent_id = config.get("agent_id")
seed = agents_gen.derive_seed(agent_id) # Returns None if seeding disabled
model = self.get_model_adapter(model_id, seed=seed)
# Create agent with seeded model...
setup_environment
setup_environment(
agent_data: Dict[str, Any],
task: Task,
seed_generator: SeedGenerator,
) -> Environment
Create the MultiAgentBench environment.
| PARAMETER | DESCRIPTION |
|---|---|
agent_data
|
Agent configuration
TYPE:
|
task
|
The task to set up
TYPE:
|
seed_generator
|
Seed generator for reproducibility
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Environment
|
MultiAgentBenchEnvironment instance |
setup_evaluators
setup_evaluators(
environment: Environment,
task: Task,
agents: Sequence[AgentAdapter],
user: Optional[User],
seed_generator: SeedGenerator,
) -> Sequence[Evaluator]
Create evaluators for the task.
| PARAMETER | DESCRIPTION |
|---|---|
environment
|
The environment
TYPE:
|
task
|
The task with evaluation data
TYPE:
|
agents
|
The agents
TYPE:
|
user
|
User simulator (None for MultiAgentBench)
TYPE:
|
seed_generator
|
Seed generator for reproducibility
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Sequence[Evaluator]
|
List of evaluators |
setup_user
setup_user(
agent_data: Dict[str, Any],
environment: Environment,
task: Task,
seed_generator: SeedGenerator,
) -> Optional[User]
MultiAgentBench tasks don't use user simulators.
The multi-agent coordination replaces user interaction.
| PARAMETER | DESCRIPTION |
|---|---|
agent_data
|
Agent configuration
TYPE:
|
environment
|
The environment instance
TYPE:
|
task
|
The task
TYPE:
|
seed_generator
|
Seed generator (unused)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Optional[User]
|
None |
MarbleMultiAgentBenchBenchmark
Bases: MultiAgentBenchBenchmark
MARBLE reproduction mode for MultiAgentBench.
This benchmark uses MARBLE's native agents and engine for exact reproduction of published results. It wraps MARBLE components in MASEval adapters for unified tracing.
Example
from maseval.benchmark.multiagentbench import (
MarbleMultiAgentBenchBenchmark,
load_tasks,
configure_model_ids,
)
class MyMarbleBenchmark(MarbleMultiAgentBenchBenchmark):
def get_model_adapter(self, model_id, **kwargs):
from maseval.interface.openai import OpenAIModelAdapter
adapter = OpenAIModelAdapter(model_id)
if "register_name" in kwargs:
self.register("models", kwargs["register_name"], adapter)
return adapter
tasks = load_tasks("research", limit=5)
configure_model_ids(tasks, agent_model_id="gpt-4o")
benchmark = MyMarbleBenchmark()
results = benchmark.run(tasks, agent_data={})
seed_generator
property
seed_generator: SeedGenerator
The seed generator for this benchmark.
The seed generator is configured at benchmark initialization via the seed
or seed_generator parameters. When seed=None (the default), the generator's
derive_seed() method returns None, effectively disabling seeding while
maintaining a uniform interface.
| RETURNS | DESCRIPTION |
|---|---|
SeedGenerator
|
The root |
usage
property
usage: Usage
Running usage total across all task repetitions.
Queryable at any time, including while the benchmark is still running. Returns the grand total of all usage collected so far.
usage_by_component
property
usage_by_component: Dict[str, Usage]
Per-component running usage totals across all repetitions.
Keys are registry keys (e.g., "models:main_model").
__init__
__init__(
callbacks: Optional[List[BenchmarkCallback]] = None,
n_task_repeats: int = 1,
max_invocations: int = 10,
num_workers: int = 1,
fail_on_setup_error: bool = False,
fail_on_task_error: bool = False,
fail_on_evaluation_error: bool = False,
progress_bar: bool | str = True,
seed: Optional[int] = None,
seed_generator: Optional[SeedGenerator] = None,
)
Initialize the benchmark.
| PARAMETER | DESCRIPTION |
|---|---|
callbacks
|
Optional list of callbacks
TYPE:
|
n_task_repeats
|
Number of times to repeat each task
TYPE:
|
max_invocations
|
Maximum agent invocations per task
TYPE:
|
num_workers
|
Number of parallel workers
TYPE:
|
fail_on_setup_error
|
Raise on setup errors
TYPE:
|
fail_on_task_error
|
Raise on task errors
TYPE:
|
fail_on_evaluation_error
|
Raise on evaluation errors
TYPE:
|
progress_bar
|
Progress bar configuration
TYPE:
|
seed
|
Global seed for reproducible benchmark runs
TYPE:
|
seed_generator
|
Custom seed generator (takes precedence over seed)
TYPE:
|
add_callback
add_callback(callback: BenchmarkCallback) -> None
Register a callback handler to monitor benchmark execution.
| PARAMETER | DESCRIPTION |
|---|---|
callback
|
A BenchmarkCallback instance that will receive execution events.
TYPE:
|
How to use
Callbacks receive notifications at key lifecycle points for tracing, progress tracking,
or custom metrics collection. See BenchmarkCallback
for available hooks and their signatures.
from maseval.core.callbacks import MessageTracingCallback
benchmark = MyBenchmark(tasks=tasks, agent_data=config)
benchmark.add_callback(MessageTracingCallback(output_dir="logs"))
results = benchmark.run()
clear_registry
clear_registry() -> None
Clear the component registry after a task repetition completes.
This method is called automatically by run() after each task repetition
to ensure components are not carried over between repetitions. The
reports list persists across all repetitions for aggregated analysis.
collect_all_configs
collect_all_configs() -> Dict[str, Any]
Collect configuration from all registered components for the current task repetition.
This method is called automatically by run() after each task repetition completes
and before evaluation begins. It gathers comprehensive configuration from all registered
components (agents, models, tools, simulators, callbacks, etc.) for that specific
repetition. After collection, the registry is cleared for the next repetition.
The collected configs are stored in benchmark.reports list along with traces
for persistent access across all task repetitions.
Output fields:
metadata- Collection timestamp and thread infoagents- Dict mapping agent names to their config (settings, parameters)models- Dict mapping model names to their config (model IDs, parameters)tools- Dict mapping tool names to their config (specifications, settings)simulators- Dict mapping simulator names to their config (parameters, templates)callbacks- Dict mapping callback names to their config (settings)environment- Direct config from the environment (not nested), orNoneif not presentuser- Direct config from the user simulator (not nested), orNoneif not presentother- Dict for any other registered componentsbenchmark- Benchmark-level configuration (git, system, packages)
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Structured dictionary containing configuration from all registered components. |
How to use
This method is called automatically by run() after each task repetition:
# Automatic collection (recommended)
results = benchmark.run()
# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
# Agents is a dict: agent_name -> config
print(f"Agent config: {report['config']['agents']['my_agent']}")
# Environment and user are direct (not nested)
print(f"Environment config: {report['config']['environment']}")
print(f"User config: {report['config']['user']}")
# Benchmark-level config
print(f"Git commit: {report['config']['benchmark']['git']['commit_hash']}")
The collected configs are available in the results for reproducibility analysis.
collect_all_traces
collect_all_traces() -> Dict[str, Any]
Collect execution traces from all registered components for the current task repetition.
This method is called automatically by run() after each task repetition completes
and before evaluation begins. It gathers comprehensive traces from all registered
components (agents, models, tools, simulators, callbacks, etc.) for that specific
repetition. After collection, the registry is cleared for the next repetition.
The collected traces are stored in benchmark.reports list along with configs
for persistent access across all task repetitions.
Output fields:
metadata- Collection timestamp and thread infoagents- Dict mapping agent names to their traces (messages, execution data)models- Dict mapping model names to their traces (API calls, timing, errors)tools- Dict mapping tool names to their traces (invocations, parameters)simulators- Dict mapping simulator names to their traces (attempts, outcomes)callbacks- Dict mapping callback names to their traces (custom data)environment- Direct traces from the environment (not nested), orNoneif not presentuser- Direct traces from the user simulator (not nested), orNoneif not presentother- Dict for any other registered components
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Structured dictionary containing execution traces from all registered components. |
How to use
This method is called automatically by run() after each task repetition:
# Automatic collection (recommended)
results = benchmark.run()
# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
# Agents is a dict: agent_name -> traces
print(f"Agent messages: {report['traces']['agents']['my_agent']}")
# Environment and user are direct (not nested)
print(f"Environment state: {report['traces']['environment']}")
print(f"User interactions: {report['traces']['user']}")
The collected traces are passed to the evaluator's evaluate() method
and stored in benchmark.reports for later analysis.
collect_all_usage
collect_all_usage() -> Dict[str, Any]
Collect usage from all registered components for the current task repetition.
This method is called automatically by run() after each task repetition
completes. It gathers usage from all registered UsageTrackableMixin
components and also accumulates into persistent running totals accessible
via usage and usage_by_component.
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Structured dictionary containing usage from all registered components. |
evaluate
evaluate(
evaluators: Sequence[Evaluator],
agents: Dict[str, AgentAdapter],
final_answer: Any,
traces: Dict[str, Any],
) -> List[Dict[str, Any]]
Execute evaluators on the results.
| PARAMETER | DESCRIPTION |
|---|---|
evaluators
|
The evaluators
TYPE:
|
agents
|
Dict of all agents
TYPE:
|
final_answer
|
The combined agent outputs
TYPE:
|
traces
|
Execution traces
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[Dict[str, Any]]
|
List of evaluation results |
execution_loop
execution_loop(
agents: Sequence[AgentAdapter],
task: Task,
environment: Environment,
user: Optional[User],
) -> Any
Execute MARBLE's multi-iteration coordination loop.
Dispatches to the appropriate coordination handler based on the task's
coordinate_mode. Replicates Engine.start() from
marble/engine/engine.py:1034-1055.
| PARAMETER | DESCRIPTION |
|---|---|
agents
|
MARBLE agents wrapped in MarbleAgentAdapter
TYPE:
|
task
|
The task being solved
TYPE:
|
environment
|
The environment
TYPE:
|
user
|
Always None for MultiAgentBench
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Any
|
Dict with agent_results, communications, and coordination_mode |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If coordinate_mode is not supported |
get_failed_tasks
get_failed_tasks(
status_filter: Optional[
Union[
TaskExecutionStatus, List[TaskExecutionStatus]
]
] = None,
reports: Optional[List[Dict[str, Any]]] = None,
) -> SequentialTaskQueue
Get tasks that failed during benchmark execution.
This method retrieves failed tasks based on their execution status, useful for debugging, retry logic, or failure analysis.
| PARAMETER | DESCRIPTION |
|---|---|
status_filter
|
Filter by specific failure status(es). If None, returns all failed tasks (any status except SUCCESS). Can be a single TaskExecutionStatus or a list of them. Examples: - TaskExecutionStatus.TASK_EXECUTION_FAILED: Only tasks that failed during execution - TaskExecutionStatus.EVALUATION_FAILED: Only tasks where evaluation failed - [TaskExecutionStatus.TASK_EXECUTION_FAILED, TaskExecutionStatus.SETUP_FAILED]: Tasks that failed during execution or setup
TYPE:
|
reports
|
Optional list of reports to analyze. If None, uses the reports from the last run() call. This allows analyzing externally stored or modified reports.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
SequentialTaskQueue
|
SequentialTaskQueue containing the failed tasks. Empty if no failures match the filter. |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If reports is None and run() has not been executed yet. |
How to use
# Run benchmark
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)
# Get all failed tasks (from internal state)
failed = benchmark.get_failed_tasks()
print(f"Failed: {len(failed)}/{len(benchmark.tasks)} tasks")
# Or work with returned reports (safe from internal state changes)
failed = benchmark.get_failed_tasks(reports=reports)
# Get only tasks that failed during execution (not evaluation)
execution_failures = benchmark.get_failed_tasks(
TaskExecutionStatus.TASK_EXECUTION_FAILED,
reports=reports
)
# Get setup and execution failures
critical_failures = benchmark.get_failed_tasks(
status_filter=[
TaskExecutionStatus.SETUP_FAILED,
TaskExecutionStatus.TASK_EXECUTION_FAILED
],
reports=reports
)
# Retry failed tasks elegantly - this is the key use case!
if len(failed) > 0:
retry_reports = benchmark.run(tasks=failed)
# Or more concisely
reports = benchmark.run(tasks=tasks)
retry_reports = benchmark.run(tasks=benchmark.get_failed_tasks())
get_model_adapter
abstractmethod
get_model_adapter(
model_id: str, **kwargs: Any
) -> ModelAdapter
Provide a model adapter (implement in subclass).
| PARAMETER | DESCRIPTION |
|---|---|
model_id
|
Model identifier
TYPE:
|
**kwargs
|
Additional arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ModelAdapter
|
ModelAdapter instance |
register
register(
category: str,
name: str,
component: RegisterableComponent,
) -> RegisterableComponent
Register a component for comprehensive trace and configuration collection.
All core MASEval components (AgentAdapter, ModelAdapter, Environment, User, LLMSimulator, BenchmarkCallback) inherit from TraceableMixin and/or ConfigurableMixin, and are automatically registered for both trace and configuration collection before evaluation.
Note: Most components are automatically registered when returned from
setup methods (setup_environment, setup_user, setup_agents). You only
need to manually register additional components like models, simulators, or
tools that aren't automatically captured.
| PARAMETER | DESCRIPTION |
|---|---|
category
|
Component category (e.g., "agents", "models", "tools", "simulators", "callbacks", "user", "environment", "seeding"). Use plural form to match the structure in collect_all_traces() and collect_all_configs().
TYPE:
|
name
|
Unique identifier for this component within its category
TYPE:
|
component
|
Any object inheriting from TraceableMixin and/or ConfigurableMixin
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RegisterableComponent
|
The component (for chaining convenience) |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If the component is already registered under a different name |
How to use
Most components are auto-registered. Manual registration is only needed for additional components:
def setup_agents(self, agent_data, environment, task, user):
# Create model (needs manual registration)
model = MyModelAdapter(...)
self.register("models", "main_model", model)
# Create agent (auto-registered when returned)
agent = MyAgent(model=model)
agent_adapter = AgentAdapter(agent, "agent1")
# Environment and user are also auto-registered
return [agent_adapter], {"agent1": agent_adapter}
Traces and configs are automatically collected before evaluation via
collect_all_traces() and collect_all_configs() which are called
internally by the run() method.
run
run(
tasks: Union[
Task, BaseTaskQueue, Iterable[Union[Task, dict]]
],
agent_data: Dict[str, Any] | Iterable[Dict[str, Any]],
) -> List[Dict[str, Any]]
Initialize and execute the complete benchmark loop across all tasks.
| PARAMETER | DESCRIPTION |
|---|---|
tasks
|
Task source for execution. Can be: - A single Task object - A BaseTaskQueue (SequentialTaskQueue, PriorityTaskQueue, or custom AdaptiveTaskQueue) - An iterable of Task objects or dicts that will be converted to Tasks When a BaseTaskQueue is provided, it controls the task ordering. AdaptiveTaskQueue subclasses are automatically registered as callbacks to receive task completion notifications.
TYPE:
|
agent_data
|
Configuration for agents. Either a single dict applied to all tasks, or an iterable of dicts with one configuration per task. Agent data typically includes model parameters, agent architecture details, and tool specifications.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[Dict[str, Any]]
|
List of report dictionaries, one per task repetition. Each report contains: |
List[Dict[str, Any]]
|
|
List[Dict[str, Any]]
|
|
List[Dict[str, Any]]
|
|
List[Dict[str, Any]]
|
|
List[Dict[str, Any]]
|
|
List[Dict[str, Any]]
|
|
List[Dict[str, Any]]
|
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If agent_data length doesn't match number of tasks (when agent_data is an iterable). |
How to use
This is the framework's main orchestration method that runs your entire benchmark. It iterates through all tasks, handles repetitions, and manages the three-stage lifecycle for each execution. You don't implement this method—instead, you call it to start the benchmark after implementing the setup and execution methods.
By default, the benchmark will continue executing remaining tasks even if some fail.
You can change this behavior by setting fail_on_task_error=True,
fail_on_evaluation_error=True, or fail_on_setup_error=True when instantiating
the benchmark. Each task execution returns a status indicating success or the specific
failure type (see TaskExecutionStatus).
For each task execution, the framework:
- Calls your setup methods to initialize components
- Calls your
run_agents()method to execute the task - Collects message histories and calls evaluators
- Stores results and triggers callbacks
Pseudocode structure:
for task in tasks:
for repeat in range(n_task_repeats):
# Setup stage
environment = setup_environment(agent_data, task)
user = setup_user(agent_data, environment, task)
agents_to_run, agents_dict = setup_agents(agent_data, environment, task, user)
evaluators = setup_evaluators(environment, task, agents_to_run, user)
# Run stage (execution_loop handles multi-turn if user exists)
agents_output = execution_loop(agents_to_run, task, environment, user)
# Evaluate stage
traces = collect_message_histories(agents_dict)
eval_results = evaluate(evaluators, traces, agents_dict)
# Store results
store_result(task_id, traces, eval_results)
Callback hooks are triggered at these points:
- on_run_start: Before processing any tasks
- on_task_start: Before processing a task (once per task, not per repeat)
- on_task_repeat_start: Before each repetition of a task
- on_task_repeat_end: After each repetition completes
- on_task_end: After all repetitions of a task complete
- on_run_end: After all tasks complete
# Typical usage
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)
# Analyze results
for report in reports:
print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}: {report['eval']}")
print(f"Config: {report['config']}")
print(f"Traces: {report['traces']}")
# Parallel execution with 4 workers
benchmark = MyBenchmark(num_workers=4)
reports = benchmark.run(tasks=tasks, agent_data=config)
# Single agent config for all tasks
reports = benchmark.run(tasks=tasks, agent_data={"model": "gpt-4"})
# Task-specific agent configs (must match task count)
reports = benchmark.run(
tasks=tasks,
agent_data=[
{"model": "gpt-4", "difficulty": "easy"},
{"model": "gpt-4", "difficulty": "hard"},
]
)
# Priority-based execution
from maseval.core.task import PriorityTaskQueue
for task in tasks:
task.protocol.priority = compute_priority(task)
queue = PriorityTaskQueue(tasks)
reports = benchmark.run(tasks=queue, agent_data=config)
# Adaptive queue (auto-registered as callback)
queue = MyAdaptiveTaskQueue(tasks)
reports = benchmark.run(tasks=queue) # queue receives on_task_complete callbacks
run_agents
run_agents(
agents: Sequence[AgentAdapter],
task: Task,
environment: Environment,
query: str,
) -> Dict[str, Any]
Execute the multi-agent system.
For MultiAgentBench, this runs all agents on the task and collects their outputs.
| PARAMETER | DESCRIPTION |
|---|---|
agents
|
Agents to run
TYPE:
|
task
|
The task
TYPE:
|
environment
|
The environment
TYPE:
|
query
|
The query/task content
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dict with agent_results, communications, and coordination_mode |
setup_agents
setup_agents(
agent_data: Dict[str, Any],
environment: Environment,
task: Task,
user: Optional[User],
seed_generator: SeedGenerator,
) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]
Create MARBLE agents wrapped in MASEval adapters.
Also creates MARBLE's orchestration components (EnginePlanner,
SharedMemory, AgentGraph) needed by execution_loop to
replicate MARBLE's multi-iteration coordination.
Note
MARBLE agents use their own internal LLM handling with a model ID string,
not MASEval's ModelAdapter. This means seed_generator cannot be applied
to agent LLM calls in this implementation. For reproducible agent behavior,
use MultiAgentBenchBenchmark with a custom setup_agents that creates
agents using seeded MASEval ModelAdapters.
| PARAMETER | DESCRIPTION |
|---|---|
agent_data
|
Agent configuration
TYPE:
|
environment
|
The environment
TYPE:
|
task
|
The task with agent specifications
TYPE:
|
user
|
User simulator (None)
TYPE:
|
seed_generator
|
Seed generator (not used for MARBLE agents, but seeding is applied to evaluators)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]
|
Tuple of (agents_to_run, agents_dict) |
setup_environment
setup_environment(
agent_data: Dict[str, Any],
task: Task,
seed_generator: SeedGenerator,
) -> Environment
Create the MultiAgentBench environment.
| PARAMETER | DESCRIPTION |
|---|---|
agent_data
|
Agent configuration
TYPE:
|
task
|
The task to set up
TYPE:
|
seed_generator
|
Seed generator for reproducibility
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Environment
|
MultiAgentBenchEnvironment instance |
setup_evaluators
setup_evaluators(
environment: Environment,
task: Task,
agents: Sequence[AgentAdapter],
user: Optional[User],
seed_generator: SeedGenerator,
) -> Sequence[Evaluator]
Create a thin evaluator for MARBLE reproduction mode.
All LLM-based evaluation happens inside the coordination loop via
MARBLE's Evaluator (imported directly). The MarbleReproductionEvaluator
only reformats pre-computed metrics into MASEval's result format.
No ModelAdapter is needed — evaluation LLM calls are handled by
MARBLE's model_prompting() in the coordination loop.
| PARAMETER | DESCRIPTION |
|---|---|
environment
|
The environment
TYPE:
|
task
|
The task with evaluation data
TYPE:
|
agents
|
The agents
TYPE:
|
user
|
User simulator (None for MultiAgentBench)
TYPE:
|
seed_generator
|
Seed generator for reproducibility
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Sequence[Evaluator]
|
List containing a single MarbleReproductionEvaluator |
setup_user
setup_user(
agent_data: Dict[str, Any],
environment: Environment,
task: Task,
seed_generator: SeedGenerator,
) -> Optional[User]
MultiAgentBench tasks don't use user simulators.
The multi-agent coordination replaces user interaction.
| PARAMETER | DESCRIPTION |
|---|---|
agent_data
|
Agent configuration
TYPE:
|
environment
|
The environment instance
TYPE:
|
task
|
The task
TYPE:
|
seed_generator
|
Seed generator (unused)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Optional[User]
|
None |
MultiAgentBenchEnvironment
Bases: Environment
MASEval Environment wrapper for MARBLE environments.
This environment wraps MARBLE's domain-specific environments (Research, Bargaining, Coding, etc.) and exposes their tools through MASEval's tracing infrastructure.
| ATTRIBUTE | DESCRIPTION |
|---|---|
domain |
The domain name (e.g., "research", "bargaining")
|
marble_env |
The underlying MARBLE environment instance
|
__init__
__init__(task_data: Dict[str, Any])
Initialize the environment.
| PARAMETER | DESCRIPTION |
|---|---|
task_data
|
Task data containing environment configuration
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
EnvironmentError
|
If required infrastructure is unavailable |
ImportError
|
If MARBLE is not available |
apply_action
apply_action(
agent_id: Optional[str],
action_name: str,
arguments: Dict[str, Any],
) -> Dict[str, Any]
Execute an action in the MARBLE environment.
| PARAMETER | DESCRIPTION |
|---|---|
agent_id
|
ID of the agent performing the action
TYPE:
|
action_name
|
Name of the action to execute
TYPE:
|
arguments
|
Arguments for the action
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Action result dictionary |
create_tools
create_tools() -> Dict[str, Callable]
Create tools from MARBLE environment for MASEval tracing.
MARBLE environments expose tools via action_handler_descriptions. This method wraps them for MASEval's tracing infrastructure.
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Callable]
|
Dict mapping tool names to wrapped callables |
gather_config
gather_config() -> Dict[str, Any]
Gather environment configuration.
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dict with environment configuration |
gather_traces
gather_traces() -> Dict[str, Any]
Gather traces including tool invocations.
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dict with environment traces |
get_marble_state
get_marble_state() -> Dict[str, Any]
Get the current MARBLE environment state.
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
State dictionary from MARBLE environment |
get_tool
get_tool(name: str) -> Optional[Callable]
Get a specific tool by name.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
Tool name
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Optional[Callable]
|
Tool callable if found, None otherwise |
get_tool_descriptions
get_tool_descriptions() -> Dict[str, Any]
Get tool descriptions in OpenAI function format.
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dict mapping tool names to their OpenAI-format descriptions |
get_tools
get_tools() -> Dict[str, Any]
Get all tools as a dict.
is_done
is_done() -> bool
Check if the environment has reached a terminal state.
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if done, False otherwise |
is_task_completed
is_task_completed() -> bool
Check if the task has been completed successfully.
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if task completed, False otherwise |
setup_state
setup_state(task_data: Dict[str, Any]) -> Dict[str, Any]
Initialize state and optionally create MARBLE environment.
| PARAMETER | DESCRIPTION |
|---|---|
task_data
|
Task data containing environment configuration
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Initial state dictionary |
| RAISES | DESCRIPTION |
|---|---|
EnvironmentError
|
If required infrastructure is unavailable |
MultiAgentBenchEvaluator
Bases: Evaluator
Evaluator for MultiAgentBench tasks matching MARBLE's methodology.
This evaluator implements MARBLE's LLM-based evaluation metrics: - Task completion assessment - Communication quality scoring - Planning/coordination scoring - Domain-specific task evaluation (research, bargaining, etc.)
| ATTRIBUTE | DESCRIPTION |
|---|---|
domain |
The benchmark domain (research, bargaining, etc.)
|
model_adapter |
Model adapter for LLM-based evaluation
|
metrics_config |
Configuration for metrics to evaluate
|
__call__
__call__(
traces: Dict[str, Any], final_answer: Any
) -> Dict[str, Any]
Evaluate the task execution.
| PARAMETER | DESCRIPTION |
|---|---|
traces
|
Filtered execution traces
TYPE:
|
final_answer
|
Final output from agents (dict, list, str, or None)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Evaluation results dictionary |
__init__
__init__(
domain: str,
model_adapter: ModelAdapter,
metrics_config: Optional[Dict[str, Any]] = None,
output_format: str = "",
result_truncation_length: Optional[
int
] = DEFAULT_RESULT_TRUNCATION_LENGTH,
)
Initialize the evaluator.
| PARAMETER | DESCRIPTION |
|---|---|
domain
|
Benchmark domain (research, bargaining, etc.)
TYPE:
|
model_adapter
|
Model adapter for LLM evaluation
TYPE:
|
metrics_config
|
Configuration for evaluation metrics
TYPE:
|
output_format
|
Expected output format for task evaluation
TYPE:
|
result_truncation_length
|
Maximum characters per agent result before LLM
summarization. Matches MARBLE's
TYPE:
|
filter_traces
filter_traces(traces: Dict[str, Any]) -> Dict[str, Any]
Filter traces for evaluation.
| PARAMETER | DESCRIPTION |
|---|---|
traces
|
All collected traces
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Filtered traces relevant for evaluation |
MarbleAgentAdapter
Bases: AgentAdapter
Adapter wrapping a MARBLE BaseAgent for MASEval tracing.
This adapter provides a unified interface to MARBLE agents while capturing all relevant traces for evaluation.
| ATTRIBUTE | DESCRIPTION |
|---|---|
agent_id |
Unique identifier for the agent
TYPE:
|
marble_agent |
The underlying MARBLE BaseAgent instance
TYPE:
|
profile |
Agent's role profile from MARBLE config
TYPE:
|
agent_id
property
agent_id: str
Return the agent's unique identifier.
marble_agent
property
marble_agent: Any
Return the underlying MARBLE agent.
profile
property
profile: str
Return the agent's profile.
__init__
__init__(marble_agent: Any, agent_id: str)
Initialize the adapter.
| PARAMETER | DESCRIPTION |
|---|---|
marble_agent
|
MARBLE BaseAgent instance
TYPE:
|
agent_id
|
Unique identifier for this agent
TYPE:
|
gather_config
gather_config() -> Dict[str, Any]
Gather agent configuration.
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dict with agent configuration |
gather_traces
gather_traces() -> Dict[str, Any]
Gather traces including agent-specific data.
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dict with all agent traces |
gather_usage
gather_usage() -> Usage
Gather usage with automatic cost calculation.
Calls _gather_usage() for raw token counts, then applies
the cost calculator if one is available and cost is still 0.0.
The model_id used for cost calculation is resolved in order:
- Explicit
model_idpassed to__init__ - Auto-detected from the framework agent via
_resolve_model_id()
Subclasses should override _gather_usage() (not this method)
to provide framework-specific token extraction.
| RETURNS | DESCRIPTION |
|---|---|
Usage
|
Usage (or TokenUsage) with cost filled in when possible. |
get_memory_str
get_memory_str() -> str
Get the agent's memory as a string.
| RETURNS | DESCRIPTION |
|---|---|
str
|
Serialized memory string |
get_messages
get_messages() -> MessageHistory
Get the current message history as an iterable MessageHistory object.
The returned MessageHistory can be:
- Iterated: for msg in agent.get_messages(): ...
- Indexed: agent.get_messages()[0]
- Converted to list: list(agent.get_messages()) or agent.get_messages().to_list()
- Checked for emptiness: if agent.get_messages(): ...
| RETURNS | DESCRIPTION |
|---|---|
MessageHistory
|
MessageHistory object (empty if no messages yet) |
Example
# Iterate directly
for msg in agent.get_messages():
print(msg['role'], msg['content'])
# Convert to list
messages = agent.get_messages().to_list()
messages = list(agent.get_messages())
# Check if empty
if agent.get_messages():
print("Agent has messages")
get_serialized_messages
get_serialized_messages(session_id: str = '') -> str
Get serialized inter-agent messages.
| PARAMETER | DESCRIPTION |
|---|---|
session_id
|
Optional session ID filter
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Serialized message string |
get_token_usage
get_token_usage() -> int
Get the total token usage from the MARBLE agent.
| RETURNS | DESCRIPTION |
|---|---|
int
|
Total tokens used by the agent |
run
run(query: str) -> Any
Executes the agent and returns the result.
load_tasks
load_tasks(
domain: str,
data_dir: Optional[Path] = None,
limit: Optional[int] = None,
) -> List[Task]
Load MultiAgentBench tasks for a domain.
Most domains load from JSONL files. Werewolf uses config-based task loading since it has no JSONL data (it uses a game engine).
| PARAMETER | DESCRIPTION |
|---|---|
domain
|
Domain name (one of: coding, database, minecraft, research, bargaining, werewolf)
TYPE:
|
data_dir
|
Optional path to MARBLE data directory
TYPE:
|
limit
|
Maximum number of tasks to load (None for all)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[Task]
|
List of Task objects |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If domain is invalid |
FileNotFoundError
|
If data files not found |
Example
tasks = load_tasks("research", limit=5) len(tasks) 5 tasks[0].metadata["domain"] 'research'
configure_model_ids
configure_model_ids(
tasks: List[Task],
*,
agent_model_id: str,
evaluator_model_id: Optional[str] = None,
) -> List[Task]
Configure model IDs for MARBLE agents and evaluator.
Modifies tasks in-place to set the LLM model IDs used by agents and optionally the evaluator.
| PARAMETER | DESCRIPTION |
|---|---|
tasks
|
List of Tasks to configure
TYPE:
|
agent_model_id
|
Model ID for all MARBLE agents (e.g., "gpt-4o")
TYPE:
|
evaluator_model_id
|
Optional model ID for LLM-based evaluation
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[Task]
|
The input tasks (modified in-place) |
Example
tasks = load_tasks("research", limit=5) configure_model_ids(tasks, agent_model_id="gpt-4o") tasks[0].environment_data["llm"] 'gpt-4o'
ensure_marble_exists
ensure_marble_exists(auto_download: bool = True) -> Path
Ensure MARBLE is available, optionally downloading it.
This function checks if MARBLE is installed and optionally downloads it if not present.
| PARAMETER | DESCRIPTION |
|---|---|
auto_download
|
If True, automatically download MARBLE if not found. If False, raise an error if MARBLE is not found.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Path
|
Path to the MARBLE directory |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If MARBLE is not found and auto_download=False |
Example
marble_dir = ensure_marble_exists()
MARBLE is now available at marble_dir
download_marble
download_marble(
target_dir: Optional[Path] = None,
commit: Optional[str] = None,
force: bool = False,
) -> Path
Clone MARBLE repository to the specified directory.
| PARAMETER | DESCRIPTION |
|---|---|
target_dir
|
Directory to clone into. Defaults to marble/ relative to this module.
TYPE:
|
commit
|
Specific commit hash to checkout. Defaults to MARBLE_DEFAULT_COMMIT or latest.
TYPE:
|
force
|
If True, remove existing directory and re-clone.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Path
|
Path to the cloned MARBLE directory |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If git clone fails |
FileExistsError
|
If directory exists and force=False |
get_domain_info
get_domain_info(domain: str) -> Dict[str, Any]
Get information about a domain.
| PARAMETER | DESCRIPTION |
|---|---|
domain
|
Domain name
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dict with domain information including: |
Dict[str, Any]
|
|
Dict[str, Any]
|
|
Dict[str, Any]
|
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If domain is invalid |