MultiAgentBench: Multi-Agent Collaboration Benchmark (Beta)

Beta

This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

The MultiAgentBench benchmark evaluates multi-agent collaboration and competition in LLM-based systems across diverse scenarios including research, negotiation, coding, and more.

MultiAgentBench (from the MARBLE framework, where the original work was done) is designed to evaluate how multiple LLM-based agents collaborate and compete to solve complex tasks. We use a bug-fixed fork for MASEval integration. The benchmark features:

6 diverse domains: research, bargaining, coding, database, werewolf, minecraft (minecraft is untested)
Multiple coordination modes: cooperative, star, tree, hierarchical
LLM-based evaluation: Matches MARBLE's evaluation methodology
Framework-agnostic: Use with any agent framework or MARBLE's native agents

Reference Paper: MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Check out the BENCHMARKS.md file for more information including licenses.

MultiAgentBenchBenchmark

Bases: Benchmark

Abstract base class for framework-agnostic MultiAgentBench evaluation.

This benchmark provides the infrastructure for evaluating multi-agent systems on MARBLE's MultiAgentBench tasks. Subclasses implement setup_agents() with their specific agent framework.

The benchmark supports: - Multiple coordination modes (star, cooperative, tree, hierarchical) - Multiple domains (research, bargaining, coding, database, etc.) - LLM-based evaluation matching MARBLE's metrics - Comprehensive tracing of agent interactions

Warning

communication_score is only computed when agents use MarbleAgentAdapter, which populates the communication_log trace key from BaseAgent.act(). Custom setup_agents() implementations using other adapters must explicitly populate communication_log in each adapter's gather_traces() output for communication evaluation to work. See MultiAgentBenchEvaluator._extract_communications() for the expected format.

Example

class MyMultiAgentBenchmark(MultiAgentBenchBenchmark):
    def setup_agents(self, agent_data, environment, task, user, seed_generator):
        # Derive seeds for agents (returns None if seeding disabled)
        agents_gen = seed_generator.child("agents")
        agent_seeds = {}
        for config in task.environment_data.get("agents", []):
            agent_id = config.get("agent_id")
            agent_seeds[agent_id] = agents_gen.derive_seed(agent_id)

        # Create agents using your framework with seeds
        agents_list = []
        agents_dict = {}
        for config in task.environment_data.get("agents", []):
            agent_id = config.get("agent_id")
            model = self.get_model_adapter(
                agent_data.get("model_id", "gpt-4o"),
                register_name=f"agent_{agent_id}",
                seed=agent_seeds.get(agent_id),
            )
            # Create your agent with the seeded model...
            ...
        return agents_list, agents_dict

    def get_model_adapter(self, model_id, **kwargs):
        seed = kwargs.pop("seed", None)
        adapter = MyModelAdapter(model_id, seed=seed)
        if "register_name" in kwargs:
            self.register("models", kwargs["register_name"], adapter)
        return adapter

benchmark = MyMultiAgentBenchmark(seed=42)  # Enable seeding
results = benchmark.run(tasks, agent_data={"model_id": "gpt-4o"})

seed_generator `property`

seed_generator: SeedGenerator

The seed generator for this benchmark.

The seed generator is configured at benchmark initialization via the seed or seed_generator parameters. When seed=None (the default), the generator's derive_seed() method returns None, effectively disabling seeding while maintaining a uniform interface.

RETURNS	DESCRIPTION
`SeedGenerator`	The root `SeedGenerator` instance.

usage `property`

usage: Usage

Running usage total across all task repetitions.

Queryable at any time, including while the benchmark is still running. Returns the grand total of all usage collected so far.

usage_by_component `property`

usage_by_component: Dict[str, Usage]

Per-component running usage totals across all repetitions.

Keys are registry keys (e.g., "models:main_model").

init

__init__(
    callbacks: Optional[List[BenchmarkCallback]] = None,
    n_task_repeats: int = 1,
    max_invocations: int = 10,
    num_workers: int = 1,
    fail_on_setup_error: bool = False,
    fail_on_task_error: bool = False,
    fail_on_evaluation_error: bool = False,
    progress_bar: bool | str = True,
    seed: Optional[int] = None,
    seed_generator: Optional[SeedGenerator] = None,
)

Initialize the benchmark.

PARAMETER	DESCRIPTION
`callbacks`	Optional list of callbacks TYPE: `Optional[List[BenchmarkCallback]]` DEFAULT: `None`
`n_task_repeats`	Number of times to repeat each task TYPE: `int` DEFAULT: `1`
`max_invocations`	Maximum agent invocations per task TYPE: `int` DEFAULT: `10`
`num_workers`	Number of parallel workers TYPE: `int` DEFAULT: `1`
`fail_on_setup_error`	Raise on setup errors TYPE: `bool` DEFAULT: `False`
`fail_on_task_error`	Raise on task errors TYPE: `bool` DEFAULT: `False`
`fail_on_evaluation_error`	Raise on evaluation errors TYPE: `bool` DEFAULT: `False`
`progress_bar`	Progress bar configuration TYPE: `bool \| str` DEFAULT: `True`
`seed`	Global seed for reproducible benchmark runs TYPE: `Optional[int]` DEFAULT: `None`
`seed_generator`	Custom seed generator (takes precedence over seed) TYPE: `Optional[SeedGenerator]` DEFAULT: `None`

add_callback

add_callback(callback: BenchmarkCallback) -> None

Register a callback handler to monitor benchmark execution.

PARAMETER	DESCRIPTION
`callback`	A BenchmarkCallback instance that will receive execution events. TYPE: `BenchmarkCallback`

How to use

Callbacks receive notifications at key lifecycle points for tracing, progress tracking, or custom metrics collection. See BenchmarkCallback for available hooks and their signatures.

from maseval.core.callbacks import MessageTracingCallback

benchmark = MyBenchmark(tasks=tasks, agent_data=config)
benchmark.add_callback(MessageTracingCallback(output_dir="logs"))
results = benchmark.run()

clear_registry

clear_registry() -> None

Clear the component registry after a task repetition completes.

This method is called automatically by run() after each task repetition to ensure components are not carried over between repetitions. The reports list persists across all repetitions for aggregated analysis.

collect_all_configs

collect_all_configs() -> Dict[str, Any]

Collect configuration from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes and before evaluation begins. It gathers comprehensive configuration from all registered components (agents, models, tools, simulators, callbacks, etc.) for that specific repetition. After collection, the registry is cleared for the next repetition.

The collected configs are stored in benchmark.reports list along with traces for persistent access across all task repetitions.

Output fields:

metadata - Collection timestamp and thread info
agents - Dict mapping agent names to their config (settings, parameters)
models - Dict mapping model names to their config (model IDs, parameters)
tools - Dict mapping tool names to their config (specifications, settings)
simulators - Dict mapping simulator names to their config (parameters, templates)
callbacks - Dict mapping callback names to their config (settings)
environment - Direct config from the environment (not nested), or None if not present
user - Direct config from the user simulator (not nested), or None if not present
other - Dict for any other registered components
benchmark - Benchmark-level configuration (git, system, packages)

RETURNS	DESCRIPTION
`Dict[str, Any]`	Structured dictionary containing configuration from all registered components.

How to use

This method is called automatically by run() after each task repetition:

# Automatic collection (recommended)
results = benchmark.run()

# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
    # Agents is a dict: agent_name -> config
    print(f"Agent config: {report['config']['agents']['my_agent']}")
    # Environment and user are direct (not nested)
    print(f"Environment config: {report['config']['environment']}")
    print(f"User config: {report['config']['user']}")
    # Benchmark-level config
    print(f"Git commit: {report['config']['benchmark']['git']['commit_hash']}")

The collected configs are available in the results for reproducibility analysis.

collect_all_traces

collect_all_traces() -> Dict[str, Any]

Collect execution traces from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes and before evaluation begins. It gathers comprehensive traces from all registered components (agents, models, tools, simulators, callbacks, etc.) for that specific repetition. After collection, the registry is cleared for the next repetition.

The collected traces are stored in benchmark.reports list along with configs for persistent access across all task repetitions.

Output fields:

metadata - Collection timestamp and thread info
agents - Dict mapping agent names to their traces (messages, execution data)
models - Dict mapping model names to their traces (API calls, timing, errors)
tools - Dict mapping tool names to their traces (invocations, parameters)
simulators - Dict mapping simulator names to their traces (attempts, outcomes)
callbacks - Dict mapping callback names to their traces (custom data)
environment - Direct traces from the environment (not nested), or None if not present
user - Direct traces from the user simulator (not nested), or None if not present
other - Dict for any other registered components

RETURNS	DESCRIPTION
`Dict[str, Any]`	Structured dictionary containing execution traces from all registered components.

How to use

This method is called automatically by run() after each task repetition:

# Automatic collection (recommended)
results = benchmark.run()

# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
    # Agents is a dict: agent_name -> traces
    print(f"Agent messages: {report['traces']['agents']['my_agent']}")
    # Environment and user are direct (not nested)
    print(f"Environment state: {report['traces']['environment']}")
    print(f"User interactions: {report['traces']['user']}")

The collected traces are passed to the evaluator's evaluate() method and stored in benchmark.reports for later analysis.

collect_all_usage

collect_all_usage() -> Dict[str, Any]

Collect usage from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes. It gathers usage from all registered UsageTrackableMixin components and also accumulates into persistent running totals accessible via usage and usage_by_component.

RETURNS	DESCRIPTION
`Dict[str, Any]`	Structured dictionary containing usage from all registered components.

evaluate

evaluate(
    evaluators: Sequence[Evaluator],
    agents: Dict[str, AgentAdapter],
    final_answer: Any,
    traces: Dict[str, Any],
) -> List[Dict[str, Any]]

Execute evaluators on the results.

PARAMETER	DESCRIPTION
`evaluators`	The evaluators TYPE: `Sequence[Evaluator]`
`agents`	Dict of all agents TYPE: `Dict[str, AgentAdapter]`
`final_answer`	The combined agent outputs TYPE: `Any`
`traces`	Execution traces TYPE: `Dict[str, Any]`

RETURNS	DESCRIPTION
`List[Dict[str, Any]]`	List of evaluation results

execution_loop

execution_loop(
    agents: Sequence[AgentAdapter],
    task: Task,
    environment: Environment,
    user: Optional[User],
) -> Any

Execute agents in a single pass.

MultiAgentBench uses multi-agent coordination instead of user interaction. The base class execution_loop breaks after one call when user is None, so this override makes the single-pass behavior explicit.

Subclasses (e.g. MarbleMultiAgentBenchBenchmark) override this with multi-iteration coordination loops matching their framework's orchestration.

get_failed_tasks

get_failed_tasks(
    status_filter: Optional[
        Union[
            TaskExecutionStatus, List[TaskExecutionStatus]
        ]
    ] = None,
    reports: Optional[List[Dict[str, Any]]] = None,
) -> SequentialTaskQueue

Get tasks that failed during benchmark execution.

This method retrieves failed tasks based on their execution status, useful for debugging, retry logic, or failure analysis.

PARAMETER DESCRIPTION

status_filter

Filter by specific failure status(es). If None, returns all failed tasks (any status except SUCCESS). Can be a single TaskExecutionStatus or a list of them. Examples: - TaskExecutionStatus.TASK_EXECUTION_FAILED: Only tasks that failed during execution - TaskExecutionStatus.EVALUATION_FAILED: Only tasks where evaluation failed - [TaskExecutionStatus.TASK_EXECUTION_FAILED, TaskExecutionStatus.SETUP_FAILED]: Tasks that failed during execution or setup

TYPE: Optional[Union[TaskExecutionStatus, List[TaskExecutionStatus]]] DEFAULT: None

reports

Optional list of reports to analyze. If None, uses the reports from the last run() call. This allows analyzing externally stored or modified reports.

TYPE: Optional[List[Dict[str, Any]]] DEFAULT: None

RETURNS	DESCRIPTION
`SequentialTaskQueue`	SequentialTaskQueue containing the failed tasks. Empty if no failures match the filter.

RAISES	DESCRIPTION
`RuntimeError`	If reports is None and run() has not been executed yet.

How to use

# Run benchmark
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)

# Get all failed tasks (from internal state)
failed = benchmark.get_failed_tasks()
print(f"Failed: {len(failed)}/{len(benchmark.tasks)} tasks")

# Or work with returned reports (safe from internal state changes)
failed = benchmark.get_failed_tasks(reports=reports)

# Get only tasks that failed during execution (not evaluation)
execution_failures = benchmark.get_failed_tasks(
    TaskExecutionStatus.TASK_EXECUTION_FAILED,
    reports=reports
)

# Get setup and execution failures
critical_failures = benchmark.get_failed_tasks(
    status_filter=[
        TaskExecutionStatus.SETUP_FAILED,
        TaskExecutionStatus.TASK_EXECUTION_FAILED
    ],
    reports=reports
)

# Retry failed tasks elegantly - this is the key use case!
if len(failed) > 0:
    retry_reports = benchmark.run(tasks=failed)

# Or more concisely
reports = benchmark.run(tasks=tasks)
retry_reports = benchmark.run(tasks=benchmark.get_failed_tasks())

get_model_adapter `abstractmethod`

get_model_adapter(
    model_id: str, **kwargs: Any
) -> ModelAdapter

Provide a model adapter (implement in subclass).

PARAMETER	DESCRIPTION
`model_id`	Model identifier TYPE: `str`
`**kwargs`	Additional arguments including register_name TYPE: `Any` DEFAULT: `{}`

RETURNS	DESCRIPTION
`ModelAdapter`	ModelAdapter instance

register

register(
    category: str,
    name: str,
    component: RegisterableComponent,
) -> RegisterableComponent

Register a component for comprehensive trace and configuration collection.

All core MASEval components (AgentAdapter, ModelAdapter, Environment, User, LLMSimulator, BenchmarkCallback) inherit from TraceableMixin and/or ConfigurableMixin, and are automatically registered for both trace and configuration collection before evaluation.

Note: Most components are automatically registered when returned from setup methods (setup_environment, setup_user, setup_agents). You only need to manually register additional components like models, simulators, or tools that aren't automatically captured.

PARAMETER	DESCRIPTION
`category`	Component category (e.g., "agents", "models", "tools", "simulators", "callbacks", "user", "environment", "seeding"). Use plural form to match the structure in collect_all_traces() and collect_all_configs(). TYPE: `str`
`name`	Unique identifier for this component within its category TYPE: `str`
`component`	Any object inheriting from TraceableMixin and/or ConfigurableMixin TYPE: `RegisterableComponent`

RETURNS	DESCRIPTION
`RegisterableComponent`	The component (for chaining convenience)

RAISES	DESCRIPTION
`ValueError`	If the component is already registered under a different name

How to use

Most components are auto-registered. Manual registration is only needed for additional components:

def setup_agents(self, agent_data, environment, task, user):
    # Create model (needs manual registration)
    model = MyModelAdapter(...)
    self.register("models", "main_model", model)

    # Create agent (auto-registered when returned)
    agent = MyAgent(model=model)
    agent_adapter = AgentAdapter(agent, "agent1")

    # Environment and user are also auto-registered
    return [agent_adapter], {"agent1": agent_adapter}

Traces and configs are automatically collected before evaluation via collect_all_traces() and collect_all_configs() which are called internally by the run() method.

run

run(
    tasks: Union[
        Task, BaseTaskQueue, Iterable[Union[Task, dict]]
    ],
    agent_data: Dict[str, Any] | Iterable[Dict[str, Any]],
) -> List[Dict[str, Any]]

Initialize and execute the complete benchmark loop across all tasks.

PARAMETER DESCRIPTION

tasks

Task source for execution. Can be: - A single Task object - A BaseTaskQueue (SequentialTaskQueue, PriorityTaskQueue, or custom AdaptiveTaskQueue) - An iterable of Task objects or dicts that will be converted to Tasks

When a BaseTaskQueue is provided, it controls the task ordering. AdaptiveTaskQueue subclasses are automatically registered as callbacks to receive task completion notifications.

TYPE: Union[Task, BaseTaskQueue, Iterable[Union[Task, dict]]]

agent_data

Configuration for agents. Either a single dict applied to all tasks, or an iterable of dicts with one configuration per task. Agent data typically includes model parameters, agent architecture details, and tool specifications.

TYPE: Dict[str, Any] | Iterable[Dict[str, Any]]

RETURNS	DESCRIPTION
`List[Dict[str, Any]]`	List of report dictionaries, one per task repetition. Every report carries the
`List[Dict[str, Any]]`	same keys (consistent schema) regardless of success or failure:
`List[Dict[str, Any]]`	task_id: Task identifier (UUID)
`List[Dict[str, Any]]`	repeat_idx: Repetition index (0 to n_task_repeats-1)
`List[Dict[str, Any]]`	status: Execution status (one of TaskExecutionStatus enum values)
`List[Dict[str, Any]]`	traces: Execution traces from all registered components (`{}` if unavailable, e.g. setup failure)
`List[Dict[str, Any]]`	config: Configuration from all registered components and benchmark level (`{}` if unavailable)
`List[Dict[str, Any]]`	usage: Aggregated usage from all registered components (`None` if not collected)
`List[Dict[str, Any]]`	eval: Evaluation results (None if task or evaluation failed)
`List[Dict[str, Any]]`	task: Task summary dict with `query`, `metadata`, and `protocol`
`List[Dict[str, Any]]`	error: Error details dict — `None` only when status is SUCCESS; otherwise always populated, containing: error_type: Exception class name error_message: Exception message traceback: Full traceback string (plus any error-specific extras, e.g. `component`, `elapsed`, `timeout`)

RAISES	DESCRIPTION
`ValueError`	If agent_data length doesn't match number of tasks (when agent_data is an iterable).
`Exception`	If a `fail_on_setup_error` / `fail_on_task_error` / `fail_on_evaluation_error` flag is set and the corresponding failure occurs, the original exception is re-raised and the run is aborted (this applies to both sequential and parallel execution).

How to use

This is the framework's main orchestration method that runs your entire benchmark. It iterates through all tasks, handles repetitions, and manages the three-stage lifecycle for each execution. You don't implement this method—instead, you call it to start the benchmark after implementing the setup and execution methods.

By default, the benchmark will continue executing remaining tasks even if some fail. You can change this behavior by setting fail_on_task_error=True, fail_on_evaluation_error=True, or fail_on_setup_error=True when instantiating the benchmark. Each task execution returns a status indicating success or the specific failure type (see TaskExecutionStatus).

For each task execution, the framework:

Calls your setup methods to initialize components
Calls your run_agents() method to execute the task
Collects message histories and calls evaluators
Stores results and triggers callbacks

Pseudocode structure:

for task in tasks:
    for repeat in range(n_task_repeats):
        # Setup stage
        environment = setup_environment(agent_data, task)
        user = setup_user(agent_data, environment, task)
        agents_to_run, agents_dict = setup_agents(agent_data, environment, task, user)
        evaluators = setup_evaluators(environment, task, agents_to_run, user)

        # Run stage (execution_loop handles multi-turn if user exists)
        agents_output = execution_loop(agents_to_run, task, environment, user)

        # Evaluate stage
        traces = collect_message_histories(agents_dict)
        eval_results = evaluate(evaluators, traces, agents_dict)

        # Store results
        store_result(task_id, traces, eval_results)

Callback hooks are triggered at these points:

on_run_start: Before processing any tasks
on_task_start: Before processing a task (once per task, not per repeat)
on_task_repeat_start: Before each repetition of a task
on_task_repeat_end: After each repetition completes
on_task_end: After all repetitions of a task complete
on_run_end: After all tasks complete

# Typical usage
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)

# Analyze results
for report in reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}: {report['eval']}")
    print(f"Config: {report['config']}")
    print(f"Traces: {report['traces']}")

# Parallel execution with 4 workers
benchmark = MyBenchmark(num_workers=4)
reports = benchmark.run(tasks=tasks, agent_data=config)

# Single agent config for all tasks
reports = benchmark.run(tasks=tasks, agent_data={"model": "gpt-4"})

# Task-specific agent configs (must match task count)
reports = benchmark.run(
    tasks=tasks,
    agent_data=[
        {"model": "gpt-4", "difficulty": "easy"},
        {"model": "gpt-4", "difficulty": "hard"},
    ]
)

# Priority-based execution
from maseval.core.task import PriorityTaskQueue
for task in tasks:
    task.protocol.priority = compute_priority(task)
queue = PriorityTaskQueue(tasks)
reports = benchmark.run(tasks=queue, agent_data=config)

# Adaptive queue (auto-registered as callback)
queue = MyAdaptiveTaskQueue(tasks)
reports = benchmark.run(tasks=queue)  # queue receives on_task_complete callbacks

run_agents

run_agents(
    agents: Sequence[AgentAdapter],
    task: Task,
    environment: Environment,
    query: str,
) -> Dict[str, Any]

Execute the multi-agent system.

For MultiAgentBench, this runs all agents on the task and collects their outputs.

PARAMETER	DESCRIPTION
`agents`	Agents to run TYPE: `Sequence[AgentAdapter]`
`task`	The task TYPE: `Task`
`environment`	The environment TYPE: `Environment`
`query`	The query/task content TYPE: `str`

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dict with agent_results, communications, and coordination_mode

setup_agents `abstractmethod`

setup_agents(
    agent_data: Dict[str, Any],
    environment: Environment,
    task: Task,
    user: Optional[User],
    seed_generator: SeedGenerator,
) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]

Create agents for the task (implement in subclass).

Subclasses should: 1. Read agent specifications from task.environment_data["agents"] 2. Derive seeds from seed_generator for each agent's model 3. Create agents using their framework with seeded models 4. Wrap them in AgentAdapter 5. Set up relationships from task.environment_data["relationships"]

PARAMETER	DESCRIPTION
`agent_data`	Agent configuration (model IDs, etc.) TYPE: `Dict[str, Any]`
`environment`	The environment instance TYPE: `Environment`
`task`	The task containing agent specs TYPE: `Task`
`user`	User simulator (None for MultiAgentBench) TYPE: `Optional[User]`
`seed_generator`	Seed generator for deriving deterministic seeds. Use `seed_generator.child("agents")` to create a namespace, then `derive_seed(agent_id)` for each agent's model. Returns None if seeding is disabled. TYPE: `SeedGenerator`

RETURNS	DESCRIPTION
`Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]`	Tuple of (agents_to_run, agents_dict)

Example

def setup_agents(self, agent_data, environment, task, user, seed_generator):
    agents_gen = seed_generator.child("agents")

    for config in task.environment_data.get("agents", []):
        agent_id = config.get("agent_id")
        seed = agents_gen.derive_seed(agent_id)  # Returns None if seeding disabled
        model = self.get_model_adapter(model_id, seed=seed)
        # Create agent with seeded model...

setup_environment

setup_environment(
    agent_data: Dict[str, Any],
    task: Task,
    seed_generator: SeedGenerator,
) -> Environment

Create the MultiAgentBench environment.

PARAMETER	DESCRIPTION
`agent_data`	Agent configuration TYPE: `Dict[str, Any]`
`task`	The task to set up TYPE: `Task`
`seed_generator`	Seed generator for reproducibility TYPE: `SeedGenerator`

RETURNS	DESCRIPTION
`Environment`	MultiAgentBenchEnvironment instance

setup_evaluators

setup_evaluators(
    environment: Environment,
    task: Task,
    agents: Sequence[AgentAdapter],
    user: Optional[User],
    seed_generator: SeedGenerator,
) -> Sequence[Evaluator]

Create evaluators for the task.

PARAMETER	DESCRIPTION
`environment`	The environment TYPE: `Environment`
`task`	The task with evaluation data TYPE: `Task`
`agents`	The agents TYPE: `Sequence[AgentAdapter]`
`user`	User simulator (None for MultiAgentBench) TYPE: `Optional[User]`
`seed_generator`	Seed generator for reproducibility TYPE: `SeedGenerator`

RETURNS	DESCRIPTION
`Sequence[Evaluator]`	List of evaluators

setup_user

setup_user(
    agent_data: Dict[str, Any],
    environment: Environment,
    task: Task,
    seed_generator: SeedGenerator,
) -> Optional[User]

MultiAgentBench tasks don't use user simulators.

The multi-agent coordination replaces user interaction.

PARAMETER	DESCRIPTION
`agent_data`	Agent configuration TYPE: `Dict[str, Any]`
`environment`	The environment instance TYPE: `Environment`
`task`	The task TYPE: `Task`
`seed_generator`	Seed generator (unused) TYPE: `SeedGenerator`

RETURNS	DESCRIPTION
`Optional[User]`	None

MarbleMultiAgentBenchBenchmark

Bases: MultiAgentBenchBenchmark

MARBLE reproduction mode for MultiAgentBench.

This benchmark uses MARBLE's native agents and engine for exact reproduction of published results. It wraps MARBLE components in MASEval adapters for unified tracing.

Example

from maseval.benchmark.multiagentbench import (
    MarbleMultiAgentBenchBenchmark,
    load_tasks,
    configure_model_ids,
)

class MyMarbleBenchmark(MarbleMultiAgentBenchBenchmark):
    def get_model_adapter(self, model_id, **kwargs):
        from maseval.interface.openai import OpenAIModelAdapter
        adapter = OpenAIModelAdapter(model_id)
        if "register_name" in kwargs:
            self.register("models", kwargs["register_name"], adapter)
        return adapter

tasks = load_tasks("research", limit=5)
configure_model_ids(tasks, agent_model_id="gpt-4o")

benchmark = MyMarbleBenchmark()
results = benchmark.run(tasks, agent_data={})

seed_generator `property`

seed_generator: SeedGenerator

The seed generator for this benchmark.

The seed generator is configured at benchmark initialization via the seed or seed_generator parameters. When seed=None (the default), the generator's derive_seed() method returns None, effectively disabling seeding while maintaining a uniform interface.

RETURNS	DESCRIPTION
`SeedGenerator`	The root `SeedGenerator` instance.

usage `property`

usage: Usage

Running usage total across all task repetitions.

Queryable at any time, including while the benchmark is still running. Returns the grand total of all usage collected so far.

usage_by_component `property`

usage_by_component: Dict[str, Usage]

Per-component running usage totals across all repetitions.

Keys are registry keys (e.g., "models:main_model").

init

__init__(
    callbacks: Optional[List[BenchmarkCallback]] = None,
    n_task_repeats: int = 1,
    max_invocations: int = 10,
    num_workers: int = 1,
    fail_on_setup_error: bool = False,
    fail_on_task_error: bool = False,
    fail_on_evaluation_error: bool = False,
    progress_bar: bool | str = True,
    seed: Optional[int] = None,
    seed_generator: Optional[SeedGenerator] = None,
)

Initialize the benchmark.

PARAMETER	DESCRIPTION
`callbacks`	Optional list of callbacks TYPE: `Optional[List[BenchmarkCallback]]` DEFAULT: `None`
`n_task_repeats`	Number of times to repeat each task TYPE: `int` DEFAULT: `1`
`max_invocations`	Maximum agent invocations per task TYPE: `int` DEFAULT: `10`
`num_workers`	Number of parallel workers TYPE: `int` DEFAULT: `1`
`fail_on_setup_error`	Raise on setup errors TYPE: `bool` DEFAULT: `False`
`fail_on_task_error`	Raise on task errors TYPE: `bool` DEFAULT: `False`
`fail_on_evaluation_error`	Raise on evaluation errors TYPE: `bool` DEFAULT: `False`
`progress_bar`	Progress bar configuration TYPE: `bool \| str` DEFAULT: `True`
`seed`	Global seed for reproducible benchmark runs TYPE: `Optional[int]` DEFAULT: `None`
`seed_generator`	Custom seed generator (takes precedence over seed) TYPE: `Optional[SeedGenerator]` DEFAULT: `None`

add_callback

add_callback(callback: BenchmarkCallback) -> None

Register a callback handler to monitor benchmark execution.

PARAMETER	DESCRIPTION
`callback`	A BenchmarkCallback instance that will receive execution events. TYPE: `BenchmarkCallback`

How to use

Callbacks receive notifications at key lifecycle points for tracing, progress tracking, or custom metrics collection. See BenchmarkCallback for available hooks and their signatures.

from maseval.core.callbacks import MessageTracingCallback

benchmark = MyBenchmark(tasks=tasks, agent_data=config)
benchmark.add_callback(MessageTracingCallback(output_dir="logs"))
results = benchmark.run()

clear_registry

clear_registry() -> None

Clear the component registry after a task repetition completes.

This method is called automatically by run() after each task repetition to ensure components are not carried over between repetitions. The reports list persists across all repetitions for aggregated analysis.

collect_all_configs

collect_all_configs() -> Dict[str, Any]

Collect configuration from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes and before evaluation begins. It gathers comprehensive configuration from all registered components (agents, models, tools, simulators, callbacks, etc.) for that specific repetition. After collection, the registry is cleared for the next repetition.

The collected configs are stored in benchmark.reports list along with traces for persistent access across all task repetitions.

Output fields:

metadata - Collection timestamp and thread info
agents - Dict mapping agent names to their config (settings, parameters)
models - Dict mapping model names to their config (model IDs, parameters)
tools - Dict mapping tool names to their config (specifications, settings)
simulators - Dict mapping simulator names to their config (parameters, templates)
callbacks - Dict mapping callback names to their config (settings)
environment - Direct config from the environment (not nested), or None if not present
user - Direct config from the user simulator (not nested), or None if not present
other - Dict for any other registered components
benchmark - Benchmark-level configuration (git, system, packages)

RETURNS	DESCRIPTION
`Dict[str, Any]`	Structured dictionary containing configuration from all registered components.

How to use

This method is called automatically by run() after each task repetition:

# Automatic collection (recommended)
results = benchmark.run()

# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
    # Agents is a dict: agent_name -> config
    print(f"Agent config: {report['config']['agents']['my_agent']}")
    # Environment and user are direct (not nested)
    print(f"Environment config: {report['config']['environment']}")
    print(f"User config: {report['config']['user']}")
    # Benchmark-level config
    print(f"Git commit: {report['config']['benchmark']['git']['commit_hash']}")

The collected configs are available in the results for reproducibility analysis.

collect_all_traces

collect_all_traces() -> Dict[str, Any]

Collect execution traces from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes and before evaluation begins. It gathers comprehensive traces from all registered components (agents, models, tools, simulators, callbacks, etc.) for that specific repetition. After collection, the registry is cleared for the next repetition.

The collected traces are stored in benchmark.reports list along with configs for persistent access across all task repetitions.

Output fields:

metadata - Collection timestamp and thread info
agents - Dict mapping agent names to their traces (messages, execution data)
models - Dict mapping model names to their traces (API calls, timing, errors)
tools - Dict mapping tool names to their traces (invocations, parameters)
simulators - Dict mapping simulator names to their traces (attempts, outcomes)
callbacks - Dict mapping callback names to their traces (custom data)
environment - Direct traces from the environment (not nested), or None if not present
user - Direct traces from the user simulator (not nested), or None if not present
other - Dict for any other registered components

RETURNS	DESCRIPTION
`Dict[str, Any]`	Structured dictionary containing execution traces from all registered components.

How to use

This method is called automatically by run() after each task repetition:

# Automatic collection (recommended)
results = benchmark.run()

# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
    # Agents is a dict: agent_name -> traces
    print(f"Agent messages: {report['traces']['agents']['my_agent']}")
    # Environment and user are direct (not nested)
    print(f"Environment state: {report['traces']['environment']}")
    print(f"User interactions: {report['traces']['user']}")

The collected traces are passed to the evaluator's evaluate() method and stored in benchmark.reports for later analysis.

collect_all_usage

collect_all_usage() -> Dict[str, Any]

Collect usage from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes. It gathers usage from all registered UsageTrackableMixin components and also accumulates into persistent running totals accessible via usage and usage_by_component.

RETURNS	DESCRIPTION
`Dict[str, Any]`	Structured dictionary containing usage from all registered components.

evaluate

evaluate(
    evaluators: Sequence[Evaluator],
    agents: Dict[str, AgentAdapter],
    final_answer: Any,
    traces: Dict[str, Any],
) -> List[Dict[str, Any]]

Execute evaluators on the results.

PARAMETER	DESCRIPTION
`evaluators`	The evaluators TYPE: `Sequence[Evaluator]`
`agents`	Dict of all agents TYPE: `Dict[str, AgentAdapter]`
`final_answer`	The combined agent outputs TYPE: `Any`
`traces`	Execution traces TYPE: `Dict[str, Any]`

RETURNS	DESCRIPTION
`List[Dict[str, Any]]`	List of evaluation results

execution_loop

execution_loop(
    agents: Sequence[AgentAdapter],
    task: Task,
    environment: Environment,
    user: Optional[User],
) -> Any

Execute MARBLE's multi-iteration coordination loop.

Dispatches to the appropriate coordination handler based on the task's coordinate_mode. Replicates Engine.start() from marble/engine/engine.py:1034-1055.

PARAMETER	DESCRIPTION
`agents`	MARBLE agents wrapped in MarbleAgentAdapter TYPE: `Sequence[AgentAdapter]`
`task`	The task being solved TYPE: `Task`
`environment`	The environment TYPE: `Environment`
`user`	Always None for MultiAgentBench TYPE: `Optional[User]`

RETURNS	DESCRIPTION
`Any`	Dict with agent_results, communications, and coordination_mode

RAISES	DESCRIPTION
`ValueError`	If coordinate_mode is not supported

get_failed_tasks

get_failed_tasks(
    status_filter: Optional[
        Union[
            TaskExecutionStatus, List[TaskExecutionStatus]
        ]
    ] = None,
    reports: Optional[List[Dict[str, Any]]] = None,
) -> SequentialTaskQueue

Get tasks that failed during benchmark execution.

This method retrieves failed tasks based on their execution status, useful for debugging, retry logic, or failure analysis.

PARAMETER DESCRIPTION

status_filter

Filter by specific failure status(es). If None, returns all failed tasks (any status except SUCCESS). Can be a single TaskExecutionStatus or a list of them. Examples: - TaskExecutionStatus.TASK_EXECUTION_FAILED: Only tasks that failed during execution - TaskExecutionStatus.EVALUATION_FAILED: Only tasks where evaluation failed - [TaskExecutionStatus.TASK_EXECUTION_FAILED, TaskExecutionStatus.SETUP_FAILED]: Tasks that failed during execution or setup

TYPE: Optional[Union[TaskExecutionStatus, List[TaskExecutionStatus]]] DEFAULT: None

reports

Optional list of reports to analyze. If None, uses the reports from the last run() call. This allows analyzing externally stored or modified reports.

TYPE: Optional[List[Dict[str, Any]]] DEFAULT: None

RETURNS	DESCRIPTION
`SequentialTaskQueue`	SequentialTaskQueue containing the failed tasks. Empty if no failures match the filter.

RAISES	DESCRIPTION
`RuntimeError`	If reports is None and run() has not been executed yet.

How to use

# Run benchmark
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)

# Get all failed tasks (from internal state)
failed = benchmark.get_failed_tasks()
print(f"Failed: {len(failed)}/{len(benchmark.tasks)} tasks")

# Or work with returned reports (safe from internal state changes)
failed = benchmark.get_failed_tasks(reports=reports)

# Get only tasks that failed during execution (not evaluation)
execution_failures = benchmark.get_failed_tasks(
    TaskExecutionStatus.TASK_EXECUTION_FAILED,
    reports=reports
)

# Get setup and execution failures
critical_failures = benchmark.get_failed_tasks(
    status_filter=[
        TaskExecutionStatus.SETUP_FAILED,
        TaskExecutionStatus.TASK_EXECUTION_FAILED
    ],
    reports=reports
)

# Retry failed tasks elegantly - this is the key use case!
if len(failed) > 0:
    retry_reports = benchmark.run(tasks=failed)

# Or more concisely
reports = benchmark.run(tasks=tasks)
retry_reports = benchmark.run(tasks=benchmark.get_failed_tasks())

get_model_adapter `abstractmethod`

get_model_adapter(
    model_id: str, **kwargs: Any
) -> ModelAdapter

Provide a model adapter (implement in subclass).

PARAMETER	DESCRIPTION
`model_id`	Model identifier TYPE: `str`
`**kwargs`	Additional arguments TYPE: `Any` DEFAULT: `{}`

RETURNS	DESCRIPTION
`ModelAdapter`	ModelAdapter instance

register

register(
    category: str,
    name: str,
    component: RegisterableComponent,
) -> RegisterableComponent

Register a component for comprehensive trace and configuration collection.

All core MASEval components (AgentAdapter, ModelAdapter, Environment, User, LLMSimulator, BenchmarkCallback) inherit from TraceableMixin and/or ConfigurableMixin, and are automatically registered for both trace and configuration collection before evaluation.

Note: Most components are automatically registered when returned from setup methods (setup_environment, setup_user, setup_agents). You only need to manually register additional components like models, simulators, or tools that aren't automatically captured.

PARAMETER	DESCRIPTION
`category`	Component category (e.g., "agents", "models", "tools", "simulators", "callbacks", "user", "environment", "seeding"). Use plural form to match the structure in collect_all_traces() and collect_all_configs(). TYPE: `str`
`name`	Unique identifier for this component within its category TYPE: `str`
`component`	Any object inheriting from TraceableMixin and/or ConfigurableMixin TYPE: `RegisterableComponent`

RETURNS	DESCRIPTION
`RegisterableComponent`	The component (for chaining convenience)

RAISES	DESCRIPTION
`ValueError`	If the component is already registered under a different name

How to use

Most components are auto-registered. Manual registration is only needed for additional components:

def setup_agents(self, agent_data, environment, task, user):
    # Create model (needs manual registration)
    model = MyModelAdapter(...)
    self.register("models", "main_model", model)

    # Create agent (auto-registered when returned)
    agent = MyAgent(model=model)
    agent_adapter = AgentAdapter(agent, "agent1")

    # Environment and user are also auto-registered
    return [agent_adapter], {"agent1": agent_adapter}

Traces and configs are automatically collected before evaluation via collect_all_traces() and collect_all_configs() which are called internally by the run() method.

run

run(
    tasks: Union[
        Task, BaseTaskQueue, Iterable[Union[Task, dict]]
    ],
    agent_data: Dict[str, Any] | Iterable[Dict[str, Any]],
) -> List[Dict[str, Any]]

Initialize and execute the complete benchmark loop across all tasks.

PARAMETER DESCRIPTION

tasks

Task source for execution. Can be: - A single Task object - A BaseTaskQueue (SequentialTaskQueue, PriorityTaskQueue, or custom AdaptiveTaskQueue) - An iterable of Task objects or dicts that will be converted to Tasks

When a BaseTaskQueue is provided, it controls the task ordering. AdaptiveTaskQueue subclasses are automatically registered as callbacks to receive task completion notifications.

TYPE: Union[Task, BaseTaskQueue, Iterable[Union[Task, dict]]]

agent_data

Configuration for agents. Either a single dict applied to all tasks, or an iterable of dicts with one configuration per task. Agent data typically includes model parameters, agent architecture details, and tool specifications.

TYPE: Dict[str, Any] | Iterable[Dict[str, Any]]

RETURNS	DESCRIPTION
`List[Dict[str, Any]]`	List of report dictionaries, one per task repetition. Every report carries the
`List[Dict[str, Any]]`	same keys (consistent schema) regardless of success or failure:
`List[Dict[str, Any]]`	task_id: Task identifier (UUID)
`List[Dict[str, Any]]`	repeat_idx: Repetition index (0 to n_task_repeats-1)
`List[Dict[str, Any]]`	status: Execution status (one of TaskExecutionStatus enum values)
`List[Dict[str, Any]]`	traces: Execution traces from all registered components (`{}` if unavailable, e.g. setup failure)
`List[Dict[str, Any]]`	config: Configuration from all registered components and benchmark level (`{}` if unavailable)
`List[Dict[str, Any]]`	usage: Aggregated usage from all registered components (`None` if not collected)
`List[Dict[str, Any]]`	eval: Evaluation results (None if task or evaluation failed)
`List[Dict[str, Any]]`	task: Task summary dict with `query`, `metadata`, and `protocol`
`List[Dict[str, Any]]`	error: Error details dict — `None` only when status is SUCCESS; otherwise always populated, containing: error_type: Exception class name error_message: Exception message traceback: Full traceback string (plus any error-specific extras, e.g. `component`, `elapsed`, `timeout`)

RAISES	DESCRIPTION
`ValueError`	If agent_data length doesn't match number of tasks (when agent_data is an iterable).
`Exception`	If a `fail_on_setup_error` / `fail_on_task_error` / `fail_on_evaluation_error` flag is set and the corresponding failure occurs, the original exception is re-raised and the run is aborted (this applies to both sequential and parallel execution).

How to use

This is the framework's main orchestration method that runs your entire benchmark. It iterates through all tasks, handles repetitions, and manages the three-stage lifecycle for each execution. You don't implement this method—instead, you call it to start the benchmark after implementing the setup and execution methods.

By default, the benchmark will continue executing remaining tasks even if some fail. You can change this behavior by setting fail_on_task_error=True, fail_on_evaluation_error=True, or fail_on_setup_error=True when instantiating the benchmark. Each task execution returns a status indicating success or the specific failure type (see TaskExecutionStatus).

For each task execution, the framework:

Calls your setup methods to initialize components
Calls your run_agents() method to execute the task
Collects message histories and calls evaluators
Stores results and triggers callbacks

Pseudocode structure:

for task in tasks:
    for repeat in range(n_task_repeats):
        # Setup stage
        environment = setup_environment(agent_data, task)
        user = setup_user(agent_data, environment, task)
        agents_to_run, agents_dict = setup_agents(agent_data, environment, task, user)
        evaluators = setup_evaluators(environment, task, agents_to_run, user)

        # Run stage (execution_loop handles multi-turn if user exists)
        agents_output = execution_loop(agents_to_run, task, environment, user)

        # Evaluate stage
        traces = collect_message_histories(agents_dict)
        eval_results = evaluate(evaluators, traces, agents_dict)

        # Store results
        store_result(task_id, traces, eval_results)

Callback hooks are triggered at these points:

on_run_start: Before processing any tasks
on_task_start: Before processing a task (once per task, not per repeat)
on_task_repeat_start: Before each repetition of a task
on_task_repeat_end: After each repetition completes
on_task_end: After all repetitions of a task complete
on_run_end: After all tasks complete

# Typical usage
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)

# Analyze results
for report in reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}: {report['eval']}")
    print(f"Config: {report['config']}")
    print(f"Traces: {report['traces']}")

# Parallel execution with 4 workers
benchmark = MyBenchmark(num_workers=4)
reports = benchmark.run(tasks=tasks, agent_data=config)

# Single agent config for all tasks
reports = benchmark.run(tasks=tasks, agent_data={"model": "gpt-4"})

# Task-specific agent configs (must match task count)
reports = benchmark.run(
    tasks=tasks,
    agent_data=[
        {"model": "gpt-4", "difficulty": "easy"},
        {"model": "gpt-4", "difficulty": "hard"},
    ]
)

# Priority-based execution
from maseval.core.task import PriorityTaskQueue
for task in tasks:
    task.protocol.priority = compute_priority(task)
queue = PriorityTaskQueue(tasks)
reports = benchmark.run(tasks=queue, agent_data=config)

# Adaptive queue (auto-registered as callback)
queue = MyAdaptiveTaskQueue(tasks)
reports = benchmark.run(tasks=queue)  # queue receives on_task_complete callbacks

run_agents

run_agents(
    agents: Sequence[AgentAdapter],
    task: Task,
    environment: Environment,
    query: str,
) -> Dict[str, Any]

Execute the multi-agent system.

For MultiAgentBench, this runs all agents on the task and collects their outputs.

PARAMETER	DESCRIPTION
`agents`	Agents to run TYPE: `Sequence[AgentAdapter]`
`task`	The task TYPE: `Task`
`environment`	The environment TYPE: `Environment`
`query`	The query/task content TYPE: `str`

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dict with agent_results, communications, and coordination_mode

setup_agents

setup_agents(
    agent_data: Dict[str, Any],
    environment: Environment,
    task: Task,
    user: Optional[User],
    seed_generator: SeedGenerator,
) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]

Create MARBLE agents wrapped in MASEval adapters.

Also creates MARBLE's orchestration components (EnginePlanner, SharedMemory, AgentGraph) needed by execution_loop to replicate MARBLE's multi-iteration coordination.

Note

MARBLE agents use their own internal LLM handling with a model ID string, not MASEval's ModelAdapter. This means seed_generator cannot be applied to agent LLM calls in this implementation. For reproducible agent behavior, use MultiAgentBenchBenchmark with a custom setup_agents that creates agents using seeded MASEval ModelAdapters.

PARAMETER	DESCRIPTION
`agent_data`	Agent configuration TYPE: `Dict[str, Any]`
`environment`	The environment TYPE: `Environment`
`task`	The task with agent specifications TYPE: `Task`
`user`	User simulator (None) TYPE: `Optional[User]`
`seed_generator`	Seed generator (not used for MARBLE agents, but seeding is applied to evaluators) TYPE: `SeedGenerator`

RETURNS	DESCRIPTION
`Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]`	Tuple of (agents_to_run, agents_dict)

setup_environment

setup_environment(
    agent_data: Dict[str, Any],
    task: Task,
    seed_generator: SeedGenerator,
) -> Environment

Create the MultiAgentBench environment.

PARAMETER	DESCRIPTION
`agent_data`	Agent configuration TYPE: `Dict[str, Any]`
`task`	The task to set up TYPE: `Task`
`seed_generator`	Seed generator for reproducibility TYPE: `SeedGenerator`

RETURNS	DESCRIPTION
`Environment`	MultiAgentBenchEnvironment instance

setup_evaluators

setup_evaluators(
    environment: Environment,
    task: Task,
    agents: Sequence[AgentAdapter],
    user: Optional[User],
    seed_generator: SeedGenerator,
) -> Sequence[Evaluator]

Create a thin evaluator for MARBLE reproduction mode.

All LLM-based evaluation happens inside the coordination loop via MARBLE's Evaluator (imported directly). The MarbleReproductionEvaluator only reformats pre-computed metrics into MASEval's result format.

No ModelAdapter is needed — evaluation LLM calls are handled by MARBLE's model_prompting() in the coordination loop.

PARAMETER	DESCRIPTION
`environment`	The environment TYPE: `Environment`
`task`	The task with evaluation data TYPE: `Task`
`agents`	The agents TYPE: `Sequence[AgentAdapter]`
`user`	User simulator (None for MultiAgentBench) TYPE: `Optional[User]`
`seed_generator`	Seed generator for reproducibility TYPE: `SeedGenerator`

RETURNS	DESCRIPTION
`Sequence[Evaluator]`	List containing a single MarbleReproductionEvaluator

setup_user

setup_user(
    agent_data: Dict[str, Any],
    environment: Environment,
    task: Task,
    seed_generator: SeedGenerator,
) -> Optional[User]

MultiAgentBench tasks don't use user simulators.

The multi-agent coordination replaces user interaction.

PARAMETER	DESCRIPTION
`agent_data`	Agent configuration TYPE: `Dict[str, Any]`
`environment`	The environment instance TYPE: `Environment`
`task`	The task TYPE: `Task`
`seed_generator`	Seed generator (unused) TYPE: `SeedGenerator`

RETURNS	DESCRIPTION
`Optional[User]`	None

MultiAgentBenchEnvironment

Bases: Environment

MASEval Environment wrapper for MARBLE environments.

This environment wraps MARBLE's domain-specific environments (Research, Bargaining, Coding, etc.) and exposes their tools through MASEval's tracing infrastructure.

ATTRIBUTE	DESCRIPTION
`domain`	The domain name (e.g., "research", "bargaining")
`marble_env`	The underlying MARBLE environment instance

init

__init__(environment_data: Dict[str, Any])

Initialize the environment.

PARAMETER	DESCRIPTION
`environment_data`	Task data containing environment configuration TYPE: `Dict[str, Any]`

RAISES	DESCRIPTION
`EnvironmentError`	If required infrastructure is unavailable
`ImportError`	If MARBLE is not available

apply_action

apply_action(
    agent_id: Optional[str],
    action_name: str,
    arguments: Dict[str, Any],
) -> Dict[str, Any]

Execute an action in the MARBLE environment.

PARAMETER	DESCRIPTION
`agent_id`	ID of the agent performing the action TYPE: `Optional[str]`
`action_name`	Name of the action to execute TYPE: `str`
`arguments`	Arguments for the action TYPE: `Dict[str, Any]`

RETURNS	DESCRIPTION
`Dict[str, Any]`	Action result dictionary

create_tools

create_tools() -> Dict[str, Callable]

Create tools from MARBLE environment for MASEval tracing.

MARBLE environments expose tools via action_handler_descriptions. This method wraps them for MASEval's tracing infrastructure.

RETURNS	DESCRIPTION
`Dict[str, Callable]`	Dict mapping tool names to wrapped callables

gather_config

gather_config() -> Dict[str, Any]

Gather environment configuration.

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dict with environment configuration

gather_traces

gather_traces() -> Dict[str, Any]

Gather traces including tool invocations.

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dict with environment traces

get_marble_state

get_marble_state() -> Dict[str, Any]

Get the current MARBLE environment state.

RETURNS	DESCRIPTION
`Dict[str, Any]`	State dictionary from MARBLE environment

get_tool

get_tool(name: str) -> Optional[Callable]

Get a specific tool by name.

PARAMETER	DESCRIPTION
`name`	Tool name TYPE: `str`

RETURNS	DESCRIPTION
`Optional[Callable]`	Tool callable if found, None otherwise

get_tool_descriptions

get_tool_descriptions() -> Dict[str, Any]

Get tool descriptions in OpenAI function format.

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dict mapping tool names to their OpenAI-format descriptions

get_tools

get_tools() -> Dict[str, Any]

Get all tools as a dict.

is_done

is_done() -> bool

Check if the environment has reached a terminal state.

RETURNS	DESCRIPTION
`bool`	True if done, False otherwise

is_task_completed

is_task_completed() -> bool

Check if the task has been completed successfully.

RETURNS	DESCRIPTION
`bool`	True if task completed, False otherwise

setup_state

setup_state(
    environment_data: Dict[str, Any],
) -> Dict[str, Any]

Initialize state and optionally create MARBLE environment.

PARAMETER	DESCRIPTION
`environment_data`	Task data containing environment configuration TYPE: `Dict[str, Any]`

RETURNS	DESCRIPTION
`Dict[str, Any]`	Initial state dictionary

RAISES	DESCRIPTION
`EnvironmentError`	If required infrastructure is unavailable

MultiAgentBenchEvaluator

Bases: Evaluator

Evaluator for MultiAgentBench tasks matching MARBLE's methodology.

This evaluator implements MARBLE's LLM-based evaluation metrics: - Task completion assessment - Communication quality scoring - Planning/coordination scoring - Domain-specific task evaluation (research, bargaining, etc.)

ATTRIBUTE	DESCRIPTION
`domain`	The benchmark domain (research, bargaining, etc.)
`model_adapter`	Model adapter for LLM-based evaluation
`metrics_config`	Configuration for metrics to evaluate

call

__call__(
    traces: Dict[str, Any], final_answer: Any
) -> Dict[str, Any]

Evaluate the task execution.

PARAMETER	DESCRIPTION
`traces`	Filtered execution traces TYPE: `Dict[str, Any]`
`final_answer`	Final output from agents (dict, list, str, or None) TYPE: `Any`

RETURNS	DESCRIPTION
`Dict[str, Any]`	Evaluation results dictionary

init

__init__(
    domain: str,
    model_adapter: ModelAdapter,
    metrics_config: Optional[Dict[str, Any]] = None,
    output_format: str = "",
    result_truncation_length: Optional[
        int
    ] = DEFAULT_RESULT_TRUNCATION_LENGTH,
)

Initialize the evaluator.

PARAMETER	DESCRIPTION
`domain`	Benchmark domain (research, bargaining, etc.) TYPE: `str`
`model_adapter`	Model adapter for LLM evaluation TYPE: `ModelAdapter`
`metrics_config`	Configuration for evaluation metrics TYPE: `Optional[Dict[str, Any]]` DEFAULT: `None`
`output_format`	Expected output format for task evaluation TYPE: `str` DEFAULT: `''`
`result_truncation_length`	Maximum characters per agent result before LLM summarization. Matches MARBLE's `_summarize_results()` which truncates each result to 1000 chars, then passes the truncated output through an LLM summarization call (`planner.summarize_output()`). Set to `None` to disable both truncation and LLM summarization, passing raw agent results directly to the evaluator (not recommended for domains with large outputs like research). TYPE: `Optional[int]` DEFAULT: `DEFAULT_RESULT_TRUNCATION_LENGTH`

filter_traces

filter_traces(traces: Dict[str, Any]) -> Dict[str, Any]

Filter traces for evaluation.

PARAMETER	DESCRIPTION
`traces`	All collected traces TYPE: `Dict[str, Any]`

RETURNS	DESCRIPTION
`Dict[str, Any]`	Filtered traces relevant for evaluation

MarbleAgentAdapter

Bases: AgentAdapter

Adapter wrapping a MARBLE BaseAgent for MASEval tracing.

This adapter provides a unified interface to MARBLE agents while capturing all relevant traces for evaluation.

ATTRIBUTE	DESCRIPTION
`agent_id`	Unique identifier for the agent TYPE: `str`
`marble_agent`	The underlying MARBLE BaseAgent instance TYPE: `Any`
`profile`	Agent's role profile from MARBLE config TYPE: `str`

agent_id `property`

agent_id: str

Return the agent's unique identifier.

marble_agent `property`

marble_agent: Any

Return the underlying MARBLE agent.

profile `property`

profile: str

Return the agent's profile.

init

__init__(marble_agent: Any, agent_id: str)

Initialize the adapter.

PARAMETER	DESCRIPTION
`marble_agent`	MARBLE BaseAgent instance TYPE: `Any`
`agent_id`	Unique identifier for this agent TYPE: `str`

gather_config

gather_config() -> Dict[str, Any]

Gather agent configuration.

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dict with agent configuration

gather_traces

gather_traces() -> Dict[str, Any]

Gather traces including agent-specific data.

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dict with all agent traces

gather_usage

gather_usage() -> Usage

Gather usage with automatic cost calculation.

Calls _gather_usage() for raw token counts, then applies the cost calculator if one is available and cost is still 0.0.

The model_id used for cost calculation is resolved in order:

Explicit model_id passed to __init__
Auto-detected from the framework agent via _resolve_model_id()

Subclasses should override _gather_usage() (not this method) to provide framework-specific token extraction.

RETURNS	DESCRIPTION
`Usage`	Usage (or TokenUsage) with cost filled in when possible.

get_memory_str

get_memory_str() -> str

Get the agent's memory as a string.

RETURNS	DESCRIPTION
`str`	Serialized memory string

get_messages

get_messages() -> MessageHistory

Get the current message history as an iterable MessageHistory object.

The returned MessageHistory can be: - Iterated: for msg in agent.get_messages(): ... - Indexed: agent.get_messages()[0] - Converted to list: list(agent.get_messages()) or agent.get_messages().to_list() - Checked for emptiness: if agent.get_messages(): ...

RETURNS	DESCRIPTION
`MessageHistory`	MessageHistory object (empty if no messages yet)

Example

# Iterate directly
for msg in agent.get_messages():
    print(msg['role'], msg['content'])

# Convert to list
messages = agent.get_messages().to_list()
messages = list(agent.get_messages())

# Check if empty
if agent.get_messages():
    print("Agent has messages")

get_serialized_messages

get_serialized_messages(session_id: str = '') -> str

Get serialized inter-agent messages.

PARAMETER	DESCRIPTION
`session_id`	Optional session ID filter TYPE: `str` DEFAULT: `''`

RETURNS	DESCRIPTION
`str`	Serialized message string

get_token_usage

get_token_usage() -> int

Get the total token usage from the MARBLE agent.

RETURNS	DESCRIPTION
`int`	Total tokens used by the agent

run

run(query: str) -> Any

Executes the agent and returns the result.

load_tasks

load_tasks(
    domain: str,
    data_dir: Optional[Path] = None,
    limit: Optional[int] = None,
) -> List[Task]

Load MultiAgentBench tasks for a domain.

Most domains load from JSONL files. Werewolf uses config-based task loading since it has no JSONL data (it uses a game engine).

PARAMETER	DESCRIPTION
`domain`	Domain name (one of: coding, database, minecraft, research, bargaining, werewolf) TYPE: `str`
`data_dir`	Optional path to MARBLE data directory TYPE: `Optional[Path]` DEFAULT: `None`
`limit`	Maximum number of tasks to load (None for all) TYPE: `Optional[int]` DEFAULT: `None`

RETURNS	DESCRIPTION
`List[Task]`	List of Task objects

RAISES	DESCRIPTION
`ValueError`	If domain is invalid
`FileNotFoundError`	If data files not found

Example

tasks = load_tasks("research", limit=5) len(tasks) 5 tasks[0].metadata["domain"] 'research'

configure_model_ids

configure_model_ids(
    tasks: List[Task],
    *,
    agent_model_id: str,
    evaluator_model_id: Optional[str] = None,
) -> List[Task]

Configure model IDs for MARBLE agents and evaluator.

Modifies tasks in-place to set the LLM model IDs used by agents and optionally the evaluator.

PARAMETER	DESCRIPTION
`tasks`	List of Tasks to configure TYPE: `List[Task]`
`agent_model_id`	Model ID for all MARBLE agents (e.g., "gpt-4o") TYPE: `str`
`evaluator_model_id`	Optional model ID for LLM-based evaluation TYPE: `Optional[str]` DEFAULT: `None`

RETURNS	DESCRIPTION
`List[Task]`	The input tasks (modified in-place)

Example

tasks = load_tasks("research", limit=5) configure_model_ids(tasks, agent_model_id="gpt-4o") tasks[0].environment_data["llm"] 'gpt-4o'

ensure_marble_exists

ensure_marble_exists(auto_download: bool = True) -> Path

Ensure MARBLE is available, optionally downloading it.

This function checks if MARBLE is installed and optionally downloads it if not present.

PARAMETER	DESCRIPTION
`auto_download`	If True, automatically download MARBLE if not found. If False, raise an error if MARBLE is not found. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`Path`	Path to the MARBLE directory

RAISES	DESCRIPTION
`FileNotFoundError`	If MARBLE is not found and auto_download=False

Example

marble_dir = ensure_marble_exists()

MARBLE is now available at marble_dir

download_marble

download_marble(
    target_dir: Optional[Path] = None,
    commit: Optional[str] = None,
    force: bool = False,
) -> Path

Clone MARBLE repository to the specified directory.

PARAMETER	DESCRIPTION
`target_dir`	Directory to clone into. Defaults to marble/ relative to this module. TYPE: `Optional[Path]` DEFAULT: `None`
`commit`	Specific commit hash to checkout. Defaults to MARBLE_DEFAULT_COMMIT or latest. TYPE: `Optional[str]` DEFAULT: `None`
`force`	If True, remove existing directory and re-clone. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`Path`	Path to the cloned MARBLE directory

RAISES	DESCRIPTION
`RuntimeError`	If git clone fails
`FileExistsError`	If directory exists and force=False

get_domain_info

get_domain_info(domain: str) -> Dict[str, Any]

Get information about a domain.

PARAMETER	DESCRIPTION
`domain`	Domain name TYPE: `str`

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dict with domain information including:
`Dict[str, Any]`	requires_infrastructure: Whether external services needed
`Dict[str, Any]`	description: Brief domain description
`Dict[str, Any]`	coordination_mode: Default coordination mode

RAISES	DESCRIPTION
`ValueError`	If domain is invalid

MultiAgentBench: Multi-Agent Collaboration Benchmark (Beta)

MultiAgentBenchBenchmark

seed_generator property

usage property

usage_by_component property

__init__

add_callback

clear_registry

collect_all_configs

collect_all_traces

collect_all_usage

evaluate

execution_loop

get_failed_tasks

get_model_adapter abstractmethod

register

run

run_agents

setup_agents abstractmethod

setup_environment

setup_evaluators

setup_user

MarbleMultiAgentBenchBenchmark

seed_generator property

usage property

usage_by_component property

__init__

add_callback

clear_registry

collect_all_configs

collect_all_traces

collect_all_usage

evaluate

execution_loop

get_failed_tasks

get_model_adapter abstractmethod

register

run

run_agents

setup_agents

setup_environment

setup_evaluators

setup_user

MultiAgentBenchEnvironment

__init__

apply_action

create_tools

gather_config

gather_traces

get_marble_state

get_tool

get_tool_descriptions

get_tools

is_done

is_task_completed

setup_state

MultiAgentBenchEvaluator

__call__

__init__

filter_traces

MarbleAgentAdapter

agent_id property

marble_agent property

profile property

__init__

gather_config

gather_traces

gather_usage

get_memory_str

get_messages

get_serialized_messages

get_token_usage

run

load_tasks

configure_model_ids

ensure_marble_exists

MARBLE is now available at marble_dir

download_marble

get_domain_info

seed_generator `property`

usage `property`

usage_by_component `property`

init

get_model_adapter `abstractmethod`

setup_agents `abstractmethod`

seed_generator `property`

usage `property`

usage_by_component `property`

init

get_model_adapter `abstractmethod`

init

call

init

agent_id `property`

marble_agent `property`

profile `property`

init