Skip to content

MultiAgentBench: Multi-Agent Collaboration Benchmark (Beta)

Beta

This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

The MultiAgentBench benchmark evaluates multi-agent collaboration and competition in LLM-based systems across diverse scenarios including research, negotiation, coding, and more.

MultiAgentBench (from the MARBLE framework, where the original work was done) is designed to evaluate how multiple LLM-based agents collaborate and compete to solve complex tasks. We use a bug-fixed fork for MASEval integration. The benchmark features:

  • 6 diverse domains: research, bargaining, coding, database, werewolf, minecraft (minecraft is untested)
  • Multiple coordination modes: cooperative, star, tree, hierarchical
  • LLM-based evaluation: Matches MARBLE's evaluation methodology
  • Framework-agnostic: Use with any agent framework or MARBLE's native agents

Reference Paper: MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Check out the BENCHMARKS.md file for more information including licenses.

MultiAgentBenchBenchmark

Bases: Benchmark

Abstract base class for framework-agnostic MultiAgentBench evaluation.

This benchmark provides the infrastructure for evaluating multi-agent systems on MARBLE's MultiAgentBench tasks. Subclasses implement setup_agents() with their specific agent framework.

The benchmark supports: - Multiple coordination modes (star, cooperative, tree, hierarchical) - Multiple domains (research, bargaining, coding, database, etc.) - LLM-based evaluation matching MARBLE's metrics - Comprehensive tracing of agent interactions

Warning

communication_score is only computed when agents use MarbleAgentAdapter, which populates the communication_log trace key from BaseAgent.act(). Custom setup_agents() implementations using other adapters must explicitly populate communication_log in each adapter's gather_traces() output for communication evaluation to work. See MultiAgentBenchEvaluator._extract_communications() for the expected format.

Example
class MyMultiAgentBenchmark(MultiAgentBenchBenchmark):
    def setup_agents(self, agent_data, environment, task, user, seed_generator):
        # Derive seeds for agents (returns None if seeding disabled)
        agents_gen = seed_generator.child("agents")
        agent_seeds = {}
        for config in task.environment_data.get("agents", []):
            agent_id = config.get("agent_id")
            agent_seeds[agent_id] = agents_gen.derive_seed(agent_id)

        # Create agents using your framework with seeds
        agents_list = []
        agents_dict = {}
        for config in task.environment_data.get("agents", []):
            agent_id = config.get("agent_id")
            model = self.get_model_adapter(
                agent_data.get("model_id", "gpt-4o"),
                register_name=f"agent_{agent_id}",
                seed=agent_seeds.get(agent_id),
            )
            # Create your agent with the seeded model...
            ...
        return agents_list, agents_dict

    def get_model_adapter(self, model_id, **kwargs):
        seed = kwargs.pop("seed", None)
        adapter = MyModelAdapter(model_id, seed=seed)
        if "register_name" in kwargs:
            self.register("models", kwargs["register_name"], adapter)
        return adapter

benchmark = MyMultiAgentBenchmark(seed=42)  # Enable seeding
results = benchmark.run(tasks, agent_data={"model_id": "gpt-4o"})

seed_generator property

seed_generator: SeedGenerator

The seed generator for this benchmark.

The seed generator is configured at benchmark initialization via the seed or seed_generator parameters. When seed=None (the default), the generator's derive_seed() method returns None, effectively disabling seeding while maintaining a uniform interface.

RETURNS DESCRIPTION
SeedGenerator

The root SeedGenerator instance.

usage property

usage: Usage

Running usage total across all task repetitions.

Queryable at any time, including while the benchmark is still running. Returns the grand total of all usage collected so far.

usage_by_component property

usage_by_component: Dict[str, Usage]

Per-component running usage totals across all repetitions.

Keys are registry keys (e.g., "models:main_model").

__init__

__init__(
    callbacks: Optional[List[BenchmarkCallback]] = None,
    n_task_repeats: int = 1,
    max_invocations: int = 10,
    num_workers: int = 1,
    fail_on_setup_error: bool = False,
    fail_on_task_error: bool = False,
    fail_on_evaluation_error: bool = False,
    progress_bar: bool | str = True,
    seed: Optional[int] = None,
    seed_generator: Optional[SeedGenerator] = None,
)

Initialize the benchmark.

PARAMETER DESCRIPTION
callbacks

Optional list of callbacks

TYPE: Optional[List[BenchmarkCallback]] DEFAULT: None

n_task_repeats

Number of times to repeat each task

TYPE: int DEFAULT: 1

max_invocations

Maximum agent invocations per task

TYPE: int DEFAULT: 10

num_workers

Number of parallel workers

TYPE: int DEFAULT: 1

fail_on_setup_error

Raise on setup errors

TYPE: bool DEFAULT: False

fail_on_task_error

Raise on task errors

TYPE: bool DEFAULT: False

fail_on_evaluation_error

Raise on evaluation errors

TYPE: bool DEFAULT: False

progress_bar

Progress bar configuration

TYPE: bool | str DEFAULT: True

seed

Global seed for reproducible benchmark runs

TYPE: Optional[int] DEFAULT: None

seed_generator

Custom seed generator (takes precedence over seed)

TYPE: Optional[SeedGenerator] DEFAULT: None

add_callback

add_callback(callback: BenchmarkCallback) -> None

Register a callback handler to monitor benchmark execution.

PARAMETER DESCRIPTION
callback

A BenchmarkCallback instance that will receive execution events.

TYPE: BenchmarkCallback

How to use

Callbacks receive notifications at key lifecycle points for tracing, progress tracking, or custom metrics collection. See BenchmarkCallback for available hooks and their signatures.

from maseval.core.callbacks import MessageTracingCallback

benchmark = MyBenchmark(tasks=tasks, agent_data=config)
benchmark.add_callback(MessageTracingCallback(output_dir="logs"))
results = benchmark.run()

clear_registry

clear_registry() -> None

Clear the component registry after a task repetition completes.

This method is called automatically by run() after each task repetition to ensure components are not carried over between repetitions. The reports list persists across all repetitions for aggregated analysis.

collect_all_configs

collect_all_configs() -> Dict[str, Any]

Collect configuration from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes and before evaluation begins. It gathers comprehensive configuration from all registered components (agents, models, tools, simulators, callbacks, etc.) for that specific repetition. After collection, the registry is cleared for the next repetition.

The collected configs are stored in benchmark.reports list along with traces for persistent access across all task repetitions.

Output fields:

  • metadata - Collection timestamp and thread info
  • agents - Dict mapping agent names to their config (settings, parameters)
  • models - Dict mapping model names to their config (model IDs, parameters)
  • tools - Dict mapping tool names to their config (specifications, settings)
  • simulators - Dict mapping simulator names to their config (parameters, templates)
  • callbacks - Dict mapping callback names to their config (settings)
  • environment - Direct config from the environment (not nested), or None if not present
  • user - Direct config from the user simulator (not nested), or None if not present
  • other - Dict for any other registered components
  • benchmark - Benchmark-level configuration (git, system, packages)
RETURNS DESCRIPTION
Dict[str, Any]

Structured dictionary containing configuration from all registered components.

How to use

This method is called automatically by run() after each task repetition:

# Automatic collection (recommended)
results = benchmark.run()

# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
    # Agents is a dict: agent_name -> config
    print(f"Agent config: {report['config']['agents']['my_agent']}")
    # Environment and user are direct (not nested)
    print(f"Environment config: {report['config']['environment']}")
    print(f"User config: {report['config']['user']}")
    # Benchmark-level config
    print(f"Git commit: {report['config']['benchmark']['git']['commit_hash']}")

The collected configs are available in the results for reproducibility analysis.

collect_all_traces

collect_all_traces() -> Dict[str, Any]

Collect execution traces from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes and before evaluation begins. It gathers comprehensive traces from all registered components (agents, models, tools, simulators, callbacks, etc.) for that specific repetition. After collection, the registry is cleared for the next repetition.

The collected traces are stored in benchmark.reports list along with configs for persistent access across all task repetitions.

Output fields:

  • metadata - Collection timestamp and thread info
  • agents - Dict mapping agent names to their traces (messages, execution data)
  • models - Dict mapping model names to their traces (API calls, timing, errors)
  • tools - Dict mapping tool names to their traces (invocations, parameters)
  • simulators - Dict mapping simulator names to their traces (attempts, outcomes)
  • callbacks - Dict mapping callback names to their traces (custom data)
  • environment - Direct traces from the environment (not nested), or None if not present
  • user - Direct traces from the user simulator (not nested), or None if not present
  • other - Dict for any other registered components
RETURNS DESCRIPTION
Dict[str, Any]

Structured dictionary containing execution traces from all registered components.

How to use

This method is called automatically by run() after each task repetition:

# Automatic collection (recommended)
results = benchmark.run()

# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
    # Agents is a dict: agent_name -> traces
    print(f"Agent messages: {report['traces']['agents']['my_agent']}")
    # Environment and user are direct (not nested)
    print(f"Environment state: {report['traces']['environment']}")
    print(f"User interactions: {report['traces']['user']}")

The collected traces are passed to the evaluator's evaluate() method and stored in benchmark.reports for later analysis.

collect_all_usage

collect_all_usage() -> Dict[str, Any]

Collect usage from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes. It gathers usage from all registered UsageTrackableMixin components and also accumulates into persistent running totals accessible via usage and usage_by_component.

RETURNS DESCRIPTION
Dict[str, Any]

Structured dictionary containing usage from all registered components.

evaluate

evaluate(
    evaluators: Sequence[Evaluator],
    agents: Dict[str, AgentAdapter],
    final_answer: Any,
    traces: Dict[str, Any],
) -> List[Dict[str, Any]]

Execute evaluators on the results.

PARAMETER DESCRIPTION
evaluators

The evaluators

TYPE: Sequence[Evaluator]

agents

Dict of all agents

TYPE: Dict[str, AgentAdapter]

final_answer

The combined agent outputs

TYPE: Any

traces

Execution traces

TYPE: Dict[str, Any]

RETURNS DESCRIPTION
List[Dict[str, Any]]

List of evaluation results

execution_loop

execution_loop(
    agents: Sequence[AgentAdapter],
    task: Task,
    environment: Environment,
    user: Optional[User],
) -> Any

Execute agents in a single pass.

MultiAgentBench uses multi-agent coordination instead of user interaction. The base class execution_loop breaks after one call when user is None, so this override makes the single-pass behavior explicit.

Subclasses (e.g. MarbleMultiAgentBenchBenchmark) override this with multi-iteration coordination loops matching their framework's orchestration.

get_failed_tasks

get_failed_tasks(
    status_filter: Optional[
        Union[
            TaskExecutionStatus, List[TaskExecutionStatus]
        ]
    ] = None,
    reports: Optional[List[Dict[str, Any]]] = None,
) -> SequentialTaskQueue

Get tasks that failed during benchmark execution.

This method retrieves failed tasks based on their execution status, useful for debugging, retry logic, or failure analysis.

PARAMETER DESCRIPTION
status_filter

Filter by specific failure status(es). If None, returns all failed tasks (any status except SUCCESS). Can be a single TaskExecutionStatus or a list of them. Examples: - TaskExecutionStatus.TASK_EXECUTION_FAILED: Only tasks that failed during execution - TaskExecutionStatus.EVALUATION_FAILED: Only tasks where evaluation failed - [TaskExecutionStatus.TASK_EXECUTION_FAILED, TaskExecutionStatus.SETUP_FAILED]: Tasks that failed during execution or setup

TYPE: Optional[Union[TaskExecutionStatus, List[TaskExecutionStatus]]] DEFAULT: None

reports

Optional list of reports to analyze. If None, uses the reports from the last run() call. This allows analyzing externally stored or modified reports.

TYPE: Optional[List[Dict[str, Any]]] DEFAULT: None

RETURNS DESCRIPTION
SequentialTaskQueue

SequentialTaskQueue containing the failed tasks. Empty if no failures match the filter.

RAISES DESCRIPTION
RuntimeError

If reports is None and run() has not been executed yet.

How to use
# Run benchmark
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)

# Get all failed tasks (from internal state)
failed = benchmark.get_failed_tasks()
print(f"Failed: {len(failed)}/{len(benchmark.tasks)} tasks")

# Or work with returned reports (safe from internal state changes)
failed = benchmark.get_failed_tasks(reports=reports)

# Get only tasks that failed during execution (not evaluation)
execution_failures = benchmark.get_failed_tasks(
    TaskExecutionStatus.TASK_EXECUTION_FAILED,
    reports=reports
)

# Get setup and execution failures
critical_failures = benchmark.get_failed_tasks(
    status_filter=[
        TaskExecutionStatus.SETUP_FAILED,
        TaskExecutionStatus.TASK_EXECUTION_FAILED
    ],
    reports=reports
)

# Retry failed tasks elegantly - this is the key use case!
if len(failed) > 0:
    retry_reports = benchmark.run(tasks=failed)

# Or more concisely
reports = benchmark.run(tasks=tasks)
retry_reports = benchmark.run(tasks=benchmark.get_failed_tasks())

get_model_adapter abstractmethod

get_model_adapter(
    model_id: str, **kwargs: Any
) -> ModelAdapter

Provide a model adapter (implement in subclass).

PARAMETER DESCRIPTION
model_id

Model identifier

TYPE: str

**kwargs

Additional arguments including register_name

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
ModelAdapter

ModelAdapter instance

register

register(
    category: str,
    name: str,
    component: RegisterableComponent,
) -> RegisterableComponent

Register a component for comprehensive trace and configuration collection.

All core MASEval components (AgentAdapter, ModelAdapter, Environment, User, LLMSimulator, BenchmarkCallback) inherit from TraceableMixin and/or ConfigurableMixin, and are automatically registered for both trace and configuration collection before evaluation.

Note: Most components are automatically registered when returned from setup methods (setup_environment, setup_user, setup_agents). You only need to manually register additional components like models, simulators, or tools that aren't automatically captured.

PARAMETER DESCRIPTION
category

Component category (e.g., "agents", "models", "tools", "simulators", "callbacks", "user", "environment", "seeding"). Use plural form to match the structure in collect_all_traces() and collect_all_configs().

TYPE: str

name

Unique identifier for this component within its category

TYPE: str

component

Any object inheriting from TraceableMixin and/or ConfigurableMixin

TYPE: RegisterableComponent

RETURNS DESCRIPTION
RegisterableComponent

The component (for chaining convenience)

RAISES DESCRIPTION
ValueError

If the component is already registered under a different name

How to use

Most components are auto-registered. Manual registration is only needed for additional components:

def setup_agents(self, agent_data, environment, task, user):
    # Create model (needs manual registration)
    model = MyModelAdapter(...)
    self.register("models", "main_model", model)

    # Create agent (auto-registered when returned)
    agent = MyAgent(model=model)
    agent_adapter = AgentAdapter(agent, "agent1")

    # Environment and user are also auto-registered
    return [agent_adapter], {"agent1": agent_adapter}

Traces and configs are automatically collected before evaluation via collect_all_traces() and collect_all_configs() which are called internally by the run() method.

run

run(
    tasks: Union[
        Task, BaseTaskQueue, Iterable[Union[Task, dict]]
    ],
    agent_data: Dict[str, Any] | Iterable[Dict[str, Any]],
) -> List[Dict[str, Any]]

Initialize and execute the complete benchmark loop across all tasks.

PARAMETER DESCRIPTION
tasks

Task source for execution. Can be: - A single Task object - A BaseTaskQueue (SequentialTaskQueue, PriorityTaskQueue, or custom AdaptiveTaskQueue) - An iterable of Task objects or dicts that will be converted to Tasks

When a BaseTaskQueue is provided, it controls the task ordering. AdaptiveTaskQueue subclasses are automatically registered as callbacks to receive task completion notifications.

TYPE: Union[Task, BaseTaskQueue, Iterable[Union[Task, dict]]]

agent_data

Configuration for agents. Either a single dict applied to all tasks, or an iterable of dicts with one configuration per task. Agent data typically includes model parameters, agent architecture details, and tool specifications.

TYPE: Dict[str, Any] | Iterable[Dict[str, Any]]

RETURNS DESCRIPTION
List[Dict[str, Any]]

List of report dictionaries, one per task repetition. Each report contains:

List[Dict[str, Any]]
  • task_id: Task identifier (UUID)
List[Dict[str, Any]]
  • repeat_idx: Repetition index (0 to n_task_repeats-1)
List[Dict[str, Any]]
  • status: Execution status (one of TaskExecutionStatus enum values)
List[Dict[str, Any]]
  • traces: Execution traces from all registered components
List[Dict[str, Any]]
  • config: Configuration from all registered components and benchmark level
List[Dict[str, Any]]
  • eval: Evaluation results (None if task or evaluation failed)
List[Dict[str, Any]]
  • error: Error details dict (only present if status is not SUCCESS), containing:
  • error_type: Exception class name
  • error_message: Exception message
  • traceback: Full traceback string
RAISES DESCRIPTION
ValueError

If agent_data length doesn't match number of tasks (when agent_data is an iterable).

How to use

This is the framework's main orchestration method that runs your entire benchmark. It iterates through all tasks, handles repetitions, and manages the three-stage lifecycle for each execution. You don't implement this method—instead, you call it to start the benchmark after implementing the setup and execution methods.

By default, the benchmark will continue executing remaining tasks even if some fail. You can change this behavior by setting fail_on_task_error=True, fail_on_evaluation_error=True, or fail_on_setup_error=True when instantiating the benchmark. Each task execution returns a status indicating success or the specific failure type (see TaskExecutionStatus).

For each task execution, the framework:

  1. Calls your setup methods to initialize components
  2. Calls your run_agents() method to execute the task
  3. Collects message histories and calls evaluators
  4. Stores results and triggers callbacks

Pseudocode structure:

for task in tasks:
    for repeat in range(n_task_repeats):
        # Setup stage
        environment = setup_environment(agent_data, task)
        user = setup_user(agent_data, environment, task)
        agents_to_run, agents_dict = setup_agents(agent_data, environment, task, user)
        evaluators = setup_evaluators(environment, task, agents_to_run, user)

        # Run stage (execution_loop handles multi-turn if user exists)
        agents_output = execution_loop(agents_to_run, task, environment, user)

        # Evaluate stage
        traces = collect_message_histories(agents_dict)
        eval_results = evaluate(evaluators, traces, agents_dict)

        # Store results
        store_result(task_id, traces, eval_results)

Callback hooks are triggered at these points:

  • on_run_start: Before processing any tasks
  • on_task_start: Before processing a task (once per task, not per repeat)
  • on_task_repeat_start: Before each repetition of a task
  • on_task_repeat_end: After each repetition completes
  • on_task_end: After all repetitions of a task complete
  • on_run_end: After all tasks complete
# Typical usage
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)

# Analyze results
for report in reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}: {report['eval']}")
    print(f"Config: {report['config']}")
    print(f"Traces: {report['traces']}")

# Parallel execution with 4 workers
benchmark = MyBenchmark(num_workers=4)
reports = benchmark.run(tasks=tasks, agent_data=config)

# Single agent config for all tasks
reports = benchmark.run(tasks=tasks, agent_data={"model": "gpt-4"})

# Task-specific agent configs (must match task count)
reports = benchmark.run(
    tasks=tasks,
    agent_data=[
        {"model": "gpt-4", "difficulty": "easy"},
        {"model": "gpt-4", "difficulty": "hard"},
    ]
)

# Priority-based execution
from maseval.core.task import PriorityTaskQueue
for task in tasks:
    task.protocol.priority = compute_priority(task)
queue = PriorityTaskQueue(tasks)
reports = benchmark.run(tasks=queue, agent_data=config)

# Adaptive queue (auto-registered as callback)
queue = MyAdaptiveTaskQueue(tasks)
reports = benchmark.run(tasks=queue)  # queue receives on_task_complete callbacks

run_agents

run_agents(
    agents: Sequence[AgentAdapter],
    task: Task,
    environment: Environment,
    query: str,
) -> Dict[str, Any]

Execute the multi-agent system.

For MultiAgentBench, this runs all agents on the task and collects their outputs.

PARAMETER DESCRIPTION
agents

Agents to run

TYPE: Sequence[AgentAdapter]

task

The task

TYPE: Task

environment

The environment

TYPE: Environment

query

The query/task content

TYPE: str

RETURNS DESCRIPTION
Dict[str, Any]

Dict with agent_results, communications, and coordination_mode

setup_agents abstractmethod

setup_agents(
    agent_data: Dict[str, Any],
    environment: Environment,
    task: Task,
    user: Optional[User],
    seed_generator: SeedGenerator,
) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]

Create agents for the task (implement in subclass).

Subclasses should: 1. Read agent specifications from task.environment_data["agents"] 2. Derive seeds from seed_generator for each agent's model 3. Create agents using their framework with seeded models 4. Wrap them in AgentAdapter 5. Set up relationships from task.environment_data["relationships"]

PARAMETER DESCRIPTION
agent_data

Agent configuration (model IDs, etc.)

TYPE: Dict[str, Any]

environment

The environment instance

TYPE: Environment

task

The task containing agent specs

TYPE: Task

user

User simulator (None for MultiAgentBench)

TYPE: Optional[User]

seed_generator

Seed generator for deriving deterministic seeds. Use seed_generator.child("agents") to create a namespace, then derive_seed(agent_id) for each agent's model. Returns None if seeding is disabled.

TYPE: SeedGenerator

RETURNS DESCRIPTION
Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]

Tuple of (agents_to_run, agents_dict)

Example
def setup_agents(self, agent_data, environment, task, user, seed_generator):
    agents_gen = seed_generator.child("agents")

    for config in task.environment_data.get("agents", []):
        agent_id = config.get("agent_id")
        seed = agents_gen.derive_seed(agent_id)  # Returns None if seeding disabled
        model = self.get_model_adapter(model_id, seed=seed)
        # Create agent with seeded model...

setup_environment

setup_environment(
    agent_data: Dict[str, Any],
    task: Task,
    seed_generator: SeedGenerator,
) -> Environment

Create the MultiAgentBench environment.

PARAMETER DESCRIPTION
agent_data

Agent configuration

TYPE: Dict[str, Any]

task

The task to set up

TYPE: Task

seed_generator

Seed generator for reproducibility

TYPE: SeedGenerator

RETURNS DESCRIPTION
Environment

MultiAgentBenchEnvironment instance

setup_evaluators

setup_evaluators(
    environment: Environment,
    task: Task,
    agents: Sequence[AgentAdapter],
    user: Optional[User],
    seed_generator: SeedGenerator,
) -> Sequence[Evaluator]

Create evaluators for the task.

PARAMETER DESCRIPTION
environment

The environment

TYPE: Environment

task

The task with evaluation data

TYPE: Task

agents

The agents

TYPE: Sequence[AgentAdapter]

user

User simulator (None for MultiAgentBench)

TYPE: Optional[User]

seed_generator

Seed generator for reproducibility

TYPE: SeedGenerator

RETURNS DESCRIPTION
Sequence[Evaluator]

List of evaluators

setup_user

setup_user(
    agent_data: Dict[str, Any],
    environment: Environment,
    task: Task,
    seed_generator: SeedGenerator,
) -> Optional[User]

MultiAgentBench tasks don't use user simulators.

The multi-agent coordination replaces user interaction.

PARAMETER DESCRIPTION
agent_data

Agent configuration

TYPE: Dict[str, Any]

environment

The environment instance

TYPE: Environment

task

The task

TYPE: Task

seed_generator

Seed generator (unused)

TYPE: SeedGenerator

RETURNS DESCRIPTION
Optional[User]

None

MarbleMultiAgentBenchBenchmark

Bases: MultiAgentBenchBenchmark

MARBLE reproduction mode for MultiAgentBench.

This benchmark uses MARBLE's native agents and engine for exact reproduction of published results. It wraps MARBLE components in MASEval adapters for unified tracing.

Example
from maseval.benchmark.multiagentbench import (
    MarbleMultiAgentBenchBenchmark,
    load_tasks,
    configure_model_ids,
)

class MyMarbleBenchmark(MarbleMultiAgentBenchBenchmark):
    def get_model_adapter(self, model_id, **kwargs):
        from maseval.interface.openai import OpenAIModelAdapter
        adapter = OpenAIModelAdapter(model_id)
        if "register_name" in kwargs:
            self.register("models", kwargs["register_name"], adapter)
        return adapter

tasks = load_tasks("research", limit=5)
configure_model_ids(tasks, agent_model_id="gpt-4o")

benchmark = MyMarbleBenchmark()
results = benchmark.run(tasks, agent_data={})

seed_generator property

seed_generator: SeedGenerator

The seed generator for this benchmark.

The seed generator is configured at benchmark initialization via the seed or seed_generator parameters. When seed=None (the default), the generator's derive_seed() method returns None, effectively disabling seeding while maintaining a uniform interface.

RETURNS DESCRIPTION
SeedGenerator

The root SeedGenerator instance.

usage property

usage: Usage

Running usage total across all task repetitions.

Queryable at any time, including while the benchmark is still running. Returns the grand total of all usage collected so far.

usage_by_component property

usage_by_component: Dict[str, Usage]

Per-component running usage totals across all repetitions.

Keys are registry keys (e.g., "models:main_model").

__init__

__init__(
    callbacks: Optional[List[BenchmarkCallback]] = None,
    n_task_repeats: int = 1,
    max_invocations: int = 10,
    num_workers: int = 1,
    fail_on_setup_error: bool = False,
    fail_on_task_error: bool = False,
    fail_on_evaluation_error: bool = False,
    progress_bar: bool | str = True,
    seed: Optional[int] = None,
    seed_generator: Optional[SeedGenerator] = None,
)

Initialize the benchmark.

PARAMETER DESCRIPTION
callbacks

Optional list of callbacks

TYPE: Optional[List[BenchmarkCallback]] DEFAULT: None

n_task_repeats

Number of times to repeat each task

TYPE: int DEFAULT: 1

max_invocations

Maximum agent invocations per task

TYPE: int DEFAULT: 10

num_workers

Number of parallel workers

TYPE: int DEFAULT: 1

fail_on_setup_error

Raise on setup errors

TYPE: bool DEFAULT: False

fail_on_task_error

Raise on task errors

TYPE: bool DEFAULT: False

fail_on_evaluation_error

Raise on evaluation errors

TYPE: bool DEFAULT: False

progress_bar

Progress bar configuration

TYPE: bool | str DEFAULT: True

seed

Global seed for reproducible benchmark runs

TYPE: Optional[int] DEFAULT: None

seed_generator

Custom seed generator (takes precedence over seed)

TYPE: Optional[SeedGenerator] DEFAULT: None

add_callback

add_callback(callback: BenchmarkCallback) -> None

Register a callback handler to monitor benchmark execution.

PARAMETER DESCRIPTION
callback

A BenchmarkCallback instance that will receive execution events.

TYPE: BenchmarkCallback

How to use

Callbacks receive notifications at key lifecycle points for tracing, progress tracking, or custom metrics collection. See BenchmarkCallback for available hooks and their signatures.

from maseval.core.callbacks import MessageTracingCallback

benchmark = MyBenchmark(tasks=tasks, agent_data=config)
benchmark.add_callback(MessageTracingCallback(output_dir="logs"))
results = benchmark.run()

clear_registry

clear_registry() -> None

Clear the component registry after a task repetition completes.

This method is called automatically by run() after each task repetition to ensure components are not carried over between repetitions. The reports list persists across all repetitions for aggregated analysis.

collect_all_configs

collect_all_configs() -> Dict[str, Any]

Collect configuration from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes and before evaluation begins. It gathers comprehensive configuration from all registered components (agents, models, tools, simulators, callbacks, etc.) for that specific repetition. After collection, the registry is cleared for the next repetition.

The collected configs are stored in benchmark.reports list along with traces for persistent access across all task repetitions.

Output fields:

  • metadata - Collection timestamp and thread info
  • agents - Dict mapping agent names to their config (settings, parameters)
  • models - Dict mapping model names to their config (model IDs, parameters)
  • tools - Dict mapping tool names to their config (specifications, settings)
  • simulators - Dict mapping simulator names to their config (parameters, templates)
  • callbacks - Dict mapping callback names to their config (settings)
  • environment - Direct config from the environment (not nested), or None if not present
  • user - Direct config from the user simulator (not nested), or None if not present
  • other - Dict for any other registered components
  • benchmark - Benchmark-level configuration (git, system, packages)
RETURNS DESCRIPTION
Dict[str, Any]

Structured dictionary containing configuration from all registered components.

How to use

This method is called automatically by run() after each task repetition:

# Automatic collection (recommended)
results = benchmark.run()

# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
    # Agents is a dict: agent_name -> config
    print(f"Agent config: {report['config']['agents']['my_agent']}")
    # Environment and user are direct (not nested)
    print(f"Environment config: {report['config']['environment']}")
    print(f"User config: {report['config']['user']}")
    # Benchmark-level config
    print(f"Git commit: {report['config']['benchmark']['git']['commit_hash']}")

The collected configs are available in the results for reproducibility analysis.

collect_all_traces

collect_all_traces() -> Dict[str, Any]

Collect execution traces from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes and before evaluation begins. It gathers comprehensive traces from all registered components (agents, models, tools, simulators, callbacks, etc.) for that specific repetition. After collection, the registry is cleared for the next repetition.

The collected traces are stored in benchmark.reports list along with configs for persistent access across all task repetitions.

Output fields:

  • metadata - Collection timestamp and thread info
  • agents - Dict mapping agent names to their traces (messages, execution data)
  • models - Dict mapping model names to their traces (API calls, timing, errors)
  • tools - Dict mapping tool names to their traces (invocations, parameters)
  • simulators - Dict mapping simulator names to their traces (attempts, outcomes)
  • callbacks - Dict mapping callback names to their traces (custom data)
  • environment - Direct traces from the environment (not nested), or None if not present
  • user - Direct traces from the user simulator (not nested), or None if not present
  • other - Dict for any other registered components
RETURNS DESCRIPTION
Dict[str, Any]

Structured dictionary containing execution traces from all registered components.

How to use

This method is called automatically by run() after each task repetition:

# Automatic collection (recommended)
results = benchmark.run()

# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
    # Agents is a dict: agent_name -> traces
    print(f"Agent messages: {report['traces']['agents']['my_agent']}")
    # Environment and user are direct (not nested)
    print(f"Environment state: {report['traces']['environment']}")
    print(f"User interactions: {report['traces']['user']}")

The collected traces are passed to the evaluator's evaluate() method and stored in benchmark.reports for later analysis.

collect_all_usage

collect_all_usage() -> Dict[str, Any]

Collect usage from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes. It gathers usage from all registered UsageTrackableMixin components and also accumulates into persistent running totals accessible via usage and usage_by_component.

RETURNS DESCRIPTION
Dict[str, Any]

Structured dictionary containing usage from all registered components.

evaluate

evaluate(
    evaluators: Sequence[Evaluator],
    agents: Dict[str, AgentAdapter],
    final_answer: Any,
    traces: Dict[str, Any],
) -> List[Dict[str, Any]]

Execute evaluators on the results.

PARAMETER DESCRIPTION
evaluators

The evaluators

TYPE: Sequence[Evaluator]

agents

Dict of all agents

TYPE: Dict[str, AgentAdapter]

final_answer

The combined agent outputs

TYPE: Any

traces

Execution traces

TYPE: Dict[str, Any]

RETURNS DESCRIPTION
List[Dict[str, Any]]

List of evaluation results

execution_loop

execution_loop(
    agents: Sequence[AgentAdapter],
    task: Task,
    environment: Environment,
    user: Optional[User],
) -> Any

Execute MARBLE's multi-iteration coordination loop.

Dispatches to the appropriate coordination handler based on the task's coordinate_mode. Replicates Engine.start() from marble/engine/engine.py:1034-1055.

PARAMETER DESCRIPTION
agents

MARBLE agents wrapped in MarbleAgentAdapter

TYPE: Sequence[AgentAdapter]

task

The task being solved

TYPE: Task

environment

The environment

TYPE: Environment

user

Always None for MultiAgentBench

TYPE: Optional[User]

RETURNS DESCRIPTION
Any

Dict with agent_results, communications, and coordination_mode

RAISES DESCRIPTION
ValueError

If coordinate_mode is not supported

get_failed_tasks

get_failed_tasks(
    status_filter: Optional[
        Union[
            TaskExecutionStatus, List[TaskExecutionStatus]
        ]
    ] = None,
    reports: Optional[List[Dict[str, Any]]] = None,
) -> SequentialTaskQueue

Get tasks that failed during benchmark execution.

This method retrieves failed tasks based on their execution status, useful for debugging, retry logic, or failure analysis.

PARAMETER DESCRIPTION
status_filter

Filter by specific failure status(es). If None, returns all failed tasks (any status except SUCCESS). Can be a single TaskExecutionStatus or a list of them. Examples: - TaskExecutionStatus.TASK_EXECUTION_FAILED: Only tasks that failed during execution - TaskExecutionStatus.EVALUATION_FAILED: Only tasks where evaluation failed - [TaskExecutionStatus.TASK_EXECUTION_FAILED, TaskExecutionStatus.SETUP_FAILED]: Tasks that failed during execution or setup

TYPE: Optional[Union[TaskExecutionStatus, List[TaskExecutionStatus]]] DEFAULT: None

reports

Optional list of reports to analyze. If None, uses the reports from the last run() call. This allows analyzing externally stored or modified reports.

TYPE: Optional[List[Dict[str, Any]]] DEFAULT: None

RETURNS DESCRIPTION
SequentialTaskQueue

SequentialTaskQueue containing the failed tasks. Empty if no failures match the filter.

RAISES DESCRIPTION
RuntimeError

If reports is None and run() has not been executed yet.

How to use
# Run benchmark
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)

# Get all failed tasks (from internal state)
failed = benchmark.get_failed_tasks()
print(f"Failed: {len(failed)}/{len(benchmark.tasks)} tasks")

# Or work with returned reports (safe from internal state changes)
failed = benchmark.get_failed_tasks(reports=reports)

# Get only tasks that failed during execution (not evaluation)
execution_failures = benchmark.get_failed_tasks(
    TaskExecutionStatus.TASK_EXECUTION_FAILED,
    reports=reports
)

# Get setup and execution failures
critical_failures = benchmark.get_failed_tasks(
    status_filter=[
        TaskExecutionStatus.SETUP_FAILED,
        TaskExecutionStatus.TASK_EXECUTION_FAILED
    ],
    reports=reports
)

# Retry failed tasks elegantly - this is the key use case!
if len(failed) > 0:
    retry_reports = benchmark.run(tasks=failed)

# Or more concisely
reports = benchmark.run(tasks=tasks)
retry_reports = benchmark.run(tasks=benchmark.get_failed_tasks())

get_model_adapter abstractmethod

get_model_adapter(
    model_id: str, **kwargs: Any
) -> ModelAdapter

Provide a model adapter (implement in subclass).

PARAMETER DESCRIPTION
model_id

Model identifier

TYPE: str

**kwargs

Additional arguments

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
ModelAdapter

ModelAdapter instance

register

register(
    category: str,
    name: str,
    component: RegisterableComponent,
) -> RegisterableComponent

Register a component for comprehensive trace and configuration collection.

All core MASEval components (AgentAdapter, ModelAdapter, Environment, User, LLMSimulator, BenchmarkCallback) inherit from TraceableMixin and/or ConfigurableMixin, and are automatically registered for both trace and configuration collection before evaluation.

Note: Most components are automatically registered when returned from setup methods (setup_environment, setup_user, setup_agents). You only need to manually register additional components like models, simulators, or tools that aren't automatically captured.

PARAMETER DESCRIPTION
category

Component category (e.g., "agents", "models", "tools", "simulators", "callbacks", "user", "environment", "seeding"). Use plural form to match the structure in collect_all_traces() and collect_all_configs().

TYPE: str

name

Unique identifier for this component within its category

TYPE: str

component

Any object inheriting from TraceableMixin and/or ConfigurableMixin

TYPE: RegisterableComponent

RETURNS DESCRIPTION
RegisterableComponent

The component (for chaining convenience)

RAISES DESCRIPTION
ValueError

If the component is already registered under a different name

How to use

Most components are auto-registered. Manual registration is only needed for additional components:

def setup_agents(self, agent_data, environment, task, user):
    # Create model (needs manual registration)
    model = MyModelAdapter(...)
    self.register("models", "main_model", model)

    # Create agent (auto-registered when returned)
    agent = MyAgent(model=model)
    agent_adapter = AgentAdapter(agent, "agent1")

    # Environment and user are also auto-registered
    return [agent_adapter], {"agent1": agent_adapter}

Traces and configs are automatically collected before evaluation via collect_all_traces() and collect_all_configs() which are called internally by the run() method.

run

run(
    tasks: Union[
        Task, BaseTaskQueue, Iterable[Union[Task, dict]]
    ],
    agent_data: Dict[str, Any] | Iterable[Dict[str, Any]],
) -> List[Dict[str, Any]]

Initialize and execute the complete benchmark loop across all tasks.

PARAMETER DESCRIPTION
tasks

Task source for execution. Can be: - A single Task object - A BaseTaskQueue (SequentialTaskQueue, PriorityTaskQueue, or custom AdaptiveTaskQueue) - An iterable of Task objects or dicts that will be converted to Tasks

When a BaseTaskQueue is provided, it controls the task ordering. AdaptiveTaskQueue subclasses are automatically registered as callbacks to receive task completion notifications.

TYPE: Union[Task, BaseTaskQueue, Iterable[Union[Task, dict]]]

agent_data

Configuration for agents. Either a single dict applied to all tasks, or an iterable of dicts with one configuration per task. Agent data typically includes model parameters, agent architecture details, and tool specifications.

TYPE: Dict[str, Any] | Iterable[Dict[str, Any]]

RETURNS DESCRIPTION
List[Dict[str, Any]]

List of report dictionaries, one per task repetition. Each report contains:

List[Dict[str, Any]]
  • task_id: Task identifier (UUID)
List[Dict[str, Any]]
  • repeat_idx: Repetition index (0 to n_task_repeats-1)
List[Dict[str, Any]]
  • status: Execution status (one of TaskExecutionStatus enum values)
List[Dict[str, Any]]
  • traces: Execution traces from all registered components
List[Dict[str, Any]]
  • config: Configuration from all registered components and benchmark level
List[Dict[str, Any]]
  • eval: Evaluation results (None if task or evaluation failed)
List[Dict[str, Any]]
  • error: Error details dict (only present if status is not SUCCESS), containing:
  • error_type: Exception class name
  • error_message: Exception message
  • traceback: Full traceback string
RAISES DESCRIPTION
ValueError

If agent_data length doesn't match number of tasks (when agent_data is an iterable).

How to use

This is the framework's main orchestration method that runs your entire benchmark. It iterates through all tasks, handles repetitions, and manages the three-stage lifecycle for each execution. You don't implement this method—instead, you call it to start the benchmark after implementing the setup and execution methods.

By default, the benchmark will continue executing remaining tasks even if some fail. You can change this behavior by setting fail_on_task_error=True, fail_on_evaluation_error=True, or fail_on_setup_error=True when instantiating the benchmark. Each task execution returns a status indicating success or the specific failure type (see TaskExecutionStatus).

For each task execution, the framework:

  1. Calls your setup methods to initialize components
  2. Calls your run_agents() method to execute the task
  3. Collects message histories and calls evaluators
  4. Stores results and triggers callbacks

Pseudocode structure:

for task in tasks:
    for repeat in range(n_task_repeats):
        # Setup stage
        environment = setup_environment(agent_data, task)
        user = setup_user(agent_data, environment, task)
        agents_to_run, agents_dict = setup_agents(agent_data, environment, task, user)
        evaluators = setup_evaluators(environment, task, agents_to_run, user)

        # Run stage (execution_loop handles multi-turn if user exists)
        agents_output = execution_loop(agents_to_run, task, environment, user)

        # Evaluate stage
        traces = collect_message_histories(agents_dict)
        eval_results = evaluate(evaluators, traces, agents_dict)

        # Store results
        store_result(task_id, traces, eval_results)

Callback hooks are triggered at these points:

  • on_run_start: Before processing any tasks
  • on_task_start: Before processing a task (once per task, not per repeat)
  • on_task_repeat_start: Before each repetition of a task
  • on_task_repeat_end: After each repetition completes
  • on_task_end: After all repetitions of a task complete
  • on_run_end: After all tasks complete
# Typical usage
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)

# Analyze results
for report in reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}: {report['eval']}")
    print(f"Config: {report['config']}")
    print(f"Traces: {report['traces']}")

# Parallel execution with 4 workers
benchmark = MyBenchmark(num_workers=4)
reports = benchmark.run(tasks=tasks, agent_data=config)

# Single agent config for all tasks
reports = benchmark.run(tasks=tasks, agent_data={"model": "gpt-4"})

# Task-specific agent configs (must match task count)
reports = benchmark.run(
    tasks=tasks,
    agent_data=[
        {"model": "gpt-4", "difficulty": "easy"},
        {"model": "gpt-4", "difficulty": "hard"},
    ]
)

# Priority-based execution
from maseval.core.task import PriorityTaskQueue
for task in tasks:
    task.protocol.priority = compute_priority(task)
queue = PriorityTaskQueue(tasks)
reports = benchmark.run(tasks=queue, agent_data=config)

# Adaptive queue (auto-registered as callback)
queue = MyAdaptiveTaskQueue(tasks)
reports = benchmark.run(tasks=queue)  # queue receives on_task_complete callbacks

run_agents

run_agents(
    agents: Sequence[AgentAdapter],
    task: Task,
    environment: Environment,
    query: str,
) -> Dict[str, Any]

Execute the multi-agent system.

For MultiAgentBench, this runs all agents on the task and collects their outputs.

PARAMETER DESCRIPTION
agents

Agents to run

TYPE: Sequence[AgentAdapter]

task

The task

TYPE: Task

environment

The environment

TYPE: Environment

query

The query/task content

TYPE: str

RETURNS DESCRIPTION
Dict[str, Any]

Dict with agent_results, communications, and coordination_mode

setup_agents

setup_agents(
    agent_data: Dict[str, Any],
    environment: Environment,
    task: Task,
    user: Optional[User],
    seed_generator: SeedGenerator,
) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]

Create MARBLE agents wrapped in MASEval adapters.

Also creates MARBLE's orchestration components (EnginePlanner, SharedMemory, AgentGraph) needed by execution_loop to replicate MARBLE's multi-iteration coordination.

Note

MARBLE agents use their own internal LLM handling with a model ID string, not MASEval's ModelAdapter. This means seed_generator cannot be applied to agent LLM calls in this implementation. For reproducible agent behavior, use MultiAgentBenchBenchmark with a custom setup_agents that creates agents using seeded MASEval ModelAdapters.

PARAMETER DESCRIPTION
agent_data

Agent configuration

TYPE: Dict[str, Any]

environment

The environment

TYPE: Environment

task

The task with agent specifications

TYPE: Task

user

User simulator (None)

TYPE: Optional[User]

seed_generator

Seed generator (not used for MARBLE agents, but seeding is applied to evaluators)

TYPE: SeedGenerator

RETURNS DESCRIPTION
Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]

Tuple of (agents_to_run, agents_dict)

setup_environment

setup_environment(
    agent_data: Dict[str, Any],
    task: Task,
    seed_generator: SeedGenerator,
) -> Environment

Create the MultiAgentBench environment.

PARAMETER DESCRIPTION
agent_data

Agent configuration

TYPE: Dict[str, Any]

task

The task to set up

TYPE: Task

seed_generator

Seed generator for reproducibility

TYPE: SeedGenerator

RETURNS DESCRIPTION
Environment

MultiAgentBenchEnvironment instance

setup_evaluators

setup_evaluators(
    environment: Environment,
    task: Task,
    agents: Sequence[AgentAdapter],
    user: Optional[User],
    seed_generator: SeedGenerator,
) -> Sequence[Evaluator]

Create a thin evaluator for MARBLE reproduction mode.

All LLM-based evaluation happens inside the coordination loop via MARBLE's Evaluator (imported directly). The MarbleReproductionEvaluator only reformats pre-computed metrics into MASEval's result format.

No ModelAdapter is needed — evaluation LLM calls are handled by MARBLE's model_prompting() in the coordination loop.

PARAMETER DESCRIPTION
environment

The environment

TYPE: Environment

task

The task with evaluation data

TYPE: Task

agents

The agents

TYPE: Sequence[AgentAdapter]

user

User simulator (None for MultiAgentBench)

TYPE: Optional[User]

seed_generator

Seed generator for reproducibility

TYPE: SeedGenerator

RETURNS DESCRIPTION
Sequence[Evaluator]

List containing a single MarbleReproductionEvaluator

setup_user

setup_user(
    agent_data: Dict[str, Any],
    environment: Environment,
    task: Task,
    seed_generator: SeedGenerator,
) -> Optional[User]

MultiAgentBench tasks don't use user simulators.

The multi-agent coordination replaces user interaction.

PARAMETER DESCRIPTION
agent_data

Agent configuration

TYPE: Dict[str, Any]

environment

The environment instance

TYPE: Environment

task

The task

TYPE: Task

seed_generator

Seed generator (unused)

TYPE: SeedGenerator

RETURNS DESCRIPTION
Optional[User]

None

MultiAgentBenchEnvironment

Bases: Environment

MASEval Environment wrapper for MARBLE environments.

This environment wraps MARBLE's domain-specific environments (Research, Bargaining, Coding, etc.) and exposes their tools through MASEval's tracing infrastructure.

ATTRIBUTE DESCRIPTION
domain

The domain name (e.g., "research", "bargaining")

marble_env

The underlying MARBLE environment instance

__init__

__init__(task_data: Dict[str, Any])

Initialize the environment.

PARAMETER DESCRIPTION
task_data

Task data containing environment configuration

TYPE: Dict[str, Any]

RAISES DESCRIPTION
EnvironmentError

If required infrastructure is unavailable

ImportError

If MARBLE is not available

apply_action

apply_action(
    agent_id: Optional[str],
    action_name: str,
    arguments: Dict[str, Any],
) -> Dict[str, Any]

Execute an action in the MARBLE environment.

PARAMETER DESCRIPTION
agent_id

ID of the agent performing the action

TYPE: Optional[str]

action_name

Name of the action to execute

TYPE: str

arguments

Arguments for the action

TYPE: Dict[str, Any]

RETURNS DESCRIPTION
Dict[str, Any]

Action result dictionary

create_tools

create_tools() -> Dict[str, Callable]

Create tools from MARBLE environment for MASEval tracing.

MARBLE environments expose tools via action_handler_descriptions. This method wraps them for MASEval's tracing infrastructure.

RETURNS DESCRIPTION
Dict[str, Callable]

Dict mapping tool names to wrapped callables

gather_config

gather_config() -> Dict[str, Any]

Gather environment configuration.

RETURNS DESCRIPTION
Dict[str, Any]

Dict with environment configuration

gather_traces

gather_traces() -> Dict[str, Any]

Gather traces including tool invocations.

RETURNS DESCRIPTION
Dict[str, Any]

Dict with environment traces

get_marble_state

get_marble_state() -> Dict[str, Any]

Get the current MARBLE environment state.

RETURNS DESCRIPTION
Dict[str, Any]

State dictionary from MARBLE environment

get_tool

get_tool(name: str) -> Optional[Callable]

Get a specific tool by name.

PARAMETER DESCRIPTION
name

Tool name

TYPE: str

RETURNS DESCRIPTION
Optional[Callable]

Tool callable if found, None otherwise

get_tool_descriptions

get_tool_descriptions() -> Dict[str, Any]

Get tool descriptions in OpenAI function format.

RETURNS DESCRIPTION
Dict[str, Any]

Dict mapping tool names to their OpenAI-format descriptions

get_tools

get_tools() -> Dict[str, Any]

Get all tools as a dict.

is_done

is_done() -> bool

Check if the environment has reached a terminal state.

RETURNS DESCRIPTION
bool

True if done, False otherwise

is_task_completed

is_task_completed() -> bool

Check if the task has been completed successfully.

RETURNS DESCRIPTION
bool

True if task completed, False otherwise

setup_state

setup_state(task_data: Dict[str, Any]) -> Dict[str, Any]

Initialize state and optionally create MARBLE environment.

PARAMETER DESCRIPTION
task_data

Task data containing environment configuration

TYPE: Dict[str, Any]

RETURNS DESCRIPTION
Dict[str, Any]

Initial state dictionary

RAISES DESCRIPTION
EnvironmentError

If required infrastructure is unavailable

MultiAgentBenchEvaluator

Bases: Evaluator

Evaluator for MultiAgentBench tasks matching MARBLE's methodology.

This evaluator implements MARBLE's LLM-based evaluation metrics: - Task completion assessment - Communication quality scoring - Planning/coordination scoring - Domain-specific task evaluation (research, bargaining, etc.)

ATTRIBUTE DESCRIPTION
domain

The benchmark domain (research, bargaining, etc.)

model_adapter

Model adapter for LLM-based evaluation

metrics_config

Configuration for metrics to evaluate

__call__

__call__(
    traces: Dict[str, Any], final_answer: Any
) -> Dict[str, Any]

Evaluate the task execution.

PARAMETER DESCRIPTION
traces

Filtered execution traces

TYPE: Dict[str, Any]

final_answer

Final output from agents (dict, list, str, or None)

TYPE: Any

RETURNS DESCRIPTION
Dict[str, Any]

Evaluation results dictionary

__init__

__init__(
    domain: str,
    model_adapter: ModelAdapter,
    metrics_config: Optional[Dict[str, Any]] = None,
    output_format: str = "",
    result_truncation_length: Optional[
        int
    ] = DEFAULT_RESULT_TRUNCATION_LENGTH,
)

Initialize the evaluator.

PARAMETER DESCRIPTION
domain

Benchmark domain (research, bargaining, etc.)

TYPE: str

model_adapter

Model adapter for LLM evaluation

TYPE: ModelAdapter

metrics_config

Configuration for evaluation metrics

TYPE: Optional[Dict[str, Any]] DEFAULT: None

output_format

Expected output format for task evaluation

TYPE: str DEFAULT: ''

result_truncation_length

Maximum characters per agent result before LLM summarization. Matches MARBLE's _summarize_results() which truncates each result to 1000 chars, then passes the truncated output through an LLM summarization call (planner.summarize_output()). Set to None to disable both truncation and LLM summarization, passing raw agent results directly to the evaluator (not recommended for domains with large outputs like research).

TYPE: Optional[int] DEFAULT: DEFAULT_RESULT_TRUNCATION_LENGTH

filter_traces

filter_traces(traces: Dict[str, Any]) -> Dict[str, Any]

Filter traces for evaluation.

PARAMETER DESCRIPTION
traces

All collected traces

TYPE: Dict[str, Any]

RETURNS DESCRIPTION
Dict[str, Any]

Filtered traces relevant for evaluation

MarbleAgentAdapter

Bases: AgentAdapter

Adapter wrapping a MARBLE BaseAgent for MASEval tracing.

This adapter provides a unified interface to MARBLE agents while capturing all relevant traces for evaluation.

ATTRIBUTE DESCRIPTION
agent_id

Unique identifier for the agent

TYPE: str

marble_agent

The underlying MARBLE BaseAgent instance

TYPE: Any

profile

Agent's role profile from MARBLE config

TYPE: str

agent_id property

agent_id: str

Return the agent's unique identifier.

marble_agent property

marble_agent: Any

Return the underlying MARBLE agent.

profile property

profile: str

Return the agent's profile.

__init__

__init__(marble_agent: Any, agent_id: str)

Initialize the adapter.

PARAMETER DESCRIPTION
marble_agent

MARBLE BaseAgent instance

TYPE: Any

agent_id

Unique identifier for this agent

TYPE: str

gather_config

gather_config() -> Dict[str, Any]

Gather agent configuration.

RETURNS DESCRIPTION
Dict[str, Any]

Dict with agent configuration

gather_traces

gather_traces() -> Dict[str, Any]

Gather traces including agent-specific data.

RETURNS DESCRIPTION
Dict[str, Any]

Dict with all agent traces

gather_usage

gather_usage() -> Usage

Gather usage with automatic cost calculation.

Calls _gather_usage() for raw token counts, then applies the cost calculator if one is available and cost is still 0.0.

The model_id used for cost calculation is resolved in order:

  1. Explicit model_id passed to __init__
  2. Auto-detected from the framework agent via _resolve_model_id()

Subclasses should override _gather_usage() (not this method) to provide framework-specific token extraction.

RETURNS DESCRIPTION
Usage

Usage (or TokenUsage) with cost filled in when possible.

get_memory_str

get_memory_str() -> str

Get the agent's memory as a string.

RETURNS DESCRIPTION
str

Serialized memory string

get_messages

get_messages() -> MessageHistory

Get the current message history as an iterable MessageHistory object.

The returned MessageHistory can be: - Iterated: for msg in agent.get_messages(): ... - Indexed: agent.get_messages()[0] - Converted to list: list(agent.get_messages()) or agent.get_messages().to_list() - Checked for emptiness: if agent.get_messages(): ...

RETURNS DESCRIPTION
MessageHistory

MessageHistory object (empty if no messages yet)

Example
# Iterate directly
for msg in agent.get_messages():
    print(msg['role'], msg['content'])

# Convert to list
messages = agent.get_messages().to_list()
messages = list(agent.get_messages())

# Check if empty
if agent.get_messages():
    print("Agent has messages")

get_serialized_messages

get_serialized_messages(session_id: str = '') -> str

Get serialized inter-agent messages.

PARAMETER DESCRIPTION
session_id

Optional session ID filter

TYPE: str DEFAULT: ''

RETURNS DESCRIPTION
str

Serialized message string

get_token_usage

get_token_usage() -> int

Get the total token usage from the MARBLE agent.

RETURNS DESCRIPTION
int

Total tokens used by the agent

run

run(query: str) -> Any

Executes the agent and returns the result.

load_tasks

load_tasks(
    domain: str,
    data_dir: Optional[Path] = None,
    limit: Optional[int] = None,
) -> List[Task]

Load MultiAgentBench tasks for a domain.

Most domains load from JSONL files. Werewolf uses config-based task loading since it has no JSONL data (it uses a game engine).

PARAMETER DESCRIPTION
domain

Domain name (one of: coding, database, minecraft, research, bargaining, werewolf)

TYPE: str

data_dir

Optional path to MARBLE data directory

TYPE: Optional[Path] DEFAULT: None

limit

Maximum number of tasks to load (None for all)

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[Task]

List of Task objects

RAISES DESCRIPTION
ValueError

If domain is invalid

FileNotFoundError

If data files not found

Example

tasks = load_tasks("research", limit=5) len(tasks) 5 tasks[0].metadata["domain"] 'research'

configure_model_ids

configure_model_ids(
    tasks: List[Task],
    *,
    agent_model_id: str,
    evaluator_model_id: Optional[str] = None,
) -> List[Task]

Configure model IDs for MARBLE agents and evaluator.

Modifies tasks in-place to set the LLM model IDs used by agents and optionally the evaluator.

PARAMETER DESCRIPTION
tasks

List of Tasks to configure

TYPE: List[Task]

agent_model_id

Model ID for all MARBLE agents (e.g., "gpt-4o")

TYPE: str

evaluator_model_id

Optional model ID for LLM-based evaluation

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
List[Task]

The input tasks (modified in-place)

Example

tasks = load_tasks("research", limit=5) configure_model_ids(tasks, agent_model_id="gpt-4o") tasks[0].environment_data["llm"] 'gpt-4o'

ensure_marble_exists

ensure_marble_exists(auto_download: bool = True) -> Path

Ensure MARBLE is available, optionally downloading it.

This function checks if MARBLE is installed and optionally downloads it if not present.

PARAMETER DESCRIPTION
auto_download

If True, automatically download MARBLE if not found. If False, raise an error if MARBLE is not found.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
Path

Path to the MARBLE directory

RAISES DESCRIPTION
FileNotFoundError

If MARBLE is not found and auto_download=False

Example

marble_dir = ensure_marble_exists()

MARBLE is now available at marble_dir

download_marble

download_marble(
    target_dir: Optional[Path] = None,
    commit: Optional[str] = None,
    force: bool = False,
) -> Path

Clone MARBLE repository to the specified directory.

PARAMETER DESCRIPTION
target_dir

Directory to clone into. Defaults to marble/ relative to this module.

TYPE: Optional[Path] DEFAULT: None

commit

Specific commit hash to checkout. Defaults to MARBLE_DEFAULT_COMMIT or latest.

TYPE: Optional[str] DEFAULT: None

force

If True, remove existing directory and re-clone.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
Path

Path to the cloned MARBLE directory

RAISES DESCRIPTION
RuntimeError

If git clone fails

FileExistsError

If directory exists and force=False

get_domain_info

get_domain_info(domain: str) -> Dict[str, Any]

Get information about a domain.

PARAMETER DESCRIPTION
domain

Domain name

TYPE: str

RETURNS DESCRIPTION
Dict[str, Any]

Dict with domain information including:

Dict[str, Any]
  • requires_infrastructure: Whether external services needed
Dict[str, Any]
  • description: Brief domain description
Dict[str, Any]
  • coordination_mode: Default coordination mode
RAISES DESCRIPTION
ValueError

If domain is invalid