Skip to content

MACS: Multi-Agent Collaboration Scenarios

The Multi-Agent Collaboration Scenarios (MACS) benchmark evaluates how well multi-agent systems collaborate to solve complex enterprise tasks across multiple domains.

Overview

Multi-Agent Collaboration Scenarios (MACS) is designed to test collaborative problem-solving in realistic enterprise scenarios. The benchmark includes tasks spanning multiple domains such as travel planning, retail, and more. Each task involves multiple agents that must coordinate their actions to achieve user goals.

Check out the BENCHMARKS.md file for more information including licenses.

Quick Start

from maseval.benchmark.macs import (
    MACSBenchmark, MACSEnvironment, MACSEvaluator, MACSGenericTool,
    load_tasks, load_agent_config,
)

# Load data
tasks = load_tasks("travel", limit=5)
agent_config = load_agent_config("travel")

# Create your framework-specific benchmark subclass
class MyMACSBenchmark(MACSBenchmark):
    def setup_agents(self, agent_data, environment, task, user):
        # Your framework-specific agent creation
        ...

# Run
benchmark = MyMACSBenchmark(agent_data=agent_config, model=my_model)
results = benchmark.run(tasks)

MACSBenchmark

Bases: Benchmark

MACS Benchmark - Framework-agnostic base class.

This base class handles: - Environment setup with MACSEnvironment - Dual evaluator setup (user-side + system-side) - GSR metric aggregation

Users must subclass and implement: - setup_agents() for their agent framework - get_model_adapter() to provide model adapters

Model IDs for components (tools, user, evaluators) are read from task data: - task.environment_data["model_id"] for tool simulators - task.user_data["model_id"] for user simulator - task.evaluation_data["model_id"] for evaluators

Use configure_model_ids() to set these values after loading tasks:

from maseval.benchmark.macs import load_tasks, configure_model_ids

tasks = load_tasks("travel")
configure_model_ids(
    tasks,
    tool_model_id="gemini-2.5-flash",
    user_model_id="gemini-2.5-flash",
    evaluator_model_id="gemini-2.5-flash",
)

seed_generator property

seed_generator: SeedGenerator

The seed generator for this benchmark.

The seed generator is configured at benchmark initialization via the seed or seed_generator parameters. When seed=None (the default), the generator's derive_seed() method returns None, effectively disabling seeding while maintaining a uniform interface.

RETURNS DESCRIPTION
SeedGenerator

The root SeedGenerator instance.

usage property

usage: Usage

Running usage total across all task repetitions.

Queryable at any time, including while the benchmark is still running. Returns the grand total of all usage collected so far.

usage_by_component property

usage_by_component: Dict[str, Usage]

Per-component running usage totals across all repetitions.

Keys are registry keys (e.g., "models:main_model").

__init__

__init__(
    callbacks: Optional[List[Any]] = None,
    n_task_repeats: int = 1,
    max_invocations: int = 5,
    **kwargs: Any,
)

Initialize benchmark.

PARAMETER DESCRIPTION
callbacks

Benchmark callbacks

TYPE: Optional[List[Any]] DEFAULT: None

n_task_repeats

Repetitions per task

TYPE: int DEFAULT: 1

max_invocations

Maximum agent-user interaction rounds (default: 5 per MACS paper)

TYPE: int DEFAULT: 5

add_callback

add_callback(callback: BenchmarkCallback) -> None

Register a callback handler to monitor benchmark execution.

PARAMETER DESCRIPTION
callback

A BenchmarkCallback instance that will receive execution events.

TYPE: BenchmarkCallback

How to use

Callbacks receive notifications at key lifecycle points for tracing, progress tracking, or custom metrics collection. See BenchmarkCallback for available hooks and their signatures.

from maseval.core.callbacks import MessageTracingCallback

benchmark = MyBenchmark(tasks=tasks, agent_data=config)
benchmark.add_callback(MessageTracingCallback(output_dir="logs"))
results = benchmark.run()

clear_registry

clear_registry() -> None

Clear the component registry after a task repetition completes.

This method is called automatically by run() after each task repetition to ensure components are not carried over between repetitions. The reports list persists across all repetitions for aggregated analysis.

collect_all_configs

collect_all_configs() -> Dict[str, Any]

Collect configuration from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes and before evaluation begins. It gathers comprehensive configuration from all registered components (agents, models, tools, simulators, callbacks, etc.) for that specific repetition. After collection, the registry is cleared for the next repetition.

The collected configs are stored in benchmark.reports list along with traces for persistent access across all task repetitions.

Output fields:

  • metadata - Collection timestamp and thread info
  • agents - Dict mapping agent names to their config (settings, parameters)
  • models - Dict mapping model names to their config (model IDs, parameters)
  • tools - Dict mapping tool names to their config (specifications, settings)
  • simulators - Dict mapping simulator names to their config (parameters, templates)
  • callbacks - Dict mapping callback names to their config (settings)
  • environment - Direct config from the environment (not nested), or None if not present
  • user - Direct config from the user simulator (not nested), or None if not present
  • other - Dict for any other registered components
  • benchmark - Benchmark-level configuration (git, system, packages)
RETURNS DESCRIPTION
Dict[str, Any]

Structured dictionary containing configuration from all registered components.

How to use

This method is called automatically by run() after each task repetition:

# Automatic collection (recommended)
results = benchmark.run()

# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
    # Agents is a dict: agent_name -> config
    print(f"Agent config: {report['config']['agents']['my_agent']}")
    # Environment and user are direct (not nested)
    print(f"Environment config: {report['config']['environment']}")
    print(f"User config: {report['config']['user']}")
    # Benchmark-level config
    print(f"Git commit: {report['config']['benchmark']['git']['commit_hash']}")

The collected configs are available in the results for reproducibility analysis.

collect_all_traces

collect_all_traces() -> Dict[str, Any]

Collect execution traces from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes and before evaluation begins. It gathers comprehensive traces from all registered components (agents, models, tools, simulators, callbacks, etc.) for that specific repetition. After collection, the registry is cleared for the next repetition.

The collected traces are stored in benchmark.reports list along with configs for persistent access across all task repetitions.

Output fields:

  • metadata - Collection timestamp and thread info
  • agents - Dict mapping agent names to their traces (messages, execution data)
  • models - Dict mapping model names to their traces (API calls, timing, errors)
  • tools - Dict mapping tool names to their traces (invocations, parameters)
  • simulators - Dict mapping simulator names to their traces (attempts, outcomes)
  • callbacks - Dict mapping callback names to their traces (custom data)
  • environment - Direct traces from the environment (not nested), or None if not present
  • user - Direct traces from the user simulator (not nested), or None if not present
  • other - Dict for any other registered components
RETURNS DESCRIPTION
Dict[str, Any]

Structured dictionary containing execution traces from all registered components.

How to use

This method is called automatically by run() after each task repetition:

# Automatic collection (recommended)
results = benchmark.run()

# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
    # Agents is a dict: agent_name -> traces
    print(f"Agent messages: {report['traces']['agents']['my_agent']}")
    # Environment and user are direct (not nested)
    print(f"Environment state: {report['traces']['environment']}")
    print(f"User interactions: {report['traces']['user']}")

The collected traces are passed to the evaluator's evaluate() method and stored in benchmark.reports for later analysis.

collect_all_usage

collect_all_usage() -> Dict[str, Any]

Collect usage from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes. It gathers usage from all registered UsageTrackableMixin components and also accumulates into persistent running totals accessible via usage and usage_by_component.

RETURNS DESCRIPTION
Dict[str, Any]

Structured dictionary containing usage from all registered components.

evaluate

evaluate(
    evaluators: Sequence[Evaluator],
    agents: Dict[str, AgentAdapter],
    final_answer: Any,
    traces: Dict[str, Any],
) -> List[Dict[str, Any]]

Evaluate using both evaluators and aggregate GSR metrics.

Uses each evaluator's filter_traces() method to extract relevant data, then calls the evaluator with the filtered traces.

Returns AWS paper format: - user_gsr, system_gsr, overall_gsr, supervisor_gsr - user_partial_gsr, system_partial_gsr, overall_partial_gsr - report: Combined assertion judgments

execution_loop

execution_loop(
    agents: Sequence[AgentAdapter],
    task: Task,
    environment: Environment,
    user: Optional[User],
) -> Any

Execute agents with optional user interaction loop.

This method orchestrates the agent-user interaction pattern. When a user is present, the user initiates the conversation using user.get_initial_query(). If no user is present, task.query is used as the initial query.

Interaction Flow

By default, agents execute once (max_invocations=1). For multi-turn interaction, set self.max_invocations > 1 in your benchmark's __init__. The loop continues until max_invocations is reached or user.is_done() returns True (e.g., max turns reached or stop token detected).

Note

Override this method in your benchmark subclass to implement custom interaction patterns (e.g., agent-initiated conversations, different termination conditions, or specialized query routing).

PARAMETER DESCRIPTION
agents

Agents to execute (typically the orchestrator).

TYPE: Sequence[AgentAdapter]

task

The task being solved.

TYPE: Task

environment

The environment providing tools and state.

TYPE: Environment

user

Optional user simulator. If provided, the user initiates and drives the conversation. If None, a single agent execution with task.query.

TYPE: Optional[User]

RETURNS DESCRIPTION
Any

Final answer from the last agent execution.

Example

For interactive benchmarks, enable multi-turn interaction::

def __init__(self, ...):
    super().__init__(...)
    self.max_invocations = 5  # Up to 5 agent-user exchanges

get_failed_tasks

get_failed_tasks(
    status_filter: Optional[
        Union[
            TaskExecutionStatus, List[TaskExecutionStatus]
        ]
    ] = None,
    reports: Optional[List[Dict[str, Any]]] = None,
) -> SequentialTaskQueue

Get tasks that failed during benchmark execution.

This method retrieves failed tasks based on their execution status, useful for debugging, retry logic, or failure analysis.

PARAMETER DESCRIPTION
status_filter

Filter by specific failure status(es). If None, returns all failed tasks (any status except SUCCESS). Can be a single TaskExecutionStatus or a list of them. Examples: - TaskExecutionStatus.TASK_EXECUTION_FAILED: Only tasks that failed during execution - TaskExecutionStatus.EVALUATION_FAILED: Only tasks where evaluation failed - [TaskExecutionStatus.TASK_EXECUTION_FAILED, TaskExecutionStatus.SETUP_FAILED]: Tasks that failed during execution or setup

TYPE: Optional[Union[TaskExecutionStatus, List[TaskExecutionStatus]]] DEFAULT: None

reports

Optional list of reports to analyze. If None, uses the reports from the last run() call. This allows analyzing externally stored or modified reports.

TYPE: Optional[List[Dict[str, Any]]] DEFAULT: None

RETURNS DESCRIPTION
SequentialTaskQueue

SequentialTaskQueue containing the failed tasks. Empty if no failures match the filter.

RAISES DESCRIPTION
RuntimeError

If reports is None and run() has not been executed yet.

How to use
# Run benchmark
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)

# Get all failed tasks (from internal state)
failed = benchmark.get_failed_tasks()
print(f"Failed: {len(failed)}/{len(benchmark.tasks)} tasks")

# Or work with returned reports (safe from internal state changes)
failed = benchmark.get_failed_tasks(reports=reports)

# Get only tasks that failed during execution (not evaluation)
execution_failures = benchmark.get_failed_tasks(
    TaskExecutionStatus.TASK_EXECUTION_FAILED,
    reports=reports
)

# Get setup and execution failures
critical_failures = benchmark.get_failed_tasks(
    status_filter=[
        TaskExecutionStatus.SETUP_FAILED,
        TaskExecutionStatus.TASK_EXECUTION_FAILED
    ],
    reports=reports
)

# Retry failed tasks elegantly - this is the key use case!
if len(failed) > 0:
    retry_reports = benchmark.run(tasks=failed)

# Or more concisely
reports = benchmark.run(tasks=tasks)
retry_reports = benchmark.run(tasks=benchmark.get_failed_tasks())

get_model_adapter abstractmethod

get_model_adapter(
    model_id: str, **kwargs: Any
) -> ModelAdapter

Provide a ModelAdapter for benchmark components that require LLM access.

Many benchmark components beyond the agents themselves require access to language models. Common examples include:

  • Tool simulators: Simulating tool responses when real APIs aren't available
  • User simulators: Generating realistic user responses in multi-turn dialogues
  • Judges/Evaluators: Using LLMs to assess agent performance against criteria
  • Reward models: Computing scores for reinforcement learning

This method centralizes model provisioning, giving you control over which models are used throughout the benchmark. Implement this to return a configured ModelAdapter for the requested model.

PARAMETER DESCRIPTION
model_id

The model identifier to use (e.g., "gemini-2.5-flash", "openrouter/google/gemini-2.5-flash", "gpt-4o"). This is passed by the benchmark when setting up components that need model access.

TYPE: str

**kwargs

Additional arguments for adapter creation or registration. Common kwargs: - register_category: Category for trace registration (e.g., "models") - register_name: Name for trace registration (e.g., "evaluator_user_gsr")

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
ModelAdapter

A ModelAdapter instance configured for the specified model. For proper tracing,

ModelAdapter

return a fresh adapter for each call rather than reusing instances. You can

ModelAdapter

still share the underlying API client for efficiency.

How to use

For proper tracing, register the adapter after creation using the kwargs:

def get_model_adapter(self, model_id: str, **kwargs: Any) -> ModelAdapter:
    adapter = GoogleGenAIModelAdapter(self.client, model_id=model_id)

    # Register for tracing if registration info provided
    category = kwargs.get("register_category", "models")
    name = kwargs.get("register_name", model_id)
    self.register(category, name, adapter)

    return adapter

The benchmark calls this method when setting up tools, user simulators, and evaluators. Each call creates a fresh adapter with its own trace log.

register

register(
    category: str,
    name: str,
    component: RegisterableComponent,
) -> RegisterableComponent

Register a component for comprehensive trace and configuration collection.

All core MASEval components (AgentAdapter, ModelAdapter, Environment, User, LLMSimulator, BenchmarkCallback) inherit from TraceableMixin and/or ConfigurableMixin, and are automatically registered for both trace and configuration collection before evaluation.

Note: Most components are automatically registered when returned from setup methods (setup_environment, setup_user, setup_agents). You only need to manually register additional components like models, simulators, or tools that aren't automatically captured.

PARAMETER DESCRIPTION
category

Component category (e.g., "agents", "models", "tools", "simulators", "callbacks", "user", "environment", "seeding"). Use plural form to match the structure in collect_all_traces() and collect_all_configs().

TYPE: str

name

Unique identifier for this component within its category

TYPE: str

component

Any object inheriting from TraceableMixin and/or ConfigurableMixin

TYPE: RegisterableComponent

RETURNS DESCRIPTION
RegisterableComponent

The component (for chaining convenience)

RAISES DESCRIPTION
ValueError

If the component is already registered under a different name

How to use

Most components are auto-registered. Manual registration is only needed for additional components:

def setup_agents(self, agent_data, environment, task, user):
    # Create model (needs manual registration)
    model = MyModelAdapter(...)
    self.register("models", "main_model", model)

    # Create agent (auto-registered when returned)
    agent = MyAgent(model=model)
    agent_adapter = AgentAdapter(agent, "agent1")

    # Environment and user are also auto-registered
    return [agent_adapter], {"agent1": agent_adapter}

Traces and configs are automatically collected before evaluation via collect_all_traces() and collect_all_configs() which are called internally by the run() method.

run

run(
    tasks: Union[
        Task, BaseTaskQueue, Iterable[Union[Task, dict]]
    ],
    agent_data: Dict[str, Any] | Iterable[Dict[str, Any]],
) -> List[Dict[str, Any]]

Initialize and execute the complete benchmark loop across all tasks.

PARAMETER DESCRIPTION
tasks

Task source for execution. Can be: - A single Task object - A BaseTaskQueue (SequentialTaskQueue, PriorityTaskQueue, or custom AdaptiveTaskQueue) - An iterable of Task objects or dicts that will be converted to Tasks

When a BaseTaskQueue is provided, it controls the task ordering. AdaptiveTaskQueue subclasses are automatically registered as callbacks to receive task completion notifications.

TYPE: Union[Task, BaseTaskQueue, Iterable[Union[Task, dict]]]

agent_data

Configuration for agents. Either a single dict applied to all tasks, or an iterable of dicts with one configuration per task. Agent data typically includes model parameters, agent architecture details, and tool specifications.

TYPE: Dict[str, Any] | Iterable[Dict[str, Any]]

RETURNS DESCRIPTION
List[Dict[str, Any]]

List of report dictionaries, one per task repetition. Each report contains:

List[Dict[str, Any]]
  • task_id: Task identifier (UUID)
List[Dict[str, Any]]
  • repeat_idx: Repetition index (0 to n_task_repeats-1)
List[Dict[str, Any]]
  • status: Execution status (one of TaskExecutionStatus enum values)
List[Dict[str, Any]]
  • traces: Execution traces from all registered components
List[Dict[str, Any]]
  • config: Configuration from all registered components and benchmark level
List[Dict[str, Any]]
  • eval: Evaluation results (None if task or evaluation failed)
List[Dict[str, Any]]
  • error: Error details dict (only present if status is not SUCCESS), containing:
  • error_type: Exception class name
  • error_message: Exception message
  • traceback: Full traceback string
RAISES DESCRIPTION
ValueError

If agent_data length doesn't match number of tasks (when agent_data is an iterable).

How to use

This is the framework's main orchestration method that runs your entire benchmark. It iterates through all tasks, handles repetitions, and manages the three-stage lifecycle for each execution. You don't implement this method—instead, you call it to start the benchmark after implementing the setup and execution methods.

By default, the benchmark will continue executing remaining tasks even if some fail. You can change this behavior by setting fail_on_task_error=True, fail_on_evaluation_error=True, or fail_on_setup_error=True when instantiating the benchmark. Each task execution returns a status indicating success or the specific failure type (see TaskExecutionStatus).

For each task execution, the framework:

  1. Calls your setup methods to initialize components
  2. Calls your run_agents() method to execute the task
  3. Collects message histories and calls evaluators
  4. Stores results and triggers callbacks

Pseudocode structure:

for task in tasks:
    for repeat in range(n_task_repeats):
        # Setup stage
        environment = setup_environment(agent_data, task)
        user = setup_user(agent_data, environment, task)
        agents_to_run, agents_dict = setup_agents(agent_data, environment, task, user)
        evaluators = setup_evaluators(environment, task, agents_to_run, user)

        # Run stage (execution_loop handles multi-turn if user exists)
        agents_output = execution_loop(agents_to_run, task, environment, user)

        # Evaluate stage
        traces = collect_message_histories(agents_dict)
        eval_results = evaluate(evaluators, traces, agents_dict)

        # Store results
        store_result(task_id, traces, eval_results)

Callback hooks are triggered at these points:

  • on_run_start: Before processing any tasks
  • on_task_start: Before processing a task (once per task, not per repeat)
  • on_task_repeat_start: Before each repetition of a task
  • on_task_repeat_end: After each repetition completes
  • on_task_end: After all repetitions of a task complete
  • on_run_end: After all tasks complete
# Typical usage
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)

# Analyze results
for report in reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}: {report['eval']}")
    print(f"Config: {report['config']}")
    print(f"Traces: {report['traces']}")

# Parallel execution with 4 workers
benchmark = MyBenchmark(num_workers=4)
reports = benchmark.run(tasks=tasks, agent_data=config)

# Single agent config for all tasks
reports = benchmark.run(tasks=tasks, agent_data={"model": "gpt-4"})

# Task-specific agent configs (must match task count)
reports = benchmark.run(
    tasks=tasks,
    agent_data=[
        {"model": "gpt-4", "difficulty": "easy"},
        {"model": "gpt-4", "difficulty": "hard"},
    ]
)

# Priority-based execution
from maseval.core.task import PriorityTaskQueue
for task in tasks:
    task.protocol.priority = compute_priority(task)
queue = PriorityTaskQueue(tasks)
reports = benchmark.run(tasks=queue, agent_data=config)

# Adaptive queue (auto-registered as callback)
queue = MyAdaptiveTaskQueue(tasks)
reports = benchmark.run(tasks=queue)  # queue receives on_task_complete callbacks

run_agents

run_agents(
    agents: Sequence[AgentAdapter],
    task: Task,
    environment: MACSEnvironment,
    query: str = "",
) -> Any

Execute agents and return final answer.

setup_agents abstractmethod

setup_agents(
    agent_data: Dict[str, Any],
    environment: MACSEnvironment,
    task: Task,
    user: Optional[User],
    seed_generator: DefaultSeedGenerator,
) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]

Create agents for this task. Must be implemented by subclass.

PARAMETER DESCRIPTION
agent_data

Agent configuration with hierarchy spec

TYPE: Dict[str, Any]

environment

MACSEnvironment with tools

TYPE: MACSEnvironment

task

Current task

TYPE: Task

user

Optional user simulator

TYPE: Optional[User]

seed_generator

Seed generator for deriving agent seeds

TYPE: DefaultSeedGenerator

RETURNS DESCRIPTION
Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]

Tuple of (ordered agent list, agent dict keyed by ID)

setup_environment

setup_environment(
    agent_data: Dict[str, Any], task: Task, seed_generator
) -> MACSEnvironment

Create environment for a task.

Uses get_model_adapter() to create separate model adapters for each tool, enabling independent tracing per tool.

Model ID is read from task.environment_data["model_id"].

setup_evaluators

setup_evaluators(
    environment: MACSEnvironment,
    task: Task,
    agents: Sequence[AgentAdapter],
    user: Optional[User],
    seed_generator,
) -> Sequence[Evaluator]

Create user-side and system-side evaluators.

Each evaluator gets its own model adapter for separate tracing. Model ID is read from task.evaluation_data["model_id"].

setup_user

setup_user(
    agent_data: Dict[str, Any],
    environment: MACSEnvironment,
    task: Task,
    seed_generator: DefaultSeedGenerator,
) -> MACSUser

Create MACS user simulator.

Creates a MACSUser with scenario and query from the task. The user profile is automatically extracted from the scenario text. Model ID is read from task.user_data["model_id"].

Note: MACSUser.get_tool() raises NotImplementedError. Framework-specific subclasses in examples should wrap this user or override setup_user() to return a user with get_tool() implemented.

PARAMETER DESCRIPTION
agent_data

Agent configuration

TYPE: Dict[str, Any]

environment

The task environment

TYPE: MACSEnvironment

task

Current task with scenario and user profile

TYPE: Task

RETURNS DESCRIPTION
MACSUser

MACSUser instance

MACSUser

Bases: LLMUser

MACS-specific user simulator with conversation limits.

Extends the LLMUser class with MACS-specific behavior: - Maximum 5 turns of interaction (as per MACS paper) - token detection for natural conversation ending - User profile and scenario-aware responses - LLM-based satisfaction evaluation

The simulator maintains a conversation history and uses an LLM to generate responses that are consistent with the user's profile and scenario.

Note: This is a base class. Framework-specific subclasses should override get_tool() to return a compatible tool (e.g., SmolAgentUserSimulationInputTool).

termination_reason property

termination_reason: TerminationReason

Get the reason why the user interaction terminated.

RETURNS DESCRIPTION
TerminationReason

Why is_done() returns True, or NOT_TERMINATED if still ongoing.

__init__

__init__(
    model: ModelAdapter,
    scenario: str,
    initial_query: str,
    name: str = "Simulated User",
    template: Optional[str] = None,
    max_turns: int = DEFAULT_MAX_TURNS,
    stop_tokens: Optional[List[str]] = None,
    early_stopping_condition: str = DEFAULT_EARLY_STOPPING_CONDITION,
    exhausted_response: Optional[str] = None,
)

Initialize MACS user simulator.

PARAMETER DESCRIPTION
model

ModelAdapter for LLM-based response generation

TYPE: ModelAdapter

scenario

Full scenario text (contains goals and user background)

TYPE: str

initial_query

The initial query to the agent

TYPE: str

name

User name for identification (default: "Simulated User")

TYPE: str DEFAULT: 'Simulated User'

template

Optional custom prompt template (uses base UserLLMSimulator template)

TYPE: Optional[str] DEFAULT: None

max_turns

Maximum conversation turns (default: 5, per MACS paper)

TYPE: int DEFAULT: DEFAULT_MAX_TURNS

stop_tokens

Tokens indicating user satisfaction (default: [""])

TYPE: Optional[List[str]] DEFAULT: None

early_stopping_condition

Description of when to emit stop token (default: "ALL goals have been satisfactorily addressed by the assistant")

TYPE: str DEFAULT: DEFAULT_EARLY_STOPPING_CONDITION

exhausted_response

Message to return when respond() is called after the user is done. If None (default), raises UserExhaustedError instead.

TYPE: Optional[str] DEFAULT: None

gather_config

gather_config() -> Dict[str, Any]

Gather configuration from this user.

Output fields:

  • name - User identifier
  • profile - User profile data
  • scenario - Task scenario description
  • max_turns - Maximum interaction turns
  • stop_tokens - Early stopping tokens (empty list if disabled)
  • exhausted_response - Message returned when user is done, or None
RETURNS DESCRIPTION
Dict[str, Any]

Dictionary containing user configuration.

gather_traces

gather_traces() -> Dict[str, Any]

Gather traces with MACS-specific information.

get_initial_query

get_initial_query() -> str

Get the initial query for the conversation.

If an initial_query was provided at construction, returns it. Otherwise, generates one using the LLM simulator based on the user's profile and scenario.

This method: - Returns the existing initial query if one was provided - Or calls the LLM simulator to generate one - Ensures the query is in the message history - Counts the initial query as the first turn

RETURNS DESCRIPTION
str

The initial query (either pre-set or LLM-generated).

RAISES DESCRIPTION
RuntimeError

If called after conversation has progressed beyond the initial message.

get_tool

get_tool() -> Any

Return a tool that agents can invoke to interact with this user.

Subclasses must override this to wrap the user interaction logic in a tool object compatible with their agentic framework.

RETURNS DESCRIPTION
Any

A tool instance for agent-user interaction.

RAISES DESCRIPTION
NotImplementedError

Always; must be implemented by subclass.

increment_turn

increment_turn() -> None

Increment the turn counter.

Call this after recording a user response in the message history.

is_done

is_done() -> bool

Check if the user interaction should end.

Checks: 1. If max_turns has been reached 2. If the user previously indicated termination (via stop_token)

Subclasses can override to add custom termination logic (e.g., LLM-based satisfaction checks) by calling super().is_done() first.

RETURNS DESCRIPTION
bool

True if the user is done interacting, False to continue.

reset

reset() -> None

Reset the conversation state for a new interaction.

respond

respond(message: str) -> str

Respond to a message from the agent using LLM simulation.

This method appends the agent's message to the conversation history, generates a response using the LLM simulator, appends the response to the history, and returns it.

If a stop_token is detected in the response, triggers early stopping.

PARAMETER DESCRIPTION
message

The message from the agent to which the user should respond.

TYPE: str

RETURNS DESCRIPTION
str

The user's response, or exhausted_response if done and configured.

RAISES DESCRIPTION
UserExhaustedError

If the user is already done and no exhausted_response is configured.

MACSEnvironment

Bases: Environment

Unified environment for all MACS domains.

Creates MACSGenericTool instances from task's environment_data. Tools are stored in a dict keyed by name for efficient lookup.

__init__

__init__(
    task_data: Dict[str, Any],
    model_factory: Callable[[str], ModelAdapter],
    callbacks: Optional[List[Any]] = None,
)

Initialize environment.

PARAMETER DESCRIPTION
task_data

Task data containing environment_data with tool specs

TYPE: Dict[str, Any]

model_factory

Factory function that creates a ModelAdapter for a given model_name

TYPE: Callable[[str], ModelAdapter]

callbacks

Optional callbacks

TYPE: Optional[List[Any]] DEFAULT: None

create_tools

create_tools() -> Dict[str, MACSGenericTool]

Create tools from task specifications.

Each tool gets its own ModelAdapter instance for separate tracing.

RETURNS DESCRIPTION
Dict[str, MACSGenericTool]

Dict mapping tool names to MACSGenericTool instances

gather_config

gather_config() -> dict[str, Any]

Gather configuration from this environment.

Output fields:

  • type - Component class name
  • gathered_at - ISO timestamp
  • tool_count - Number of tools
  • tool_names - List of tool names
RETURNS DESCRIPTION
dict[str, Any]

Dictionary containing environment configuration.

gather_traces

gather_traces() -> dict[str, Any]

Gather execution traces from this environment and its tools.

Output fields:

  • type - Component class name
  • gathered_at - ISO timestamp
  • tool_count - Number of tools in environment
  • tools - Dictionary of tool traces keyed by tool name
RETURNS DESCRIPTION
dict[str, Any]

Dictionary containing environment execution traces.

get_tool

get_tool(name: str) -> Optional[Any]

Get a tool by name.

PARAMETER DESCRIPTION
name

Tool name

TYPE: str

RETURNS DESCRIPTION
Optional[Any]

The tool, or None if not found

get_tools

get_tools() -> Dict[str, Any]

Get all tools as a dict.

get_tools_for_agent

get_tools_for_agent(
    agent_spec: Dict[str, Any],
) -> Dict[str, MACSGenericTool]

Get tools for a specific agent based on its configuration.

PARAMETER DESCRIPTION
agent_spec

Agent specification dict with 'tools' key containing tool group names

TYPE: Dict[str, Any]

RETURNS DESCRIPTION
Dict[str, MACSGenericTool]

Dict of MACSGenericTool instances assigned to this agent, keyed by name

setup_state

setup_state(task_data: Dict[str, Any]) -> Dict[str, Any]

Initialize state from task data.

MACSEvaluator

Bases: Evaluator

LLM-based assertion evaluator for GSR metrics.

Follows AWS paper methodology for Goal Success Rate (GSR) evaluation: - user: Evaluates user-observable behaviors (conversation only) - system: Evaluates internal behaviors (tool calls, agent actions)

__call__

__call__(
    traces: Dict[str, Any],
    final_answer: Optional[str] = None,
) -> Dict[str, Any]

Evaluate the trace against assertions.

PARAMETER DESCRIPTION
traces

Filtered traces dict containing 'messages' and optionally 'tool_traces'

TYPE: Dict[str, Any]

final_answer

Final answer from agents (unused in MACS evaluation)

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Dict[str, Any]

Dict with: gsr, partial_gsr, report (list of assertion judgments)

__init__

__init__(
    model: ModelAdapter,
    task: Task,
    gsr_type: Literal["user", "system"] = "user",
    template: Optional[str] = None,
)

Initialize the evaluator.

PARAMETER DESCRIPTION
model

ModelAdapter for LLM evaluation

TYPE: ModelAdapter

task

Task being evaluated (contains assertions)

TYPE: Task

gsr_type

Either "user" or "system"

TYPE: Literal['user', 'system'] DEFAULT: 'user'

template

Optional custom prompt template (uses default if None)

TYPE: Optional[str] DEFAULT: None

filter_traces

filter_traces(traces: Dict[str, Any]) -> Dict[str, Any]

Filter traces based on gsr_type.

For user evaluation: Use user trace which contains the user-observable conversation by construction (what the user sees: queries, agent questions, user answers, and final answers).

For system evaluation: Full traces including all agent messages and tool invocations (internal behaviors not visible to users).

PARAMETER DESCRIPTION
traces

Full execution traces dict containing 'agents', 'tools', 'user', etc.

TYPE: Dict[str, Any]

RETURNS DESCRIPTION
Dict[str, Any]

Filtered dict with 'messages' and optionally 'tool_traces'

MACSGenericTool

Bases: TraceableMixin, ConfigurableMixin

Framework-agnostic tool with LLM-based response simulation.

This tool does not inherit from any framework-specific Tool class. Users wrap it for their framework using composition. Example for smolagents:

class MySmolagentsTool(smolagents.Tool):
    skip_forward_signature_validation = True

    def __init__(self, generic_tool: MACSGenericTool):
        self.generic_tool = generic_tool
        self.name = generic_tool.name
        self.description = generic_tool.description
        self.inputs = generic_tool.inputs
        self.output_type = "string"
        super().__init__()

    def forward(self, **kwargs) -> str:
        return self.generic_tool(**kwargs)
Error Classification
  • AgentError: Raised when agent provides invalid arguments (wrong types, missing required args, constraint violations). Agent's fault.
  • EnvironmentError: Raised when tool infrastructure fails after input validation (LLM simulator fails, internal error). Not agent's fault.

__call__

__call__(**kwargs: Any) -> str

Execute the tool with simulated response.

PARAMETER DESCRIPTION
**kwargs

Tool arguments provided by the agent.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
str

Simulated tool response string.

RAISES DESCRIPTION
AgentError

If agent provides invalid arguments (wrong types, missing required args). This is the agent's fault.

EnvironmentError

If tool infrastructure fails after validation (LLM simulator fails, internal error). Not the agent's fault.

__init__

__init__(spec: Dict[str, Any], model: ModelAdapter)

Initialize tool from specification.

PARAMETER DESCRIPTION
spec

Tool specification with 'name', 'description', 'input_schema'

TYPE: Dict[str, Any]

model

ModelAdapter for LLM-based response simulation

TYPE: ModelAdapter

gather_config

gather_config() -> Dict[str, Any]

Gather configuration.

gather_traces

gather_traces() -> Dict[str, Any]

Gather execution traces.