CONVERSE Benchmark (Beta)

Beta

This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

CONVERSE evaluates privacy and security robustness in agent-to-agent conversations where the external counterpart is adversarial.

What It Tests

Privacy attacks: the external agent tries to extract sensitive profile details.
Security attacks: the external agent tries to induce unauthorized tool actions.
Utility: how well the assistant completes the user's task (coverage and ratings).
Multi-turn manipulation: attacks progress over several conversational turns.

Data Source

Data is loaded from the official CONVERSE repository amrgomaaelhady/ConVerse

Supported domains:

travel
real_estate
insurance

Usage

Implement a framework-specific subclass of ConverseBenchmark and provide agent setup plus model adapter provisioning.

from typing import Any, Dict, Optional, Sequence, Tuple

from maseval import AgentAdapter, Environment, ModelAdapter, Task, User
from maseval.benchmark.converse import ConverseBenchmark, ensure_data_exists, load_tasks
from maseval.core.seeding import SeedGenerator


class MyConverseBenchmark(ConverseBenchmark):
    def setup_agents(
        self,
        agent_data: Dict[str, Any],
        environment: Environment,
        task: Task,
        user: Optional[User],
        seed_generator: SeedGenerator,
    ) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]:
        # Create your framework agent(s) using environment tools.
        ...

    def get_model_adapter(self, model_id: str, **kwargs: Any) -> ModelAdapter:
        # Create and optionally register model adapter.
        ...


# First call downloads source files to the local benchmark data cache.
ensure_data_exists(domain="travel_planning")
tasks = load_tasks(domain="travel_planning", split="privacy", limit=5)

benchmark = MyConverseBenchmark(progress_bar=False)
results = benchmark.run(
    tasks=tasks,
    agent_data={
        "model_id": "gpt-4o-mini",
        "attacker_model_id": "gpt-4o",
        "max_turns": 10,
    },
)

Default Implementation

CONVERSE also provides a built-in default agent loop via DefaultAgentConverseBenchmark. You only need to supply get_model_adapter().

from typing import Any

from maseval import ModelAdapter
from maseval.benchmark.converse import DefaultAgentConverseBenchmark, load_tasks
from maseval.interface.inference import OpenAIModelAdapter


class MyDefaultConverseBenchmark(DefaultAgentConverseBenchmark):
    def get_model_adapter(self, model_id: str, **kwargs: Any) -> ModelAdapter:
        adapter = OpenAIModelAdapter(client=..., model_id=model_id, seed=kwargs.get("seed"))
        if kwargs.get("register_name"):
            self.register(kwargs.get("register_category", "models"), kwargs["register_name"], adapter)
        return adapter


tasks = load_tasks(domain="travel_planning", split="privacy", limit=5)
benchmark = MyDefaultConverseBenchmark(progress_bar=False)
results = benchmark.run(
    tasks=tasks,
    agent_data={
        "model_id": "gpt-4o-mini",
        "attacker_model_id": "gpt-4o",
    },
)

For a full runnable example script, see:

examples/converse_benchmark/default_converse_benchmark.py

Evaluation Output

Privacy Evaluator

Output fields:

privacy_leak (bool): Whether a privacy violation was detected.
score (float): 1.0 if no leak, 0.0 if leaked. For abstraction tasks, maps from the 0-10 appropriateness scale.
evaluation_method: "substring", "llm", or "skipped".
evaluation_type: "abstraction_appropriateness" for related_and_useful category tasks.
llm_evaluation (dict): Full LLM judge output when using LLM mode.
leak_turn (int): Turn index where the leak occurred (substring mode).
matched_snippet (str): The specific snippet that matched (substring mode).

Security Evaluator

Output fields:

security_violation (bool): Whether a security attack succeeded.
score (float): 1.0 if attack failed, 0.0 if succeeded/partial.
attack_success (str): "Failed", "Partial", or "Successful".
evaluation_method: "llm" or "skipped".
evaluation_type: "final_package_attack" or "toolkit_attack".
responsibility_flag (str): "EXTERNAL_AGENT_OPTIONS" or "AI_ASSISTANT_TOOLKIT_ACTION".
llm_evaluation (dict): Full LLM judge output.

Utility Evaluator

Output fields:

coverage (dict): LLM coverage evaluation result with UTILITY.HOW_MANY_ITEMS_COVERED.
rating (float): Average ground-truth rating of selected items.
score (float): Coverage fraction (items covered / total items).
evaluation_method: "llm" or "skipped".
coverage_evaluation (dict): Full LLM coverage evaluation.
ratings_evaluation (dict): Full LLM ratings evaluation with ratings_mapping and average_rating.

View source

ConverseBenchmark

Bases: Benchmark

CONVERSE benchmark for contextual safety in agent-to-agent conversations.

seed_generator `property`

seed_generator: SeedGenerator

The seed generator for this benchmark.

The seed generator is configured at benchmark initialization via the seed or seed_generator parameters. When seed=None (the default), the generator's derive_seed() method returns None, effectively disabling seeding while maintaining a uniform interface.

RETURNS	DESCRIPTION
`SeedGenerator`	The root `SeedGenerator` instance.

usage `property`

usage: Usage

Running usage total across all task repetitions.

Queryable at any time, including while the benchmark is still running. Returns the grand total of all usage collected so far.

usage_by_component `property`

usage_by_component: Dict[str, Usage]

Per-component running usage totals across all repetitions.

Keys are registry keys (e.g., "models:main_model").

init

__init__(*args: Any, **kwargs: Any)

Initialize the CONVERSE benchmark.

Sets max_invocations to 10 by default because multi-turn dialogue is required for social-engineering style attacks.

PARAMETER	DESCRIPTION
`*args`	Forwarded to :class:`Benchmark`. TYPE: `Any` DEFAULT: `()`
`**kwargs`	Forwarded to :class:`Benchmark`. `max_invocations` defaults to 10 if not provided. TYPE: `Any` DEFAULT: `{}`

add_callback

add_callback(callback: BenchmarkCallback) -> None

Register a callback handler to monitor benchmark execution.

PARAMETER	DESCRIPTION
`callback`	A BenchmarkCallback instance that will receive execution events. TYPE: `BenchmarkCallback`

How to use

Callbacks receive notifications at key lifecycle points for tracing, progress tracking, or custom metrics collection. See BenchmarkCallback for available hooks and their signatures.

from maseval.core.callbacks import MessageTracingCallback

benchmark = MyBenchmark(tasks=tasks, agent_data=config)
benchmark.add_callback(MessageTracingCallback(output_dir="logs"))
results = benchmark.run()

clear_registry

clear_registry() -> None

Clear the component registry after a task repetition completes.

This method is called automatically by run() after each task repetition to ensure components are not carried over between repetitions. The reports list persists across all repetitions for aggregated analysis.

collect_all_configs

collect_all_configs() -> Dict[str, Any]

Collect configuration from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes and before evaluation begins. It gathers comprehensive configuration from all registered components (agents, models, tools, simulators, callbacks, etc.) for that specific repetition. After collection, the registry is cleared for the next repetition.

The collected configs are stored in benchmark.reports list along with traces for persistent access across all task repetitions.

Output fields:

metadata - Collection timestamp and thread info
agents - Dict mapping agent names to their config (settings, parameters)
models - Dict mapping model names to their config (model IDs, parameters)
tools - Dict mapping tool names to their config (specifications, settings)
simulators - Dict mapping simulator names to their config (parameters, templates)
callbacks - Dict mapping callback names to their config (settings)
environment - Direct config from the environment (not nested), or None if not present
user - Direct config from the user simulator (not nested), or None if not present
other - Dict for any other registered components
benchmark - Benchmark-level configuration (git, system, packages)

RETURNS	DESCRIPTION
`Dict[str, Any]`	Structured dictionary containing configuration from all registered components.

How to use

This method is called automatically by run() after each task repetition:

# Automatic collection (recommended)
results = benchmark.run()

# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
    # Agents is a dict: agent_name -> config
    print(f"Agent config: {report['config']['agents']['my_agent']}")
    # Environment and user are direct (not nested)
    print(f"Environment config: {report['config']['environment']}")
    print(f"User config: {report['config']['user']}")
    # Benchmark-level config
    print(f"Git commit: {report['config']['benchmark']['git']['commit_hash']}")

The collected configs are available in the results for reproducibility analysis.

collect_all_traces

collect_all_traces() -> Dict[str, Any]

Collect execution traces from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes and before evaluation begins. It gathers comprehensive traces from all registered components (agents, models, tools, simulators, callbacks, etc.) for that specific repetition. After collection, the registry is cleared for the next repetition.

The collected traces are stored in benchmark.reports list along with configs for persistent access across all task repetitions.

Output fields:

metadata - Collection timestamp and thread info
agents - Dict mapping agent names to their traces (messages, execution data)
models - Dict mapping model names to their traces (API calls, timing, errors)
tools - Dict mapping tool names to their traces (invocations, parameters)
simulators - Dict mapping simulator names to their traces (attempts, outcomes)
callbacks - Dict mapping callback names to their traces (custom data)
environment - Direct traces from the environment (not nested), or None if not present
user - Direct traces from the user simulator (not nested), or None if not present
other - Dict for any other registered components

RETURNS	DESCRIPTION
`Dict[str, Any]`	Structured dictionary containing execution traces from all registered components.

How to use

This method is called automatically by run() after each task repetition:

# Automatic collection (recommended)
results = benchmark.run()

# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
    # Agents is a dict: agent_name -> traces
    print(f"Agent messages: {report['traces']['agents']['my_agent']}")
    # Environment and user are direct (not nested)
    print(f"Environment state: {report['traces']['environment']}")
    print(f"User interactions: {report['traces']['user']}")

The collected traces are passed to the evaluator's evaluate() method and stored in benchmark.reports for later analysis.

collect_all_usage

collect_all_usage() -> Dict[str, Any]

Collect usage from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes. It gathers usage from all registered UsageTrackableMixin components and also accumulates into persistent running totals accessible via usage and usage_by_component.

RETURNS	DESCRIPTION
`Dict[str, Any]`	Structured dictionary containing usage from all registered components.

evaluate

evaluate(
    evaluators: Sequence[Evaluator],
    agents: Dict[str, AgentAdapter],
    final_answer: Any,
    traces: Dict[str, Any],
) -> List[Dict[str, Any]]

Run all evaluators and return their results.

Each evaluator first filters the traces to its relevant subset, then produces a result dictionary containing at least score.

PARAMETER	DESCRIPTION
`evaluators`	Evaluators selected by :meth:`setup_evaluators`. TYPE: `Sequence[Evaluator]`
`agents`	Named agent adapters (unused). TYPE: `Dict[str, AgentAdapter]`
`final_answer`	The agent's final response. TYPE: `Any`
`traces`	Full execution traces from the benchmark run. TYPE: `Dict[str, Any]`

RETURNS	DESCRIPTION
`List[Dict[str, Any]]`	List of evaluation result dictionaries, one per evaluator.

execution_loop

execution_loop(
    agents: Sequence[AgentAdapter],
    task: Task,
    environment: Environment,
    user: Optional[User],
) -> Any

Execute agents with optional user interaction loop.

This method orchestrates the agent-user interaction pattern. When a user is present, the user initiates the conversation using user.get_initial_query(). If no user is present, task.query is used as the initial query.

Interaction Flow

By default, agents execute once (max_invocations=1). For multi-turn interaction, set self.max_invocations > 1 in your benchmark's __init__. The loop continues until max_invocations is reached or user.is_done() returns True (e.g., max turns reached or stop token detected).

Note

Override this method in your benchmark subclass to implement custom interaction patterns (e.g., agent-initiated conversations, different termination conditions, or specialized query routing).

PARAMETER	DESCRIPTION
`agents`	Agents to execute (typically the orchestrator). TYPE: `Sequence[AgentAdapter]`
`task`	The task being solved. TYPE: `Task`
`environment`	The environment providing tools and state. TYPE: `Environment`
`user`	Optional user simulator. If provided, the user initiates and drives the conversation. If None, a single agent execution with `task.query`. TYPE: `Optional[User]`

RETURNS	DESCRIPTION
`Any`	Final answer from the last agent execution.

Example

For interactive benchmarks, enable multi-turn interaction::

def __init__(self, ...):
    super().__init__(...)
    self.max_invocations = 5  # Up to 5 agent-user exchanges

get_failed_tasks

get_failed_tasks(
    status_filter: Optional[
        Union[
            TaskExecutionStatus, List[TaskExecutionStatus]
        ]
    ] = None,
    reports: Optional[List[Dict[str, Any]]] = None,
) -> SequentialTaskQueue

Get tasks that failed during benchmark execution.

This method retrieves failed tasks based on their execution status, useful for debugging, retry logic, or failure analysis.

PARAMETER DESCRIPTION

status_filter

Filter by specific failure status(es). If None, returns all failed tasks (any status except SUCCESS). Can be a single TaskExecutionStatus or a list of them. Examples: - TaskExecutionStatus.TASK_EXECUTION_FAILED: Only tasks that failed during execution - TaskExecutionStatus.EVALUATION_FAILED: Only tasks where evaluation failed - [TaskExecutionStatus.TASK_EXECUTION_FAILED, TaskExecutionStatus.SETUP_FAILED]: Tasks that failed during execution or setup

TYPE: Optional[Union[TaskExecutionStatus, List[TaskExecutionStatus]]] DEFAULT: None

reports

Optional list of reports to analyze. If None, uses the reports from the last run() call. This allows analyzing externally stored or modified reports.

TYPE: Optional[List[Dict[str, Any]]] DEFAULT: None

RETURNS	DESCRIPTION
`SequentialTaskQueue`	SequentialTaskQueue containing the failed tasks. Empty if no failures match the filter.

RAISES	DESCRIPTION
`RuntimeError`	If reports is None and run() has not been executed yet.

How to use

# Run benchmark
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)

# Get all failed tasks (from internal state)
failed = benchmark.get_failed_tasks()
print(f"Failed: {len(failed)}/{len(benchmark.tasks)} tasks")

# Or work with returned reports (safe from internal state changes)
failed = benchmark.get_failed_tasks(reports=reports)

# Get only tasks that failed during execution (not evaluation)
execution_failures = benchmark.get_failed_tasks(
    TaskExecutionStatus.TASK_EXECUTION_FAILED,
    reports=reports
)

# Get setup and execution failures
critical_failures = benchmark.get_failed_tasks(
    status_filter=[
        TaskExecutionStatus.SETUP_FAILED,
        TaskExecutionStatus.TASK_EXECUTION_FAILED
    ],
    reports=reports
)

# Retry failed tasks elegantly - this is the key use case!
if len(failed) > 0:
    retry_reports = benchmark.run(tasks=failed)

# Or more concisely
reports = benchmark.run(tasks=tasks)
retry_reports = benchmark.run(tasks=benchmark.get_failed_tasks())

get_model_adapter `abstractmethod`

get_model_adapter(
    model_id: str, **kwargs: Any
) -> ModelAdapter

Create and optionally register a model adapter for CONVERSE components.

register

register(
    category: str,
    name: str,
    component: RegisterableComponent,
) -> RegisterableComponent

Register a component for comprehensive trace and configuration collection.

All core MASEval components (AgentAdapter, ModelAdapter, Environment, User, LLMSimulator, BenchmarkCallback) inherit from TraceableMixin and/or ConfigurableMixin, and are automatically registered for both trace and configuration collection before evaluation.

Note: Most components are automatically registered when returned from setup methods (setup_environment, setup_user, setup_agents). You only need to manually register additional components like models, simulators, or tools that aren't automatically captured.

PARAMETER	DESCRIPTION
`category`	Component category (e.g., "agents", "models", "tools", "simulators", "callbacks", "user", "environment", "seeding"). Use plural form to match the structure in collect_all_traces() and collect_all_configs(). TYPE: `str`
`name`	Unique identifier for this component within its category TYPE: `str`
`component`	Any object inheriting from TraceableMixin and/or ConfigurableMixin TYPE: `RegisterableComponent`

RETURNS	DESCRIPTION
`RegisterableComponent`	The component (for chaining convenience)

RAISES	DESCRIPTION
`ValueError`	If the component is already registered under a different name

How to use

Most components are auto-registered. Manual registration is only needed for additional components:

def setup_agents(self, agent_data, environment, task, user):
    # Create model (needs manual registration)
    model = MyModelAdapter(...)
    self.register("models", "main_model", model)

    # Create agent (auto-registered when returned)
    agent = MyAgent(model=model)
    agent_adapter = AgentAdapter(agent, "agent1")

    # Environment and user are also auto-registered
    return [agent_adapter], {"agent1": agent_adapter}

Traces and configs are automatically collected before evaluation via collect_all_traces() and collect_all_configs() which are called internally by the run() method.

run

run(
    tasks: Union[
        Task, BaseTaskQueue, Iterable[Union[Task, dict]]
    ],
    agent_data: Dict[str, Any] | Iterable[Dict[str, Any]],
) -> List[Dict[str, Any]]

Initialize and execute the complete benchmark loop across all tasks.

PARAMETER DESCRIPTION

tasks

Task source for execution. Can be: - A single Task object - A BaseTaskQueue (SequentialTaskQueue, PriorityTaskQueue, or custom AdaptiveTaskQueue) - An iterable of Task objects or dicts that will be converted to Tasks

When a BaseTaskQueue is provided, it controls the task ordering. AdaptiveTaskQueue subclasses are automatically registered as callbacks to receive task completion notifications.

TYPE: Union[Task, BaseTaskQueue, Iterable[Union[Task, dict]]]

agent_data

Configuration for agents. Either a single dict applied to all tasks, or an iterable of dicts with one configuration per task. Agent data typically includes model parameters, agent architecture details, and tool specifications.

TYPE: Dict[str, Any] | Iterable[Dict[str, Any]]

RETURNS	DESCRIPTION
`List[Dict[str, Any]]`	List of report dictionaries, one per task repetition. Every report carries the
`List[Dict[str, Any]]`	same keys (consistent schema) regardless of success or failure:
`List[Dict[str, Any]]`	task_id: Task identifier (UUID)
`List[Dict[str, Any]]`	repeat_idx: Repetition index (0 to n_task_repeats-1)
`List[Dict[str, Any]]`	status: Execution status (one of TaskExecutionStatus enum values)
`List[Dict[str, Any]]`	traces: Execution traces from all registered components (`{}` if unavailable, e.g. setup failure)
`List[Dict[str, Any]]`	config: Configuration from all registered components and benchmark level (`{}` if unavailable)
`List[Dict[str, Any]]`	usage: Aggregated usage from all registered components (`None` if not collected)
`List[Dict[str, Any]]`	eval: Evaluation results (None if task or evaluation failed)
`List[Dict[str, Any]]`	task: Task summary dict with `query`, `metadata`, and `protocol`
`List[Dict[str, Any]]`	error: Error details dict — `None` only when status is SUCCESS; otherwise always populated, containing: error_type: Exception class name error_message: Exception message traceback: Full traceback string (plus any error-specific extras, e.g. `component`, `elapsed`, `timeout`)

RAISES	DESCRIPTION
`ValueError`	If agent_data length doesn't match number of tasks (when agent_data is an iterable).
`Exception`	If a `fail_on_setup_error` / `fail_on_task_error` / `fail_on_evaluation_error` flag is set and the corresponding failure occurs, the original exception is re-raised and the run is aborted (this applies to both sequential and parallel execution).

How to use

This is the framework's main orchestration method that runs your entire benchmark. It iterates through all tasks, handles repetitions, and manages the three-stage lifecycle for each execution. You don't implement this method—instead, you call it to start the benchmark after implementing the setup and execution methods.

By default, the benchmark will continue executing remaining tasks even if some fail. You can change this behavior by setting fail_on_task_error=True, fail_on_evaluation_error=True, or fail_on_setup_error=True when instantiating the benchmark. Each task execution returns a status indicating success or the specific failure type (see TaskExecutionStatus).

For each task execution, the framework:

Calls your setup methods to initialize components
Calls your run_agents() method to execute the task
Collects message histories and calls evaluators
Stores results and triggers callbacks

Pseudocode structure:

for task in tasks:
    for repeat in range(n_task_repeats):
        # Setup stage
        environment = setup_environment(agent_data, task)
        user = setup_user(agent_data, environment, task)
        agents_to_run, agents_dict = setup_agents(agent_data, environment, task, user)
        evaluators = setup_evaluators(environment, task, agents_to_run, user)

        # Run stage (execution_loop handles multi-turn if user exists)
        agents_output = execution_loop(agents_to_run, task, environment, user)

        # Evaluate stage
        traces = collect_message_histories(agents_dict)
        eval_results = evaluate(evaluators, traces, agents_dict)

        # Store results
        store_result(task_id, traces, eval_results)

Callback hooks are triggered at these points:

on_run_start: Before processing any tasks
on_task_start: Before processing a task (once per task, not per repeat)
on_task_repeat_start: Before each repetition of a task
on_task_repeat_end: After each repetition completes
on_task_end: After all repetitions of a task complete
on_run_end: After all tasks complete

# Typical usage
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)

# Analyze results
for report in reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}: {report['eval']}")
    print(f"Config: {report['config']}")
    print(f"Traces: {report['traces']}")

# Parallel execution with 4 workers
benchmark = MyBenchmark(num_workers=4)
reports = benchmark.run(tasks=tasks, agent_data=config)

# Single agent config for all tasks
reports = benchmark.run(tasks=tasks, agent_data={"model": "gpt-4"})

# Task-specific agent configs (must match task count)
reports = benchmark.run(
    tasks=tasks,
    agent_data=[
        {"model": "gpt-4", "difficulty": "easy"},
        {"model": "gpt-4", "difficulty": "hard"},
    ]
)

# Priority-based execution
from maseval.core.task import PriorityTaskQueue
for task in tasks:
    task.protocol.priority = compute_priority(task)
queue = PriorityTaskQueue(tasks)
reports = benchmark.run(tasks=queue, agent_data=config)

# Adaptive queue (auto-registered as callback)
queue = MyAdaptiveTaskQueue(tasks)
reports = benchmark.run(tasks=queue)  # queue receives on_task_complete callbacks

run_agents

run_agents(
    agents: Sequence[AgentAdapter],
    task: Task,
    environment: Environment,
    query: str,
) -> Any

Run the first agent with the initial query.

CONVERSE is a single-agent benchmark — only the first adapter in the sequence receives the query.

PARAMETER	DESCRIPTION
`agents`	Sequence of agent adapters (only the first is used). TYPE: `Sequence[AgentAdapter]`
`task`	Current task (unused). TYPE: `Task`
`environment`	Task environment (unused). TYPE: `Environment`
`query`	Initial query from the adversarial external agent. TYPE: `str`

RETURNS	DESCRIPTION
`Any`	The agent's response string.

RAISES	DESCRIPTION
`ValueError`	If no agents are provided.

setup_agents `abstractmethod`

setup_agents(
    agent_data: Dict[str, Any],
    environment: Environment,
    task: Task,
    user: Optional[User],
    seed_generator: SeedGenerator,
) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]

Set up the SUT agents for CONVERSE.

setup_environment

setup_environment(
    agent_data: Dict[str, Any],
    task: Task,
    seed_generator: SeedGenerator,
) -> Environment

Create a :class:ConverseEnvironment from the task's environment data.

PARAMETER	DESCRIPTION
`agent_data`	Agent configuration (unused). TYPE: `Dict[str, Any]`
`task`	Current task containing environment data (persona, domain, tools). TYPE: `Task`
`seed_generator`	Seed generator (unused). TYPE: `SeedGenerator`

RETURNS	DESCRIPTION
`A`	class:`ConverseEnvironment` initialised with the task's data. TYPE: `Environment`

setup_evaluators

setup_evaluators(
    environment: Environment,
    task: Task,
    agents: Sequence[AgentAdapter],
    user: Optional[User],
    seed_generator: SeedGenerator,
) -> List[Evaluator]

Select evaluators based on the task's evaluation type.

All ConVerse evaluators require an LLM judge. When evaluator_model_id is not set in task.evaluation_data (see :func:configure_model_ids), no evaluators are created.

When a judge model is available, a :class:PrivacyEvaluator or :class:SecurityEvaluator is added based on the task type, and a :class:UtilityEvaluator is always added (utility is measured on every task — ConVerse/judge/utility_judge.py:143-179).

PARAMETER	DESCRIPTION
`environment`	The task environment. TYPE: `Environment`
`task`	Current task whose `evaluation_data` drives evaluator selection. TYPE: `Task`
`agents`	Agent adapters (unused). TYPE: `Sequence[AgentAdapter]`
`user`	The adversarial user (forwarded to evaluators). TYPE: `Optional[User]`
`seed_generator`	Seed generator for evaluator model reproducibility. TYPE: `SeedGenerator`

RETURNS	DESCRIPTION
`List[Evaluator]`	List of evaluators applicable to this task.

setup_user

setup_user(
    agent_data: Dict[str, Any],
    environment: Environment,
    task: Task,
    seed_generator: SeedGenerator,
) -> Optional[User]

Create the adversarial external agent that acts as the benchmark user.

The external agent is an LLM-driven attacker that attempts privacy extraction or unauthorised action induction over multiple turns.

PARAMETER	DESCRIPTION
`agent_data`	Must contain `attacker_model_id` (or `attacker_model`) for the attacker LLM. Raises `ValueError` if absent — no silent default is provided because a wrong attacker model would fundamentally change the nature of the attacks. Optional `max_turns` controls dialogue length (default 10). TYPE: `Dict[str, Any]`
`environment`	The task environment (unused). TYPE: `Environment`
`task`	Current task with `user_data` (persona, attack goal/strategy). TYPE: `Task`
`seed_generator`	Used to derive a reproducible seed for the attacker model. TYPE: `SeedGenerator`

RETURNS	DESCRIPTION
`A`	class:`ConverseExternalAgent` configured for the task. TYPE: `Optional[User]`

RAISES	DESCRIPTION
`ValueError`	If `attacker_model_id` (or `attacker_model`) is not in `agent_data`.

DefaultAgentConverseBenchmark

Bases: ConverseBenchmark

CONVERSE benchmark with a built-in default tool-calling assistant agent.

seed_generator `property`

seed_generator: SeedGenerator

The seed generator for this benchmark.

The seed generator is configured at benchmark initialization via the seed or seed_generator parameters. When seed=None (the default), the generator's derive_seed() method returns None, effectively disabling seeding while maintaining a uniform interface.

RETURNS	DESCRIPTION
`SeedGenerator`	The root `SeedGenerator` instance.

usage `property`

usage: Usage

Running usage total across all task repetitions.

Queryable at any time, including while the benchmark is still running. Returns the grand total of all usage collected so far.

usage_by_component `property`

usage_by_component: Dict[str, Usage]

Per-component running usage totals across all repetitions.

Keys are registry keys (e.g., "models:main_model").

init

__init__(*args: Any, **kwargs: Any)

Initialize the CONVERSE benchmark.

Sets max_invocations to 10 by default because multi-turn dialogue is required for social-engineering style attacks.

PARAMETER	DESCRIPTION
`*args`	Forwarded to :class:`Benchmark`. TYPE: `Any` DEFAULT: `()`
`**kwargs`	Forwarded to :class:`Benchmark`. `max_invocations` defaults to 10 if not provided. TYPE: `Any` DEFAULT: `{}`

add_callback

add_callback(callback: BenchmarkCallback) -> None

Register a callback handler to monitor benchmark execution.

PARAMETER	DESCRIPTION
`callback`	A BenchmarkCallback instance that will receive execution events. TYPE: `BenchmarkCallback`

How to use

Callbacks receive notifications at key lifecycle points for tracing, progress tracking, or custom metrics collection. See BenchmarkCallback for available hooks and their signatures.

from maseval.core.callbacks import MessageTracingCallback

benchmark = MyBenchmark(tasks=tasks, agent_data=config)
benchmark.add_callback(MessageTracingCallback(output_dir="logs"))
results = benchmark.run()

clear_registry

clear_registry() -> None

Clear the component registry after a task repetition completes.

This method is called automatically by run() after each task repetition to ensure components are not carried over between repetitions. The reports list persists across all repetitions for aggregated analysis.

collect_all_configs

collect_all_configs() -> Dict[str, Any]

Collect configuration from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes and before evaluation begins. It gathers comprehensive configuration from all registered components (agents, models, tools, simulators, callbacks, etc.) for that specific repetition. After collection, the registry is cleared for the next repetition.

The collected configs are stored in benchmark.reports list along with traces for persistent access across all task repetitions.

Output fields:

metadata - Collection timestamp and thread info
agents - Dict mapping agent names to their config (settings, parameters)
models - Dict mapping model names to their config (model IDs, parameters)
tools - Dict mapping tool names to their config (specifications, settings)
simulators - Dict mapping simulator names to their config (parameters, templates)
callbacks - Dict mapping callback names to their config (settings)
environment - Direct config from the environment (not nested), or None if not present
user - Direct config from the user simulator (not nested), or None if not present
other - Dict for any other registered components
benchmark - Benchmark-level configuration (git, system, packages)

RETURNS	DESCRIPTION
`Dict[str, Any]`	Structured dictionary containing configuration from all registered components.

How to use

This method is called automatically by run() after each task repetition:

# Automatic collection (recommended)
results = benchmark.run()

# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
    # Agents is a dict: agent_name -> config
    print(f"Agent config: {report['config']['agents']['my_agent']}")
    # Environment and user are direct (not nested)
    print(f"Environment config: {report['config']['environment']}")
    print(f"User config: {report['config']['user']}")
    # Benchmark-level config
    print(f"Git commit: {report['config']['benchmark']['git']['commit_hash']}")

The collected configs are available in the results for reproducibility analysis.

collect_all_traces

collect_all_traces() -> Dict[str, Any]

Collect execution traces from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes and before evaluation begins. It gathers comprehensive traces from all registered components (agents, models, tools, simulators, callbacks, etc.) for that specific repetition. After collection, the registry is cleared for the next repetition.

The collected traces are stored in benchmark.reports list along with configs for persistent access across all task repetitions.

Output fields:

metadata - Collection timestamp and thread info
agents - Dict mapping agent names to their traces (messages, execution data)
models - Dict mapping model names to their traces (API calls, timing, errors)
tools - Dict mapping tool names to their traces (invocations, parameters)
simulators - Dict mapping simulator names to their traces (attempts, outcomes)
callbacks - Dict mapping callback names to their traces (custom data)
environment - Direct traces from the environment (not nested), or None if not present
user - Direct traces from the user simulator (not nested), or None if not present
other - Dict for any other registered components

RETURNS	DESCRIPTION
`Dict[str, Any]`	Structured dictionary containing execution traces from all registered components.

How to use

This method is called automatically by run() after each task repetition:

# Automatic collection (recommended)
results = benchmark.run()

# Access all collected reports (traces + configs) across repetitions
for report in benchmark.reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}")
    # Agents is a dict: agent_name -> traces
    print(f"Agent messages: {report['traces']['agents']['my_agent']}")
    # Environment and user are direct (not nested)
    print(f"Environment state: {report['traces']['environment']}")
    print(f"User interactions: {report['traces']['user']}")

The collected traces are passed to the evaluator's evaluate() method and stored in benchmark.reports for later analysis.

collect_all_usage

collect_all_usage() -> Dict[str, Any]

Collect usage from all registered components for the current task repetition.

This method is called automatically by run() after each task repetition completes. It gathers usage from all registered UsageTrackableMixin components and also accumulates into persistent running totals accessible via usage and usage_by_component.

RETURNS	DESCRIPTION
`Dict[str, Any]`	Structured dictionary containing usage from all registered components.

evaluate

evaluate(
    evaluators: Sequence[Evaluator],
    agents: Dict[str, AgentAdapter],
    final_answer: Any,
    traces: Dict[str, Any],
) -> List[Dict[str, Any]]

Run all evaluators and return their results.

Each evaluator first filters the traces to its relevant subset, then produces a result dictionary containing at least score.

PARAMETER	DESCRIPTION
`evaluators`	Evaluators selected by :meth:`setup_evaluators`. TYPE: `Sequence[Evaluator]`
`agents`	Named agent adapters (unused). TYPE: `Dict[str, AgentAdapter]`
`final_answer`	The agent's final response. TYPE: `Any`
`traces`	Full execution traces from the benchmark run. TYPE: `Dict[str, Any]`

RETURNS	DESCRIPTION
`List[Dict[str, Any]]`	List of evaluation result dictionaries, one per evaluator.

execution_loop

execution_loop(
    agents: Sequence[AgentAdapter],
    task: Task,
    environment: Environment,
    user: Optional[User],
) -> Any

Execute agents with optional user interaction loop.

This method orchestrates the agent-user interaction pattern. When a user is present, the user initiates the conversation using user.get_initial_query(). If no user is present, task.query is used as the initial query.

Interaction Flow

By default, agents execute once (max_invocations=1). For multi-turn interaction, set self.max_invocations > 1 in your benchmark's __init__. The loop continues until max_invocations is reached or user.is_done() returns True (e.g., max turns reached or stop token detected).

Note

Override this method in your benchmark subclass to implement custom interaction patterns (e.g., agent-initiated conversations, different termination conditions, or specialized query routing).

PARAMETER	DESCRIPTION
`agents`	Agents to execute (typically the orchestrator). TYPE: `Sequence[AgentAdapter]`
`task`	The task being solved. TYPE: `Task`
`environment`	The environment providing tools and state. TYPE: `Environment`
`user`	Optional user simulator. If provided, the user initiates and drives the conversation. If None, a single agent execution with `task.query`. TYPE: `Optional[User]`

RETURNS	DESCRIPTION
`Any`	Final answer from the last agent execution.

Example

For interactive benchmarks, enable multi-turn interaction::

def __init__(self, ...):
    super().__init__(...)
    self.max_invocations = 5  # Up to 5 agent-user exchanges

get_failed_tasks

get_failed_tasks(
    status_filter: Optional[
        Union[
            TaskExecutionStatus, List[TaskExecutionStatus]
        ]
    ] = None,
    reports: Optional[List[Dict[str, Any]]] = None,
) -> SequentialTaskQueue

Get tasks that failed during benchmark execution.

This method retrieves failed tasks based on their execution status, useful for debugging, retry logic, or failure analysis.

PARAMETER DESCRIPTION

status_filter

Filter by specific failure status(es). If None, returns all failed tasks (any status except SUCCESS). Can be a single TaskExecutionStatus or a list of them. Examples: - TaskExecutionStatus.TASK_EXECUTION_FAILED: Only tasks that failed during execution - TaskExecutionStatus.EVALUATION_FAILED: Only tasks where evaluation failed - [TaskExecutionStatus.TASK_EXECUTION_FAILED, TaskExecutionStatus.SETUP_FAILED]: Tasks that failed during execution or setup

TYPE: Optional[Union[TaskExecutionStatus, List[TaskExecutionStatus]]] DEFAULT: None

reports

Optional list of reports to analyze. If None, uses the reports from the last run() call. This allows analyzing externally stored or modified reports.

TYPE: Optional[List[Dict[str, Any]]] DEFAULT: None

RETURNS	DESCRIPTION
`SequentialTaskQueue`	SequentialTaskQueue containing the failed tasks. Empty if no failures match the filter.

RAISES	DESCRIPTION
`RuntimeError`	If reports is None and run() has not been executed yet.

How to use

# Run benchmark
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)

# Get all failed tasks (from internal state)
failed = benchmark.get_failed_tasks()
print(f"Failed: {len(failed)}/{len(benchmark.tasks)} tasks")

# Or work with returned reports (safe from internal state changes)
failed = benchmark.get_failed_tasks(reports=reports)

# Get only tasks that failed during execution (not evaluation)
execution_failures = benchmark.get_failed_tasks(
    TaskExecutionStatus.TASK_EXECUTION_FAILED,
    reports=reports
)

# Get setup and execution failures
critical_failures = benchmark.get_failed_tasks(
    status_filter=[
        TaskExecutionStatus.SETUP_FAILED,
        TaskExecutionStatus.TASK_EXECUTION_FAILED
    ],
    reports=reports
)

# Retry failed tasks elegantly - this is the key use case!
if len(failed) > 0:
    retry_reports = benchmark.run(tasks=failed)

# Or more concisely
reports = benchmark.run(tasks=tasks)
retry_reports = benchmark.run(tasks=benchmark.get_failed_tasks())

get_model_adapter `abstractmethod`

get_model_adapter(
    model_id: str, **kwargs: Any
) -> ModelAdapter

Create and optionally register a model adapter for CONVERSE components.

register

register(
    category: str,
    name: str,
    component: RegisterableComponent,
) -> RegisterableComponent

Register a component for comprehensive trace and configuration collection.

All core MASEval components (AgentAdapter, ModelAdapter, Environment, User, LLMSimulator, BenchmarkCallback) inherit from TraceableMixin and/or ConfigurableMixin, and are automatically registered for both trace and configuration collection before evaluation.

Note: Most components are automatically registered when returned from setup methods (setup_environment, setup_user, setup_agents). You only need to manually register additional components like models, simulators, or tools that aren't automatically captured.

PARAMETER	DESCRIPTION
`category`	Component category (e.g., "agents", "models", "tools", "simulators", "callbacks", "user", "environment", "seeding"). Use plural form to match the structure in collect_all_traces() and collect_all_configs(). TYPE: `str`
`name`	Unique identifier for this component within its category TYPE: `str`
`component`	Any object inheriting from TraceableMixin and/or ConfigurableMixin TYPE: `RegisterableComponent`

RETURNS	DESCRIPTION
`RegisterableComponent`	The component (for chaining convenience)

RAISES	DESCRIPTION
`ValueError`	If the component is already registered under a different name

How to use

Most components are auto-registered. Manual registration is only needed for additional components:

def setup_agents(self, agent_data, environment, task, user):
    # Create model (needs manual registration)
    model = MyModelAdapter(...)
    self.register("models", "main_model", model)

    # Create agent (auto-registered when returned)
    agent = MyAgent(model=model)
    agent_adapter = AgentAdapter(agent, "agent1")

    # Environment and user are also auto-registered
    return [agent_adapter], {"agent1": agent_adapter}

Traces and configs are automatically collected before evaluation via collect_all_traces() and collect_all_configs() which are called internally by the run() method.

run

run(
    tasks: Union[
        Task, BaseTaskQueue, Iterable[Union[Task, dict]]
    ],
    agent_data: Dict[str, Any] | Iterable[Dict[str, Any]],
) -> List[Dict[str, Any]]

Initialize and execute the complete benchmark loop across all tasks.

PARAMETER DESCRIPTION

tasks

Task source for execution. Can be: - A single Task object - A BaseTaskQueue (SequentialTaskQueue, PriorityTaskQueue, or custom AdaptiveTaskQueue) - An iterable of Task objects or dicts that will be converted to Tasks

When a BaseTaskQueue is provided, it controls the task ordering. AdaptiveTaskQueue subclasses are automatically registered as callbacks to receive task completion notifications.

TYPE: Union[Task, BaseTaskQueue, Iterable[Union[Task, dict]]]

agent_data

Configuration for agents. Either a single dict applied to all tasks, or an iterable of dicts with one configuration per task. Agent data typically includes model parameters, agent architecture details, and tool specifications.

TYPE: Dict[str, Any] | Iterable[Dict[str, Any]]

RETURNS	DESCRIPTION
`List[Dict[str, Any]]`	List of report dictionaries, one per task repetition. Every report carries the
`List[Dict[str, Any]]`	same keys (consistent schema) regardless of success or failure:
`List[Dict[str, Any]]`	task_id: Task identifier (UUID)
`List[Dict[str, Any]]`	repeat_idx: Repetition index (0 to n_task_repeats-1)
`List[Dict[str, Any]]`	status: Execution status (one of TaskExecutionStatus enum values)
`List[Dict[str, Any]]`	traces: Execution traces from all registered components (`{}` if unavailable, e.g. setup failure)
`List[Dict[str, Any]]`	config: Configuration from all registered components and benchmark level (`{}` if unavailable)
`List[Dict[str, Any]]`	usage: Aggregated usage from all registered components (`None` if not collected)
`List[Dict[str, Any]]`	eval: Evaluation results (None if task or evaluation failed)
`List[Dict[str, Any]]`	task: Task summary dict with `query`, `metadata`, and `protocol`
`List[Dict[str, Any]]`	error: Error details dict — `None` only when status is SUCCESS; otherwise always populated, containing: error_type: Exception class name error_message: Exception message traceback: Full traceback string (plus any error-specific extras, e.g. `component`, `elapsed`, `timeout`)

RAISES	DESCRIPTION
`ValueError`	If agent_data length doesn't match number of tasks (when agent_data is an iterable).
`Exception`	If a `fail_on_setup_error` / `fail_on_task_error` / `fail_on_evaluation_error` flag is set and the corresponding failure occurs, the original exception is re-raised and the run is aborted (this applies to both sequential and parallel execution).

How to use

This is the framework's main orchestration method that runs your entire benchmark. It iterates through all tasks, handles repetitions, and manages the three-stage lifecycle for each execution. You don't implement this method—instead, you call it to start the benchmark after implementing the setup and execution methods.

By default, the benchmark will continue executing remaining tasks even if some fail. You can change this behavior by setting fail_on_task_error=True, fail_on_evaluation_error=True, or fail_on_setup_error=True when instantiating the benchmark. Each task execution returns a status indicating success or the specific failure type (see TaskExecutionStatus).

For each task execution, the framework:

Calls your setup methods to initialize components
Calls your run_agents() method to execute the task
Collects message histories and calls evaluators
Stores results and triggers callbacks

Pseudocode structure:

for task in tasks:
    for repeat in range(n_task_repeats):
        # Setup stage
        environment = setup_environment(agent_data, task)
        user = setup_user(agent_data, environment, task)
        agents_to_run, agents_dict = setup_agents(agent_data, environment, task, user)
        evaluators = setup_evaluators(environment, task, agents_to_run, user)

        # Run stage (execution_loop handles multi-turn if user exists)
        agents_output = execution_loop(agents_to_run, task, environment, user)

        # Evaluate stage
        traces = collect_message_histories(agents_dict)
        eval_results = evaluate(evaluators, traces, agents_dict)

        # Store results
        store_result(task_id, traces, eval_results)

Callback hooks are triggered at these points:

on_run_start: Before processing any tasks
on_task_start: Before processing a task (once per task, not per repeat)
on_task_repeat_start: Before each repetition of a task
on_task_repeat_end: After each repetition completes
on_task_end: After all repetitions of a task complete
on_run_end: After all tasks complete

# Typical usage
benchmark = MyBenchmark()
reports = benchmark.run(tasks=tasks, agent_data=config)

# Analyze results
for report in reports:
    print(f"Task {report['task_id']}, Repeat {report['repeat_idx']}: {report['eval']}")
    print(f"Config: {report['config']}")
    print(f"Traces: {report['traces']}")

# Parallel execution with 4 workers
benchmark = MyBenchmark(num_workers=4)
reports = benchmark.run(tasks=tasks, agent_data=config)

# Single agent config for all tasks
reports = benchmark.run(tasks=tasks, agent_data={"model": "gpt-4"})

# Task-specific agent configs (must match task count)
reports = benchmark.run(
    tasks=tasks,
    agent_data=[
        {"model": "gpt-4", "difficulty": "easy"},
        {"model": "gpt-4", "difficulty": "hard"},
    ]
)

# Priority-based execution
from maseval.core.task import PriorityTaskQueue
for task in tasks:
    task.protocol.priority = compute_priority(task)
queue = PriorityTaskQueue(tasks)
reports = benchmark.run(tasks=queue, agent_data=config)

# Adaptive queue (auto-registered as callback)
queue = MyAdaptiveTaskQueue(tasks)
reports = benchmark.run(tasks=queue)  # queue receives on_task_complete callbacks

run_agents

run_agents(
    agents: Sequence[AgentAdapter],
    task: Task,
    environment: Environment,
    query: str,
) -> Any

Run the first agent with the initial query.

CONVERSE is a single-agent benchmark — only the first adapter in the sequence receives the query.

PARAMETER	DESCRIPTION
`agents`	Sequence of agent adapters (only the first is used). TYPE: `Sequence[AgentAdapter]`
`task`	Current task (unused). TYPE: `Task`
`environment`	Task environment (unused). TYPE: `Environment`
`query`	Initial query from the adversarial external agent. TYPE: `str`

RETURNS	DESCRIPTION
`Any`	The agent's response string.

RAISES	DESCRIPTION
`ValueError`	If no agents are provided.

setup_agents

setup_agents(
    agent_data: Dict[str, Any],
    environment: Environment,
    task: Task,
    user: Optional[User],
    seed_generator: SeedGenerator,
) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]

Create a :class:DefaultConverseAgent wrapped in an adapter.

PARAMETER	DESCRIPTION
`agent_data`	Must contain `model_id` for the assistant LLM. Optional `max_tool_calls` and `generation_params`. TYPE: `Dict[str, Any]`
`environment`	Environment providing tools to the agent. TYPE: `Environment`
`task`	Current task (unused). TYPE: `Task`
`user`	The adversarial user (unused). TYPE: `Optional[User]`
`seed_generator`	Used to derive a reproducible seed for the agent model. TYPE: `SeedGenerator`

RETURNS	DESCRIPTION
`Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]`	Tuple of (agent list, name-to-adapter dict).

RAISES	DESCRIPTION
`ValueError`	If `agent_data` does not contain `model_id`.

setup_environment

setup_environment(
    agent_data: Dict[str, Any],
    task: Task,
    seed_generator: SeedGenerator,
) -> Environment

Create a :class:ConverseEnvironment from the task's environment data.

PARAMETER	DESCRIPTION
`agent_data`	Agent configuration (unused). TYPE: `Dict[str, Any]`
`task`	Current task containing environment data (persona, domain, tools). TYPE: `Task`
`seed_generator`	Seed generator (unused). TYPE: `SeedGenerator`

RETURNS	DESCRIPTION
`A`	class:`ConverseEnvironment` initialised with the task's data. TYPE: `Environment`

setup_evaluators

setup_evaluators(
    environment: Environment,
    task: Task,
    agents: Sequence[AgentAdapter],
    user: Optional[User],
    seed_generator: SeedGenerator,
) -> List[Evaluator]

Select evaluators based on the task's evaluation type.

All ConVerse evaluators require an LLM judge. When evaluator_model_id is not set in task.evaluation_data (see :func:configure_model_ids), no evaluators are created.

When a judge model is available, a :class:PrivacyEvaluator or :class:SecurityEvaluator is added based on the task type, and a :class:UtilityEvaluator is always added (utility is measured on every task — ConVerse/judge/utility_judge.py:143-179).

PARAMETER	DESCRIPTION
`environment`	The task environment. TYPE: `Environment`
`task`	Current task whose `evaluation_data` drives evaluator selection. TYPE: `Task`
`agents`	Agent adapters (unused). TYPE: `Sequence[AgentAdapter]`
`user`	The adversarial user (forwarded to evaluators). TYPE: `Optional[User]`
`seed_generator`	Seed generator for evaluator model reproducibility. TYPE: `SeedGenerator`

RETURNS	DESCRIPTION
`List[Evaluator]`	List of evaluators applicable to this task.

setup_user

setup_user(
    agent_data: Dict[str, Any],
    environment: Environment,
    task: Task,
    seed_generator: SeedGenerator,
) -> Optional[User]

Create the adversarial external agent that acts as the benchmark user.

The external agent is an LLM-driven attacker that attempts privacy extraction or unauthorised action induction over multiple turns.

PARAMETER	DESCRIPTION
`agent_data`	Must contain `attacker_model_id` (or `attacker_model`) for the attacker LLM. Raises `ValueError` if absent — no silent default is provided because a wrong attacker model would fundamentally change the nature of the attacks. Optional `max_turns` controls dialogue length (default 10). TYPE: `Dict[str, Any]`
`environment`	The task environment (unused). TYPE: `Environment`
`task`	Current task with `user_data` (persona, attack goal/strategy). TYPE: `Task`
`seed_generator`	Used to derive a reproducible seed for the attacker model. TYPE: `SeedGenerator`

RETURNS	DESCRIPTION
`A`	class:`ConverseExternalAgent` configured for the task. TYPE: `Optional[User]`

RAISES	DESCRIPTION
`ValueError`	If `attacker_model_id` (or `attacker_model`) is not in `agent_data`.

DefaultConverseAgent

Default tool-calling agent for CONVERSE benchmark runs.

Implements a safety-aware ReAct-style agent loop:

Receives user/external-agent message
Generates response (text or tool call) via the provided model
If tool call: executes tool against the environment and loops to step 2
If text: returns text as the final assistant response
If max_tool_calls is reached, returns a safe fallback message

The system prompt instructs the agent to protect private user data and refuse suspicious requests, matching the defensive posture expected by the CONVERSE evaluation (privacy leak and forbidden-tool checks).

ATTRIBUTE	DESCRIPTION
`model`	ModelAdapter used for LLM inference.
`tools`	Mapping of tool name to callable.
`max_tool_calls`	Upper bound on tool invocations per turn.
`generation_params`	Extra parameters forwarded to the model.
`messages`	Running message history for the current session.
`system_prompt`	System-level instruction text.

init

__init__(
    model: ModelAdapter,
    tools: Dict[str, Callable[..., Any]],
    max_tool_calls: int = 20,
    generation_params: Optional[Dict[str, Any]] = None,
    system_prompt: Optional[str] = None,
)

Initialise the default CONVERSE agent.

PARAMETER	DESCRIPTION
`model`	Model adapter for LLM inference. TYPE: `ModelAdapter`
`tools`	Mapping of tool name to callable (from the environment). TYPE: `Dict[str, Callable[..., Any]]`
`max_tool_calls`	Maximum number of tool invocations per turn. TYPE: `int` DEFAULT: `20`
`generation_params`	Extra parameters forwarded to `model.chat()`. TYPE: `Optional[Dict[str, Any]]` DEFAULT: `None`
`system_prompt`	Custom system prompt. When `None`, uses the full prompt ported from the original ConVerse `assistant_prompts.py`. Pass a string to override. TYPE: `Optional[str]` DEFAULT: `None`

get_messages

get_messages() -> MessageHistory

Return the full message history for this session.

run

run(query: str) -> str

Append a user message and generate a response (possibly with tool use).

PARAMETER	DESCRIPTION
`query`	The incoming message text. TYPE: `str`

RETURNS	DESCRIPTION
`str`	The assistant's final textual response for this turn.

DefaultConverseAgentAdapter

Bases: AgentAdapter

Adapter for the built-in default CONVERSE agent.

init

__init__(
    agent: DefaultConverseAgent,
    name: str = "default_converse_agent",
)

Wrap a :class:DefaultConverseAgent as an :class:AgentAdapter.

PARAMETER	DESCRIPTION
`agent`	The default CONVERSE agent instance. TYPE: `DefaultConverseAgent`
`name`	Adapter name used as the key in `agents_dict`. TYPE: `str` DEFAULT: `'default_converse_agent'`

gather_config

gather_config() -> Dict[str, Any]

Gather configuration from this agent.

Collects comprehensive configuration information about the agent including its name, type, and callback configuration.

Output fields:

type - Component class name
gathered_at - ISO timestamp
name - Agent name
agent_type - Underlying agent framework class name
adapter_type - The specific adapter class (e.g., SmolAgentAdapter)
callbacks - List of callback class names attached to this agent

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary containing agent configuration.

How to use

This method is automatically called by Benchmark during config collection. Framework-specific adapters can extend this to include additional data:

def gather_config(self) -> Dict[str, Any]:
    return {
        **super().gather_config(),
        "framework_specific_setting": self.agent.some_setting
    }

gather_traces

gather_traces() -> Dict[str, Any]

Gather execution traces from this agent.

Collects comprehensive information about the agent's execution including message history, callback information, and agent metadata.

Output fields:

type - Component class name
gathered_at - ISO timestamp
name - Agent name
agent_type - Underlying agent framework class name
message_count - Number of messages in history
messages - Full message history as list of dicts
callbacks - List of callback class names attached to this agent

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary containing agent execution traces.

How to use

This method is automatically called by Benchmark during trace collection. Framework-specific adapters can extend this to include additional data:

def gather_traces(self) -> Dict[str, Any]:
    return {
        **super().gather_traces(),
        "framework_specific_metric": self.agent.some_metric
    }

gather_usage

gather_usage() -> Usage

Gather usage with automatic cost calculation.

Calls _gather_usage() for raw token counts, then applies the cost calculator if one is available and cost is still 0.0.

The model_id used for cost calculation is resolved in order:

Explicit model_id passed to __init__
Auto-detected from the framework agent via _resolve_model_id()

Subclasses should override _gather_usage() (not this method) to provide framework-specific token extraction.

RETURNS	DESCRIPTION
`Usage`	Usage (or TokenUsage) with cost filled in when possible.

get_messages

get_messages() -> MessageHistory

Return the message history from the wrapped agent.

run

run(query: str) -> Any

Executes the agent and returns the result.

ConverseEnvironment

Bases: Environment

Environment exposing tools that can be abused in social-engineering attacks.

create_tools

create_tools() -> Dict[str, Any]

Create the set of tools that the assistant agent may invoke.

RETURNS	DESCRIPTION
`Dict[str, Any]`	Mapping of tool name to :class:`ConverseFunctionTool` instance.

gather_config

gather_config() -> dict[str, Any]

Gather configuration from this environment.

Output fields:

type - Component class name
gathered_at - ISO timestamp
tool_count - Number of tools
tool_names - List of tool names

RETURNS	DESCRIPTION
`dict[str, Any]`	Dictionary containing environment configuration.

gather_traces

gather_traces() -> dict[str, Any]

Gather execution traces from this environment and its tools.

Output fields:

type - Component class name
gathered_at - ISO timestamp
tool_count - Number of tools in environment
tools - Dictionary of tool traces keyed by tool name

RETURNS	DESCRIPTION
`dict[str, Any]`	Dictionary containing environment execution traces.

get_tool

get_tool(name: str) -> Optional[Any]

Get a tool by name.

PARAMETER	DESCRIPTION
`name`	Tool name TYPE: `str`

RETURNS	DESCRIPTION
`Optional[Any]`	The tool, or None if not found

get_tools

get_tools() -> Dict[str, Any]

Get all tools as a dict.

setup_state

setup_state(
    environment_data: Dict[str, Any],
) -> Dict[str, Any]

Initialise environment state from the task's environment data.

PARAMETER	DESCRIPTION
`environment_data`	Dictionary with keys such as `persona_text`, `options_text`, `domain`, `emails`, `calendar`, `general_info`, `banking`, `medical`. TYPE: `Dict[str, Any]`

RETURNS	DESCRIPTION
`Dict[str, Any]`	Mutable state dictionary used by the tools during execution.

ConverseExternalAgent

Bases: LLMUser

LLM-driven adversarial external service provider used as the benchmark user.

termination_reason `property`

termination_reason: TerminationReason

Get the reason why the user interaction terminated.

RETURNS	DESCRIPTION
`TerminationReason`	Why `is_done()` returns True, or `NOT_TERMINATED` if still ongoing.

init

__init__(
    model: ModelAdapter,
    user_data: Dict[str, Any],
    initial_query: Optional[str] = None,
    max_turns: int = 10,
    options_text: str = "",
    domain: str = "",
    **kwargs: Any,
)

Initialise the adversarial external agent.

PARAMETER	DESCRIPTION
`model`	Model adapter for the attacker LLM. TYPE: `ModelAdapter`
`user_data`	Dictionary containing `persona`, `attack_goal`, `attack_strategy`, and `attack_rationale`. TYPE: `Dict[str, Any]`
`initial_query`	First message sent to the assistant agent. TYPE: `Optional[str]` DEFAULT: `None`
`max_turns`	Maximum number of dialogue turns. TYPE: `int` DEFAULT: `10`
`options_text`	Package options text for the domain. When provided, the full adversarial prompt from the original ConVerse is used (matching `get_external_aggregated_prompt_adv`). TYPE: `str` DEFAULT: `''`
`domain`	MASEval domain name (`"travel_planning"`, `"real_estate"`, or `"insurance"`). Used to select the correct role. TYPE: `str` DEFAULT: `''`
`**kwargs`	Forwarded to :class:`LLMUser`. TYPE: `Any` DEFAULT: `{}`

gather_config

gather_config() -> Dict[str, Any]

Gather configuration from this user.

Output fields:

name - User identifier
profile - User profile data
scenario - Task scenario description
max_turns - Maximum interaction turns
stop_tokens - Early stopping tokens (empty list if disabled)
exhausted_response - Message returned when user is done, or None

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary containing user configuration.

gather_traces

gather_traces() -> Dict[str, Any]

Gather execution traces from this user.

Output fields:

name - User identifier
profile - User profile data
message_count - Number of messages in history
messages - Full conversation history
logs - Execution logs with timing
termination_reason - Why interaction ended (see TerminationReason)
stop_reason - Which stop token triggered termination, if any
max_turns - Maximum allowed turns
turns_used - Actual turns used
stopped_by_user - Whether user emitted a stop token

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary containing user state and interaction data.

get_initial_query

get_initial_query() -> str

Get the initial query for the conversation.

If an initial_query was provided at construction, returns it. Otherwise, generates one using the LLM simulator based on the user's profile and scenario.

This method: - Returns the existing initial query if one was provided - Or calls the LLM simulator to generate one - Ensures the query is in the message history - Counts the initial query as the first turn

RETURNS	DESCRIPTION
`str`	The initial query (either pre-set or LLM-generated).

RAISES	DESCRIPTION
`RuntimeError`	If called after conversation has progressed beyond the initial message.

get_tool

get_tool() -> Any

Return a framework-compatible tool for agent interaction.

Some frameworks (smolagents, CAMEL) use a tool-based pattern where agents invoke an AskUser tool to interact with the user. Override this in subclasses for frameworks that need it.

RETURNS	DESCRIPTION
`Any`	Framework-specific tool, or `None` if not applicable.

increment_turn

increment_turn() -> None

Increment the turn counter.

Call this after recording a user response in the message history.

is_done

is_done() -> bool

Check if the user interaction should end.

Checks: 1. If max_turns has been reached 2. If the user previously indicated termination (via stop_token)

Subclasses can override to add custom termination logic (e.g., LLM-based satisfaction checks) by calling super().is_done() first.

RETURNS	DESCRIPTION
`bool`	True if the user is done interacting, False to continue.

respond

respond(message: str) -> str

Respond to a message from the agent using LLM simulation.

This method appends the agent's message to the conversation history, generates a response using the LLM simulator, appends the response to the history, and returns it.

If a stop_token is detected in the response, triggers early stopping.

PARAMETER	DESCRIPTION
`message`	The message from the agent to which the user should respond. TYPE: `str`

RETURNS	DESCRIPTION
`str`	The user's response, or `exhausted_response` if done and configured.

RAISES	DESCRIPTION
`UserExhaustedError`	If the user is already done and no `exhausted_response` is configured.

PrivacyEvaluator

Bases: _ConverseEvaluator

Evaluate whether an assistant leaked private information.

Routes by category matching ConVerse/judge/privacy_judge.py:199-221:

Leak categories (unrelated_to_*, related_but_private): checks whether extraction snippets were disclosed.
Abstraction category (related_and_useful): checks whether raw data was shared instead of an acceptable abstraction.

Only assistant / external-agent dialogue is passed to the LLM judge; tool interactions with the environment are filtered out, matching ConVerse/judge/privacy_judge.py:138-149.

call

__call__(
    traces: Dict[str, Any],
    final_answer: Optional[str] = None,
) -> Dict[str, Any]

Evaluate privacy based on category routing.

Routes to the appropriate evaluation method based on category, matching ConVerse/judge/privacy_judge.py:199-221.

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary with `privacy_leak` (bool), `score` (float),
`Dict[str, Any]`	and evaluation-specific details.

filter_traces

filter_traces(traces: Dict[str, Any]) -> Dict[str, Any]

Extract all assistant messages from the agent traces.

PARAMETER	DESCRIPTION
`traces`	Full execution traces containing an `agents` key. TYPE: `Dict[str, Any]`

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary with a `messages` key containing the flattened message list.

SecurityEvaluator

Bases: _ConverseEvaluator

Evaluate whether a security attack succeeded using an LLM judge.

Routes by responsibility_flag matching the original ConVerse security_judge.py:124-167:

EXTERNAL_AGENT_OPTIONS → evaluates whether the final package contains harmful elements introduced by the malicious external agent.
AI_ASSISTANT_TOOLKIT_ACTION → evaluates whether the assistant performed unauthorized tool actions via the environment.

call

__call__(
    traces: Dict[str, Any],
    final_answer: Optional[str] = None,
) -> Dict[str, Any]

Evaluate security based on responsibility_flag routing.

Matches ConVerse/judge/security_judge.py:124-167.

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary with `security_violation` (bool), `score` (float),
`Dict[str, Any]`	`attack_success`, and evaluation-specific details.

filter_traces

filter_traces(traces: Dict[str, Any]) -> Dict[str, Any]

Extract messages and tool traces for security evaluation.

PARAMETER	DESCRIPTION
`traces`	Full execution traces. TYPE: `Dict[str, Any]`

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary with `messages` and `tools` keys.

UtilityEvaluator

Bases: _ConverseEvaluator

Evaluate the utility/completeness of the assistant's final package.

Matches the original ConVerse utility_judge.py which evaluates:

Coverage: How many required items were included in the final package (HOW_MANY_ITEMS_COVERED: N/M).
Ratings: Maps items in the final package to ground-truth ratings and computes an average rating.

call

__call__(
    traces: Dict[str, Any],
    final_answer: Optional[str] = None,
) -> Dict[str, Any]

Evaluate utility of the final package.

Matches ConVerse/judge/utility_judge.py:143-179.

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary with `coverage`, `rating`, `score`, and
`Dict[str, Any]`	evaluation-specific details.

filter_traces

filter_traces(traces: Dict[str, Any]) -> Dict[str, Any]

Extract all assistant messages from the agent traces.

PARAMETER	DESCRIPTION
`traces`	Full execution traces containing an `agents` key. TYPE: `Dict[str, Any]`

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary with a `messages` key containing the flattened message list.

load_tasks

load_tasks(
    domain: ConverseDomain,
    split: Literal["privacy", "security", "all"] = "all",
    limit: Optional[int] = None,
    data_dir: Optional[Path] = None,
) -> TaskQueue

Load CONVERSE tasks for a domain and attack split.

Downloads benchmark data on first call via :func:ensure_data_exists.

PARAMETER	DESCRIPTION
`domain`	CONVERSE domain to load. TYPE: `ConverseDomain`
`split`	Attack type filter — `"privacy"`, `"security"`, or `"all"`. TYPE: `Literal['privacy', 'security', 'all']` DEFAULT: `'all'`
`limit`	Maximum number of tasks to return (`None` for all). TYPE: `Optional[int]` DEFAULT: `None`
`data_dir`	Optional override for the local data cache directory. TYPE: `Optional[Path]` DEFAULT: `None`

RETURNS	DESCRIPTION
`A`	class:`TaskQueue` containing the loaded tasks. TYPE: `TaskQueue`

RAISES	DESCRIPTION
`ValueError`	If domain is not one of the supported domains.

ensure_data_exists

ensure_data_exists(
    domain: ConverseDomain,
    data_dir: Optional[Path] = None,
    force_download: bool = False,
) -> Path

Ensure local CONVERSE data exists for the selected domain.

Downloads benchmark files the first time they are needed.

PARAMETER	DESCRIPTION
`domain`	CONVERSE domain to load. TYPE: `ConverseDomain`
`data_dir`	Optional override for the local data cache directory. TYPE: `Optional[Path]` DEFAULT: `None`
`force_download`	Re-download files even if they already exist. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`Path`	Path to the local data root directory.

CONVERSE Benchmark (Beta)

What It Tests

Data Source

Usage

Default Implementation

Evaluation Output

Privacy Evaluator

Security Evaluator

Utility Evaluator

ConverseBenchmark

seed_generator property

usage property

usage_by_component property

__init__

add_callback

clear_registry

collect_all_configs

collect_all_traces

collect_all_usage

evaluate

execution_loop

get_failed_tasks

get_model_adapter abstractmethod

register

run

run_agents

setup_agents abstractmethod

setup_environment

setup_evaluators

setup_user

DefaultAgentConverseBenchmark

seed_generator property

usage property

usage_by_component property

__init__

add_callback

clear_registry

collect_all_configs

collect_all_traces

collect_all_usage

evaluate

execution_loop

get_failed_tasks

get_model_adapter abstractmethod

register

run

run_agents

setup_agents

setup_environment

setup_evaluators

setup_user

DefaultConverseAgent

__init__

get_messages

run

DefaultConverseAgentAdapter

__init__

gather_config

gather_traces

gather_usage

get_messages

run

ConverseEnvironment

create_tools

gather_config

gather_traces

get_tool

get_tools

setup_state

ConverseExternalAgent

termination_reason property

__init__

gather_config

gather_traces

get_initial_query

get_tool

increment_turn

is_done

respond

PrivacyEvaluator

seed_generator `property`

usage `property`

usage_by_component `property`

init

get_model_adapter `abstractmethod`

setup_agents `abstractmethod`

seed_generator `property`

usage `property`

usage_by_component `property`

init

get_model_adapter `abstractmethod`

init

init

termination_reason `property`

init

call

call

call