Skip to content

Exception Handling

Overview

When running benchmarks, tasks can fail for different reasons. MASEval provides an exception hierarchy that distinguishes between agent failures and infrastructure failures. This distinction gives you the option to analyze different failure modes separately, which can be useful for fair scoring or debugging.

Why Distinguish Failure Types?

Consider a scenario where the agent provides correct inputs but the database connection times out. Without distinguishing failure types, this would appear as an agent failure. The exception hierarchy allows separating these cases when analysis requires it.

Error Types

MASEval defines three error categories:

Exception Source Default Scoring Example
AgentError Agent input Included Agent passed wrong type to tool
EnvironmentError Infrastructure Excluded Database connection failed
UserError User simulator Excluded LLM API unreachable
TaskTimeoutError Timeout Excluded Task exceeded deadline

AgentError

Indicates the agent violated a contract at a controlled boundary:

from maseval import AgentError

def calculate(a: int, b: int, operation: str) -> int:
    # Validate inputs
    if not isinstance(a, int):
        raise AgentError(
            f"Expected int for 'a', got {type(a).__name__}",
            component="calculate",
            suggestion="Provide a as an integer, e.g., a=10"
        )

    if operation not in ("add", "subtract", "multiply"):
        raise AgentError(
            f"Unknown operation: {operation}",
            component="calculate",
            suggestion="Use one of: add, subtract, multiply"
        )

    # ... execution logic

The optional suggestion field provides agent-friendly hints. Some agent frameworks use error messages for automatic retry attempts.

EnvironmentError

Indicates infrastructure failure after input validation passed:

from maseval import EnvironmentError

def fetch_data(query: str) -> dict:
    # Input validation passed, now execute
    try:
        return database.query(query)
    except DatabaseTimeoutError as e:
        raise EnvironmentError(
            "Database query timed out",
            component="fetch_data",
            details={"timeout": 30, "query_length": len(query)}
        ) from e

The details dict can include debugging information for developers.

UserError

Indicates user simulation infrastructure failure:

from maseval import UserError

class SimulatedUser:
    def respond(self, agent_message: str) -> str:
        try:
            return self.llm.generate(agent_message)
        except APIError as e:
            raise UserError(
                "User simulator LLM failed",
                component="user_simulator",
                details={"error": str(e)}
            ) from e

The Boundary Pattern

One approach to exception handling places the boundary between agent responsibility and infrastructure responsibility at input validation:

flowchart TD
    subgraph TOOL_EXECUTION[" "]
        A[Agent passes arguments] --> B{INPUT VALIDATION}
        B -->|Fails| C[AgentError]
        B -->|Passes| D{EXECUTION}
        D -->|Fails| E[EnvironmentError]
        D -->|Passes| F[Result]
    end

    style TOOL_EXECUTION fill:none,stroke:#888
    style B fill:#f5f5f5,stroke:#333
    style D fill:#f5f5f5,stroke:#333
    style C fill:#ffebee,stroke:#c62828
    style E fill:#ffebee,stroke:#c62828
    style F fill:#e8f5e9,stroke:#2e7d32

With this pattern:

  • Validation failures indicate agent-provided bad input (AgentError)
  • Execution failures after validation indicate infrastructure issues (EnvironmentError)

Validation Helpers

MASEval provides optional utilities for input validation:

from maseval import (
    validate_argument_type,
    validate_required_arguments,
    validate_arguments_from_schema,
)

SCHEMA = {
    "properties": {
        "query": {"type": "string"},
        "limit": {"type": "integer"},
    },
    "required": ["query"],
}

def search(**kwargs):
    validate_arguments_from_schema(kwargs, SCHEMA, component="search")
    # Execution logic...

These helpers raise AgentError with automatic suggestions:

AgentError: [search] Argument 'limit' expected integer, got string.
Suggestion: Provide limit as an integer, e.g., 10

Task Execution Status

Each completed task has a status indicating what happened:

Status Description
success Task completed normally
agent_error AgentError was raised
environment_error EnvironmentError was raised
user_error UserError was raised
task_timeout Task exceeded configured timeout
evaluation_failed Evaluator raised an exception
setup_failed Task setup raised an exception
unknown_execution_error Unclassified exception

Scoring Considerations

When computing benchmark metrics, distinguishing between failure modes, provides the option to exclude those infrastructure failures from the scoring.

The recommended use is to count agent_error as agentic failure and others as benchmarking failure, i.e include the former but exclude the letter from scoring.

results = benchmark.run(tasks)
summary = compute_benchmark_metrics(results)

Example output:

Total Tasks: 100
Scored Tasks: 92
Success Rate: 65.22%

Status Breakdown:
  success                    60
  agent_error                 8
  environment_error           5
  user_error                  2
  ...

The success rate (65.22%) reflects 60 / 92 rather than 60 / 100.

Rerunning Failed Tasks

Infrastructure errors are often transient. Tasks with infrastructure failures can be rerun:

results = benchmark.run(tasks)

# Identify infrastructure failures
infra_failed_ids = [
    r["task_id"] for r in results
    if r["status"] in ("environment_error", "user_error", "unknown_execution_error")
]

if infra_failed_ids:
    # Filter and rerun
    retry_tasks = tasks.filter(lambda t: t.id in infra_failed_ids)
    retry_results = benchmark.run(retry_tasks)

    # Merge results
    final_results = [
        r for r in results if r["task_id"] not in infra_failed_ids
    ] + retry_results

Error Message Audiences

Different exception types serve different audiences:

Exception Primary Audience Message Characteristics
AgentError Agent/Framework Actionable, with suggestion field
EnvironmentError Developer Technical, debugging-oriented
UserError Developer Identifies simulator issue

Examples:

# AgentError - agent-facing
AgentError(
    "Expected string for 'query', got int",
    suggestion="Provide query as a string"
)

# EnvironmentError - developer-facing
EnvironmentError(
    "Connection failed after 3 retries",
    details={"host": "api.example.com", "timeout": 30}
)

Summary

MASEval's exception hierarchy provides:

  • AgentError: Signals agent input violations
  • EnvironmentError: Signals infrastructure failures
  • UserError: Signals user simulator failures

This distinction enables:

  • Separating failure analysis by source
  • Optional exclusion of infrastructure failures from scoring
  • Targeted rerunning of transient failures
  • Different error message styles for different audiences