Exception Handling
Overview
When running benchmarks, tasks can fail for different reasons. MASEval provides an exception hierarchy that distinguishes between agent failures and infrastructure failures. This distinction gives you the option to analyze different failure modes separately, which can be useful for fair scoring or debugging.
Why Distinguish Failure Types?
Consider a scenario where the agent provides correct inputs but the database connection times out. Without distinguishing failure types, this would appear as an agent failure. The exception hierarchy allows separating these cases when analysis requires it.
Error Types
MASEval defines three error categories:
| Exception | Source | Default Scoring | Example |
|---|---|---|---|
AgentError |
Agent input | Included | Agent passed wrong type to tool |
EnvironmentError |
Infrastructure | Excluded | Database connection failed |
UserError |
User simulator | Excluded | LLM API unreachable |
TaskTimeoutError |
Timeout | Excluded | Task exceeded deadline |
AgentError
Indicates the agent violated a contract at a controlled boundary:
from maseval import AgentError
def calculate(a: int, b: int, operation: str) -> int:
# Validate inputs
if not isinstance(a, int):
raise AgentError(
f"Expected int for 'a', got {type(a).__name__}",
component="calculate",
suggestion="Provide a as an integer, e.g., a=10"
)
if operation not in ("add", "subtract", "multiply"):
raise AgentError(
f"Unknown operation: {operation}",
component="calculate",
suggestion="Use one of: add, subtract, multiply"
)
# ... execution logic
The optional suggestion field provides agent-friendly hints. Some agent frameworks use error messages for automatic retry attempts.
EnvironmentError
Indicates infrastructure failure after input validation passed:
from maseval import EnvironmentError
def fetch_data(query: str) -> dict:
# Input validation passed, now execute
try:
return database.query(query)
except DatabaseTimeoutError as e:
raise EnvironmentError(
"Database query timed out",
component="fetch_data",
details={"timeout": 30, "query_length": len(query)}
) from e
The details dict can include debugging information for developers.
UserError
Indicates user simulation infrastructure failure:
from maseval import UserError
class SimulatedUser:
def respond(self, agent_message: str) -> str:
try:
return self.llm.generate(agent_message)
except APIError as e:
raise UserError(
"User simulator LLM failed",
component="user_simulator",
details={"error": str(e)}
) from e
The Boundary Pattern
One approach to exception handling places the boundary between agent responsibility and infrastructure responsibility at input validation:
flowchart TD
subgraph TOOL_EXECUTION[" "]
A[Agent passes arguments] --> B{INPUT VALIDATION}
B -->|Fails| C[AgentError]
B -->|Passes| D{EXECUTION}
D -->|Fails| E[EnvironmentError]
D -->|Passes| F[Result]
end
style TOOL_EXECUTION fill:none,stroke:#888
style B fill:#f5f5f5,stroke:#333
style D fill:#f5f5f5,stroke:#333
style C fill:#ffebee,stroke:#c62828
style E fill:#ffebee,stroke:#c62828
style F fill:#e8f5e9,stroke:#2e7d32
With this pattern:
- Validation failures indicate agent-provided bad input (
AgentError) - Execution failures after validation indicate infrastructure issues (
EnvironmentError)
Validation Helpers
MASEval provides optional utilities for input validation:
from maseval import (
validate_argument_type,
validate_required_arguments,
validate_arguments_from_schema,
)
SCHEMA = {
"properties": {
"query": {"type": "string"},
"limit": {"type": "integer"},
},
"required": ["query"],
}
def search(**kwargs):
validate_arguments_from_schema(kwargs, SCHEMA, component="search")
# Execution logic...
These helpers raise AgentError with automatic suggestions:
AgentError: [search] Argument 'limit' expected integer, got string.
Suggestion: Provide limit as an integer, e.g., 10
Task Execution Status
Each completed task has a status indicating what happened:
| Status | Description |
|---|---|
success |
Task completed normally |
agent_error |
AgentError was raised |
environment_error |
EnvironmentError was raised |
user_error |
UserError was raised |
task_timeout |
Task exceeded configured timeout |
evaluation_failed |
Evaluator raised an exception |
setup_failed |
Task setup raised an exception |
unknown_execution_error |
Unclassified exception |
Scoring Considerations
When computing benchmark metrics, distinguishing between failure modes, provides the option to exclude those infrastructure failures from the scoring.
The recommended use is to count agent_error as agentic failure and others as benchmarking failure, i.e include the former but exclude the letter from scoring.
results = benchmark.run(tasks)
summary = compute_benchmark_metrics(results)
Example output:
Total Tasks: 100
Scored Tasks: 92
Success Rate: 65.22%
Status Breakdown:
success 60
agent_error 8
environment_error 5
user_error 2
...
The success rate (65.22%) reflects 60 / 92 rather than 60 / 100.
Rerunning Failed Tasks
Infrastructure errors are often transient. Tasks with infrastructure failures can be rerun:
results = benchmark.run(tasks)
# Identify infrastructure failures
infra_failed_ids = [
r["task_id"] for r in results
if r["status"] in ("environment_error", "user_error", "unknown_execution_error")
]
if infra_failed_ids:
# Filter and rerun
retry_tasks = tasks.filter(lambda t: t.id in infra_failed_ids)
retry_results = benchmark.run(retry_tasks)
# Merge results
final_results = [
r for r in results if r["task_id"] not in infra_failed_ids
] + retry_results
Error Message Audiences
Different exception types serve different audiences:
| Exception | Primary Audience | Message Characteristics |
|---|---|---|
AgentError |
Agent/Framework | Actionable, with suggestion field |
EnvironmentError |
Developer | Technical, debugging-oriented |
UserError |
Developer | Identifies simulator issue |
Examples:
# AgentError - agent-facing
AgentError(
"Expected string for 'query', got int",
suggestion="Provide query as a string"
)
# EnvironmentError - developer-facing
EnvironmentError(
"Connection failed after 3 retries",
details={"host": "api.example.com", "timeout": 30}
)
Summary
MASEval's exception hierarchy provides:
AgentError: Signals agent input violationsEnvironmentError: Signals infrastructure failuresUserError: Signals user simulator failures
This distinction enables:
- Separating failure analysis by source
- Optional exclusion of infrastructure failures from scoring
- Targeted rerunning of transient failures
- Different error message styles for different audiences