5 A Day Benchmark¶

This notebook is available as a Jupyter notebook — clone the repo and run it yourself!

What This Notebook Demonstrates

This is a complete, production-ready example of evaluating multi-agent systems using the MASEval library. By the end, you'll understand how to:

Build multi-agent systems with orchestrators and specialists
Create framework-agnostic tools that work across agent libraries
Organize systematic evaluation using Tasks, Environments, and Benchmarks
Implement custom evaluators (unit tests, LLM judges, pattern matching)
Run reproducible benchmarks with automatic tool tracing

Prerequisites: Familiarity with LLM agents and basic Python programming.

The 5-A-Day Benchmark¶

We implement 5 diverse tasks representing real-world agent scenarios:

Task	Scenario	Tools	Evaluation
0	Email & Banking	email, banking	LLM judge, pattern matching
1	Finance Calculation	stock_price, calculator, family_info	Arithmetic verification
2	Code Generation	python_executor	Unit tests, complexity analysis
3	Calendar Scheduling	calendar	Slot matching, logic validation
4	Hotel Optimization	hotel_search	Ranking, search strategy

We'll use a multi-agent architecture: an orchestrator delegates to specialized agents (banking specialist, email specialist, etc.), demonstrating MASEval's support for complex agent systems.

Part 1: Understanding Multi-Agent Systems¶

Before diving into MASEval, let's build the multi-agent system we'll be evaluating.

1.1 Imports and Setup¶

We use smolagents as our agent framework (MASEval also supports LangGraph and LlamaIndex).

The imports below include:

Standard libraries: json, os, Path for file handling
Helper utilities: Functions from this example's utils.py and tools.py
smolagents: The agent framework we'll use
MASEval core: The evaluation orchestration library

In [ ]:

Copied!





# ruff: noqa E402
# Setup: Set working directory to project root for proper imports
# This must happen FIRST before any other imports
import os
import sys
from pathlib import Path
import json
from typing import Any, Dict, List, Optional, Sequence
from rich.console import Console
from rich.panel import Panel

# Determine notebook directory and set working directory to project root
_notebook_dir = Path(__file__).parent if "__file__" in dir() else Path.cwd()
if _notebook_dir.name == "five_a_day_benchmark":
    _project_root = _notebook_dir.parent.parent
    os.chdir(_project_root)
    # Add project root to path so `examples.five_a_day_benchmark.*` imports work
    if str(_project_root) not in sys.path:
        sys.path.insert(0, str(_project_root))
    # Also add the example directory for local imports (utils, tools, evaluators)
    if str(_notebook_dir) not in sys.path:
        sys.path.insert(0, str(_notebook_dir))
    print(f"Working directory set to: {os.getcwd()}")


# Utility functions from this example
# - sanitize_name(): Cleans agent names for framework compatibility
from utils import sanitize_name

# Tool collection classes and helpers
# - EmailToolCollection, BankingToolCollection: Pre-built tool groups
# - filter_tool_adapters_by_prefix(): Selects tools by name prefix
# - get_states(): Initializes tool state objects (email inboxes, bank accounts, etc.)
from tools import (
    EmailToolCollection,
    BankingToolCollection,
    CalculatorToolCollection,
    CodeExecutionToolCollection,
    FamilyInfoToolCollection,
    StockPriceToolCollection,
    CalendarToolCollection,
    HotelSearchToolCollection,
    MCPCalendarToolCollection,
    filter_tool_adapters_by_prefix,
    get_states,
)

# smolagents: Our chosen agent framework
from smolagents import ToolCallingAgent, LiteLLMModel, FinalAnswerTool

# MASEval core components
from maseval import Benchmark, Environment, Task, TaskQueue, AgentAdapter, Evaluator, ModelAdapter, SeedGenerator
from maseval.interface.agents.smolagents import SmolAgentAdapter

# Import evaluators module (dynamically loaded later)
import evaluators


def load_benchmark_data(
    config_type: str = "multi",
    framework: str = "smolagents",
    model_id: str = "gemini-2.5-flash",
    temperature: float = 0.7,
    limit: int | None = None,
    task_indices: list[int] | None = None,
) -> tuple[TaskQueue, list[Dict[str, Any]]]:
    """Load tasks and agent configurations.

    Args:
        config_type: 'single' or 'multi' agent configuration
        framework: Agent framework to use
        model_id: Model identifier
        temperature: Model temperature
        limit: Optional limit on number of tasks (None = all 5)
        task_indices: Optional list of task indices to load (e.g., [0, 2, 4])

    Returns:
        Tuple of (TaskQueue, list of agent configs)
    """
    data_dir = Path("examples/five_a_day_benchmark/data")

    with open(data_dir / "tasks.json", "r") as f:
        tasks_raw = json.load(f)
    with open(data_dir / f"{config_type}agent.json", "r") as f:
        configs_raw = json.load(f)

    # Apply limit first
    if limit:
        tasks_raw = tasks_raw[:limit]
        configs_raw = configs_raw[:limit]

    # Then apply task_indices filter if specified
    if task_indices is not None:
        tasks_raw = [tasks_raw[i] for i in task_indices if i < len(tasks_raw)]
        configs_raw = [configs_raw[i] for i in task_indices if i < len(configs_raw)]

    tasks_data = []
    configs_data = []

    for task_dict, config in zip(tasks_raw, configs_raw):
        task_id = task_dict["metadata"]["task_id"]
        task_dict["environment_data"]["agent_framework"] = framework

        # Create Task object with id from metadata
        tasks_data.append(
            Task(
                query=task_dict["query"],
                id=task_id,
                environment_data=task_dict["environment_data"],
                evaluation_data=task_dict["evaluation_data"],
                metadata=task_dict["metadata"],
            )
        )

        # Enrich config with framework and model info
        config["framework"] = framework
        config["model_config"] = {"model_id": model_id, "temperature": temperature}

        configs_data.append(config)

    return TaskQueue(tasks_data), configs_data
# ruff: noqa E402
# Setup: Set working directory to project root for proper imports
# This must happen FIRST before any other imports
import os
import sys
from pathlib import Path
import json
from typing import Any, Dict, List, Optional, Sequence
from rich.console import Console
from rich.panel import Panel

# Determine notebook directory and set working directory to project root
_notebook_dir = Path(__file__).parent if "__file__" in dir() else Path.cwd()
if _notebook_dir.name == "five_a_day_benchmark":
    _project_root = _notebook_dir.parent.parent
    os.chdir(_project_root)
    # Add project root to path so `examples.five_a_day_benchmark.*` imports work
    if str(_project_root) not in sys.path:
        sys.path.insert(0, str(_project_root))
    # Also add the example directory for local imports (utils, tools, evaluators)
    if str(_notebook_dir) not in sys.path:
        sys.path.insert(0, str(_notebook_dir))
    print(f"Working directory set to: {os.getcwd()}")


# Utility functions from this example
# - sanitize_name(): Cleans agent names for framework compatibility
from utils import sanitize_name

# Tool collection classes and helpers
# - EmailToolCollection, BankingToolCollection: Pre-built tool groups
# - filter_tool_adapters_by_prefix(): Selects tools by name prefix
# - get_states(): Initializes tool state objects (email inboxes, bank accounts, etc.)
from tools import (
    EmailToolCollection,
    BankingToolCollection,
    CalculatorToolCollection,
    CodeExecutionToolCollection,
    FamilyInfoToolCollection,
    StockPriceToolCollection,
    CalendarToolCollection,
    HotelSearchToolCollection,
    MCPCalendarToolCollection,
    filter_tool_adapters_by_prefix,
    get_states,
)

# smolagents: Our chosen agent framework
from smolagents import ToolCallingAgent, LiteLLMModel, FinalAnswerTool

# MASEval core components
from maseval import Benchmark, Environment, Task, TaskQueue, AgentAdapter, Evaluator, ModelAdapter, SeedGenerator
from maseval.interface.agents.smolagents import SmolAgentAdapter

# Import evaluators module (dynamically loaded later)
import evaluators


def load_benchmark_data(
    config_type: str = "multi",
    framework: str = "smolagents",
    model_id: str = "gemini-2.5-flash",
    temperature: float = 0.7,
    limit: int | None = None,
    task_indices: list[int] | None = None,
) -> tuple[TaskQueue, list[Dict[str, Any]]]:
    """Load tasks and agent configurations.

    Args:
        config_type: 'single' or 'multi' agent configuration
        framework: Agent framework to use
        model_id: Model identifier
        temperature: Model temperature
        limit: Optional limit on number of tasks (None = all 5)
        task_indices: Optional list of task indices to load (e.g., [0, 2, 4])

    Returns:
        Tuple of (TaskQueue, list of agent configs)
    """
    data_dir = Path("examples/five_a_day_benchmark/data")

    with open(data_dir / "tasks.json", "r") as f:
        tasks_raw = json.load(f)
    with open(data_dir / f"{config_type}agent.json", "r") as f:
        configs_raw = json.load(f)

    # Apply limit first
    if limit:
        tasks_raw = tasks_raw[:limit]
        configs_raw = configs_raw[:limit]

    # Then apply task_indices filter if specified
    if task_indices is not None:
        tasks_raw = [tasks_raw[i] for i in task_indices if i < len(tasks_raw)]
        configs_raw = [configs_raw[i] for i in task_indices if i < len(configs_raw)]

    tasks_data = []
    configs_data = []

    for task_dict, config in zip(tasks_raw, configs_raw):
        task_id = task_dict["metadata"]["task_id"]
        task_dict["environment_data"]["agent_framework"] = framework

        # Create Task object with id from metadata
        tasks_data.append(
            Task(
                query=task_dict["query"],
                id=task_id,
                environment_data=task_dict["environment_data"],
                evaluation_data=task_dict["evaluation_data"],
                metadata=task_dict["metadata"],
            )
        )

        # Enrich config with framework and model info
        config["framework"] = framework
        config["model_config"] = {"model_id": model_id, "temperature": temperature}

        configs_data.append(config)

    return TaskQueue(tasks_data), configs_data

1.2 Model Factory¶

We need LLMs to power our agents. This factory function creates models using LiteLLM, which provides a unified interface to many providers (OpenAI, Anthropic, Google, etc.).

In [ ]:

Copied!





import litellm

# Tell litellm to drop unsupported params (like 'seed' for Gemini)
litellm.drop_params = True


def get_model(model_id: str, temperature: float = 0.7, seed: int | None = None):
    """Create a model instance compatible with smolagents.

    Args:
        model_id: Model name (e.g., 'gemini-2.5-flash', 'gpt-4')
        temperature: Randomness (0.0 = deterministic, 1.0 = creative)
        seed: Random seed for reproducible outputs (ignored for models that don't support it)

    Returns:
        LiteLLMModel configured for smolagents
    """
    return LiteLLMModel(
        model_id=f"gemini/{model_id}",  # Prefix determines provider
        api_key=os.getenv("GOOGLE_API_KEY"),
        temperature=temperature,
        seed=seed,  # Will be dropped by litellm for providers that don't support it
    )


# Test the model factory
model = get_model("gemini-2.5-flash", temperature=0.7, seed=42)
print(f"Created model: {model.model_id}")
import litellm

# Tell litellm to drop unsupported params (like 'seed' for Gemini)
litellm.drop_params = True


def get_model(model_id: str, temperature: float = 0.7, seed: int | None = None):
    """Create a model instance compatible with smolagents.

    Args:
        model_id: Model name (e.g., 'gemini-2.5-flash', 'gpt-4')
        temperature: Randomness (0.0 = deterministic, 1.0 = creative)
        seed: Random seed for reproducible outputs (ignored for models that don't support it)

    Returns:
        LiteLLMModel configured for smolagents
    """
    return LiteLLMModel(
        model_id=f"gemini/{model_id}",  # Prefix determines provider
        api_key=os.getenv("GOOGLE_API_KEY"),
        temperature=temperature,
        seed=seed,  # Will be dropped by litellm for providers that don't support it
    )


# Test the model factory
model = get_model("gemini-2.5-flash", temperature=0.7, seed=42)
print(f"Created model: {model.model_id}")

1.3 Loading Task Data¶

We use the load_benchmark_data() function to load tasks and agent configurations. Let's load Task 0 (Email & Banking) to examine its structure.

In [ ]:

Copied!





# Load Task 0 for demonstration in Part 1
tasks, agent_configs = load_benchmark_data(
    config_type="multi",
    framework="smolagents",
    model_id="gemini-2.5-flash",
    temperature=0.7,
)

# Extract the first (and only) task and config
task_0: Task = tasks[0]
config_0: Dict[str, Any] = agent_configs[0]

print("=" * 60)
print("TASK 0: Email & Banking")
print("=" * 60)
print(f"\nUser Query:\n{task_0.query}\n")
print(f"Required Tools: {task_0.environment_data['tools']}")
print(f"\nEvaluators: {task_0.evaluation_data['evaluators']}")
# Load Task 0 for demonstration in Part 1
tasks, agent_configs = load_benchmark_data(
    config_type="multi",
    framework="smolagents",
    model_id="gemini-2.5-flash",
    temperature=0.7,
)

# Extract the first (and only) task and config
task_0: Task = tasks[0]
config_0: Dict[str, Any] = agent_configs[0]

print("=" * 60)
print("TASK 0: Email & Banking")
print("=" * 60)
print(f"\nUser Query:\n{task_0.query}\n")
print(f"Required Tools: {task_0.environment_data['tools']}")
print(f"\nEvaluators: {task_0.evaluation_data['evaluators']}")

Multi-Agent Configuration¶

For this task, we use 3 agents:

Orchestrator - Coordinates specialists
Banking Specialist - Handles financial data
Email Specialist - Manages email operations

In [ ]:

Copied!





print("Multi-Agent Setup:")
print(f"Agent Type: {config_0['agent_type']}")
print(f"Primary Agent: {config_0['primary_agent_id']}\n")

for i, agent_spec in enumerate(config_0["agents"], 1):
    print(f"{i}. {agent_spec['agent_name']} (ID: {agent_spec['agent_id']})")
    print(f"   Tools: {agent_spec['tools'] if agent_spec['tools'] else 'None (delegates only)'}")
    print(f"   Role: {agent_spec['agent_instruction'][:80]}...")
    print()
print("Multi-Agent Setup:")
print(f"Agent Type: {config_0['agent_type']}")
print(f"Primary Agent: {config_0['primary_agent_id']}\n")

for i, agent_spec in enumerate(config_0["agents"], 1):
    print(f"{i}. {agent_spec['agent_name']} (ID: {agent_spec['agent_id']})")
    print(f"   Tools: {agent_spec['tools'] if agent_spec['tools'] else 'None (delegates only)'}")
    print(f"   Role: {agent_spec['agent_instruction'][:80]}...")
    print()

1.4 Creating Tools for Specialist Agents¶

Tools are functions agents can call. We'll create email and banking tools for our specialists.

Key insight: Our tools are "framework-agnostic" BaseTool objects that convert to any framework (smolagents, LangGraph, LlamaIndex).

In [ ]:

Copied!





# Initialize state objects from task data
# These hold the actual data (emails, bank transactions) that tools operate on
env_data = task_0.environment_data.copy()
states = get_states(env_data["tools"], env_data)

# Create tool collections (tools are examples)
email_tools = EmailToolCollection(states["email_state"])
banking_tools = BankingToolCollection(states["banking_state"])

# Convert to smolagents format (returns tool adapters with tracing support)
email_adapters = [tool.to_smolagents() for tool in email_tools.get_sub_tools()]
banking_adapters = [tool.to_smolagents() for tool in banking_tools.get_sub_tools()]

# Extract raw smolagents Tool objects
all_tool_adapters = email_adapters + banking_adapters
all_tools = [adapter.tool for adapter in all_tool_adapters]

print(f"Created {len(all_tools)} tools:")
for tool in all_tools:
    print(f"  - {tool.name}: {tool.description[:60]}...")
# Initialize state objects from task data
# These hold the actual data (emails, bank transactions) that tools operate on
env_data = task_0.environment_data.copy()
states = get_states(env_data["tools"], env_data)

# Create tool collections (tools are examples)
email_tools = EmailToolCollection(states["email_state"])
banking_tools = BankingToolCollection(states["banking_state"])

# Convert to smolagents format (returns tool adapters with tracing support)
email_adapters = [tool.to_smolagents() for tool in email_tools.get_sub_tools()]
banking_adapters = [tool.to_smolagents() for tool in banking_tools.get_sub_tools()]

# Extract raw smolagents Tool objects
all_tool_adapters = email_adapters + banking_adapters
all_tools = [adapter.tool for adapter in all_tool_adapters]

print(f"Created {len(all_tools)} tools:")
for tool in all_tools:
    print(f"  - {tool.name}: {tool.description[:60]}...")

1.5 Building the Multi-Agent System¶

Now we build our 3 agents:

Specialist agents get tools + FinalAnswerTool() (to return results)
Orchestrator gets specialists as managed_agents (can delegate to them)

In [ ]:

Copied!





# Build specialist agents
def build_agents(
    agent_data: Dict[str, Any],
    environment: Environment,
    seeds: Optional[Dict[str, int]] = None,
) -> tuple[list[ToolCallingAgent], Dict[str, ToolCallingAgent]]:
    """Create multi-agent system with orchestrator and specialists.

    Args:
        agent_data: Agent configuration data
        environment: Environment with tools
        seeds: Optional dict mapping agent_id to seed values

    Returns:
        Tuple of (agents_to_run, agents_to_monitor)
    """
    model_id = agent_data["model_config"]["model_id"]

    specialist_agents = []

    temperature = agent_data["model_config"]["temperature"]

    primary_agent_id = agent_data["primary_agent_id"]
    agents_specs = agent_data["agents"]
    all_tool_adapters = environment.get_tools()  # Now returns Dict[str, Any]

    # Build specialists first
    specialist_agents = []
    for agent_spec in agents_specs:
        if agent_spec["agent_id"] == primary_agent_id:
            continue

        seed = seeds.get(agent_spec["agent_id"]) if seeds else None
        model = get_model(model_id, temperature, seed)
        spec_tool_adapters = filter_tool_adapters_by_prefix(all_tool_adapters, agent_spec["tools"])
        spec_tools = [adapter.tool for adapter in spec_tool_adapters.values()]
        spec_tools.append(FinalAnswerTool())

        specialist = ToolCallingAgent(
            model=model,
            tools=spec_tools,
            name=sanitize_name(agent_spec["agent_name"]),
            description=agent_spec["agent_instruction"],
            instructions=agent_spec["agent_instruction"],
            verbosity_level=0,
        )
        specialist_agents.append(specialist)

    # Build orchestrator
    primary_spec = next(a for a in agents_specs if a["agent_id"] == primary_agent_id)
    primary_seed = seeds.get(primary_agent_id) if seeds else None
    primary_model = get_model(model_id, temperature, primary_seed)

    orchestrator = ToolCallingAgent(
        model=primary_model,
        tools=[FinalAnswerTool()],
        managed_agents=specialist_agents if specialist_agents else None,
        name=sanitize_name(primary_spec["agent_name"]),
        instructions=primary_spec["agent_instruction"],
        verbosity_level=0,
    )

    return [orchestrator], {agent.name: agent for agent in specialist_agents}
# Build specialist agents
def build_agents(
    agent_data: Dict[str, Any],
    environment: Environment,
    seeds: Optional[Dict[str, int]] = None,
) -> tuple[list[ToolCallingAgent], Dict[str, ToolCallingAgent]]:
    """Create multi-agent system with orchestrator and specialists.

    Args:
        agent_data: Agent configuration data
        environment: Environment with tools
        seeds: Optional dict mapping agent_id to seed values

    Returns:
        Tuple of (agents_to_run, agents_to_monitor)
    """
    model_id = agent_data["model_config"]["model_id"]

    specialist_agents = []

    temperature = agent_data["model_config"]["temperature"]

    primary_agent_id = agent_data["primary_agent_id"]
    agents_specs = agent_data["agents"]
    all_tool_adapters = environment.get_tools()  # Now returns Dict[str, Any]

    # Build specialists first
    specialist_agents = []
    for agent_spec in agents_specs:
        if agent_spec["agent_id"] == primary_agent_id:
            continue

        seed = seeds.get(agent_spec["agent_id"]) if seeds else None
        model = get_model(model_id, temperature, seed)
        spec_tool_adapters = filter_tool_adapters_by_prefix(all_tool_adapters, agent_spec["tools"])
        spec_tools = [adapter.tool for adapter in spec_tool_adapters.values()]
        spec_tools.append(FinalAnswerTool())

        specialist = ToolCallingAgent(
            model=model,
            tools=spec_tools,
            name=sanitize_name(agent_spec["agent_name"]),
            description=agent_spec["agent_instruction"],
            instructions=agent_spec["agent_instruction"],
            verbosity_level=0,
        )
        specialist_agents.append(specialist)

    # Build orchestrator
    primary_spec = next(a for a in agents_specs if a["agent_id"] == primary_agent_id)
    primary_seed = seeds.get(primary_agent_id) if seeds else None
    primary_model = get_model(model_id, temperature, primary_seed)

    orchestrator = ToolCallingAgent(
        model=primary_model,
        tools=[FinalAnswerTool()],
        managed_agents=specialist_agents if specialist_agents else None,
        name=sanitize_name(primary_spec["agent_name"]),
        instructions=primary_spec["agent_instruction"],
        verbosity_level=0,
    )

    return [orchestrator], {agent.name: agent for agent in specialist_agents}

Part 2: Organizing Evaluation with MASEval¶

We've built a multi-agent system. Now let's see how MASEval helps us evaluate it systematically across multiple tasks.

MASEval provides three key abstractions:

Task - A single evaluation scenario (query + environment + evaluation criteria)
Environment - Manages tools and state for a task
Benchmark - Orchestrates running agents on tasks and collecting results

2.1 The Environment Class¶

An Environment creates and manages tools for each task. It also enables automatic tool tracing.

Key methods:

setup_state(): Initialize tool state (email inboxes, bank accounts, etc.)
create_tools(): Create and convert tools to framework-specific format

In [ ]:

Copied!





class FiveADayEnvironment(Environment):
    """Environment that creates framework-specific tools from environment data."""

    def __init__(self, environment_data: Dict[str, Any], callbacks: List | None = None):
        super().__init__(environment_data, callbacks)

    def setup_state(self, environment_data: Dict[str, Any]) -> Dict[str, Any]:
        """Initialize environment state from environment data."""
        env_data = environment_data.copy()
        tool_names = env_data.get("tools", [])

        # Create state objects (e.g., email inboxes, bank accounts)
        states = get_states(tool_names, env_data)
        env_data.update(states)

        return env_data

    def create_tools(self) -> Dict[str, Any]:
        """Create and convert tools to framework-specific format, keyed by name."""
        tools_dict: Dict[str, Any] = {}

        # Map tool names to their collection classes
        tool_mapping = {
            "email": (EmailToolCollection, lambda: (self.state["email_state"],)),
            "banking": (BankingToolCollection, lambda: (self.state["banking_state"],)),
            "calculator": (CalculatorToolCollection, lambda: ()),
            "python_executor": (CodeExecutionToolCollection, lambda: (self.state["python_executor_state"],)),
            "family_info": (FamilyInfoToolCollection, lambda: (self.state["family_info"],)),
            "stock_price": (StockPriceToolCollection, lambda: (self.state["stock_price_lookup"],)),
            "calendar": (CalendarToolCollection, lambda: (self.state["calendar_state"],)),
            "hotel_search": (HotelSearchToolCollection, lambda: (self.state["hotel_search_state"],)),
            "my_calendar_mcp": (MCPCalendarToolCollection, lambda: (self.state["my_calendar_mcp_state"],)),
            "other_calendar_mcp": (MCPCalendarToolCollection, lambda: (self.state["other_calendar_mcp_state"],)),
        }

        for tool_name in self.state["tools"]:
            if tool_name in tool_mapping:
                ToolClass, get_init_args = tool_mapping[tool_name]
                tool_instance = ToolClass(*get_init_args())

                # Get base tools and convert to framework format
                for base_tool in tool_instance.get_sub_tools():
                    framework_tool = base_tool.to_smolagents()
                    tool_key = getattr(base_tool, "name", None) or str(type(base_tool).__name__)
                    tools_dict[tool_key] = framework_tool

        return tools_dict
class FiveADayEnvironment(Environment):
    """Environment that creates framework-specific tools from environment data."""

    def __init__(self, environment_data: Dict[str, Any], callbacks: List | None = None):
        super().__init__(environment_data, callbacks)

    def setup_state(self, environment_data: Dict[str, Any]) -> Dict[str, Any]:
        """Initialize environment state from environment data."""
        env_data = environment_data.copy()
        tool_names = env_data.get("tools", [])

        # Create state objects (e.g., email inboxes, bank accounts)
        states = get_states(tool_names, env_data)
        env_data.update(states)

        return env_data

    def create_tools(self) -> Dict[str, Any]:
        """Create and convert tools to framework-specific format, keyed by name."""
        tools_dict: Dict[str, Any] = {}

        # Map tool names to their collection classes
        tool_mapping = {
            "email": (EmailToolCollection, lambda: (self.state["email_state"],)),
            "banking": (BankingToolCollection, lambda: (self.state["banking_state"],)),
            "calculator": (CalculatorToolCollection, lambda: ()),
            "python_executor": (CodeExecutionToolCollection, lambda: (self.state["python_executor_state"],)),
            "family_info": (FamilyInfoToolCollection, lambda: (self.state["family_info"],)),
            "stock_price": (StockPriceToolCollection, lambda: (self.state["stock_price_lookup"],)),
            "calendar": (CalendarToolCollection, lambda: (self.state["calendar_state"],)),
            "hotel_search": (HotelSearchToolCollection, lambda: (self.state["hotel_search_state"],)),
            "my_calendar_mcp": (MCPCalendarToolCollection, lambda: (self.state["my_calendar_mcp_state"],)),
            "other_calendar_mcp": (MCPCalendarToolCollection, lambda: (self.state["other_calendar_mcp_state"],)),
        }

        for tool_name in self.state["tools"]:
            if tool_name in tool_mapping:
                ToolClass, get_init_args = tool_mapping[tool_name]
                tool_instance = ToolClass(*get_init_args())

                # Get base tools and convert to framework format
                for base_tool in tool_instance.get_sub_tools():
                    framework_tool = base_tool.to_smolagents()
                    tool_key = getattr(base_tool, "name", None) or str(type(base_tool).__name__)
                    tools_dict[tool_key] = framework_tool

        return tools_dict

2.2 Check Agents¶

Let's verify our agent setup by building agents for the first task and inspecting their configuration.

First, we check the agent config.

In [ ]:

Copied!





print(f"{config_0['task_description']}")

for i, agent_spec in enumerate(config_0["agents"], 1):
    print(f"{i}. {agent_spec['agent_name']} (ID: {agent_spec['agent_id']})")
    print(f"   Tools: {agent_spec['tools'] if agent_spec['tools'] else 'None (delegates only)'}")
    print(f"   Role: {agent_spec['agent_instruction'][:80]}...")
    print()
print(f"{config_0['task_description']}")

for i, agent_spec in enumerate(config_0["agents"], 1):
    print(f"{i}. {agent_spec['agent_name']} (ID: {agent_spec['agent_id']})")
    print(f"   Tools: {agent_spec['tools'] if agent_spec['tools'] else 'None (delegates only)'}")
    print(f"   Role: {agent_spec['agent_instruction'][:80]}...")
    print()

Now we implment the agents for the first task.

In [ ]:

Copied!





# Build the agents for task 0
# Note: model_config is already set by load_benchmark_data()

# Create environment from environment data
environment_0 = FiveADayEnvironment(task_0.environment_data)

# Build agents using the build_agents function (no seeds for this demo)
agents_to_run, agents_to_monitor = build_agents(config_0, environment_0)

print(f"\nBuilt Agents for Task: {task_0.id}")
print(f"{'=' * 60}")
print(f"\nAgents to run: {[agent.name for agent in agents_to_run]}")
print(f"Agents to monitor: {list(agents_to_monitor.keys())}")

# Print details for each agent
for agent in agents_to_run:
    print(f"\n  Agent: {agent.name}")
    # smolagents stores tools as a dict with string keys
    print(f"    Tools: {list(agent.tools.keys())}")
    if hasattr(agent, "managed_agents") and agent.managed_agents:
        # managed_agents is also a dict with string keys
        print(f"    Managed agents: {list(agent.managed_agents.keys())}")
        for agent_name, managed in agent.managed_agents.items():
            print(f"      - {managed.name}: {list(managed.tools.keys())}")

print("\nAll agents built successfully.")
# Build the agents for task 0
# Note: model_config is already set by load_benchmark_data()

# Create environment from environment data
environment_0 = FiveADayEnvironment(task_0.environment_data)

# Build agents using the build_agents function (no seeds for this demo)
agents_to_run, agents_to_monitor = build_agents(config_0, environment_0)

print(f"\nBuilt Agents for Task: {task_0.id}")
print(f"{'=' * 60}")
print(f"\nAgents to run: {[agent.name for agent in agents_to_run]}")
print(f"Agents to monitor: {list(agents_to_monitor.keys())}")

# Print details for each agent
for agent in agents_to_run:
    print(f"\n  Agent: {agent.name}")
    # smolagents stores tools as a dict with string keys
    print(f"    Tools: {list(agent.tools.keys())}")
    if hasattr(agent, "managed_agents") and agent.managed_agents:
        # managed_agents is also a dict with string keys
        print(f"    Managed agents: {list(agent.managed_agents.keys())}")
        for agent_name, managed in agent.managed_agents.items():
            print(f"      - {managed.name}: {list(managed.tools.keys())}")

print("\nAll agents built successfully.")

2.3 The Benchmark Class¶

A Benchmark orchestrates the entire evaluation process. It implements 5 key methods:

setup_environment() - Create tools for a task
setup_agents() - Build agents with appropriate tools
setup_evaluators() - Create task-specific evaluators
run_agents() - Execute agents and collect responses
evaluate() - Run evaluators on agent outputs

In [ ]:

Copied!





class FiveADayBenchmark(Benchmark):
    """5-A-Day benchmark with multi-agent support."""

    def setup_environment(self, agent_data: Dict[str, Any], task: Task, seed_generator: SeedGenerator) -> Environment:
        """Create environment from environment data."""
        environment = FiveADayEnvironment(task.environment_data)

        # Register all tools for tracing
        for tool_name, tool_adapter in environment.get_tools().items():
            self.register("tools", tool_name, tool_adapter)

        return environment

    def setup_agents(
        self,
        agent_data: Dict[str, Any],
        environment: Environment,
        task: Task,
        user,
        seed_generator: SeedGenerator,
    ) -> tuple[list[SmolAgentAdapter], Dict[str, SmolAgentAdapter]]:
        """Create multi-agent system with orchestrator and specialists.

        Seeds are derived for each agent using the benchmark's seeding system
        with hierarchical paths. derive_seed() returns None if seeding is disabled.
        """
        # Build seeds dict using seed_generator
        # Use child("agents") to create logical paths like "agents/primary_agent"
        agent_gen = seed_generator.child("agents")
        seeds = {}
        for agent_spec in agent_data["agents"]:
            seeds[agent_spec["agent_id"]] = agent_gen.derive_seed(agent_spec["agent_id"])

        agents_to_run, agents_to_monitor = build_agents(agent_data, environment, seeds)

        # Create adapters for the primary agent(s) to run
        adapters_to_run = [SmolAgentAdapter(agent, agent.name) for agent in agents_to_run]

        # This ensures all agent traces are collected by the benchmark
        all_agents = {agent.name: agent for agent in agents_to_run} | agents_to_monitor
        adapters_to_monitor = {name: SmolAgentAdapter(agent, name) for name, agent in all_agents.items()}
        return adapters_to_run, adapters_to_monitor

    def setup_evaluators(self, environment, task, agents, user, seed_generator: SeedGenerator) -> Sequence[Evaluator]:
        """Create evaluators based on task's evaluation criteria."""
        if not task.evaluation_data["evaluators"]:
            return []

        evaluator_instances = []
        for name in task.evaluation_data["evaluators"]:
            evaluator_class = getattr(evaluators, name)
            evaluator_instances.append(evaluator_class(task, environment, user))

        return evaluator_instances

    def run_agents(self, agents: Sequence[AgentAdapter], task: Task, environment: Environment, query: str) -> Sequence[Any]:
        """Execute agents and return their final answers."""
        answers = [agent.run(query) for agent in agents]
        return answers

    def get_model_adapter(self, model_id: str, **kwargs) -> ModelAdapter:
        """Return a model adapter for benchmark components that need LLM access.

        This benchmark doesn't use simulated tools, user simulators, or LLM judges,
        so this method is not called during execution.
        """
        raise NotImplementedError("This benchmark doesn't use model adapters for tools/users/evaluators.")

    def evaluate(
        self,
        evaluators: Sequence[Evaluator],
        agents: Dict[str, AgentAdapter],
        final_answer: Any,
        traces: Dict[str, Any],
    ) -> list[Dict[str, Any]]:
        """Evaluate agent performance."""
        results = []
        for evaluator in evaluators:
            filtered_traces = evaluator.filter_traces(traces)
            results.append(evaluator(filtered_traces, final_answer))
        return results
class FiveADayBenchmark(Benchmark):
    """5-A-Day benchmark with multi-agent support."""

    def setup_environment(self, agent_data: Dict[str, Any], task: Task, seed_generator: SeedGenerator) -> Environment:
        """Create environment from environment data."""
        environment = FiveADayEnvironment(task.environment_data)

        # Register all tools for tracing
        for tool_name, tool_adapter in environment.get_tools().items():
            self.register("tools", tool_name, tool_adapter)

        return environment

    def setup_agents(
        self,
        agent_data: Dict[str, Any],
        environment: Environment,
        task: Task,
        user,
        seed_generator: SeedGenerator,
    ) -> tuple[list[SmolAgentAdapter], Dict[str, SmolAgentAdapter]]:
        """Create multi-agent system with orchestrator and specialists.

        Seeds are derived for each agent using the benchmark's seeding system
        with hierarchical paths. derive_seed() returns None if seeding is disabled.
        """
        # Build seeds dict using seed_generator
        # Use child("agents") to create logical paths like "agents/primary_agent"
        agent_gen = seed_generator.child("agents")
        seeds = {}
        for agent_spec in agent_data["agents"]:
            seeds[agent_spec["agent_id"]] = agent_gen.derive_seed(agent_spec["agent_id"])

        agents_to_run, agents_to_monitor = build_agents(agent_data, environment, seeds)

        # Create adapters for the primary agent(s) to run
        adapters_to_run = [SmolAgentAdapter(agent, agent.name) for agent in agents_to_run]

        # This ensures all agent traces are collected by the benchmark
        all_agents = {agent.name: agent for agent in agents_to_run} | agents_to_monitor
        adapters_to_monitor = {name: SmolAgentAdapter(agent, name) for name, agent in all_agents.items()}
        return adapters_to_run, adapters_to_monitor

    def setup_evaluators(self, environment, task, agents, user, seed_generator: SeedGenerator) -> Sequence[Evaluator]:
        """Create evaluators based on task's evaluation criteria."""
        if not task.evaluation_data["evaluators"]:
            return []

        evaluator_instances = []
        for name in task.evaluation_data["evaluators"]:
            evaluator_class = getattr(evaluators, name)
            evaluator_instances.append(evaluator_class(task, environment, user))

        return evaluator_instances

    def run_agents(self, agents: Sequence[AgentAdapter], task: Task, environment: Environment, query: str) -> Sequence[Any]:
        """Execute agents and return their final answers."""
        answers = [agent.run(query) for agent in agents]
        return answers

    def get_model_adapter(self, model_id: str, **kwargs) -> ModelAdapter:
        """Return a model adapter for benchmark components that need LLM access.

        This benchmark doesn't use simulated tools, user simulators, or LLM judges,
        so this method is not called during execution.
        """
        raise NotImplementedError("This benchmark doesn't use model adapters for tools/users/evaluators.")

    def evaluate(
        self,
        evaluators: Sequence[Evaluator],
        agents: Dict[str, AgentAdapter],
        final_answer: Any,
        traces: Dict[str, Any],
    ) -> list[Dict[str, Any]]:
        """Evaluate agent performance."""
        results = []
        for evaluator in evaluators:
            filtered_traces = evaluator.filter_traces(traces)
            results.append(evaluator(filtered_traces, final_answer))
        return results

2.4 Loading All Tasks¶

Now let's load all 5 tasks to run the full benchmark. We reuse load_benchmark_data() without specifying task_indices to get all tasks.

In [ ]:

Copied!





# Reload all 5 tasks for the benchmark
tasks, agent_configs = load_benchmark_data(
    config_type="multi",
    framework="smolagents",
    model_id="gemini-2.5-flash",
    temperature=0.7,
    # No task_indices = load all tasks
)

print(f"Loaded {len(tasks)} tasks:")
for i, task in enumerate(tasks):
    print(f"  {i}. {task.id}: {task.metadata['description']}")
# Reload all 5 tasks for the benchmark
tasks, agent_configs = load_benchmark_data(
    config_type="multi",
    framework="smolagents",
    model_id="gemini-2.5-flash",
    temperature=0.7,
    # No task_indices = load all tasks
)

print(f"Loaded {len(tasks)} tasks:")
for i, task in enumerate(tasks):
    print(f"  {i}. {task.id}: {task.metadata['description']}")

2.5 Running the Benchmark¶

Now we can run the complete benchmark! MASEval will:

Create environments for each task
Build multi-agent systems with appropriate tools
Run agents and collect traces (tool calls, messages, etc.)
Evaluate results using task-specific evaluators
Log everything to a file

In [ ]:

Copied!





# Create and run benchmark (will take approx. 2 min)
# Pass seed=42 to enable reproducible seeding via the benchmark's seeding system
benchmark = FiveADayBenchmark(
    seed=42,  # Uses DefaultSeedGenerator for reproducible agent seeds
    fail_on_setup_error=True,
    fail_on_task_error=True,
    fail_on_evaluation_error=True,
)

results = benchmark.run(tasks=tasks, agent_data=agent_configs)
# Create and run benchmark (will take approx. 2 min)
# Pass seed=42 to enable reproducible seeding via the benchmark's seeding system
benchmark = FiveADayBenchmark(
    seed=42,  # Uses DefaultSeedGenerator for reproducible agent seeds
    fail_on_setup_error=True,
    fail_on_task_error=True,
    fail_on_evaluation_error=True,
)

results = benchmark.run(tasks=tasks, agent_data=agent_configs)

2.6 Examining Results¶

Let's look at the results for two tasks to understand the output structure.

In [ ]:

Copied!





console = Console()

for task in results[:2]:
    task_id = task["task_id"]
    print("=" * 60)
    print(f"Results for Task ID: {task_id}")
    print("=" * 60)
    traces = task["traces"]
    agent_traces = traces["agents"]
    print(f"Traces available for agents: {list(agent_traces.keys())}")
    orchestrator_name = list(traces["agents"].keys())[0]
    print(f"Last 5 messages for '{orchestrator_name}'")
    print(traces["agents"].keys())
    messages = traces["agents"][orchestrator_name]["messages"]
    for msg in messages[-5:]:
        role = msg.get("role", "unknown")
        content = msg.get("content", [])[0].get("text", "")
        panel = Panel.fit(
            content,
            title=f" {role} ",
            title_align="left",
        )
        console.print(panel)
console = Console()

for task in results[:2]:
    task_id = task["task_id"]
    print("=" * 60)
    print(f"Results for Task ID: {task_id}")
    print("=" * 60)
    traces = task["traces"]
    agent_traces = traces["agents"]
    print(f"Traces available for agents: {list(agent_traces.keys())}")
    orchestrator_name = list(traces["agents"].keys())[0]
    print(f"Last 5 messages for '{orchestrator_name}'")
    print(traces["agents"].keys())
    messages = traces["agents"][orchestrator_name]["messages"]
    for msg in messages[-5:]:
        role = msg.get("role", "unknown")
        content = msg.get("content", [])[0].get("text", "")
        panel = Panel.fit(
            content,
            title=f" {role} ",
            title_align="left",
        )
        console.print(panel)

In [ ]:

Copied!





# print results for first two tasks
for task in results[:2]:
    task_id = task["task_id"]
    print("=" * 60)
    print(f"Results for Task ID: {task_id}")
    print("=" * 60)
    eval_results = task["eval"]
    for evals in eval_results:
        for k, v in evals.items():
            print(f"{k:<35} {v}")
# print results for first two tasks
for task in results[:2]:
    task_id = task["task_id"]
    print("=" * 60)
    print(f"Results for Task ID: {task_id}")
    print("=" * 60)
    eval_results = task["eval"]
    for evals in eval_results:
        for k, v in evals.items():
            print(f"{k:<35} {v}")

2.7 Usage & Cost Tracking¶

MASEval automatically tracks token usage for every LLM call made during benchmark execution. Each report includes a "usage" key with per-component breakdowns, and the benchmark maintains running totals across all tasks.

For cost estimation, pass a CostCalculator to your model adapters. MASEval ships two built-in calculators:

StaticPricingCalculator — user-supplied per-token rates (no dependencies)
LiteLLMCostCalculator — automatic pricing via LiteLLM's model database (requires litellm)

Since this benchmark uses smolagents with LiteLLM models (which don't go through MASEval's ModelAdapter), token usage is tracked at the tool level. In benchmarks that use MASEval's model adapters directly, token-level usage and cost are captured automatically.

In [ ]:

Copied!





from collections import defaultdict
from maseval import UsageReporter, TokenUsage


def _fmt_usage(usage):
    """Format a Usage record for display."""
    parts = [f"cost=${usage.cost:.6f}"]
    if isinstance(usage, TokenUsage):
        parts.append(f"in={usage.input_tokens}  out={usage.output_tokens}")
    if usage.units:
        parts.append(f"units={dict(usage.units)}")
    return "  ".join(parts)


# --- Live totals (available during and after execution) ---
print("Live Usage Totals")
print("=" * 60)
total = benchmark.usage
print(f"  Total: {_fmt_usage(total)}")

# Group components by category
by_category = defaultdict(dict)
for key, usage in benchmark.usage_by_component.items():
    category, name = key.split(":", 1)
    by_category[category][name] = usage

for category in ["agents", "models", "tools", "simulators", "callbacks"]:
    if category not in by_category:
        continue
    print(f"\n{category.capitalize()}:")
    for name, usage in by_category[category].items():
        print(f"  {name:<35} {_fmt_usage(usage)}")

# Print any remaining categories not in the standard list
for category, components in by_category.items():
    if category in {"agents", "models", "tools", "simulators", "callbacks"}:
        continue
    print(f"\n{category.capitalize()}:")
    for name, usage in components.items():
        print(f"  {name:<35} {_fmt_usage(usage)}")

# --- Post-hoc analysis with UsageReporter ---
print()
reporter = UsageReporter.from_reports(results)

print("Per-Task Usage")
print("-" * 60)
for task_id, usage in reporter.by_task().items():
    print(f"  {task_id:<35} {_fmt_usage(usage)}")

print()
print("Summary dict (for JSON export):")
print(json.dumps(reporter.summary(), indent=2, default=str))
from collections import defaultdict
from maseval import UsageReporter, TokenUsage


def _fmt_usage(usage):
    """Format a Usage record for display."""
    parts = [f"cost=${usage.cost:.6f}"]
    if isinstance(usage, TokenUsage):
        parts.append(f"in={usage.input_tokens}  out={usage.output_tokens}")
    if usage.units:
        parts.append(f"units={dict(usage.units)}")
    return "  ".join(parts)


# --- Live totals (available during and after execution) ---
print("Live Usage Totals")
print("=" * 60)
total = benchmark.usage
print(f"  Total: {_fmt_usage(total)}")

# Group components by category
by_category = defaultdict(dict)
for key, usage in benchmark.usage_by_component.items():
    category, name = key.split(":", 1)
    by_category[category][name] = usage

for category in ["agents", "models", "tools", "simulators", "callbacks"]:
    if category not in by_category:
        continue
    print(f"\n{category.capitalize()}:")
    for name, usage in by_category[category].items():
        print(f"  {name:<35} {_fmt_usage(usage)}")

# Print any remaining categories not in the standard list
for category, components in by_category.items():
    if category in {"agents", "models", "tools", "simulators", "callbacks"}:
        continue
    print(f"\n{category.capitalize()}:")
    for name, usage in components.items():
        print(f"  {name:<35} {_fmt_usage(usage)}")

# --- Post-hoc analysis with UsageReporter ---
print()
reporter = UsageReporter.from_reports(results)

print("Per-Task Usage")
print("-" * 60)
for task_id, usage in reporter.by_task().items():
    print(f"  {task_id:<35} {_fmt_usage(usage)}")

print()
print("Summary dict (for JSON export):")
print(json.dumps(reporter.summary(), indent=2, default=str))

Summary and Key Takeaways¶

What You've Learned¶

You now understand how to build production agent benchmarks with MASEval:

Part 1: Multi-Agent Systems¶

Model creation with LiteLLM for framework compatibility
Framework-agnostic tools that convert to any agent library
Multi-agent architecture with orchestrators and specialists
Tool state management for realistic task environments

Part 2: MASEval Framework¶

Task abstraction packages queries, environments, and evaluation criteria
Environment class creates tools and enables automatic tracing
Benchmark class orchestrates evaluation across multiple tasks
Custom evaluators for diverse evaluation approaches (unit tests, LLM judges, etc.)
Automatic tracing captures all tool calls and agent interactions
Usage & cost tracking monitors token consumption and computes cost across providers

Key Design Patterns¶

Separation of Concerns:
- Tasks define WHAT to evaluate
- Environments provides a world in which the agents act (tools and state)
- Benchmarks orchestrate WHEN and WHERE
- Evaluators determine SUCCESS
Framework Agnostic:
- Same tasks work with smolagents, LangGraph, LlamaIndex
- Tools convert automatically to framework-specific formats
- Easy to compare frameworks on identical tasks
Reproducibility:
- Seeds derived systematically from task_id + agent_id
- All parameters logged automatically
- Results saved in structured JSONL format

Next Steps¶

Explore evaluators — Check evaluators/ for different evaluation strategies
Try single-agent mode — Load data/singleagent.json to compare architectures
Run from CLI — Use five_a_day_benchmark.py for scripted runs with different frameworks
Add custom tasks — Create your own task definitions and evaluators
Compare frameworks — Run the same benchmark with LangGraph or LlamaIndex

Resources¶

MASEval Documentation
Example code: examples/five_a_day_benchmark/
Example data: examples/five_a_day_benchmark/data/
Tool implementations: examples/five_a_day_benchmark/tools/
Evaluator implementations: examples/five_a_day_benchmark/evaluators/