5 A Day Benchmark¶
This notebook is available as a Jupyter notebook — clone the repo and run it yourself!
What This Notebook Demonstrates
This is a complete, production-ready example of evaluating multi-agent systems using the MASEval library. By the end, you'll understand how to:
- Build multi-agent systems with orchestrators and specialists
- Create framework-agnostic tools that work across agent libraries
- Organize systematic evaluation using Tasks, Environments, and Benchmarks
- Implement custom evaluators (unit tests, LLM judges, pattern matching)
- Run reproducible benchmarks with automatic tool tracing
Prerequisites: Familiarity with LLM agents and basic Python programming.
The 5-A-Day Benchmark¶
We implement 5 diverse tasks representing real-world agent scenarios:
| Task | Scenario | Tools | Evaluation |
|---|---|---|---|
| 0 | Email & Banking | email, banking | LLM judge, pattern matching |
| 1 | Finance Calculation | stock_price, calculator, family_info | Arithmetic verification |
| 2 | Code Generation | python_executor | Unit tests, complexity analysis |
| 3 | Calendar Scheduling | calendar | Slot matching, logic validation |
| 4 | Hotel Optimization | hotel_search | Ranking, search strategy |
We'll use a multi-agent architecture: an orchestrator delegates to specialized agents (banking specialist, email specialist, etc.), demonstrating MASEval's support for complex agent systems.
Part 1: Understanding Multi-Agent Systems¶
Before diving into MASEval, let's build the multi-agent system we'll be evaluating.
1.1 Imports and Setup¶
We use smolagents as our agent framework (MASEval also supports LangGraph and LlamaIndex).
The imports below include:
- Standard libraries:
json,os,Pathfor file handling - Helper utilities: Functions from this example's
utils.pyandtools.py - smolagents: The agent framework we'll use
- MASEval core: The evaluation orchestration library
# ruff: noqa E402
# Setup: Set working directory to project root for proper imports
# This must happen FIRST before any other imports
import os
import sys
from pathlib import Path
import json
from typing import Any, Dict, List, Optional, Sequence
from rich.console import Console
from rich.panel import Panel
# Determine notebook directory and set working directory to project root
_notebook_dir = Path(__file__).parent if "__file__" in dir() else Path.cwd()
if _notebook_dir.name == "five_a_day_benchmark":
_project_root = _notebook_dir.parent.parent
os.chdir(_project_root)
# Add project root to path so `examples.five_a_day_benchmark.*` imports work
if str(_project_root) not in sys.path:
sys.path.insert(0, str(_project_root))
# Also add the example directory for local imports (utils, tools, evaluators)
if str(_notebook_dir) not in sys.path:
sys.path.insert(0, str(_notebook_dir))
print(f"Working directory set to: {os.getcwd()}")
# Utility functions from this example
# - sanitize_name(): Cleans agent names for framework compatibility
from utils import sanitize_name
# Tool collection classes and helpers
# - EmailToolCollection, BankingToolCollection: Pre-built tool groups
# - filter_tool_adapters_by_prefix(): Selects tools by name prefix
# - get_states(): Initializes tool state objects (email inboxes, bank accounts, etc.)
from tools import (
EmailToolCollection,
BankingToolCollection,
CalculatorToolCollection,
CodeExecutionToolCollection,
FamilyInfoToolCollection,
StockPriceToolCollection,
CalendarToolCollection,
HotelSearchToolCollection,
MCPCalendarToolCollection,
filter_tool_adapters_by_prefix,
get_states,
)
# smolagents: Our chosen agent framework
from smolagents import ToolCallingAgent, LiteLLMModel, FinalAnswerTool
# MASEval core components
from maseval import Benchmark, Environment, Task, TaskQueue, AgentAdapter, Evaluator, ModelAdapter, SeedGenerator
from maseval.interface.agents.smolagents import SmolAgentAdapter
# Import evaluators module (dynamically loaded later)
import evaluators
def load_benchmark_data(
config_type: str = "multi",
framework: str = "smolagents",
model_id: str = "gemini-2.5-flash",
temperature: float = 0.7,
limit: int | None = None,
task_indices: list[int] | None = None,
) -> tuple[TaskQueue, list[Dict[str, Any]]]:
"""Load tasks and agent configurations.
Args:
config_type: 'single' or 'multi' agent configuration
framework: Agent framework to use
model_id: Model identifier
temperature: Model temperature
limit: Optional limit on number of tasks (None = all 5)
task_indices: Optional list of task indices to load (e.g., [0, 2, 4])
Returns:
Tuple of (TaskQueue, list of agent configs)
"""
data_dir = Path("examples/five_a_day_benchmark/data")
with open(data_dir / "tasks.json", "r") as f:
tasks_raw = json.load(f)
with open(data_dir / f"{config_type}agent.json", "r") as f:
configs_raw = json.load(f)
# Apply limit first
if limit:
tasks_raw = tasks_raw[:limit]
configs_raw = configs_raw[:limit]
# Then apply task_indices filter if specified
if task_indices is not None:
tasks_raw = [tasks_raw[i] for i in task_indices if i < len(tasks_raw)]
configs_raw = [configs_raw[i] for i in task_indices if i < len(configs_raw)]
tasks_data = []
configs_data = []
for task_dict, config in zip(tasks_raw, configs_raw):
task_id = task_dict["metadata"]["task_id"]
task_dict["environment_data"]["agent_framework"] = framework
# Create Task object with id from metadata
tasks_data.append(
Task(
query=task_dict["query"],
id=task_id,
environment_data=task_dict["environment_data"],
evaluation_data=task_dict["evaluation_data"],
metadata=task_dict["metadata"],
)
)
# Enrich config with framework and model info
config["framework"] = framework
config["model_config"] = {"model_id": model_id, "temperature": temperature}
configs_data.append(config)
return TaskQueue(tasks_data), configs_data
1.2 Model Factory¶
We need LLMs to power our agents. This factory function creates models using LiteLLM, which provides a unified interface to many providers (OpenAI, Anthropic, Google, etc.).
import litellm
# Tell litellm to drop unsupported params (like 'seed' for Gemini)
litellm.drop_params = True
def get_model(model_id: str, temperature: float = 0.7, seed: int | None = None):
"""Create a model instance compatible with smolagents.
Args:
model_id: Model name (e.g., 'gemini-2.5-flash', 'gpt-4')
temperature: Randomness (0.0 = deterministic, 1.0 = creative)
seed: Random seed for reproducible outputs (ignored for models that don't support it)
Returns:
LiteLLMModel configured for smolagents
"""
return LiteLLMModel(
model_id=f"gemini/{model_id}", # Prefix determines provider
api_key=os.getenv("GOOGLE_API_KEY"),
temperature=temperature,
seed=seed, # Will be dropped by litellm for providers that don't support it
)
# Test the model factory
model = get_model("gemini-2.5-flash", temperature=0.7, seed=42)
print(f"Created model: {model.model_id}")
1.3 Loading Task Data¶
We use the load_benchmark_data() function to load tasks and agent configurations. Let's load Task 0 (Email & Banking) to examine its structure.
# Load Task 0 for demonstration in Part 1
task_data, agent_configs = load_benchmark_data(
config_type="multi",
framework="smolagents",
model_id="gemini-2.5-flash",
temperature=0.7,
)
# Extract the first (and only) task and config
task_0: Task = task_data[0]
config_0: Dict[str, Any] = agent_configs[0]
print("=" * 60)
print("TASK 0: Email & Banking")
print("=" * 60)
print(f"\nUser Query:\n{task_0.query}\n")
print(f"Required Tools: {task_0.environment_data['tools']}")
print(f"\nEvaluators: {task_0.evaluation_data['evaluators']}")
Multi-Agent Configuration¶
For this task, we use 3 agents:
- Orchestrator - Coordinates specialists
- Banking Specialist - Handles financial data
- Email Specialist - Manages email operations
print("Multi-Agent Setup:")
print(f"Agent Type: {config_0['agent_type']}")
print(f"Primary Agent: {config_0['primary_agent_id']}\n")
for i, agent_spec in enumerate(config_0["agents"], 1):
print(f"{i}. {agent_spec['agent_name']} (ID: {agent_spec['agent_id']})")
print(f" Tools: {agent_spec['tools'] if agent_spec['tools'] else 'None (delegates only)'}")
print(f" Role: {agent_spec['agent_instruction'][:80]}...")
print()
1.4 Creating Tools for Specialist Agents¶
Tools are functions agents can call. We'll create email and banking tools for our specialists.
Key insight: Our tools are "framework-agnostic" BaseTool objects that convert to any framework (smolagents, LangGraph, LlamaIndex).
# Initialize state objects from task data
# These hold the actual data (emails, bank transactions) that tools operate on
env_data = task_0.environment_data.copy()
states = get_states(env_data["tools"], env_data)
# Create tool collections (tools are examples)
email_tools = EmailToolCollection(states["email_state"])
banking_tools = BankingToolCollection(states["banking_state"])
# Convert to smolagents format (returns tool adapters with tracing support)
email_adapters = [tool.to_smolagents() for tool in email_tools.get_sub_tools()]
banking_adapters = [tool.to_smolagents() for tool in banking_tools.get_sub_tools()]
# Extract raw smolagents Tool objects
all_tool_adapters = email_adapters + banking_adapters
all_tools = [adapter.tool for adapter in all_tool_adapters]
print(f"Created {len(all_tools)} tools:")
for tool in all_tools:
print(f" - {tool.name}: {tool.description[:60]}...")
1.5 Building the Multi-Agent System¶
Now we build our 3 agents:
- Specialist agents get tools +
FinalAnswerTool()(to return results) - Orchestrator gets specialists as
managed_agents(can delegate to them)
# Build specialist agents
def build_agents(
agent_data: Dict[str, Any],
environment: Environment,
seeds: Optional[Dict[str, int]] = None,
) -> tuple[list[ToolCallingAgent], Dict[str, ToolCallingAgent]]:
"""Create multi-agent system with orchestrator and specialists.
Args:
agent_data: Agent configuration data
environment: Environment with tools
seeds: Optional dict mapping agent_id to seed values
Returns:
Tuple of (agents_to_run, agents_to_monitor)
"""
model_id = agent_data["model_config"]["model_id"]
specialist_agents = []
temperature = agent_data["model_config"]["temperature"]
primary_agent_id = agent_data["primary_agent_id"]
agents_specs = agent_data["agents"]
all_tool_adapters = environment.get_tools() # Now returns Dict[str, Any]
# Build specialists first
specialist_agents = []
for agent_spec in agents_specs:
if agent_spec["agent_id"] == primary_agent_id:
continue
seed = seeds.get(agent_spec["agent_id"]) if seeds else None
model = get_model(model_id, temperature, seed)
spec_tool_adapters = filter_tool_adapters_by_prefix(all_tool_adapters, agent_spec["tools"])
spec_tools = [adapter.tool for adapter in spec_tool_adapters.values()]
spec_tools.append(FinalAnswerTool())
specialist = ToolCallingAgent(
model=model,
tools=spec_tools,
name=sanitize_name(agent_spec["agent_name"]),
description=agent_spec["agent_instruction"],
instructions=agent_spec["agent_instruction"],
verbosity_level=0,
)
specialist_agents.append(specialist)
# Build orchestrator
primary_spec = next(a for a in agents_specs if a["agent_id"] == primary_agent_id)
primary_seed = seeds.get(primary_agent_id) if seeds else None
primary_model = get_model(model_id, temperature, primary_seed)
orchestrator = ToolCallingAgent(
model=primary_model,
tools=[FinalAnswerTool()],
managed_agents=specialist_agents if specialist_agents else None,
name=sanitize_name(primary_spec["agent_name"]),
instructions=primary_spec["agent_instruction"],
verbosity_level=0,
)
return [orchestrator], {agent.name: agent for agent in specialist_agents}
Part 2: Organizing Evaluation with MASEval¶
We've built a multi-agent system. Now let's see how MASEval helps us evaluate it systematically across multiple tasks.
MASEval provides three key abstractions:
- Task - A single evaluation scenario (query + environment + evaluation criteria)
- Environment - Manages tools and state for a task
- Benchmark - Orchestrates running agents on tasks and collecting results
2.1 The Environment Class¶
An Environment creates and manages tools for each task. It also enables automatic tool tracing.
Key methods:
setup_state(): Initialize tool state (email inboxes, bank accounts, etc.)create_tools(): Create and convert tools to framework-specific format
class FiveADayEnvironment(Environment):
"""Environment that creates framework-specific tools from task data."""
def __init__(self, task_data: Dict[str, Any], callbacks: List | None = None):
super().__init__(task_data, callbacks)
def setup_state(self, task_data: Dict[str, Any]) -> Dict[str, Any]:
"""Initialize environment state from task data."""
env_data = task_data["environment_data"].copy()
tool_names = env_data.get("tools", [])
# Create state objects (e.g., email inboxes, bank accounts)
states = get_states(tool_names, env_data)
env_data.update(states)
return env_data
def create_tools(self) -> Dict[str, Any]:
"""Create and convert tools to framework-specific format, keyed by name."""
tools_dict: Dict[str, Any] = {}
# Map tool names to their collection classes
tool_mapping = {
"email": (EmailToolCollection, lambda: (self.state["email_state"],)),
"banking": (BankingToolCollection, lambda: (self.state["banking_state"],)),
"calculator": (CalculatorToolCollection, lambda: ()),
"python_executor": (CodeExecutionToolCollection, lambda: (self.state["python_executor_state"],)),
"family_info": (FamilyInfoToolCollection, lambda: (self.state["family_info"],)),
"stock_price": (StockPriceToolCollection, lambda: (self.state["stock_price_lookup"],)),
"calendar": (CalendarToolCollection, lambda: (self.state["calendar_state"],)),
"hotel_search": (HotelSearchToolCollection, lambda: (self.state["hotel_search_state"],)),
"my_calendar_mcp": (MCPCalendarToolCollection, lambda: (self.state["my_calendar_mcp_state"],)),
"other_calendar_mcp": (MCPCalendarToolCollection, lambda: (self.state["other_calendar_mcp_state"],)),
}
for tool_name in self.state["tools"]:
if tool_name in tool_mapping:
ToolClass, get_init_args = tool_mapping[tool_name]
tool_instance = ToolClass(*get_init_args())
# Get base tools and convert to framework format
for base_tool in tool_instance.get_sub_tools():
framework_tool = base_tool.to_smolagents()
tool_key = getattr(base_tool, "name", None) or str(type(base_tool).__name__)
tools_dict[tool_key] = framework_tool
return tools_dict
2.2 Check Agents¶
Let's verify our agent setup by building agents for the first task and inspecting their configuration.
First, we check the agent config.
print(f"{config_0['task_description']}")
for i, agent_spec in enumerate(config_0["agents"], 1):
print(f"{i}. {agent_spec['agent_name']} (ID: {agent_spec['agent_id']})")
print(f" Tools: {agent_spec['tools'] if agent_spec['tools'] else 'None (delegates only)'}")
print(f" Role: {agent_spec['agent_instruction'][:80]}...")
print()
Now we implment the agents for the first task.
# Build the agents for task 0
# Note: model_config is already set by load_benchmark_data()
# Create environment from task data
environment_0 = FiveADayEnvironment(
{
"environment_data": task_0.environment_data,
"query": task_0.query,
"evaluation_data": task_0.evaluation_data,
"metadata": task_0.metadata,
}
)
# Build agents using the build_agents function (no seeds for this demo)
agents_to_run, agents_to_monitor = build_agents(config_0, environment_0)
print(f"\nBuilt Agents for Task: {task_0.id}")
print(f"{'=' * 60}")
print(f"\nAgents to run: {[agent.name for agent in agents_to_run]}")
print(f"Agents to monitor: {list(agents_to_monitor.keys())}")
# Print details for each agent
for agent in agents_to_run:
print(f"\n Agent: {agent.name}")
# smolagents stores tools as a dict with string keys
print(f" Tools: {list(agent.tools.keys())}")
if hasattr(agent, "managed_agents") and agent.managed_agents:
# managed_agents is also a dict with string keys
print(f" Managed agents: {list(agent.managed_agents.keys())}")
for agent_name, managed in agent.managed_agents.items():
print(f" - {managed.name}: {list(managed.tools.keys())}")
print("\nAll agents built successfully.")
2.3 The Benchmark Class¶
A Benchmark orchestrates the entire evaluation process. It implements 5 key methods:
- setup_environment() - Create tools for a task
- setup_agents() - Build agents with appropriate tools
- setup_evaluators() - Create task-specific evaluators
- run_agents() - Execute agents and collect responses
- evaluate() - Run evaluators on agent outputs
class FiveADayBenchmark(Benchmark):
"""5-A-Day benchmark with multi-agent support."""
def setup_environment(self, agent_data: Dict[str, Any], task: Task, seed_generator: SeedGenerator) -> Environment:
"""Create environment from task data."""
task_data = {
"environment_data": task.environment_data,
"query": task.query,
"evaluation_data": task.evaluation_data,
"metadata": task.metadata,
}
environment = FiveADayEnvironment(task_data)
# Register all tools for tracing
for tool_name, tool_adapter in environment.get_tools().items():
self.register("tools", tool_name, tool_adapter)
return environment
def setup_agents(
self,
agent_data: Dict[str, Any],
environment: Environment,
task: Task,
user,
seed_generator: SeedGenerator,
) -> tuple[list[SmolAgentAdapter], Dict[str, SmolAgentAdapter]]:
"""Create multi-agent system with orchestrator and specialists.
Seeds are derived for each agent using the benchmark's seeding system
with hierarchical paths. derive_seed() returns None if seeding is disabled.
"""
# Build seeds dict using seed_generator
# Use child("agents") to create logical paths like "agents/primary_agent"
agent_gen = seed_generator.child("agents")
seeds = {}
for agent_spec in agent_data["agents"]:
seeds[agent_spec["agent_id"]] = agent_gen.derive_seed(agent_spec["agent_id"])
agents_to_run, agents_to_monitor = build_agents(agent_data, environment, seeds)
# Create adapters for the primary agent(s) to run
adapters_to_run = [SmolAgentAdapter(agent, agent.name) for agent in agents_to_run]
# This ensures all agent traces are collected by the benchmark
all_agents = {agent.name: agent for agent in agents_to_run} | agents_to_monitor
adapters_to_monitor = {name: SmolAgentAdapter(agent, name) for name, agent in all_agents.items()}
return adapters_to_run, adapters_to_monitor
def setup_evaluators(self, environment, task, agents, user, seed_generator: SeedGenerator) -> Sequence[Evaluator]:
"""Create evaluators based on task's evaluation criteria."""
if not task.evaluation_data["evaluators"]:
return []
evaluator_instances = []
for name in task.evaluation_data["evaluators"]:
evaluator_class = getattr(evaluators, name)
evaluator_instances.append(evaluator_class(task, environment, user))
return evaluator_instances
def run_agents(self, agents: Sequence[AgentAdapter], task: Task, environment: Environment, query: str) -> Sequence[Any]:
"""Execute agents and return their final answers."""
answers = [agent.run(query) for agent in agents]
return answers
def get_model_adapter(self, model_id: str, **kwargs) -> ModelAdapter:
"""Return a model adapter for benchmark components that need LLM access.
This benchmark doesn't use simulated tools, user simulators, or LLM judges,
so this method is not called during execution.
"""
raise NotImplementedError("This benchmark doesn't use model adapters for tools/users/evaluators.")
def evaluate(
self,
evaluators: Sequence[Evaluator],
agents: Dict[str, AgentAdapter],
final_answer: Any,
traces: Dict[str, Any],
) -> list[Dict[str, Any]]:
"""Evaluate agent performance."""
results = []
for evaluator in evaluators:
filtered_traces = evaluator.filter_traces(traces)
results.append(evaluator(filtered_traces, final_answer))
return results
2.4 Loading All Tasks¶
Now let's load all 5 tasks to run the full benchmark. We reuse load_benchmark_data() without specifying task_indices to get all tasks.
# Reload all 5 tasks for the benchmark
tasks, agent_configs = load_benchmark_data(
config_type="multi",
framework="smolagents",
model_id="gemini-2.5-flash",
temperature=0.7,
# No task_indices = load all tasks
)
print(f"Loaded {len(tasks)} tasks:")
for i, task in enumerate(tasks):
print(f" {i}. {task.id}: {task.metadata['description']}")
2.5 Running the Benchmark¶
Now we can run the complete benchmark! MASEval will:
- Create environments for each task
- Build multi-agent systems with appropriate tools
- Run agents and collect traces (tool calls, messages, etc.)
- Evaluate results using task-specific evaluators
- Log everything to a file
# Create and run benchmark (will take approx. 2 min)
# Pass seed=42 to enable reproducible seeding via the benchmark's seeding system
benchmark = FiveADayBenchmark(
seed=42, # Uses DefaultSeedGenerator for reproducible agent seeds
fail_on_setup_error=True,
fail_on_task_error=True,
fail_on_evaluation_error=True,
)
results = benchmark.run(tasks=tasks, agent_data=agent_configs)
2.6 Examining Results¶
Let's look at the results for two tasks to understand the output structure.
console = Console()
for task in results[:2]:
task_id = task["task_id"]
print("=" * 60)
print(f"Results for Task ID: {task_id}")
print("=" * 60)
traces = task["traces"]
agent_traces = traces["agents"]
print(f"Traces available for agents: {list(agent_traces.keys())}")
orchestrator_name = list(traces["agents"].keys())[0]
print(f"Last 5 messages for '{orchestrator_name}'")
print(traces["agents"].keys())
messages = traces["agents"][orchestrator_name]["messages"]
for msg in messages[-5:]:
role = msg.get("role", "unknown")
content = msg.get("content", [])[0].get("text", "")
panel = Panel.fit(
content,
title=f" {role} ",
title_align="left",
)
console.print(panel)
# print results for first two tasks
for task in results[:2]:
task_id = task["task_id"]
print("=" * 60)
print(f"Results for Task ID: {task_id}")
print("=" * 60)
eval_results = task["eval"]
for evals in eval_results:
for k, v in evals.items():
print(f"{k:<35} {v}")
2.7 Usage & Cost Tracking¶
MASEval automatically tracks token usage for every LLM call made during benchmark execution. Each report includes a "usage" key with per-component breakdowns, and the benchmark maintains running totals across all tasks.
For cost estimation, pass a CostCalculator to your model adapters. MASEval ships two built-in calculators:
StaticPricingCalculator— user-supplied per-token rates (no dependencies)LiteLLMCostCalculator— automatic pricing via LiteLLM's model database (requireslitellm)
Since this benchmark uses smolagents with LiteLLM models (which don't go through MASEval's ModelAdapter), token usage is tracked at the tool level. In benchmarks that use MASEval's model adapters directly, token-level usage and cost are captured automatically.
from collections import defaultdict
from maseval import UsageReporter, TokenUsage
def _fmt_usage(usage):
"""Format a Usage record for display."""
parts = [f"cost=${usage.cost:.6f}"]
if isinstance(usage, TokenUsage):
parts.append(f"in={usage.input_tokens} out={usage.output_tokens}")
if usage.units:
parts.append(f"units={dict(usage.units)}")
return " ".join(parts)
# --- Live totals (available during and after execution) ---
print("Live Usage Totals")
print("=" * 60)
total = benchmark.usage
print(f" Total: {_fmt_usage(total)}")
# Group components by category
by_category = defaultdict(dict)
for key, usage in benchmark.usage_by_component.items():
category, name = key.split(":", 1)
by_category[category][name] = usage
for category in ["agents", "models", "tools", "simulators", "callbacks"]:
if category not in by_category:
continue
print(f"\n{category.capitalize()}:")
for name, usage in by_category[category].items():
print(f" {name:<35} {_fmt_usage(usage)}")
# Print any remaining categories not in the standard list
for category, components in by_category.items():
if category in {"agents", "models", "tools", "simulators", "callbacks"}:
continue
print(f"\n{category.capitalize()}:")
for name, usage in components.items():
print(f" {name:<35} {_fmt_usage(usage)}")
# --- Post-hoc analysis with UsageReporter ---
print()
reporter = UsageReporter.from_reports(results)
print("Per-Task Usage")
print("-" * 60)
for task_id, usage in reporter.by_task().items():
print(f" {task_id:<35} {_fmt_usage(usage)}")
print()
print("Summary dict (for JSON export):")
print(json.dumps(reporter.summary(), indent=2, default=str))
Summary and Key Takeaways¶
What You've Learned¶
You now understand how to build production agent benchmarks with MASEval:
Part 1: Multi-Agent Systems¶
- Model creation with LiteLLM for framework compatibility
- Framework-agnostic tools that convert to any agent library
- Multi-agent architecture with orchestrators and specialists
- Tool state management for realistic task environments
Part 2: MASEval Framework¶
- Task abstraction packages queries, environments, and evaluation criteria
- Environment class creates tools and enables automatic tracing
- Benchmark class orchestrates evaluation across multiple tasks
- Custom evaluators for diverse evaluation approaches (unit tests, LLM judges, etc.)
- Automatic tracing captures all tool calls and agent interactions
- Usage & cost tracking monitors token consumption and computes cost across providers
Key Design Patterns¶
Separation of Concerns:
- Tasks define WHAT to evaluate
- Environments provides a world in which the agents act (tools and state)
- Benchmarks orchestrate WHEN and WHERE
- Evaluators determine SUCCESS
Framework Agnostic:
- Same tasks work with smolagents, LangGraph, LlamaIndex
- Tools convert automatically to framework-specific formats
- Easy to compare frameworks on identical tasks
Reproducibility:
- Seeds derived systematically from task_id + agent_id
- All parameters logged automatically
- Results saved in structured JSONL format
Next Steps¶
- Explore evaluators — Check
evaluators/for different evaluation strategies - Try single-agent mode — Load
data/singleagent.jsonto compare architectures - Run from CLI — Use
five_a_day_benchmark.pyfor scripted runs with different frameworks - Add custom tasks — Create your own task definitions and evaluators
- Compare frameworks — Run the same benchmark with LangGraph or LlamaIndex
Resources¶
- MASEval Documentation
- Example code:
examples/five_a_day_benchmark/ - Example data:
examples/five_a_day_benchmark/data/ - Tool implementations:
examples/five_a_day_benchmark/tools/ - Evaluator implementations:
examples/five_a_day_benchmark/evaluators/