Tiny Tutorial¶

This notebook is available as a Jupyter notebook — clone the repo and run it yourself!

What You'll Learn¶

Build your first agent — Create tools and agents with smolagents
Run a minimal benchmark — One task, one agent, end-to-end
Understand the core abstractions — Tasks, Environments, Evaluators working together

This tutorial first introduces smolagents as introduction to agents. Then it provides a super small single task benchmark.

Setup¶

First, let's install the required dependencies and import the libraries we need.

In [ ]:

Copied!





# Install dependencies (uncomment if needed)
# !pip install maseval[smolagents]
# !pip install litellm

import os
import json
from pathlib import Path
from typing import Any, Dict, List, Optional

# Set your API key
# os.environ["GOOGLE_API_KEY"] = "your-api-key-here"
# Install dependencies (uncomment if needed)
# !pip install maseval[smolagents]
# !pip install litellm

import os
import json
from pathlib import Path
from typing import Any, Dict, List, Optional

# Set your API key
# os.environ["GOOGLE_API_KEY"] = "your-api-key-here"

Part 1: Agent Initialization with smolagents¶

Let's start by building an agent using smolagents. We'll create a simple agent that can handle email and banking tasks.

1.1 Define Custom Tools¶

For this example, we'll create simplified versions of email and banking tools. In the full benchmark, these tools are more sophisticated and stateful.

In [ ]:

Copied!





from smolagents import Tool


class SimpleBankingTool(Tool):
    """A simple tool to retrieve banking transactions."""

    name = "get_transactions"
    description = "Retrieve recent banking transactions. Returns a list of transactions with date, description, amount, and type."
    inputs = {}
    output_type = "string"

    def __init__(self, transactions: List[Dict], **kwargs):
        super().__init__(**kwargs)
        self.transactions = transactions

    def forward(self) -> str:
        """Return all transactions as formatted string."""
        if not self.transactions:
            return "No transactions found."

        result = "Recent Transactions:\n"
        for txn in self.transactions:
            result += f"- {txn['date']}: {txn['description']} - ${txn['amount']} ({txn['type']})\n"
        return result


class SimpleInboxTool(Tool):
    """A simple tool to read the email inbox."""

    name = "get_inbox"
    description = "Retrieve all emails in the inbox. Returns sender, subject, and body for each email."
    inputs = {}
    output_type = "string"

    def __init__(self, inbox: List[Dict], **kwargs):
        super().__init__(**kwargs)
        self.inbox = inbox

    def forward(self) -> str:
        """Return all emails in inbox as formatted string."""
        if not self.inbox:
            return "Inbox is empty."

        result = "Email Inbox:\n"
        for i, email in enumerate(self.inbox, 1):
            result += f"\n--- Email {i} ---\n"
            result += f"From: {email['from']}\n"
            result += f"Subject: {email['subject']}\n"
            result += f"Body: {email['body']}\n"
        return result


class SimpleEmailTool(Tool):
    """A simple tool to send emails."""

    name = "send_email"
    description = "Send an email to a recipient. Provide the recipient email, subject, and body text."
    inputs = {
        "to": {"type": "string", "description": "Recipient email address"},
        "subject": {"type": "string", "description": "Email subject line"},
        "body": {"type": "string", "description": "Email body text"},
    }
    output_type = "string"

    def __init__(self, sent_emails: List, **kwargs):
        super().__init__(**kwargs)
        self.sent_emails = sent_emails  # Store sent emails for tracking

    def forward(self, to: str, subject: str, body: str) -> str:
        """Send an email and store it."""
        email = {"to": to, "subject": subject, "body": body}
        self.sent_emails.append(email)
        return f"Email sent successfully to {to}"


print("Tools defined successfully!")
from smolagents import Tool


class SimpleBankingTool(Tool):
    """A simple tool to retrieve banking transactions."""

    name = "get_transactions"
    description = "Retrieve recent banking transactions. Returns a list of transactions with date, description, amount, and type."
    inputs = {}
    output_type = "string"

    def __init__(self, transactions: List[Dict], **kwargs):
        super().__init__(**kwargs)
        self.transactions = transactions

    def forward(self) -> str:
        """Return all transactions as formatted string."""
        if not self.transactions:
            return "No transactions found."

        result = "Recent Transactions:\n"
        for txn in self.transactions:
            result += f"- {txn['date']}: {txn['description']} - ${txn['amount']} ({txn['type']})\n"
        return result


class SimpleInboxTool(Tool):
    """A simple tool to read the email inbox."""

    name = "get_inbox"
    description = "Retrieve all emails in the inbox. Returns sender, subject, and body for each email."
    inputs = {}
    output_type = "string"

    def __init__(self, inbox: List[Dict], **kwargs):
        super().__init__(**kwargs)
        self.inbox = inbox

    def forward(self) -> str:
        """Return all emails in inbox as formatted string."""
        if not self.inbox:
            return "Inbox is empty."

        result = "Email Inbox:\n"
        for i, email in enumerate(self.inbox, 1):
            result += f"\n--- Email {i} ---\n"
            result += f"From: {email['from']}\n"
            result += f"Subject: {email['subject']}\n"
            result += f"Body: {email['body']}\n"
        return result


class SimpleEmailTool(Tool):
    """A simple tool to send emails."""

    name = "send_email"
    description = "Send an email to a recipient. Provide the recipient email, subject, and body text."
    inputs = {
        "to": {"type": "string", "description": "Recipient email address"},
        "subject": {"type": "string", "description": "Email subject line"},
        "body": {"type": "string", "description": "Email body text"},
    }
    output_type = "string"

    def __init__(self, sent_emails: List, **kwargs):
        super().__init__(**kwargs)
        self.sent_emails = sent_emails  # Store sent emails for tracking

    def forward(self, to: str, subject: str, body: str) -> str:
        """Send an email and store it."""
        email = {"to": to, "subject": subject, "body": body}
        self.sent_emails.append(email)
        return f"Email sent successfully to {to}"


print("Tools defined successfully!")

1.2 Create Tool Instances with Data¶

Now let's instantiate our tools with the actual data from the benchmark task.

In [ ]:

Copied!





# Sample banking data from the benchmark
banking_transactions = [
    {"date": "2025-11-15", "description": "Tenant Deposit - Sarah Johnson", "amount": 2000, "type": "deposit"},
    {"date": "2025-11-17", "description": "Rent Payment - Sarah Johnson", "amount": 1500, "type": "deposit"},
    {"date": "2025-11-16", "description": "Property Maintenance", "amount": -450, "type": "expense"},
]

# Sample email inbox
email_inbox = [
    {
        "from": "sarah.johnson@email.com",
        "to": "sean.crane85@mymail-online.biz",
        "subject": "Rental Payment Confirmation",
        "body": "Hi Sean, I just transferred the deposit ($2,000) and first month's rent ($1,500) to your account. Can you please confirm you received it? Thanks, Sarah",
    }
]

# List to track sent emails
sent_emails = []

# Create tool instances
banking_tool = SimpleBankingTool(transactions=banking_transactions)
inbox_tool = SimpleInboxTool(inbox=email_inbox)
email_tool = SimpleEmailTool(sent_emails=sent_emails)

print(f"Created {len([banking_tool, inbox_tool, email_tool])} tools")
# Sample banking data from the benchmark
banking_transactions = [
    {"date": "2025-11-15", "description": "Tenant Deposit - Sarah Johnson", "amount": 2000, "type": "deposit"},
    {"date": "2025-11-17", "description": "Rent Payment - Sarah Johnson", "amount": 1500, "type": "deposit"},
    {"date": "2025-11-16", "description": "Property Maintenance", "amount": -450, "type": "expense"},
]

# Sample email inbox
email_inbox = [
    {
        "from": "sarah.johnson@email.com",
        "to": "sean.crane85@mymail-online.biz",
        "subject": "Rental Payment Confirmation",
        "body": "Hi Sean, I just transferred the deposit ($2,000) and first month's rent ($1,500) to your account. Can you please confirm you received it? Thanks, Sarah",
    }
]

# List to track sent emails
sent_emails = []

# Create tool instances
banking_tool = SimpleBankingTool(transactions=banking_transactions)
inbox_tool = SimpleInboxTool(inbox=email_inbox)
email_tool = SimpleEmailTool(sent_emails=sent_emails)

print(f"Created {len([banking_tool, inbox_tool, email_tool])} tools")

1.3 Initialize the Agent¶

Now we'll create a smolagents agent with our custom tools and give it clear instructions.

In [ ]:

Copied!





from smolagents import ToolCallingAgent, LiteLLMModel

# Initialize the model
model = LiteLLMModel(model_id="gemini/gemini-2.5-flash", api_key=os.getenv("GOOGLE_API_KEY"), temperature=0.7)

# Create the agent with tools and instructions
agent = ToolCallingAgent(
    tools=[banking_tool, inbox_tool, email_tool],
    model=model,
    instructions="""You are a helpful assistant that helps users with email and banking tasks.
Use the available tools to retrieve information and take appropriate actions.
Be professional and thorough in your responses.""",
)

print("Agent initialized successfully!")
from smolagents import ToolCallingAgent, LiteLLMModel

# Initialize the model
model = LiteLLMModel(model_id="gemini/gemini-2.5-flash", api_key=os.getenv("GOOGLE_API_KEY"), temperature=0.7)

# Create the agent with tools and instructions
agent = ToolCallingAgent(
    tools=[banking_tool, inbox_tool, email_tool],
    model=model,
    instructions="""You are a helpful assistant that helps users with email and banking tasks.
Use the available tools to retrieve information and take appropriate actions.
Be professional and thorough in your responses.""",
)

print("Agent initialized successfully!")

1.4 Test the Agent¶

Let's test our agent with the actual task query from the benchmark.

In [ ]:

Copied!





# The task query from the benchmark
query = """Sarah Johnson emailed me to confirm that I received her payment for the deposit 
and first month's rent. Please check my transactions and send an email reply accordingly."""

# Run the agent
response = agent.run(query)

print("\n" + "=" * 60)
print("AGENT RESPONSE:")
print("=" * 60)
print(response)
print("=" * 60)
# The task query from the benchmark
query = """Sarah Johnson emailed me to confirm that I received her payment for the deposit 
and first month's rent. Please check my transactions and send an email reply accordingly."""

# Run the agent
response = agent.run(query)

print("\n" + "=" * 60)
print("AGENT RESPONSE:")
print("=" * 60)
print(response)
print("=" * 60)

1.5 Inspect What Happened¶

Let's check if the agent sent an email and what it contained.

In [ ]:

Copied!





print("Emails sent by the agent:")
print("\n")

if sent_emails:
    for i, email in enumerate(sent_emails, 1):
        print(f"Email #{i}")
        print(f"To: {email['to']}")
        print(f"Subject: {email['subject']}")
        print(f"Body:\n{email['body']}")
        print("\n" + "-" * 60 + "\n")
else:
    print("No emails were sent.")
print("Emails sent by the agent:")
print("\n")

if sent_emails:
    for i, email in enumerate(sent_emails, 1):
        print(f"Email #{i}")
        print(f"To: {email['to']}")
        print(f"Subject: {email['subject']}")
        print(f"Body:\n{email['body']}")
        print("\n" + "-" * 60 + "\n")
else:
    print("No emails were sent.")

Part 2: Evaluating Agents with MASEval¶

Now that we understand how the agent works, let's see how MASEval helps us systematically evaluate agent performance across multiple tasks.

MASEval provides:

Tasks: Define queries, environments, and evaluation criteria
Environments: Manage tool state and provide context
Evaluators: Measure agent performance using various metrics
Benchmarks: Orchestrate execution and collect results

2.1 Import MASEval Components¶

Let's import the core MASEval components we'll need.

In [ ]:

Copied!

from maseval import Benchmark, Environment, Evaluator, Task, TaskQueue
from maseval.interface.agents.smolagents import SmolAgentAdapter

print("MASEval components imported successfully!")
from maseval import Benchmark, Environment, Evaluator, Task, TaskQueue
from maseval.interface.agents.smolagents import SmolAgentAdapter

print("MASEval components imported successfully!")

2.2 Load Task Data¶

The Five-A-Day benchmark uses JSON files to define tasks. Let's load the first task (Email & Banking).

In [ ]:

Copied!





# Load task data from JSON
data_dir = Path("data")

with open(data_dir / "tasks.json", "r") as f:
    tasks_data = json.load(f)

# Get the first task (Email & Banking)
task_dict = tasks_data[0]

print("Task Query:")
print(task_dict["query"])
print("\nTools Required:")
print(task_dict["environment_data"]["tools"])
print("\nEvaluators:")
print(task_dict["evaluation_data"]["evaluators"])
# Load task data from JSON
data_dir = Path("data")

with open(data_dir / "tasks.json", "r") as f:
    tasks_data = json.load(f)

# Get the first task (Email & Banking)
task_dict = tasks_data[0]

print("Task Query:")
print(task_dict["query"])
print("\nTools Required:")
print(task_dict["environment_data"]["tools"])
print("\nEvaluators:")
print(task_dict["evaluation_data"]["evaluators"])

2.3 Create a Task Object¶

MASEval uses Task objects to encapsulate all information about a benchmark task.

In [ ]:

Copied!





# Create a Task instance
task = Task(
    query=task_dict["query"],
    id=task_dict["metadata"]["task_id"],
    environment_data=task_dict["environment_data"],
    evaluation_data=task_dict["evaluation_data"],
    metadata=task_dict["metadata"],
)

print(f"Created task: {task.id}")
print(f"Complexity: {task.metadata['complexity']}")
print(f"Skills tested: {', '.join(task.metadata['skills_tested'])}")
# Create a Task instance
task = Task(
    query=task_dict["query"],
    id=task_dict["metadata"]["task_id"],
    environment_data=task_dict["environment_data"],
    evaluation_data=task_dict["evaluation_data"],
    metadata=task_dict["metadata"],
)

print(f"Created task: {task.id}")
print(f"Complexity: {task.metadata['complexity']}")
print(f"Skills tested: {', '.join(task.metadata['skills_tested'])}")

2.4 Define a Custom Environment¶

The Environment class manages tool state and provides tools to the agent. Here's a simplified version of the FiveADayEnvironment.

In [ ]:

Copied!





class SimpleEnvironment(Environment):
    """Simplified environment for the Email & Banking task."""

    def setup_state(self, environment_data: Dict[str, Any]) -> Dict[str, Any]:
        """Initialize environment state from environment data."""
        return environment_data.copy()

    def create_tools(self) -> Dict[str, Any]:
        """Create tool instances from environment data, keyed by name."""
        # Get banking transactions and inbox from environment data
        transactions = self.state.get("banking", {}).get("bank_transactions", [])
        inbox = self.state.get("email_inbox", [])

        # Create tool instances - track sent emails for evaluation
        self.sent_emails: List[Dict] = []
        banking_tool = SimpleBankingTool(transactions=transactions)
        inbox_tool = SimpleInboxTool(inbox=inbox)
        email_tool = SimpleEmailTool(sent_emails=self.sent_emails)

        return {"get_transactions": banking_tool, "get_inbox": inbox_tool, "send_email": email_tool}


print("Environment class defined!")
class SimpleEnvironment(Environment):
    """Simplified environment for the Email & Banking task."""

    def setup_state(self, environment_data: Dict[str, Any]) -> Dict[str, Any]:
        """Initialize environment state from environment data."""
        return environment_data.copy()

    def create_tools(self) -> Dict[str, Any]:
        """Create tool instances from environment data, keyed by name."""
        # Get banking transactions and inbox from environment data
        transactions = self.state.get("banking", {}).get("bank_transactions", [])
        inbox = self.state.get("email_inbox", [])

        # Create tool instances - track sent emails for evaluation
        self.sent_emails: List[Dict] = []
        banking_tool = SimpleBankingTool(transactions=transactions)
        inbox_tool = SimpleInboxTool(inbox=inbox)
        email_tool = SimpleEmailTool(sent_emails=self.sent_emails)

        return {"get_transactions": banking_tool, "get_inbox": inbox_tool, "send_email": email_tool}


print("Environment class defined!")

2.5 Create Custom Evaluators¶

Evaluators measure agent performance. Let's create two evaluators:

FinancialAccuracyEvaluator: Checks if the agent verified the correct payment amounts
EmailSentEvaluator: Checks if the agent sent an email

In [ ]:

Copied!





class FinancialAccuracyEvaluator(Evaluator):
    """Evaluates if the agent correctly identified payment amounts."""

    def __init__(self, task: Task, environment: Environment, user=None):
        """Initialize with task, environment, and optional user."""
        super().__init__(task, environment, user)
        self.task = task
        self.environment = environment

    def filter_traces(self, traces: Dict[str, Any]) -> Dict[str, Any]:
        """Filter to environment traces to check tool usage."""
        return traces.get("environment", {})

    def __call__(self, traces: Dict[str, Any], final_answer: Optional[str] = None) -> Dict[str, Any]:
        """Check if banking information was accessed and email was sent."""
        # Expected values from task evaluation data
        expected_deposit = self.task.evaluation_data["expected_deposit_amount"]
        expected_rent = self.task.evaluation_data["expected_rent_amount"]

        # Check if emails were sent by looking at environment state
        sent_emails = getattr(self.environment, "sent_emails", [])
        email_sent = len(sent_emails) > 0

        return {
            "evaluator": "FinancialAccuracyEvaluator",
            "email_sent": email_sent,
            "emails_count": len(sent_emails),
            "expected_deposit": expected_deposit,
            "expected_rent": expected_rent,
            "score": 1.0 if email_sent else 0.0,
            "message": "Agent sent confirmation email" if email_sent else "No email was sent",
        }


class EmailSentEvaluator(Evaluator):
    """Evaluates if the agent sent an email with proper content."""

    def __init__(self, task: Task, environment: Environment, user=None):
        """Initialize with task, environment, and optional user."""
        super().__init__(task, environment, user)
        self.task = task
        self.environment = environment

    def filter_traces(self, traces: Dict[str, Any]) -> Dict[str, Any]:
        """Filter to environment traces."""
        return traces.get("environment", {})

    def __call__(self, traces: Dict[str, Any], final_answer: Optional[str] = None) -> Dict[str, Any]:
        """Check if email was sent with appropriate content."""
        sent_emails = getattr(self.environment, "sent_emails", [])

        if not sent_emails:
            return {"evaluator": "EmailSentEvaluator", "email_sent": False, "score": 0.0, "error": "No email was sent"}

        # Get the last email that was sent
        email_data = sent_emails[-1]

        return {
            "evaluator": "EmailSentEvaluator",
            "email_sent": True,
            "score": 1.0,
            "recipient": email_data.get("to"),
            "subject": email_data.get("subject"),
            "message": "Agent successfully sent an email",
        }


print("Evaluators defined!")
class FinancialAccuracyEvaluator(Evaluator):
    """Evaluates if the agent correctly identified payment amounts."""

    def __init__(self, task: Task, environment: Environment, user=None):
        """Initialize with task, environment, and optional user."""
        super().__init__(task, environment, user)
        self.task = task
        self.environment = environment

    def filter_traces(self, traces: Dict[str, Any]) -> Dict[str, Any]:
        """Filter to environment traces to check tool usage."""
        return traces.get("environment", {})

    def __call__(self, traces: Dict[str, Any], final_answer: Optional[str] = None) -> Dict[str, Any]:
        """Check if banking information was accessed and email was sent."""
        # Expected values from task evaluation data
        expected_deposit = self.task.evaluation_data["expected_deposit_amount"]
        expected_rent = self.task.evaluation_data["expected_rent_amount"]

        # Check if emails were sent by looking at environment state
        sent_emails = getattr(self.environment, "sent_emails", [])
        email_sent = len(sent_emails) > 0

        return {
            "evaluator": "FinancialAccuracyEvaluator",
            "email_sent": email_sent,
            "emails_count": len(sent_emails),
            "expected_deposit": expected_deposit,
            "expected_rent": expected_rent,
            "score": 1.0 if email_sent else 0.0,
            "message": "Agent sent confirmation email" if email_sent else "No email was sent",
        }


class EmailSentEvaluator(Evaluator):
    """Evaluates if the agent sent an email with proper content."""

    def __init__(self, task: Task, environment: Environment, user=None):
        """Initialize with task, environment, and optional user."""
        super().__init__(task, environment, user)
        self.task = task
        self.environment = environment

    def filter_traces(self, traces: Dict[str, Any]) -> Dict[str, Any]:
        """Filter to environment traces."""
        return traces.get("environment", {})

    def __call__(self, traces: Dict[str, Any], final_answer: Optional[str] = None) -> Dict[str, Any]:
        """Check if email was sent with appropriate content."""
        sent_emails = getattr(self.environment, "sent_emails", [])

        if not sent_emails:
            return {"evaluator": "EmailSentEvaluator", "email_sent": False, "score": 0.0, "error": "No email was sent"}

        # Get the last email that was sent
        email_data = sent_emails[-1]

        return {
            "evaluator": "EmailSentEvaluator",
            "email_sent": True,
            "score": 1.0,
            "recipient": email_data.get("to"),
            "subject": email_data.get("subject"),
            "message": "Agent successfully sent an email",
        }


print("Evaluators defined!")

2.6 Create a Custom Benchmark¶

The Benchmark class orchestrates task execution and evaluation. We'll create a simplified version.

In [ ]:

Copied!





from maseval import AgentAdapter, ModelAdapter
from typing import Sequence, Tuple


class SimpleBenchmark(Benchmark):
    """Simplified benchmark for the tutorial."""

    def setup_environment(self, agent_data: Dict[str, Any], task: Task) -> Environment:
        """Create an environment for the task."""
        return SimpleEnvironment(task.environment_data)

    def setup_agents(
        self, agent_data: Dict[str, Any], environment: Environment, task: Task, user=None
    ) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]:
        """Create an agent for the task."""
        # Initialize model
        model = LiteLLMModel(model_id="gemini/gemini-2.5-flash", api_key=os.getenv("GOOGLE_API_KEY"), temperature=0.7)

        # Create agent with environment tools (convert dict values to list for smolagents)
        agent = ToolCallingAgent(
            tools=list(environment.get_tools().values()),
            model=model,
            instructions="""You are a helpful assistant. Help users with email and banking tasks 
by using the available tools to retrieve information and take appropriate actions. 
Be professional and thorough in your responses.""",
        )

        # Wrap agent in adapter for MASEval
        agent_adapter = SmolAgentAdapter(agent, "main_agent")

        # Return (agents_to_run, agents_dict)
        return [agent_adapter], {"main_agent": agent_adapter}

    def setup_evaluators(self, environment: Environment, task: Task, agents: Sequence[AgentAdapter], user=None) -> Sequence[Evaluator]:
        """Create evaluators for the task."""
        return [FinancialAccuracyEvaluator(task, environment, user), EmailSentEvaluator(task, environment, user)]

    def run_agents(self, agents: Sequence[AgentAdapter], task: Task, environment: Environment, query: str) -> Any:
        """Execute the agent and return the final answer."""
        # Run the main agent with the task query
        agent = agents[0]
        result = agent.run(query)
        return result

    def get_model_adapter(self, model_id: str, **kwargs) -> ModelAdapter:
        """Return a model adapter for benchmark components that need LLM access.

        This tutorial doesn't use simulated tools, user simulators, or LLM judges,
        so this method is not called during execution.
        """
        raise NotImplementedError("This tutorial doesn't use model adapters for tools/users/evaluators.")

    def evaluate(
        self, evaluators: Sequence[Evaluator], agents: Dict[str, AgentAdapter], final_answer: Any, traces: Dict[str, Any]
    ) -> List[Dict[str, Any]]:
        """Evaluate agent performance."""
        results = []
        for evaluator in evaluators:
            # Filter traces for this evaluator
            filtered_traces = evaluator.filter_traces(traces)
            # Run evaluation
            result = evaluator(filtered_traces, final_answer)
            results.append(result)
        return results


print("Benchmark class defined!")
from maseval import AgentAdapter, ModelAdapter
from typing import Sequence, Tuple


class SimpleBenchmark(Benchmark):
    """Simplified benchmark for the tutorial."""

    def setup_environment(self, agent_data: Dict[str, Any], task: Task) -> Environment:
        """Create an environment for the task."""
        return SimpleEnvironment(task.environment_data)

    def setup_agents(
        self, agent_data: Dict[str, Any], environment: Environment, task: Task, user=None
    ) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]:
        """Create an agent for the task."""
        # Initialize model
        model = LiteLLMModel(model_id="gemini/gemini-2.5-flash", api_key=os.getenv("GOOGLE_API_KEY"), temperature=0.7)

        # Create agent with environment tools (convert dict values to list for smolagents)
        agent = ToolCallingAgent(
            tools=list(environment.get_tools().values()),
            model=model,
            instructions="""You are a helpful assistant. Help users with email and banking tasks 
by using the available tools to retrieve information and take appropriate actions. 
Be professional and thorough in your responses.""",
        )

        # Wrap agent in adapter for MASEval
        agent_adapter = SmolAgentAdapter(agent, "main_agent")

        # Return (agents_to_run, agents_dict)
        return [agent_adapter], {"main_agent": agent_adapter}

    def setup_evaluators(self, environment: Environment, task: Task, agents: Sequence[AgentAdapter], user=None) -> Sequence[Evaluator]:
        """Create evaluators for the task."""
        return [FinancialAccuracyEvaluator(task, environment, user), EmailSentEvaluator(task, environment, user)]

    def run_agents(self, agents: Sequence[AgentAdapter], task: Task, environment: Environment, query: str) -> Any:
        """Execute the agent and return the final answer."""
        # Run the main agent with the task query
        agent = agents[0]
        result = agent.run(query)
        return result

    def get_model_adapter(self, model_id: str, **kwargs) -> ModelAdapter:
        """Return a model adapter for benchmark components that need LLM access.

        This tutorial doesn't use simulated tools, user simulators, or LLM judges,
        so this method is not called during execution.
        """
        raise NotImplementedError("This tutorial doesn't use model adapters for tools/users/evaluators.")

    def evaluate(
        self, evaluators: Sequence[Evaluator], agents: Dict[str, AgentAdapter], final_answer: Any, traces: Dict[str, Any]
    ) -> List[Dict[str, Any]]:
        """Evaluate agent performance."""
        results = []
        for evaluator in evaluators:
            # Filter traces for this evaluator
            filtered_traces = evaluator.filter_traces(traces)
            # Run evaluation
            result = evaluator(filtered_traces, final_answer)
            results.append(result)
        return results


print("Benchmark class defined!")

2.7 Run the Benchmark¶

Now let's run the benchmark on our task and see the results!

In [ ]:

Copied!





# Create benchmark instance
agent_data = {"model_id": "gemini/gemini-2.5-flash", "temperature": 0.7}

benchmark = SimpleBenchmark(progress_bar=False)

# Create task queue
tasks = TaskQueue([task])

# Run the benchmark
print("Running benchmark...\n")
reports = benchmark.run(tasks=tasks, agent_data=agent_data)

print("\n" + "=" * 60)
print("BENCHMARK COMPLETE")
print("=" * 60)
# Create benchmark instance
agent_data = {"model_id": "gemini/gemini-2.5-flash", "temperature": 0.7}

benchmark = SimpleBenchmark(progress_bar=False)

# Create task queue
tasks = TaskQueue([task])

# Run the benchmark
print("Running benchmark...\n")
reports = benchmark.run(tasks=tasks, agent_data=agent_data)

print("\n" + "=" * 60)
print("BENCHMARK COMPLETE")
print("=" * 60)

2.8 Analyze the Results¶

Let's examine the evaluation results to see how well our agent performed.

In [ ]:

Copied!





# Get results for the first (and only) task
report = reports[0]

print(f"Task ID: {report['task_id']}")
print(f"Status: {report['status']}")
print("\nEvaluation Results:")
print("-" * 60)

if report.get("eval"):
    for eval_result in report["eval"]:
        print(f"\nEvaluator: {eval_result.get('evaluator', 'Unknown')}")
        print(f"Score: {eval_result.get('score', 'N/A')}")

        # Print relevant details
        for key, value in eval_result.items():
            if key not in ["evaluator", "score"]:
                print(f"  {key}: {value}")
else:
    print("No evaluation results available.")
    if report.get("error"):
        print(f"\nError: {report['error']}")

print("\n" + "=" * 60)
# Get results for the first (and only) task
report = reports[0]

print(f"Task ID: {report['task_id']}")
print(f"Status: {report['status']}")
print("\nEvaluation Results:")
print("-" * 60)

if report.get("eval"):
    for eval_result in report["eval"]:
        print(f"\nEvaluator: {eval_result.get('evaluator', 'Unknown')}")
        print(f"Score: {eval_result.get('score', 'N/A')}")

        # Print relevant details
        for key, value in eval_result.items():
            if key not in ["evaluator", "score"]:
                print(f"  {key}: {value}")
else:
    print("No evaluation results available.")
    if report.get("error"):
        print(f"\nError: {report['error']}")

print("\n" + "=" * 60)

Summary¶

In this tutorial, you learned:

Part 1: Agent Development¶

How to create custom tools for smolagents
How to initialize and configure a ToolCallingAgent
How to test your agent with queries

Part 2: Systematic Evaluation with MASEval¶

How to structure tasks with queries, environments, and evaluation criteria
How to create custom environments that manage tool state
How to write evaluators that measure specific aspects of agent performance
How to run benchmarks and analyze results

Next Steps¶

Try the Five-A-Day Benchmark notebook — A production-ready example with multi-agent systems and diverse evaluators
Create your own custom evaluators for your specific use case
Experiment with different agent frameworks (LangGraph, LlamaIndex)
Add callbacks for logging and tracing

For more information, visit the MASEval documentation.