Tiny Tutorial¶
This notebook is available as a Jupyter notebook — clone the repo and run it yourself!
What You'll Learn¶
- Build your first agent — Create tools and agents with smolagents
- Run a minimal benchmark — One task, one agent, end-to-end
- Understand the core abstractions — Tasks, Environments, Evaluators working together
This tutorial first introduces smolagents as introduction to agents. Then it provides a super small single task benchmark.
Setup¶
First, let's install the required dependencies and import the libraries we need.
# Install dependencies (uncomment if needed)
# !pip install maseval[smolagents]
# !pip install litellm
import os
import json
from pathlib import Path
from typing import Any, Dict, List, Optional
# Set your API key
# os.environ["GOOGLE_API_KEY"] = "your-api-key-here"
Part 1: Agent Initialization with smolagents¶
Let's start by building an agent using smolagents. We'll create a simple agent that can handle email and banking tasks.
1.1 Define Custom Tools¶
For this example, we'll create simplified versions of email and banking tools. In the full benchmark, these tools are more sophisticated and stateful.
from smolagents import Tool
class SimpleBankingTool(Tool):
"""A simple tool to retrieve banking transactions."""
name = "get_transactions"
description = "Retrieve recent banking transactions. Returns a list of transactions with date, description, amount, and type."
inputs = {}
output_type = "string"
def __init__(self, transactions: List[Dict], **kwargs):
super().__init__(**kwargs)
self.transactions = transactions
def forward(self) -> str:
"""Return all transactions as formatted string."""
if not self.transactions:
return "No transactions found."
result = "Recent Transactions:\n"
for txn in self.transactions:
result += f"- {txn['date']}: {txn['description']} - ${txn['amount']} ({txn['type']})\n"
return result
class SimpleInboxTool(Tool):
"""A simple tool to read the email inbox."""
name = "get_inbox"
description = "Retrieve all emails in the inbox. Returns sender, subject, and body for each email."
inputs = {}
output_type = "string"
def __init__(self, inbox: List[Dict], **kwargs):
super().__init__(**kwargs)
self.inbox = inbox
def forward(self) -> str:
"""Return all emails in inbox as formatted string."""
if not self.inbox:
return "Inbox is empty."
result = "Email Inbox:\n"
for i, email in enumerate(self.inbox, 1):
result += f"\n--- Email {i} ---\n"
result += f"From: {email['from']}\n"
result += f"Subject: {email['subject']}\n"
result += f"Body: {email['body']}\n"
return result
class SimpleEmailTool(Tool):
"""A simple tool to send emails."""
name = "send_email"
description = "Send an email to a recipient. Provide the recipient email, subject, and body text."
inputs = {
"to": {"type": "string", "description": "Recipient email address"},
"subject": {"type": "string", "description": "Email subject line"},
"body": {"type": "string", "description": "Email body text"},
}
output_type = "string"
def __init__(self, sent_emails: List, **kwargs):
super().__init__(**kwargs)
self.sent_emails = sent_emails # Store sent emails for tracking
def forward(self, to: str, subject: str, body: str) -> str:
"""Send an email and store it."""
email = {"to": to, "subject": subject, "body": body}
self.sent_emails.append(email)
return f"Email sent successfully to {to}"
print("Tools defined successfully!")
1.2 Create Tool Instances with Data¶
Now let's instantiate our tools with the actual data from the benchmark task.
# Sample banking data from the benchmark
banking_transactions = [
{"date": "2025-11-15", "description": "Tenant Deposit - Sarah Johnson", "amount": 2000, "type": "deposit"},
{"date": "2025-11-17", "description": "Rent Payment - Sarah Johnson", "amount": 1500, "type": "deposit"},
{"date": "2025-11-16", "description": "Property Maintenance", "amount": -450, "type": "expense"},
]
# Sample email inbox
email_inbox = [
{
"from": "sarah.johnson@email.com",
"to": "sean.crane85@mymail-online.biz",
"subject": "Rental Payment Confirmation",
"body": "Hi Sean, I just transferred the deposit ($2,000) and first month's rent ($1,500) to your account. Can you please confirm you received it? Thanks, Sarah",
}
]
# List to track sent emails
sent_emails = []
# Create tool instances
banking_tool = SimpleBankingTool(transactions=banking_transactions)
inbox_tool = SimpleInboxTool(inbox=email_inbox)
email_tool = SimpleEmailTool(sent_emails=sent_emails)
print(f"Created {len([banking_tool, inbox_tool, email_tool])} tools")
1.3 Initialize the Agent¶
Now we'll create a smolagents agent with our custom tools and give it clear instructions.
from smolagents import ToolCallingAgent, LiteLLMModel
# Initialize the model
model = LiteLLMModel(model_id="gemini/gemini-2.5-flash", api_key=os.getenv("GOOGLE_API_KEY"), temperature=0.7)
# Create the agent with tools and instructions
agent = ToolCallingAgent(
tools=[banking_tool, inbox_tool, email_tool],
model=model,
instructions="""You are a helpful assistant that helps users with email and banking tasks.
Use the available tools to retrieve information and take appropriate actions.
Be professional and thorough in your responses.""",
)
print("Agent initialized successfully!")
1.4 Test the Agent¶
Let's test our agent with the actual task query from the benchmark.
# The task query from the benchmark
query = """Sarah Johnson emailed me to confirm that I received her payment for the deposit
and first month's rent. Please check my transactions and send an email reply accordingly."""
# Run the agent
response = agent.run(query)
print("\n" + "=" * 60)
print("AGENT RESPONSE:")
print("=" * 60)
print(response)
print("=" * 60)
1.5 Inspect What Happened¶
Let's check if the agent sent an email and what it contained.
print("Emails sent by the agent:")
print("\n")
if sent_emails:
for i, email in enumerate(sent_emails, 1):
print(f"Email #{i}")
print(f"To: {email['to']}")
print(f"Subject: {email['subject']}")
print(f"Body:\n{email['body']}")
print("\n" + "-" * 60 + "\n")
else:
print("No emails were sent.")
Part 2: Evaluating Agents with MASEval¶
Now that we understand how the agent works, let's see how MASEval helps us systematically evaluate agent performance across multiple tasks.
MASEval provides:
- Tasks: Define queries, environments, and evaluation criteria
- Environments: Manage tool state and provide context
- Evaluators: Measure agent performance using various metrics
- Benchmarks: Orchestrate execution and collect results
2.1 Import MASEval Components¶
Let's import the core MASEval components we'll need.
from maseval import Benchmark, Environment, Evaluator, Task, TaskQueue
from maseval.interface.agents.smolagents import SmolAgentAdapter
print("MASEval components imported successfully!")
2.2 Load Task Data¶
The Five-A-Day benchmark uses JSON files to define tasks. Let's load the first task (Email & Banking).
# Load task data from JSON
data_dir = Path("data")
with open(data_dir / "tasks.json", "r") as f:
tasks_data = json.load(f)
# Get the first task (Email & Banking)
task_data = tasks_data[0]
print("Task Query:")
print(task_data["query"])
print("\nTools Required:")
print(task_data["environment_data"]["tools"])
print("\nEvaluators:")
print(task_data["evaluation_data"]["evaluators"])
2.3 Create a Task Object¶
MASEval uses Task objects to encapsulate all information about a benchmark task.
# Create a Task instance
task = Task(
query=task_data["query"],
id=task_data["metadata"]["task_id"],
environment_data=task_data["environment_data"],
evaluation_data=task_data["evaluation_data"],
metadata=task_data["metadata"],
)
print(f"Created task: {task.id}")
print(f"Complexity: {task.metadata['complexity']}")
print(f"Skills tested: {', '.join(task.metadata['skills_tested'])}")
2.4 Define a Custom Environment¶
The Environment class manages tool state and provides tools to the agent. Here's a simplified version of the FiveADayEnvironment.
class SimpleEnvironment(Environment):
"""Simplified environment for the Email & Banking task."""
def setup_state(self, task_data: Dict[str, Any]) -> Dict[str, Any]:
"""Initialize environment state from task data."""
return task_data.copy()
def create_tools(self) -> Dict[str, Any]:
"""Create tool instances from environment data, keyed by name."""
# Get banking transactions and inbox from environment data
transactions = self.state.get("banking", {}).get("bank_transactions", [])
inbox = self.state.get("email_inbox", [])
# Create tool instances - track sent emails for evaluation
self.sent_emails: List[Dict] = []
banking_tool = SimpleBankingTool(transactions=transactions)
inbox_tool = SimpleInboxTool(inbox=inbox)
email_tool = SimpleEmailTool(sent_emails=self.sent_emails)
return {"get_transactions": banking_tool, "get_inbox": inbox_tool, "send_email": email_tool}
print("Environment class defined!")
2.5 Create Custom Evaluators¶
Evaluators measure agent performance. Let's create two evaluators:
- FinancialAccuracyEvaluator: Checks if the agent verified the correct payment amounts
- EmailSentEvaluator: Checks if the agent sent an email
class FinancialAccuracyEvaluator(Evaluator):
"""Evaluates if the agent correctly identified payment amounts."""
def __init__(self, task: Task, environment: Environment, user=None):
"""Initialize with task, environment, and optional user."""
super().__init__(task, environment, user)
self.task = task
self.environment = environment
def filter_traces(self, traces: Dict[str, Any]) -> Dict[str, Any]:
"""Filter to environment traces to check tool usage."""
return traces.get("environment", {})
def __call__(self, traces: Dict[str, Any], final_answer: Optional[str] = None) -> Dict[str, Any]:
"""Check if banking information was accessed and email was sent."""
# Expected values from task evaluation data
expected_deposit = self.task.evaluation_data["expected_deposit_amount"]
expected_rent = self.task.evaluation_data["expected_rent_amount"]
# Check if emails were sent by looking at environment state
sent_emails = getattr(self.environment, "sent_emails", [])
email_sent = len(sent_emails) > 0
return {
"evaluator": "FinancialAccuracyEvaluator",
"email_sent": email_sent,
"emails_count": len(sent_emails),
"expected_deposit": expected_deposit,
"expected_rent": expected_rent,
"score": 1.0 if email_sent else 0.0,
"message": "Agent sent confirmation email" if email_sent else "No email was sent",
}
class EmailSentEvaluator(Evaluator):
"""Evaluates if the agent sent an email with proper content."""
def __init__(self, task: Task, environment: Environment, user=None):
"""Initialize with task, environment, and optional user."""
super().__init__(task, environment, user)
self.task = task
self.environment = environment
def filter_traces(self, traces: Dict[str, Any]) -> Dict[str, Any]:
"""Filter to environment traces."""
return traces.get("environment", {})
def __call__(self, traces: Dict[str, Any], final_answer: Optional[str] = None) -> Dict[str, Any]:
"""Check if email was sent with appropriate content."""
sent_emails = getattr(self.environment, "sent_emails", [])
if not sent_emails:
return {"evaluator": "EmailSentEvaluator", "email_sent": False, "score": 0.0, "error": "No email was sent"}
# Get the last email that was sent
email_data = sent_emails[-1]
return {
"evaluator": "EmailSentEvaluator",
"email_sent": True,
"score": 1.0,
"recipient": email_data.get("to"),
"subject": email_data.get("subject"),
"message": "Agent successfully sent an email",
}
print("Evaluators defined!")
2.6 Create a Custom Benchmark¶
The Benchmark class orchestrates task execution and evaluation. We'll create a simplified version.
from maseval import AgentAdapter, ModelAdapter
from typing import Sequence, Tuple
class SimpleBenchmark(Benchmark):
"""Simplified benchmark for the tutorial."""
def setup_environment(self, agent_data: Dict[str, Any], task: Task) -> Environment:
"""Create an environment for the task."""
return SimpleEnvironment(task.environment_data)
def setup_agents(
self, agent_data: Dict[str, Any], environment: Environment, task: Task, user=None
) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]:
"""Create an agent for the task."""
# Initialize model
model = LiteLLMModel(model_id="gemini/gemini-2.5-flash", api_key=os.getenv("GOOGLE_API_KEY"), temperature=0.7)
# Create agent with environment tools (convert dict values to list for smolagents)
agent = ToolCallingAgent(
tools=list(environment.get_tools().values()),
model=model,
instructions="""You are a helpful assistant. Help users with email and banking tasks
by using the available tools to retrieve information and take appropriate actions.
Be professional and thorough in your responses.""",
)
# Wrap agent in adapter for MASEval
agent_adapter = SmolAgentAdapter(agent, "main_agent")
# Return (agents_to_run, agents_dict)
return [agent_adapter], {"main_agent": agent_adapter}
def setup_evaluators(self, environment: Environment, task: Task, agents: Sequence[AgentAdapter], user=None) -> Sequence[Evaluator]:
"""Create evaluators for the task."""
return [FinancialAccuracyEvaluator(task, environment, user), EmailSentEvaluator(task, environment, user)]
def run_agents(self, agents: Sequence[AgentAdapter], task: Task, environment: Environment, query: str) -> Any:
"""Execute the agent and return the final answer."""
# Run the main agent with the task query
agent = agents[0]
result = agent.run(query)
return result
def get_model_adapter(self, model_id: str, **kwargs) -> ModelAdapter:
"""Return a model adapter for benchmark components that need LLM access.
This tutorial doesn't use simulated tools, user simulators, or LLM judges,
so this method is not called during execution.
"""
raise NotImplementedError("This tutorial doesn't use model adapters for tools/users/evaluators.")
def evaluate(
self, evaluators: Sequence[Evaluator], agents: Dict[str, AgentAdapter], final_answer: Any, traces: Dict[str, Any]
) -> List[Dict[str, Any]]:
"""Evaluate agent performance."""
results = []
for evaluator in evaluators:
# Filter traces for this evaluator
filtered_traces = evaluator.filter_traces(traces)
# Run evaluation
result = evaluator(filtered_traces, final_answer)
results.append(result)
return results
print("Benchmark class defined!")
2.7 Run the Benchmark¶
Now let's run the benchmark on our task and see the results!
# Create benchmark instance
agent_data = {"model_id": "gemini/gemini-2.5-flash", "temperature": 0.7}
benchmark = SimpleBenchmark(progress_bar=False)
# Create task queue
tasks = TaskQueue([task])
# Run the benchmark
print("Running benchmark...\n")
reports = benchmark.run(tasks=tasks, agent_data=agent_data)
print("\n" + "=" * 60)
print("BENCHMARK COMPLETE")
print("=" * 60)
2.8 Analyze the Results¶
Let's examine the evaluation results to see how well our agent performed.
# Get results for the first (and only) task
report = reports[0]
print(f"Task ID: {report['task_id']}")
print(f"Status: {report['status']}")
print("\nEvaluation Results:")
print("-" * 60)
if report.get("eval"):
for eval_result in report["eval"]:
print(f"\nEvaluator: {eval_result.get('evaluator', 'Unknown')}")
print(f"Score: {eval_result.get('score', 'N/A')}")
# Print relevant details
for key, value in eval_result.items():
if key not in ["evaluator", "score"]:
print(f" {key}: {value}")
else:
print("No evaluation results available.")
if report.get("error"):
print(f"\nError: {report['error']}")
print("\n" + "=" * 60)
Summary¶
In this tutorial, you learned:
Part 1: Agent Development¶
- How to create custom tools for smolagents
- How to initialize and configure a ToolCallingAgent
- How to test your agent with queries
Part 2: Systematic Evaluation with MASEval¶
- How to structure tasks with queries, environments, and evaluation criteria
- How to create custom environments that manage tool state
- How to write evaluators that measure specific aspects of agent performance
- How to run benchmarks and analyze results
Next Steps¶
- Try the Five-A-Day Benchmark notebook — A production-ready example with multi-agent systems and diverse evaluators
- Create your own custom evaluators for your specific use case
- Experiment with different agent frameworks (LangGraph, LlamaIndex)
- Add callbacks for logging and tracing
For more information, visit the MASEval documentation.