Callback

Callbacks allow you to hook into benchmark execution at various points. Use them for logging, monitoring, tracing, or custom side effects during agent runs.

View source

BenchmarkCallback

Bases: ABC, TraceableMixin

Base class for benchmark callbacks.

gather_traces

gather_traces() -> dict[str, Any]

Gather execution traces from this callback.

By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.

RETURNS	DESCRIPTION
`dict[str, Any]`	Dictionary with basic callback information. Subclasses should
`dict[str, Any]`	extend this with their own data.

on_event

on_event(event_name: str, **data) -> None

Handle a generic event.

EnvironmentCallback

Bases: ABC, TraceableMixin

Base class for environment callbacks.

gather_traces

gather_traces() -> dict[str, Any]

Gather execution traces from this callback.

By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.

RETURNS	DESCRIPTION
`dict[str, Any]`	Dictionary with basic callback information. Subclasses should
`dict[str, Any]`	extend this with their own data.

on_event

on_event(event_name: str, **data) -> None

Handle a generic event.

AgentCallback

Bases: ABC, TraceableMixin

Base class for agent callbacks.

gather_traces

gather_traces() -> dict[str, Any]

Gather execution traces from this callback.

By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.

RETURNS	DESCRIPTION
`dict[str, Any]`	Dictionary with basic callback information. Subclasses should
`dict[str, Any]`	extend this with their own data.

on_event

on_event(event_name: str, **data) -> None

Handle a generic event.

Built-in Callbacks

MASEval provides built-in callback implementations:

Message Tracing

View source

MessageTracingAgentCallback

Bases: AgentCallback

Callback that traces all agent messages to memory.

This callback is useful for: - Frameworks that don't provide built-in message history - Debugging agent behavior - Creating datasets from agent runs - Monitoring multi-agent systems

The callback collects all message history from agents after each run.

Example

from maseval import AgentAdapter
from maseval.core.callbacks.message_tracing import MessageTracingAgentCallback

# Create callback
tracer = MessageTracingAgentCallback(include_metadata=True, verbose=True)

# Use with agent
agent_adapter = MyAgentAdapter(agent, name="agent1", callbacks=[tracer])
agent_adapter.run("What's the weather?")

# Access traced conversations
for conversation in tracer.get_all_conversations():
    print(f"Agent: {conversation['agent_name']}")
    print(f"Query: {conversation['query']}")
    print(f"Messages: {len(conversation['messages'])}")

init

__init__(
    include_metadata: bool = True, verbose: bool = False
)

Initialize the message tracing callback.

PARAMETER	DESCRIPTION
`include_metadata`	If True, include timestamps and metadata in traces TYPE: `bool` DEFAULT: `True`
`verbose`	If True, print tracing information to console TYPE: `bool` DEFAULT: `False`

clear

clear() -> None

Clear all conversations from memory.

gather_traces

gather_traces() -> dict[str, Any]

Gather execution traces from this callback.

By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.

RETURNS	DESCRIPTION
`dict[str, Any]`	Dictionary with basic callback information. Subclasses should
`dict[str, Any]`	extend this with their own data.

get_all_conversations

get_all_conversations() -> List[Dict[str, Any]]

Get all traced conversations from memory.

RETURNS	DESCRIPTION
`List[Dict[str, Any]]`	List of conversation dictionaries

get_conversations_by_agent

get_conversations_by_agent(
    agent_name: str,
) -> List[Dict[str, Any]]

Get all conversations for a specific agent.

PARAMETER	DESCRIPTION
`agent_name`	Name of the agent to filter by TYPE: `str`

RETURNS	DESCRIPTION
`List[Dict[str, Any]]`	List of conversation dictionaries for the specified agent

get_statistics

get_statistics() -> Dict[str, Any]

Get statistics about traced conversations.

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary with statistics

on_event

on_event(event_name: str, **data) -> None

Handle a generic event.

on_run_end

on_run_end(agent: AgentAdapter, result: Any) -> None

Called when agent execution completes.

PARAMETER	DESCRIPTION
`agent`	The agent adapter instance TYPE: `AgentAdapter`
`result`	The result returned by the agent (usually MessageHistory) TYPE: `Any`

on_run_start

on_run_start(agent: AgentAdapter) -> None

Called when agent execution starts.

Note: We don't have access to the query here in the current implementation, so we'll capture it in on_run_end from the result.

Result Logging

View source

ResultLogger

Bases: BenchmarkCallback, ABC

Abstract base class for logging benchmark results to various backends.

This class provides a framework for implementing result loggers that: - Write results incrementally after each task iteration (repeat) - Track expected vs actual logged iterations - Validate completeness at benchmark end - Support selective logging of traces, config, and eval results

Subclasses implement specific backends (file, wandb, opentelemetry, etc.) by overriding the abstract methods.

ATTRIBUTE	DESCRIPTION
`include_traces`	Whether to include execution traces in logged results
`include_config`	Whether to include configuration in logged results
`include_eval`	Whether to include evaluation results in logged results
`include_usage`	Whether to include API usage data in logged results
`validate_on_completion`	Whether to validate all iterations were logged

Example

class MyLogger(ResultLogger):
    def log_iteration(self, report: Dict) -> None:
        # Write report to backend
        pass

    def finalize(self) -> None:
        # Close connections, flush buffers
        pass

    def validate(self) -> bool:
        # Check all iterations present
        return True

logger = MyLogger(include_traces=True)
benchmark = MyBenchmark(tasks, agent_data, callbacks=[logger])
benchmark.run()

init

__init__(
    include_traces: bool = True,
    include_config: bool = True,
    include_eval: bool = True,
    include_task: bool = True,
    include_usage: bool = True,
    validate_on_completion: bool = True,
)

Initialize the result logger.

PARAMETER	DESCRIPTION
`include_traces`	If True, include execution traces in logged results TYPE: `bool` DEFAULT: `True`
`include_config`	If True, include configuration in logged results TYPE: `bool` DEFAULT: `True`
`include_eval`	If True, include evaluation results in logged results TYPE: `bool` DEFAULT: `True`
`include_task`	If True, include task data (query, metadata, protocol) in logged results TYPE: `bool` DEFAULT: `True`
`include_usage`	If True, include API usage data in logged results TYPE: `bool` DEFAULT: `True`
`validate_on_completion`	If True, validate all iterations were logged at end TYPE: `bool` DEFAULT: `True`

finalize `abstractmethod`

finalize() -> None

Finalize logging operations.

Called at benchmark end. Implementations should: - Close file handles - Flush buffers - Close network connections - Write metadata files - Perform any cleanup operations

RAISES	DESCRIPTION
`Exception`	If finalization fails (will be caught and re-raised by base class)

gather_traces

gather_traces() -> dict[str, Any]

Gather execution traces from this callback.

By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.

RETURNS	DESCRIPTION
`dict[str, Any]`	Dictionary with basic callback information. Subclasses should
`dict[str, Any]`	extend this with their own data.

log_iteration `abstractmethod`

log_iteration(report: Dict) -> None

Log a single task iteration to the backend.

This method is called after each task repeat completes. Implementations should write the report to their specific backend (file, API, etc.).

PARAMETER	DESCRIPTION
`report`	Filtered report dict containing task_id, repeat_idx, and optionally traces, config, and eval based on include flags TYPE: `Dict`

RAISES	DESCRIPTION
`Exception`	If logging fails (will be caught and re-raised by base class)

on_event

on_event(event_name: str, **data) -> None

Handle a generic event.

on_run_end

on_run_end(
    benchmark: Benchmark, results: List[Dict]
) -> None

Called when benchmark execution completes.

Finalizes logging and optionally validates completeness.

PARAMETER	DESCRIPTION
`benchmark`	The benchmark instance TYPE: `Benchmark`
`results`	List of all result reports from the benchmark TYPE: `List[Dict]`

on_run_start

on_run_start(benchmark: Benchmark) -> None

Called when benchmark execution starts.

Records the expected number of tasks and repeats for validation.

PARAMETER	DESCRIPTION
`benchmark`	The benchmark instance TYPE: `Benchmark`

on_task_repeat_end

on_task_repeat_end(
    benchmark: Benchmark, report: Dict
) -> None

Called after each task iteration completes.

Filters the report based on include flags, logs it, and tracks the iteration.

PARAMETER	DESCRIPTION
`benchmark`	The benchmark instance TYPE: `Benchmark`
`report`	The complete report dict with task_id, repeat_idx, traces, config, eval TYPE: `Dict`

validate `abstractmethod`

validate() -> bool

Validate that all expected iterations were logged correctly.

Called at benchmark end if validate_on_completion is True. Implementations should verify: - All expected iterations are present - No duplicate iterations exist - Data integrity is maintained

RETURNS	DESCRIPTION
`bool`	True if validation passes, False otherwise

FileResultLogger

Bases: ResultLogger

Logger that writes benchmark results incrementally to JSONL files.

This logger writes each task iteration to a JSONL file (one JSON object per line) as soon as it completes. This provides: - Recovery from crashes: partial results are preserved - Streaming analysis: results can be read while benchmark is running - Safe concurrent reads: JSONL format is line-atomic - Validation: ensures all expected iterations were written

The logger uses atomic writes (write to temp file, then rename) to prevent file corruption from crashes or interruptions.

ATTRIBUTE	DESCRIPTION
`output_dir`	Directory where result files will be written
`filename_pattern`	Pattern for result filename (supports {timestamp})
`write_metadata`	Whether to write a metadata file with benchmark info
`atomic_writes`	Whether to use atomic writes (recommended)

Example

from maseval.core.callbacks.result_logger import FileResultLogger

# Basic usage
logger = FileResultLogger(output_dir="./results")

# Custom configuration
logger = FileResultLogger(
    output_dir="./results",
    filename_pattern="benchmark_{timestamp}.jsonl",
    include_traces=True,
    include_config=True,
    validate_on_completion=True
)

# Use with benchmark
benchmark = MyBenchmark(
    tasks=tasks,
    agent_data=agent_data,
    callbacks=[logger]
)
results = benchmark.run()

# Results are written to: ./results/benchmark_20251028_143022.jsonl

init

__init__(
    output_dir: Path | str = "./results",
    filename_pattern: str = "benchmark_{timestamp}.jsonl",
    write_metadata: bool = True,
    atomic_writes: bool = True,
    overwrite: bool = False,
    include_traces: bool = True,
    include_config: bool = True,
    include_eval: bool = True,
    include_task: bool = True,
    include_usage: bool = True,
    validate_on_completion: bool = True,
)

Initialize the file logger.

PARAMETER	DESCRIPTION
`output_dir`	Directory where result files will be written (created if needed). Accepts either a Path object or a string path. TYPE: `Path \| str` DEFAULT: `'./results'`
`filename_pattern`	Pattern for result filename. Use {timestamp} for automatic timestamp insertion (format: YYYYMMDD_HHMMSS) TYPE: `str` DEFAULT: `'benchmark_{timestamp}.jsonl'`
`write_metadata`	If True, write a metadata file alongside results TYPE: `bool` DEFAULT: `True`
`atomic_writes`	If True, use atomic writes (write to temp, then rename) TYPE: `bool` DEFAULT: `True`
`overwrite`	If True, overwrite existing files. If False, raise an error when the output file already exists. TYPE: `bool` DEFAULT: `False`
`include_traces`	If True, include execution traces in logged results TYPE: `bool` DEFAULT: `True`
`include_config`	If True, include configuration in logged results TYPE: `bool` DEFAULT: `True`
`include_eval`	If True, include evaluation results in logged results TYPE: `bool` DEFAULT: `True`
`include_task`	If True, include task data (query, metadata, protocol) in logged results TYPE: `bool` DEFAULT: `True`
`include_usage`	If True, include API usage data in logged results TYPE: `bool` DEFAULT: `True`
`validate_on_completion`	If True, validate all iterations were logged TYPE: `bool` DEFAULT: `True`

finalize

finalize() -> None

Finalize logging by closing files and writing metadata.

RAISES	DESCRIPTION
`IOError`	If file operations fail

gather_traces

gather_traces() -> dict[str, Any]

Gather execution traces from this callback.

By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.

RETURNS	DESCRIPTION
`dict[str, Any]`	Dictionary with basic callback information. Subclasses should
`dict[str, Any]`	extend this with their own data.

log_iteration

log_iteration(report: Dict) -> None

Log a single task iteration to the JSONL file.

PARAMETER	DESCRIPTION
`report`	Filtered report dict to write TYPE: `Dict`

RAISES	DESCRIPTION
`IOError`	If writing to file fails

on_event

on_event(event_name: str, **data) -> None

Handle a generic event.

on_run_end

on_run_end(
    benchmark: Benchmark, results: List[Dict]
) -> None

Called when benchmark execution completes.

Finalizes logging and optionally validates completeness.

PARAMETER	DESCRIPTION
`benchmark`	The benchmark instance TYPE: `Benchmark`
`results`	List of all result reports from the benchmark TYPE: `List[Dict]`

on_run_start

on_run_start(benchmark: Benchmark) -> None

Called when benchmark execution starts.

Records the expected number of tasks and repeats for validation.

PARAMETER	DESCRIPTION
`benchmark`	The benchmark instance TYPE: `Benchmark`

on_task_repeat_end

on_task_repeat_end(
    benchmark: Benchmark, report: Dict
) -> None

Called after each task iteration completes.

Filters the report based on include flags, logs it, and tracks the iteration.

PARAMETER	DESCRIPTION
`benchmark`	The benchmark instance TYPE: `Benchmark`
`report`	The complete report dict with task_id, repeat_idx, traces, config, eval TYPE: `Dict`

validate

validate() -> bool

Validate that all expected iterations were written to file.

Checks: 1. Number of lines matches number of logged iterations 2. All expected iterations are present 3. No duplicate iterations exist

RETURNS	DESCRIPTION
`bool`	True if validation passes, False otherwise

Progress Bars

View source

ProgressBarCallback

Bases: BenchmarkCallback, ABC

Abstract base class for progress bar callbacks.

Displays benchmark execution progress including overall completion, success rate, time elapsed/remaining, and custom metrics. Automatically tracks benchmark execution and updates the progress bar as tasks complete.

Use TqdmProgressBarCallback or RichProgressBarCallback directly, or subclass them to customize metric display.

User-facing methods:

set_metrics(**metrics): Manually update displayed metrics
update_metrics(report): Override to automatically extract metrics from task reports

Example

from maseval.core.callbacks.progress_bar import TqdmProgressBarCallback

# Option 1: Use directly with manual metric updates
progress_bar = TqdmProgressBarCallback()
benchmark = MyBenchmark(callbacks=[progress_bar])
benchmark.run(tasks)
progress_bar.set_metrics(accuracy="95.2%", avg_score="0.87")

# Option 2: Subclass to automatically extract metrics from reports
class MyProgressBar(TqdmProgressBarCallback):
    def update_metrics(self, report):
        if "evaluation_result" in report:
            return {"accuracy": f"{report['evaluation_result']['acc']:.1%}"}
        return {}

progress_bar = MyProgressBar()
benchmark = MyBenchmark(callbacks=[progress_bar])
benchmark.run(tasks)  # Metrics auto-update after each task

PARAMETER	DESCRIPTION
`desc`	Custom description. Defaults to "Running {BenchmarkClassName}" TYPE: `Optional[str]` DEFAULT: `None`
`show_status`	Whether to display success counter (X/Y Successful) TYPE: `bool` DEFAULT: `True`

gather_traces

gather_traces() -> dict[str, Any]

Gather execution traces from this callback.

By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.

RETURNS	DESCRIPTION
`dict[str, Any]`	Dictionary with basic callback information. Subclasses should
`dict[str, Any]`	extend this with their own data.

on_event

on_event(event_name: str, **data) -> None

Handle a generic event.

on_run_end

on_run_end(
    benchmark: Benchmark, results: List[Dict]
) -> None

Called by benchmark framework when run completes.

on_run_start

on_run_start(benchmark: Benchmark) -> None

Called by benchmark framework when run starts.

on_task_repeat_end

on_task_repeat_end(
    benchmark: Benchmark, report: Dict
) -> None

Called by benchmark framework when a task repeat completes.

set_metrics

set_metrics(**metrics: str) -> None

Manually update custom metrics displayed in the progress bar.

Call this method to set or update metrics at any time during benchmark execution. The progress bar will immediately reflect the changes.

PARAMETER	DESCRIPTION
`**metrics`	Key-value pairs to display (e.g., accuracy="95%", loss="0.23") TYPE: `str` DEFAULT: `{}`

Example

progress_bar = TqdmProgressBarCallback()
benchmark = MyBenchmark(callbacks=[progress_bar])

# Update metrics during or after execution
progress_bar.set_metrics(accuracy="95%", f1="0.87")
progress_bar.set_metrics(avg_loss="0.23")  # Updates/adds metrics

update_metrics

update_metrics(report: Dict) -> Dict[str, str]

Extract and return custom metrics from task execution reports.

Override this method in a subclass to automatically display metrics extracted from benchmark task reports. Called by the framework after each task completes.

The default implementation returns an empty dict (no automatic metrics). Use set_metrics() instead if you prefer manual metric updates.

PARAMETER	DESCRIPTION
`report`	Task execution report containing status, results, and evaluation data. Common keys include "status", "evaluation_result", "agent_response". TYPE: `Dict`

RETURNS	DESCRIPTION
`Dict[str, str]`	Dictionary mapping metric names to string values for display.
`Dict[str, str]`	Return empty dict `{}` if no metrics should be added.

Example

class MyProgressBar(TqdmProgressBarCallback):
    def update_metrics(self, report):
        # Extract metrics from evaluation results
        if "evaluation_result" in report:
            result = report["evaluation_result"]
            return {
                "accuracy": f"{result['accuracy']:.1%}",
                "f1": f"{result['f1']:.2f}"
            }
        return {}  # No metrics for this report

progress_bar = MyProgressBar()
benchmark = MyBenchmark(callbacks=[progress_bar])
benchmark.run(tasks)  # Metrics auto-update after each task

TqdmProgressBarCallback

Bases: ProgressBarCallback

Progress bar callback using tqdm (recommended default).

Simple text-based progress bar that works in terminals and Jupyter notebooks. Displays task completion, success rate, and custom metrics.

Example

from maseval.core.callbacks.progress_bar import TqdmProgressBarCallback

# Basic usage
progress_bar = TqdmProgressBarCallback()
benchmark = MyBenchmark(callbacks=[progress_bar])
benchmark.run(tasks)

# With custom description and metrics
progress_bar = TqdmProgressBarCallback(desc="Evaluating agents")
progress_bar.set_metrics(accuracy="95%", f1="0.87")
benchmark.run(tasks)

PARAMETER	DESCRIPTION
`desc`	Custom description (defaults to "Running {BenchmarkClassName}") TYPE: `Optional[str]` DEFAULT: `None`
`show_status`	Show success counter (default: True) TYPE: `bool` DEFAULT: `True`
`leave`	Keep bar visible after completion (default: True) TYPE: `bool` DEFAULT: `True`
`ncols`	Width in characters (default: auto) TYPE: `Optional[int]` DEFAULT: `None`
`bar_format`	Custom tqdm format string (default: None) TYPE: `Optional[str]` DEFAULT: `None`

gather_traces

gather_traces() -> dict[str, Any]

Gather execution traces from this callback.

By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.

RETURNS	DESCRIPTION
`dict[str, Any]`	Dictionary with basic callback information. Subclasses should
`dict[str, Any]`	extend this with their own data.

on_event

on_event(event_name: str, **data) -> None

Handle a generic event.

on_run_end

on_run_end(
    benchmark: Benchmark, results: List[Dict]
) -> None

Called by benchmark framework when run completes.

on_run_start

on_run_start(benchmark: Benchmark) -> None

Called by benchmark framework when run starts.

on_task_repeat_end

on_task_repeat_end(
    benchmark: Benchmark, report: Dict
) -> None

Called by benchmark framework when a task repeat completes.

set_metrics

set_metrics(**metrics: str) -> None

Manually update custom metrics displayed in the progress bar.

Call this method to set or update metrics at any time during benchmark execution. The progress bar will immediately reflect the changes.

PARAMETER	DESCRIPTION
`**metrics`	Key-value pairs to display (e.g., accuracy="95%", loss="0.23") TYPE: `str` DEFAULT: `{}`

Example

progress_bar = TqdmProgressBarCallback()
benchmark = MyBenchmark(callbacks=[progress_bar])

# Update metrics during or after execution
progress_bar.set_metrics(accuracy="95%", f1="0.87")
progress_bar.set_metrics(avg_loss="0.23")  # Updates/adds metrics

update_metrics

update_metrics(report: Dict) -> Dict[str, str]

Extract and return custom metrics from task execution reports.

Override this method in a subclass to automatically display metrics extracted from benchmark task reports. Called by the framework after each task completes.

The default implementation returns an empty dict (no automatic metrics). Use set_metrics() instead if you prefer manual metric updates.

PARAMETER	DESCRIPTION
`report`	Task execution report containing status, results, and evaluation data. Common keys include "status", "evaluation_result", "agent_response". TYPE: `Dict`

RETURNS	DESCRIPTION
`Dict[str, str]`	Dictionary mapping metric names to string values for display.
`Dict[str, str]`	Return empty dict `{}` if no metrics should be added.

Example

class MyProgressBar(TqdmProgressBarCallback):
    def update_metrics(self, report):
        # Extract metrics from evaluation results
        if "evaluation_result" in report:
            result = report["evaluation_result"]
            return {
                "accuracy": f"{result['accuracy']:.1%}",
                "f1": f"{result['f1']:.2f}"
            }
        return {}  # No metrics for this report

progress_bar = MyProgressBar()
benchmark = MyBenchmark(callbacks=[progress_bar])
benchmark.run(tasks)  # Metrics auto-update after each task

RichProgressBarCallback

Bases: ProgressBarCallback

Progress bar callback using rich library.

Visually enhanced progress bar with color formatting, rich text support, and improved aesthetics. Requires the rich library to be installed.

Example

from maseval.core.callbacks.progress_bar import RichProgressBarCallback

# Basic usage
progress_bar = RichProgressBarCallback()
benchmark = MyBenchmark(callbacks=[progress_bar])
benchmark.run(tasks)

# With custom metrics
progress_bar = RichProgressBarCallback(desc="Benchmarking")
progress_bar.set_metrics(avg_score="0.89", correct="42/50")
benchmark.run(tasks)

PARAMETER	DESCRIPTION
`desc`	Custom description (defaults to "Running {BenchmarkClassName}") TYPE: `Optional[str]` DEFAULT: `None`
`show_status`	Show colored success counter (default: True) TYPE: `bool` DEFAULT: `True`
`transient`	Remove bar after completion (default: False) TYPE: `bool` DEFAULT: `False`

gather_traces

gather_traces() -> dict[str, Any]

Gather execution traces from this callback.

By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.

RETURNS	DESCRIPTION
`dict[str, Any]`	Dictionary with basic callback information. Subclasses should
`dict[str, Any]`	extend this with their own data.

on_event

on_event(event_name: str, **data) -> None

Handle a generic event.

on_run_end

on_run_end(
    benchmark: Benchmark, results: List[Dict]
) -> None

Called by benchmark framework when run completes.

on_run_start

on_run_start(benchmark: Benchmark) -> None

Called by benchmark framework when run starts.

on_task_repeat_end

on_task_repeat_end(
    benchmark: Benchmark, report: Dict
) -> None

Called by benchmark framework when a task repeat completes.

set_metrics

set_metrics(**metrics: str) -> None

Manually update custom metrics displayed in the progress bar.

Call this method to set or update metrics at any time during benchmark execution. The progress bar will immediately reflect the changes.

PARAMETER	DESCRIPTION
`**metrics`	Key-value pairs to display (e.g., accuracy="95%", loss="0.23") TYPE: `str` DEFAULT: `{}`

Example

progress_bar = TqdmProgressBarCallback()
benchmark = MyBenchmark(callbacks=[progress_bar])

# Update metrics during or after execution
progress_bar.set_metrics(accuracy="95%", f1="0.87")
progress_bar.set_metrics(avg_loss="0.23")  # Updates/adds metrics

update_metrics

update_metrics(report: Dict) -> Dict[str, str]

Extract and return custom metrics from task execution reports.

Override this method in a subclass to automatically display metrics extracted from benchmark task reports. Called by the framework after each task completes.

The default implementation returns an empty dict (no automatic metrics). Use set_metrics() instead if you prefer manual metric updates.

PARAMETER	DESCRIPTION
`report`	Task execution report containing status, results, and evaluation data. Common keys include "status", "evaluation_result", "agent_response". TYPE: `Dict`

RETURNS	DESCRIPTION
`Dict[str, str]`	Dictionary mapping metric names to string values for display.
`Dict[str, str]`	Return empty dict `{}` if no metrics should be added.

Example

class MyProgressBar(TqdmProgressBarCallback):
    def update_metrics(self, report):
        # Extract metrics from evaluation results
        if "evaluation_result" in report:
            result = report["evaluation_result"]
            return {
                "accuracy": f"{result['accuracy']:.1%}",
                "f1": f"{result['f1']:.2f}"
            }
        return {}  # No metrics for this report

progress_bar = MyProgressBar()
benchmark = MyBenchmark(callbacks=[progress_bar])
benchmark.run(tasks)  # Metrics auto-update after each task

Callback

BenchmarkCallback

gather_traces

on_event

EnvironmentCallback

gather_traces

on_event

AgentCallback

gather_traces

on_event

Built-in Callbacks

Message Tracing

MessageTracingAgentCallback

__init__

clear

gather_traces

get_all_conversations

get_conversations_by_agent

get_statistics

on_event

on_run_end

on_run_start

Result Logging

ResultLogger

__init__

finalize abstractmethod

gather_traces

log_iteration abstractmethod

on_event

on_run_end

on_run_start

on_task_repeat_end

validate abstractmethod

FileResultLogger

__init__

finalize

gather_traces

log_iteration

on_event

on_run_end

on_run_start

on_task_repeat_end

validate

Progress Bars

ProgressBarCallback

gather_traces

on_event

on_run_end

on_run_start

on_task_repeat_end

set_metrics

update_metrics

TqdmProgressBarCallback

gather_traces

on_event

on_run_end

on_run_start

on_task_repeat_end

set_metrics

update_metrics

RichProgressBarCallback

gather_traces

on_event

on_run_end

on_run_start

on_task_repeat_end

set_metrics

update_metrics

init

init

finalize `abstractmethod`

log_iteration `abstractmethod`

validate `abstractmethod`

init