Callback
Callbacks allow you to hook into benchmark execution at various points. Use them for logging, monitoring, tracing, or custom side effects during agent runs.
BenchmarkCallback
Bases: ABC, TraceableMixin
Base class for benchmark callbacks.
gather_traces
gather_traces() -> dict[str, Any]
Gather execution traces from this callback.
By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
Dictionary with basic callback information. Subclasses should |
dict[str, Any]
|
extend this with their own data. |
on_event
on_event(event_name: str, **data) -> None
Handle a generic event.
EnvironmentCallback
Bases: ABC, TraceableMixin
Base class for environment callbacks.
gather_traces
gather_traces() -> dict[str, Any]
Gather execution traces from this callback.
By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
Dictionary with basic callback information. Subclasses should |
dict[str, Any]
|
extend this with their own data. |
on_event
on_event(event_name: str, **data) -> None
Handle a generic event.
AgentCallback
Bases: ABC, TraceableMixin
Base class for agent callbacks.
gather_traces
gather_traces() -> dict[str, Any]
Gather execution traces from this callback.
By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
Dictionary with basic callback information. Subclasses should |
dict[str, Any]
|
extend this with their own data. |
on_event
on_event(event_name: str, **data) -> None
Handle a generic event.
Built-in Callbacks
MASEval provides built-in callback implementations:
Message Tracing
MessageTracingAgentCallback
Bases: AgentCallback
Callback that traces all agent messages to memory.
This callback is useful for: - Frameworks that don't provide built-in message history - Debugging agent behavior - Creating datasets from agent runs - Monitoring multi-agent systems
The callback collects all message history from agents after each run.
Example
from maseval import AgentAdapter
from maseval.core.callbacks.message_tracing import MessageTracingAgentCallback
# Create callback
tracer = MessageTracingAgentCallback(include_metadata=True, verbose=True)
# Use with agent
agent_adapter = MyAgentAdapter(agent, name="agent1", callbacks=[tracer])
agent_adapter.run("What's the weather?")
# Access traced conversations
for conversation in tracer.get_all_conversations():
print(f"Agent: {conversation['agent_name']}")
print(f"Query: {conversation['query']}")
print(f"Messages: {len(conversation['messages'])}")
__init__
__init__(
include_metadata: bool = True, verbose: bool = False
)
Initialize the message tracing callback.
| PARAMETER | DESCRIPTION |
|---|---|
include_metadata
|
If True, include timestamps and metadata in traces
TYPE:
|
verbose
|
If True, print tracing information to console
TYPE:
|
clear
clear() -> None
Clear all conversations from memory.
gather_traces
gather_traces() -> dict[str, Any]
Gather execution traces from this callback.
By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
Dictionary with basic callback information. Subclasses should |
dict[str, Any]
|
extend this with their own data. |
get_all_conversations
get_all_conversations() -> List[Dict[str, Any]]
Get all traced conversations from memory.
| RETURNS | DESCRIPTION |
|---|---|
List[Dict[str, Any]]
|
List of conversation dictionaries |
get_conversations_by_agent
get_conversations_by_agent(
agent_name: str,
) -> List[Dict[str, Any]]
Get all conversations for a specific agent.
| PARAMETER | DESCRIPTION |
|---|---|
agent_name
|
Name of the agent to filter by
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[Dict[str, Any]]
|
List of conversation dictionaries for the specified agent |
get_statistics
get_statistics() -> Dict[str, Any]
Get statistics about traced conversations.
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dictionary with statistics |
on_event
on_event(event_name: str, **data) -> None
Handle a generic event.
on_run_end
on_run_end(agent: AgentAdapter, result: Any) -> None
Called when agent execution completes.
| PARAMETER | DESCRIPTION |
|---|---|
agent
|
The agent adapter instance
TYPE:
|
result
|
The result returned by the agent (usually MessageHistory)
TYPE:
|
on_run_start
on_run_start(agent: AgentAdapter) -> None
Called when agent execution starts.
Note: We don't have access to the query here in the current implementation, so we'll capture it in on_run_end from the result.
Result Logging
ResultLogger
Bases: BenchmarkCallback, ABC
Abstract base class for logging benchmark results to various backends.
This class provides a framework for implementing result loggers that: - Write results incrementally after each task iteration (repeat) - Track expected vs actual logged iterations - Validate completeness at benchmark end - Support selective logging of traces, config, and eval results
Subclasses implement specific backends (file, wandb, opentelemetry, etc.) by overriding the abstract methods.
| ATTRIBUTE | DESCRIPTION |
|---|---|
include_traces |
Whether to include execution traces in logged results
|
include_config |
Whether to include configuration in logged results
|
include_eval |
Whether to include evaluation results in logged results
|
include_usage |
Whether to include API usage data in logged results
|
validate_on_completion |
Whether to validate all iterations were logged
|
Example
class MyLogger(ResultLogger):
def log_iteration(self, report: Dict) -> None:
# Write report to backend
pass
def finalize(self) -> None:
# Close connections, flush buffers
pass
def validate(self) -> bool:
# Check all iterations present
return True
logger = MyLogger(include_traces=True)
benchmark = MyBenchmark(tasks, agent_data, callbacks=[logger])
benchmark.run()
__init__
__init__(
include_traces: bool = True,
include_config: bool = True,
include_eval: bool = True,
include_task: bool = True,
include_usage: bool = True,
validate_on_completion: bool = True,
)
Initialize the result logger.
| PARAMETER | DESCRIPTION |
|---|---|
include_traces
|
If True, include execution traces in logged results
TYPE:
|
include_config
|
If True, include configuration in logged results
TYPE:
|
include_eval
|
If True, include evaluation results in logged results
TYPE:
|
include_task
|
If True, include task data (query, metadata, protocol) in logged results
TYPE:
|
include_usage
|
If True, include API usage data in logged results
TYPE:
|
validate_on_completion
|
If True, validate all iterations were logged at end
TYPE:
|
finalize
abstractmethod
finalize() -> None
Finalize logging operations.
Called at benchmark end. Implementations should: - Close file handles - Flush buffers - Close network connections - Write metadata files - Perform any cleanup operations
| RAISES | DESCRIPTION |
|---|---|
Exception
|
If finalization fails (will be caught and re-raised by base class) |
gather_traces
gather_traces() -> dict[str, Any]
Gather execution traces from this callback.
By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
Dictionary with basic callback information. Subclasses should |
dict[str, Any]
|
extend this with their own data. |
log_iteration
abstractmethod
log_iteration(report: Dict) -> None
Log a single task iteration to the backend.
This method is called after each task repeat completes. Implementations should write the report to their specific backend (file, API, etc.).
| PARAMETER | DESCRIPTION |
|---|---|
report
|
Filtered report dict containing task_id, repeat_idx, and optionally traces, config, and eval based on include flags
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
Exception
|
If logging fails (will be caught and re-raised by base class) |
on_event
on_event(event_name: str, **data) -> None
Handle a generic event.
on_run_end
on_run_end(
benchmark: Benchmark, results: List[Dict]
) -> None
Called when benchmark execution completes.
Finalizes logging and optionally validates completeness.
| PARAMETER | DESCRIPTION |
|---|---|
benchmark
|
The benchmark instance
TYPE:
|
results
|
List of all result reports from the benchmark
TYPE:
|
on_run_start
on_run_start(benchmark: Benchmark) -> None
Called when benchmark execution starts.
Records the expected number of tasks and repeats for validation.
| PARAMETER | DESCRIPTION |
|---|---|
benchmark
|
The benchmark instance
TYPE:
|
on_task_repeat_end
on_task_repeat_end(
benchmark: Benchmark, report: Dict
) -> None
Called after each task iteration completes.
Filters the report based on include flags, logs it, and tracks the iteration.
| PARAMETER | DESCRIPTION |
|---|---|
benchmark
|
The benchmark instance
TYPE:
|
report
|
The complete report dict with task_id, repeat_idx, traces, config, eval
TYPE:
|
validate
abstractmethod
validate() -> bool
Validate that all expected iterations were logged correctly.
Called at benchmark end if validate_on_completion is True. Implementations should verify: - All expected iterations are present - No duplicate iterations exist - Data integrity is maintained
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if validation passes, False otherwise |
FileResultLogger
Bases: ResultLogger
Logger that writes benchmark results incrementally to JSONL files.
This logger writes each task iteration to a JSONL file (one JSON object per line) as soon as it completes. This provides: - Recovery from crashes: partial results are preserved - Streaming analysis: results can be read while benchmark is running - Safe concurrent reads: JSONL format is line-atomic - Validation: ensures all expected iterations were written
The logger uses atomic writes (write to temp file, then rename) to prevent file corruption from crashes or interruptions.
| ATTRIBUTE | DESCRIPTION |
|---|---|
output_dir |
Directory where result files will be written
|
filename_pattern |
Pattern for result filename (supports {timestamp})
|
write_metadata |
Whether to write a metadata file with benchmark info
|
atomic_writes |
Whether to use atomic writes (recommended)
|
Example
from maseval.core.callbacks.result_logger import FileResultLogger
# Basic usage
logger = FileResultLogger(output_dir="./results")
# Custom configuration
logger = FileResultLogger(
output_dir="./results",
filename_pattern="benchmark_{timestamp}.jsonl",
include_traces=True,
include_config=True,
validate_on_completion=True
)
# Use with benchmark
benchmark = MyBenchmark(
tasks=tasks,
agent_data=agent_data,
callbacks=[logger]
)
results = benchmark.run()
# Results are written to: ./results/benchmark_20251028_143022.jsonl
__init__
__init__(
output_dir: Path | str = "./results",
filename_pattern: str = "benchmark_{timestamp}.jsonl",
write_metadata: bool = True,
atomic_writes: bool = True,
overwrite: bool = False,
include_traces: bool = True,
include_config: bool = True,
include_eval: bool = True,
include_task: bool = True,
include_usage: bool = True,
validate_on_completion: bool = True,
)
Initialize the file logger.
| PARAMETER | DESCRIPTION |
|---|---|
output_dir
|
Directory where result files will be written (created if needed). Accepts either a Path object or a string path.
TYPE:
|
filename_pattern
|
Pattern for result filename. Use {timestamp} for automatic timestamp insertion (format: YYYYMMDD_HHMMSS)
TYPE:
|
write_metadata
|
If True, write a metadata file alongside results
TYPE:
|
atomic_writes
|
If True, use atomic writes (write to temp, then rename)
TYPE:
|
overwrite
|
If True, overwrite existing files. If False, raise an error when the output file already exists.
TYPE:
|
include_traces
|
If True, include execution traces in logged results
TYPE:
|
include_config
|
If True, include configuration in logged results
TYPE:
|
include_eval
|
If True, include evaluation results in logged results
TYPE:
|
include_task
|
If True, include task data (query, metadata, protocol) in logged results
TYPE:
|
include_usage
|
If True, include API usage data in logged results
TYPE:
|
validate_on_completion
|
If True, validate all iterations were logged
TYPE:
|
finalize
finalize() -> None
Finalize logging by closing files and writing metadata.
| RAISES | DESCRIPTION |
|---|---|
IOError
|
If file operations fail |
gather_traces
gather_traces() -> dict[str, Any]
Gather execution traces from this callback.
By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
Dictionary with basic callback information. Subclasses should |
dict[str, Any]
|
extend this with their own data. |
log_iteration
log_iteration(report: Dict) -> None
Log a single task iteration to the JSONL file.
| PARAMETER | DESCRIPTION |
|---|---|
report
|
Filtered report dict to write
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
IOError
|
If writing to file fails |
on_event
on_event(event_name: str, **data) -> None
Handle a generic event.
on_run_end
on_run_end(
benchmark: Benchmark, results: List[Dict]
) -> None
Called when benchmark execution completes.
Finalizes logging and optionally validates completeness.
| PARAMETER | DESCRIPTION |
|---|---|
benchmark
|
The benchmark instance
TYPE:
|
results
|
List of all result reports from the benchmark
TYPE:
|
on_run_start
on_run_start(benchmark: Benchmark) -> None
Called when benchmark execution starts.
Records the expected number of tasks and repeats for validation.
| PARAMETER | DESCRIPTION |
|---|---|
benchmark
|
The benchmark instance
TYPE:
|
on_task_repeat_end
on_task_repeat_end(
benchmark: Benchmark, report: Dict
) -> None
Called after each task iteration completes.
Filters the report based on include flags, logs it, and tracks the iteration.
| PARAMETER | DESCRIPTION |
|---|---|
benchmark
|
The benchmark instance
TYPE:
|
report
|
The complete report dict with task_id, repeat_idx, traces, config, eval
TYPE:
|
validate
validate() -> bool
Validate that all expected iterations were written to file.
Checks: 1. Number of lines matches number of logged iterations 2. All expected iterations are present 3. No duplicate iterations exist
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if validation passes, False otherwise |
Progress Bars
ProgressBarCallback
Bases: BenchmarkCallback, ABC
Abstract base class for progress bar callbacks.
Displays benchmark execution progress including overall completion, success rate, time elapsed/remaining, and custom metrics. Automatically tracks benchmark execution and updates the progress bar as tasks complete.
Use TqdmProgressBarCallback or RichProgressBarCallback directly, or subclass them
to customize metric display.
User-facing methods:
set_metrics(**metrics): Manually update displayed metricsupdate_metrics(report): Override to automatically extract metrics from task reports
Example
from maseval.core.callbacks.progress_bar import TqdmProgressBarCallback
# Option 1: Use directly with manual metric updates
progress_bar = TqdmProgressBarCallback()
benchmark = MyBenchmark(callbacks=[progress_bar])
benchmark.run(tasks)
progress_bar.set_metrics(accuracy="95.2%", avg_score="0.87")
# Option 2: Subclass to automatically extract metrics from reports
class MyProgressBar(TqdmProgressBarCallback):
def update_metrics(self, report):
if "evaluation_result" in report:
return {"accuracy": f"{report['evaluation_result']['acc']:.1%}"}
return {}
progress_bar = MyProgressBar()
benchmark = MyBenchmark(callbacks=[progress_bar])
benchmark.run(tasks) # Metrics auto-update after each task
| PARAMETER | DESCRIPTION |
|---|---|
desc
|
Custom description. Defaults to "Running {BenchmarkClassName}"
TYPE:
|
show_status
|
Whether to display success counter (X/Y Successful)
TYPE:
|
gather_traces
gather_traces() -> dict[str, Any]
Gather execution traces from this callback.
By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
Dictionary with basic callback information. Subclasses should |
dict[str, Any]
|
extend this with their own data. |
on_event
on_event(event_name: str, **data) -> None
Handle a generic event.
on_run_end
on_run_end(
benchmark: Benchmark, results: List[Dict]
) -> None
Called by benchmark framework when run completes.
on_run_start
on_run_start(benchmark: Benchmark) -> None
Called by benchmark framework when run starts.
on_task_repeat_end
on_task_repeat_end(
benchmark: Benchmark, report: Dict
) -> None
Called by benchmark framework when a task repeat completes.
set_metrics
set_metrics(**metrics: str) -> None
Manually update custom metrics displayed in the progress bar.
Call this method to set or update metrics at any time during benchmark execution. The progress bar will immediately reflect the changes.
| PARAMETER | DESCRIPTION |
|---|---|
**metrics
|
Key-value pairs to display (e.g., accuracy="95%", loss="0.23")
TYPE:
|
Example
progress_bar = TqdmProgressBarCallback()
benchmark = MyBenchmark(callbacks=[progress_bar])
# Update metrics during or after execution
progress_bar.set_metrics(accuracy="95%", f1="0.87")
progress_bar.set_metrics(avg_loss="0.23") # Updates/adds metrics
update_metrics
update_metrics(report: Dict) -> Dict[str, str]
Extract and return custom metrics from task execution reports.
Override this method in a subclass to automatically display metrics extracted from benchmark task reports. Called by the framework after each task completes.
The default implementation returns an empty dict (no automatic metrics).
Use set_metrics() instead if you prefer manual metric updates.
| PARAMETER | DESCRIPTION |
|---|---|
report
|
Task execution report containing status, results, and evaluation data. Common keys include "status", "evaluation_result", "agent_response".
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, str]
|
Dictionary mapping metric names to string values for display. |
Dict[str, str]
|
Return empty dict |
Example
class MyProgressBar(TqdmProgressBarCallback):
def update_metrics(self, report):
# Extract metrics from evaluation results
if "evaluation_result" in report:
result = report["evaluation_result"]
return {
"accuracy": f"{result['accuracy']:.1%}",
"f1": f"{result['f1']:.2f}"
}
return {} # No metrics for this report
progress_bar = MyProgressBar()
benchmark = MyBenchmark(callbacks=[progress_bar])
benchmark.run(tasks) # Metrics auto-update after each task
TqdmProgressBarCallback
Bases: ProgressBarCallback
Progress bar callback using tqdm (recommended default).
Simple text-based progress bar that works in terminals and Jupyter notebooks. Displays task completion, success rate, and custom metrics.
Example
from maseval.core.callbacks.progress_bar import TqdmProgressBarCallback
# Basic usage
progress_bar = TqdmProgressBarCallback()
benchmark = MyBenchmark(callbacks=[progress_bar])
benchmark.run(tasks)
# With custom description and metrics
progress_bar = TqdmProgressBarCallback(desc="Evaluating agents")
progress_bar.set_metrics(accuracy="95%", f1="0.87")
benchmark.run(tasks)
| PARAMETER | DESCRIPTION |
|---|---|
desc
|
Custom description (defaults to "Running {BenchmarkClassName}")
TYPE:
|
show_status
|
Show success counter (default: True)
TYPE:
|
leave
|
Keep bar visible after completion (default: True)
TYPE:
|
ncols
|
Width in characters (default: auto)
TYPE:
|
bar_format
|
Custom tqdm format string (default: None)
TYPE:
|
gather_traces
gather_traces() -> dict[str, Any]
Gather execution traces from this callback.
By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
Dictionary with basic callback information. Subclasses should |
dict[str, Any]
|
extend this with their own data. |
on_event
on_event(event_name: str, **data) -> None
Handle a generic event.
on_run_end
on_run_end(
benchmark: Benchmark, results: List[Dict]
) -> None
Called by benchmark framework when run completes.
on_run_start
on_run_start(benchmark: Benchmark) -> None
Called by benchmark framework when run starts.
on_task_repeat_end
on_task_repeat_end(
benchmark: Benchmark, report: Dict
) -> None
Called by benchmark framework when a task repeat completes.
set_metrics
set_metrics(**metrics: str) -> None
Manually update custom metrics displayed in the progress bar.
Call this method to set or update metrics at any time during benchmark execution. The progress bar will immediately reflect the changes.
| PARAMETER | DESCRIPTION |
|---|---|
**metrics
|
Key-value pairs to display (e.g., accuracy="95%", loss="0.23")
TYPE:
|
Example
progress_bar = TqdmProgressBarCallback()
benchmark = MyBenchmark(callbacks=[progress_bar])
# Update metrics during or after execution
progress_bar.set_metrics(accuracy="95%", f1="0.87")
progress_bar.set_metrics(avg_loss="0.23") # Updates/adds metrics
update_metrics
update_metrics(report: Dict) -> Dict[str, str]
Extract and return custom metrics from task execution reports.
Override this method in a subclass to automatically display metrics extracted from benchmark task reports. Called by the framework after each task completes.
The default implementation returns an empty dict (no automatic metrics).
Use set_metrics() instead if you prefer manual metric updates.
| PARAMETER | DESCRIPTION |
|---|---|
report
|
Task execution report containing status, results, and evaluation data. Common keys include "status", "evaluation_result", "agent_response".
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, str]
|
Dictionary mapping metric names to string values for display. |
Dict[str, str]
|
Return empty dict |
Example
class MyProgressBar(TqdmProgressBarCallback):
def update_metrics(self, report):
# Extract metrics from evaluation results
if "evaluation_result" in report:
result = report["evaluation_result"]
return {
"accuracy": f"{result['accuracy']:.1%}",
"f1": f"{result['f1']:.2f}"
}
return {} # No metrics for this report
progress_bar = MyProgressBar()
benchmark = MyBenchmark(callbacks=[progress_bar])
benchmark.run(tasks) # Metrics auto-update after each task
RichProgressBarCallback
Bases: ProgressBarCallback
Progress bar callback using rich library.
Visually enhanced progress bar with color formatting, rich text support,
and improved aesthetics. Requires the rich library to be installed.
Example
from maseval.core.callbacks.progress_bar import RichProgressBarCallback
# Basic usage
progress_bar = RichProgressBarCallback()
benchmark = MyBenchmark(callbacks=[progress_bar])
benchmark.run(tasks)
# With custom metrics
progress_bar = RichProgressBarCallback(desc="Benchmarking")
progress_bar.set_metrics(avg_score="0.89", correct="42/50")
benchmark.run(tasks)
| PARAMETER | DESCRIPTION |
|---|---|
desc
|
Custom description (defaults to "Running {BenchmarkClassName}")
TYPE:
|
show_status
|
Show colored success counter (default: True)
TYPE:
|
transient
|
Remove bar after completion (default: False)
TYPE:
|
gather_traces
gather_traces() -> dict[str, Any]
Gather execution traces from this callback.
By default, callbacks don't store traces, but subclasses can override this to provide custom tracing data.
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
Dictionary with basic callback information. Subclasses should |
dict[str, Any]
|
extend this with their own data. |
on_event
on_event(event_name: str, **data) -> None
Handle a generic event.
on_run_end
on_run_end(
benchmark: Benchmark, results: List[Dict]
) -> None
Called by benchmark framework when run completes.
on_run_start
on_run_start(benchmark: Benchmark) -> None
Called by benchmark framework when run starts.
on_task_repeat_end
on_task_repeat_end(
benchmark: Benchmark, report: Dict
) -> None
Called by benchmark framework when a task repeat completes.
set_metrics
set_metrics(**metrics: str) -> None
Manually update custom metrics displayed in the progress bar.
Call this method to set or update metrics at any time during benchmark execution. The progress bar will immediately reflect the changes.
| PARAMETER | DESCRIPTION |
|---|---|
**metrics
|
Key-value pairs to display (e.g., accuracy="95%", loss="0.23")
TYPE:
|
Example
progress_bar = TqdmProgressBarCallback()
benchmark = MyBenchmark(callbacks=[progress_bar])
# Update metrics during or after execution
progress_bar.set_metrics(accuracy="95%", f1="0.87")
progress_bar.set_metrics(avg_loss="0.23") # Updates/adds metrics
update_metrics
update_metrics(report: Dict) -> Dict[str, str]
Extract and return custom metrics from task execution reports.
Override this method in a subclass to automatically display metrics extracted from benchmark task reports. Called by the framework after each task completes.
The default implementation returns an empty dict (no automatic metrics).
Use set_metrics() instead if you prefer manual metric updates.
| PARAMETER | DESCRIPTION |
|---|---|
report
|
Task execution report containing status, results, and evaluation data. Common keys include "status", "evaluation_result", "agent_response".
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, str]
|
Dictionary mapping metric names to string values for display. |
Dict[str, str]
|
Return empty dict |
Example
class MyProgressBar(TqdmProgressBarCallback):
def update_metrics(self, report):
# Extract metrics from evaluation results
if "evaluation_result" in report:
result = report["evaluation_result"]
return {
"accuracy": f"{result['accuracy']:.1%}",
"f1": f"{result['f1']:.2f}"
}
return {} # No metrics for this report
progress_bar = MyProgressBar()
benchmark = MyBenchmark(callbacks=[progress_bar])
benchmark.run(tasks) # Metrics auto-update after each task