Usage & Cost Tracking

Usage and cost tracking provides data classes for recording resource consumption, a mixin for automatic collection, and pluggable cost calculators.

See the Usage & Cost Tracking guide for usage patterns and examples.

View source

Usage `dataclass`

Generic usage record for any billable resource.

Represents accumulated cost and countable units for a component or aggregated group. All fields default to zero, so Usage() can be used as a starting value for accumulation with + and sum().

Note

cost defaults to 0.0. This means adding a Usage() to another record never changes the cost: Usage() + Usage(cost=0.05) gives cost=0.05. Components that track cost start at 0.0 and accumulate upward. Components that do not track cost (e.g., agent adapters that only count tokens) also default to 0.0 — their cost simply has no effect when summed with components that do report cost.

Grouping fields (provider, category, component_name, kind) identify what scope the record covers. When two records are summed, matching grouping fields are preserved; mismatches become None (meaning "aggregated over").

ATTRIBUTE	DESCRIPTION
`cost`	Total cost in USD (or whatever unit your calculator uses). Defaults to `0.0`. TYPE: `float`
`units`	Arbitrary countable units (e.g., `{"api_calls": 3}`). TYPE: `Dict[str, int \| float]`
`provider`	Provider identifier (e.g., `"anthropic"`, `"bloomberg"`). TYPE: `Optional[str]`
`category`	Registry category (e.g., `"models"`, `"tools"`). TYPE: `Optional[str]`
`component_name`	Component name within category (e.g., `"main_model"`). TYPE: `Optional[str]`
`kind`	Component kind (e.g., `"llm"`, `"service"`, `"local"`). TYPE: `Optional[str]`

Example

usage = Usage(cost=0.05, units={"api_calls": 1}, provider="bloomberg", kind="service")

# Summing preserves matching fields
total = usage + Usage(cost=0.03, units={"api_calls": 2}, provider="bloomberg", kind="service")
assert total.cost == 0.08
assert total.units == {"api_calls": 3}
assert total.provider == "bloomberg"

# Usage() is the zero element
assert (usage + Usage()).cost == 0.05

# Accumulate with sum()
records = [Usage(cost=0.10), Usage(cost=0.20), Usage(cost=0.05)]
assert sum(records, Usage()).cost == 0.35

# Mismatched grouping fields become None
mixed = usage + Usage(cost=0.10, provider="anthropic", kind="llm")
assert mixed.provider is None  # aggregated over
assert mixed.kind is None      # aggregated over

radd

__radd__(other: object) -> Usage

Support sum() by handling 0 + Usage.

to_dict

to_dict() -> Dict[str, Any]

Serialize to a JSON-compatible dictionary.

TokenUsage `dataclass`

Bases: Usage

LLM-specific usage record with token counts.

Extends Usage with token fields reported by LLM providers. Use from_chat_response_usage() to create from the dict returned by model adapters.

ATTRIBUTE	DESCRIPTION
`input_tokens`	Number of input/prompt tokens. TYPE: `int`
`output_tokens`	Number of output/completion tokens. TYPE: `int`
`total_tokens`	Total tokens (input + output). TYPE: `int`
`cached_input_tokens`	Tokens served from cache (Anthropic `cache_read_input_tokens`, OpenAI `cached_tokens`). TYPE: `int`
`cache_creation_input_tokens`	Tokens used to create a new cache entry (Anthropic `cache_creation_input_tokens`). Billed at a higher rate. TYPE: `int`
`reasoning_tokens`	Tokens used for reasoning (OpenAI `reasoning_tokens`, Google `thoughts_token_count`). TYPE: `int`
`audio_tokens`	Tokens for audio processing (OpenAI). TYPE: `int`

Example

token_usage = TokenUsage.from_chat_response_usage({
    "input_tokens": 100,
    "output_tokens": 50,
    "total_tokens": 150,
})
assert token_usage.input_tokens == 100

radd

__radd__(other: object) -> Usage

Support sum() by handling 0 + Usage.

from_chat_response_usage `classmethod`

from_chat_response_usage(
    usage_dict: Dict[str, Any],
    *,
    cost: float = 0.0,
    provider: Optional[str] = None,
    category: Optional[str] = None,
    component_name: Optional[str] = None,
    kind: str = "llm",
) -> TokenUsage

Create a TokenUsage from a ChatResponse.usage dict.

Maps provider-specific key names to the canonical fields.

PARAMETER	DESCRIPTION
`usage_dict`	The usage dict from `ChatResponse.usage`. TYPE: `Dict[str, Any]`
`cost`	Cost in USD (e.g., from provider-reported cost). Defaults to `0.0`. TYPE: `float` DEFAULT: `0.0`
`provider`	Provider identifier. TYPE: `Optional[str]` DEFAULT: `None`
`category`	Registry category. TYPE: `Optional[str]` DEFAULT: `None`
`component_name`	Component name. TYPE: `Optional[str]` DEFAULT: `None`
`kind`	Component kind, defaults to `"llm"`. TYPE: `str` DEFAULT: `'llm'`

RETURNS	DESCRIPTION
`TokenUsage`	A TokenUsage instance with mapped fields.

to_dict

to_dict() -> Dict[str, Any]

Serialize to a JSON-compatible dictionary.

UsageTrackableMixin

Mixin that provides usage tracking capability to any component.

Classes that inherit from UsageTrackableMixin can be registered with a Benchmark instance and will have their usage automatically collected by the registry via collect_usage().

The gather_usage() method provides a default implementation that returns an empty Usage. Subclasses should override this to return their accumulated usage data.

How to use

For custom components that incur billable costs, inherit from UsageTrackableMixin and override gather_usage():

class MyPaidService(TraceableMixin, UsageTrackableMixin):
    def __init__(self):
        self._usage_records: List[Usage] = []

    def call_api(self, query):
        result = api.call(query)
        self._usage_records.append(Usage(
            cost=result.cost,
            units={"api_calls": 1},
        ))
        return result

    def gather_usage(self) -> Usage:
        return sum(self._usage_records, Usage())

Then register it with your benchmark:

service = MyPaidService()
benchmark.register("tools", "my_service", service)

Thread Safety

Usage collection happens synchronously in the main thread after task execution completes. Components should use thread-safe data structures when accumulating usage during concurrent execution, but gather_usage() itself is called sequentially.

gather_usage

gather_usage() -> Usage

Gather accumulated usage from this component.

Provides a default implementation that returns an empty Usage. Subclasses should override this to return their accumulated usage data.

RETURNS	DESCRIPTION
`Usage`	Accumulated usage for this component.

How to use

Override this method to return your component's usage:

def gather_usage(self) -> Usage:
    return sum(self._usage_records, Usage())

CostCalculator

Bases: Protocol

Protocol for computing cost from token usage.

Implementations receive a TokenUsage and the model ID, and return the cost in whatever unit the calculator declares (typically USD).

Example

class MyCostCalculator:
    def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
        rate = MY_PRICING.get(model_id)
        if rate is None:
            return None
        return rate["input"] * usage.input_tokens + rate["output"] * usage.output_tokens

calculate_cost

calculate_cost(
    usage: TokenUsage, model_id: str
) -> Optional[float]

Compute cost for a single chat call.

PARAMETER	DESCRIPTION
`usage`	Token usage from the call. TYPE: `TokenUsage`
`model_id`	The model identifier (e.g., `"gpt-4"`, `"claude-sonnet-4-5"`). TYPE: `str`

RETURNS	DESCRIPTION
`Optional[float]`	Cost as a float, or `None` if pricing is unknown for this model.

StaticPricingCalculator

Cost calculator using user-supplied per-model pricing.

Pricing is specified as cost per token (not per 1K or 1M tokens). If a model is not in the pricing table, calculate_cost returns None.

PARAMETER DESCRIPTION

pricing

Dict mapping model IDs to their per-token rates. Each value is a dict with keys:

"input" — cost per input token (required)
"output" — cost per output token (required)
"cached_input" — cost per cached input token (optional, defaults to "input" rate)
"cache_creation_input" — cost per cache creation token (optional, defaults to "input" rate)

TYPE: Dict[str, Dict[str, float]]

Example

calculator = StaticPricingCalculator({
    "gpt-4": {"input": 0.00003, "output": 0.00006},
    "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015},
})

model = LiteLLMModelAdapter(model_id="gpt-4", cost_calculator=calculator)

For university clusters or custom credit systems, the "cost" unit is whatever the pricing values represent (credits, EUR, etc.):

```python
calculator = StaticPricingCalculator({
    "llama-3-70b": {"input": 0.5, "output": 1.0},  # credits per token
})
```

models `property`

models: List[str]

List of model IDs with pricing configured.

add_model

add_model(model_id: str, rates: Dict[str, float]) -> None

Add or update pricing for a model.

PARAMETER	DESCRIPTION
`model_id`	The model identifier. TYPE: `str`
`rates`	Per-token rates (`"input"`, `"output"`, optionally `"cached_input"`). TYPE: `Dict[str, float]`

calculate_cost

calculate_cost(
    usage: TokenUsage, model_id: str
) -> Optional[float]

Compute cost from static per-token rates.

PARAMETER	DESCRIPTION
`usage`	Token usage from the call. TYPE: `TokenUsage`
`model_id`	The model identifier to look up in the pricing table. TYPE: `str`

RETURNS	DESCRIPTION
`Optional[float]`	Computed cost, or `None` if the model is not in the pricing table.

gather_config

gather_config() -> Dict[str, Any]

Return pricing configuration for reproducibility.

UsageReporter

Post-hoc utility for analyzing usage across benchmark reports.

Walks report["usage"] across all reports to produce breakdowns by task, component, model, etc.

Example

reporter = UsageReporter.from_reports(benchmark.reports)
print(reporter.total())
print(reporter.by_task())
print(reporter.by_component())

init

__init__(entries: List[Dict[str, Any]])

Initialize with raw entries extracted from reports.

PARAMETER	DESCRIPTION
`entries`	List of dicts, each with `"task_id"`, `"repeat_idx"`, and `"usage_items"` (list of `(key, usage_dict)` tuples). TYPE: `List[Dict[str, Any]]`

by_component

by_component() -> Dict[str, Usage]

Aggregate usage by registry key (e.g., "models:main_model").

by_task

by_task() -> Dict[str, Usage]

Aggregate usage by task_id across all repetitions.

from_reports `staticmethod`

from_reports(
    reports: List[Dict[str, Any]],
) -> UsageReporter

Create a UsageReporter from benchmark reports.

PARAMETER	DESCRIPTION
`reports`	The `benchmark.reports` list. TYPE: `List[Dict[str, Any]]`

RETURNS	DESCRIPTION
`UsageReporter`	A UsageReporter ready for analysis.

summary

summary() -> Dict[str, Any]

Nested dict with all breakdowns.

total

total() -> Usage

Grand total across all tasks and components.

View source

LiteLLMCostCalculator

Cost calculator using LiteLLM's bundled pricing database.

LiteLLM maintains a comprehensive model_prices_and_context_window.json <https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json>_ that covers most major LLM providers. This calculator delegates to litellm.cost_per_token for per-token rates and computes the total.

This is the recommended calculator for most users — it covers OpenAI, Anthropic, Google, Mistral, Cohere, and many more without requiring manual pricing tables.

Note

If you're already using the LiteLLMModelAdapter, it extracts provider-reported cost from response._hidden_params.response_cost automatically. This calculator is useful as a fallback when using other adapters (OpenAI, Anthropic, Google) directly.

Example

from maseval.interface.usage import LiteLLMCostCalculator
from maseval.interface.inference import OpenAIModelAdapter

calculator = LiteLLMCostCalculator()
model = OpenAIModelAdapter(client=client, model_id="gpt-4", cost_calculator=calculator)

# Cost is now computed automatically after each chat() call
response = model.chat([{"role": "user", "content": "Hello"}])
print(model.gather_usage().cost)  # e.g., 0.00123

init

__init__(
    custom_pricing: Optional[
        Dict[str, Dict[str, float]]
    ] = None,
    model_id_map: Optional[Dict[str, str]] = None,
)

Initialize the LiteLLM cost calculator.

PARAMETER DESCRIPTION

custom_pricing

Optional overrides for specific models. Keys are model IDs, values are dicts with "input_cost_per_token" and "output_cost_per_token". These take precedence over LiteLLM's built-in pricing.

TYPE: Optional[Dict[str, Dict[str, float]]] DEFAULT: None

model_id_map

Optional mapping from adapter model IDs to LiteLLM model IDs. Use this when your adapter's model_id doesn't match LiteLLM's naming convention — e.g., when using Google's OpenAI-compatible endpoint where the adapter sees "gemini-2.0-flash" but LiteLLM expects "gemini/gemini-2.0-flash".

Example::

LiteLLMCostCalculator(model_id_map={
    "gemini-2.0-flash": "gemini/gemini-2.0-flash",
})

TYPE: Optional[Dict[str, str]] DEFAULT: None

calculate_cost

calculate_cost(
    usage: TokenUsage, model_id: str
) -> Optional[float]

Compute cost using LiteLLM's pricing database.

PARAMETER	DESCRIPTION
`usage`	Token usage from the call. TYPE: `TokenUsage`
`model_id`	The model identifier. Remapped via `model_id_map` if configured, then looked up in custom pricing and LiteLLM's database. TYPE: `str`

RETURNS	DESCRIPTION
`Optional[float]`	Cost in USD, or `None` if LiteLLM doesn't have pricing for
`Optional[float]`	this model and no custom pricing was provided.

gather_config

gather_config() -> Dict[str, Any]

Return calculator configuration for reproducibility.

Usage & Cost Tracking

Usage dataclass

__radd__

to_dict

TokenUsage dataclass

__radd__

from_chat_response_usage classmethod

to_dict

UsageTrackableMixin

gather_usage

CostCalculator

calculate_cost

StaticPricingCalculator

models property

add_model

calculate_cost

gather_config

UsageReporter

__init__

by_component

by_task

from_reports staticmethod

summary

total

LiteLLMCostCalculator

__init__

calculate_cost

gather_config

Usage `dataclass`

radd

TokenUsage `dataclass`

radd

from_chat_response_usage `classmethod`

models `property`

init

from_reports `staticmethod`

init