Skip to content

Usage & Cost Tracking

Usage and cost tracking provides data classes for recording resource consumption, a mixin for automatic collection, and pluggable cost calculators.

See the Usage & Cost Tracking guide for usage patterns and examples.

View source

Usage dataclass

Generic usage record for any billable resource.

Represents accumulated cost and countable units for a component or aggregated group. All fields default to zero, so Usage() can be used as a starting value for accumulation with + and sum().

Note

cost defaults to 0.0. This means adding a Usage() to another record never changes the cost: Usage() + Usage(cost=0.05) gives cost=0.05. Components that track cost start at 0.0 and accumulate upward. Components that do not track cost (e.g., agent adapters that only count tokens) also default to 0.0 — their cost simply has no effect when summed with components that do report cost.

Grouping fields (provider, category, component_name, kind) identify what scope the record covers. When two records are summed, matching grouping fields are preserved; mismatches become None (meaning "aggregated over").

ATTRIBUTE DESCRIPTION
cost

Total cost in USD (or whatever unit your calculator uses). Defaults to 0.0.

TYPE: float

units

Arbitrary countable units (e.g., {"api_calls": 3}).

TYPE: Dict[str, int | float]

provider

Provider identifier (e.g., "anthropic", "bloomberg").

TYPE: Optional[str]

category

Registry category (e.g., "models", "tools").

TYPE: Optional[str]

component_name

Component name within category (e.g., "main_model").

TYPE: Optional[str]

kind

Component kind (e.g., "llm", "service", "local").

TYPE: Optional[str]

Example
usage = Usage(cost=0.05, units={"api_calls": 1}, provider="bloomberg", kind="service")

# Summing preserves matching fields
total = usage + Usage(cost=0.03, units={"api_calls": 2}, provider="bloomberg", kind="service")
assert total.cost == 0.08
assert total.units == {"api_calls": 3}
assert total.provider == "bloomberg"

# Usage() is the zero element
assert (usage + Usage()).cost == 0.05

# Accumulate with sum()
records = [Usage(cost=0.10), Usage(cost=0.20), Usage(cost=0.05)]
assert sum(records, Usage()).cost == 0.35

# Mismatched grouping fields become None
mixed = usage + Usage(cost=0.10, provider="anthropic", kind="llm")
assert mixed.provider is None  # aggregated over
assert mixed.kind is None      # aggregated over

__radd__

__radd__(other: object) -> Usage

Support sum() by handling 0 + Usage.

to_dict

to_dict() -> Dict[str, Any]

Serialize to a JSON-compatible dictionary.

TokenUsage dataclass

Bases: Usage

LLM-specific usage record with token counts.

Extends Usage with token fields reported by LLM providers. Use from_chat_response_usage() to create from the dict returned by model adapters.

ATTRIBUTE DESCRIPTION
input_tokens

Number of input/prompt tokens.

TYPE: int

output_tokens

Number of output/completion tokens.

TYPE: int

total_tokens

Total tokens (input + output).

TYPE: int

cached_input_tokens

Tokens served from cache (Anthropic cache_read_input_tokens, OpenAI cached_tokens).

TYPE: int

cache_creation_input_tokens

Tokens used to create a new cache entry (Anthropic cache_creation_input_tokens). Billed at a higher rate.

TYPE: int

reasoning_tokens

Tokens used for reasoning (OpenAI reasoning_tokens, Google thoughts_token_count).

TYPE: int

audio_tokens

Tokens for audio processing (OpenAI).

TYPE: int

Example
token_usage = TokenUsage.from_chat_response_usage({
    "input_tokens": 100,
    "output_tokens": 50,
    "total_tokens": 150,
})
assert token_usage.input_tokens == 100

__radd__

__radd__(other: object) -> Usage

Support sum() by handling 0 + Usage.

from_chat_response_usage classmethod

from_chat_response_usage(
    usage_dict: Dict[str, Any],
    *,
    cost: float = 0.0,
    provider: Optional[str] = None,
    category: Optional[str] = None,
    component_name: Optional[str] = None,
    kind: str = "llm",
) -> TokenUsage

Create a TokenUsage from a ChatResponse.usage dict.

Maps provider-specific key names to the canonical fields.

PARAMETER DESCRIPTION
usage_dict

The usage dict from ChatResponse.usage.

TYPE: Dict[str, Any]

cost

Cost in USD (e.g., from provider-reported cost). Defaults to 0.0.

TYPE: float DEFAULT: 0.0

provider

Provider identifier.

TYPE: Optional[str] DEFAULT: None

category

Registry category.

TYPE: Optional[str] DEFAULT: None

component_name

Component name.

TYPE: Optional[str] DEFAULT: None

kind

Component kind, defaults to "llm".

TYPE: str DEFAULT: 'llm'

RETURNS DESCRIPTION
TokenUsage

A TokenUsage instance with mapped fields.

to_dict

to_dict() -> Dict[str, Any]

Serialize to a JSON-compatible dictionary.

UsageTrackableMixin

Mixin that provides usage tracking capability to any component.

Classes that inherit from UsageTrackableMixin can be registered with a Benchmark instance and will have their usage automatically collected by the registry via collect_usage().

The gather_usage() method provides a default implementation that returns an empty Usage. Subclasses should override this to return their accumulated usage data.

How to use

For custom components that incur billable costs, inherit from UsageTrackableMixin and override gather_usage():

class MyPaidService(TraceableMixin, UsageTrackableMixin):
    def __init__(self):
        self._usage_records: List[Usage] = []

    def call_api(self, query):
        result = api.call(query)
        self._usage_records.append(Usage(
            cost=result.cost,
            units={"api_calls": 1},
        ))
        return result

    def gather_usage(self) -> Usage:
        return sum(self._usage_records, Usage())

Then register it with your benchmark:

service = MyPaidService()
benchmark.register("tools", "my_service", service)
Thread Safety

Usage collection happens synchronously in the main thread after task execution completes. Components should use thread-safe data structures when accumulating usage during concurrent execution, but gather_usage() itself is called sequentially.

gather_usage

gather_usage() -> Usage

Gather accumulated usage from this component.

Provides a default implementation that returns an empty Usage. Subclasses should override this to return their accumulated usage data.

RETURNS DESCRIPTION
Usage

Accumulated usage for this component.

How to use

Override this method to return your component's usage:

def gather_usage(self) -> Usage:
    return sum(self._usage_records, Usage())

CostCalculator

Bases: Protocol

Protocol for computing cost from token usage.

Implementations receive a TokenUsage and the model ID, and return the cost in whatever unit the calculator declares (typically USD).

Example
class MyCostCalculator:
    def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
        rate = MY_PRICING.get(model_id)
        if rate is None:
            return None
        return rate["input"] * usage.input_tokens + rate["output"] * usage.output_tokens

calculate_cost

calculate_cost(
    usage: TokenUsage, model_id: str
) -> Optional[float]

Compute cost for a single chat call.

PARAMETER DESCRIPTION
usage

Token usage from the call.

TYPE: TokenUsage

model_id

The model identifier (e.g., "gpt-4", "claude-sonnet-4-5").

TYPE: str

RETURNS DESCRIPTION
Optional[float]

Cost as a float, or None if pricing is unknown for this model.

StaticPricingCalculator

Cost calculator using user-supplied per-model pricing.

Pricing is specified as cost per token (not per 1K or 1M tokens). If a model is not in the pricing table, calculate_cost returns None.

PARAMETER DESCRIPTION
pricing

Dict mapping model IDs to their per-token rates. Each value is a dict with keys:

  • "input" — cost per input token (required)
  • "output" — cost per output token (required)
  • "cached_input" — cost per cached input token (optional, defaults to "input" rate)
  • "cache_creation_input" — cost per cache creation token (optional, defaults to "input" rate)

TYPE: Dict[str, Dict[str, float]]

Example
calculator = StaticPricingCalculator({
    "gpt-4": {"input": 0.00003, "output": 0.00006},
    "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015},
})

model = LiteLLMModelAdapter(model_id="gpt-4", cost_calculator=calculator)

For university clusters or custom credit systems, the "cost" unit is whatever the pricing values represent (credits, EUR, etc.):

```python
calculator = StaticPricingCalculator({
    "llama-3-70b": {"input": 0.5, "output": 1.0},  # credits per token
})
```

models property

models: List[str]

List of model IDs with pricing configured.

add_model

add_model(model_id: str, rates: Dict[str, float]) -> None

Add or update pricing for a model.

PARAMETER DESCRIPTION
model_id

The model identifier.

TYPE: str

rates

Per-token rates ("input", "output", optionally "cached_input").

TYPE: Dict[str, float]

calculate_cost

calculate_cost(
    usage: TokenUsage, model_id: str
) -> Optional[float]

Compute cost from static per-token rates.

PARAMETER DESCRIPTION
usage

Token usage from the call.

TYPE: TokenUsage

model_id

The model identifier to look up in the pricing table.

TYPE: str

RETURNS DESCRIPTION
Optional[float]

Computed cost, or None if the model is not in the pricing table.

gather_config

gather_config() -> Dict[str, Any]

Return pricing configuration for reproducibility.

UsageReporter

Post-hoc utility for analyzing usage across benchmark reports.

Walks report["usage"] across all reports to produce breakdowns by task, component, model, etc.

Example
reporter = UsageReporter.from_reports(benchmark.reports)
print(reporter.total())
print(reporter.by_task())
print(reporter.by_component())

__init__

__init__(entries: List[Dict[str, Any]])

Initialize with raw entries extracted from reports.

PARAMETER DESCRIPTION
entries

List of dicts, each with "task_id", "repeat_idx", and "usage_items" (list of (key, usage_dict) tuples).

TYPE: List[Dict[str, Any]]

by_component

by_component() -> Dict[str, Usage]

Aggregate usage by registry key (e.g., "models:main_model").

by_task

by_task() -> Dict[str, Usage]

Aggregate usage by task_id across all repetitions.

from_reports staticmethod

from_reports(
    reports: List[Dict[str, Any]],
) -> UsageReporter

Create a UsageReporter from benchmark reports.

PARAMETER DESCRIPTION
reports

The benchmark.reports list.

TYPE: List[Dict[str, Any]]

RETURNS DESCRIPTION
UsageReporter

A UsageReporter ready for analysis.

summary

summary() -> Dict[str, Any]

Nested dict with all breakdowns.

total

total() -> Usage

Grand total across all tasks and components.

View source

LiteLLMCostCalculator

Cost calculator using LiteLLM's bundled pricing database.

LiteLLM maintains a comprehensive model_prices_and_context_window.json <https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json>_ that covers most major LLM providers. This calculator delegates to litellm.cost_per_token for per-token rates and computes the total.

This is the recommended calculator for most users — it covers OpenAI, Anthropic, Google, Mistral, Cohere, and many more without requiring manual pricing tables.

Note

If you're already using the LiteLLMModelAdapter, it extracts provider-reported cost from response._hidden_params.response_cost automatically. This calculator is useful as a fallback when using other adapters (OpenAI, Anthropic, Google) directly.

Example
from maseval.interface.usage import LiteLLMCostCalculator
from maseval.interface.inference import OpenAIModelAdapter

calculator = LiteLLMCostCalculator()
model = OpenAIModelAdapter(client=client, model_id="gpt-4", cost_calculator=calculator)

# Cost is now computed automatically after each chat() call
response = model.chat([{"role": "user", "content": "Hello"}])
print(model.gather_usage().cost)  # e.g., 0.00123

__init__

__init__(
    custom_pricing: Optional[
        Dict[str, Dict[str, float]]
    ] = None,
    model_id_map: Optional[Dict[str, str]] = None,
)

Initialize the LiteLLM cost calculator.

PARAMETER DESCRIPTION
custom_pricing

Optional overrides for specific models. Keys are model IDs, values are dicts with "input_cost_per_token" and "output_cost_per_token". These take precedence over LiteLLM's built-in pricing.

TYPE: Optional[Dict[str, Dict[str, float]]] DEFAULT: None

model_id_map

Optional mapping from adapter model IDs to LiteLLM model IDs. Use this when your adapter's model_id doesn't match LiteLLM's naming convention — e.g., when using Google's OpenAI-compatible endpoint where the adapter sees "gemini-2.0-flash" but LiteLLM expects "gemini/gemini-2.0-flash".

Example::

LiteLLMCostCalculator(model_id_map={
    "gemini-2.0-flash": "gemini/gemini-2.0-flash",
})

TYPE: Optional[Dict[str, str]] DEFAULT: None

calculate_cost

calculate_cost(
    usage: TokenUsage, model_id: str
) -> Optional[float]

Compute cost using LiteLLM's pricing database.

PARAMETER DESCRIPTION
usage

Token usage from the call.

TYPE: TokenUsage

model_id

The model identifier. Remapped via model_id_map if configured, then looked up in custom pricing and LiteLLM's database.

TYPE: str

RETURNS DESCRIPTION
Optional[float]

Cost in USD, or None if LiteLLM doesn't have pricing for

Optional[float]

this model and no custom pricing was provided.

gather_config

gather_config() -> Dict[str, Any]

Return calculator configuration for reproducibility.