Model Scorers

Model Scorers provide a uniform interface for log-likelihood computation across model providers. Unlike ModelAdapter (which handles text generation and chat), scorers evaluate how likely a model considers a given continuation given some context.

Note

ModelScorer is the scoring counterpart to ModelAdapter. Use it when you need log-likelihood evaluation (e.g., multiple-choice benchmarks) rather than text generation.

View source

ModelScorer

Bases: ABC, TraceableMixin, ConfigurableMixin

Abstract base class for model scorers.

ModelScorer provides a consistent interface for computing token-level log-likelihoods from language models. All scorers implement the same methods, so you can swap providers without changing evaluation code.

To use a scorer:

Create an instance with provider-specific configuration
Call loglikelihood() for single context-continuation pairs
Call loglikelihood_batch() for efficient batch computation
Call loglikelihood_choices() for MCQ evaluation

Implementing a custom scorer:

Subclass ModelScorer and implement:

model_id property: Return the model identifier string
_loglikelihood_impl(): Score a single (context, continuation) pair

Optionally override:

_loglikelihood_batch_impl(): Optimised batch scoring
loglikelihood_choices(): MCQ-specific optimisations (e.g. shared-context single-pass)

model_id `abstractmethod` `property`

model_id: str

The identifier for the underlying model.

RETURNS	DESCRIPTION
`str`	A string identifying the model (e.g., `"meta-llama/Llama-2-7b-hf"`).

seed `property`

seed: Optional[int]

Seed for deterministic scoring, or None if unseeded.

init

__init__(seed: Optional[int] = None)

Initialize the model scorer.

PARAMETER	DESCRIPTION
`seed`	Seed for deterministic scoring. Passed to the underlying model if supported. TYPE: `Optional[int]` DEFAULT: `None`

gather_config

gather_config() -> Dict[str, Any]

Gather configuration from this scorer.

Output fields:

type - Component class name
gathered_at - ISO timestamp
model_id - Model identifier
scorer_type - The specific scorer class name
seed - Seed for deterministic scoring, or None if unseeded

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary containing scorer configuration.

gather_traces

gather_traces() -> Dict[str, Any]

Gather execution traces from this scorer.

Output fields:

type - Component class name
gathered_at - ISO timestamp
model_id - Model identifier
total_calls - Number of scoring calls
successful_calls - Number of successful calls
failed_calls - Number of failed calls
total_duration_seconds - Total time spent in calls
logs - List of individual call records

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary containing scorer execution traces.

loglikelihood

loglikelihood(context: str, continuation: str) -> float

Compute the log-likelihood of continuation given context.

PARAMETER	DESCRIPTION
`context`	The conditioning text (prompt). TYPE: `str`
`continuation`	The text whose likelihood is scored. TYPE: `str`

RETURNS	DESCRIPTION
`float`	Log-likelihood (negative float; higher = more likely).

loglikelihood_batch

loglikelihood_batch(
    pairs: List[Tuple[str, str]],
) -> List[float]

Compute log-likelihoods for a batch of (context, continuation) pairs.

Override _loglikelihood_batch_impl for provider-specific batching optimisations. The default loops over _loglikelihood_impl.

PARAMETER	DESCRIPTION
`pairs`	List of (context, continuation) tuples. TYPE: `List[Tuple[str, str]]`

RETURNS	DESCRIPTION
`List[float]`	List of log-likelihoods, one per pair.

loglikelihood_choices

loglikelihood_choices(
    context: str, choices: List[str], delimiter: str = " "
) -> List[float]

Compute log-likelihoods for multiple-choice continuations.

Convenience method for MCQ evaluation. Each choice is prepended with delimiter before scoring (e.g. " A", " B").

Subclasses may override this for optimised shared-context scoring (e.g. single forward pass for single-token choices).

PARAMETER	DESCRIPTION
`context`	The question/prompt text. TYPE: `str`
`choices`	Answer choice strings (e.g. `["A", "B", "C", "D"]`). TYPE: `List[str]`
`delimiter`	String prepended to each choice (default `" "`). TYPE: `str` DEFAULT: `' '`

RETURNS	DESCRIPTION
`List[float]`	List of log-likelihoods, one per choice.

Interfaces

The following scorer classes implement the ModelScorer interface for specific providers.

View source

HuggingFaceModelScorer

Bases: ModelScorer

Log-likelihood scorer backed by a HuggingFace causal language model.

Loads the model lazily on first use. Supports:

Single-token optimisation: when all continuations map to a single token, one forward pass scores every choice.
Multi-token fallback: separate forward pass per continuation.
loglikelihood_choices() override that picks the optimal path automatically.

The tokenisation strategy matches lm-evaluation-harness: context and continuation are encoded separately, then concatenated to handle tokenisation-boundary effects correctly.

seed `property`

seed: Optional[int]

Seed for deterministic scoring, or None if unseeded.

init

__init__(
    model_id: str,
    device: str = "cuda:0",
    trust_remote_code: bool = True,
    seed: Optional[int] = None,
)

Initialize HuggingFace model scorer.

PARAMETER	DESCRIPTION
`model_id`	HuggingFace model identifier (e.g. `"meta-llama/Llama-2-7b-hf"`). TYPE: `str`
`device`	Torch device string (e.g. `"cuda:0"`, `"cpu"`). TYPE: `str` DEFAULT: `'cuda:0'`
`trust_remote_code`	Trust remote code when loading the model. TYPE: `bool` DEFAULT: `True`
`seed`	Seed for deterministic scoring. TYPE: `Optional[int]` DEFAULT: `None`

gather_config

gather_config() -> Dict[str, Any]

Gather configuration including device and model settings.

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary containing scorer configuration.

gather_traces

gather_traces() -> Dict[str, Any]

Gather execution traces from this scorer.

Output fields:

type - Component class name
gathered_at - ISO timestamp
model_id - Model identifier
total_calls - Number of scoring calls
successful_calls - Number of successful calls
failed_calls - Number of failed calls
total_duration_seconds - Total time spent in calls
logs - List of individual call records

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary containing scorer execution traces.

loglikelihood

loglikelihood(context: str, continuation: str) -> float

Compute the log-likelihood of continuation given context.

PARAMETER	DESCRIPTION
`context`	The conditioning text (prompt). TYPE: `str`
`continuation`	The text whose likelihood is scored. TYPE: `str`

RETURNS	DESCRIPTION
`float`	Log-likelihood (negative float; higher = more likely).

loglikelihood_batch

loglikelihood_batch(
    pairs: List[Tuple[str, str]],
) -> List[float]

Compute log-likelihoods for a batch of (context, continuation) pairs.

Override _loglikelihood_batch_impl for provider-specific batching optimisations. The default loops over _loglikelihood_impl.

PARAMETER	DESCRIPTION
`pairs`	List of (context, continuation) tuples. TYPE: `List[Tuple[str, str]]`

RETURNS	DESCRIPTION
`List[float]`	List of log-likelihoods, one per pair.

loglikelihood_choices

loglikelihood_choices(
    context: str, choices: List[str], delimiter: str = " "
) -> List[float]

Score multiple-choice continuations with shared-context optimisation.

When every delimiter + choice maps to a single continuation token, all choices are scored in one forward pass. Otherwise falls back to per-choice scoring via _loglikelihood_impl.

PARAMETER	DESCRIPTION
`context`	The question/prompt text. TYPE: `str`
`choices`	Answer choice strings (e.g. `["A", "B", "C", "D"]`). TYPE: `List[str]`
`delimiter`	String prepended to each choice (default `" "`). TYPE: `str` DEFAULT: `' '`

RETURNS	DESCRIPTION
`List[float]`	List of log-likelihoods, one per choice.

Model Scorers

ModelScorer

model_id abstractmethod property

seed property

__init__

gather_config

gather_traces

loglikelihood

loglikelihood_batch

loglikelihood_choices

Interfaces

HuggingFaceModelScorer

seed property

__init__

gather_config

gather_traces

loglikelihood

loglikelihood_batch

loglikelihood_choices

model_id `abstractmethod` `property`

seed `property`

init

seed `property`

init