Model Scorers
Model Scorers provide a uniform interface for log-likelihood computation across model providers. Unlike ModelAdapter (which handles text generation and chat), scorers evaluate how likely a model considers a given continuation given some context.
Note
ModelScorer is the scoring counterpart to ModelAdapter. Use it when you need log-likelihood evaluation (e.g., multiple-choice benchmarks) rather than text generation.
ModelScorer
Bases: ABC, TraceableMixin, ConfigurableMixin
Abstract base class for model scorers.
ModelScorer provides a consistent interface for computing token-level
log-likelihoods from language models. All scorers implement the same
methods, so you can swap providers without changing evaluation code.
To use a scorer:
- Create an instance with provider-specific configuration
- Call
loglikelihood()for single context-continuation pairs - Call
loglikelihood_batch()for efficient batch computation - Call
loglikelihood_choices()for MCQ evaluation
Implementing a custom scorer:
Subclass ModelScorer and implement:
model_idproperty: Return the model identifier string_loglikelihood_impl(): Score a single (context, continuation) pair
Optionally override:
_loglikelihood_batch_impl(): Optimised batch scoringloglikelihood_choices(): MCQ-specific optimisations (e.g. shared-context single-pass)
model_id
abstractmethod
property
model_id: str
The identifier for the underlying model.
| RETURNS | DESCRIPTION |
|---|---|
str
|
A string identifying the model (e.g., |
seed
property
seed: Optional[int]
Seed for deterministic scoring, or None if unseeded.
__init__
__init__(seed: Optional[int] = None)
Initialize the model scorer.
| PARAMETER | DESCRIPTION |
|---|---|
seed
|
Seed for deterministic scoring. Passed to the underlying model if supported.
TYPE:
|
gather_config
gather_config() -> Dict[str, Any]
Gather configuration from this scorer.
Output fields:
type- Component class namegathered_at- ISO timestampmodel_id- Model identifierscorer_type- The specific scorer class nameseed- Seed for deterministic scoring, or None if unseeded
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dictionary containing scorer configuration. |
gather_traces
gather_traces() -> Dict[str, Any]
Gather execution traces from this scorer.
Output fields:
type- Component class namegathered_at- ISO timestampmodel_id- Model identifiertotal_calls- Number of scoring callssuccessful_calls- Number of successful callsfailed_calls- Number of failed callstotal_duration_seconds- Total time spent in callslogs- List of individual call records
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dictionary containing scorer execution traces. |
loglikelihood
loglikelihood(context: str, continuation: str) -> float
Compute the log-likelihood of continuation given context.
| PARAMETER | DESCRIPTION |
|---|---|
context
|
The conditioning text (prompt).
TYPE:
|
continuation
|
The text whose likelihood is scored.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Log-likelihood (negative float; higher = more likely). |
loglikelihood_batch
loglikelihood_batch(
pairs: List[Tuple[str, str]],
) -> List[float]
Compute log-likelihoods for a batch of (context, continuation) pairs.
Override _loglikelihood_batch_impl for provider-specific batching
optimisations. The default loops over _loglikelihood_impl.
| PARAMETER | DESCRIPTION |
|---|---|
pairs
|
List of (context, continuation) tuples.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[float]
|
List of log-likelihoods, one per pair. |
loglikelihood_choices
loglikelihood_choices(
context: str, choices: List[str], delimiter: str = " "
) -> List[float]
Compute log-likelihoods for multiple-choice continuations.
Convenience method for MCQ evaluation. Each choice is prepended with
delimiter before scoring (e.g. " A", " B").
Subclasses may override this for optimised shared-context scoring (e.g. single forward pass for single-token choices).
| PARAMETER | DESCRIPTION |
|---|---|
context
|
The question/prompt text.
TYPE:
|
choices
|
Answer choice strings (e.g.
TYPE:
|
delimiter
|
String prepended to each choice (default
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[float]
|
List of log-likelihoods, one per choice. |
Interfaces
The following scorer classes implement the ModelScorer interface for specific providers.
HuggingFaceModelScorer
Bases: ModelScorer
Log-likelihood scorer backed by a HuggingFace causal language model.
Loads the model lazily on first use. Supports:
- Single-token optimisation: when all continuations map to a single token, one forward pass scores every choice.
- Multi-token fallback: separate forward pass per continuation.
loglikelihood_choices()override that picks the optimal path automatically.
The tokenisation strategy matches lm-evaluation-harness: context and
continuation are encoded separately, then concatenated to handle
tokenisation-boundary effects correctly.
seed
property
seed: Optional[int]
Seed for deterministic scoring, or None if unseeded.
__init__
__init__(
model_id: str,
device: str = "cuda:0",
trust_remote_code: bool = True,
seed: Optional[int] = None,
)
Initialize HuggingFace model scorer.
| PARAMETER | DESCRIPTION |
|---|---|
model_id
|
HuggingFace model identifier
(e.g.
TYPE:
|
device
|
Torch device string (e.g.
TYPE:
|
trust_remote_code
|
Trust remote code when loading the model.
TYPE:
|
seed
|
Seed for deterministic scoring.
TYPE:
|
gather_config
gather_config() -> Dict[str, Any]
Gather configuration including device and model settings.
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dictionary containing scorer configuration. |
gather_traces
gather_traces() -> Dict[str, Any]
Gather execution traces from this scorer.
Output fields:
type- Component class namegathered_at- ISO timestampmodel_id- Model identifiertotal_calls- Number of scoring callssuccessful_calls- Number of successful callsfailed_calls- Number of failed callstotal_duration_seconds- Total time spent in callslogs- List of individual call records
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dictionary containing scorer execution traces. |
loglikelihood
loglikelihood(context: str, continuation: str) -> float
Compute the log-likelihood of continuation given context.
| PARAMETER | DESCRIPTION |
|---|---|
context
|
The conditioning text (prompt).
TYPE:
|
continuation
|
The text whose likelihood is scored.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Log-likelihood (negative float; higher = more likely). |
loglikelihood_batch
loglikelihood_batch(
pairs: List[Tuple[str, str]],
) -> List[float]
Compute log-likelihoods for a batch of (context, continuation) pairs.
Override _loglikelihood_batch_impl for provider-specific batching
optimisations. The default loops over _loglikelihood_impl.
| PARAMETER | DESCRIPTION |
|---|---|
pairs
|
List of (context, continuation) tuples.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[float]
|
List of log-likelihoods, one per pair. |
loglikelihood_choices
loglikelihood_choices(
context: str, choices: List[str], delimiter: str = " "
) -> List[float]
Score multiple-choice continuations with shared-context optimisation.
When every delimiter + choice maps to a single continuation token,
all choices are scored in one forward pass. Otherwise falls back to
per-choice scoring via _loglikelihood_impl.
| PARAMETER | DESCRIPTION |
|---|---|
context
|
The question/prompt text.
TYPE:
|
choices
|
Answer choice strings (e.g.
TYPE:
|
delimiter
|
String prepended to each choice (default
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[float]
|
List of log-likelihoods, one per choice. |