Skip to content

HuggingFace Inference Adapters

This page documents the HuggingFace model adapters for MASEval.

Pipeline Model Adapter (Text Generation)

View source

HuggingFacePipelineModelAdapter

Bases: ModelAdapter

Adapter for HuggingFace transformers pipelines and callables.

Wraps a HuggingFace pipeline() object (or any text-generation callable) for use with the ModelAdapter interface (chat(), generate()).

For log-likelihood scoring, see HuggingFaceModelScorer.

Works with:

  • transformers.pipeline() objects
  • Any callable that accepts a prompt and returns text

For chat functionality, the adapter uses the tokenizer's chat template if available. This provides proper formatting for instruction-tuned models.

Tool calling support

Tool calling is only supported if the model's chat template explicitly supports it. If you pass tools and the model doesn't support them, a ToolCallingNotSupportedError is raised. For reliable tool calling, consider using LiteLLMModelAdapter instead.

seed property

seed: Optional[int]

Seed for deterministic generation, or None if unseeded.

__init__

__init__(
    model: Callable[[str], str],
    model_id: Optional[str] = None,
    default_generation_params: Optional[
        Dict[str, Any]
    ] = None,
    seed: Optional[int] = None,
    cost_calculator: Optional[CostCalculator] = None,
)

Initialize HuggingFace model adapter.

PARAMETER DESCRIPTION
model

A callable that generates text. Can be: - A transformers pipeline (e.g., pipeline("text-generation", ...)) - Any callable that takes a prompt string and returns text

TYPE: Callable[[str], str]

model_id

Identifier for the model. If not provided, attempts to extract from the model's name_or_path attribute.

TYPE: Optional[str] DEFAULT: None

default_generation_params

Default parameters for all calls. Common parameters: max_new_tokens, temperature, top_p, do_sample.

TYPE: Optional[Dict[str, Any]] DEFAULT: None

seed

Seed for deterministic generation. Sets the random seed before each generation call using transformers.set_seed().

TYPE: Optional[int] DEFAULT: None

cost_calculator

Optional cost calculator for computing cost from token counts when the provider doesn't report cost directly.

TYPE: Optional[CostCalculator] DEFAULT: None

chat

chat(
    messages: Union[List[Dict[str, Any]], MessageHistory],
    generation_params: Optional[Dict[str, Any]] = None,
    tools: Optional[List[Dict[str, Any]]] = None,
    tool_choice: Optional[
        Union[str, Dict[str, Any]]
    ] = None,
    **kwargs: Any,
) -> ChatResponse

Send messages to the model and get a response.

This is the primary method for interacting with the model. Pass a conversation history and receive the model's response.

PARAMETER DESCRIPTION
messages

The conversation history. Either a list of message dicts in OpenAI format, or a MessageHistory object. Each message has 'role' ('system', 'user', 'assistant', 'tool') and 'content' keys.

TYPE: Union[List[Dict[str, Any]], MessageHistory]

generation_params

Model parameters like temperature, max_tokens, top_p, etc. Provider-specific parameters are also accepted.

TYPE: Optional[Dict[str, Any]] DEFAULT: None

tools

Tool definitions the model can use. Each tool is a dict with 'type' (usually 'function') and 'function' containing 'name', 'description', and 'parameters' (JSON Schema).

TYPE: Optional[List[Dict[str, Any]]] DEFAULT: None

tool_choice

How the model should use tools: - "auto": Model decides whether to use tools (default) - "none": Model won't use tools - "required": Model must use a tool - {"type": "function", "function": {"name": "..."}}: Use specific tool

TYPE: Optional[Union[str, Dict[str, Any]]] DEFAULT: None

**kwargs

Additional provider-specific arguments.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
ChatResponse

ChatResponse containing the model's response (text and/or tool calls).

RAISES DESCRIPTION
Exception

Provider-specific errors are logged and re-raised.

Example
# Simple conversation
response = model.chat([
    {"role": "user", "content": "Hello!"}
])
print(response.content)

# With system prompt
response = model.chat([
    {"role": "system", "content": "You are a pirate."},
    {"role": "user", "content": "Hello!"}
])

# With tools
response = model.chat(
    messages=[{"role": "user", "content": "What's 2+2?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "calculator",
            "description": "Evaluate math expressions",
            "parameters": {
                "type": "object",
                "properties": {"expression": {"type": "string"}},
                "required": ["expression"]
            }
        }
    }]
)

gather_config

gather_config() -> Dict[str, Any]

Gather configuration from this HuggingFace model adapter.

RETURNS DESCRIPTION
Dict[str, Any]

Dictionary containing model configuration.

gather_traces

gather_traces() -> Dict[str, Any]

Gather execution traces from this model adapter.

Called automatically by Benchmark to collect execution data for evaluation. Returns comprehensive statistics about all calls made to this adapter.

Output fields:

  • type - Component class name
  • gathered_at - ISO timestamp
  • model_id - Model identifier
  • total_calls - Number of chat/generate calls
  • successful_calls - Number of successful calls
  • failed_calls - Number of failed calls
  • total_duration_seconds - Total time spent in calls
  • average_duration_seconds - Average time per call
  • logs - List of individual call records
RETURNS DESCRIPTION
Dict[str, Any]

Dictionary containing model execution traces.

gather_usage

gather_usage() -> Usage

Gather accumulated token usage from all chat calls.

RETURNS DESCRIPTION
Usage

Summed TokenUsage across all calls, or empty TokenUsage if no calls were made.

generate

generate(
    prompt: str,
    generation_params: Optional[Dict[str, Any]] = None,
    **kwargs: Any,
) -> str

Generate text from a simple prompt.

This is a convenience method that wraps the prompt in a user message and calls chat(). Use this for simple text-in/text-out scenarios.

For conversations or tool use, use chat() directly.

PARAMETER DESCRIPTION
prompt

The input prompt.

TYPE: str

generation_params

Generation parameters (temperature, max_tokens, etc.).

TYPE: Optional[Dict[str, Any]] DEFAULT: None

**kwargs

Additional provider-specific arguments.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
str

The model's text response.

Example
response = model.generate("What is the capital of France?")
print(response)  # "Paris"

Model Scorer (Log-Likelihood)

View source

HuggingFaceModelScorer

Bases: ModelScorer

Log-likelihood scorer backed by a HuggingFace causal language model.

Loads the model lazily on first use. Supports:

  • Single-token optimisation: when all continuations map to a single token, one forward pass scores every choice.
  • Multi-token fallback: separate forward pass per continuation.
  • loglikelihood_choices() override that picks the optimal path automatically.

The tokenisation strategy matches lm-evaluation-harness: context and continuation are encoded separately, then concatenated to handle tokenisation-boundary effects correctly.

seed property

seed: Optional[int]

Seed for deterministic scoring, or None if unseeded.

__init__

__init__(
    model_id: str,
    device: str = "cuda:0",
    trust_remote_code: bool = True,
    seed: Optional[int] = None,
)

Initialize HuggingFace model scorer.

PARAMETER DESCRIPTION
model_id

HuggingFace model identifier (e.g. "meta-llama/Llama-2-7b-hf").

TYPE: str

device

Torch device string (e.g. "cuda:0", "cpu").

TYPE: str DEFAULT: 'cuda:0'

trust_remote_code

Trust remote code when loading the model.

TYPE: bool DEFAULT: True

seed

Seed for deterministic scoring.

TYPE: Optional[int] DEFAULT: None

gather_config

gather_config() -> Dict[str, Any]

Gather configuration including device and model settings.

RETURNS DESCRIPTION
Dict[str, Any]

Dictionary containing scorer configuration.

gather_traces

gather_traces() -> Dict[str, Any]

Gather execution traces from this scorer.

Output fields:

  • type - Component class name
  • gathered_at - ISO timestamp
  • model_id - Model identifier
  • total_calls - Number of scoring calls
  • successful_calls - Number of successful calls
  • failed_calls - Number of failed calls
  • total_duration_seconds - Total time spent in calls
  • logs - List of individual call records
RETURNS DESCRIPTION
Dict[str, Any]

Dictionary containing scorer execution traces.

loglikelihood

loglikelihood(context: str, continuation: str) -> float

Compute the log-likelihood of continuation given context.

PARAMETER DESCRIPTION
context

The conditioning text (prompt).

TYPE: str

continuation

The text whose likelihood is scored.

TYPE: str

RETURNS DESCRIPTION
float

Log-likelihood (negative float; higher = more likely).

loglikelihood_batch

loglikelihood_batch(
    pairs: List[Tuple[str, str]],
) -> List[float]

Compute log-likelihoods for a batch of (context, continuation) pairs.

Override _loglikelihood_batch_impl for provider-specific batching optimisations. The default loops over _loglikelihood_impl.

PARAMETER DESCRIPTION
pairs

List of (context, continuation) tuples.

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
List[float]

List of log-likelihoods, one per pair.

loglikelihood_choices

loglikelihood_choices(
    context: str, choices: List[str], delimiter: str = " "
) -> List[float]

Score multiple-choice continuations with shared-context optimisation.

When every delimiter + choice maps to a single continuation token, all choices are scored in one forward pass. Otherwise falls back to per-choice scoring via _loglikelihood_impl.

PARAMETER DESCRIPTION
context

The question/prompt text.

TYPE: str

choices

Answer choice strings (e.g. ["A", "B", "C", "D"]).

TYPE: List[str]

delimiter

String prepended to each choice (default " ").

TYPE: str DEFAULT: ' '

RETURNS DESCRIPTION
List[float]

List of log-likelihoods, one per choice.