Skip to content

Seeding for Reproducibility

Overview

MASEval provides a seeding system for reproducible benchmark runs. Seeds cascade from a global seed through all components, enabling deterministic behavior when model providers support seeding.

This guide covers:

  • Basic usage: Enabling seeding in benchmarks
  • Selective variance: Controlling which components vary per repetition
  • Custom generators: Implementing alternative seeding strategies
  • Provider support: Which model providers support seeding

When to use seeding

Seeding is most useful when:

  • Comparing agent architectures under controlled conditions
  • Running ablation studies where you need reproducibility
  • Debugging issues that only appear intermittently
  • Creating reproducible baselines for publication

Note that not all model providers support seeding, and even those that do may only offer "best-effort" determinism.

Basic Usage

Enabling Seeding

Pass a seed parameter when creating your benchmark:

from maseval import Benchmark

# Simple: pass a seed integer
benchmark = MyBenchmark(seed=42)
results = benchmark.run(tasks, agent_data=config)

This creates a DefaultSeedGenerator internally and passes it to all setup methods.

Disabling Seeding

If you don't need seeding, you can simply ignore the seed generators. However, in workflows where you mix seeded and non-seeded runs, you can disable seeding without writing if/else statements to check whether a seed is provided.

To disable seeding, omit the seed parameter when creating your Benchmark or DefaultSeedGenerator (or pass seed=None):

  1. A DefaultSeedGenerator(global_seed=None) is still created internally
  2. Setup methods still receive a seed_generator parameter
  3. derive_seed() returns None instead of an integer
class MyBenchmark(Benchmark):
    ...
    def setup_agents(self, agent_data, environment, task, user, seed_generator):
        # Always works - seed_generator is never None
        agent = MyAgent(seed=seed_generator("agents/orchestrator"))
        ...

# No seed = seeding disabled
benchmark = MyBenchmark(seed=None)

Using Seeds in Setup Methods

All setup methods receive a seed_generator parameter. Use it to derive seeds for your components. When seeding is disabled (no seed passed to benchmark), derive_seed() returns None:

from maseval import Benchmark, SeedGenerator

class MyBenchmark(Benchmark):
    def setup_agents(
        self,
        agent_data,
        environment,
        task,
        user,
        seed_generator: SeedGenerator,
    ):
        # Derive a seed for your agent using hierarchical paths
        # Returns None if seeding is disabled (global_seed=None)
        # Use child() to create logical namespaces - results in "agents/orchestrator"
        agent_gen = seed_generator.child("agents")
        agent_seed = agent_gen.derive_seed("orchestrator")

        # Pass seed directly to your agent
        agent = MyAgent(seed=agent_seed)
        # ... rest of setup

Seeds are derived from hierarchical paths, so derive_seed("orchestrator") within a child("agents") context produces "agents/orchestrator", which is different from "agents/baseline".

Selective Variance with per_repetition

When running multiple repetitions of the same task, you may want some components to vary while others remain constant. The per_repetition flag controls this:

def setup_agents(self, agent_data, environment, task, user, seed_generator):
    # Use child() to group agent seeds under "agents/" namespace
    agent_gen = seed_generator.child("agents")

    # Varies per repetition - different seed for rep 0, 1, 2, ...
    # Results in path: "agents/experimental"
    experimental_seed = agent_gen.derive_seed("experimental", per_repetition=True)

    # Constant across repetitions - same seed for rep 0, 1, 2, ...
    # Results in path: "agents/baseline"
    baseline_seed = agent_gen.derive_seed("baseline", per_repetition=False)

Use cases:

per_repetition Behavior Use case
True (default) Seed varies per repetition Experimental agents, ablation studies
False Seed constant across repetitions Baseline agents, control conditions

Hierarchical Namespacing

For complex systems with many components, use child() to create hierarchical namespaces:

def setup_environment(self, agent_data, task, seed_generator):
    # Create a child generator for environment components
    env_gen = seed_generator.child("environment")

    # Further nest tools under "environment/tools/"
    tools_gen = env_gen.child("tools")
    weather_seed = tools_gen.derive_seed("weather")  # "environment/tools/weather"
    search_seed = tools_gen.derive_seed("search")    # "environment/tools/search"

def setup_agents(self, agent_data, environment, task, user, seed_generator):
    # Create a child generator for agents
    agent_gen = seed_generator.child("agents")

    orchestrator_seed = agent_gen.derive_seed("orchestrator")  # "agents/orchestrator"

    # Nest workers under "agents/workers/"
    worker_gen = agent_gen.child("workers")
    analyst_seed = worker_gen.derive_seed("analyst")           # "agents/workers/analyst"

Child generators share the same seed log, so all derived seeds are recorded together.

Flat paths work too

You can use flat paths directly without child():

seed_generator.derive_seed("environment/tools/weather")
seed_generator.derive_seed("agents/orchestrator")

Both approaches produce identical seeds. Use child() when it makes your code cleaner.

How Seed Derivation Works

This section demonstrates the core mechanics of seed derivation with concrete examples.

Basic Example

from maseval import DefaultSeedGenerator

# Create a generator with a global seed
gen = DefaultSeedGenerator(global_seed=0)

# Scope to a task and repetition (required before deriving seeds)
task_gen = gen.for_task("task_1").for_repetition(0)

# Derive seeds for components
agent_seed = task_gen.derive_seed("agent")
print(agent_seed)  # 778051139

# Different paths produce different seeds
env_seed = task_gen.derive_seed("environment")
print(env_seed)  # 1348051591

# Child generators extend the path
tools_gen = task_gen.child("tools")
weather_seed = tools_gen.derive_seed("weather")  # Path: "tools/weather"
print(weather_seed)  # 1528663065

Determinism: Same Path = Same Seed

The key property of the seed generator is determinism: the same path always produces the same derived seed, even when called multiple times.

gen = DefaultSeedGenerator(global_seed=0).for_task("task_1").for_repetition(0)

# Call the same path twice on the same generator
seed1 = gen.derive_seed("agent")
seed2 = gen.derive_seed("agent")

print(seed1)  # 778051139
print(seed2)  # 778051139
assert seed1 == seed2  # Always true

This also works across separate generator instances with the same configuration:

# Two separate generators with identical configuration
gen1 = DefaultSeedGenerator(global_seed=0).for_task("task_1").for_repetition(0)
gen2 = DefaultSeedGenerator(global_seed=0).for_task("task_1").for_repetition(0)

seed1 = gen1.derive_seed("agent")
seed2 = gen2.derive_seed("agent")

assert seed1 == seed2  # Always true

This is what enables reproducibility - if you record the global seed used in an experiment, you can recreate the exact same derived seeds later.

Different Global Seeds = Different Results

Changing the global seed changes all derived seeds:

# With seed=0
gen_0 = DefaultSeedGenerator(global_seed=0).for_task("task_1").for_repetition(0)
print(gen_0.derive_seed("agent"))        # 778051139
print(gen_0.derive_seed("environment"))  # 1348051591

# With seed=1 - same paths, different seeds
gen_1 = DefaultSeedGenerator(global_seed=1).for_task("task_1").for_repetition(0)
print(gen_1.derive_seed("agent"))        # 1297896250
print(gen_1.derive_seed("environment"))  # 886012105

This allows you to run multiple independent experiments by simply changing the global seed.

Seed Log

The generator tracks all derived seeds, which is useful for debugging and reproducibility:

gen = DefaultSeedGenerator(global_seed=42).for_task("task_1").for_repetition(0)

# Derive several seeds
gen.derive_seed("agent")
tools_gen = gen.child("tools")
tools_gen.derive_seed("weather")
tools_gen.derive_seed("search")

# Inspect what was derived
print(gen.seed_log)
# {'agent': 1608637542, 'tools/weather': 353148029, 'tools/search': 906566780}

The seed log is included in benchmark reports automatically, so you always have a record of which seeds were used.

Model Provider Support

Not all providers support seeding. Here's the current status:

Provider Support Notes
OpenAI Best-effort Seed parameter accepted, but determinism not guaranteed
Google GenAI Supported Seed parameter passed to generation config
LiteLLM Pass-through Passes seed to underlying provider
HuggingFace Supported Uses transformers.set_seed()
Anthropic Not supported Raises SeedingError if seed provided

If you pass a seed to an adapter that doesn't support seeding, it raises SeedingError at creation time:

from maseval import SeedingError

try:
    adapter = AnthropicModelAdapter(client, model_id="claude-3", seed=42)
except SeedingError as e:
    print(f"Provider doesn't support seeding: {e}")

Seed Logging

All derived seeds are automatically logged and included in results:

results = benchmark.run(tasks, agent_data=config)

for report in results:
    seed_config = report["config"]["seeding"]["seed_generator"]
    print(f"Global seed: {seed_config['global_seed']}")
    print(f"Task: {seed_config['task_id']}")
    print(f"Repetition: {seed_config['rep_index']}")
    print(f"Seeds used: {seed_config['seeds']}")
    # Output: {"agents/orchestrator": 12345, "agents/workers/analyst": 67890, ...}

This enables exact reproduction of benchmark runs and debugging of seed-related issues.

Custom Seed Generators

Using a Different Hash Algorithm

Subclass DefaultSeedGenerator and override _compute_seed():

import hashlib
from maseval import DefaultSeedGenerator

class MD5SeedGenerator(DefaultSeedGenerator):
    """Uses MD5 instead of SHA-256."""

    def _compute_seed(self, full_path: str, components: list) -> int:
        seed_string = ":".join(str(c) for c in components)
        hash_bytes = hashlib.md5(seed_string.encode()).digest()
        return int.from_bytes(hash_bytes[:4], "big") & 0x7FFFFFFF

# Use it
benchmark = MyBenchmark(seed_generator=MD5SeedGenerator(global_seed=42))

Implementing a Custom Generator

For completely custom seeding strategies (e.g., database-backed seeds), implement the SeedGenerator ABC:

from maseval import SeedGenerator
from typing import Dict, Any
from typing_extensions import Self

class DatabaseSeedGenerator(SeedGenerator):
    """Looks up seeds from a database for exact reproducibility."""

    def __init__(self, db_connection, run_id: str):
        super().__init__()
        self._db = db_connection
        self._run_id = run_id
        self._task_id = None
        self._rep_index = None
        self._log = {}

    @property
    def global_seed(self) -> int:
        return self._db.get_run_seed(self._run_id)

    def derive_seed(self, name: str, per_repetition: bool = True) -> int:
        key = (self._run_id, self._task_id, self._rep_index if per_repetition else None, name)
        seed = self._db.get_or_create_seed(key)
        self._log[name] = seed
        return seed

    def for_task(self, task_id: str) -> Self:
        new_gen = DatabaseSeedGenerator(self._db, self._run_id)
        new_gen._task_id = task_id
        return new_gen

    def for_repetition(self, rep_index: int) -> Self:
        new_gen = DatabaseSeedGenerator(self._db, self._run_id)
        new_gen._task_id = self._task_id
        new_gen._rep_index = rep_index
        new_gen._log = self._log  # Share log within task
        return new_gen

    @property
    def seed_log(self) -> Dict[str, int]:
        return dict(self._log)

Tips

For reproducibility: Always log and report the global seed used. Include it in publications and experiment tracking.

For debugging: Use seed_generator.seed_log to inspect which seeds were derived during a run.

For baselines: Use per_repetition=False for baseline agents that should remain constant while you vary experimental agents.

For provider compatibility: Check provider support before relying on seeding. OpenAI's seeding is "best-effort" and may not be perfectly deterministic.