halftrace
pre-alpha · first study in progress

half/trace

A halftrace is the trajectory length at which an agent capability is degraded by 50%.

halftrace is a Python library for measuring this across the four failure modes that bite agents on long-horizon tasks.

capability(t) · four probes fig. 1
Each curve crosses 50% at its halftrace. Schematic; real values will be empirical.

Four failure modes

Each probe targets a distinct kind of decay. halftrace measures them separately so you can tell which one is dominating a given run.

  • 01state_amnesia

    The agent loses track of facts it established earlier in the trajectory — variable bindings, prior tool outputs, the user's original constraints.

    recall ~ exp(−t / τ) · steepest among the four

    halftrace ≈ 28 tool calls
  • 02instruction_decay

    The agent stops adhering to system-prompt instructions as the context grows: format rules slip, forbidden actions return, hedges accumulate.

    adherence ~ exp(−t / τ) · gentler, slower failure

    halftrace ≈ 59 tool calls
  • 03tool_repetition

    The agent begins re-calling tools with identical or near-identical arguments instead of progressing — the classic "stuck in a loop" failure.

    novelty(args) ~ 1 − ½ (t / τ)1.4 · sudden collapse

    halftrace ≈ 45 tool calls
  • 04premature_termination

    The agent's tendency to claim completion before the task is actually done — flat early, then a sharp rise once the trajectory gets long.

    P(stop | incomplete) ~ σ((t − μ) / s) · sigmoid cliff

    halftrace ≈ 70 tool calls

Install and run

Bring your own trajectories from whatever runtime you use. halftrace doesn't run agents — it scores trajectories you already have.

$ pip install halftrace
python · example.py
from halftrace import Suite, probes

suite = Suite(
    probes=[
        probes.StateAmnesia(),
        probes.InstructionDecay(),
        probes.ToolRepetition(),
        probes.PrematureTermination(),
    ],
)

# trajectory: a list of (step, tool_call, response) tuples
# from whichever agent runtime you use
scores = suite.run(trajectory)

for name, halftrace in scores.items():
    print(f"{name}: {halftrace} tool calls")

Expected output

A halftrace per probe, in units of tool calls. Lower is worse — the capability is half-degraded sooner.

probe halftrace notes
state_amnesia 52tool calls recall of established facts
instruction_decay 38tool calls adherence to system prompt
tool_repetition 71tool calls argument novelty across calls
premature_termination 44tool calls stop-when-incomplete rate

Illustrative. Real numbers will land with the first study in roughly three weeks at ruairidh.dev/blog.

What halftrace is not

A short list of things it would be easy to mistake this for, and which it deliberately is not.

not

a benchmark

It will not rank models against each other or produce a leaderboard. It scores trajectories, not model identities.

not

an eval framework

There are no tasks, no graders, no rubric. You bring trajectories; halftrace measures decay within them.

not

an agent framework

halftrace doesn't run agents or call models. It is read-only on the trajectories produced by your existing runtime.

not

a substitute for inspection

A halftrace number tells you which failure mode is dominating, not why. Use it to direct attention, then read the trace.

It is a measurement tool. You supply trajectories from whatever runtime you use — Anthropic, OpenAI, an in-house harness — and halftrace tells you which failure mode is degrading capability fastest, in a unit ("tool calls until half-degraded") that's comparable across runs.