pre-alpha · first study in progress

half/trace

A halftrace is the trajectory length at which an agent capability is degraded by 50%.

halftrace is a Python library for measuring this across the four failure modes that bite agents on long-horizon tasks.

capability(t) · four probes fig. 1

Each curve crosses 50% at its halftrace. Schematic; real values will be empirical.

§1

Four failure modes

Each probe targets a distinct kind of decay. halftrace measures them separately so you can tell which one is dominating a given run.

01state_amnesia

The agent loses track of facts it established earlier in the trajectory — variable bindings, prior tool outputs, the user's original constraints.

recall ~ exp(−t / τ) · steepest among the four

halftrace ≈ 28 tool calls
02instruction_decay

The agent stops adhering to system-prompt instructions as the context grows: format rules slip, forbidden actions return, hedges accumulate.

adherence ~ exp(−t / τ) · gentler, slower failure

halftrace ≈ 59 tool calls
03tool_repetition

The agent begins re-calling tools with identical or near-identical arguments instead of progressing — the classic "stuck in a loop" failure.

novelty(args) ~ 1 − ½ (t / τ)^1.4 · sudden collapse

halftrace ≈ 45 tool calls
04premature_termination

The agent's tendency to claim completion before the task is actually done — flat early, then a sharp rise once the trajectory gets long.

P(stop | incomplete) ~ σ((t − μ) / s) · sigmoid cliff

halftrace ≈ 70 tool calls

Plus an experimental probe — narration_substitution — measuring how often the agent substitutes a narrated intention ("I will now call the search tool…") for actually performing the action. Reported separately; not yet considered stable.

§2

Install and run

Bring your own trajectories from whatever runtime you use. halftrace doesn't run agents — it scores trajectories you already have.

$ pip install halftrace

python · example.py

from halftrace import Suite, probes

suite = Suite(
    probes=[
        probes.StateAmnesia(),
        probes.InstructionDecay(),
        probes.ToolRepetition(),
        probes.PrematureTermination(),
    ],
)

# trajectory: a list of (step, tool_call, response) tuples
# from whichever agent runtime you use
scores = suite.run(trajectory)

for name, halftrace in scores.items():
    print(f"{name}: {halftrace} tool calls")

§3

Expected output

A halftrace per probe, in units of tool calls. Lower is worse — the capability is half-degraded sooner.

probe	halftrace	notes
state_amnesia	52tool calls	recall of established facts
instruction_decay	38tool calls	adherence to system prompt
tool_repetition	71tool calls	argument novelty across calls
premature_termination	44tool calls	stop-when-incomplete rate

Illustrative. Real numbers will land with the first study in roughly three weeks at ruairidh.dev/blog.

§4

What halftrace is not

A short list of things it would be easy to mistake this for, and which it deliberately is not.

not

a benchmark

It will not rank models against each other or produce a leaderboard. It scores trajectories, not model identities.

not

an eval framework

There are no tasks, no graders, no rubric. You bring trajectories; halftrace measures decay within them.

not

an agent framework

halftrace doesn't run agents or call models. It is read-only on the trajectories produced by your existing runtime.

not

a substitute for inspection

A halftrace number tells you which failure mode is dominating, not why. Use it to direct attention, then read the trace.

It is a measurement tool. You supply trajectories from whatever runtime you use — Anthropic, OpenAI, an in-house harness — and halftrace tells you which failure mode is degrading capability fastest, in a unit ("tool calls until half-degraded") that's comparable across runs.

half/trace

01state_amnesia

02instruction_decay

03tool_repetition

04premature_termination

a benchmark

an eval framework

an agent framework

a substitute for inspection