half/trace
A halftrace is the trajectory length at which an agent capability is degraded by 50%.
halftrace is a Python library for measuring this across the four failure modes that bite agents on long-horizon tasks.
Four failure modes
Each probe targets a distinct kind of decay. halftrace measures them separately so you can tell which one is dominating a given run.
-
01state_amnesia
The agent loses track of facts it established earlier in the trajectory — variable bindings, prior tool outputs, the user's original constraints.
halftrace ≈ 28 tool calls -
02instruction_decay
The agent stops adhering to system-prompt instructions as the context grows: format rules slip, forbidden actions return, hedges accumulate.
halftrace ≈ 59 tool calls -
03tool_repetition
The agent begins re-calling tools with identical or near-identical arguments instead of progressing — the classic "stuck in a loop" failure.
halftrace ≈ 45 tool calls -
04premature_termination
The agent's tendency to claim completion before the task is actually done — flat early, then a sharp rise once the trajectory gets long.
halftrace ≈ 70 tool calls
Install and run
Bring your own trajectories from whatever runtime you use. halftrace doesn't run agents — it scores trajectories you already have.
pip install halftrace
from halftrace import Suite, probes
suite = Suite(
probes=[
probes.StateAmnesia(),
probes.InstructionDecay(),
probes.ToolRepetition(),
probes.PrematureTermination(),
],
)
# trajectory: a list of (step, tool_call, response) tuples
# from whichever agent runtime you use
scores = suite.run(trajectory)
for name, halftrace in scores.items():
print(f"{name}: {halftrace} tool calls")
Expected output
A halftrace per probe, in units of tool calls. Lower is worse — the capability is half-degraded sooner.
| probe | halftrace | notes | |
|---|---|---|---|
| state_amnesia | 52tool calls | recall of established facts | |
| instruction_decay | 38tool calls | adherence to system prompt | |
| tool_repetition | 71tool calls | argument novelty across calls | |
| premature_termination | 44tool calls | stop-when-incomplete rate |
Illustrative. Real numbers will land with the first study in roughly three weeks at ruairidh.dev/blog.
What halftrace is not
A short list of things it would be easy to mistake this for, and which it deliberately is not.
a benchmark
It will not rank models against each other or produce a leaderboard. It scores trajectories, not model identities.
an eval framework
There are no tasks, no graders, no rubric. You bring trajectories; halftrace measures decay within them.
an agent framework
halftrace doesn't run agents or call models. It is read-only on the trajectories produced by your existing runtime.
a substitute for inspection
A halftrace number tells you which failure mode is dominating, not why. Use it to direct attention, then read the trace.
It is a measurement tool. You supply trajectories from whatever runtime you use — Anthropic, OpenAI, an in-house harness — and halftrace tells you which failure mode is degrading capability fastest, in a unit ("tool calls until half-degraded") that's comparable across runs.