Story points
are dead.

A framework for estimating software work in dollars of LLM inference — not hours, not abstract points. Honest, falsifiable, calibrated to 2026.

● v0.1 — open source ● CC BY 4.0 ● Community-driven

Hours lied. Story points were a costume.

For 20 years we estimated software in points — an abstract, unfalsifiable unit invented to dodge the obvious failures of hour estimates. It worked, sort of, when humans wrote every line of code.

That world is gone. In 2026, agents write the first draft of most production code. The bottleneck is no longer "how long will a developer type this?" — it's "how many turns of inference, with which model, against which codebase, will it take to ship?"

That question has a real answer in dollars. The token cost is on the API invoice every month. Ignored. TokenPoints is a planning vocabulary built around using it.

A sizing scale anchored in USD.

Five buckets, plus a "spike first" escape hatch for unknown work. The numbers are starting anchors — every team recalibrates with its own data after two sprints.

Size
Pattern
Cost (USD)
XS
Pinpoint edit, autocomplete-heavy
< $1
S
Single-file feature or bug, 5–15 turns
$1 – $8
M
Multi-file feature, 15–40 turns
$8 – $40
L
Refactor, deep debug, cross-module
$40 – $160
XL
Architectural change, multi-system
$160 – $400
??
Spike first — investigate before sizing
time-boxed

Anything above XL must be decomposed. If you can't decompose it, you don't understand it yet — that's a spike.

FAQ

Real questions from the community. If yours isn't here, open an issue.

Token usage varies wildly between runs. How can you estimate it?
The framework doesn't try to predict cost exactly. It gives teams a unit to track variance against. A task that varies 30x between runs is telling you something story points hid: the work has hidden complexity. Variance is the signal, not the noise.
Different models have different prices. How does the framework handle that?
With an explicit model multiplier baked in at estimation. Opus is roughly 5x Sonnet on input; Haiku roughly 0.2x. Calibration absorbs the rest based on your team's actual model mix. The methodology is universal — the multipliers are local to your context and the current pricing.
Is this only useful for large (XL) features?
No. The headline isn't the absolute dollar amount — it's the variance between estimate and actual. A $4 task that lands at $20 tells you what story points hid. Even XS tasks accumulate calibration data that compounds over sprints.
Who calculates the cost — a human or an AI?
Neither, manually. Your tooling already does it. Claude Code, Cursor, Aider, Copilot, the API consoles — they all log tokens per session. You read the number at PR-merge time and drop it in your tracking sheet. ~30 seconds of overhead per task. The data collection is essentially free.
Doesn't this only work after a tech spec exists?
Correct. That's exactly what the ?? size is for. If you can't decompose the work, you don't know its shape yet — forcing a number is fiction. Time-box a spike, produce sized sub-tasks, then estimate. Spike-first is the framework's release valve for genuine uncertainty.
Why not just budget at the project level (CapEx-style)?
The framework supports that — sprint capacity is a budget cap. The per-task layer just zooms in so you can investigate variance, which project-level budgets hide. Both work together. CapEx-style budgeting tells you how much; TokenPoints tells you where it went and why.
Can managers weaponize this for individual performance reviews?
Yes — and that's the framework's biggest risk. The manifesto explicitly forbids it. Comparing $/task between developers is the modern lines-of-code: it will be gamed, it will damage trust, and it measures the wrong thing. This is a hard rule, defended actively. Aggregate at the team level only.
How is this different from FinOps for engineering?
FinOps tracks aggregate spend for finance reporting. TokenPoints connects spend to tasks for planning, not just reporting after the fact. They're complementary — a calibrated TokenPoints practice generates the per-task data that makes engineering FinOps actually meaningful.
Does this work if my team uses OpenAI, Gemini, or other providers?
Yes. The framework is provider-agnostic. Substitute the model premium multipliers for your provider's pricing. The sizing scale, calibration playbook, and tracking templates work identically. Calibration absorbs the rest.
What if my team isn't using agents yet?
Then this framework solves a problem you don't have. Use whatever you used before. TokenPoints is only useful in proportion to how much of your work is actually agent-driven. Honest answer: if you're sub-30%, story points or hours are still fine.