A Benchmark & Framework for Evaluating Next Action Predictions in Spreadsheets

Live & interactive

Watch a spreadsheet predict itself

Two views, one engine. See the assistant suggest & the user accept/reject — then replay a real benchmark trajectory action-by-action.

ƒx—

current op—

The data

A benchmark built like real spreadsheets

52 workbooks, reconstructed into the step-by-step action sequences a person would actually take — then validated by hand.

spreadsheets

12K

total actions

164

median ops / sheet

action types

How each trajectory is generated

🗎

Static workbookpublic corpora (Singh et al., 2023)

→

👁

VLM annotationregions · dependencies · pasted ranges

→

⚙

Symbolic heuristicscell→range merge · sampled order

→

🤖

LLM judge–editormake natural · revalidate to target

→

✅

Human annotationreorder · correct · most reworked

Annotators don't just rubber-stamp the machine drafts — they substantially rewrite them. The mean normalized edit distance between the draft and the final trajectory is 0.69 (median 0.77), so most sequences reflect a genuinely human build order rather than the model's first guess.

One DSL for every edit

Each action is a single pipe-delimited line, so trajectories are readable, diffable, and executable.

INPUT      | A1     | "Task"
FONT_BOLD  | A1:D1  | true
FILL_COLOR | A1:D1  | #0D2B4E
BORDER_ALL | A1:C3  | Thin, #0083AC
MERGE      | A1:G1  | true

Operation mix hover to explore

Sequence length ops per spreadsheet · min 35 · median 164 · mean 229 · max 821

Most sheets run well under 300 actions — median 164 — with a long tail of large, complex workbooks out past 800.

The method

Online evaluation, not “given x, predict y”

A real assistant changes the very state it operates in. We evaluate on-policy: predictions reshape the future the user still has to complete.

one step of the online evaluation loop — One step of the loop: the system predicts, the user accepts/rejects, and the ground-truth future is rewritten.

The accept & adapt loop

Predict. After each user action the solver proposes ≥0 next actions.
Decide. An acceptance heuristic accepts or rejects (modeling the user).
Adapt. On accept, drop satisfied ops, prepend inverses for false positives, and patch so the target is still reached.
Repair. The system must fix its own mistakes — impossible to capture offline.

State-level metrics

UAS % user actions saved primary

AR predictions accepted

PREC correct predicted edits

pCov of predictable edits hit

Scored on the resulting workbook state — different action sequences can reach the same sheet.

single-action vs multi-action prediction settings

Two prediction settings

Single-action (k=1, re-predict): one op at a time, re-queried after each accept — levels the field for small models. Multi-action (k≥1): emit a whole block and advance by what's accepted — natural for LLMs.

operations remaining vs user steps, for low / medium / high UAS trajectories — What “actions saved” looks like
Each accepted prediction removes work the user would otherwise do. The blue line drops below the no-assist baseline; the green area is the saved effort — small on weak trajectories, large on strong ones.

68%

How much is even predictable?

An oracle of four reasoning LLMs recovers 68% of all edits from history alone (median 66% per sheet; 44/52 sheets > 50%). This sets a ceiling on how much any predictor can save.

What we learned

Results & insights

Baselines span zero-shot LLMs, fine-tuned small models, and classical sequence models — each surfaces a different lesson.

Model comparison single-action · greedy · reasoning = low effort

Model	UAS	AR	PREC	pCov
Zero-shot LLMs
GPT-5 (reasoning)	32.7	29.4	41.6	24.8
GPT-5 mini (reasoning)	28.2	25.5	37.0	20.9
GPT-5	27.4	30.9	44.8	20.7
GPT-5 mini	18.0	16.8	21.9	10.7
SmolLM2 — base → fine-tuned
360M fine-tuned	26.8	26.8	33.7	13.7
360M base	21.7	22.3	29.7	9.6
135M fine-tuned	23.2	23.1	30.6	13.0
135M base	18.3	19.0	24.8	8.9
Classical sequence models
Online n-gram	12.0	14.7	20.4	11.1
LSTM	5.7	5.5	12.4	2.4
Trained n-gram	3.8	3.9	11.9	0.7
XGBoost	2.9	2.3	6.5	1.0

Learnable task. Fine-tuning lifts SmolLM2-360M from 21.7 → 26.8 UAS — matching GPT-5 (27.4) at a fraction of the size.

Acceptance heuristics multi-action · GPT-5

Heuristic	Rule	UAS	AR	PREC	pCov
Greedy	ℓ≥1	22.3	20.0	36.0	19.6
Hybrid-1	p≥.9, ℓ≥1	21.8	16.6	39.4	17.3
Greedy-2	ℓ≥2	20.3	10.5	37.6	15.9
Hybrid-2	p=1, ℓ≥2	17.5	7.9	40.1	10.7
P100	p=1	19.9	21.8	38.2	20.1
P90	p≥.9	17.0	23.3	36.0	25.5
P60	p≥.6	13.3	26.9	32.9	33.0
Always	—	−19.2	100	9.3	8.1

Abstention is key. Accepting everything collapses utility to −19% UAS; pure precision under-saves too.

Multi-action prediction a block per call · greedy

Model	UAS	AR	PREC	pCov
GPT-5 (reasoning)	26.6	13.3	27.8	30.7
GPT-5	22.3	20.0	36.0	19.6
GPT-5 mini (reasoning)	21.3	9.6	23.2	24.3
GPT-5 mini	20.1	10.1	19.0	17.3

Knowing when to stop. Emitting a whole block rewards solvers that self-limit; reasoning lifts precision & coverage but is choosier, so it accepts less often.

Hyperparameter ablations GPT-5 · greedy · defaults marked ·def

Single-action repredict one op re-predicted after each user action

Param	UAS	AR	PREC	pCov
Prediction stride (s)
1 def	27.4	30.9	44.8	20.7
2	22.6	36.5	48.4	15.9
4	16.8	42.3	53.2	9.4
8	10.6	43.7	55.1	7.1
Context window (c)
8	19.9	24.1	39.7	13.6
32 def	27.4	30.9	44.8	20.7
128	30.0	32.5	47.8	28.5
512	30.8	33.7	47.9	31.0
Context shortening
on def	27.4	30.9	44.8	20.7
off	27.7	31.2	44.3	21.4
Re-prediction
on def	27.4	30.9	44.8	20.7
off	20.3	30.6	44.1	15.2

Multi-action a whole block predicted per call

Param	UAS	AR	PREC	pCov
Prediction stride (s)
1 def	22.3	20.0	36.0	19.6
2	19.5	24.4	39.4	16.0
4	14.7	33.4	45.1	12.2
8	9.8	36.5	49.4	6.8
Context window (c)
8	16.2	17.5	33.8	12.3
32 def	22.3	20.0	36.0	19.6
128	27.6	19.7	37.3	30.0
512	26.2	19.0	35.9	32.0
2048	27.4	19.4	36.6	33.0
Context shortening
on def	22.3	20.0	36.0	19.6
off	24.2	21.5	38.4	21.4
Max ops / call (m)
1	20.3	30.6	44.1	15.2
4	20.8	23.1	39.2	19.4
8	22.1	19.4	36.1	18.9
16	21.8	20.6	36.2	20.2
∞ def	22.3	20.0	36.0	19.6

Cheap, frequent triggers win; context saturates. Stride 1 maximises savings (sparser triggers raise acceptance but cut UAS); context helps up to ~128 ops then flattens; and re-prediction is essential (single-action UAS drops 27.4 → 20.3 without it).

Where it works — acceptance by operation category

Content beats chrome. Content ops (input / paste / fill) are accepted far more than presentational ones (align / number-format / border) — and in-domain fine-tuning lifts exactly the categories the base model handled worst.

distribution of acceptance streak lengths for GPT-5

Why modest acceptance still saves a lot

An acceptance streak is a run of consecutive accepted predictions between two user actions. Most are short (60% length 1), but a heavy tail chains 7+ accepts.

Savings come in bursts. The mean streak of 2.21 means each successful suggestion commits roughly twice the work the user would type — reconciling a ~32% per-step acceptance rate with much larger total savings.

🎯

Task is learnable

A small fine-tuned model matches a frontier LLM; a clear capability gradient across solvers.

✋

Know when not to act

Abstention & stopping criteria matter more than raw generative power.

⚡

Cheap triggers win

Predicting after every action saves most; sparser triggers raise acceptance but cut savings.

🧭

Context saturates

Most predictive signal sits in the last ~128 actions; beyond that, returns flatten.

❄

Cold start

Acceptance climbs from ~12% early to ~24% late as patterns emerge — adaptive triggering could help.

📈

Savings come in bursts

Accepted predictions chain (mean streak 2.21), amplifying modest per-step acceptance.

Autocomplete, but for spreadsheets.