Code editors predict your next line. Spreadsheets — used by hundreds of millions — predict almost nothing. NAPE is the first benchmark and online-evaluation framework for systems that watch a user's edits and suggest the next actions.
Building a sheet takes hundreds of tiny actions. There are no public edit histories to learn from, and the action space (values, formats, borders, merges) is huge and composite.
Treat it like code autocomplete. After every action, a solver predicts the next ones; a simulated user accepts or rejects, and the remaining work is rewritten on the fly.
The task is learnable: a fine-tuned 360M model rivals GPT-5, saving ~27% of user actions. Abstention and cheap triggers matter more than raw model size.
Two views, one engine. See the assistant suggest & the user accept/reject — then replay a real benchmark trajectory action-by-action.
—52 workbooks, reconstructed into the step-by-step action sequences a person would actually take — then validated by hand.
Annotators don't just rubber-stamp the machine drafts — they substantially rewrite them. The mean normalized edit distance between the draft and the final trajectory is 0.69 (median 0.77), so most sequences reflect a genuinely human build order rather than the model's first guess.
Each action is a single pipe-delimited line, so trajectories are readable, diffable, and executable.
INPUT | A1 | "Task" FONT_BOLD | A1:D1 | true FILL_COLOR | A1:D1 | #0D2B4E BORDER_ALL | A1:C3 | Thin, #0083AC MERGE | A1:G1 | true
Most sheets run well under 300 actions — median 164 — with a long tail of large, complex workbooks out past 800.
A real assistant changes the very state it operates in. We evaluate on-policy: predictions reshape the future the user still has to complete.
Scored on the resulting workbook state — different action sequences can reach the same sheet.
Single-action (k=1, re-predict): one op at a time, re-queried after each accept — levels the field for small models. Multi-action (k≥1): emit a whole block and advance by what's accepted — natural for LLMs.
An oracle of four reasoning LLMs recovers 68% of all edits from history alone (median 66% per sheet; 44/52 sheets > 50%). This sets a ceiling on how much any predictor can save.
Baselines span zero-shot LLMs, fine-tuned small models, and classical sequence models — each surfaces a different lesson.
| Model | UAS | AR | PREC | pCov |
|---|---|---|---|---|
| Zero-shot LLMs | ||||
| GPT-5 (reasoning) | 32.7 | 29.4 | 41.6 | 24.8 |
| GPT-5 mini (reasoning) | 28.2 | 25.5 | 37.0 | 20.9 |
| GPT-5 | 27.4 | 30.9 | 44.8 | 20.7 |
| GPT-5 mini | 18.0 | 16.8 | 21.9 | 10.7 |
| SmolLM2 — base → fine-tuned | ||||
| 360M fine-tuned | 26.8 | 26.8 | 33.7 | 13.7 |
| 360M base | 21.7 | 22.3 | 29.7 | 9.6 |
| 135M fine-tuned | 23.2 | 23.1 | 30.6 | 13.0 |
| 135M base | 18.3 | 19.0 | 24.8 | 8.9 |
| Classical sequence models | ||||
| Online n-gram | 12.0 | 14.7 | 20.4 | 11.1 |
| LSTM | 5.7 | 5.5 | 12.4 | 2.4 |
| Trained n-gram | 3.8 | 3.9 | 11.9 | 0.7 |
| XGBoost | 2.9 | 2.3 | 6.5 | 1.0 |
Learnable task. Fine-tuning lifts SmolLM2-360M from 21.7 → 26.8 UAS — matching GPT-5 (27.4) at a fraction of the size.
| Heuristic | Rule | UAS | AR | PREC | pCov |
|---|---|---|---|---|---|
| Greedy | ℓ≥1 | 22.3 | 20.0 | 36.0 | 19.6 |
| Hybrid-1 | p≥.9, ℓ≥1 | 21.8 | 16.6 | 39.4 | 17.3 |
| Greedy-2 | ℓ≥2 | 20.3 | 10.5 | 37.6 | 15.9 |
| Hybrid-2 | p=1, ℓ≥2 | 17.5 | 7.9 | 40.1 | 10.7 |
| P100 | p=1 | 19.9 | 21.8 | 38.2 | 20.1 |
| P90 | p≥.9 | 17.0 | 23.3 | 36.0 | 25.5 |
| P60 | p≥.6 | 13.3 | 26.9 | 32.9 | 33.0 |
| Always | — | −19.2 | 100 | 9.3 | 8.1 |
Abstention is key. Accepting everything collapses utility to −19% UAS; pure precision under-saves too.
| Model | UAS | AR | PREC | pCov |
|---|---|---|---|---|
| GPT-5 (reasoning) | 26.6 | 13.3 | 27.8 | 30.7 |
| GPT-5 | 22.3 | 20.0 | 36.0 | 19.6 |
| GPT-5 mini (reasoning) | 21.3 | 9.6 | 23.2 | 24.3 |
| GPT-5 mini | 20.1 | 10.1 | 19.0 | 17.3 |
Knowing when to stop. Emitting a whole block rewards solvers that self-limit; reasoning lifts precision & coverage but is choosier, so it accepts less often.
| Param | UAS | AR | PREC | pCov |
|---|---|---|---|---|
| Prediction stride (s) | ||||
| 1 def | 27.4 | 30.9 | 44.8 | 20.7 |
| 2 | 22.6 | 36.5 | 48.4 | 15.9 |
| 4 | 16.8 | 42.3 | 53.2 | 9.4 |
| 8 | 10.6 | 43.7 | 55.1 | 7.1 |
| Context window (c) | ||||
| 8 | 19.9 | 24.1 | 39.7 | 13.6 |
| 32 def | 27.4 | 30.9 | 44.8 | 20.7 |
| 128 | 30.0 | 32.5 | 47.8 | 28.5 |
| 512 | 30.8 | 33.7 | 47.9 | 31.0 |
| Context shortening | ||||
| on def | 27.4 | 30.9 | 44.8 | 20.7 |
| off | 27.7 | 31.2 | 44.3 | 21.4 |
| Re-prediction | ||||
| on def | 27.4 | 30.9 | 44.8 | 20.7 |
| off | 20.3 | 30.6 | 44.1 | 15.2 |
| Param | UAS | AR | PREC | pCov |
|---|---|---|---|---|
| Prediction stride (s) | ||||
| 1 def | 22.3 | 20.0 | 36.0 | 19.6 |
| 2 | 19.5 | 24.4 | 39.4 | 16.0 |
| 4 | 14.7 | 33.4 | 45.1 | 12.2 |
| 8 | 9.8 | 36.5 | 49.4 | 6.8 |
| Context window (c) | ||||
| 8 | 16.2 | 17.5 | 33.8 | 12.3 |
| 32 def | 22.3 | 20.0 | 36.0 | 19.6 |
| 128 | 27.6 | 19.7 | 37.3 | 30.0 |
| 512 | 26.2 | 19.0 | 35.9 | 32.0 |
| 2048 | 27.4 | 19.4 | 36.6 | 33.0 |
| Context shortening | ||||
| on def | 22.3 | 20.0 | 36.0 | 19.6 |
| off | 24.2 | 21.5 | 38.4 | 21.4 |
| Max ops / call (m) | ||||
| 1 | 20.3 | 30.6 | 44.1 | 15.2 |
| 4 | 20.8 | 23.1 | 39.2 | 19.4 |
| 8 | 22.1 | 19.4 | 36.1 | 18.9 |
| 16 | 21.8 | 20.6 | 36.2 | 20.2 |
| ∞ def | 22.3 | 20.0 | 36.0 | 19.6 |
Cheap, frequent triggers win; context saturates. Stride 1 maximises savings (sparser triggers raise acceptance but cut UAS); context helps up to ~128 ops then flattens; and re-prediction is essential (single-action UAS drops 27.4 → 20.3 without it).
Content beats chrome. Content ops (input / paste / fill) are accepted far more than presentational ones (align / number-format / border) — and in-domain fine-tuning lifts exactly the categories the base model handled worst.

An acceptance streak is a run of consecutive accepted predictions between two user actions. Most are short (60% length 1), but a heavy tail chains 7+ accepts.
Savings come in bursts. The mean streak of 2.21 means each successful suggestion commits roughly twice the work the user would type — reconciling a ~32% per-step acceptance rate with much larger total savings.
A small fine-tuned model matches a frontier LLM; a clear capability gradient across solvers.
Abstention & stopping criteria matter more than raw generative power.
Predicting after every action saves most; sparser triggers raise acceptance but cut savings.
Most predictive signal sits in the last ~128 actions; beyond that, returns flatten.
Acceptance climbs from ~12% early to ~24% late as patterns emerge — adaptive triggering could help.
Accepted predictions chain (mean streak 2.21), amplifying modest per-step acceptance.
@inproceedings{agrawal2026nape,
title = {A Benchmark and Framework for Evaluating
Next Action Predictions in Spreadsheets},
author = {Agrawal, Tejas and Le, Vu and
Gulwani, Sumit and Verbruggen, Gust},
booktitle = {Proceedings of the 43rd International
Conference on Machine Learning (ICML)},
year = {2026}
}