ICML 2026 Microsoft Benchmark + Framework arXiv:2606.13802 ↗

Autocomplete, but for spreadsheets.

Code editors predict your next line. Spreadsheets — used by hundreds of millions — predict almost nothing. NAPE is the first benchmark and online-evaluation framework for systems that watch a user's edits and suggest the next actions.

A Benchmark & Framework for Evaluating Next Action Predictions in Spreadsheets

Tejas Agrawal  ·  Vu Le  ·  Sumit Gulwani  ·  Gust Verbruggen Microsoft
?

The problem

Building a sheet takes hundreds of tiny actions. There are no public edit histories to learn from, and the action space (values, formats, borders, merges) is huge and composite.

The idea

Treat it like code autocomplete. After every action, a solver predicts the next ones; a simulated user accepts or rejects, and the remaining work is rewritten on the fly.

The result

The task is learnable: a fine-tuned 360M model rivals GPT-5, saving ~27% of user actions. Abstention and cheap triggers matter more than raw model size.

Live & interactive

Watch a spreadsheet predict itself

Two views, one engine. See the assistant suggest & the user accept/reject — then replay a real benchmark trajectory action-by-action.

ƒx
 
current op
The data

A benchmark built like real spreadsheets

52 workbooks, reconstructed into the step-by-step action sequences a person would actually take — then validated by hand.

52
spreadsheets
12K
total actions
164
median ops / sheet
9
action types

How each trajectory is generated

🗎
Static workbookpublic corpora (Singh et al., 2023)
👁
VLM annotationregions · dependencies · pasted ranges
Symbolic heuristicscell→range merge · sampled order
🤖
LLM judge–editormake natural · revalidate to target
Human annotationreorder · correct · most reworked

Annotators don't just rubber-stamp the machine drafts — they substantially rewrite them. The mean normalized edit distance between the draft and the final trajectory is 0.69 (median 0.77), so most sequences reflect a genuinely human build order rather than the model's first guess.

One DSL for every edit

Each action is a single pipe-delimited line, so trajectories are readable, diffable, and executable.

INPUT      | A1     | "Task"
FONT_BOLD  | A1:D1  | true
FILL_COLOR | A1:D1  | #0D2B4E
BORDER_ALL | A1:C3  | Thin, #0083AC
MERGE      | A1:G1  | true

Operation mix hover to explore

Sequence length ops per spreadsheet · min 35 · median 164 · mean 229 · max 821

Most sheets run well under 300 actions — median 164 — with a long tail of large, complex workbooks out past 800.

The method

Online evaluation, not “given x, predict y”

A real assistant changes the very state it operates in. We evaluate on-policy: predictions reshape the future the user still has to complete.

one step of the online evaluation loop
One step of the loop: the system predicts, the user accepts/rejects, and the ground-truth future is rewritten.

The accept & adapt loop

  1. Predict. After each user action the solver proposes ≥0 next actions.
  2. Decide. An acceptance heuristic accepts or rejects (modeling the user).
  3. Adapt. On accept, drop satisfied ops, prepend inverses for false positives, and patch so the target is still reached.
  4. Repair. The system must fix its own mistakes — impossible to capture offline.

State-level metrics

UAS % user actions saved primary
AR predictions accepted
PREC correct predicted edits
pCov of predictable edits hit

Scored on the resulting workbook state — different action sequences can reach the same sheet.

single-action vs multi-action prediction settings

Two prediction settings

Single-action (k=1, re-predict): one op at a time, re-queried after each accept — levels the field for small models. Multi-action (k≥1): emit a whole block and advance by what's accepted — natural for LLMs.

What “actions saved” looks like

Each accepted prediction removes work the user would otherwise do. The blue line drops below the no-assist baseline; the green area is the saved effort — small on weak trajectories, large on strong ones.
operations remaining vs user steps, for low / medium / high UAS trajectories
68%

How much is even predictable?

An oracle of four reasoning LLMs recovers 68% of all edits from history alone (median 66% per sheet; 44/52 sheets > 50%). This sets a ceiling on how much any predictor can save.

What we learned

Results & insights

Baselines span zero-shot LLMs, fine-tuned small models, and classical sequence models — each surfaces a different lesson.

Model comparison single-action · greedy · reasoning = low effort

ModelUASARPRECpCov
Zero-shot LLMs
GPT-5 (reasoning)32.729.441.624.8
GPT-5 mini (reasoning)28.225.537.020.9
GPT-527.430.944.820.7
GPT-5 mini18.016.821.910.7
SmolLM2 — base → fine-tuned
360M  fine-tuned26.826.833.713.7
360M  base21.722.329.79.6
135M  fine-tuned23.223.130.613.0
135M  base18.319.024.88.9
Classical sequence models
Online n-gram12.014.720.411.1
LSTM5.75.512.42.4
Trained n-gram3.83.911.90.7
XGBoost2.92.36.51.0

Learnable task. Fine-tuning lifts SmolLM2-360M from 21.7 → 26.8 UAS — matching GPT-5 (27.4) at a fraction of the size.

Acceptance heuristics multi-action · GPT-5

HeuristicRuleUASARPRECpCov
Greedyℓ≥122.320.036.019.6
Hybrid-1p≥.9, ℓ≥121.816.639.417.3
Greedy-2ℓ≥220.310.537.615.9
Hybrid-2p=1, ℓ≥217.57.940.110.7
P100p=119.921.838.220.1
P90p≥.917.023.336.025.5
P60p≥.613.326.932.933.0
Always−19.21009.38.1

Abstention is key. Accepting everything collapses utility to −19% UAS; pure precision under-saves too.

Multi-action prediction a block per call · greedy

ModelUASARPRECpCov
GPT-5 (reasoning)26.613.327.830.7
GPT-522.320.036.019.6
GPT-5 mini (reasoning)21.39.623.224.3
GPT-5 mini20.110.119.017.3

Knowing when to stop. Emitting a whole block rewards solvers that self-limit; reasoning lifts precision & coverage but is choosier, so it accepts less often.

Hyperparameter ablations GPT-5 · greedy · defaults marked ·def

Single-action repredict one op re-predicted after each user action
ParamUASARPRECpCov
Prediction stride (s)
1 def27.430.944.820.7
222.636.548.415.9
416.842.353.29.4
810.643.755.17.1
Context window (c)
819.924.139.713.6
32 def27.430.944.820.7
12830.032.547.828.5
51230.833.747.931.0
Context shortening
on def27.430.944.820.7
off27.731.244.321.4
Re-prediction
on def27.430.944.820.7
off20.330.644.115.2
Multi-action a whole block predicted per call
ParamUASARPRECpCov
Prediction stride (s)
1 def22.320.036.019.6
219.524.439.416.0
414.733.445.112.2
89.836.549.46.8
Context window (c)
816.217.533.812.3
32 def22.320.036.019.6
12827.619.737.330.0
51226.219.035.932.0
204827.419.436.633.0
Context shortening
on def22.320.036.019.6
off24.221.538.421.4
Max ops / call (m)
120.330.644.115.2
420.823.139.219.4
822.119.436.118.9
1621.820.636.220.2
def22.320.036.019.6

Cheap, frequent triggers win; context saturates. Stride 1 maximises savings (sparser triggers raise acceptance but cut UAS); context helps up to ~128 ops then flattens; and re-prediction is essential (single-action UAS drops 27.4 → 20.3 without it).

Where it works — acceptance by operation category

per-category acceptance across solvers

Content beats chrome. Content ops (input / paste / fill) are accepted far more than presentational ones (align / number-format / border) — and in-domain fine-tuning lifts exactly the categories the base model handled worst.

distribution of acceptance streak lengths for GPT-5

Why modest acceptance still saves a lot

An acceptance streak is a run of consecutive accepted predictions between two user actions. Most are short (60% length 1), but a heavy tail chains 7+ accepts.

Savings come in bursts. The mean streak of 2.21 means each successful suggestion commits roughly twice the work the user would type — reconciling a ~32% per-step acceptance rate with much larger total savings.

🎯

Task is learnable

A small fine-tuned model matches a frontier LLM; a clear capability gradient across solvers.

Know when not to act

Abstention & stopping criteria matter more than raw generative power.

Cheap triggers win

Predicting after every action saves most; sparser triggers raise acceptance but cut savings.

🧭

Context saturates

Most predictive signal sits in the last ~128 actions; beyond that, returns flatten.

Cold start

Acceptance climbs from ~12% early to ~24% late as patterns emerge — adaptive triggering could help.

📈

Savings come in bursts

Accepted predictions chain (mean streak 2.21), amplifying modest per-step acceptance.

Use it

Citation & resources

BibTeX
@inproceedings{agrawal2026nape,
  title     = {A Benchmark and Framework for Evaluating
               Next Action Predictions in Spreadsheets},
  author    = {Agrawal, Tejas and Le, Vu and
               Gulwani, Sumit and Verbruggen, Gust},
  booktitle = {Proceedings of the 43rd International
               Conference on Machine Learning (ICML)},
  year      = {2026}
}