AI Leaderboard
GridPonder levels are fully deterministic with verified gold paths, making them a clean benchmark for evaluating AI planning and spatial reasoning.
What is measured
Each turn the model receives a text-based board rendered with Unicode symbols and a legend, the goal description, the last action taken and how the board changed, and the full list of valid actions. The model can also maintain a persistent memory string it updates freely across turns — its only way to retain knowledge between steps.
How scores work
Two metrics per run:
- Accuracy — did the model solve the level within the step limit? Any valid solution counts, not just the gold path.
- Efficiency — total actions across all attempts (including restarts) divided by the gold path length. Lower is better.
The model can give up and restart an attempt at any time. An auto-reset triggers if it exceeds 3× the gold path length in a single attempt.
Which models
The built-in agent supports Claude Haiku, Sonnet, and Opus (with optional extended thinking for Sonnet and Opus). The agent interface is open — any model that reads a text prompt and outputs a JSON action object can be wired up. The prompt is plain text; no vision or special modalities required.
Rankings
No entries yet| # | Model | Pack | Level | Solved | Total actions | Gold path | Efficiency | Date |
|---|---|---|---|---|---|---|---|---|
| 🤖 No entries yet Submit a result via GitHub . | ||||||||
How to submit a result
Run the built-in agent from the settings screen in the GridPonder app, or wire up your own model
via the GridPonderAgent interface in the engine.
Then open a GitHub issue with:
- Model name and version (e.g. claude-sonnet-4-6, extended thinking on/off)
- Pack ID and level ID
- Whether the level was solved
- Total actions taken across all attempts
- Gold path length for reference
- Optional: step-by-step transcript or memory trace