🏆 AI Benchmarks

AI Leaderboard

GridPonder levels are fully deterministic with verified gold paths, making them a clean benchmark for evaluating AI planning and spatial reasoning.

🧩

What is measured

Each turn the model receives a text-based board rendered with Unicode symbols and a legend, the goal description, the last action taken and how the board changed, and the full list of valid actions. The model can also maintain a persistent memory string it updates freely across turns — its only way to retain knowledge between steps.

📊

How scores work

Two metrics per run:

  • Accuracy — did the model solve the level within the step limit? Any valid solution counts, not just the gold path.
  • Efficiency — total actions across all attempts (including restarts) divided by the gold path length. Lower is better.

The model can give up and restart an attempt at any time. An auto-reset triggers if it exceeds 3× the gold path length in a single attempt.

🤖

Which models

The built-in agent supports Claude Haiku, Sonnet, and Opus (with optional extended thinking for Sonnet and Opus). The agent interface is open — any model that reads a text prompt and outputs a JSON action object can be wired up. The prompt is plain text; no vision or special modalities required.

Rankings

No entries yet
#ModelPackLevelSolvedTotal actionsGold pathEfficiencyDate
🤖

No entries yet

Submit a result via GitHub .

How to submit a result

Run the built-in agent from the settings screen in the GridPonder app, or wire up your own model via the GridPonderAgent interface in the engine. Then open a GitHub issue with:

  • Model name and version (e.g. claude-sonnet-4-6, extended thinking on/off)
  • Pack ID and level ID
  • Whether the level was solved
  • Total actions taken across all attempts
  • Gold path length for reference
  • Optional: step-by-step transcript or memory trace
Open an issue on GitHub →