arxiv:2604.18394

OpenGame: Open Agentic Coding for Games

Published on Apr 20

· Submitted by

Jiaming Han on Apr 21

#3 Paper of the day

Upvote

Authors:

Yaozhi Zheng ,

Jiaming Han ,

Abstract

OpenGame is an open-source agentic framework for end-to-end web game creation that uses specialized code models and evaluation benchmarks to overcome challenges in interactive application development.

AI-generated summary

Game development sits at the intersection of creative design and intricate software engineering, demanding the joint orchestration of game engines, real-time loops, and tightly coupled state across many files. While Large Language Models (LLMs) and code agents now solve isolated programming tasks with ease, they consistently stumble when asked to produce a fully playable game from a high-level design, collapsing under cross-file inconsistencies, broken scene wiring, and logical incoherence. We bridge this gap with OpenGame, the first open-source agentic framework explicitly designed for end-to-end web game creation. At its core lies Game Skill, a reusable, evolving capability composed of a Template Skill that grows a library of project skeletons from experience and a Debug Skill that maintains a living protocol of verified fixes - together enabling the agent to scaffold stable architectures and systematically repair integration errors rather than patch isolated syntax bugs. Powering this framework is GameCoder-27B, a code LLM specialized for game engine mastery through a three-stage pipeline of continual pre-training, supervised fine-tuning, and execution-grounded reinforcement learning. Since verifying interactive playability is fundamentally harder than checking static code, we further introduce OpenGame-Bench, an evaluation pipeline that scores agentic game generation along Build Health, Visual Usability, and Intent Alignment via headless browser execution and VLM judging. Across 150 diverse game prompts, OpenGame establishes a new state-of-the-art. We hope OpenGame pushes code agents beyond discrete software engineering problems and toward building complex, interactive real-world applications. Our framework will be fully open-sourced.

View arXiv page View PDF Project page GitHub 1.09k Add to collection

Community

csuhan

Paper author Paper submitter 5 days ago

librarian-bot

4 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

avahal

4 days ago

the most interesting bit for me is how the debug skill turns a living protocol of verified fixes into real cross-file coherence for a multi-file game project. i'd love to see an ablation where you disable the debug skill and rely on the template skill alone, because i suspect the long-horizon integration stability mostly comes from that patch library. the three-stage training for GameCoder-27B is neat, but execution-grounded rl on engine APIs will wobble with API drift, so i'm curious how you keep the patch knowledge in sync across engine versions. btw the arxivlens breakdown helped me parse the method details; there's a solid walkthrough here https://arxivlens.com/PaperView/Details/opengame-open-agentic-coding-for-games-3108-2a06e2e2

Leigest

2 days ago

Thank you for the incredibly insightful feedback! You are exactly right about the Debug Skill; our ablation study confirms that removing it causes a catastrophic drop in Build Health, proving that while the Template Skill establishes the initial scaffolding, it is the Debug Skill's iterative patching that truly enforces intricate cross-file coherence. Regarding your excellent point on API drift, we mitigate this by relying on the framework's dynamic execution loop rather than static syntax memorization. Because GameCoder acquired deep, specialized latent knowledge of Phaser during its continual pre-training phase, it treats API drift as a standard debugging task. When an outdated API throws a runtime exception, the Debug Skill parses the exact browser traceback, hypothesizes an alternative syntax based on its internal engine priors, and tests it in the live environment. This execution-grounded trial-and-error allows the system to empirically "rediscover" the correct API pattern and self-heal across engine versions without needing external retrieval or weight updates.

droussis

3 days ago

•

edited 3 days ago

Hi! Congrats for the great work!

Did you perform CPT+SFT+RL on top of the instruction-tuned Qwen 3.5 27B? Or did you have access to the base model which has not been published? Did you merge models in any part of the pipeline to avoid catastrophic forgetting?

Thank you

Leigest

2 days ago

Thanks for your questions! We performed the entire CPT+SFT+RL pipeline directly on the open-source version of Qwen-3.5-27B. Because our framework targets a highly specialized domain, we intentionally accepted a drop in performance on unrelated general tasks. As long as the model's core coding, reasoning, and agentic abilities remained robust for game development, trading off general knowledge for deep engine specialization was an acceptable and necessary compromise for us.