Jump to:

Skip to content
Wayne Goodchild
Wayne Goodchild Senior Editor
Fact checked by: Jorgen Johansson
Updated: April 16, 2025
Hao AI Lab Test AI Language Models on Phoenix Wright: Ace Attorney

Various AI LLMs (Large Language Models) have been tested on video games before, but this time Hao AI Lab have used Ace Attorney as a test, and for an intriguing reason. The core idea is that if AI can understand and follow clues, it can come to the right conclusion in a case, which shows the LLM has learned from the experience.

In a recent X (Twitter) post, Hao AI Lab outlined its inspiration for this latest experiment. It involved a March 2023 talk by Ilya Sutskever, the co-founder of OpenAI, with Jensen Huang, the CEO of NVIDIA, and the implications of predictive text. 

“When Ilya Sutskever once explained why next-word prediction leads to intelligence, he made a metaphor: if you can piece together the clues and deduce the criminal’s name on the last page, you have a real understanding of the story,” Hao AI Lab said on X. 

“Inspired by that idea, we turned to Ace Attorney to test AI’s reasoning. It’s the perfect stage: the AI plays as a detective to collect clues, expose contradictions, and uncover the truth.”

Objection!

Phoenix Wright: Ace Attorney is a type of visual novel that revolves around complex court cases. It was first released in 2001 for the GameBoy Advance but has since been ported to other consoles, such as the Switch, and PC.

“In our setup, models are tested on the intense cross-examination stage. It must spot contradictions and present the correct evidence to challenge witness testimony. Each level grants 5 lives, allowing limited tolerance for mistakes.”

Edgeworth, Phoenix Wright’s nemesis.

Hao AI Lab noted that it used O1, Gemini 2.5 Pro, Claude 3.7-thinking, and Llama-4 Maverick AI for this experiment. These are multimodal AI, meaning they can interpret any input, such as video, sound, and text, and convert them into suitable outputs. For example, if it’s shown an image of a cake, it can generate a suitable recipe to make that cake.

Hao AI Lab’s findings are based on three core principles the AI models had to follow: long-context reasoning, visual understanding, and strategic decision-making. However, while the Lab noted that O1 and Gemini 2.5 Pro performed the best, the overall game design proved a tough challenge.

The AI models in action.

“Game design pushes AI beyond pure textual and visual tasks by requiring it to convert understanding into context-aware actions. It is harder to overfit because success here demands reasoning over context-aware action space – not just memorization.”

AI Benchmarks

Hao AI Lab is part of the University of California San Diego and its aim is to democratize various AI models. It does this primarily through open source code, but also by investigating how AI models work, including their energy consumption and related factors.

As part of its attempt to create a robust benchmark system, Hao AI Lab looks at things like how many API calls an AI makes when it’s working. An API call is, basically, the process of requesting and sending data between different programs. If an AI makes a lot of API calls, then its cost efficiency goes up. In short, the lower the amount of API calls, the better the energy use of the model.

The API cost of the models used in the experiment.

“In our table for models that passed Level 1, O1 made the fewest API calls but still had the highest cost. The call count reflects strategy, not reasoning strength, as models that dig deeper into testimony naturally trigger more requests,” Hao AI Lab said on X.

“Beyond Level 1, as conversations get longer, O1’s cost skyrockets. In Level 2, which is a really long case, O1 cost over $45.75, while Gemini 2.5 Pro handled it for $7.89. That’s a massive gap!”

Hao AI Lab previously tested LLMs on the original Super Mario and are behind Game Arena, a site that shows how AI does on other video games.

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

Wayne Goodchild

Senior Editor

Editor, occasional game dev, constant dad, horror writer, noisy musician. I love games that put effort into fun mechanics, even if there’s a bit of jank here and there. I’m also really keen on indie dev news. My first experience with video games was through the Game and Watch version of Donkey Kong, because I’m older than I look.