I used Meta Llama 4, Qwen 3-Coder and Gemma 4 to develop a Python application, and only one model is worth keeping for developers.

Local LLMs have become more capable than ever and, unsurprisingly, have become part of more workflows than ever. From coding and brainstorming to research and agent automationRunning models locally has allowed entire communities of professionals to get more out of their time, at no cost.

Although the defining feature of local AI is the fact that it is free to use, it is certainly not unlimited. Storage, system memory and especially VRAM They are finite resources, which means that each model that your machine occupies is effectively competing for a spot there. To find out which deserved the spot more, I tested Meta’s Llama Scout 17B, Qwen3-Coder 30B, and Gemma 4 26B against each other by asking them to create the same Python application. Here’s the story behind the model that earned its keep.

Pygame was the reference point

“Why a game and not a script?” Well, there is an answer.

I subscribe to the idea that games represent the best user experience. It is the most demanding category of everyday software that most people use, because it is one in which the cost of a small error in the code is immediately reduced. and viscerally obvious. A slow app menu can be annoying, but a game with inverted controls or an incorrect scorecard just a few pixels It feels unplayable and unusable the instant you touch it. That makes it a much better test of compliance than most coding prompts. So while a limited encoding task only tests whether a model can follow an instruction, a game tests whether it can follow it and make accurate decisions in the space that the instruction did not cover.

I decided to check each model’s capabilities by asking them to develop a Pygame inspired by an old game you’d find pre-installed on a Nokia 3310 phone. As always, to keep the playing field level, all models received the same message, word for word.

“Build a simple side-scrolling shooter in Python using Pygame. The player controls a ship on the left side of the screen that moves up, down, left, and right within the boundaries of the screen and fires projectiles to the right with the space bar. Enemies appear from the right edge at random heights and move left toward the player. Destroying an enemy with a projectile increases the score, and an enemy that collides with the player eliminates a life.”

I ran Gemma 4 and Qwen 3.5 for the same local tasks, and one got miles ahead

Pitching them against each other to find the best one for my workflow

Flame 4 did not survive its first frame

A typo and a bigger problem below

If there’s one thing that teaches you that a model isn’t particularly in tune with coding, it’s what happened around the time I released Pygame’s shipment of Llama 4. To be brief, Pygame failed on release, period. Curious to see what went wrong, I went into the code. It turns out that it was a single lost dictionary. Instead, a line meant to read the player’s vertical position referenced a value that was never created, causing the game to throw an error before anything had a chance to appear on the screen. Fortunately, it was only a one-line solution.

But that was not the end of the fiasco. The motion controls were reversed, with the left arrow sending the ship to the right and vice versa. On top of that, an enemy that collided in the game was not removed from the game as expected and, as a result, did not give the player any respite after the fact. This game was incredibly broken.

Gemma 4 had a solid attempt, with two errors under the hood

It looked good, it played a little poorly.

Perhaps the most visually disciplined attempt came from Google’s open model, Gemma 4 26B. The parallax effect around the star field imparted depth, and distant stars in the background appeared smaller, dimmer, and slower, while nearby stars moved faster and appeared brighter. The player’s ship was also the simplest of the three attempts, with the model opting for a simple triangular silhouette rather than a more developed design.

However, the gameplay was another story and the code arrived with two bugs under the hood. The first allowed a single projectile to destroy two overlapping enemies and award points to both, as the collision feedback loop did not end after the initial impact. Now, if that sounds like a minor complaint, the second problem certainly isn’t. Enemy projectiles could hit the player’s ship without reducing lives at all, effectively removing the glitch state from the game. This means you can play all day, aimlessly, as there are no bets and nothing to lose. That makes it a bad game.

I finally found a local open source LLM that really competes with cloud AI.

Open source is catching up

Qwen3-Coder behaved as if it was created for coding

“Claude, is that you?” — My honest reaction

After spending some time debugging other model submissions, I wasn’t expecting much when I arrived at Qwen3-Coder’s attempt. When I finally did, the attempt seemed almost surreal. It was the only entry that was executed correctly without needing a single correction. Every part of the game state, from the enemy and projectile lists to the score counter and collision loops, was initialized and integrated into a functional game loop that borders on perfection. This was a model designed for coding and the results reflected the same.

The finer aspects of Pygame They were subjected to microscopic analysis and they surprised me too. After spending some time timing the shots correctly, I deliberately forced two projectiles to hit the same enemy during the frame time and found that only one registered. The model had dealt with the duplicate scoring problem, which sounds trivial until you realize that another equally capable model failed under similar conditions. The controls behaved as expected, collisions made sense, and the gameplay loop had a lot going on. I may have spent a few minutes forgetting about checking for bugs and just enjoying it for the sake of it, which, in itself, demonstrates the smooth gaming experience that the Pygame platform is capable of delivering, if programmed correctly.

Qwen3-Coder doesn’t need you to clean up its mess

The three models I tested are among the pioneers in the local AI race, but only one of them produced software that didn’t immediately send me searching for bugs. Qwen3 encoder He was undoubtedly the most capable, but surprisingly he is also the least demanding. When it comes to code questions, there is an obvious answer. At least when you run local AI.

Source link