← Back to news

Local Qwen isn't a worse Opus, it's a different tool

blog.alexellis.io|409 points|220 comments|by alphabettsy|Jun 18, 2026

Local Qwen isn't a worse Opus, it's a different tool

Many claim that local models like Qwen 27B or 35-A3B are "near-Opus level." However, I'm not basing my perspective on a cursory glance, a random X post about canceling a Claude subscription, or a hobbyist's report of a model crawling at single-digit tokens per second with a tiny 32K context window. Nor is this a tweet from a celebrity CEO coding on a plane.

Instead, this is a transparent account from a founder of a small software business where local models have provided actual, albeit caveated, value. I have skin in the game, but no motive to shill cloud or local solutions; I simply want local models to be reliable.

What this exploration covers:

  • How the hardware investment paid for itself in 2-3 months.
  • The specific business use cases it serves.
  • Why unsupervised trust is still impossible.
  • Qwen's biggest flaws: perfect reliability infinite loops and hallucinations (especially when quantized for consumer GPUs).

RTX 6000 Pro Power Connectors


My AI Use Case & Background

My path as a founder and maintainer began with OpenFaaS, which I built entirely by hand back in 2016. I laid the foundation alone and then grew it through community involvement—not because I lacked the ability to solo it, but because I wanted a successful open-source ecosystem.

My professional trajectory looked like this:

  1. 2017: Joined VMware to fund my time.
  2. 2019: Shifted toward an open-core, bootstrapped company model due to market changes.

The Product Ecosystem

Our lean team currently manages a suite of tools focused on efficiency, control, and autonomy:

These products rely on low-level Linux primitives: containers, Firecracker microVMs, network protocols, and Kubernetes. They are primarily written in Go, with some React for UIs and documentation. Because we are small, we provide high-touch, "non-scalable" support to our users.

I've adopted AI tools since their inception—starting with VS Code tab completion, moving to ChatGPT for bug hunting, and eventually spending 12 hours a day in tmux. I even built Superterm.dev to manage my sessions and visualize coding agents. I've watched AI evolve from "boilerplate reduction" to "end-to-end architecture." While I still handle my own writing, I rarely write code by hand now rely heavily on Claude or Codex.


The Frontier Intelligence Shift

Between November 2025 and January 2026, a paradigm shift occurred. Developers on X began reporting that Claude Opus had evolved to the point of handling nearly all their professional workloads.

  • Manual coding became as obsolete as milk left in the sun.
  • Cost: Top-tier plans settled around $200 / mo for individuals.
  • Limits: If you manage your unattended tasks, you can stretch the 5-hour and weekly limits.

The Case for Local Models

One might ask: "Why use anything less than the absolute best?"

In 2026, we are in a strange era where any idea can be cloned overnight by an unknown competitor using a subscription in a developing nation. I've seen this happen to SlicerVM (hand-written in 2022) and Superterm (created in 2026 via agents).

While a "vibecoded" clone isn't the same as a well-architected solution backed by an experienced team, in a market where the cost of software drops to near zero, "free and good enough" often wins.

The Capacity Gap

There is a massive difference in scale between frontier and local models. Frontier models are estimated at: Parameters0.5T to 2T\text{Parameters} \approx 0.5\text{T to } 2\text{T}

This isn't just a marginal increase; it's a different league of reasoning and knowledge. Yet, a dense model like Qwen 3.6 27B performs surprisingly well.

ModelSWE-Bench Verified Score
Claude Opus 4.888.6%88.6\%
Qwen 3.6 27B77.2%77.2\%

This 11.4%\approx 11.4\% gap leads people to shout that "local is nearly SOTA," claiming a 6-year-old GPU can replace a \200/\text{mo}$ subscription.


The Trap of "Benchmaxxing"

Benchmarks are moving targets. Since they are public, models can be tuned specifically to score higher on them.

The SWE-Bench Verified test focuses on Python issues. While Python supports async and threading, the majority of its codebase is single-threaded and synchronous. This is fundamentally different from our work: distributed systems written in Go.

In Go, we deal with:

  • channels
  • contexts
  • structs

When a local model fails, it doesn't just give a wrong answer; it often enters a failure state like this:

// Example of a hallucinated infinite loop risk
for {
    // The model might forget the exit condition 
    // or hallucinate a variable that never changes
    if condition { 
        break 
    }
    // ... logic that never triggers 'condition'
}

This risk of infinite loops and hallucinations spikes significantly when you quantize the model to fit it onto consumer-grade hardware.