The Open-Weight Escape Valve (Part 3 of 3)

The Quiet Half Of The Story

In Part 1 of this series I laid out what happened to flat-rate AI subscriptions in March and April. In Part 2 I walked through the arithmetic that made all of those events inevitable. Both posts were, frankly, the gloomy half of the picture. Pricing emails landing, capex-versus-revenue gaps, postmortems from your favourite vendor.

Now I want to do the half I think most kafkai.ai readers underestimate.

While the frontier subscription market was being repriced in front of us, something else was happening underneath. A real open-weight tier matured. Not in a research-press-release way. In a "you can ship a product on this today" way. I have been writing about this corner of the market since the state of local LLMs piece in January. Four months later, the picture is meaningfully better than it was then.

The point of this third post is not to tell you to abandon Claude or Chat-GPT. Frontier still matters. The point is that the gap that justified premium pricing is narrower than it was a year ago, and the routing decision, not the subscription decision, is now the one that pays the bills.

Digital artwork of a laptop with code on a desk, featuring a floating metallic ai bubble and a steaming coffee cup in a bright home office.

Ollama Is The Platform Layer Now

The number to anchor on is Ollama. In Q1 2026, Ollama hit roughly 52 million monthly downloads. That is up something like 520x from Q1 2023. HuggingFace now hosts around 135,000 GGUF models. The local LLM ecosystem has gone from a hobbyist corner to a deployable platform layer in under three years.

What that means in practice, for a small business, is that you can now run inference on your own hardware at zero marginal cost per request, and the quality of what you can run is somewhere in the 70 to 85 percent range of frontier quality on most tasks. That is not "almost as good." It is "good enough for a meaningful slice of the workload that flat-rate Claude was carrying for you yesterday." The slice is not all of it. It is enough.

Ollama themselves moved up the stack this year with Ollama Cloud. The free tier has daily quotas. Pro is $20 a month, Max is $100. Local deployment, the thing most kafkai.ai readers will care about, remains free regardless. The structural point about Ollama Cloud is not the price. The price is fine. The structural point is that Ollama does not bear the training cost of frontier models the way Anthropic and OpenAI do. Their margin pressure is much lower because their cost stack is much lower. They are reselling and orchestrating someone else's training investment, which is a different and easier business than carrying it.

If you have not set up Ollama yet, the running LLMs locally piece is a reasonable starting point. The basics have not changed. The model selection has expanded. That is the next part.

The February-To-April Wave

Between February and the first week of April 2026, the open-weight coding model space received a series of releases that, taken together, change what you should plan for. I am going to go through them quickly. The point is not a deep technical comparison. The point is that "open weights" is no longer a synonym for "second tier."

A useful single source for the comparison is the Atlas Cloud roundup of Kimi K2.6, GLM 5.1, Qwen 3.6 Plus and MiniMax M2.7. The headline claims, distilled:

GLM-5.1, released by Z.AI on April 7, 2026. 754B parameter mixture-of-experts, MIT licensed. It tops SWE-Bench Pro at 58.4%, beating GPT-5.4 and Claude Opus 4.6 on that specific benchmark. MIT licence is the part many small businesses will care about most. You can use it commercially without negotiating with anyone.
Qwen 3.6 Plus, from Alibaba, late March 2026. 1 million token context window. Leads Terminal-Bench 2.0 at 61.6% against Claude Opus 4.6's 57.5%. Available as a free preview on OpenRouter at the time of writing.
Kimi K2.6, from Moonshot, April 2026. Hits 80.2% on SWE-Bench Verified, just under Claude Opus 4.6's 80.8%. Open source. The agentic loop reliability on Kimi K2.6 has been getting noticed publicly, including reports of it sustaining tool calls over very long sessions.
DeepSeek V4 Flash. Best performance-to-cost in the bunch when self-hosted. Roughly $0.01 per benchmark task run. If you have ever wanted a self-hosted model where the cost-per-call is genuinely a rounding error, this is the one.
Qwen3-Coder-Next, February 2026. 80B total parameters with 3B active. Roughly Claude Sonnet 4.5 quality on coding tasks. Runs on a Mac Studio. That last sentence is the one I want you to hold onto. A model in the same neighbourhood as a paid frontier tier from late last year, running on hardware that fits under your desk.

Five names, five concrete capability claims, five different licences and architectures. Two years ago this paragraph would have been "DeepSeek and Llama, and good luck." That is no longer the situation.

Claude Code Is Already Set Up For This

The best part is that Claude Code, the same Claude Code that the entire first half of this series has been describing as a flat-rate pressure point, is interoperable with these other models out of the box.

You just run Ollama and specify the model you want to use like this:

ollama launch claude --model qwen3.6

The same skills, the same MCP servers, the same subagents, the same hooks all keep working.

I am going to be careful with this claim because it is easy to oversell. The behaviour is not literally identical. Some skills assume Claude-specific quirks. Some agentic patterns are tuned for Claude's tool-call style. You will hit edges. But the core loop, "give an instruction, watch a model take steps, read tool output, edit files, ask for review", works against multiple model backends with very little plumbing. There are now developers publishing step-by-step guides showing zero monthly Anthropic spend while continuing to use the Claude Code interface.

This is not, to be clear, an instruction to do that. I run a small AI company. We still do pay to all the major AI companies for fixed plans because for a meaningful share of our work the frontier model still matters and the time saved is worth more than the subscription. What I want kafkai.ai readers to understand is that the lock-in story has weakened. If your bill triples on June 1 because of how a particular agentic workflow runs, you have meaningful options. The options were not there in 2024.

Three Things SMEs Should Do

This is the part I want every reader to take away, in the same shape as the closing list from the April 9 piece that started this whole thread.

Treat frontier AI as a premium SKU. Reserve Claude Opus, OpenAI GPT, and the other top-of-stack models for the 10 to 15 percent of your work that genuinely benefits from frontier reasoning. Hard contract drafting. Strategic analysis. Multi-step research. Anything where the answer is the product. Route the rest, classification, summarisation, drafting, simple agents, to Claude Sonnet, Haiku, or open-weight models via OpenRouter or Ollama. The bill drops by a meaningful percentage immediately and the quality on the rerouted work does not noticeably change.
Build a routing strategy now, not after the next price change. The next twelve months will see more usage-based billing rollouts, more "small tests on 2% of users," more silent throttling, and almost certainly at least one more pricing-page incident that reverses inside a day. If you already know which model handles which workload in your business, those changes pass through your invoice without a budget shock. If you do not know, the next change in your inbox will hurt twice. Once because you pay more, and once because you spend a week reorganising under deadline pressure.
Don't forget the ground. This is the same point I closed the April 9 piece with. Practical AI tools at $200 to $500 a month already deliver real ROI for small businesses. Local LLMs on consumer hardware now deliver 70 to 85 percent of frontier quality. Japan has its own open-weight option in NII's llm-jp-3-172b-instruct3, trained on Japanese data with over 1,900 researchers contributing through GENIAC. A small business that pairs a vibe coding workflow with practical AI tools and a routing layer that hits the frontier only when needed is in a much stronger position than one that has built everything around a single $20 subscription. Most of the 78% of companies seeing real AI results are getting them through this kind of practical execution, not through frontier brand loyalty. Tools like our own competitive intelligence platform are built for that economics, not against it.

Closing The Loop

Three weeks ago I wrote that data centres in space looked like a bubble and that small businesses should be looking at the ground. Two posts ago I followed up with what was happening to flat-rate AI subscriptions. One post ago I walked through the arithmetic. This post is the one that says what to do.

Here is the summary, as plain as I can make it. The bubble is starting to deflate. It is deflating in your terminal first, not in orbit. Small businesses that built their entire AI strategy around flat-rate frontier subscriptions are going to feel the next twelve months. Small businesses that build a routing-first, hybrid-aware stack, with frontier as a premium SKU and open weights as the default for everything that does not need the top tier, are not. The tools to build that hybrid stack are mature, today, in a way they were not last year.

I am still on the ground. What look like a bubble is still deflating. There is still no shortage of problems to solve down here.

That is the end of this short series. If you read all three posts, thank you. If you only read this one, the action items above are the part I most want to land. The next time you read a headline about an AI provider tightening limits or repricing a subscription, I would suggest a different reaction than panic. Open your config. Check which workloads are pointed at the frontier. Move the ones that do not need it. Then go back to the work that pays the bills.

Menu

Language

The Open-Weight Escape Valve (Part 3 of 3)

The Quiet Half Of The Story

Ollama Is The Platform Layer Now

The February-To-April Wave

Claude Code Is Already Set Up For This

Three Things SMEs Should Do

Closing The Loop

Related Articles

Speaking About Remote IDE at UKM SEKOPI Tech Event

Why Flat-Rate AI Was Always Going to Break (Part 2 of 3)

The Pricing Emails Are Landing (Part 1 of 3)