Published in AI & Technology
The State of Local LLMs (2024/2025): What Actually Changed
A little over a year ago, I wrote about running LLMs locally with llama.cpp, Llamafile, and Ollama. That post became our most-accessed technical article, which tells me one thing: The people tinkering with LLMs like engineers and technical founders are still trying to figure out how to run these models without selling a kidney for GPU time.
Since then, the landscape has shifted. Not revolutionized. It shifted. The tools matured, new models arrived, and the practical constraints of running AI locally became clearer. So I'm taking the opportunity at the start of 2026 to update what I know now since my last post.
The Tools: llama.cpp and Ollama, One Year Later
llama.cpp: Still the Engine, Now with Better Ergonomics
If you’re an engineer who lives in the terminal, llama.cpp remains the foundation. The project didn’t chase hype; it chased performance and compatibility. Key evolution:
- Vulkan backend matured: I remember when I first started playing around with LLMs on my on GPU, I had issues with AMD architecture and had to spend around $1000 to change the motherboard and CPU to Intel. Now AMD GPU support went from “technically works” to “actually usable.” llama.cpp's Vulkan backend has significantly matured, providing reliable AMD GPU support including for the RX 6700 XT, with benchmark scores of around 83-1051 tokens/second (tg128/pp512) on Q4_0 models. Vulkan delivers substantially higher performance than CPU (e.g., older CPU baselines ~10-30 t/s), and recent updates like ROCm comparisons show Vulkan up to 50% faster in some AMD setups. (What is Vulkan backend? Vulkan is a cross-platform graphics and compute API backend used to accelerate the inference of Large Language Models (LLMs) on a wide variety of hardware)
- Metal optimizations: Metal is the equivalent of NVIDIA's CUDA. I've read that Apple Silicon users saw 20-30% better token/s on M2/M3 chips, especially with larger context windows (16k+). I don't use Mac so don't take my word for it but these are some benchmarks you can refer to.
- GGUF became the standard: Before 2024, the community was plagued by "breaking changes" where a new software update would make your old model files useless (the old GGML format). GGUF (GPT-Generated Unified Format), introduced by Georgi Gerganov, solved this by embedding metadata into the file itself. In 2024, the industry effectively abandoned the older GGML and secondary formats in favor of GGUF, largely due to its native support on Hugging Face and within llama.cpp. So in 2025 we all have the GGUF format and it was just a matter of downloading it and running it locally.
Ollama: The Developer Experience Winner
I've not used llama.cpp directly since the last blog post. My workflow not is mostly through Simon Willison's llm tool and most of the time I talk to Ollama at the backend. Ollama is kind of the manager for my models; I can download, remove or even assign it to use very powerful cloud-based models if I need to.
Ollama doubled down on what it does best: making local models just work. You can also use cloud models (yes, I know: You'll need an internet connection for this) if your hardware can't handle the 100 billion (or even 1 trillion) parameter models.
- Build multi-adapter setups: Combine a base model with LoRA adapters for specific tasks (e.g., code generation + Japanese legal text).
- Pre-load models: Keep models warm in VRAM with
ollama serve --keep-alive, cutting first-token latency from seconds to milliseconds. In my setup, I actually turn this off because I use different models for different parts of the workflow. - REST API parity: The API now mirrors OpenAI’s closely enough that many applications need only a base URL change.
Performance-wise, Ollama still sits on top of llama.cpp, so raw speed differences are minimal I think. I've seen people online ditching Ollama to work directly with llama.cpp (llama.cpp now also has it's own server where you can access it through the browser) and that in itself is appealing but Ollama also has the cloud models which I think differentiates it with llama.cpp
The New Models: What Actually Matters
When I wrote the original post, Llama 2 was state-of-the-art for local use. The world moved on. Here’s what’s worth your disk space in 2025:
Llama 3.1/3.2: The Workhorse Evolved
- Llama 3.1 8B: The 8B variant is what Llama 2 13B wanted to be. Smaller, faster, and better at instruction following. Quantized to Q4_K_M, it runs at ~40 tokens/s on an M2 MacBook Air.
- Llama 3.2 3B: This is said to be the model that “runs on a Raspberry Pi”. The quality drop is noticeable for creative tasks, but for structured extraction and classification, I think it’s sufficient. This is where you can find so-called "edge AI"
- Context windows: 128k tokens in 3.1 means you can feed it your entire product manual. In practice, quality degrades after ~32k, but that’s still 4x more than you had last year.
Mistral Nemo: The Overlooked Performer
Mistral’s 12B model hits a sweet spot: better reasoning than Llama 3.1 8B, faster than Llama 3.1 70B. The trade-off is VRAM: 12B Q4_K_M needs ~8GB, so it fits in consumer GPUs.
If you're a cash strapped startup in Japan you might want to use it for contract analysis because its bilingual training handles code-switching better than Meta’s models.
Google Gemma 2: The Surprise Contender
Google open-sourced Gemma 2 (9B and 27B), and it’s good.
The 9B model rivals Llama 3.1 8B on benchmarks, but quantizes better: Q4_K_M retains more of its original capability. The catch: it’s more sensitive to prompt formatting. You need to follow the system prompt template exactly, or quality drops sharply.
Moonshot’s Kimi K2 Series: The Joker Card from China
If you’ve been tracking the local LLM space, you’ve probably heard about Moonshot AI and its Kimi K2 models.
While United States or West European developers were busy debating Llama vs. Mistral, Moonshot quietly released a series of models that punch well above their weight, especially for multilingual and long-context tasks.
Here’s what I think:
What’s Special About Kimi K2?
Alibaba backed Moonshot’s Kimi K2 series (released in July 2025) is optimized for long-context understanding and multilingual performance, particularly in Chinese, Japanese, and English. Unlike many Western models that treat non-English languages as an afterthought, Kimi K2 was trained with a balanced multilingual dataset, making it a strong contender for global applications.
Key strengths:
- While the initial versions of Kimi-k2 (released in July 2025) started with a 128k context window, the series was updated in September 2025 (specifically the kimi-k2-0905 version and the Kimi-k2 Thinking model) to support the full 256k capacity.
- Native INT4 Quantization: It was built with "Quantization-Aware Training," meaning it can run at half the memory cost (INT4) with almost zero loss in intelligence, doubling its generation speed.
- 1 Trillion Parameters: It has 1,000 billion total parameters, but only activates 32 billion for any single task. For comparison, OpenAI's biggest open model so far is
gpt-oss-120B - Multilingual Mastery: Benchmarks show Kimi K2 outperforming Llama 3.1 and Mistral in Chinese and Japanese while holding its own in English. If your use case involves code-switching (e.g., Japanese contracts with English technical terms), this is the model to beat.
The Kimi-k2 thinking model are so far my go to model for most tasks, especially that involves in multiple tool calling and Japanese language.
What Didn’t Work Out
- Phi-3: Impressive on paper, but the small size shows in real tasks. Fine for classification, not for generation. I've not read much on this model usage, but I am biased in what I read.
- Qwen-2: Strong for Chinese, but at least from my anecdotal experience lags behind Llama/Mistral for English/Japanese performance. Qwen3 is better, although a bit slower.
- Mixtral 8x22B: Too large for most local setups. If you have the hardware, it’s great, but that’s only a small percentage of users.
The FAQs Everyone Actually Needs
Some of the frequently asked questions that I get when I talk to people about my setup, Ollama and llama.cpp
"What is 'Quantization' and which one should I download?"
Quantization is lossy compression for neural networks. You’re trading model size and some accuracy for the ability to run it on your hardware. The naming looks cryptic (Q4_K_M, Q5_0, Q8_0), but the pattern is simple:
- Q4_K_M: The universal recommendation. 4-bit quantization with importance-aware compression. You lose ~2% accuracy vs. full precision, but the model is 1/4 the size. On an 8GB GPU, this is often your only choice.
- Q5_K_M: Slightly better quality, ~20% larger. Use this if you have VRAM headroom and need that last bit of accuracy for sensitive tasks (medical, legal).
- Q8_0: Nearly indistinguishable from full precision, but almost as large. Only useful if you’re benchmarking quantization impact.
- IQ2_XS: Extreme compression. Models become conversational but incoherent. Avoid unless you’re experimenting.
Rule of thumb: Download Q4_K_M first. If you see obvious quality issues in your specific use case, move up to Q5_K_M. Don’t overthink it.
"Can I use these models without an internet connection?"
Yes. That’s the entire point.
Once downloaded, both llama.cpp and Ollama are 100% offline. I’ve run inferences while on a flight 10km above the sea without paying for the satellite data. I can't fine tune yet because I need to connect my laptop to a powerful enough socket (some flights have a maximum watt)
Caveats:
- Initial download needs internet (obviously). A 7B Q4_K_M model is
4GB; a 70B is40GB. - Ollama’s model library browsing requires internet, and
ollama pullworks on a connection, but once you've download a model then you’re set. - Some advanced features (like Ollama’s
ollama_web_fetchorollama_web_search) call external APIs, but core inference is local. - Don't use cloud models! You'll need an internet connection for that.
Practical Setup for 2026
Here’s what I’m running now, updated from last year’s recommendations:
For Development/Prototyping (RTX Ada 2000 GPU 8GB RAM):
- Ollama + Llama 3.1 8B Q4_K_M
- Claude Code
- Handles code completion, docstring generation, and debugging explanations
I'm currently running on a ThinkPad P1 96GB RAM with a GTX Ada 2000 8GB GPU on Ubuntu 24.04. For my coding, I use Claude Code with Anthropic by default, but I will switch to Claude Code Router to connect to other models depending on the use case. All of these are cloud models because they seem to be the most powerful and useful for my coding projects.
Local models like gemma3:27b-cloud are still useful for things such as quick meta description or alt descriptions for images and posts. qwen3:8b is also surprisingly good at summarizing things and also for Japanese or Chinese usage.
For Production (Dedicated mini-PC, RTX 4060 16GB):
- llama.cpp server mode + Mistral Nemo 12B Q4_K_M
- Nginx reverse proxy
- Runs 24/7 for internal tools (support ticket classification, FAQ generation)
- Kafkai runs something similar to this setup although we have different logic that uses different models, some on the cloud, some locally.
For Experimentation (Raspberry Pi 5, 8GB):
- llama.cpp CPU-only + Llama 3.2 3B Q4_K_M
- ~8 tokens/s, but enough for testing automation scripts
The Bottom Line
The local LLM ecosystem solidified. The tools stopped trying to be everything and focused on being reliable. The models got smaller, faster, and more capable. The community converged on standards (GGUF, Q4_K_M) that just work.
If you read the original post and thought “this is promising but painful,” now is the time to revisit. The setup friction dropped from “weekend project” to “coffee break task.”
For engineers, the key insight is this: local models aren’t replacing cloud APIs for everything, but they’ve become essential for specific workflows where latency, privacy, or cost matter. Think of them as another tool in your DevOps kit, like Redis or PostgreSQL that are specialized, powerful, and worth understanding.
And for the love of all things beautiful and kind, don't download models using the airport's lounge WiFi!