Tiny Weights

Small language models — Gemma, Phi, SmolLM, Qwen, and the rest — are getting surprisingly capable. Benchmarks, local deployment, quantization, and hands-on guides for running small LLMs on real hardware.

What Can You Actually Do With a Local Small LLM? A Practical Guide

Running a local small LLM takes five minutes with Ollama. Knowing what to actually do with one takes longer to figure out — the use cases that make local AI genuinely useful are different from what most people expect after years of cloud AI, and the cases where a local model is the wrong tool entirely are just as important to know upfront. This post answers the question directly. Here is what a local small LLM does well, what it does badly, and which model and hardware configuration makes sense for each job. Links throughout point to the deeper guides on this blog for each model and runtime. ...

Running LLMs on Raspberry Pi 5: A Practical Guide with Real Benchmarks

Running a language model locally on a Raspberry Pi 5 is practical in 2026 — if you pick the right model. The Pi 5 (8 GB) handles 1–3B parameter models at speeds that work for interactive tasks without a cloud connection or dedicated AI hardware. This is CPU-only inference, which sets a hard ceiling: expect 5–15 tokens per second for 1.5B models, and 2–5 tokens per second for 3B models. This guide covers setup with both Ollama and llama.cpp, real benchmark data, and which models are worth running on Pi hardware. ...

The Complete Guide to Running Small LLMs on Apple Silicon (2026)

Apple Silicon Macs are the best consumer hardware for running small language models locally in 2026. The reason is architectural: Apple’s unified memory architecture (UMA) lets the GPU and CPU share a single physical memory pool, eliminating the model-size ceiling that limits every Windows laptop and desktop GPU to their VRAM capacity. A Mac with 32 GB of unified memory can load a 32 GB model and run full GPU-accelerated inference on it — no swapping, no offloading. ...

How to Run Phi-4-mini Locally: Microsoft's 3.8B Model with 128K Context

Phi-4-mini is Microsoft’s 3.8B parameter model and one of the most practical models to run locally right now. The GGUF download is 2.49 GB at Q4_K_M, it handles a 128,000-token context window — rare at this scale — and it scores 88.6% on GSM8K math benchmarks while competing against models that score 30 points lower. This guide covers exactly how to run Phi-4-mini locally with Ollama, llama.cpp, and Python’s Transformers library, with verified RAM requirements and benchmark context throughout. ...

How Much RAM Do You Actually Need to Run Local LLMs?

The short answer: for a 3B model at Q4_K_M quantization you need about 4 GB of free RAM. An 8B model needs closer to 7 GB. A 32B model won’t move at all without 22+ GB available. The longer answer depends on which runtime you’re using, how much context you need, and whether you have a GPU. This post breaks down exactly where that memory goes, with verified file size data pulled from HuggingFace, so you can size your hardware against a specific model rather than guess. ...

GGUF vs ONNX vs MLX: Which Model Format Should You Use for Local Inference?

Raspberry Pi circuit board. Photo by Pexels, free to use. When you search for a small language model on HuggingFace, you’ll typically find the same model offered in three formats: GGUF, ONNX, and MLX. The names are not self-explanatory, the documentation is scattered, and picking the wrong one wastes time. This guide cuts through it. The short answer: GGUF for almost everything, MLX if you’re on Apple Silicon and want maximum speed, ONNX if you’re building a production app on Windows or deploying to a phone. Everything below explains why. ...

Ollama vs LM Studio vs llama.cpp: Which Local AI Runtime Should You Use?

If you want to run a language model locally, three tools handle the vast majority of real-world setups: Ollama, LM Studio, and llama.cpp. Choosing between them is the first decision most people get wrong — not because any of them is bad, but because they solve meaningfully different problems. Here is the short answer before the full breakdown: Ollama if you are a developer building something. LM Studio if you want a GUI and are not writing code. llama.cpp if you need maximum throughput or fine-grained control and are comfortable on the command line. Everything below is the reasoning behind those calls. ...

The Best Small Language Models in 2026: A Practical Comparison

The small model space in 2026 is genuinely crowded. Microsoft, Google, HuggingFace, and Alibaba have all shipped competitive sub-10B models in the last year, and picking the right one for your setup is no longer obvious. This post cuts through the noise: real benchmarks sourced from official model cards and technical reports, real VRAM requirements, and use-case recommendations you can act on today. This is the reference comparison post for the model family reviews on this blog. Each model covered here has a dedicated deep-dive linked below. As we publish more, they’ll link back here. ...

Qwen3.5-0.8B: A Multimodal Thinking Model That Fits in 1 Gigabyte

800 million parameters. 262,000-token context window. Images, video, and text — all handled natively. Thinking mode on demand. Apache 2.0 license. And the entire model weighs in at 1GB on Ollama. That’s the Qwen3.5-0.8B, the smallest member of Alibaba’s Qwen3.5 family, released in February 2026. It is not a general-purpose language model pretending to be multimodal — it was trained with early fusion on multimodal tokens from the start, covering 201 languages and dialects. At sub-gigabyte scale, very little competes with its feature set. ...

Qwen3-Coder-Next: Run a Frontier-Level Coding Agent Locally on Consumer Hardware

There is a certain irony in spending $200 a month on a cloud coding assistant for a codebase you’ll never let leave your machine. Your intellectual property stays on-premises, but every line you paste into a chat window makes a round trip to a server you don’t control. Until recently, the performance gap between local models and frontier cloud assistants made that trade-off feel unavoidable. Qwen3-Coder-Next, released by Alibaba’s Qwen team on February 3, 2026, is the clearest argument yet that the trade-off is closing. With 80 billion total parameters but only 3 billion activated per token, it scores 70.6% on SWE-Bench Verified — matching or beating models with 10–20× more active parameters — and it runs on hardware you can buy today. ...