Worth your time — three on-device inference benchmarks worth the read

A round-up of three benchmarks worth your time this week — short summaries plus our take.

"INT4 inference on Snapdragon 8 Gen 3 — a vendor-neutral comparison" (March 2026). A 30-page industry-association report comparing four Q4 quantisation schemes on identical hardware. Our take: the comparison methodology is solid; the conclusions on Q4_K vs Q4_0 match what we found in our own pipeline (see issue #33). One small quibble — the perplexity eval uses an English-only dataset where multilingual deployments will see different cliff edges.

"Llama.cpp memory layout: the 4 GB question" (Hacker Engineer, March 2026). A long-read on the engineering tradeoffs of running an 8B model in 4 GB of mobile RAM. Particularly good on the cache-line alignment work that lets the inference loop avoid the L3 thrash we saw early on. The author shares concrete profiler output from an iPhone 15 Pro that lines up almost exactly with what we measured on Snapdragon 8 Gen 3.

"Why Phi-3 mini is the most under-appreciated model for radio kits" (substack, Feb 2026). Counter-intuitive but well-argued: for the ≤1B parameter band where on-device truly shines, Phi-3-mini's instruction-tuned variant beats every Llama-fork the author tested. We disagree with the conclusion on overall flexibility — Llama-3 8B at Q4_K is still our preference where the headroom exists — but the data on Phi-3-mini's instruction-following at sub-1B parameters is clean and worth a serious read if you're targeting truly constrained hardware.