We started this exercise with a question that comes up in nearly every edge-AI conversation we have: what does it actually cost — in latency, accuracy, and engineer hours — to run an 8-billion-parameter model on a phone? Six weeks later we have a shipping pipeline and a number of opinions about which steps were worth the effort and which were not.
The target was a Snapdragon 8 Gen 3 reference device with 8 GB of RAM, of which roughly 4 GB is realistically available for our model and inference scratch. The starting point was Llama-3 8B in FP16, which weighs in at 16 GB on disk and refuses to even load on the device. That gap — 16 GB to under 4 GB — was the problem.
The standard answer is quantisation. The question is which scheme. We ended up evaluating four: straight INT8, plain INT4, the Q4_0 layout from llama.cpp, and the Q4_K mixed-precision scheme that spends extra bits on the attention layers and saves bits elsewhere. INT8 is half the size of FP16 — that gets us to 8 GB, still too big. Plain INT4 gets us to 4 GB, which fits, but accuracy on our internal eval drops by 6.2 perplexity points, which is too far.
Plain INT4 fits in the budget. Q4_K fits in the budget AND keeps accuracy within a perplexity point of FP16.
Q4_K turned out to be the sweet spot. The model lands at 3.2 GB, which leaves comfortable headroom for the inference KV cache. Accuracy on our eval drops by 0.9 perplexity points relative to FP16 — within noise of what we'd see with conservative dropout. Throughput, measured as token-per-second on a single prompt, lands at 17 t/s on the Snapdragon's NPU path, comfortably above the 12 t/s we set as our minimum useful UX threshold.
The surprise wasn't the quantisation itself — that part is now well-trodden territory. The surprise was how much of the engineer-hours budget went to the surrounding plumbing. Memory layout for the KV cache had to be reworked to align with the SoC's L3 cache lines or the inner inference loop spent a large fraction of its time in cache misses. The tokeniser had to be rebuilt to ship as a self-contained binary rather than relying on a Python runtime. The model file format had to change to allow memory-mapping for fast cold-load.
Three takeaways:
Q4_K is the on-device default until something better ships. If you have 4 GB of headroom, you can run an 8B-class model with throughput that holds up to interactive use, and accuracy that holds up to most production tasks.
The model is a small fraction of the engineering work. Memory layout, tokeniser packaging, model file format, lifecycle (loading, unloading, hot-swapping) — each of these took as long as the quantisation itself.
Benchmark on the device, not on a workstation. The synthetic-throughput numbers we got from our laptop GPU bore no useful relation to what the Snapdragon NPU actually delivered. Every decision needs to be re-validated on real hardware.
Subscribe at edgesignal.example — next week we look at Phi-3-mini for radio kits.