Why we abandoned WireGuard for our cross-border IoT mesh

Two issues ago we wrote about replacing TCP with the Bundle Protocol for an IoT pilot that crosses two borders. That story has a footnote we deliberately didn't include then because it deserved its own piece: before we went to bundle protocol, we tried WireGuard. WireGuard is excellent, and we are fans of WireGuard, and WireGuard is the wrong tool for this particular job. Here is why.

WireGuard's elegance is partly in its assumption set. It assumes both endpoints have stable, simultaneous reachability. It uses a small handshake-and-rekey state machine that, by design, drops sessions which violate that assumption. The replay-window logic, in particular, marks any received packet whose sequence number is too far behind the highest seen as a replay attempt and discards it.

That replay window is exactly correct for the threat model WireGuard was designed against — a determined attacker with packet capture and replay capabilities on a modern internet link. It is exactly wrong for a sensor that comes back online after six hours of solar-battery dropout to discover that its peer's view of the sequence numbers has moved well past what the sensor's last-known state can resynchronise from.

WireGuard's replay window is the right answer for hostile internet links. It's the wrong answer for sensors that come back from the dead.

What we observed in the field, repeatedly, was a death spiral. Sensor goes offline (battery, weather, generator). Comes back online. Tries to send a queued packet using its old WireGuard state. Peer rejects it because the sequence number is now classified as a replay. Sensor takes that as a transport error, retries, and the retry is also rejected. The sensor's monitoring agent, designed for this kind of failure, attempts a clean handshake reset. Sometimes that succeeds. Sometimes — particularly when many sensors are coming back online together after a regional outage — the handshake itself drops a packet over a 700 ms satellite RTT, the peer marks the connection invalid before the handshake completes, and the sensor enters a retry loop.

Same four nodes, same physical links — different tolerance for intermittent connectivity.

The bundle-protocol replacement makes none of these assumptions. Bundles are sequence-number-free at the protocol level; ordering is reconstructed from per-bundle timestamps at the receiver. A sensor that wakes up after six hours simply hands its locally-buffered bundles to whichever gateway it can reach now, and the gateway accepts them without caring about sequence state. The replay-resistance trade we lose is real, but for our threat model — sensors mostly transmitting low-value telemetry, and where any high-value control path is on a separate authenticated channel — the tradeoff is acceptable.

Three takeaways:

WireGuard's elegance is in what it assumes. Stable simultaneous reachability is a great assumption for most VPN use cases, and a fatal assumption for intermittent IoT.

Replay-window logic is the silent killer. The handshake works in your monitoring; the data flow doesn't. The bug is invisible until you watch packet-by-packet.

Pick the protocol whose assumption set matches your link. Don't fight WireGuard's design; just don't deploy it where you need delay tolerance.