Matthew Diakonov, Written with AI

Published April 20, 202610 min read

AI edge, the continuity definition

AI edge is not a latency claim. It is a continuity claim. Here is the file that proves it.

Every top-ranked explainer defines AI edge by where inference runs and sells it on latency, bandwidth, and privacy. The question that definition never answers is what happens to detection and alerting when the upstream link goes dark for six hours. This page is about that test, and about the specific append-only NDJSON file on a unit that a field technician can tail to watch the answer.

See the outbox on a live unit

4.9from 50+ properties

Detection continues during a WAN outage, events append to an on-device file

Monotonic local_seq counter survives reboot and NTP corrections

Drain is strictly in order; server dedupes by (device_id, local_seq)

Retry backoff state is persisted, not reset on reboot

AI edge, the continuity story

The SERP sells latency. Operators need durability.

Every 'AI edge' explainer defines the edge by latency and bandwidth

The test the definition skips: what happens during a WAN outage

On our system, detection keeps running and events append to a local file

A monotonic local_seq survives reboot and NTP correction

On reconnect, the drain walks forward in strict order with server-side dedupe

0:00 / 0:05

The definition on page one, and the test it skips

Read the current top-ranked pages for the phrase AI edge and the framing is nearly word-for-word identical. Move inference out of a cloud region, run it on hardware near the data, ship less data over the wire, react faster, keep raw data local for privacy. Every bullet is true. None of them describe what the system does between the moment the ISP router reboots and the moment it comes back up.

The reason that question gets skipped is that the canonical reference implementation of edge AI in those vendor articles is a smart camera or a robot with a neural accelerator that hands results to a cloud backend for storage and alerting. Detection lives near the data; the rest of the pipeline does not. Pull the uplink and the usable part of the product disappears.

The version of edge AI that survives that test has to keep three things on the device itself: the model, the event state, and the retry state. Our system keeps all three on the unit at the property. Everything on this page falls out of that.

The four phases of a WAN outage on an edge AI unit

This is the narrative operators actually ask about. It is not about inference speed; it is about what the box does when the network goes away and comes back.

How the unit behaves across a partition

Before
Detect, filter, deliver. Live path, local_seq advancing, ack cursor keeping up.
Partition
Delivery 5xx or DNS fail. Events still append to the outbox. Ack cursor stalls.
Buffer
Backoff bumps on disk. Inference keeps running. Outbox file grows line by line.
4
Drain
Link returns. Walk forward from ack cursor, post in order, dedupe by local_seq.

Delivery failures that do not break detection

The uncopyable artifact: one NDJSON line in the outbox

This is a literal line from the file a unit appends during a WAN outage. Every event is one line. The drain worker reads these forward on reconnect. The server collapses duplicates by the tuple at the top. That is the whole contract that makes continuity a solved problem instead of a story.

/var/lib/cyrano/outbox/2026-04-20.jsonl

0bytes of local_seq state on disk

0file system call per event append

0cloud dependencies for detection

0%percent of events survive uplink blackout

0 events dropped across a measured six-hour uplink outage on a 16-camera test property, because every triggered event is committed to the file before any send attempt.

The drain loop, in 25 lines of Python

This is the loop that runs on every retry tick. It is short on purpose. The only state that matters sits in two files: ack_local_seq (how far we have acknowledged) and retry_state (the backoff schedule). Both are written to disk. Both survive a power cut. Either can be inspected with cat.

cyrano/outbox/drain.py

What a real partition + recovery looks like in the log

Below is an eighteen-line slice of a our system event stream from the minute the uplink drops, across the buffering window, and through the drain when the ISP router comes back. Every step of the pipeline is visible; nothing is a black box.

cyrano edge unit, partition drill on a 16-camera property

0 / 55

“During the drill above, 55 events entered the outbox while the link was down. All 55 drained in strict order in 2.14 seconds once the router came back. Zero dropped, zero server-side duplicates. That is the number that separates edge AI from cloud AI with a local accelerator.”

Our system field notes, partition drill on a production multifamily property

What has to live on the unit for this to work

There is no continuity without the right three pieces of state on the device itself. None of them are exotic; all of them are the difference between an edge AI unit and a thin client that happens to decode frames locally.

The three local stores and what they buy

The model and layout cache

Detector weights plus the per-layout overlay mask files live on /var/lib/cyrano/. The unit never calls out to load them, so inference is stateless with respect to the WAN.

The outbox

One NDJSON file per day at /var/lib/cyrano/outbox/YYYY-MM-DD.jsonl. Every triggered event is appended before any send attempt. The ack cursor is a separate file so the outbox is never rewritten.

The retry state

Backoff schedule serialized to /var/lib/cyrano/meta/retry_state. A reboot during an outage does not reset the schedule, so a bad uplink does not cause a send storm on the next power cycle.

What survives a partition on a unit

survives:

HDMI capture, layout router, overlay mask subtract, forward pass
Zone and dwell filters, including per-window arming schedules
Thumbnail crop and local write to /var/lib/cyrano/thumbs/
Outbox append with monotonic local_seq
Backoff state across power cycles
On-device event log for forensic review

deferred, not dropped:

WhatsApp or SMS delivery (queued in the outbox)
Dashboard push updates (backfill on reconnect)
Cloud thumbnail backup (catches up via drain)
Remote OTA model update (next reconnect window)

The two definitions of AI edge, side by side

The table below is the shape of the question an operator should actually ask. Latency and bandwidth are fine. The hard axis is the behavior across a network partition, because that is the one that separates a product you can deploy to a property with a flaky DSL line from a product that silently becomes a recorder.

Feature	Edge AI as latency (SERP framing)	Edge AI as continuity (our system)
Where inference runs	On-device accelerator near the camera	On a unit at the property, reading DVR HDMI
What happens during a WAN outage	Usually unspecified; often loses detection	Inference continues; events append to the outbox
Event identity during an outage	Implicit; relies on server wall clock	Monotonic local_seq per device, written to disk
Retry behavior across reboot	Usually resets; send storm possible	Backoff state persisted to /var/lib/cyrano/meta/
Dedupe contract on reconnect	Vendor-defined, usually uuid or timestamp	(device_id, local_seq), collapsed server-side
Forensic inspectability	Cloud query only	tail /var/lib/cyrano/outbox/*.jsonl on the unit
Operator question vendor must answer	How fast is inference?	Where is the buffer file, and what is the dedupe key?

The thing that is uncopyable

You can tail the file.

The outbox is a plain UTF-8 NDJSON file, rotated daily. Any technician with a USB keyboard can plug in, cd to /var/lib/cyrano/outbox, and tail -n 50 the current day. That is the whole verification story for alert continuity. No SDK. No vendor dashboard login. No support ticket. The artifact is on the unit; the behavior is observable; the contract is one line per event with a counter that only ever increases.

Failure modes this unit is designed to survive

ISP router reboot

DNS resolver flap

TLS chain expiry

WhatsApp API 5xx

SMS gateway timeout

NTP skew after boot

Property power flicker

WAN bandwidth crash

Captive portal reboot

Cellular failover lag

Every one of these collapses into the same branch of the send loop: append to the outbox, bump the backoff, wait. Detection never pauses. The counter never resets. The drain picks up where it left off.

When the latency definition is fine and this continuity definition does not matter

Worth being direct: if the property has dual-WAN failover, a dedicated circuit, and a written SLA from the ISP, the continuity window is short enough that the behavior on this page is a footnote. The latency-and-bandwidth framing covers the buyer correctly in that world.

For the rest, the Class B and C multifamily, the construction trailers on a cellular hotspot, the strip of storefronts on a residential DSL line, AI edge only counts if the unit can ride out a six-hour outage and drain cleanly when the link comes back. That is the case this page is for, and that is the case the outbox file, the local_seq counter, and the persisted backoff state were built to answer.

Watch a unit ride out a live partition drill

A 15-minute call. We pull the uplink on a production unit, generate events, and tail /var/lib/cyrano/outbox/*.jsonl with you while the drain catches up on reconnect.

AI edge: frequently asked questions

What does 'AI edge' mean on the current SERP, and what does that framing miss?

On page one of 'ai edge' results (NVIDIA, IBM, Cisco, HPE, Red Hat, Dell), the phrase is used to name inference that runs outside a cloud region on hardware close to the data source. The benefits listed are always the same triad: lower latency, lower bandwidth, better privacy. That triad is correct but incomplete. It treats the upstream link as if it were always present. The missing definition is continuity: AI edge is the category where detection and alert queuing continue to work when the upstream is not present, so the system does not degrade into a dashboard that says 'unknown' for four hours.

What is the concrete on-device artifact our system uses to prove alert continuity during a WAN outage?

An append-only NDJSON file at /var/lib/cyrano/outbox/YYYY-MM-DD.jsonl. Every emitted event is one line of JSON containing the device_id, a monotonic local_seq counter, the composite layout_id, the camera name, the detection class, a thumbnail path, and the capture timestamp. When delivery to WhatsApp or SMS fails, the line is not lost; it stays in the file. On reconnect, a drain worker reads forward from the last acknowledged local_seq and re-posts events in strict order. The server collapses duplicates by (device_id, local_seq) so the at-least-once retry is safe.

Why is local_seq a per-device monotonic counter instead of a uuid or a wall-clock timestamp?

Because the identity of an event on an edge device has to survive both the network being dead and the clock being wrong. A uuid requires a random source and cannot be ordered. A wall-clock timestamp collides during NTP correction after a boot with no network. A per-device monotonic counter kept on local storage gives you: strict ordering for the drain, a stable dedupe key on the server, and a visible gap audit if a counter value never arrives. The cost is one 8-byte write per event to /var/lib/cyrano/meta/local_seq.

What actually breaks during a WAN outage on a unit, and what does not?

What does not break: HDMI capture, layout classification, overlay-mask subtract, the forward pass, the zone filter, the dwell filter, the thumbnail crop, the outbox append, and the on-device event log. Inference keeps running at the normal frame rate, and every triggered event is persisted with a local_seq. What does break: live delivery to WhatsApp and SMS, dashboard push updates, and cloud backups of thumbnails. Those are deferred, not dropped. When the link returns, the drain worker walks the outbox forward and the dashboard backfills. A six-hour uplink blackout at a typical 16-camera property leaves low tens to low hundreds of queued events in the file and drains in seconds on reconnect.

Does the unit run with a flaky DNS or an intermittent upstream?

Yes. DNS failure, TLS handshake failure, and WhatsApp 5xx are all folded into the same 'delivery failed' branch of the send loop, so the event sits in the outbox until the next retry tick. The retry tick uses jittered exponential backoff between 5 seconds and 5 minutes, so a rebooting ISP router does not amplify into a send storm the moment it comes back. The backoff state itself lives on disk, so a reboot mid-outage does not reset the schedule.

How does this compare to a cloud-AI camera system during the same outage?

A cloud-AI system defines detection as a pipeline that begins when the frame arrives in the cloud. During a WAN outage that pipeline is not partially degraded, it is absent. No frames arrive, so no inference runs. Incidents that occur during the outage are only recoverable by forensic review of whatever the on-prem recorder was independently storing. With edge AI the inference is local, so the same outage leaves you with a queue of real alerts waiting to drain, not a silence.

What is the shape of one outbox line?

It is a single line of JSON, roughly 400 to 800 bytes depending on metadata, ending with a newline. Keys: device_id (the hardware serial), local_seq (monotonic u64), captured_at (ISO-8601 in UTC), layout_id (for example 4x4-std), camera_name (human name from the DVR name strip), event_class (pre_action_zone_entry, loitering_dwell_exceeded, tailgate_detected, package_dwell_exceeded, and so on), zone_id, dwell_seconds, thumb_path (for example /var/lib/cyrano/thumbs/<sha256>.jpg), and model_tag (for example person-v7@a1f2c9). That line is the envelope the unit commits to before it even attempts to call WhatsApp.

Why include model_tag on every event?

Because an edge AI unit runs a specific model build at a specific moment and that version has to be legible after the fact. If an operator pulls the March outbox to investigate a disputed incident, the model_tag tells them exactly which weights generated the detection. When the model is updated on a later OTA push, old events still reference the old tag; the new tag only applies to events generated after the swap. Compliance, insurance review, and internal postmortem all require that trail, and it cannot be reconstructed on the cloud side because the cloud only ever sees the drained event, not the device state that produced it.

Is the outbox a database?

No, and deliberately. It is an append-only flat file with one event per line, rotated daily. The choice avoids a live process holding a database file that would need to survive power loss, corruption, and reboot. A flat NDJSON file is resilient to abrupt power cuts in a way that a writing database is not; the worst case is a truncated final line, which the drain worker detects and discards. The schema is human-readable, so a field technician can open the file on a USB keyboard, tail the last 50 lines, and see exactly what the unit emitted during the window in question.

What does this angle on AI edge mean for a property operator choosing hardware?

It shifts the evaluation from 'how fast is inference' to 'what is the operational posture when the link goes down.' Ask the vendor where events are buffered during a WAN outage, how long the buffer survives, what the dedupe key is on the server, and whether the retry backoff state survives a reboot. A product that cannot answer those four questions with a concrete file or table is not really edge AI; it is cloud AI with a local accelerator. Our system answers all four with /var/lib/cyrano/outbox/*.jsonl, 'until disk fills,' '(device_id, local_seq),' and 'yes, the backoff state is persisted to /var/lib/cyrano/meta/retry_state.'