Matthew Diakonov, Written with AI

Published April 20, 202611 min read

AI on edge, the non-stationary input case

AI on edge is two models on one box, because a human at the DVR can change the picture at any moment.

The stock definition of AI on edge assumes a stationary input pipeline: the same sensor feeds the same model forever. That assumption holds for a smart camera and it does not hold for anything that watches a DVR's HDMI output. On a our system unit the picture coming out of the cable changes shape whenever the property's operator presses a front-panel button. This page is about the second, much smaller model that sits in front of the main detector and decides how to read the frame.

See the classifier + detector stack on a live unit

4.9from 50+ properties

Layout classifier runs every frame in ~2.4 ms at 256x144

Five-frame hysteresis suppresses routing thrash during human transitions

Tile router reads /var/lib/cyrano/state/current_layout and crops by layout_id

40 known DVR layouts in the fingerprint library, grows via OTA

AI on edge, two models deep

Because a human controls the picture, not the sensor.

Every generic AI-on-edge explainer assumes a stationary input

A DVR operator pressing a front-panel button breaks that

Our system stacks a tiny layout classifier in front of the detector

The classifier writes current_layout, the tile router reads it

Five-frame hysteresis prevents thrash while the human is mid-switch

0:00 / 0:05

The hidden assumption in every top-ranked AI-on-edge article

Read the first page of results for AI on edge and you will repeatedly see the same reference architecture: a sensor with an integrated accelerator, a fixed preprocessing pipeline, a fixed model, a fixed output. The explainers (NVIDIA, IBM, Red Hat, HPE, Cisco) talk about TOPS, quantization, privacy, and latency. All of that is true in the sensor-with-an-accelerator world. None of it addresses what happens when the input itself is not stationary.

On a unit the input is whatever picture the DVR is currently showing on its monitor output. The DVR operator, which is almost always a property employee with a physical mouse and remote, can change that picture at any time. They can flip from a 4x4 grid to a 5x5 grid. They can double-click on one tile and expand to full screen. They can open the DVR's own settings menu, at which point the picture becomes a configuration dialog with no camera tiles at all. The edge AI has no advance notice of any of this; it only sees the HDMI frames arriving at 15 fps.

If you run a single monolithic object detector on that feed, the labels come back attached to pixel coordinates that nobody can map back to a camera name, because the camera-to-pixel mapping is the thing that just changed. The only architecturally honest answer is to answer two questions per frame, not one.

The stack, in one pass per frame

Every frame goes through the same two-stage pipeline. Stage one is cheap and decides how to read the frame. Stage two is the expensive work, and it runs only after stage one has committed a layout.

per-frame pipeline on a unit

The uncopyable artifact: tiles.json for one layout

Each recognized multiview has its own directory under /var/lib/cyrano/layouts/. Inside is a tiles.json with the crop rectangles, and a masks/ subdirectory with one PNG per mask used to cover DVR chrome. The file below is the real shape of the 4x4-std entry, abbreviated to its first four tiles. The tile router re-reads this file whenever current_layout changes; it does not keep it in memory across layout switches.

/var/lib/cyrano/layouts/4x4-std/tiles.json

0known DVR layouts in the fingerprint library

0 msms spent classifying per frame

0frame hysteresis before the router switches

0%percent of frame budget left for the main detector

0 known layouts today. The number grows through OTA updates when a unit uploads a fingerprint that the classifier flags as unknown.

The classifier, in 25 lines of Python

This is the whole of the layout classifier's per-frame logic. Four concerns: inference, hysteresis, commit, and the unknown-layout fallback. The only persistent state between frames is a five-element ring buffer in memory and one text file on disk. The tile router does not call into this code; it just reads /var/lib/cyrano/state/current_layout and trusts the value it finds there.

cyrano/layout/classify.py

What the classifier output routes into

A layout change in the unit log

Below is an unedited slice of the event stream from a our system unit while a property employee flipped the DVR from a 4x4 grid to a single-camera full-screen view and back. The classifier confidence wobbles during the transition animation; the hysteresis window holds routing steady until five consecutive frames agree. No false alerts fire in the gap because the detector is still reading tiles from the last committed layout_id.

cyrano edge unit, layout switch at an occupied property

“During the layout switch drill shown above, the unit handled two full human transitions (grid to single-camera, single-camera back to grid) with zero mis-routed detections. Every detection was attached to the correct camera_name, and every tile read came from the correct tiles.json. That is the behavior a stationary-input assumption cannot deliver.”

Our system field notes, layout switch drill on a multifamily DVR

The four moments of a human layout change

The transition is not a single event in the classifier's view. It has a quiet before, a visually ambiguous middle, a committing moment, and a new quiet after. The hysteresis rule exists to keep routing stable across the middle.

one DVR button press, seen from the edge unit

Quiet before

Classifier is returning the same layout_id at 0.99 confidence every frame. current_layout is stable. The router reads the same tiles.json on every frame.

Ambiguous middle

Operator presses the DVR panel. The next 3 to 8 frames show a transition animation or a reshuffled grid. Classifier confidence drops into the 0.55 to 0.9 range. Hysteresis holds; current_layout is not overwritten.

Commit

Five consecutive frames agree on the new layout_id. The classifier writes the new value to /var/lib/cyrano/state/current_layout. The tile router will use the new tiles.json starting with the next frame.

Quiet after

Routing is stable again, but now against a different crop schedule and a different camera-to-tile mapping. The detector runs as before. Every emitted event has the new layout_id on its outbox line.

Stationary-input AI on edge vs non-stationary-input AI on edge

The table below is the distinction the stock SERP definition does not draw. Both sides qualify as AI on edge; they have very different architectural bills of materials.

Feature	Stationary input (stock SERP example)	Non-stationary input (our system)
Typical reference device	Smart IP camera with an accelerator	Box tapping an existing DVR's HDMI output
Who controls the input frame	The camera, fixed at install time	A human operator at the DVR, anytime
Number of models on the edge unit	One (the main detector)	Two (layout classifier + main detector)
Per-frame pre-pass	None required	~2.4 ms layout classification at 256x144
Tile-to-camera mapping	Compiled in once; static	Re-read per layout_id from tiles.json
How DVR chrome is handled	Not applicable; no chrome in the feed	Per-layout overlay mask PNG subtracted pre-detector
Recovery when the operator presses a button	Not applicable	5-frame hysteresis, then a state_write to disk

The thing that is uncopyable

You can ls the classifier.

On a unit there is a model, a state file, and a layout library, and all three are on disk with plain names. ls /var/lib/cyrano/models and you will see layout-classifier-v3.onnx next to the main detector. cat /var/lib/cyrano/state/current_layout and you will see the layout_id the router is using right now. ls /var/lib/cyrano/layouts and you will see one directory per known DVR multiview. There is no cloud dependency for any of it; the classifier ships with the unit and the library grows through OTA. That is the verification story for AI on edge as a non-stationary-input system.

Layouts this unit routinely sees in the field

1x1-single-cam

2x2-std

3x3-std

4x4-std

5x5-std

1+7-main-plus-thumbs

1+12-main-plus-strip

PTZ-popout

playback-review

DVR-menu (no tiles)

Each of these has its own tiles.json under /var/lib/cyrano/layouts/, and the classifier is trained to emit the matching layout_id. A view that is not in the library falls back to the last known-good layout_id and is flagged for OTA.

When the single-model assumption is actually fine

Worth being direct: if the deployment is a new IP camera with a Jetson or Coral on the PCB, and nobody outside your engineering team can touch the input, a single monolithic detector is the right architecture. That is the world the generic AI-on-edge explainers are written for, and they are correct for it.

The case this page is for is different. It is the case where the property already has a DVR on the wall, already has a human operator who uses the DVR daily, and you want to add AI without displacing any of that. In that case AI on edge is a two-model system by construction, and the specific on-device files under /var/lib/cyrano/ are the reason the result is stable rather than theatrical.

See the classifier fire on a live layout change

A 15-minute call. We tail /var/lib/cyrano/state/current_layout on a production unit while a property operator flips the DVR between grid, single-camera, and PTZ views. Zero mis-routed detections on the way through.

AI on edge: frequently asked questions

What does 'AI on edge' mean in the current SERP framing, and what does it leave out?

On page one of 'ai on edge' and 'ai at the edge' results, the phrase is used to describe machine-learning inference that runs outside a cloud datacenter on hardware next to the data. The canonical reference implementation is a sensor with an accelerator: a smart camera, a robot, a phone, or a car. The framing is fine, but it takes a critical assumption for granted: that the input to the model is stationary. Same camera, same pose, same crop, forever. What it leaves out is the class of edge AI where the input frame composition itself changes during operation because a human is controlling it. Our system is in that class: the HDMI frame coming out of a DVR switches layout whenever the DVR operator touches the front panel.

Why does this force our system to run two models instead of one?

Because the coordinates of each camera within the frame depend on which multiview the DVR is currently rendering. In a 4x4 grid, camera 07 might be tile row 2 column 3. Toggle to a 5x5 grid and camera 07 has moved. Double-click to full-screen on camera 12 and tiles 01 through 25 all disappear except one. A single monolithic detector cannot be told 'camera 07 lives at these pixels' once and for all. The only correct answer is to ask, on every frame, 'what layout am I looking at' before the detector runs. That first question is answered by a small classifier; the second question (what is in each tile) is answered by the main detector. Two models on one box.

What are the concrete on-device artifacts that make the layout classifier work?

Three files and a directory. The model: /var/lib/cyrano/models/layout-classifier-v3.onnx, a lightweight ONNX graph that takes a downsampled full frame and emits a layout_id. The state: /var/lib/cyrano/state/current_layout, a single-line text file rewritten every frame after hysteresis, read by the tile router. The layout library: /var/lib/cyrano/layouts/<layout_id>/tiles.json, one directory per recognized multiview, each containing the crop rectangles for every tile in that layout plus an overlay mask PNG for DVR chrome (date strip, camera-name banners, PTZ menu). And the fingerprints: /var/lib/cyrano/layouts/fingerprints/*.png, one small PNG per known layout used to bootstrap the classifier when a unit encounters a new DVR firmware.

What does 'hysteresis' mean here and why is it in the pipeline?

Hysteresis is the rule that the tile router does not act on a single disagreeing frame from the classifier. Concretely, the current_layout file is only overwritten after 5 consecutive frames agree on the same layout_id. The reason is that a human transition on the DVR is not instantaneous; pressing the front-panel button to go from 4x4 to 5x5 produces a handful of intermediate frames that are neither fully 4x4 nor fully 5x5 (a sliding animation, a reshuffle, a black flash). Without hysteresis, the tile router would thrash for about 150 milliseconds every time the operator switches views, sending a burst of false alerts while a label from the old layout gets mapped to a crop from the new one. Five frames at 15 fps is ~330 ms; in practice the router changes cleanly after the transition completes.

What is inside one tiles.json file?

An array of entries, one per on-screen tile. Each entry has tile_index (1..N for that layout), camera_name (the label the DVR overlays on that tile), crop (an {x, y, w, h} rectangle in full-frame pixels at 1920x1080), and mask_ref (the overlay file name for the chrome inside that tile, typically the DVR's own label strip). For example the 4x4-std layout has 16 entries with 480x270 crops, each offset by the expected grid spacing and each with a 22-pixel bottom mask where the DVR prints the camera name. The 5x5-std layout has 25 entries with 384x216 crops. A single-camera full-screen view has one entry with a 1920x1080 crop and a slimmer top mask for the date strip only.

Why does the classifier run every frame, not just on transitions?

Because the edge unit has no out-of-band signal telling it when the human at the DVR has touched the panel. There is no event bus from the DVR firmware, no GPIO, no serial line. The only input the unit has is the HDMI picture itself. So the only way to notice a layout change is to look for it on every frame. The classifier model is small enough that this costs almost nothing: on the our system reference unit, the classifier runs in ~2.4 ms per frame at 256x144 input resolution, which is under 4 percent of the frame budget at 15 fps. The main detector consumes the other 96 percent.

What happens when the classifier encounters a DVR layout the fingerprint library does not know?

It falls back to the nearest-match layout and writes an unknown_layout event to the on-device log at /var/lib/cyrano/logs/layout.log with a downsampled fingerprint. The tile router continues using the old known-good layout_id for another 60 seconds, degrading gracefully rather than aborting. On the next OTA window the fingerprint is uploaded to our system; if engineering confirms it is a new DVR firmware, a fresh tiles.json entry is pushed back down and all units on that firmware learn it. This is the only path by which the layout library grows; the unit does not try to invent a new layout from scratch on its own.

Could this be done with one bigger model that outputs bounding boxes directly on the full frame?

In principle yes, and some cloud pipelines do exactly that. The practical problem on an edge unit is that the same human at the DVR is free to choose the picture that the detector sees, including picture-in-picture views, PTZ pop-outs, menu overlays, and the occasional full-screen camera-setup dialog from the DVR firmware itself. A single detector trained on every such permutation has to cover the joint distribution of 'multiview x DVR make x DVR firmware,' which is large, unstable, and expensive to maintain. The two-model split localizes the problem: the classifier has a tiny, cheap label space (about 40 known layouts today), and the detector has a stable input (a known-good crop with its chrome masked out) regardless of what the operator did upstream.

What does this mean for the event lines our system writes to the outbox?

Every outbox line carries the layout_id the unit saw at capture time (for example layout_id: '4x4-std'), so the downstream system can later reason about which DVR view produced each event. It also means that a human operator changing the DVR view mid-incident does not break the incident record; the outbox entries that precede and follow the layout change both have correct camera_name values, because the router honored the classifier's call. If you pull a day of outbox and group by layout_id, you can see exactly which views the DVR spent its time in; this is also how It detects that a property's operator has started using a layout the fingerprint library does not know yet.

What is the operational takeaway for someone evaluating 'AI on edge' for a multifamily property?

Ask the vendor how their edge AI handles a non-stationary input pipeline. If the demo is a stock IP camera feeding a fixed crop to a fixed detector, that is one narrow shape of edge AI and it will not survive a property where an operator is allowed to change the DVR view. If the answer is 'we run a classifier in front of the detector and route tiles based on the current layout,' ask to see the three artifacts: the classifier model file, the current_layout state file, and the tiles.json directory. On a unit those are at /var/lib/cyrano/models/layout-classifier-v3.onnx, /var/lib/cyrano/state/current_layout, and /var/lib/cyrano/layouts/. If a vendor cannot show you those three things, they are running cloud AI with an accelerator, not AI on edge.