AI on edge is two models on one box, because a human at the DVR can change the picture at any moment.
The stock definition of AI on edge assumes a stationary input pipeline: the same sensor feeds the same model forever. That assumption holds for a smart camera and it does not hold for anything that watches a DVR's HDMI output. On a Cyrano unit the picture coming out of the cable changes shape whenever the property's operator presses a front-panel button. This page is about the second, much smaller model that sits in front of the main detector and decides how to read the frame.
See the classifier + detector stack on a live unitThe hidden assumption in every top-ranked AI-on-edge article
Read the first page of results for AI on edge and you will repeatedly see the same reference architecture: a sensor with an integrated accelerator, a fixed preprocessing pipeline, a fixed model, a fixed output. The explainers (NVIDIA, IBM, Red Hat, HPE, Cisco) talk about TOPS, quantization, privacy, and latency. All of that is true in the sensor-with-an-accelerator world. None of it addresses what happens when the input itself is not stationary.
On a Cyrano unit the input is whatever picture the DVR is currently showing on its monitor output. The DVR operator, which is almost always a property employee with a physical mouse and remote, can change that picture at any time. They can flip from a 4x4 grid to a 5x5 grid. They can double-click on one tile and expand to full screen. They can open the DVR's own settings menu, at which point the picture becomes a configuration dialog with no camera tiles at all. The edge AI has no advance notice of any of this; it only sees the HDMI frames arriving at 15 fps.
If you run a single monolithic object detector on that feed, the labels come back attached to pixel coordinates that nobody can map back to a camera name, because the camera-to-pixel mapping is the thing that just changed. The only architecturally honest answer is to answer two questions per frame, not one.
The stack, in one pass per frame
Every frame goes through the same two-stage pipeline. Stage one is cheap and decides how to read the frame. Stage two is the expensive work, and it runs only after stage one has committed a layout.
per-frame pipeline on a Cyrano unit
The uncopyable artifact: tiles.json for one layout
Each recognized multiview has its own directory under /var/lib/cyrano/layouts/. Inside is a tiles.json with the crop rectangles, and a masks/ subdirectory with one PNG per mask used to cover DVR chrome. The file below is the real shape of the 4x4-std entry, abbreviated to its first four tiles. The tile router re-reads this file whenever current_layout changes; it does not keep it in memory across layout switches.
0 known layouts today. The number grows through OTA updates when a unit uploads a fingerprint that the classifier flags as unknown.
The classifier, in 25 lines of Python
This is the whole of the layout classifier's per-frame logic. Four concerns: inference, hysteresis, commit, and the unknown-layout fallback. The only persistent state between frames is a five-element ring buffer in memory and one text file on disk. The tile router does not call into this code; it just reads /var/lib/cyrano/state/current_layout and trusts the value it finds there.
What the classifier output routes into
A layout change in the unit log
Below is an unedited slice of the event stream from a Cyrano unit while a property employee flipped the DVR from a 4x4 grid to a single-camera full-screen view and back. The classifier confidence wobbles during the transition animation; the hysteresis window holds routing steady until five consecutive frames agree. No false alerts fire in the gap because the detector is still reading tiles from the last committed layout_id.
“During the layout switch drill shown above, the unit handled two full human transitions (grid to single-camera, single-camera back to grid) with zero mis-routed detections. Every detection was attached to the correct camera_name, and every tile read came from the correct tiles.json. That is the behavior a stationary-input assumption cannot deliver.”
Cyrano field notes, layout switch drill on a multifamily DVR
The four moments of a human layout change
The transition is not a single event in the classifier's view. It has a quiet before, a visually ambiguous middle, a committing moment, and a new quiet after. The hysteresis rule exists to keep routing stable across the middle.
one DVR button press, seen from the edge unit
Quiet before
Classifier is returning the same layout_id at 0.99 confidence every frame. current_layout is stable. The router reads the same tiles.json on every frame.
Ambiguous middle
Operator presses the DVR panel. The next 3 to 8 frames show a transition animation or a reshuffled grid. Classifier confidence drops into the 0.55 to 0.9 range. Hysteresis holds; current_layout is not overwritten.
Commit
Five consecutive frames agree on the new layout_id. The classifier writes the new value to /var/lib/cyrano/state/current_layout. The tile router will use the new tiles.json starting with the next frame.
Quiet after
Routing is stable again, but now against a different crop schedule and a different camera-to-tile mapping. The detector runs as before. Every emitted event has the new layout_id on its outbox line.
Stationary-input AI on edge vs non-stationary-input AI on edge
The table below is the distinction the stock SERP definition does not draw. Both sides qualify as AI on edge; they have very different architectural bills of materials.
| Feature | Stationary input (stock SERP example) | Non-stationary input (Cyrano) |
|---|---|---|
| Typical reference device | Smart IP camera with an accelerator | Box tapping an existing DVR's HDMI output |
| Who controls the input frame | The camera, fixed at install time | A human operator at the DVR, anytime |
| Number of models on the edge unit | One (the main detector) | Two (layout classifier + main detector) |
| Per-frame pre-pass | None required | ~2.4 ms layout classification at 256x144 |
| Tile-to-camera mapping | Compiled in once; static | Re-read per layout_id from tiles.json |
| How DVR chrome is handled | Not applicable; no chrome in the feed | Per-layout overlay mask PNG subtracted pre-detector |
| Recovery when the operator presses a button | Not applicable | 5-frame hysteresis, then a state_write to disk |
The thing that is uncopyable
You can ls the classifier.
On a Cyrano unit there is a model, a state file, and a layout library, and all three are on disk with plain names. ls /var/lib/cyrano/models and you will see layout-classifier-v3.onnx next to the main detector. cat /var/lib/cyrano/state/current_layout and you will see the layout_id the router is using right now. ls /var/lib/cyrano/layouts and you will see one directory per known DVR multiview. There is no cloud dependency for any of it; the classifier ships with the unit and the library grows through OTA. That is the verification story for AI on edge as a non-stationary-input system.
Layouts this unit routinely sees in the field
Each of these has its own tiles.json under /var/lib/cyrano/layouts/, and the classifier is trained to emit the matching layout_id. A view that is not in the library falls back to the last known-good layout_id and is flagged for OTA.
When the single-model assumption is actually fine
Worth being direct: if the deployment is a new IP camera with a Jetson or Coral on the PCB, and nobody outside your engineering team can touch the input, a single monolithic detector is the right architecture. That is the world the generic AI-on-edge explainers are written for, and they are correct for it.
The case this page is for is different. It is the case where the property already has a DVR on the wall, already has a human operator who uses the DVR daily, and you want to add AI without displacing any of that. In that case AI on edge is a two-model system by construction, and the specific on-device files under /var/lib/cyrano/ are the reason the result is stable rather than theatrical.
See the classifier fire on a live layout change
A 15-minute call. We tail /var/lib/cyrano/state/current_layout on a production unit while a property operator flips the DVR between grid, single-camera, and PTZ views. Zero mis-routed detections on the way through.
Book a call →AI on edge: frequently asked questions
What does 'AI on edge' mean in the current SERP framing, and what does it leave out?
On page one of 'ai on edge' and 'ai at the edge' results, the phrase is used to describe machine-learning inference that runs outside a cloud datacenter on hardware next to the data. The canonical reference implementation is a sensor with an accelerator: a smart camera, a robot, a phone, or a car. The framing is fine, but it takes a critical assumption for granted: that the input to the model is stationary. Same camera, same pose, same crop, forever. What it leaves out is the class of edge AI where the input frame composition itself changes during operation because a human is controlling it. Cyrano is in that class: the HDMI frame coming out of a DVR switches layout whenever the DVR operator touches the front panel.
Why does this force Cyrano to run two models instead of one?
Because the coordinates of each camera within the frame depend on which multiview the DVR is currently rendering. In a 4x4 grid, camera 07 might be tile row 2 column 3. Toggle to a 5x5 grid and camera 07 has moved. Double-click to full-screen on camera 12 and tiles 01 through 25 all disappear except one. A single monolithic detector cannot be told 'camera 07 lives at these pixels' once and for all. The only correct answer is to ask, on every frame, 'what layout am I looking at' before the detector runs. That first question is answered by a small classifier; the second question (what is in each tile) is answered by the main detector. Two models on one box.
What are the concrete on-device artifacts that make the layout classifier work?
Three files and a directory. The model: /var/lib/cyrano/models/layout-classifier-v3.onnx, a lightweight ONNX graph that takes a downsampled full frame and emits a layout_id. The state: /var/lib/cyrano/state/current_layout, a single-line text file rewritten every frame after hysteresis, read by the tile router. The layout library: /var/lib/cyrano/layouts/<layout_id>/tiles.json, one directory per recognized multiview, each containing the crop rectangles for every tile in that layout plus an overlay mask PNG for DVR chrome (date strip, camera-name banners, PTZ menu). And the fingerprints: /var/lib/cyrano/layouts/fingerprints/*.png, one small PNG per known layout used to bootstrap the classifier when a unit encounters a new DVR firmware.
What does 'hysteresis' mean here and why is it in the pipeline?
Hysteresis is the rule that the tile router does not act on a single disagreeing frame from the classifier. Concretely, the current_layout file is only overwritten after 5 consecutive frames agree on the same layout_id. The reason is that a human transition on the DVR is not instantaneous; pressing the front-panel button to go from 4x4 to 5x5 produces a handful of intermediate frames that are neither fully 4x4 nor fully 5x5 (a sliding animation, a reshuffle, a black flash). Without hysteresis, the tile router would thrash for about 150 milliseconds every time the operator switches views, sending a burst of false alerts while a label from the old layout gets mapped to a crop from the new one. Five frames at 15 fps is ~330 ms; in practice the router changes cleanly after the transition completes.
What is inside one tiles.json file?
An array of entries, one per on-screen tile. Each entry has tile_index (1..N for that layout), camera_name (the label the DVR overlays on that tile), crop (an {x, y, w, h} rectangle in full-frame pixels at 1920x1080), and mask_ref (the overlay file name for the chrome inside that tile, typically the DVR's own label strip). For example the 4x4-std layout has 16 entries with 480x270 crops, each offset by the expected grid spacing and each with a 22-pixel bottom mask where the DVR prints the camera name. The 5x5-std layout has 25 entries with 384x216 crops. A single-camera full-screen view has one entry with a 1920x1080 crop and a slimmer top mask for the date strip only.
Why does the classifier run every frame, not just on transitions?
Because the edge unit has no out-of-band signal telling it when the human at the DVR has touched the panel. There is no event bus from the DVR firmware, no GPIO, no serial line. The only input the unit has is the HDMI picture itself. So the only way to notice a layout change is to look for it on every frame. The classifier model is small enough that this costs almost nothing: on the Cyrano reference unit, the classifier runs in ~2.4 ms per frame at 256x144 input resolution, which is under 4 percent of the frame budget at 15 fps. The main detector consumes the other 96 percent.
What happens when the classifier encounters a DVR layout the fingerprint library does not know?
It falls back to the nearest-match layout and writes an unknown_layout event to the on-device log at /var/lib/cyrano/logs/layout.log with a downsampled fingerprint. The tile router continues using the old known-good layout_id for another 60 seconds, degrading gracefully rather than aborting. On the next OTA window the fingerprint is uploaded to Cyrano; if engineering confirms it is a new DVR firmware, a fresh tiles.json entry is pushed back down and all units on that firmware learn it. This is the only path by which the layout library grows; the unit does not try to invent a new layout from scratch on its own.
Could this be done with one bigger model that outputs bounding boxes directly on the full frame?
In principle yes, and some cloud pipelines do exactly that. The practical problem on an edge unit is that the same human at the DVR is free to choose the picture that the detector sees, including picture-in-picture views, PTZ pop-outs, menu overlays, and the occasional full-screen camera-setup dialog from the DVR firmware itself. A single detector trained on every such permutation has to cover the joint distribution of 'multiview x DVR make x DVR firmware,' which is large, unstable, and expensive to maintain. The two-model split localizes the problem: the classifier has a tiny, cheap label space (about 40 known layouts today), and the detector has a stable input (a known-good crop with its chrome masked out) regardless of what the operator did upstream.
What does this mean for the event lines Cyrano writes to the outbox?
Every outbox line carries the layout_id the unit saw at capture time (for example layout_id: '4x4-std'), so the downstream system can later reason about which DVR view produced each event. It also means that a human operator changing the DVR view mid-incident does not break the incident record; the outbox entries that precede and follow the layout change both have correct camera_name values, because the router honored the classifier's call. If you pull a day of outbox and group by layout_id, you can see exactly which views the DVR spent its time in; this is also how Cyrano detects that a property's operator has started using a layout the fingerprint library does not know yet.
What is the operational takeaway for someone evaluating 'AI on edge' for a multifamily property?
Ask the vendor how their edge AI handles a non-stationary input pipeline. If the demo is a stock IP camera feeding a fixed crop to a fixed detector, that is one narrow shape of edge AI and it will not survive a property where an operator is allowed to change the DVR view. If the answer is 'we run a classifier in front of the detector and route tiles based on the current layout,' ask to see the three artifacts: the classifier model file, the current_layout state file, and the tiles.json directory. On a Cyrano unit those are at /var/lib/cyrano/models/layout-classifier-v3.onnx, /var/lib/cyrano/state/current_layout, and /var/lib/cyrano/layouts/. If a vendor cannot show you those three things, they are running cloud AI with an accelerator, not AI on edge.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.