Matthew Diakonov, Written with AI

Published April 20, 202611 min read

AI at the Edge, the retrofit form factor

AI at the edge has two form factors. Every explainer on the SERP only describes one of them.

The vendor explainers for AI at the edge all describe the same pattern: a new sensor with new silicon, deployed greenfield, processing one stream per device. There is a second pattern that never shows up on that first SERP page: a side-channel box that taps the HDMI output of a DVR already on the wall and runs one forward pass per composite frame of up to 25 tiles. This guide is about that second pattern, what it can do, what it cannot, and the one operational artifact that makes it work.

See the HDMI-tap form factor live

4.9from 50+ properties

One forward pass per HDMI composite, up to 25 tiles

Per-layout overlay mask subtracts DVR clock and name strip

No RTSP, no ONVIF, no cloud hop

Under 2 minutes of wiring per retrofit

Edge AI, retrofit form factor

One HDMI cable, one composite frame, one forward pass.

Greenfield edge AI assumes new silicon in a new sensor

Side-channel edge AI wedges onto hardware already on the wall

Input is a 25-tile DVR composite, not 25 RTSP streams

A per-layout overlay mask subtracts clock and name-strip glyphs

Inference runs once, tiles resolve to camera ids after

0:00 / 0:05

The SERP definition, and what it leaves out

The dominant definition of AI at the edge, the one you see on NVIDIA, IBM, Cisco, Dell, Red Hat, SUSE, Scale Computing, STMicroelectronics, Built In, and Computer Weekly, reads almost word for word the same across every page. AI is migrated out of a cloud datacenter onto a device near the data. The device is new. The sensor is new. The processor inside the sensor is new. The examples are autonomous vehicles, factory robots, smart refrigerators, drones, phones. The pitch is privacy, latency, bandwidth, cost.

None of that is wrong. It is just not the whole category. The greenfield-sensor framing skips over a different deployment pattern that matters for a lot of operators: the several hundred thousand properties that already have a DVR or NVR on the wall, already have cameras cabled in, and already pay to record 24/7. That hardware does not need to be torn out to benefit from edge AI. It needs a side-channel.

The side-channel form factor reads the composite frame the DVR is already driving out of its HDMI port, runs a single forward pass on that composite on a local accelerator, and emits events to a messaging endpoint. No new cameras, no replaced cabling, no cloud hop. It is still edge AI; it is just edge AI on hardware nobody bought new for it.

How a side-channel edge AI box wedges onto existing infrastructure

The two edge AI form factors, side by side

Every camera is new. Each one has silicon on board. Cables and NVRs are replaced, the local network is re-segmented, and a separate dashboard is stood up. Typical multifamily retrofit cost: $10,000 to $25,000 per property, not counting labor. The AI benefits are real; so is the capex.

New sensor, new processor, new firmware
Rip-and-replace cabling and recorder
Per-camera RTSP or proprietary stream
Capex dominates; timeline is months

One composite frame, up to 25 tiles, one forward pass

The arithmetic below is the specific thing that makes a side-channel box cheap enough to install on a $450 unit: you run the detector once per HDMI frame, not once per camera, because the DVR has already composited the cameras for you. Each tile lands at roughly 480x270 inside a 1920x1080 frame on a 4x4 layout. The tile-to-camera-id mapping is a spatial lookup, not a second inference call.

0forward pass per HDMI frame

0tiles resolved per pass

0px per tile on 4x4-std

0RTSP sessions required

0 cloud hops per event, because the model and the layout cache both live on the unit.

The single-pass pipeline, one step at a time

The thing that lets a single model pass stand in for 25 is the order of the first three steps: capture the composite, classify the layout, then mask the burned-in chrome. Only then does the detector see pixels.

Frame to alert: six stages on the edge AI unit

HDMI capture

Single 1920x1080 composite from the DVR's guard-monitor port, 15 to 30 fps.

Layout classify

Walk tile edges and corner anchors to decide 4x4-std, 5x5-std, 1x1-std, or a new layout_id.

Mask subtract

Load polygon mask file for this layout_id, multiply frame by mask to blank clock, name strip, channel bug.

One forward pass

Detector sees the masked composite once and returns bounding boxes with tile-relative coordinates.

Tile-to-camera map

Spatial lookup from box coordinate to tile index to camera_id. No second inference call.

Filter and deliver

Apply zone, dwell, and window-armed filters, crop thumbnail from composite, send WhatsApp with payload.

The 25 tile to 1 pass reduction in code

Two shapes of the same pipeline. The left is the shape assumed by the greenfield edge AI articles on page one of the SERP: one stream, one decode, one model call per camera per tick. The right is the shape that falls out of treating the DVR composite as a single frame.

Per-stream vs composite-frame inference

# Conventional greenfield edge AI: one model pass per camera
# 25 cameras = 25 RTSP sessions, 25 decode threads, 25 forward passes

for camera in cameras:            # 25 iterations per tick
    frame = rtsp_read(camera.url) # credentials, ONVIF, keepalive
    frame = decode(frame)         # per-camera H.264 decode
    boxes = model(frame)          # forward pass #N of 25
    for b in boxes:
        emit(camera.id, b)

# Cost: 25x decode, 25x inference, 25x credential surface
# Operational reality: one camera offline = one silent gap

-20% fewer model calls per tick

What the side-channel form factor gives up, and what it buys

Bought: 2 minutes of wiring, no camera replacement

The install is one HDMI pass-through and one ethernet drop. The cameras, the cabling, and the NVR stay exactly as they were. The guard monitor keeps displaying the same image. That is the entire physical change.

Bought: brand agnosticism

A composite HDMI multiview looks the same whether the DVR is Hikvision, Dahua, Lorex, or a rebrand nobody has heard of. The detector does not care, because it sees pixels, not a firmware API.

Bought: no credential surface

There are no RTSP usernames, no ONVIF port scans, no per-camera tokens to rotate. The only interface is the HDMI cable, which by definition has no authentication plane.

Given up: per-camera native resolution

Each tile of a 4x4 composite is about 480x270 px. That is enough for person detection, zone crossings, tailgating, loitering, package dwell. It is not enough for plate reads at distance or facial recognition. The tradeoff is deliberate.

Given up: the ability to pan-tilt-zoom from the AI

The side-channel box observes, it does not command. PTZ controls stay with the operator. For the ops teams that already control PTZ by hand, this is not a loss; for SOC integrations that expect the AI to drive the camera, it is.

Required: the per-layout overlay mask

The single non-obvious artifact is a mask file per layout_id that blanks the DVR's live clock, per-tile name strip, and recording indicator before inference runs. Without it the model fires phantom boxes on burned-in glyphs. With it, the composite pipeline works.

What the edge device prints during a live frame tick

The log extract below is a 20-line slice of a our system event stream during one composite frame of an actual 16-camera multifamily property. Every step of the single-pass pipeline shows up as a readable line; nothing is a black box.

cyrano edge tick, 16-camera property, single HDMI frame

1 pass / 25 tiles

“The thing that makes a side-channel edge AI box shippable is not the model; it is the layout router plus the overlay mask. The model is the easy part. The artifacts that let one forward pass stand in for twenty-five are what the SERP definition of AI at the edge leaves out.”

Our system field notes, 16-camera multifamily baseline

DVRs this form factor runs against

Hikvision DS

Dahua XVR and NVR

Lorex LNR and LHD

Swann DVR

Uniview NVR

Annke NVR

Reolink NVR

Night Owl DVR

Amcrest NVR

Q-See and rebrands

The layout router and overlay mask make the form factor device-agnostic: any recorder that drives a standard HDMI multiview to a guard monitor is a valid input. That covers roughly every DVR and NVR shipped to multifamily in the last ten years.

Greenfield edge AI vs side-channel edge AI, on the axes that actually choose the install

The two form factors are not ranked; they solve different deployment problems. The table below is the set of questions an operator should actually ask before picking one. Per-camera resolution is a fair greenfield win. Installed-base coverage and time-to-value are fair side-channel wins.

Feature	Greenfield edge AI	Side-channel edge AI
Requires new cameras	Yes, every camera replaced	No, existing cameras stay in place
Requires new cabling	Usually; switch to PoE or fiber	No; existing coax or ethernet untouched
Inference call pattern	One pass per stream per tick	One pass per composite HDMI frame
Per-camera native resolution available to AI	Yes, full sensor feed	No; composite tile resolution (~480x270 on 4x4)
Capex per property	$10,000 to $25,000	$450 one-time unit plus $200/month
Time to first live alert	Weeks to months	Under 30 minutes on site
Works across DVR brands already installed	No; forces a single ecosystem	Yes; layout router is brand-agnostic
Day-one operational artifact that must exist	Camera firmware + cloud dashboard	Per-layout overlay mask file by layout_id

When the side-channel form factor is the wrong answer

Worth being direct about the failure cases, because they define the boundary of the category. If the property needs license plate reads at 60 feet, facial recognition at distance, or any analytic that relies on native camera resolution, a 480x270 tile is not going to carry it and a greenfield sensor is the right answer. If the SOC workflow expects the AI to drive PTZ, the side-channel box does not close that loop, because it observes HDMI rather than commanding the camera. If the buyer is installing a brand-new building and has the budget for a full smart-camera rollout, the tradeoff math flips.

For the rest: Class B and C multifamily, construction trailers, older commercial sites, anything already recording 24/7 on a DVR that nobody wants to tear out, the side-channel form factor is the edge AI the SERP forgets to mention.

Walk the HDMI-tap form factor on a live DVR

A 15-minute call. We pull the layout cache and overlay mask off an actual unit running on a production DVR and you read the files for yourself.

AI at the Edge: Frequently Asked Questions

What does 'AI at the edge' usually mean in 2026?

On the current SERP, it means machine-learning inference that runs on a device rather than in a cloud datacenter, almost always on new silicon shipped inside a new sensor (a smart camera, a lidar, a robot, a phone, a car). The NVIDIA, IBM, Cisco, Dell, Red Hat, SUSE, STMicroelectronics, and Scale Computing explainers all frame the category that way. The framing is correct as far as it goes, but it describes exactly one deployment pattern: greenfield, with the intelligence baked into a device you buy new.

What does 'AI at the edge' also mean, that the SERP leaves out?

A side-channel edge AI box that retrofits intelligence onto hardware already on the wall. Our system is one example. It is a physical unit that plugs into the HDMI output of an existing DVR or NVR, reads the composite multiview frame the DVR is already driving out to the guard monitor, and runs inference locally on the device itself. No new cameras, no RTSP credentials, no ONVIF handshake, no cloud round trip. The cameras, the cabling, and the recorder do not change; only the HDMI line now has a witness on it that happens to be a neural network.

Why 'composite multiview' instead of one stream per camera?

Because the DVR has already composited the streams for you. A 16-camera property's DVR outputs a single 1920x1080 HDMI frame with 16 tiles arranged in a 4x4 grid (or 5x5, or 1x1 when a guard drills into a single camera). Our system treats that composite as the input. One frame, one forward pass, up to 25 tiles resolved at once, then a per-tile spatial map back to camera names. You avoid opening 25 RTSP sessions and you avoid running the model 25 times. You also avoid 25 chances for a camera credential to be wrong.

What is the per-layout overlay mask and why is it the hidden artifact of composite-frame edge AI?

A DVR burns graphics into its HDMI output that are not part of any camera feed: a live clock in the top-right corner, a per-tile name strip, a recording indicator. If you run person detection on that composite without subtracting those glyphs, the model fires phantom bounding boxes on the text and the recording dot. Our system computes a mask once per DVR layout at install (for example 4x4-std.mask and 5x5-std.mask) and caches the polygon coordinates by a layout_id key. On every frame, inference multiplies the composite by the cached mask in constant time. The mask is the artifact that lets composite-frame edge AI work at all; it does not exist in the greenfield-sensor edge AI story because the greenfield sensor has no overlay to subtract.

Doesn't side-channel edge AI give up resolution or framerate compared to processing camera streams directly?

Yes, and intentionally. The HDMI composite is whatever the DVR decided the guard monitor should see: typically 1920x1080 at 15 to 30 fps, with each of 16 tiles landing around 480x270. That is lower than the native camera feed of, say, 2560x1440 at 20 fps. For the incident classes our system targets (person in a zone, dwell threshold crossed, tailgate at a vestibule, package left too long) 480x270 per tile is enough. For applications that genuinely need full per-camera resolution (license plate read at 60 feet, facial recognition at distance) the side-channel form factor is the wrong tool and a greenfield-sensor system is the right one. The tradeoff is the point; it buys a 2-minute retrofit on hardware you already own.

Where does the inference actually run?

On the unit itself, at the property. The device holds the model weights, the filter stack, the layout cache, and the event queue on local storage. Frames do not leave the property during inference; only the already-compressed delivered events (thumbnail plus metadata) leave the device, and they go to the customer's WhatsApp or SMS endpoint. That is what lets the product honor the 'AI at the edge' privacy narrative on hardware that was never designed for edge AI.

How does this form factor compare to Verkada, Rhombus, or any other smart-camera system that is also 'edge AI'?

Verkada and Rhombus are greenfield edge AI. They put a new processor inside a new camera and require ripping out the existing install. That works if the property is new construction or if the owner was going to replace cameras anyway; it costs $10,000 to $25,000 per property in hardware alone, plus labor. Side-channel edge AI skips the replacement and accepts the lower composite resolution in exchange. The two forms are not direct competitors; they serve buyers with different constraints. A Class A high-rise installing a new building chooses Verkada. A Class B/C multifamily operator with 16 DVRs already on the wall across 40 properties chooses side-channel.

How is the side-channel device calibrated to a specific DVR model?

It is not calibrated to the DVR model; it is calibrated to the layout. When the unit sees a new composite layout on HDMI, it walks the frame edges to find tile boundaries, looks for the clock glyph region at known corner anchors, looks for the name-strip region at the tile header, and builds a polygon mask for all three categories. The mask is stored as a file keyed by a hash of the tile geometry plus corner anchors, with a human-readable alias such as 4x4-std or 5x5-std. After a guard switches between multiview and fullscreen a few times, the cache ends up with both 4x4-std and 1x1-std entries, which is the healthy shape. That per-layout calibration is what makes the retrofit device-agnostic across Hikvision, Dahua, Lorex, Swann, Uniview, Annke, Reolink, Night Owl, Amcrest, Q-See, and the long tail of rebranded DVRs shipped in the last decade.

What is the installation time?

Under 30 minutes per property in the field, and under two minutes for the physical wiring. You unplug the HDMI cable from the guard monitor, run it through the our system box, run a second HDMI from the our system pass-through to the monitor, and plug the unit's ethernet into the property network. The monitor keeps showing exactly what it was showing before. Zones and dwell thresholds are configured from the dashboard afterward. That is the whole physical install.

What does the event output look like?

A WhatsApp or SMS message containing a tile thumbnail cropped from the composite (not the full multiview), a zone label, a dwell in seconds, a camera name, the layout_id that was active at detection, an end-to-end latency_ms from frame capture to message send, and an event class such as pre_action_zone_entry or loitering_dwell_exceeded. The responder can triage the message in under a minute without opening the dashboard. That last part is the half of edge AI that the greenfield-sensor narrative rarely discusses: what the human receives at the other end.