Every guide on AI powered surveillance skips the capture point. It is the detail that decides whether the system fits cameras you already own.
The first page of Google for this phrase describes what AI surveillance models do: object detection, anomaly classification, facial recognition, behavior prediction. None of those pages describe where the AI physically reads pixels from. That is the architectural decision. A system that reads per-camera RTSP streams needs one thing. A system that replaces your cameras needs another. A system that reads the DVR’s HDMI multiview output, the same port the guard watches, needs one HDMI cable. This guide is about the third one, why it exists, how the install works, and what bytes actually leave the building when it fires.
Watch the HDMI adapter light up a live DVRThe SERP for this keyword has a specific blind spot
If you look at the first-page Google results for this phrase, they divide into three piles. Pile one is policy and civil-liberties coverage (ACLU, Brookings, academic papers on public surveillance). Pile two is vendor product pages (Spot AI, Eagle Eye Networks, Volt.ai, Pelco). Pile three is generic explainer blog posts (Omnilert, LitsLink, ResearchGate) walking through computer vision, anomaly detection, and facial recognition as concepts.
Across all three piles, a specific question is never answered. “Where does the AI physically get its pixels from?” The policy pieces do not need to answer that. The vendor pages quietly assume their cameras or their cloud ingest. The explainers treat it as plumbing beneath the model. But for a property manager who owns 16 cameras on a five-year-old DVR, the capture point is the entire question. If the answer is “we read RTSP from each camera,” they may be blocked by their DVR’s firmware. If the answer is “you install our cameras,” they pay for rip-and-replace. If the answer is “we plug into the HDMI output of your DVR,” they are probably done in 120 seconds.
That third answer is what this page is about.
Three capture architectures, three different buildings
An AI powered surveillance system is a stack. The part the top-ranking guides describe (the model, the dashboard, the alert) is the top of the stack. The part that decides whether you can deploy at all is the bottom: where the pixels come from. These are the three capture architectures on the market.
A. Per-camera RTSP pull to cloud or on-prem AI
An AI backend opens an RTSP connection to each camera and continuously pulls frames. Requires the cameras to expose RTSP, the DVR to allow outbound sessions, and firewall/network config to permit the traffic. Works beautifully when the DVR cooperates. In retrofit reality, older units, rebranded OEMs, and analog-hybrid systems break this path more often than they support it.
B. Rip-and-replace AI cameras
Replace every camera with the vendor’s purpose-built AI sensor. Inference runs on the camera or in the vendor cloud. Clean architecture, predictable spec. The price is a six- to seven-figure capital project at a large multifamily portfolio, and weeks to months of installation per property.
C. HDMI multiview adapter on the existing DVR
A device taps the DVR’s HDMI monitor output, reads the composite multiview frame, runs inference locally. Works with any feed the DVR records (analog, PoE, wireless). Does not need RTSP, does not need firmware flags, does not need a new camera. Install is a one-cable physical swap under two minutes.
Why C is the retrofit path
Every DVR made in the last fifteen years has to emit HDMI, because a guard has to watch a monitor. HDMI is the one interface that is universal across vendors, firmware generations, and camera types. The capture point with the largest compatibility envelope in the category is, by construction, the monitor port.
How the HDMI multiview adapter actually sees the world
The flow below is the physical path of a single composite frame through the adapter. There is nothing in here that a property staffer cannot point at on the back of a rack. It is meant to be inspectable, not magical.
Capture path: DVR multiview to event packet
The hub in the middle is not the cloud. It is a literal HDMI signal on a cable between the DVR and the adapter. Everything to the right of the adapter is a small event payload; the cameras themselves never establish a session with anything outside the property.
The physical install, as five motions
This is the anchor fact, and it is the uncopyable part of the page, because it is a claim about what the install procedure looks like when a Cyrano technician is standing in front of a running DVR. If the steps below do not match what you see, something is wrong.
Install on a running DVR (no reboot, no camera touched)
- 1
1. Unplug monitor
Disconnect the HDMI cable that runs from the DVR to the guard monitor.
- 2
2. Cable into Cyrano
Plug that HDMI cable into the Cyrano unit's HDMI input port.
- 3
3. Passthrough to monitor
Run a short HDMI cable from the unit's passthrough port to the guard monitor.
- 4
4. Ethernet
Connect an ethernet drop to the unit. Any LAN segment with outbound internet works.
- 5
5. Power
Plug in the DC brick. The passthrough is live immediately; the guard does not lose picture.
One inference pass, twenty-five tiles: how the compute is shaped
The single composite frame is the reason the compute works at a property-wide price point. Instead of running one detection model per camera stream (the cost model of per-camera cloud AI), the adapter runs one detection model on one frame, and the per-tile zone mask slices the result. The frame budget is shared, not multiplied.
Latency budget breaks roughly into 2 ms decode, 18 ms detect, 4 ms zone and event packaging, then network RTT to the notification backend and WhatsApp delivery. The dominant cost is the last hop, not the inference.
The detection loop, as a sequence
A single detection fires an exact sequence of calls. The sequence matters because it is what determines whether a person who has entered a restricted zone at 2 a.m. is still in that zone when on-call staff open their phone. If the sequence is slow, you are reviewing a recording. If it is fast, you are still inside the pre-action window.
Detection loop: DVR frame to on-call phone
What every good capture point needs to handle
An AI powered surveillance system is only as good as its worst-case frame. The checklist below is the set of things a capture architecture must actually get right to work in buildings that exist, not buildings in a datasheet. Any system that fails on one of these quietly breaks in deployment.
Checklist a retrofit capture point has to pass
- Works on DVRs that do not expose RTSP, or expose it only through a proprietary app (most rebranded Hikvision / Dahua / Lorex / Uniview do).
- Works on analog BNC cameras, which never expose a network stream at all. The DVR is the only component that sees them as pixels.
- Does not require rebooting the DVR, editing firmware, or adding a camera credential.
- Handles DVR on-screen overlays (clock, channel bug, per-tile camera name strip) without the detector firing on them.
- Re-scopes automatically when the operator zooms the DVR to fullscreen on a single camera during an active incident.
- Bounded off-property bandwidth per event. Full-frame continuous upload disqualifies the architecture from most residential bandwidth contracts.
- A single physical failure mode: unplug the adapter, the guard monitor still works. The adapter is never in-line with DVR recording.
The three capture architectures, side by side
The feature-level differences downstream of the capture point (model quality, dashboard, alert channel) are important, but they do not decide whether the system can exist on your cameras. The capture architecture does. This is the comparison at that layer, not at the model layer.
HDMI multiview vs RTSP-to-cloud vs rip-and-replace
Same marketing phrase on every homepage. Three different buildings underneath.
| Feature | RTSP-to-cloud or AI camera replacement | HDMI multiview adapter (Cyrano) |
|---|---|---|
| Capture point | Per-camera RTSP session, or purpose-built AI sensor | DVR HDMI monitor output |
| Works with analog BNC cameras | No (no IP stream exists) | Yes (DVR normalizes them) |
| Works when RTSP is locked to vendor app | No, blocked | Yes |
| DVR vendor compatibility | Per-vendor and per-firmware matrix | Any DVR with HDMI out |
| Cameras to replace | All of them (rip-and-replace) or none (RTSP) | 0 |
| Cables touched during install | Per-camera, plus network reconfig | 1 HDMI (plus passthrough) |
| Physical install time | Days to weeks for replacement; variable for RTSP | Under 2 minutes |
| Inference location | Usually in a shared vendor cloud | On the adapter at the property |
| Full-frame continuous upload | Yes, for most cloud-AI shapes | None |
| Off-property bytes per event | Continuous stream (gigabytes per camera per day) | ~240 KB |
| Per-camera monthly price | $20 to $120 per camera per month | ~$13 at 16 cameras (whole-property $200) |
| Failure mode if adapter dies | Recording workflow may be affected | Guard monitor keeps showing DVR output |
The capture point, mapped to the buildings it fits
Capture architecture is not a matter of taste. It is a matter of what building you are standing in. The vertical slices below show which architecture fits which deployment reality.
Which capture architecture fits which building
1. New-construction Class A, fresh wire, clean network
Any architecture works. RTSP pull is fine because the DVR is new, the firmware is cooperative, and the IT team is on-site. A rip-and-replace AI camera project also fits because the install is part of the build. HDMI multiview works too, and is the cheapest of the three. Choose on budget and speed.
2. Class B or Class C multifamily, 5-to-15-year-old DVR, mixed analog and IP
HDMI multiview is the only architecture that reliably works here. RTSP is blocked by the firmware on half the units. Analog BNC cameras have no IP stream to pull. A rip-and-replace project does not pencil at this building class. The DVR is already drawing the multiview for a guard; the adapter reads that drawing.
3. Construction jobsite with a trailer-mounted DVR and cellular uplink
Capture has to be bounded-bandwidth, because cellular is the backhaul. HDMI multiview plus on-device inference plus event packets fits within the bandwidth envelope. Continuous per-camera cloud upload does not.
4. Small commercial (law firm, clinic, small office) with a 4- or 8-channel DVR
HDMI multiview works identically; the DVR still renders a 2x2 or 3x3 multiview to a monitor. RTSP pull is possible if the DVR cooperates, but often does not. Rip-and-replace is a capital project no one wants to fund for four cameras.
5. Portfolio operator with 50 properties on different DVR vendors
HDMI multiview is the only architecture that does not force you to harmonize DVR vendors first. Every DVR speaks HDMI. The adapter is the standardization layer.
The architectural numbers, in one row
Each number below is a constant of the HDMI multiview shape, not a marketing claim. The tile count is what the DVR already draws. The install time is what happens when you follow the five-step procedure. The event size is what actually crosses the property boundary.
Inference budget
Typical detection pass across a 25-tile composite frame on the adapter’s local accelerator. Compute is paid once per frame, not 25 times.
End-to-end latency
Composite frame to phone buzz. The dominant cost is the WhatsApp delivery hop, not detection. Well inside the pre-action intervention window.
The DVR brands the capture point works on
Because the adapter reads HDMI and HDMI is the one universal requirement for a DVR to be sellable, the compatibility matrix is essentially the entire category. The marquee below is the set of brands we have fielded in production deployments. The pattern that matters is not the brand, it is the port.
The adapter does not talk to the DVR’s management interface, so firmware version and rebrand status are irrelevant. If the HDMI port emits a composite multiview, the adapter reads it.
“Across 50+ live Cyrano deployments, zero units have required camera replacement, zero have needed RTSP or ONVIF credentials, and zero have uploaded a continuous full-frame stream off the property. Every install followed the same five-step HDMI procedure and emitted only event packets.”
Cyrano deployment fleet, April 2026
The cable is the decision
“AI powered surveillance system” reads like a model problem. Most of the category’s marketing is about the model: which objects it classifies, which anomalies it flags, how many faces it recognizes, how many license plates it reads. That is the interesting half of the system to write about. It is not the interesting half of the system to deploy.
The deploy half is the capture point. And the capture point, for any building where the cameras already exist, is not RTSP and it is not a new sensor. It is the monitor cable on the back of the DVR. Tap that cable, run inference on the frame the guard is already watching, and you have an AI powered surveillance system that fits the building instead of asking the building to fit the system.
Your DVR is already drawing the picture. We read it.
15-minute live demo. We plug a Cyrano unit into a running DVR on the call, show the 25-tile composite, fire a real detection, and walk the event packet from HDMI to phone.
Book a demo →Frequently asked questions
What is an AI powered surveillance system, in one concrete sentence?
It is a system that applies a detection model to camera pixels in real time, then routes the resulting events to a human or a rule engine. The parts that get discussed in articles are the model (person, vehicle, package, anomaly) and the events (alerts, dashboards, escalations). The part that almost never gets discussed, and which decides whether the system can exist on cameras you already own, is the capture point: the physical place where the AI actually reads pixels from. Per-camera IP streams, purpose-built AI sensors, and DVR HDMI multiview are three different capture architectures with radically different install cost, compute shape, and compatibility envelope.
Why is the HDMI multiview capture point unusual, and why does it matter?
Every DVR and NVR made in the last fifteen years renders a mosaic of camera tiles to an HDMI port so a guard can watch a monitor. The signal at that port is a 1080p composite frame with every camera already drawn into a grid (typically 2x2, 3x3, 4x4, or 5x5). A device that taps that port gets access to all 25 cameras at once with a single HDMI cable. It does not need to know the cameras' IP addresses, it does not need ONVIF credentials, it does not need the DVR vendor to expose an RTSP endpoint. That is why retrofit AI is even possible on the vast majority of existing apartment, jobsite, and commercial DVRs: you stop fighting the network and you start reading the picture the DVR is already drawing.
How is the install actually done, step by step, on a running DVR?
Five physical motions. One: unplug the HDMI cable going from the DVR to the guard monitor. Two: plug that end of the cable into the Cyrano unit's HDMI input. Three: plug a short HDMI cable from the Cyrano unit's HDMI passthrough into the guard monitor. Four: connect ethernet to the Cyrano unit. Five: connect power. The guard monitor keeps showing the DVR's composite multiview continuously because the unit's passthrough is live. Total clock time on a running DVR, with everything handy, is under 2 minutes. The DVR is not rebooted. No camera is touched. No firmware is flashed. No network credential is entered anywhere.
Does the AI run on the device at the property, or in a cloud?
On the device. The HDMI signal goes into a capture chip, the composite frame is decoded in RAM, and a detection model runs on the unit's local accelerator. The original full frame and the 24 other tiles of unrelated feeds never leave the building. Only event artifacts leave: a ~18 KB thumbnail crop of the triggering tile, a ~220 KB six-second clip of that tile around the detection, and a small structured metadata blob (zone id, dwell seconds, time, class, confidence). Average off-property traffic per event is around 240 KB. There is no continuous upload of full frames, and there is no biometric index (no face embeddings, no license plate strings, no gait vectors) constructed at any stage.
How does one HDMI input cover up to 25 cameras at once?
The DVR has already done the hardest part. It decodes every camera's stream, scales it to a tile, positions the tile in a grid, and re-encodes the grid as one HDMI frame. The Cyrano unit receives that composite frame and treats each tile as a region of interest. One inference pass per composite frame processes all 25 tiles in parallel. A per-tile zone mask is applied so the on-screen clock, per-tile camera name strip, and DVR channel bug do not confuse the detector. If the operator switches the DVR to fullscreen on one camera during an incident, the unit re-scopes to that single camera at full resolution, so accuracy actually climbs during active events rather than dropping.
Why not just pull RTSP from each camera and bypass the DVR?
Because in the real buildings Cyrano deploys into, RTSP is unreliable as a capture architecture. The DVRs are a mix of rebranded Hikvision, Dahua, Lorex, Uniview, Swann, and LTS units. Some expose RTSP only when a specific proprietary app has negotiated a session. Some put RTSP behind a firmware flag the integrator forgot to enable. Some rotate credentials on reboot. Some cap concurrent RTSP sessions so low that adding an AI client bumps the guard's phone app off. Analog cameras on BNC inputs do not expose RTSP at all; the DVR is the only component that ever sees them as pixels. HDMI multiview bypasses all of that. If the guard can see the feed on the monitor, the adapter can run inference on it.
What specifically leaves the property per detection?
Exactly three artifacts, and nothing else. One thumbnail crop, roughly 320x240, of the tile that fired, roughly 18 KB. One short clip, six seconds, of that same tile bracketing the detection, roughly 220 KB. One JSON metadata object describing the event: zone identifier, dwell in seconds, wall-clock time, class (person, vehicle, package), confidence score, and the tile index within the DVR layout. Nothing else leaves. No full-resolution multiview frame, no continuous stream, no face embedding, no plate string, no audio, no telemetry from the 24 other tiles that did not fire. Average total bytes per event is about 240 KB. Delivery target is a WhatsApp thread with on-call staff; end-to-end latency from composite frame to phone buzz is about 1.1 seconds in typical deployments.
What happens if the DVR changes its on-screen overlays or grid layout?
The install procedure includes a one-time calibration pass: the unit captures a reference composite frame, the installer confirms which tile is which camera (a dropdown per tile), and the system records the pixel rectangles for the DVR clock, the per-tile camera name strip, and the channel bug. Those rectangles become zero-mask regions before the detector sees the frame. If the DVR layout is later changed (3x3 to 4x4, or a channel is renamed), the calibration is re-run in under a minute from the property staff's phone. The model itself is unchanged; only the per-tile coordinates move.
Why is this cheaper than per-camera AI cloud platforms?
Two reasons. First, no frames are being shipped to a cloud ingest, so the vendor is not paying for ingest, storage, GPU inference, and egress per camera per month. Second, a single edge device covers up to 25 cameras off one HDMI input, which amortizes the device cost across the whole property instead of per camera. At a typical 16-camera property, Cyrano works out to roughly $13 per camera per month on a whole-property price of $200. Per-camera AI cloud platforms are usually priced between $20 and $120 per camera per month, because the per-camera cloud compute is the product.
Does this work with analog BNC cameras, IP cameras, or both?
Both. The adapter is agnostic to what the DVR is recording. If a feed makes it into the DVR's HDMI multiview (and essentially all feeds do, that is the point of a DVR), the adapter can run inference on it. That includes analog 960H and HD-TVI cameras on BNC inputs, PoE IP cameras on the NVR's built-in switch, and wireless cameras paired to the recorder. The DVR is the normalization layer. The adapter only has to understand HDMI, and HDMI is the one signal every modern DVR has to emit to be sellable.
Worth saying plainly
An AI powered surveillance system is, at its deploy layer, a choice about where the AI reads pixels from. RTSP pull works on the buildings where the DVR cooperates. Rip-and-replace works on the buildings where the capital budget is already approved. HDMI multiview works on every other building, which is most of them.
If your cameras are already installed, already recording, and already drawing a multiview onto a guard monitor, the adapter that reads that multiview is the path that does not ask you to replace the cameras, renegotiate with the DVR vendor, or re-architect your network. One HDMI in, one HDMI out to the monitor, ethernet, power. Under two minutes. $450 one-time, $200 a month, whole property.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.