Matthew Diakonov, Written with AI

Published April 19, 202612 min read

The input, not the model

An AI surveillance system is decided by what it reads, not what it detects.

Two systems can run the same model on the same hardware and behave like completely different products in the field. The variable is input architecture. A system that reads individual camera RTSP streams needs credentials, a managed network, and often a rip-and-replace. A system that reads the monitor (the composite HDMI frame the DVR already renders for the security guard) needs a power outlet and two minutes. Same detector, completely different install. This page walks the monitor-first input, tile by tile, on one deployed unit.

See the HDMI tap on a live DVR

4.9from 50+ properties

Per-tile pixel budget on a 5x5 grid: 384 x 216

Physical install wall-clock: under 2 minutes

DVR brands verified: Hikvision, Dahua, Lorex, Swann, Uniview, Amcrest, Reolink, Q-See, Annke, ZOSI

Network load on the property LAN: zero

The AI surveillance system that reads the monitor, not the cameras

Same detector. Different input. Different install.

Input: the DVR's HDMI multiview frame

1920 x 1080 composite, up to 25 tiles

384 x 216 pixels per camera on a 5x5

Install: under 2 minutes, no RTSP, no network

Egress: about 240 KB per delivered event

0:00 / 0:05

The first decision on an AI surveillance system is not which model. It is which video path.

Search for “ai surveillance system” and the first page of results tells you about detection capabilities. Face recognition. License plate reading. Weapon detection. Crowd counting. Anomaly flagging. Loitering. Every one of those capabilities is downstream of a question the guides skip: how does the system see the cameras at all.

There are only two real answers. Read the cameras directly (RTSP or ONVIF streams over the property network, one connection per camera) or read the monitor (the composite HDMI multiview frame the existing DVR already renders, one signal for all cameras). Those two choices look almost identical in a product datasheet and are radically different on install day. One is a week of scheduled IT work and credentials hunting. The other is a person with a screwdriver and a power outlet.

Our system reads the monitor. Everything about the deployed behavior, including the per-camera pixel budget, the model class selection, and the egress contract, follows from that one choice.

0 x 216Pixels per camera tile on a 5x5 multiview

0Camera tiles handled per unit

0 minPhysical install wall-clock

0 KBOutbound payload per delivered event

The pixel budget, grid by grid

The DVR renders its multiview to a fixed 1920 x 1080 HDMI output. Tiles share that frame equally. The more cameras on the grid, the fewer pixels each one gets. These are the numbers that dictate which detections are real and which are marketing.

1x1 grid (1 camera)

Full 1920 x 1080 per tile. A person at 30 feet is roughly 180 pixels tall. Suitable for plate and face workloads on a dedicated gate or entry camera. This is the pattern when a property pairs a multiview unit with a second unit routed to a single critical camera.

2x2 grid (4 cameras)

960 x 540 per tile. A person at 30 feet is roughly 90 pixels tall. Plate reading starts to get unreliable at distance. Person and vehicle detection are solid.

3x3 grid (9 cameras)

640 x 360 per tile. A person at 30 feet is roughly 60 pixels tall. Class-level detection (person, vehicle, package) is the right workload. Face ID on distant tiles is not.

4x4 grid (16 cameras)

480 x 270 per tile. A person at 30 feet is roughly 45 pixels tall. Comfortable range for mid-market multifamily and jobsite trailers.

5x5 grid (25 cameras)

384 x 216 per tile. A person at 30 feet is roughly 36 pixels tall. This is the maximum density one unit supports, and the cost point where the detector still resolves the silhouette classes reliably.

Why 25 is the ceiling

Beyond 5x5 on a 1080p output, the tiles shrink faster than the detector can keep up with at property distances. A 6x6 grid is 320 x 180 per tile, and a person at 30 feet is 30 pixels tall, which is the empirical floor for reliable class-level detection. We cap one unit at 25 tiles and add a second unit for larger deployments.

How the input flows from DVR to detector

The pixel path is short and boring, and that is the point. The DVR renders the multiview it has always rendered, the signal passes through our system on its way to the monitor, and the detector runs on the composite frame on the way through.

Monitor-first input: from DVR HDMI to per-tile detection

The physical install, step by step

The whole point of the monitor-first input is that the install is something one person can do with no IT involvement. These are the exact steps on a typical property.

From box to running, under two minutes

1
Unplug
Disconnect the HDMI cable from the DVR's HDMI OUT port. The monitor goes dark. Residents and staff do not see this, because this runs in the gatehouse or the back office.
2
Insert
Plug that cable into our system's HDMI OUT. Plug the short included cable from the DVR's HDMI OUT into our system's HDMI IN. The monitor comes back with the same multiview it always showed.
3
Power
Plug the 12V power adapter into an outlet. The unit boots in about 20 seconds. Cellular connects on eSIM. No Wi-Fi credentials, no Ethernet, no DVR password.
4
Calibrate
On the property dashboard, capture a reference frame and tag each tile with its camera name. Draw the zone polygons. The detector is live on the first inference tick.

What the first calibration run actually prints

The first thing our system does after it boots is introspect the HDMI signal it is seeing and decide the grid. Here is what that produces on a 16-camera Class B property running a Hikvision DVR.

cyrano:install-probe property=austin-meadow-16

Monitor-first input versus camera-first input, side by side

Both architectures end at the same place: a detector running on frames. They differ everywhere else, and the differences show up on install day, on the property LAN, and in the monthly operating cost.

Two ways an AI surveillance system can read a 25-camera property

Our system monitor-first versus the typical RTSP camera-first stack.

Feature	Camera-first AI surveillance (RTSP per camera)	Our system (HDMI monitor-first)
Physical install wall-clock	1 to 3 hours per property	under 2 minutes
DVR admin password required	yes	no
New devices on property network	1 to N (inference box, switch, sometimes a VLAN)	zero
Sustained LAN bandwidth used by AI	40 to 60 Mbps on 25 cameras at 1080p	0 Mbps
Per-camera pixel budget	1920 x 1080 per stream	384 x 216 on a 5x5 tile
Native support for mixed DVR brands	depends on ONVIF profile compliance	yes, any HDMI-capable DVR
Property video that leaves the building	often continuous cloud upload	event clips only, about 240 KB each
Rollback if the AI fails	reconfigure network and cameras	unplug our system, monitor goes back to direct
Hardware cost floor	$3K to $25K per property (often with camera replacement)	$450 per unit (25 cameras)
Upgrade path on a new camera	add camera to VMS, provision RTSP, update inference config	installer assigns tile in dashboard

What a typical property already has versus what We add

Monitor-first means the existing DVR, cameras, cabling, and storage keep doing their job. Here is what the integration looks like in code-shape, as two config snapshots: the camera-first stack our system is replacing and the monitor-first stack our system becomes.

Property integration config, before and after

# ai-surveillance.yaml
property: austin-meadow-16
mode: rtsp

# requires DVR admin access, network changes,
# and credentials for every camera
dvr:
  host: 10.0.4.2
  admin_user: admin
  admin_pass: <from-installer-binder>
  rtsp_port: 554

cameras:
  - id: cam01
    rtsp: "rtsp://admin:<pw>@10.0.4.11:554/ch1/main"
    resolution: 1920x1080
  - id: cam02
    rtsp: "rtsp://admin:<pw>@10.0.4.12:554/ch1/main"
  # ... 14 more ...

network:
  vlan: security
  bandwidth_reserved_mbps: 60
  firewall_changes: yes

inference_box:
  lan_ip: 10.0.4.50
  gpu: required

33% fewer config lines

DVR brands the HDMI tap has been verified against

Monitor-first works on anything that emits a standard HDMI multiview. The list below is every brand we have confirmed on a deployed property. HDCP is never present on security-grade DVRs, which is the only hard stop.

Hikvision

Dahua

Lorex

Swann

Uniview

Amcrest

Reolink

Q-See

Annke

ZOSI

Night Owl

Defender

LaView

SV3C

ONWOTE

Samsung SDH

Flir DNR

HIKVISION rebadges (Hikua / LTS)

Dahua rebadges (Honeywell / FLIR)

Generic NVR w/ HDMI out

Procurement questions that surface the input architecture

Before committing to any AI surveillance system, these are the four questions to ask. The answers separate a monitor-first architecture from a camera-first one and tell you what install day will feel like.

Four questions for any AI surveillance vendor

Do you require the DVR admin password, and do you need to put a device on the same LAN as my cameras? A monitor-first system answers no to both. A camera-first system has to answer yes to at least the first and almost always the second.
What is the exact physical install path on an existing DVR, in steps, in minutes? If the answer includes an IT ticket, a VLAN, or a scheduled maintenance window, you are buying a camera-first system. If it is 'unplug the monitor HDMI and plug in our box,' you are buying a monitor-first system.
What is the per-camera pixel budget your detector is trained and tuned for? A camera-first system will quote 1920x1080 per stream. A monitor-first system will quote the tile size at your chosen grid (384x216 at 5x5, 480x270 at 4x4, 640x360 at 3x3). The pixel budget decides which detections are real.
What leaves my building over the network per hour of normal operation? A monitor-first system answers in kilobytes per event. A camera-first system that runs cloud inference answers in gigabytes per hour per camera. Pick the egress contract that matches your privacy posture.

1m 42s

“On a 16-camera Class B property in Austin, the physical install took one minute and forty-two seconds from unplug to power-on, and an additional seven minutes of per-tile calibration in the dashboard. No DVR password was requested. No port was opened on the property firewall. The monitor showed the same multiview it had shown the day before, and has shown every day since. The detector was live on the first inference tick.”

Our system deployment log, Austin, TX

Three failure modes that monitor-first input avoids

These are the recurring failures on camera-first deployments we have replaced. Each one traces back to the input architecture, not the model.

Failure 1

The missing password

The camera installer left in 2019. The DVR password is in a binder nobody has opened since. An RTSP-based AI surveillance system cannot start without it. A monitor-first install never asks.

Failure 2

The network strain

Twenty-five concurrent RTSP streams at 1080p add 0+ Mbps to a shared building LAN. Wi-Fi degrades. The package room IoT sensor drops. The residents notice. A monitor-first unit adds zero megabits to the LAN.

Failure 3

The ONVIF mismatch

Two cameras on the same property from different brands implement ONVIF profile G differently. The camera-first AI works on twenty-three of twenty-five streams. Two cameras need firmware updates the manufacturer no longer ships. A monitor-first unit sees both the same way: as tiles on the DVR multiview.

See the HDMI tap on a live DVR.

On a 15-minute demo we bring a unit, intercept the monitor HDMI on your existing DVR, and show the grid-detect, overlay-find, and first inference tick in real time. You see the physical install, the per-tile pixel budget on your actual camera set, and the egress contract. No RTSP credentials. No network changes.

Book the monitor-first demo →

When camera-first is the right architecture instead

Monitor-first is not universally better. It is a specific tradeoff. If the deployment needs plate reading at distance, face identification across a crowd, or sub-second inference on every camera simultaneously, the 384 x 216 tile of a 5x5 multiview is not enough pixels. Those workloads want a camera-first system running at the full 1920 x 1080 per stream, and the install and network costs come with the territory.

For the broader market (mid-market multifamily, jobsites with existing DVRs, small commercial that already has a guard monitor in the back office), the workload is person / vehicle / package at property distances, the zones are varied, and the DVR already exists. Monitor-first turns that deployment from a three-week integration into a two-minute install, because the AI surveillance system reads the signal the property has been rendering for the guard monitor all along.

Frequently asked questions

What does 'monitor-first input architecture' actually mean on an AI surveillance system?

It means the system's input is not the set of individual camera streams. It is the single composite frame that the DVR already renders to HDMI for the guard monitor. A standard DVR draws a 1x1, 2x2, 3x3, 4x4, or 5x5 grid of tiles onto a single 1920x1080 frame and pushes that out its HDMI port so a human can watch all the cameras at once. On a monitor-first system, the AI watches the same frame the human would watch. It does not open an RTSP stream, it does not ask for camera credentials, it does not touch the DVR's network configuration. The video path is a passthrough: DVR HDMI out goes into the unit, device HDMI out goes to the existing monitor, and the monitor keeps showing what it has always shown. The AI reads the signal along the way.

What is the exact per-camera pixel budget on a 25-tile multiview?

On a 1920x1080 HDMI output rendered as a 5x5 grid, each tile is 384 by 216 pixels. That is the per-camera resolution the detector gets. On a 4x4 grid of 16 cameras the tile is 480 by 270 pixels. On a 3x3 grid of 9 cameras the tile is 640 by 360 pixels. The pixel budget is why the detector is optimized for person, vehicle, and package at typical property distances, and it is why the system deliberately does not advertise face identification on distant tiles. A face at 30 feet on a 384x216 tile is roughly 20 pixels tall. That is not enough pixels for reliable identification, and pretending otherwise is how capability-first AI surveillance systems oversell.

If the pixel budget is so tight, how does the detector perform at all?

Because person, vehicle, and package are silhouette-class detections, not fine-grained identification. A person at 20 pixels tall is still detectable as a person; you just cannot tell which person. A vehicle at 40 pixels wide is still detectable as a vehicle; you just cannot read its plate. For property security, the question is almost always whether a person or vehicle is in a zone that should be empty at a given hour, not who that person is. The monitor-first input architecture is a deliberate tradeoff: you give up per-camera resolution and get a system that runs on every DVR ever sold, with zero camera replacement cost. For properties that actually need plate-level or face-level recognition on a specific gate, the supported pattern is to route that one camera's HDMI output to a unit directly (1x1 grid, full 1080p), in addition to the multiview unit covering the rest of the property.

What is the step-by-step install path on a real property?

Step 1, find the HDMI cable that runs from the existing DVR or NVR to the guard monitor. On most properties this is a single cable behind the desk or inside the gatehouse. Step 2, unplug that cable from the DVR side. Step 3, plug it into the unit's HDMI out port. Step 4, plug a short included cable from the DVR's HDMI out into the unit's HDMI in port. Step 5, plug in power. Step 6, on the property dashboard, capture a reference frame and confirm the tile layout (2x2, 3x3, 4x4, 5x5) and which camera is in which tile. Total wall-clock time is usually under two minutes for the physical install, and an additional five to ten minutes for the per-tile calibration. The monitor keeps showing the same multiview it always showed. Residents and staff do not notice anything changed.

Which DVR and NVR brands does this work with, and are there any that break it?

Any DVR or NVR that emits a standard HDMI multiview to a monitor, which is essentially every unit shipped in the last decade. Verified on Hikvision, Dahua, Lorex, Swann, Uniview, Amcrest, Q-See, Reolink, Annke, ZOSI, and a long tail of rebadges of those platforms. The failure modes are narrow: HDCP-protected outputs (almost never present on security DVRs, but occasionally on hybrid retail-and-signage devices) and analog-only BNC outputs on very old DVRs with no HDMI port at all. For the BNC-only case the fix is a $40 BNC-to-HDMI converter in front of the unit. HDCP is the only hard stop, and in four years of deployments we have not encountered one on a property-grade DVR.

Why not just open the RTSP streams from each camera like a typical AI surveillance system does?

Three concrete reasons on deployed properties. First, most property managers do not know the DVR admin password. The camera installer set it up, left the building, and either the password is on a sticky note that has since been thrown away or it is written in a contract binder nobody has opened since 2019. RTSP requires that password. HDMI does not. Second, opening RTSP streams requires putting a new device on the same network segment as the DVR, which often requires IT or landlord involvement on properties that have ten different tenants on the same building network. HDMI is electrically isolated from the network. Third, RTSP per camera means 16 or 25 network connections at 1080p each, which is 40 to 60 Mbps sustained on the LAN. HDMI is a single wire with zero network load. Monitor-first solves all three problems by refusing to be on the network at all.

Does the unit phone home? What actually crosses the property boundary?

The unit has a 4G cellular modem with an eSIM that the unit uses to send event payloads and receive configuration updates. No property Wi-Fi, no property Ethernet, no camera network access. What crosses the property boundary: roughly 240 KB per delivered HIGH event (an 18 KB JPEG thumbnail, a 220 KB H.264 clip of about six seconds bracketing the detection, and a 612 byte JSON metadata object). Nothing else. No continuous upload of the multiview frame, no face embedding, no plate string, no audio, no live stream. The raw 1920x1080 multiview never leaves the building. That egress contract is what has made our system approvable on privacy-sensitive properties where cloud AI surveillance is not.

What frame rate does the detector actually run at, given the HDMI input?

The HDMI signal from the DVR is almost always 30 FPS, sometimes 25 FPS on European-region DVRs, occasionally 15 FPS on very old units that were configured for low-bandwidth remote viewing. The detector samples that signal at 1 FPS for the full 25-tile grid, which gives each tile one inference pass per second. That is sufficient for person, vehicle, and package detection at property distances (a person takes several seconds to cross any meaningful zone), and it keeps the per-unit inference budget within the $450 hardware cost point. Properties that need faster sampling on a specific camera get a second unit dedicated to that camera at 1x1 and the full 1920x1080 at a higher per-tile FPS.

How does the system know which tile is which camera after install?

During calibration the dashboard captures a reference composite frame and shows the installer a grid of numbered tiles (tile 1,1 through tile 5,5, or fewer depending on grid size). Next to each tile is a dropdown with the installer's camera names (front_gate, compactor_alcove, pool_deck, and so on). The installer tags each tile once. From that moment on, the system knows that tile [2,3] is compactor_alcove, and every event out of that tile carries the camera name on the payload. If the DVR layout is ever changed (a camera added, a channel renamed, the grid changed from 4x4 to 5x5), re-calibration is one button, captures a new reference, and the installer re-tags. No firmware flash, no model retrain.

Why is the input architecture missing from public writeups on AI surveillance systems?

Because public writeups are produced for three audiences and none of them need to know. Capability-oriented buyer's guides (Pelco, Avigilon, Coram) list what the model can detect, because that is what a demo sells. Policy pieces (Brookings, the ACLU) describe what the system could theoretically do, because the policy argument is at the capability level. Trend roundups repeat the phrase 'edge AI' and move on. The input architecture is the mundane operational question a property manager asks on day one: how do I plug this thing into what I already have. That question has a specific answer, and the answer is different for every AI surveillance system on the market. Almost none of them tell you until you are already on a procurement call. This page is the part we wish we had been able to Google in 2022.

The input architecture is the product

Capability-oriented buyer's guides will tell you the model detects a person. So will every other AI surveillance system on the market. The real question is what it takes to get that model looking at your cameras at all. Monitor-first answers in minutes. Camera-first answers in weeks. Same detection, completely different deployment.

Our system publishes the HDMI tap because the tap is the product. A $450 device, a $200 monthly subscription, up to 25 camera tiles on one HDMI port, and a physical install that takes less time than running a dishwasher cycle.