Matthew Diakonov, Written with AI

Published April 20, 202612 min read

Video surveillance AI, the pixel budget behind the buzzwords

Video surveillance AI is a tile-resolution budget problem. Nobody on the SERP prints the budget.

Every top-ranking article on video surveillance AI lists the same features: anomaly detection, licence plate recognition, facial recognition, behavioural analytics, reduced false positives. None of them publish the one thing that actually decides whether any of those features work on a given property: how many pixels each camera gets inside the frame the model is allowed to see. This page prints the pixel budget, per layout, per detection class, and shows the operational artifact that has to exist before a composite-input model can read a DVR at all.

See the pixel budget on your DVR

4.9from 50+ properties

1 forward pass per composite HDMI frame, up to 25 tiles

Tile resolution between 384x216 and 1920x1080 depending on layout

Per-layout mask blanks clock, name strip, channel bug before detection

Zero frames leave the property during inference

Video surveillance AI

The pixel budget decides the feature list.

The model only sees the frame the guard monitor was already showing

Layout on the monitor sets the tile size, 1920x1080 down to 384x216

Each detection class needs a minimum pixel footprint to work

One composite frame, one forward pass, up to 25 tiles

A per-layout mask subtracts burned-in chrome before detection

0:00 / 0:05

The question the SERP skips: what does the model actually see?

Read the first ten results for video surveillance AI and you will come away with the same list of capabilities every time. Real-time analytics. Behavioural detection. Anomaly scoring. Proactive alerts. Reduced operator fatigue. The articles describe outcomes. They do not describe inputs. A technical reader never finds out what shape the input tensor is, at what resolution each camera lands inside it, or how many times per second the model actually runs.

That omission hides the one constraint that decides whether an AI deployment will work on a given property. A model is a function from pixels to boxes. If the pixels of interest are not in the input, no amount of model quality recovers them. On a retrofit install where the AI is reading a DVR's composite multiview, the tile resolution is fixed by whichever multiview layout the guard chose. 16 cameras in a 4x4 grid at 1920x1080 gives each camera about 480 by 270 pixels. 25 cameras in a 5x5 gives each one about 384 by 216. A fullscreen 1x1 gives one camera the full 1920x1080. That budget is the entire technical story.

This guide prints the budget, maps detection classes onto it, and describes the one configuration artifact that has to be present before the model can read a composite at all.

The anchor numbers, printed plainly

These four numbers define the compute and visual budget of a composite-input video surveillance AI pipeline on a 16 to 25 camera property. Everything else is downstream of them.

0pixels wide per composite frame

0pixels tall per composite frame

0fps ceiling from DVR HDMI out

0tiles resolved per forward pass

Zero RTSP sessions, zero cloud hops, and 0 forward pass per tick, regardless of whether 4, 16, or 25 cameras are inside the composite.

The per-layout pixel budget, every cell the model ever sees

The DVR decides how many tiles it packs into one 1920x1080 HDMI frame. Each tile gets a deterministic slice. The table below is the complete budget a composite-input model has to work with, before any compression or subsampling the detector does internally. If a camera of interest is in a 5x5 layout, it gets 384x216 pixels and no more.

Layout id	Cameras in frame	Tile width (px)	Tile height (px)	Pixels per camera
1x1-std	1	1920	1080	2.07 MP
2x2-std	4	960	540	0.52 MP
3x3-std	9	640	360	0.23 MP
4x4-std	16	480	270	0.13 MP
5x5-std	25	384	216	0.08 MP

The layout ids above are the exact strings our system writes to its layout cache on the edge unit. A freshly-installed 16-camera site will typically end up with 4x4-std and 1x1-std entries in the cache after the first shift, because guards drill into individual cameras when a motion catches their eye, and the cache picks up both shapes within an hour of live traffic.

Where a composite-input video surveillance AI pipeline reads its pixels from

Detection class vs tile resolution, a feasibility matrix

Every detection class has a minimum pixel footprint its features rely on. Person silhouettes need around 60 pixels of height to classify reliably at middle-distance. Licence plate character reads need around 20 pixels of character height. Facial recognition at distance needs inter-eye spacing on the order of 30 to 40 pixels. Once you know the footprint and the tile size, feasibility is just arithmetic.

Feature	Needs a dedicated tile (1x1 fullscreen)	Survives a composite tile
Person detection at middle distance	Not required	Yes on 2x2 through 5x5 (384x216 and up)
Vehicle presence in fire lane or parking	Not required	Yes on 2x2 through 5x5
Package drop and dwell	Not required	Yes on 3x3 through 4x4 (480x270 comfortable)
Tailgate at a vestibule door	Not required	Yes on 2x2 through 4x4
Loitering dwell in a defined zone	Not required	Yes on 3x3 through 5x5
Crowd density estimate	Not required	Yes on 4x4 and 5x5 (blob-scale features)
Licence plate character recognition	Yes on 1x1-std with camera framed for plate depth	No on any multi-tile layout
Facial recognition at distance	Yes on 1x1-std at close range, not at 30+ feet	No on any multi-tile layout
Reading small printed text on a package	Yes on 1x1-std with camera framed close	No on any multi-tile layout

What each detection class actually needs, in pixels

The cards below are the pixel footprints the public detection literature converges on for each class. They are not our system numbers; they are what the underlying detector families want. If a tile does not clear the footprint, the class fails there.

Person detection: ~60 px of height

A person silhouette classifies reliably once they occupy about 60 pixels of tile height. On a 4x4 tile (270 px tall) that is roughly 22 percent of the tile, which is what a person at 30 to 40 feet of depth lands at. Survives every composite layout.

Vehicle presence: ~80 px of length

A car-shaped blob needs about 80 pixels of horizontal extent. Every composite layout clears this for vehicles in parking lots and fire lanes.

Package drop and dwell: ~40 px square

Small-object detection for packages needs about 40x40 pixels of signal. Clears a 4x4 tile comfortably when the camera is framed toward a mailroom or leasing lobby floor.

Tailgate at vestibule: two person silhouettes in one frame

Needs two roughly 60 px tall silhouettes visible simultaneously. A 2x2 or 3x3 tile frames this easily for a standard apartment vestibule camera.

Loitering dwell: person + zone polygon + timer

Only needs a person detection plus a configured zone mask. The expensive part is the timer, not the pixels. Survives anything down to 5x5 (384x216).

Plate recognition: ~20 px of character height

A licence plate character must be about 20 pixels tall for reliable OCR. On any 4x4 or 5x5 tile, a plate at 20 feet depth lands at around 4 to 6 pixels per character. That is why plate reads require a 1x1 layout or a dedicated LPR camera framed specifically for plate depth.

The compute budget, in twelve lines of pseudocode

Two ways to shape the same video surveillance AI pipeline. The left hand shape is what per-stream vendors ship: 25 RTSP connections, 25 decoders, 25 forward passes per tick. The right hand shape is what a composite-input edge unit runs: one capture, one decode, one forward pass. Same camera count. The operational difference is where the install cost goes.

Per-stream vs composite-frame video surveillance AI

# Per-stream video surveillance AI, 25 cameras
# 25 RTSP sessions, 25 decoders, 25 forward passes, 25 credential surfaces

for camera in cameras:                 # 25 iterations per tick
    frame = rtsp_open(camera.url)      # needs creds, ONVIF, VLAN
    frame = decode(frame)              # 25 H.264 decoders
    boxes = model(frame)               # forward pass N of 25
    emit(camera.id, boxes)

# Compute per tick:  25x decode + 25x inference
# Credential surface: 25 per-camera tokens
# Failure shape:     one bad camera = one silent gap

-25% fewer forward passes per tick

What a live frame tick actually prints

The log slice below is one composite frame on a 16-camera multifamily property, with the 4x4-std layout active. Every step of the composite-input pipeline produces a readable line. If a vendor will not show you equivalent output on their own hardware, that is the signal to ask what their model actually sees.

cyrano edge tick, 16-camera property, 4x4-std layout

1 pass / 25 tiles

“The question you ask a video surveillance AI vendor is not 'what is your accuracy number.' It is 'what tile resolution is my camera feeding your model at, and at that resolution, which of your advertised detections actually works here.' Everything else is marketing.”

Our system field notes, tile-budget audit

DVRs and NVRs this composite-input pipeline reads

Hikvision DS

Dahua XVR and NVR

Lorex LNR and LHD

Swann DVR

Uniview NVR

Annke NVR

Reolink NVR

Night Owl DVR

Amcrest NVR

Q-See and rebrands

The layout router matches on tile geometry and corner anchors, not on DVR firmware, so the list above is illustrative rather than exhaustive. Any recorder that drives a standard HDMI multiview out of its guard-monitor port is a valid video surveillance AI input.

How to audit the pixel budget of a deployed system

Four checks, in order. Each one is a file or a reading on the edge unit itself. If any check fails, the deployment is not ready; nothing downstream will be correct.

Read the current active layout_id

On a unit this is the top line of the live tick log: layout_classify result=4x4-std. If the layout_id keeps flipping faster than once a minute, a guard is repeatedly drilling into fullscreen and back out. Note the distribution.

Confirm the mask file exists for the active layout

ls /var/lib/cyrano/layouts/. There must be an entry for every layout_id the log has ever emitted. A layout_id with no mask file will fire phantom detections on clock digits and name-strip text. That is the single most common misconfiguration.

Print the tile resolution for the active layout

tile_w and tile_h are written in the log. Cross-reference against the detection classes you are relying on. If any enabled class requires more pixels than the active tile delivers, that class is effectively off for the duration of the layout.

Audit the 24-hour compression ratio of events

Count raw forward-pass detections vs delivered alerts for a day. A healthy 16-camera property compresses 200 to 500 raw fires into 5 to 20 delivered alerts. Anything outside that range is either phantom firing (mask or zones misconfigured) or missing real events (tile resolution too low for the class).

Pre-buy checklist for any video surveillance AI vendor

Ask what the input tensor to your model is. They should answer in pixels.
Ask how many forward passes per tick they run on a 25-camera property.
Ask whether frames leave the property during inference.
Ask for a sample log line showing layout, tile resolution, and mask binding.
Ask for the 24-hour raw-to-delivered compression ratio from a real deployment.
Ask which of their detection classes survive each multiview layout their input supports.
If the answer to any of the above is marketing language, keep asking.

Print the pixel budget for your DVR, on a live call

15 minutes. We attach a unit to a production DVR, read the live layout_id, the mask file, the tile resolution, and the compression ratio, and you see the actual input tensor the model is reading.

Video Surveillance AI: Frequently Asked Questions

What actually is video surveillance AI, in one sentence a technician can verify?

It is a model that reads pixel frames off a camera feed (or off a recorder's composite multiview) and emits bounding boxes labelled with a class such as person, vehicle, or package. Everything marketed as video surveillance AI reduces to that loop: a frame goes in, bounding boxes plus labels come out, and a downstream filter decides whether a bounding box becomes an alert. The interesting question is not the loop; it is the frame the model is given in the first place.

Why is tile resolution the budget that decides what video surveillance AI can and cannot do?

A detection model cannot classify pixels that were never in the input. If a person needs at least 60 pixels of height to register reliably and the tile they appear in is only 216 pixels tall, they have to occupy at least 28 percent of the tile's vertical space. If a licence plate needs 20 pixels of character height, and the tile is 270 pixels tall at 50 feet of depth, the plate glyphs land at roughly 3 to 5 pixels tall and no model, however good, will read them. Tile resolution caps the category of detections before any AI runs.

What does the AI see on a 4x4 composite multiview at 1920x1080?

It sees a 1920x1080 frame made of sixteen tiles, each roughly 480x270 pixels, with a live clock burned into the top-right corner, a per-tile name strip burned along the top of each tile, and a channel indicator in one corner of each tile. That is the entire input tensor to the model. If the mask file for this layout (conventionally 4x4-std.mask on the edge AI unit) is not loaded, the detector will fire phantom boxes on the clock digits and the name-strip letters. With the mask loaded, the detector sees only the sixteen camera crops.

How many forward passes per tick does video surveillance AI actually run on a 25 camera property?

It depends entirely on where inference lives. A per-stream architecture opens 25 RTSP sessions, decodes 25 H.264 streams, runs 25 forward passes per tick, and fans the boxes back to a dashboard. A composite-frame architecture (the one our system uses) opens one HDMI capture, runs one forward pass on the single composite frame, and maps each bounding box back to its tile using cached coordinates. For the same 25 cameras the compute budget differs by roughly 25x. That ratio is the reason a side-channel edge unit can run on a $450 accelerator instead of a rack.

What detection classes actually survive a 480x270 tile, and which ones do not?

Person detection, vehicle presence, package drop, loitering-dwell, tailgate at a vestibule, zone entry, and crowd density all survive 480x270 tiles. They rely on blob-scale features. Detection classes that need fine-grained per-pixel reads (licence plate character recognition at distance, facial recognition at distance, reading small printed text on a package label) do not survive. For those, the operator has to drill the DVR to a 1x1 layout so the full 1920x1080 is dedicated to one camera, and the model re-reads that single tile at full resolution for the duration of the fullscreen.

Does video surveillance AI need cloud compute?

Not architecturally, though many vendors choose it for convenience. A detector small enough to run at 30 fps on a $300 accelerator does not need a datacenter. The reason cloud-hosted video surveillance AI exists is that sending frames to a GPU in a datacenter is easier to sell than shipping a box. The reason edge-hosted video surveillance AI exists is that shipping a box is easier to operate: frames never leave the property, the privacy story is simpler, the bandwidth bill is zero, and a dropped internet link does not break detection.

What is the overlay mask and why is it the unlisted prerequisite of composite-input video surveillance AI?

DVRs burn graphics into their HDMI output that are not part of any camera feed: a live clock, a per-tile name strip, a recording indicator, sometimes a weather widget. If a detector sees that composite without subtracting those glyphs, it fires phantom detections on the text. The overlay mask is a polygon file computed once per layout at install (for example 4x4-std.mask, 5x5-std.mask, 1x1-std.mask) that blanks those regions in constant time. On a unit the masks live at /var/lib/cyrano/layouts and are keyed by a hash of tile geometry plus corner anchors. Without this artifact a composite-input model is unusable. With it, the composite is cleaner than a raw per-stream input, because the model never sees any rendered chrome.

How is an operator supposed to audit the pixel budget of a deployed video surveillance AI system?

Four artifacts give a complete picture. First, the current active layout_id of the input (4x4-std, 5x5-std, 1x1-std). Second, the tile resolution that falls out of that layout. Third, the list of detection classes enabled at that tile resolution. Fourth, the mask file bound to the current layout_id. If all four are present and mutually consistent, the deployment is healthy. If any is missing (for instance a layout_id with no mask), detections will either be phantom-prone or entirely wrong. Our system publishes these artifacts as readable files under /var/lib/cyrano on the edge unit; a vendor that refuses to show them does not want you to know what its model is actually looking at.

Which DVR and NVR brands drive the composite multiview the AI reads from?

Any recorder that outputs a standard HDMI multiview to a guard monitor is a valid input. In practice that covers Hikvision, Dahua, Lorex, Swann, Uniview, Annke, Reolink, Night Owl, Amcrest, Q-See, and the long tail of rebranded recorders shipped to multifamily and commercial in the last ten years. The layout router on the unit does not match on DVR brand; it matches on tile geometry and corner anchors. If the DVR drives 1920x1080 with a recognisable tile grid, the AI works.

Where does inference run in the specific case of our system?

On the edge AI unit itself, which sits on the HDMI line between the DVR and the guard monitor. Model weights, layout cache, mask files, the filter stack, and the event queue all live on local storage. The only thing that leaves the unit is an already-filtered event payload (thumbnail, metadata, camera name, zone, dwell seconds, layout_id, latency_ms) sent over the customer's network to their WhatsApp or SMS endpoint. Raw frames never leave the property.