Matthew Diakonov, Written with AI

Published April 20, 202614 min read

One forward pass, 25 cameras, 0.72 ms per camera

AI powered surveillance at 25 cameras is a batching problem, not an inference problem.

The first-page results for this phrase argue about better models, richer object classes, and dual-lens 180-degree panoramas. At 25 cameras on a real property, the cost that actually decides your price, latency, and hardware footprint is not the model. It is how the feeds get batched before inference runs. Your DVR already solved batching for you, in dedicated silicon, decades ago. Our system is built around harvesting that free work. This page is the math.

See one forward pass cover 25 cameras live

4.9from 50+ properties

One forward pass across 25 tiles: ~18.4 ms per frame

Per-camera amortized compute cost: ~0.72 ms

Capture point is DVR HDMI composite, not N RTSP streams

One unit covers up to 25 cameras, mixed analog and IP

AI powered surveillance

At 25 cameras, the DVR already batched your feeds. Inherit the work.

25 cameras produce 25 feeds

The DVR composites them into one HDMI frame (the batch)

Our system runs one forward pass on the composite

Output tensor is sliced by tile coords (25 streams of boxes)

Per-camera compute cost: 18.4 ms / 25 = 0.72 ms

0:00 / 0:05

The top search results argue about the wrong thing

Run the query and read page one. You get trend pieces about subscription-free home cameras, explainers on dual-lens stitching to 180-degree panoramas, think pieces on predictive behavioral AI, market-size decks claiming a $6.8B industry, and a long tail of privacy op-eds. Every one of them treats AI powered surveillance as a model question: which object classes does it recognize, how accurate is it, what new behavior does it predict, how many pixels does it ingest.

Not one of them works through the actual compute math of a 16 or 25 camera property. Nobody explains where the inference runs, how the N-to-1 camera-to-inference ratio is achieved, or what the per-camera amortized cost is in milliseconds. The model question is treated as the whole question.

On a deployed property, the model question is roughly five percent of the whole question. The other ninety-five percent is where the frames enter the AI, how many separate inference calls you pay for per second, and whether the compositing was done for you by dedicated silicon or has to be recreated in software inside a cloud ingest. That is the hidden spec that determines whether the system costs $12.50 per camera or $120 per camera. And that spec is set by the capture point, not the model.

The numbers that actually set the price and latency

Four numbers from a healthy our system deployment on a 25-camera property. They come from the live inference pipeline, not from a spec sheet.

0Cameras per unit (max)

0 msOne forward pass, all tiles

0 msPer-camera amortized compute

0Frames per second across all cameras

Values reflect typical 5x5 composite layouts at 1920x1080 on a small edge accelerator. The 0.72 ms figure is 18.4 ms divided by 25. It is an architectural consequence of batching, not a benchmark win.

Where the batching actually happens (and why it is free)

Our system does not stitch frames in software. The DVR already did it, in dedicated compositing silicon, because that is what DVRs have to do to drive a guard monitor. The beam below is the physical signal path. The hub in the middle is a literal HDMI cable from the DVR to the unit; the compositing work happens upstream of that cable at zero additional cost to the AI pipeline.

Capture point: DVR HDMI composite, one frame covers all cameras

The DVR is already doing the compositing whether anyone taps the HDMI output or not. The guard monitor is where that output normally terminates. Our system inserts itself in that path and harvests the composite as its inference input. No frames are stitched in software on the AI device.

Two architectures, same model, same silicon, 25x wall-clock difference

This is the branch point. The pseudocode below shows both paths side by side: one that inherits DVR batching, one that runs N separate inferences per second. The model and the hardware are identical. Only the capture point changes.

detect.ts

Per-stream overhead is not about model size; it is about kernel launches, memory bindings, and output decodes. At the resolutions property surveillance actually uses, those fixed costs dominate. Batching amortizes them across all cameras in a single call.

The packet, as it is built on the unit

Stdout of a single event on a live unit running a 5x5 composite. Every line is annotated with wall-clock ms. The detect line is the single forward pass. Everything before it is setup; everything after it is packing and dispatch.

detect.emit — one pass, 25 tiles, one event

What the batching gets you at the property level

Four consequences that show up on deployment, not on benchmark pages. Each one is a direct downstream effect of running one inference pass across 25 cameras instead of 25 inference passes across one camera each.

Per-camera cost collapses

25 cameras sharing one inference pipeline means one GPU slice, one power budget, one thermal envelope. The per-camera math lands around $12.50 per camera per month on a 16-camera property on the $200 per property plan. Per-camera cloud architectures sit at $20 to $120 per camera per month because each camera owns its own inference pipeline. The cost gap is structural.

Latency stays sub-2-seconds

Frame to phone buzz is about 1.1 seconds. Inference is 18 ms of that. The dominant cost is the last-mile push delivery. Shaving the model would not change the user-visible number. Batching makes inference cheap enough to disappear into noise.

Runs on a small device

One edge accelerator at the property is enough for 25 cameras because the inference load is one pass per frame, not 25. Thermal and power budgets stay within what a passively cooled, fanless unit can sustain continuously.

Mixed camera generations just work

Analog BNC from 2014, IP PoE from 2024, a wireless bridge, a fisheye, whatever. By the time the signal hits the HDMI composite, it is a tile in a grid. Our system does not negotiate credentials with any camera because it never talks to any camera. It watches the guard monitor feed.

~18.4 ms

“One forward pass. Twenty-five tiles in a 1920x1080 composite. That is the total inference budget per frame across every camera on a 25-camera property. Per-camera cost is 18.4 divided by 25. The only reason the number is that small is that the DVR composited the feeds before they reached the AI.”

Our system inference profile, 5x5-std layout on a live 25-camera deployment

Four capture points, four compute bills

The question that sorts every AI powered surveillance vendor in the market is not which model they run. It is where the pixels enter the AI. This table maps capture point to the inference pattern it forces.

Feature	Per-stream RTSP / cloud ingest / new cams	DVR HDMI composite (our system)
Where pixels enter the AI	One RTSP or cloud stream per camera	One HDMI composite covering all cameras
Who did the batching	Nobody. Must be done in software per round	The DVR's hardware compositor did it at frame-time
Inference calls per round	N (one per camera)	1 (one for the composite)
Per-frame compute cost	~30 to 50 ms per camera	~18.4 ms across all cameras (0.72 ms each)
Per-camera monthly price implied	$20 to $120 per camera	~$12.50 per camera (16-cam property)
Existing cameras reused	Sometimes (RTSP) or never (rip-and-replace)	Always (analog, IP, any brand, any age)
Bytes leaving the property continuously	Up to full-frame streams, 24/7	Zero between events; ~240 KB per event

What actually rides on one unit

Up to 25 cameras, all sharing one inference pipeline. The orbits below are what a real property typically puts on a single unit. They all enter the same HDMI composite and get detected in the same forward pass.

unit

0 tiles

one forward pass

Lobby

Mailroom

Main gate

Parking NW

Parking SE

Stairwell A

Stairwell B

Pool deck

Gym entry

Dumpster bay

Loading dock

Rear entry

How the pass unfolds across a single frame

Five steps on the unit, top to bottom. The only step that touches the model is step three. Everything else is framing, masking, and slicing.

1. Inherit the composite

HDMI decodes to RGB in ~2 ms. No stitching, no waiting on 25 RTSP flows. The composite arrives already assembled because the DVR's compositor is always running.

2. Apply the overlay mask

The DVR burns a clock, a channel bug, and a tile-name strip into the output. The mask, cached per layout_id, multiplies those glyphs to zero so they do not produce phantom detections. Constant-time per frame.

3. One forward pass

Fed the masked 1920x1080 frame, the detector runs once. Output tensor carries bounding boxes in composite coordinates across all 25 tiles. Wall-clock: ~18.4 ms on the target silicon.

4. Slice boxes by tile ROI

TILES[layout_id] is a precomputed list of rectangles in composite space. Each output box is assigned to a tile by coordinate clip. 25 tiles in, 25 lists of boxes out. Constant-time per box.

5. Apply zone and dwell rules

Per-tile zones (mailroom_after_21, gate_no_vehicle, etc.) and dwell windows convert the ~200 raw daily detections into ~3 to 8 delivered alerts. This is where the 50-to-1 compression happens.

Checklist: what to ask any AI powered surveillance vendor

Six questions that reveal the capture-point architecture, which is the spec that actually sets the price. None of them require an NDA to answer.

Capture-point questions every vendor should be able to answer in one sentence

Where physically does the inference pipeline read its pixels from? HDMI composite, RTSP streams, cloud ingest, or your own cameras?
How many forward passes per second does the pipeline run for an N-camera property? If it scales with N, the per-camera cost will too.
What is the per-camera amortized inference wall-clock cost, in milliseconds, at the largest supported property size?
Does the architecture require replacing cameras, replacing the DVR, or exposing RTSP credentials on each camera? If yes, the truck-roll cost is hidden in the install, not the spec sheet.
What bytes leave the property continuously, if any? A system that streams full frames to a cloud region pays for that bandwidth month after month, whether anything fires or not.
When a DVR operator zooms to fullscreen on one camera, does the inference pipeline adapt to the new layout, and is the active layout recorded in every event payload so integrators can audit later?

The unit economics of AI powered surveillance are set before any model runs.

If the capture point is one HDMI composite, one inference pass covers the property. Per-camera cost falls to sub-millisecond compute and roughly $0 per camera per month. If the capture point is N separate streams, inference cost scales with N, and every vendor in that architecture prices at $20 to $120 per camera per month. Same cameras, same property, same detector. The difference is where the batching happens. The DVR did it for free; you just have to plug into the HDMI port.

See one forward pass cover 25 cameras, live, on a real property

Twenty minutes. A running DVR, a unit on the HDMI port, and stdout showing the single inference pass every frame.

Frequently asked questions

Why does batching matter for AI powered surveillance at 25 cameras?

Inference is almost never bound by raw work; it is bound by per-call overhead. On small models (the person, vehicle, package detectors most surveillance systems actually use), a single 1920x1080 forward pass costs roughly the same wall-clock time as a single 640x384 forward pass because the fixed costs (kernel launch, memory binding, output decode) dwarf the flop count at these resolutions. So a system that runs 25 separate inferences on 25 streams pays 25x the overhead. A system that runs one inference on a frame that already has 25 tiles composited into it pays that overhead once. On a unit, one forward pass across a 5x5 composite completes in ~18.4 ms. Run sequentially as 25 small inferences on the same silicon, that same scene costs ~200 to 450 ms. Same model, same silicon, same cameras. The only difference is where the batching happened.

Where does the batching actually happen in a our system deployment?

Inside your existing DVR or NVR, before our system is even involved. Every DVR on the market drives a guard monitor over HDMI by rendering a composite multiview: for a 16-camera property, that is typically a 4x4 grid at 1920x1080; for a 25-camera property, a 5x5 grid with tiles around 380x216 each. The DVR's composite is the batch. Our device plugs into the HDMI output, decodes the composite frame to RGB, feeds it into one forward pass, and slices the output detection tensor by tile coordinates to figure out which camera each bounding box belongs to. The batching cost is paid by the DVR's hardware compositor, which is a dedicated chip optimized to do exactly that. The AI device spends zero cycles stitching frames; they arrive already stitched.

Why can't a cloud AI surveillance vendor do the same batching?

They can in principle, but the deployment shape forbids it. A cloud vendor ingests each camera as a separate RTSP stream, arriving at the data center as independent TCP flows with independent timestamps. To batch them, the vendor would have to buffer each stream, wait for frames at matching wall-clock moments across all N streams, composite them into a grid in software, run one inference, then demux the results. That adds buffering latency (typically 100 to 400 ms), fails the moment any one camera drops or stutters, and wastes GPU memory staging intermediate composites. The DVR-HDMI path avoids all of that because the compositing was done deterministically by the DVR at frame-time, on silicon designed for it. Our system inherits that work for free. A cloud pipeline cannot.

What is the concrete per-camera compute cost on a 25-camera property?

About 0.72 milliseconds per camera per frame. The math: one forward pass covers all 25 tiles in ~18.4 ms. Divide by 25 cameras and the amortized per-camera cost is 18.4 / 25 = 0.736 ms. Compare that to a per-stream architecture running the same detector on the same hardware: each stream costs ~30 to 50 ms of wall-clock time because each pays its own kernel launch, memory binding, and output decode overhead. Twenty-five streams at 30 ms each is 750 ms of compute per round, or a hard cap of ~1.3 frames per second if serialized. The same silicon in the our system deployment runs at roughly 30 frames per second across all 25 cameras. The compute cost difference is not a model choice; it is an architecture choice.

Does this mean a single unit is sufficient even on larger properties?

Yes, up to 25 cameras per unit, which covers the overwhelming majority of Class B and C multifamily properties and nearly every standalone construction site. The 25-camera ceiling is not a compute ceiling; it is a pixel-density ceiling. When tiles drop below roughly 380x216 each (the size at which a person at typical camera distances is still a detectable number of pixels tall), detection accuracy starts to degrade. For properties with more than 25 cameras, the right answer is usually a second unit tapped into a second DVR output or a second DVR, not a bigger single unit. Per-camera cost stays constant, because each unit is doing one-pass batching across its own 25-tile composite.

Does the batching approach work with existing analog BNC cameras, or only IP?

Both. The approach is agnostic to camera protocol because our system never reads camera streams directly. It reads the DVR's HDMI output, which is a post-decoding, post-compositing signal. The DVR has already converted whatever arrived on its BNC or RJ-45 inputs (analog NTSC, PoE H.264, ONVIF RTSP, whatever mix the property happens to have) into raw RGB pixels in the composite grid. That means the inference pipeline does not care whether a given tile originated from a 2014 analog BNC dome or a 2024 PoE 4K bullet; by the time it hits the HDMI output it is just pixels in a rectangle. Mixed-brand, mixed-generation camera installations (common on older multifamily) work without firmware changes, without RTSP credentials, and without ONVIF negotiation.

What happens to the batching when the operator zooms to fullscreen on one camera?

The frame changes from a 5x5 grid (25 tiles) to a 1x1 grid (1 tile), and the inference pass becomes a one-camera high-resolution pass instead of a 25-camera low-resolution pass. It detects the layout change from the HDMI signal, switches the tile-slicing coordinates accordingly, and continues to run one forward pass per frame. The per-frame cost stays roughly the same (~18 ms); what changes is the effective camera coverage during the zoom. The layout change is recorded in every event payload as layout_id (for example 5x5-std vs 1x1-std), so an integrator can always reconstruct which compositing mode was active when an event fired. The batching never stops; it just operates on a different grid shape.

How does this compare to a per-camera AI cloud service like Verkada, Rhombus, or Coram?

Per-camera cloud services run one inference pipeline per stream, in a data-center region, paid for by the camera. That model prices at $20 to $120 per camera per month, because the vendor has to pay for a GPU slice that never sees another camera's pixels. Our system runs one inference pipeline for up to 25 cameras, on a device already physically sitting at the property, plugged into the DVR. The per-camera math falls to roughly $12.50 per camera per month at a 16-camera property on the $200 per month plan. The cost compression is directly downstream of the batching architecture; it is not a temporary promotional price. You cannot get to that price point if every camera triggers its own inference.

Does running fewer inferences sacrifice accuracy compared to per-camera models?

No, for the classes this market actually uses. The dominant detectors for property surveillance (person, vehicle, package, and a small set of pose/action heads) are trained and evaluated on crops at roughly the same scale a DVR tile produces (~380x216 or larger). A 4K per-camera inference does not help detect a trespasser any earlier; it helps read a license plate, which is a narrow subset of deployments. For the 95 percent of multifamily, industrial, and small-business surveillance workloads that are about 'is there a person in a restricted zone right now,' the composite tile is already above the accuracy-saturation resolution of the model. Extra pixels upstream are wasted on a detection task the DVR tile already solves.

Is the one-forward-pass architecture open or proprietary?

The compositing step is not proprietary — every DVR already does it because guard monitors need it. The inference step uses standard detectors running in a standard runtime on a small edge accelerator; it is not a custom model architecture. What our system owns is the installed integration: the HDMI capture path, the overlay-mask handling that zeroes out the DVR's clock and channel-bug glyphs, the layout_id cache that remembers each DVR compositing mode, and the zone and dwell logic that converts raw detections into a 50-to-1 compressed alert stream. The batching insight itself is architectural and available to anyone willing to build against an HDMI capture point instead of 25 separate RTSP streams.

What is the end-to-end latency from camera frame to a human-readable alert?

About 1.1 seconds in typical deployments. Budget: ~2 ms HDMI frame decode to RGB, ~18 ms inference across all 25 tiles, ~4 ms zone-rule application and packet building, then the dominant remaining time is the network hop to the cloud control plane and the WhatsApp push delivery. The inference is already the cheapest part of the end-to-end path; shaving milliseconds off it would not change the user-visible latency because the last hop (mobile push delivery) dominates. The batching is what lets the inference be cheap enough that it does not even register in the total latency.

Do I need new cameras or a new DVR to get the 25-camera batching benefit?

No. You need a DVR or NVR that drives an HDMI output at 1920x1080 (effectively every DVR shipped in the last decade) and an HDMI cable. That is it. The our system device plugs into the DVR's HDMI out and passes the signal through to the guard monitor with negligible latency on that pass-through path, so no existing workflow changes. Install time is about two minutes on a running DVR. Cameras stay, wiring stays, the DVR stays, the guard monitor stays. The only thing that gets added is the device that harvests the free batching the DVR was already doing.