Matthew Diakonov, Written with AI

Published April 20, 202613 min read

The spec on your phone, not the spec on the box

The only spec of an AI powered surveillance camera that matters is the 18 KB thumbnail on the on-call phone.

The first-page Google results for this phrase sell you 4K sensors, Wi-Fi 6, HDR, night vision, face recognition feature lists. Almost none of those specs reach the phone of the person who has to act on an alert. The output of a surveillance camera, once AI is in the loop, is a cropped JPEG, a six-second clip, and a JSON object. That three-artifact packet is the product. This page describes what is in it, why each field is there, and why the upstream sensor specs are not the thing you should be buying on.

See the delivered packet on a real property

4.9from 50+ properties

Three-artifact event packet: ~18 KB thumb + ~220 KB clip + ~612 B JSON

End-to-end latency frame-to-phone: ~1.1 seconds

One HDMI cable covers up to 25 existing cameras, analog or IP

Target compression: 40 to 100 raw detections per delivered alert

AI powered surveillance camera

The thumbnail is the product. The sensor is just a feeder.

Sensor captures pixels at 4K or 1080p

DVR composites 25 tiles into one HDMI frame

Our system runs one inference pass, crops the firing tile

Phone receives a ~240 KB packet: thumb, clip, JSON

Operator decides in under 2 seconds, no app needed

0:00 / 0:05

The first-page SERP sells the sensor. It forgets the phone.

Search this keyword and read the first ten results. You get Defender’s 4K Guard Pro page, GW Security’s face-recognition feature list, Pelco’s corporate explainer, a couple of Amazon listings, thedatascientist.com’s “best AI cameras” roundup. Every one of them is structured the same way: a sensor spec table (resolution, frame rate, lens field of view, night-vision range, IP rating), a feature list (human detection, vehicle detection, face recognition, LPR, perimeter intrusion), a pricing mention, a buy button.

Read carefully and you will notice something missing: no result on page one shows you what the alert looks like when it reaches the person on call. No one documents the push notification. No one publishes the image size. No one lists the metadata fields. No one quotes the end-to-end latency. You are shown specs on the sensor and feature checkboxes, and then the article ends.

That blind spot is the entire deployment question. An AI camera whose thumbnail is a 4K multiview screenshot is a camera whose alerts get muted in week two. An AI camera whose thumbnail is a tightly cropped 320x240 JPEG of just the firing tile gets read at 2 a.m. on a phone lockscreen. The sensor did not change. The crop did.

The three artifacts that leave the building per detection

A our system event is a packet of exactly three things. Nothing else moves. The sizes below are representative averages across deployed properties; the shapes are fixed.

1. The 18 KB thumbnail

A 320x240 JPEG crop of just the DVR tile that fired. Not the full multiview. Not the full source frame. Just the rectangle the detector said contained the triggering subject, with the zone label printed at the bottom and the dwell seconds in the corner. This is the image that fills a phone push-notification preview. It is tuned for readability on a 5-inch screen, not for archival resolution.

2. The 220 KB six-second clip

A six-second H.264 clip of the same tile, bracketing the detection moment with about three seconds pre-roll and three seconds post-roll. 320x240 at the DVR's native frame rate. Delivered as a linked asset, not inline in the notification, so the push still renders fast.

3. The 612-byte JSON

Structured metadata, not decoration. Fields: zone_id, dwell_seconds, wall_clock, class, confidence, tile_index, layout_id, latency_ms, event_class. This object is what the responder reads in the caption ("Mailroom, person, 17s dwell, after-hours") and what the integrator queries in logs.

What does not leave

The full 1080p composite multiview. The raw YUV. The 24 other tiles that did not fire. Any continuous stream. Any face embedding. Any license-plate string. Any biometric vector. Audio. Off-property bandwidth per event averages roughly 240 KB total and drops to zero between events, which makes the system tolerable on residential cable uplinks and on cellular fallback.

The numbers, as they appear in the event log

These are the four numbers a property staffer or integrator can verify by pulling the last 24 hours of events off a live unit. They are not marketing targets; they are what a healthy deployment actually produces.

0 KBThumbnail JPEG, cropped tile

0 KBSix-second clip per event

0 BJSON metadata object

0 sFrame to phone buzz

Numbers are averages on 16- to 25-camera multifamily properties. Thumbnail sizes drift with scene complexity (a person on a busy lobby floor compresses larger than a person on a flat concrete driveway). The JSON object is fixed-shape; its size varies by a few dozen bytes depending on dwell and confidence values.

How one DVR frame becomes one 18 KB thumbnail

The beam below is the physical path of a single event, left to right. The sensor feeds the DVR. The DVR feeds a composite frame out over HDMI. The unit taps that frame, runs one inference pass, and emits the three-artifact packet. There is no branch where the full frame leaves the property.

Frame path: sensor to phone, one event

The hub in the center is a literal HDMI signal on a cable between the DVR and the unit. The four destinations on the right are the only bytes that leave the property. The 24 other tiles on the composite frame that did not fire never make it into any artifact.

The packet, as it is built on the unit

This is the stdout trace of a single event on a running unit. Nothing is invented; this matches the shape of the production emit pipeline. The terminal shows the composite frame arriving, the mask applied, the detection firing, the packet being assembled, and the WhatsApp delivery completing.

event.emit — one detection, three artifacts

Why the crop is not the same as a “screenshot”

The difference between a full-frame vendor alert and a our system event is not a resolution choice. It is an attention choice. Toggle between the two alert formats to see what lands on the on-call phone in each case.

Same detection, two different phone alerts

Push notification: 'Motion detected, Cam 12'. Image preview: a shrunken 1920x1080 screenshot of the whole scene, scaled to fit a 5-inch lockscreen thumbnail. Text caption is a timestamp. To see what triggered it, the operator has to unlock the phone, open the vendor app, tap the event, wait for the video to load, and scrub the clip until they find the person.

Full-frame image, triggering object is a 2% subregion of the thumbnail
No zone label, no dwell, no class printed on the image
Caption is a timestamp and camera number, not a decision
Requires opening an app to understand the event

What the first-page SERP does and does not give you

Side by side with a typical top-ranking AI camera product page, this is what changes when the page is written around the delivered artifact instead of the sensor.

Feature	Typical top-ranking AI camera page	Our system (phone-output)
What spec is headlined	Sensor resolution, lens, night-vision range, IP rating	Thumbnail size, payload bytes, end-to-end latency
Alert image shown	Full-frame vendor dashboard screenshot	320x240 crop of the firing tile, zone label burned in
Alert channel	Dedicated vendor app (install, log in, keep open)	WhatsApp thread staff already check
Bytes off-property per event	Not published; typically multi-MB	~240 KB (18 KB thumb + 220 KB clip + 612 B JSON)
Metadata on the event	Timestamp + camera name	zone_id, dwell_seconds, class, confidence, tile_index, layout_id, latency_ms, event_class
Cameras required	Vendor's cameras (rip-and-replace)	Existing cameras on existing DVR via HDMI
Per-camera cost	$20 to $120 per camera per month	~$12.50 per camera per month (16-camera property)

The checklist for a readable AI camera alert

These are the properties a push notification has to have for an on-call person to read it on a phone at 2 a.m. and act on it without opening an app. Any AI powered surveillance camera product can be evaluated against this list; it does not require buying our system to use.

What a phone-readable AI camera alert needs

The image preview shows the triggering subject as the dominant object, not as a 2% region of a full-frame screenshot.
The zone label (mailroom, gate, lot NW) is printed on the image itself, so the responder reads location without unlocking.
The caption reads as a decision cue (class + dwell + time band), not a timestamp + camera number.
The thumbnail is small enough (under 30 KB) to render in the push preview on cellular fallback.
A structured metadata field exists for every event, with at minimum zone_id, dwell_seconds, class, confidence, and wall_clock.
A tile or layout identifier is present so the integrator can diagnose whether the operator was zoomed in at the time.
Latency from frame capture to push delivery is published and under 2 seconds.
The alert channel is something the on-call person already checks (SMS or WhatsApp), not a dedicated app they have to remember.

~1.1 s

“Frame on the DVR HDMI input to push notification on the on-call phone. 2 ms decode, 18 ms detect, 4 ms zone and packaging, then the dominant cost is the WhatsApp hop. Nothing in that budget scales with camera count; one device covers 25 cameras on one pass.”

Our system field latency, typical 16-camera multifamily deployment

How to audit the delivered artifact on whatever AI camera you already own

This is vendor-agnostic. If you are already running an AI powered surveillance camera system and you want to know whether it is phone-readable, these are the four things to check in under ten minutes.

1. Pull five representative push notifications

Open the on-call phone. Screenshot five recent alerts: two during daylight, two after dark, one during an active incident if you have one. You are auditing what the on-call person actually saw on their lockscreen, not what the dashboard shows.

2. Measure the image preview

For each of the five, answer: does the triggering subject fill the preview, or is it a speck inside a multi-tile view? If the subject is smaller than a thumbnail keyboard key, the alert format is wrong, regardless of sensor resolution.

3. Read the caption like a decision

Does the caption tell the responder where (zone), how long (dwell), and what class (person, vehicle, package)? Or does it just say 'Motion detected, Cam 12, 23:41:17'? The first is a decision cue. The second forces the responder to unlock, open the app, and re-derive the event.

4. Inspect the structured payload

Find one event in the integrator console. Confirm the event record carries zone_id, dwell_seconds, class, confidence, wall_clock, and some kind of tile or layout identifier. If those fields are missing, the system is not deployment-ready, no matter what the sensor spec sheet says.

The reframe

An AI powered surveillance camera is not a sensor. It is a pipeline that ends in a phone push notification. The product is the push. The sensor is the feeder.

You already own the feeder: your existing DVR and cameras. What is missing is the part that turns raw detections into a readable notification at 2 a.m. on cellular. That part is what our system ships, and it is why the page you are reading is about an 0 KB thumbnail instead of 4K sensor specs. The thumbnail is the thing the operator sees. The thing the operator sees is the product. The rest is upstream plumbing.

Where the upstream sensor still matters (and where it does not)

The sensor is not irrelevant, but its relevance caps out below what the spec sheet implies. The marquee below lists sensor properties and whether they actually change the delivered 18 KB thumbnail on the phone.

Lens focal length — matters (defines the tile framing)4K vs 1080p — barely (the crop is 320x240 anyway)Low-light performance — matters (detection runs on visible pixels)HDR marketing — barely (DVR composites down regardless)Camera placement — matters a lot (zones come from this)Face-recognition feature — not used (our system is class-level)Framerate (at DVR) — matters (affects dwell accuracy)Wi-Fi 6 on the camera — irrelevant (DVR handles transport)IP rating — matters (for physical survival)Cloud-app ecosystem — not used (WhatsApp delivery)

Rule of thumb: if the sensor property changes what pixels the DVR sees, it matters. If it changes how the camera’s own app renders a notification, it does not, because the alert is delivered from the unit, not from the camera.

See the 18 KB thumbnail on a real property

Twenty minutes. A live DVR, a live unit, real push notifications landing on a phone in the call.

Frequently asked questions

Why is the thumbnail size the most important spec of an AI powered surveillance camera?

Because it is the only frame a human actually looks at. An AI surveillance camera produces one output that matters: a push notification on an on-call phone with one image that the operator glances at while walking, driving, or at dinner. If that image is a full-frame 4K screenshot of a 25-tile DVR multiview, the tile that fired is one twenty-fifth of the picture, unreadable on a phone lockscreen. If it is a tightly cropped 320x240 ~18 KB JPEG of just the tile that fired with the zone label printed, the operator can make a decision in the 2 seconds it takes to peek at the phone. Sensor megapixels upstream do not change that. The crop does.

What exactly leaves the property per detection with our system, and how much is that?

Exactly three artifacts, ~240 KB total. One 320x240 JPEG thumbnail crop of the tile that fired, roughly 18 KB. One six-second clip of that same tile around the detection, roughly 220 KB. One JSON metadata object of about 612 bytes containing zone_id, dwell_seconds, wall_clock, class (person, vehicle, package), confidence, tile_index, layout_id, latency_ms, and event_class (for example pre_action_zone_entry or loitering_dwell_exceeded). Nothing else leaves: no full 1080p multiview frame, no raw video of the 24 other tiles, no continuous upload, no face embeddings, no plate strings. The event packet is the product boundary.

What is the end-to-end latency from camera frame to phone buzz?

About 1.1 seconds in typical deployments. The budget breaks down roughly as 2 ms to decode the composite HDMI frame to RGB, 18 ms to run the one detection pass across all 25 tiles, 4 ms to apply per-tile zone rules and build the event packet, then the rest is network RTT to the cloud control plane and WhatsApp push delivery. The dominant cost is the last hop, not the model. If you are shopping AI surveillance cameras and someone quotes you 5 seconds latency, they are probably uploading full frames and running inference in a vendor cloud.

Why a crop of one tile and not the full multiview frame?

Three reasons. First, attention. On a phone lockscreen, a 320x240 crop of the firing tile fills the entire preview area; a 1080p multiview shrinks that tile to an unreadable 216x216 thumbnail. Second, bandwidth. Residential and small-business internet has upload caps; shipping the full 1080p frame per event adds up to tens of megabytes per day and gets rate-limited on cellular fallback. Third, privacy. The 24 tiles that did not fire may be showing a neighbor's hallway, a resident's door, or the inside of the mailroom. There is no reason to transmit those pixels off-property just because a different tile triggered. The crop keeps the event-of-interest and leaves the rest in the building.

What fields are in the 612-byte JSON, and why do they matter?

zone_id (which named rectangle was crossed), dwell_seconds (how long the subject stayed in the zone before the event fired), wall_clock (the DVR's time), class (person, vehicle, package), confidence (the model score), tile_index (which of the 25 tiles), layout_id (which DVR multiview configuration was active, for example 5x5-std or 1x1-std after a fullscreen zoom), latency_ms (frame capture to message dispatch), and event_class (pre_action_zone_entry, loitering_dwell_exceeded, motion_post_curfew). These fields are what a responder uses to decide whether to ignore, talk down over the intercom, call, or escalate; and what an integrator uses to diagnose regressions. A system that delivers a picture with no metadata forces the human to re-interpret the scene every time, which is why most AI camera alert streams get muted within two weeks.

Does this work with my existing cameras, or do I have to buy new ones?

Existing. Our system is not a camera; it is a small edge device that plugs into the HDMI output of your existing DVR or NVR and runs inference on the composite multiview frame the DVR already draws for the guard monitor. That means the sensor you bought in 2018 (analog BNC, PoE IP, whatever) becomes an AI powered surveillance camera without any firmware change, credential setup, or network reconfiguration. One HDMI cable in, one HDMI cable out to the guard monitor, ethernet, power. Install takes under two minutes on a running DVR. The thumbnail that ends up on the on-call phone is cropped from the DVR tile of your existing camera.

How does this compare to a per-camera AI cloud product like Verkada or Rhombus?

The pricing shape is different because the output shape is different. Per-camera cloud products ship full camera frames (or heavily compressed streams) to the vendor's cloud continuously, pay for ingest, storage, GPU inference, and egress per camera, and render alerts with full-frame context. That model prices at $20 to $120 per camera per month. Our system processes the composite multiview locally, ships only the ~240 KB event packet off-property, and a single edge device covers up to 25 cameras off one HDMI input. At a 16-camera property on our system's $200 per month per property price, the per-camera math is roughly $12.50. The delivered artifact on the phone is shaped to a phone, not to a cloud dashboard.

How many detected events does a typical property produce per day, and how many alerts should reach the phone?

On a 16-camera multifamily property, a typical day produces 200 to 300 raw person-detections from residents, vendors, Amazon drivers, and passersby. A deployment-ready AI powered surveillance camera setup compresses that to 3 to 8 delivered alerts per day, a roughly 50-to-1 ratio. Below 25-to-1 and the channel gets muted in two weeks because the alert stream is noise. Above 200-to-1 and the system is dropping real incidents. Target a compression band of 40 to 100 raw events per delivered alert. The compression is done by zone rules, dwell thresholds, and event classes applied to the detection stream, not by a secondary model. The 3 to 8 that make it through are the events the phone shows a thumbnail for.

What happens when the operator zooms the DVR to fullscreen on one camera during an active incident?

The adapter re-scopes. Every event payload carries a layout_id field; when the DVR operator double-clicks a tile on the guard monitor to zoom, the HDMI signal changes from the grid layout (for example 5x5-std) to a single-tile layout (1x1-std) at full resolution. The unit detects the layout change, re-runs the zone-mask calibration for the new layout, and the thumbnail that gets delivered to the phone during fullscreen is now a 320x240 crop of the full-resolution single-camera frame. The layout_id in the JSON payload records which mode was active, so the integrator can tell from the event log whether the operator was zoomed in or not. The phone output stays the same size; only the source pixels changed.

Why WhatsApp instead of a dedicated app?

Because responders already have WhatsApp open. The cost of an AI surveillance camera alert system is not the detection, it is the probability that the alert gets read within 30 seconds. A dedicated vendor app is a notification channel the on-call person has to remember to install, keep logged in, and not snooze. A WhatsApp thread with on-call staff is a channel they already check every few minutes. The ~240 KB event packet is formatted to render as a single push notification with an inline image preview, a caption that reads camera-name / zone-label / dwell-seconds / class, and a link to the six-second clip. The thumbnail renders in the push preview; the operator does not need to open the app to triage most events.