Matthew Diakonov, Written with AI

Published April 18, 202614 min read

Measured at the byte, not debated at the policy

AI and surveillance is a debate about capability. It should be a debate about the byte budget.

The top results for “AI and surveillance” frame the category at the level of policy: what the technology can do, who it might target, what norms should constrain it. That frame is useful for a regulator. It is the wrong frame for a property manager picking between two vendors who both say the word “AI.” The right frame is smaller and concrete. What exactly leaves the property, what never does, and how big is the payload per event. This page publishes the byte budget for one deployed AI surveillance pipeline.

See the per-event payload on a live system

4.9from 50+ properties

Per-event egress: about 240 KB (thumbnail + clip + metadata)

Inference: on-device, on the composite HDMI multiview frame

Biometric index: none (no face, plate, or gait vectors)

Frames at rest off-property: zero

AI and surveillance, reframed at the byte level

Capability is a policy argument. Bytes are a procurement fact.

Every AI surveillance system has a byte budget

Cloud AI pipelines: continuous upload, gigabytes per day

edge AI pipeline: about 240 KB per event

Event payload: thumbnail + short clip + JSON

Source frames stay on the device, always

0:00 / 0:05

The policy literature is about what the system could do. The byte budget is about what the system actually emits.

Search “AI and surveillance” and the first page of results is almost entirely Wikipedia, Brookings, the ACLU, the Algorithmic Justice League, Carnegie, and Deloitte. All of them frame the category as a capability question. Can the system recognize faces. Can it predict behavior. Can it enable authoritarian monitoring at scale. These are valid questions at the level of a law, a norm, or an international treaty. They are the wrong level for a property manager or a small business owner deciding whether to put AI in front of the cameras they already own.

At the property level the question is not “can AI watch cameras.” It can, obviously. The question is what exactly leaves the property when it does. Two AI surveillance systems can have identical marketing pages (package detection, loitering alerts, real-time dashboards, analytics) and emit byte budgets two to three orders of magnitude apart. One ships every pixel to a cloud region. The other ships a 240 KB event payload and discards the source frame on the device. The first is mass surveillance by construction. The second cannot be, because the bytes required are never sent.

The byte budget is the part of the system a privacy officer, a tenant, or a property lawyer can actually audit. Features live in marketing prose. Bytes live on the wire.

0 KBOff-property payload per event

0 sDetection to operator notification

0Biometric vectors uploaded

0Camera tiles on one composite frame

The event payload, enumerated

This is the exact shape of the data that crosses the property boundary when the pipeline fires. Three files, three purposes, one JSON envelope. Nothing else leaves. The composite source frame is overwritten on-device on the next tick; it is never uploaded or retained off-property.

Per-event egress, measured on a running deployment

How the pipeline moves pixels (and where it stops)

Every AI surveillance system has to answer one question about pixels: where does inference run. On a cloud-AI system, pixels leave the property, cross an encrypted link to a cloud region, and inference runs on a shared GPU there. Every frame is a byte on the wire. On an edge-AI system, pixels stop at the device that is physically plugged into the DVR, inference runs locally, and the only thing that ever leaves is a small payload per detected event.

Where pixels live in an AI surveillance pipeline

Five things the pipeline does not compute, by design

The civil-liberties literature on AI and surveillance spends most of its word count on five specific capabilities. Each of them requires bytes that this pipeline never produces. The decisions are architectural, not rhetorical. A capability that has no backing vector cannot be turned on by a future product update.

Face matching against a gallery

Requires a face embedding vector per detected face, stored and compared against an enterprise gallery. The pipeline does not produce face embeddings. Source frames are discarded on-device before any embedding model could read them.

License plate OCR index

Requires a plate text extraction per vehicle, indexed across time and sites. The pipeline does not run OCR on plates. Vehicle-class detection fires as a zone event; the plate pixels are not extracted into a searchable index.

Gait and pose re-identification

Requires a gait or pose signature vector per person across frames and sites, to re-identify the same person across cameras. The pipeline does not build cross-frame identity vectors.

Cross-site correlation

Requires a shared id namespace across properties so the same individual or vehicle can be tracked from one site to another. The pipeline issues per-property event ids only; no shared identity exists across properties.

Indefinite off-property retention

Requires continuous frame upload and cloud storage of all camera feeds over weeks or months. The pipeline never uploads source frames. Only event thumbnails, clips, and metadata are retained, in the operator's own channel.

Real-time behavior prediction

Requires a longitudinal behavior model across identified individuals. Without identity vectors (see above), the pipeline does not and cannot build one. Detection is class-plus-zone, not person-plus-history.

Byte budget, side by side

A typical cloud-AI overlay vendor against the edge-AI adapter shape, on a 25-camera property. Both are sold under the category “AI and surveillance.” The byte budgets differ by roughly four orders of magnitude.

AI surveillance: byte budget, retention, and biometric surface

Identical marketing categories. Very different data contracts.

Feature	Typical cloud AI overlay vendor	Our system (edge AI on HDMI tap)
Continuous outbound bandwidth per camera	2 to 6 Mbps, 24 by 7	0 (no stream upload)
Daily outbound bytes, 25-camera property	~40 GB at 4 Mbps per cam	~12 MB at 50 events per day
Per-event payload shape	alert + cloud clip link (source retained)	thumbnail + 6 s clip + JSON (~240 KB)
Frames at rest off-property	all, for the retention tier	none
Face embedding index built	often, as a feature	no
License plate OCR index built	often, as a feature	no
Cross-site identity vector	vendor-defined	no
Subpoena surface on vendor	full frame archive	event thumbnails and metadata only
Detection to operator latency	~1 to 3 s	~1.1 s
Per-camera cloud subscription required	yes	no

Audit your own AI surveillance system in five steps

Run this before the vendor call or the budget meeting. The goal is to turn “AI and surveillance” from a category word into a measurable byte contract for your property.

The byte-budget audit

Step 1. Ask the vendor for the per-event payload schema

In bytes. Thumbnail size, clip duration and codec, metadata fields, any other attachments. A vendor that cannot quantify their own payload has not measured it. Budget-check this against the simple our system shape: ~18 KB thumbnail plus ~220 KB clip plus ~612 B JSON.

Step 2. Ask for the per-camera continuous upload rate

In bits per second, averaged and peak. This is the number that decides whether source frames leave the property. Zero on an edge-AI system. Two to six Mbps per camera on a cloud-AI system. Multiply by camera count to size the uplink hit.

Step 3. Read the DPA for five specific words

Biometric, embedding, gallery, retention, correlation. The presence of any of them describes a capability the byte budget must support. The absence is informative too, a vendor that cannot promise the absence in writing has not committed to it.

Step 4. Wireshark the device for one hour

Capture the outbound bytes on the device's uplink, grouped by destination. An edge-AI device shows small event-shaped bursts. A cloud-AI device shows a steady stream. The graph is unambiguous.

Step 5. Ask whether frames at rest can be pulled later

If the vendor can retrieve a frame from two weeks ago that was not tied to an event, the byte budget is effectively continuous even if alerts look small. If no frames at rest exist off-property, the budget is event-bounded and the surveillance surface is small.

~240 KB

“On a 25-camera Class C multifamily property in Fort Worth, the full off-property data emission from our system over the first 30 days was the per-event payload for 20 incidents plus health-check metadata. No continuous camera stream was uploaded. The operator WhatsApp channel was the only egress path, and every event in it was an actionable detection, not a frame dump.”

Our system deployment, Fort Worth, TX

What leaves, what stays

The same list in chip form, for the skim. This is the entire public interface of the pipeline.

Leaves: event thumbnail (~18 KB)

Leaves: 6 s event clip (~220 KB)

Leaves: event JSON (~612 B)

Leaves: device health heartbeat

Stays: composite multiview frame

Stays: per-camera raw feed

Stays: RTSP session (never opened)

Stays: ONVIF discovery (never run)

Stays: face embedding (never computed)

Stays: plate OCR (never computed)

Stays: gait signature (never computed)

Stays: cross-site identity (never issued)

Three things the policy frame gets wrong for property buyers

Not wrong as policy. Wrong as a decision aid for a single property. These are the failure modes we hear from operators who tried to reason about AI surveillance from the policy frame and ended up locked into a system whose byte budget looked nothing like what they thought they bought.

Miss 1

Capability absence is not a design choice

A vendor that does not currently offer face matching can add it in a firmware update, if the pipeline is already uploading the frames. “Does not have that feature” is not the same statement as “cannot compute it because the bytes are not present.” Only the second is enforceable.

Miss 2

Marketing parity hides byte-budget divergence

Two vendors listed side by side with “AI analytics, loitering alerts, real-time dashboard” can emit byte budgets of 0 KB per event and 0 GB per day respectively. The marketing page does not distinguish them. The data contract does.

Miss 3

Retention policy without byte accounting is prose

A retention policy of “30 days” is not a promise about the byte budget. The system can retain events for a week and still have uploaded every source frame to a vendor cloud. The right question is not how long frames are kept; it is whether frames ever left the property.

See the 240 KB payload delivered live on a 15-minute demo.

We walk through the HDMI tap on a running DVR, fire a real detection, and deliver the event thumbnail, 6-second clip, and JSON metadata to a WhatsApp you control. You see the exact bytes that crossed the property boundary. No frames leave the device. Nothing is retained in our cloud.

Book the byte-budget demo →

When the small-byte budget is the wrong pick

A 240 KB per-event budget is a constraint. It is the wrong pick when the use case actually requires identity. A casino that has to match faces against a banned list, a retailer running a loss prevention program across a national store count with centralized analytics, a law enforcement agency with a valid warrant for continuous surveillance: all three require bytes this pipeline does not emit. For those buyers, a first-party cloud AI vendor is the right shape, and the byte budget is accepted along with the capability.

For the rest of the mid-market (multifamily, construction, small commercial, HOA, single-tenant office) the detections that matter are class-plus-zone. Person in restricted area, package left unattended, tailgating, loitering, after-hours presence. Every one of those fires from the composite HDMI frame, inside a 240 KB event payload, in about 1.1 seconds. The byte budget is not a limitation here; it is the product.

Frequently asked questions

What is the byte budget of an AI surveillance system, and why is it the thing that decides whether it is mass surveillance?

The byte budget is the concrete volume and content of data that leaves the watched property per unit time. It is the only part of an AI surveillance system that is actually visible from outside the property, and it is what decides whether a system can be subpoenaed, correlated across sites, retained by a cloud vendor, or leaked in a breach. A system that uploads every camera's continuous stream to a cloud region at 2 to 6 Mbps per camera has a byte budget measured in tens of gigabytes per day and retains frames indefinitely. A system that emits only a per-event thumbnail, a short clip, and a JSON metadata blob has a byte budget measured in hundreds of kilobytes per event and retains nothing at rest off-property. The first is mass surveillance by construction. The second cannot be, because the bytes required to do mass surveillance never left the building. Features do not decide this. The byte budget does.

What is the actual per-event payload that It sends off the property?

Approximately 240 KB, in three parts. One JPEG event thumbnail at roughly 18 KB (a single still frame captured at the moment of detection, labeled with the tile coordinate it came from). One H.264 event clip of roughly 6 seconds at roughly 220 KB (the window around the detection, not continuous footage). One JSON metadata blob of roughly 612 bytes (event id, property id, timestamp, detection class, bounding box, confidence, tile index). That payload lands in an operator's WhatsApp or the our system dashboard. No frames before or after the event window leave the device. No full-resolution multiview frame leaves the device. No biometric embedding of any person or vehicle in that frame leaves the device. The 240 KB is the entire off-property footprint of one detection.

What does the our system pipeline specifically refuse to compute or send?

Five things are out of scope by design. One, face embeddings against an enterprise gallery (no face-matching index is built or uploaded). Two, license plate OCR vectors (plate text is not extracted into a searchable index). Three, gait or pose signatures across time (no re-identification vectors are produced). Four, cross-site correlation identifiers (no shared person id is issued across properties). Five, indefinite retention of frames at rest off-property (no full-frame archive is kept in a cloud region). The detections the pipeline does run (person present, person in a restricted zone, loiterer, package left unattended, tailgating at an entry, after-hours presence) are class and zone rules on the composite HDMI multiview frame. These run, they trigger the 240 KB event payload, and the source frames are discarded on-device.

How does an edge-inference AI surveillance system actually produce a smaller byte budget than a cloud-AI system?

Inference location decides where pixels live. A cloud-AI system opens an RTSP or ONVIF session per camera, uploads every frame continuously to a cloud region, and runs inference on the cloud. Every pixel crosses the property boundary, and every frame is at rest somewhere in the vendor's storage. An edge-AI system runs inference on-device, on-property, on the composite HDMI frame the DVR already renders. The source pixels never cross the property boundary. Only the small, human-readable derivatives of an actual detection (a still, a short clip, a metadata blob) cross the boundary. The byte budget is smaller because inference is local. That is the single architectural choice that converts the shape of the system from mass surveillance to event detection.

How can a property manager or privacy officer verify an AI surveillance system's byte budget?

Five checks. One, ask the vendor to publish the per-event payload schema in bytes and the per-frame upload rate in bits per second. A vendor that cannot quantify both has not measured its own byte budget. Two, measure the upstream bandwidth on the property's WAN before and after install; an edge-inference system should not move the baseline by more than the event rate times the event size. Three, read the vendor's data processing agreement for the words biometric, embedding, gallery, retention, and correlation; these are the words that separate event detection from mass surveillance. Four, run a Wireshark capture on the device's uplink for one hour and sum the outbound bytes; compare against the vendor's claimed budget. Five, ask whether the frames at rest in the cloud can be retrieved on demand. If yes, the byte budget is effectively continuous. If the vendor says no frames are retained, the budget is event-bounded.

Why does the civil-liberties literature not discuss byte budgets?

Because the literature on AI and surveillance is mostly produced by policy researchers, not by people who install the systems. Brookings, the ACLU, the AJL, Carnegie, and the Decision Lab frame the debate at the level of capability (can the system recognize faces, can it predict behavior, can it target a population). That is the right level for a law or a norm. It is the wrong level for a property-scale procurement decision, because every vendor in the AI surveillance category has roughly the same capabilities on paper, and what actually separates them is whether they exfiltrate the source pixels that a capability would need. The byte budget makes the abstract capability argument concrete. A system that does not upload face pixels cannot match faces, regardless of what its marketing page claims.

Is an AI surveillance system with a 240 KB per-event byte budget still useful for a property manager?

Yes, for the detections that a mid-market multifamily, construction, or small commercial property actually needs. Those are class and zone detections, not identity detections. Was there a person in the trash area at 3 AM, was a package left unattended for 40 minutes, did someone tailgate through the gate, was the pool occupied after hours, is the construction site awake when it should be asleep. All of those are answerable from the composite multiview frame with class detection plus a zone rule, and all of them trigger a 240 KB event payload that is useful to the operator in real time. The detections that require a full-identity pipeline (matching a specific individual by face across thirty sites, building a retail customer profile from gait) are out of scope, because they are the detections that require mass surveillance bytes.

Where can I verify the byte budget figures on our system?

The apartment-security-cameras.com/llms.txt spec publishes the hardware and software contract, including the edge-AI data path. The install write-up on /t/ai-surveillance-company and /t/ai-surveillance-companies shows the five-cable install sequence on a running DVR and identifies the exact uplink path used (ethernet on the device's LAN port, for the control plane and event egress only). The public demo at cal.com/matthewdiakonov/cyrano walks through a live event delivery end to end, from composite frame to WhatsApp message, in roughly 1.1 seconds of latency. The event payload is visible in the delivered notification itself (thumbnail plus short clip plus metadata); there is no hidden second channel.

How does latency interact with the byte budget on an edge-AI surveillance system?

Latency and byte budget trade off against each other on cloud-AI systems (more frames uploaded faster means more bytes leaving the property) and are independent on edge-AI systems. Our system's end-to-end detection to operator notification latency is roughly 1.1 seconds, because inference runs on-device and only the 240 KB event payload needs to egress. That is measured from the moment a person enters a zone to the moment a WhatsApp notification buzzes on an operator's phone. A cloud-AI system that moves frames continuously also achieves low alert latency, but at a byte budget two to three orders of magnitude larger, because the inference happens on the other side of the upload. Edge inference is the architecture that produces both low latency and a small byte budget at the same time.

The one rule that makes AI surveillance legible

Before you argue about AI and surveillance as a category, measure the byte budget of the specific system you are considering. One number settles most of the argument. If source frames leave the property, the capability surface is effectively unbounded, no matter what the marketing page says the vendor does not do. If source frames never leave, the capability surface is bounded by the derivatives that do (a thumbnail, a clip, a metadata blob), and every capability that requires more bytes than those is architecturally off the table.

Our system publishes its byte budget because the budget is the product. A $450 device, a $200 monthly subscription, up to 25 camera tiles on one HDMI port, on-device inference, and about 240 KB off the property per event. If that contract fits your threat model, the rest of the AI surveillance conversation is short.