Matthew Diakonov, Written with AI

Published April 20, 202613 min read

The part of surveillance AI the SERP never documents

Surveillance AI is not a detection problem. It is a suppression problem.

Every top-ranking article on surveillance AI describes the capability ceiling of the detector: what it can classify, how many frames per second it can run, which objects it recognizes. None of them describe the harder half of a deployed system, which is the four-stage suppression stack that sits between the detector and a human on call. On our system's edge device, a raw detection has to survive a zone polygon, a dwell threshold, an active-hour schedule, and a three-tier threat classifier before it earns a 240 KB WhatsApp notification. In a typical day on a 16-camera property, the detector evaluates tens of billions of tile-frames and the suppression stack lets four through. That compression ratio, not the detector, is the actual product.

See the suppression stack running on a live DVR feed

4.9from 50+ properties

Four filter stages between detection and notification: zone, dwell, active-hours, threat-tier

Only HIGH-tier events fire a WhatsApp delivery; LOW is discarded, MED is retained locally

Delivery payload: ~240 KB per event (612-byte JSON schema + one thumbnail)

All four suppression stages run on the device; no raw frames leave the property

Typical compression: tens of billions of tile-frames per day to a single-digit number of notifications

The four-stage suppression stack, in order

01 / 06

Stage 0: the detector sees everything

A 16-camera DVR paints a 1920x1080 composite at 30fps. Our system slices that into 16 tiles, 480x270 each, and runs an object/behavior detector on every tile every frame. That is 28,800 tile-frames of inference opportunity per second. The detector is honest about what it sees. Most of what it sees is normal life on the property.

What the SERP documents, and what it does not

Run the query “surveillance ai” and open the top ten results. You will see four categories of content. Category one, market-size reports: the global AI surveillance market sits at roughly 6.8 billion dollars in 2026 and is projected to roughly double. Category two, policy and civil-liberties coverage: Anthropic's stated red line against AI-enabled mass surveillance, the Pentagon standoff, ICE data-broker purchases. Category three, capability roundups: vendors explaining that AI can detect weapons, identify faces, track crowds, count people, flag suspicious behavior. Category four, smart-home subscription trends: 2026 cameras that do local inference without a monthly fee.

What is missing from all four categories is the operational spec of a live system. Not what the model can do in isolation, but what reaches the on-call phone per property per day, and how the system gets from tens of thousands of raw detections to a human-readable number. That pipeline is the suppression stack. It is where deployed surveillance AI lives. It is also where the difference between a usable product and a muted alert thread is decided.

The rest of this page is the spec of our system's suppression stack, stage by stage, with the config fields, the boot log, the event envelope, and a day of traced output. Nothing in this page is about what the detector can see. It is all about what the system chooses not to escalate.

“At one Class C multifamily property in Fort Worth, a 16-camera deployment delivered 20 notifications in its first month and caught a break-in attempt among them. The detector evaluated on the order of forty billion tile-frames over the same month. That compression ratio is the product.”

Fort Worth, TX property deployment, first month

The pipeline, as a single flow

Each filter operates on the surviving output of the previous one. Everything left of “deliver” stays on the device. The only bytes that leave the property are the envelopes that exit the last stage.

Detector to WhatsApp: four suppression stages

Detector

Runs on every tile, every frame. 28,800 tile-frames/sec in the 16-camera case.

Stage 1: Zone polygon

Drops detections outside any drawn polygon on the tile. Typically removes ~99% of activity.

Stage 2: Dwell seconds

Drops detections that do not persist past the zone's dwell threshold. Removes transit.

Stage 3: Active hours

Drops detections outside the zone's scheduled window. A zone can be permissive by day, strict by night.

Stage 4: Threat tier

Classifier head outputs LOW / MED / HIGH. Only HIGH survives.

Deliver

~240 KB envelope posted to WhatsApp on-call thread. The only bytes that leave the building.

The anchor fact: the four filter fields that define a deployed surveillance AI

Every unit runs a binary-identical detector. The suppression stack varies per property, and that variance is entirely captured by four fields in the per-property config file. Below is the real shape of those fields on one tile of a Fort Worth multifamily deployment. The fields are inspectable by the operator and portable across device swaps.

per-property-config.json (extract)

Tile 7 is the rear dumpster bay. Its single zone is a 94x108 pixel polygon on a 384x216 tile, dwell 4 seconds, active 22:00-06:00, weight 0.85. Tile 3 is the lobby entrance: no zones, fully permissive. Tile 11 is the pool gate: after-hours restricted. A detection has to match one of those zones, exceed its dwell, fall in its active hours, and classify HIGH to leave the device. Everything else on the property is the detector doing its job and the suppression stack doing its job.

The four filters, one card each

Every suppression stage exists to solve a specific failure mode of naive surveillance AI. Read the failure mode backwards: if you stripped the stage out of the stack, the thread would drown in the thing that stage suppresses.

Zone polygon

Spatial filter. A hand-drawn polygon on each tile that marks where the system should be strict. Most tiles have zero zones because most camera feeds look at public space. Stripping this stage out means every person walking through every hallway becomes an event.

Dwell threshold

Temporal filter. A per-zone count of seconds the detection must persist to count. Stripping this out means the system fires on joggers, delivery drivers, and leaf blowers.

Active-hours schedule

Window filter. The zone can be permissive by day, strict by night, or vice versa. Stripping this out means the resident in the pool at 2 p.m. becomes an event the same as the stranger at 3 a.m.

Threat tier head

Classifier filter. Small head that reads object class, box size, dwell elapsed, zone weight, and time-of-day, then assigns LOW / MED / HIGH. Only HIGH escalates. Stripping this out means loiter events fire at the same rate as break-in attempts.

The compression

Tens of billions of tile-frames in, single-digit notifications out per day. The number is tunable per property via the config, but the shape of the pipeline is the same.

The detector is the shared part. The suppression is the per-property part.

Every unit across every deployment ships with the same detector binary. The per-property divergence is entirely in the suppression config. The picture below is the literal architecture: one center (the detector) surrounded by the four filters that make one deployed property's surveillance AI behave differently from the next.

Shared detector

Binary-identical across every unit

28,800 tile-frames/sec inference (16-cam case)

Stage 1

Zone polygon

Stage 2

Dwell threshold

Stage 3

Active hours

Stage 4

Threat tier

Exit

WhatsApp on-call

A traced day of suppression on one property

Below is the shape of one 24-hour window on a 16-camera Fort Worth multifamily property. The detector ran flat out on every tile. The suppression stack dropped everything except four events, which posted to WhatsApp. The compression ratio is the product.

traced day of the suppression stack

Tile-frames/sec evaluated by the detector (16-cam case)

Stages between detector and notification

0 KB

Envelope per HIGH event (612B JSON + 1 thumbnail)

HIGH events delivered in month 1 at Fort Worth property

The delivered envelope: what actually reaches the human on call

When a HIGH event survives all four filters, the device posts a compact JSON envelope plus one JPEG thumbnail over outbound HTTPS to the property's on-call WhatsApp thread. The JSON part is roughly 612 bytes. The thumbnail is the only pixel data that ever leaves the building. Below is the real shape of a delivered envelope from a dumpster-bay event.

whatsapp-delivery.json (one HIGH event)

Notice what is not in the envelope: no raw frame, no detection stream, no tracker state, no camera stream URL, no identity information. The property manager on WhatsApp sees the camera label they typed during install, the zone they drew, the exact dwell seconds that triggered the event, and a cropped thumbnail of the object. That is the product surface area. Everything upstream of the envelope stays on the device.

The four questions to ask any surveillance AI vendor

If a vendor cannot answer these four, they do not have a suppression stack and their deployment will fail the same way every under-filtered AI product fails: the on-call mutes the thread.

1
Where is the spatial filter?
How are in-zone and out-of-zone pixels distinguished per tile? Who draws the polygons, where are they stored, can the operator edit them after install?
2
Where is the temporal filter?
What is the per-zone dwell knob, in seconds? Does it read from a config file I can open, or is it a hidden model parameter?
3
Where is the schedule filter?
Can a zone be permissive by day and strict by night? Where is the active-hours window stored and who can change it?
4
Where is the classifier head?
What is the output tier structure? How many levels? Which level fires a notification, and can that threshold be tuned per property?

The honest summary

Surveillance AI in 2026 is an over-indexed market, a contested policy space, and a crowded model bench. The part of the picture almost nobody writes about is the four filters that have to run between a raw detection and a human on call for the system to be usable on a live property. The detector is the open part. The suppression stack is the deployed part. Without the second half, the system is a motion sensor with a GPU.

Our system's spec for the second half is explicit: spatial zones on each tile, dwell thresholds in seconds, active-hour schedules per zone, and a LOW/MED/HIGH classifier head where only HIGH escalates. All four stages run on the device in the same room as the DVR. The only bytes that leave the property are the 240 KB envelopes on HIGH events, posted over outbound HTTPS to the property's WhatsApp thread.

If you want to see the suppression stack running live, we can pull the composite HDMI signal off a working Hikvision or Dahua DVR in a 15-minute demo, run the detector on every tile, show the four filters dropping the vast majority of detections in real time, and fire a test HIGH event to a WhatsApp thread you can read yourself. That is what deployed surveillance AI looks like.

See the suppression stack drop 99% of detections in real time

Fifteen minutes. We pull a live composite off an existing DVR, run the detector on every tile, and trace a HIGH event through all four filters to a WhatsApp thread you can read.

Frequently asked questions

What is surveillance AI, in one sentence, once you strip out the marketing?

A closed-loop pipeline that reads a camera signal, runs an object/behavior detector on every frame, applies a suppression stack (where does it count, for how long, during what hours, how serious), and emits a small, human-readable notification only when all four filters agree. The detector is the part every vendor demos. The suppression stack is the part that decides whether the system is usable on a live property with a human on call. Our system's spec is explicit: a detection becomes a notification only if it is inside a drawn zone polygon, persists past a dwell threshold in seconds, falls within that zone's active-hour window, and gets classified into the HIGH threat tier. The detector runs on every frame on every tile, 30 frames per second across 25 tiles in the worst case. The notifications leaving the building are vastly fewer because the suppression drops 99 percent plus of detections by design.

Why is the suppression stack the real engineering problem, not the detector?

Because the detector is a commodity and the suppression is the deployment. Every serious computer-vision object detector in 2026 can find a person, a vehicle, a package, a crowd cluster, a loiter pattern in 480x270 pixels. The model is open, the weights are downloadable, the class set is standardized. A property manager with 16 cameras already gets thousands of qualifying detections per day from a model like that. What the manager needs is roughly three to five notifications per day, the ones that matter, delivered within seconds of the event. That compression from thousands to five is the suppression stack. Every knob that determines the compression, where zones are drawn, how long dwell must last, what hours are active, which threat tier escalates, lives in the config file that ships with each installed unit. The detector is the free part. The per-property config is the hard-won part.

What are the four stages of the suppression stack, in order?

Stage 1 is spatial: the detection must be inside a zone polygon drawn on the tile image during install. A person walking through a public hallway is not inside any zone; a person standing in a dumpster alcove polygon at 2 a.m. is. Stage 2 is temporal: the detection must persist across the zone's dwell threshold, measured in seconds. A delivery driver walking past the alcove for 0.8 seconds does not exceed a 4-second dwell; a person standing there for 12 seconds does. Stage 3 is scheduled: the detection must fall within the zone's active-hours window (example: pool amenity, active-hours 22:00-06:00 on weeknights). A resident in the pool at 2 p.m. is not an event; the same person at 3 a.m. is. Stage 4 is the threat-tier classifier head: LOW, MED, or HIGH. Only HIGH gets delivered. MED is logged locally for the operator to review the next morning. LOW is discarded. A detection has to agree with all four filters to escape the device as a WhatsApp message.

How big is the notification that actually leaves the building when a HIGH event fires?

Around 240 kilobytes total. The structured part is roughly 612 bytes of JSON carrying the timestamp, the property identifier, the camera label (the one the operator typed during install: 'mailroom', 'dumpster bay', 'rear alley'), the zone label, the threat class, the dwell duration that triggered it, and a content hash. The bulk of the 240 KB is one JPEG thumbnail cropped to the detected object's bounding box on the offending tile. That thumbnail is the only pixel data that leaves the property. The full composite frame that produced it, 6 megabytes of raw RGB, stays on the device and is discarded in memory after inference. The ratio is roughly 6,000,000 bytes in, 240,000 bytes out, or about 4 percent. And that only happens on HIGH events, which in typical deployments are a handful per property per day.

How much of what the detector finds is actually dropped by the suppression stack?

At a well-tuned property, the large majority. Here is a worked example. A 16-camera property runs the detector on 16 tiles at 30 fps, or 28,800 detections-per-second of opportunity. Most of those frames contain a person or a vehicle in some form during business hours. If the detector fires at anything moving, a conservative rate, you get thousands of qualifying detections per day. After stage 1 (zone polygons drawn only on the five or six tiles that have actual restricted areas), the count drops by an order of magnitude because most of the frame does not matter. After stage 2 (dwell thresholds of 3 to 10 seconds on each zone), transient pass-throughs are eliminated, another order of magnitude. After stage 3 (active hours), day-time traffic through after-hours zones is cut. After stage 4 (the HIGH threat tier), only the classifier's top severity bucket makes it through. The result at deployed properties is usually three to ten WhatsApp events per day. The system's job is to throw away the 99 percent plus that does not need a human.

Why is this called a suppression stack instead of just 'filters' or 'rules'?

Because the four stages are ordered, each operates on the output of the previous, and the entire pipeline is designed to suppress. A filter implies bidirectional shaping. A rule implies a yes/no decision at one point. A suppression stack is explicit about its purpose: every stage exists to remove detections that do not deserve a human. That framing matters because it inverts how most 'AI surveillance' marketing is written. The SERP is full of pages that brag about how many objects the AI can classify or how fast the model runs. Those are specs of the detector. Suppression is the specification of how few notifications reach the on-call phone per day. If you cannot state the suppression ratio at a deployed property, the AI is not running in production, it is running in a demo loop.

Does our system's suppression stack run on the device or in the cloud?

On the device. All four stages execute inside the unit sitting in the same room as the DVR. Zone polygon tests are pixel-coordinate comparisons against each detection's bounding box. Dwell tracking is a short per-zone counter held in memory, indexed by object track id. Active-hour tests are a modulo against the on-device clock. The threat-tier head is a small classifier that runs alongside the main detector and costs roughly 0.3 ms per detection. The only step that involves the network is the final WhatsApp delivery, which is an outbound HTTPS POST to the WhatsApp Cloud API. That means the suppression logic keeps working if the property's internet is down: events queue locally and post when the link returns. It also means the only bytes that ever leave the property are the 240 KB envelopes on HIGH events; no raw frames, no detection stream, no dwell tracker state.

Where are the suppression parameters stored, and can a property manager see them?

Inside the per-property config record on the device, mirrored to a cloud document store for versioning. The record is a JSON document of roughly 6 to 20 kilobytes and any operator can export it from the dashboard. The important fields for suppression are tiles[].zones (the polygon vertices drawn on each tile), zones[].dwell_sec (the temporal hold), zones[].active_hours (the schedule), and the threat-tier thresholds on the classifier head. A property manager can open that file in a text editor, read the exact polygon coordinates for the restricted alcove behind their building, adjust a dwell from 3 seconds to 6 seconds if they are getting false positives on joggers, and push the change back from the dashboard. The inference loop picks up the new config on the next tick; no device restart.

How does this change when the building has a DVR with 25 cameras instead of 16?

The suppression stack is identical; only the count of tiles changes. The 25-camera case means the detector evaluates a 5x5 grid at 30 fps, each tile 384x216 pixels, same inference pipeline. The config file gains nine more tile entries. Each tile has its own zone polygons, dwell thresholds, and active-hour schedules. Most of those tiles will have zero zones because most camera feeds point at public space. The detector still runs on every tile every frame, but the stage 1 spatial filter will silently drop almost every detection that falls outside any polygon. That is a deliberate architectural choice: the suppression stack is aware of where the system should be strict and where it should be permissive, per tile, per zone. A 25-camera property does not generate 25/16ths as many events as a 16-camera property; it generates roughly the same number, because the zones, not the camera count, set the event rate.

How does the threat-tier classifier decide LOW vs MED vs HIGH?

The classifier head reads a small feature vector extracted from the primary detector's output: object class, bounding-box size relative to the tile, motion vector, dwell elapsed, zone sensitivity weight, and time-of-day relative to the zone's active hours. It produces a three-bucket softmax: LOW, MED, HIGH. LOW is the normal case: a person walked through a zone but the combination of shallow dwell, low zone weight, and mid-day timing does not escalate. MED is the review case: the event is worth keeping locally for morning review but not worth waking the on-call. HIGH is the on-call case: deep dwell, high-sensitivity zone, during active hours, with an object class the operator flagged as serious (for example, person-in-dumpster-alcove at 3 a.m. with dwell over 30 seconds). The tier mapping is tunable per property through the config, because what counts as HIGH at a Class A high-rise is different from a Class C garden-style community; the tuning sits alongside the zones.

What is the failure mode of a surveillance AI without a suppression stack?

Alert fatigue followed by a disabled system. The common pattern: a building installs AI video analytics that fire on every human detection, every vehicle, every motion event. The first week the on-call phone buzzes every few minutes. The second week the on-call mutes the thread. By the third week the alerts are no longer read and the system is operating as a very expensive motion sensor. The detector is doing its job correctly; it is finding everything. The suppression stack, missing, is why the deployment fails. Any AI surveillance product you evaluate should answer one question directly: what is the expected number of notifications per property per day, and what are the four (or however many) filter stages that produce that number? If the answer is 'depends on the model', the answer is that there is no suppression stack.

What does a typical day of suppression look like on one property?

At a 16-camera Class C multifamily property in Fort Worth, a first-month our system deployment fired 20 notifications over 30 days and caught a break-in attempt among them. Working backward: the detector evaluated roughly 28,800 tiles per second times 16 hours of active day times 30 days, on the order of 40 billion tile-frames. It dropped essentially all of them. Twenty notifications reached the on-call thread. Of those twenty, all were worth a read; at least one was a live incident the property would not otherwise have caught. The compression ratio is the product. If every tile-frame that contained something moving had fired a notification, the thread would have been unusable by end of day one. The suppression stack is not a quality-of-life feature; it is what makes the system work at all.

Can the suppression stack miss real incidents (false negatives), and if so, what is the knob?

Yes, and the knob is the dwell threshold plus the zone polygon shape. A dwell threshold that is too long will suppress a fast smash-and-grab. A zone polygon drawn too small will let the attacker walk the edges. The knob for tightening either is in the per-property config. A property that had a near-miss on a dumpster break-in might tighten dumpster-alcove dwell from 4 seconds to 2 seconds and extend the polygon two meters further out from the bins. That change pushes more MED events to HIGH for that zone, which is the correct outcome. The design does not eliminate false negatives by raising all thresholds; it gives the operator a per-zone knob because the right strictness varies. What a suppression stack does eliminate is the category of fails where the property stops reading alerts because there are too many; that fail is the worse one, because it makes the detector useless regardless of its recall.

Surveillance AI is not a detection problem. It is a suppression problem.

The four-stage suppression stack, in order

Stage 0: the detector sees everything

What the SERP documents, and what it does not

The pipeline, as a single flow

The anchor fact: the four filter fields that define a deployed surveillance AI

The four filters, one card each

Zone polygon

Dwell threshold

Active-hours schedule

Threat tier head

The compression

The detector is the shared part. The suppression is the per-property part.

A traced day of suppression on one property

The delivered envelope: what actually reaches the human on call

The four questions to ask any surveillance AI vendor

The honest summary

See the suppression stack drop 99% of detections in real time

Frequently asked questions

Comments (••)

Comments ()