Matthew Diakonov, Written with AI

Published April 27, 202614 min read

For ops, integrators, and people who actually have to ship one

An AI agent for security camera monitoring is real or it isn't, and the test is which tier of the loop the LLM sits in.

Almost every product marketed as an AI agent for camera monitoring today is one of three things: a CNN detector with a chat box, a VLM running on every frame, or a planner sitting above a perception tier it didn't build. The first two are not agents in any useful sense, and the third one only works if a few specific architectural decisions go the right way. This is the shape of those decisions, the four tools the planner is allowed to call, and the three files those tools read and write on a Cyrano unit.

Nothing here is novel as agent design; it is the camera-monitoring specialization that nobody else publishes. If you are evaluating a vendor in this category, the four questions at the bottom of this page are the ones that separate a real agent from a marketing wrapper.

What "AI agent for security camera monitoring" usually means in vendor copy

Open the homepages of the products that come up first when you ask this question and you will see three flavors. The first is a CNN object detector with a chat box: the cameras run a person, vehicle, and package detector around the clock, the chat box lets a human ask "show me everyone who walked past the rear gate last Saturday between 11 and 1," and clips come back. That is a search interface on top of a detector. It is useful, but the agent is not watching the cameras; the operator is, after the fact.

The second is a vision-language model running on every frame. The pitch is "the AI watches your cameras and tells you when something is wrong." The math does not work: a VLM call is hundreds of milliseconds, sometimes seconds, and a single property with 25 cameras at 15 fps is asking for 375 inferences per second. Even at the cheapest hosted endpoint that is a five-figure monthly bill per property, and the latency means alerts arrive well after the moment they describe. Vendors who advertise this either do not actually run a VLM on every frame (they run one occasionally and round it up in the marketing) or they run a tiny one that is more error-prone than a domain-specific detector at the same compute budget.

The third flavor is a planner: a small, structured agent that reasons about events that have already been filtered down to the point where a planner can fit them in its context. This one is real, but only if the perception tier and the state tier underneath it were built to feed it. Otherwise the planner inherits the same problems as the VLM-on-every-frame product, only with extra latency.

The loop, drawn honestly: three tiers, two of them with no LLM in them

What follows is the actual data path on a Cyrano edge unit. The shape is generic; any honest agent for this domain looks like this. Note that two of the three tiers never call an LLM at all, and the third one calls it on a stream of structured events that has already been compressed by roughly five orders of magnitude relative to the input frames.

Three tiers, one LLM call per real event (at most)

The planner is invoked once per transition, not once per frame. For a Class C multifamily property with 25 cameras, the transition stream comes out in the low double digits per camera per day, total on the order of a few hundred planner invocations per device per day. That is the budget that makes an LLM-backed planner financially viable.

The four tools the planner is allowed to call, and nothing else

Most agent frameworks expose dozens of tools and let the model decide which to call. That works for code agents because the cost of a wrong tool call is a failed test or a wrong commit. The cost of a wrong tool call here is a phone call to a tenant who is already asleep. So the toolset is small on purpose. Four tools, no shell, no arbitrary HTTP, no "browse the internet."

Each one has a single named input shape and a single named output shape, both serializable to a JSON line so every call is auditable. The planner's entire decision graph is a sequence of these four, in some order, terminating in either drop or alert_outbox_append.

planner-tools.ts

zone_polygon_lookup answers "is this happening somewhere I care about." dwell_timer_increment answers "has it lasted long enough to matter." threat_router_classify answers "how loud should the alert be." alert_outbox_append answers "how do I make sure the alert actually leaves the device." That is the entire useful decision space at this tier, and there is nothing else worth giving the planner access to.

What the agent remembers, and where that memory lives

Memory in this domain is not a vector store and it is not the model's context window. It is three plain files on the device's local disk. Every tool call reads from or writes to one of these files, and a field technician with a USB keyboard can open them and read the entire history of every alert decision the agent has ever made on this property. There is no opaque cloud blob. There is no "the model just remembers."

cyrano@unit-fortworth-01

The implication: every alert is auditable from the device itself, including the dwell that triggered it and the exact moment the on-call team acknowledged it. If a tenant disputes an incident the on-site team can answer in seconds, not by combing footage. If the on-call team wants to know whether the agent is missing events, they grep the registry against the dwell state and the answer is right there.

Where the LLM goes, and where it absolutely doesn't

The LLM does not run inside perception. It does not run inside the state tier. Both of those tiers run at frame rate or sub-second, both have well-shaped numerical inputs (pixels, coordinates, integers), and both have decades of fast classical and neural implementations that an LLM cannot improve on at the right cost point. Putting a language model in either of those places is the mistake the buzzwordiest products make.

The LLM, if you use one at all, runs at the planner tier, and only on the small structured stream of transitions that has cleared the zone and dwell filters. Its job is to handle the messy free-form context that a deterministic router cannot express: a text from the property manager that says "contractor at the rear gate at 2 a.m. tonight, do not page me on this," an on-call schedule that rotates weekly, a known pattern of 1 a.m. delivery drivers that should be silenced for the next 20 minutes only. The four tools above are enough to express any decision the LLM needs to make, and the structured event format is enough to keep the LLM's prompt small and its hallucination space small.

On most production Cyrano deployments today the planner is a deterministic Typescript function, not an LLM. The LLM-backed planner is a config switch, not a different product. Both share the same toolset, the same state files, and the same outbox.

Per-frame VLM vs. planner agent, side by side

The honest comparison. A vendor who is selling a per-frame VLM has the left column. A vendor who is selling a planner agent over a structured perception layer has the right column. The wide gulf between the two is the whole point of this page.

Feature	Per-frame VLM agent	Planner-tier agent (Cyrano)
LLM/VLM call rate per device	375+ per second (25 cams x 15 fps)	Low double digits per camera per day; a few hundred per device per day
Latency between event and alert	Hundreds of milliseconds to multiple seconds per frame	One detection cycle (sub-second) plus one tool-call sequence
Cost driver	Frame rate x camera count x token price	Real-event count per property per day; bounded and small
Hallucination surface	Open: VLM narrates whatever the frame suggests	Closed: planner picks among drop / low / high; free text only in incident notes
Auditability of a single decision	Re-prompt the model and hope for stability	Three plain files on local disk: registry, dwell, outbox
Behavior during a 6-hour WAN outage	Agent stops; alerts pause until cloud returns	Perception, state, deterministic planner all run; outbox replays in order on reconnect
Where video has to travel	Each frame, to the VLM endpoint	Nowhere; inference and planner run on device
What the operator can change without a redeploy	Almost nothing; behavior is in the prompt	Zone polygons, dwell thresholds, arming schedules, all plain JSON files

The left column is what most products marketed as agents in this category are actually shipping today, even when the homepage copy implies the right column.

The four questions to ask any vendor in this category

None of these are gotchas. They are the questions someone who has actually shipped a camera-monitoring agent will be able to answer in two sentences each. Someone who has not will give you a tour instead of an answer.

Where in the loop does the LLM run, and at what input rate? A real answer names the tier (planner, never perception or state) and gives a per-event budget (low double digits per camera per day). A non-answer is "continuously" or "on every frame."
What is the explicit list of tools the agent can call? A real answer is short and names each tool with its inputs and outputs. A non-answer is "it can do many things" or a vague reference to plugins.
What state does the agent persist, and where does it live? A real answer is specific files on local disk with named fields. A non-answer is "the model has memory" or "we use a vector store."
What happens during a six-hour WAN outage? A real answer describes the local-only operation of perception and state tiers, the append-only outbox, and the in-order replay on reconnect. A non-answer is "the cloud handles failover."

A vendor who can answer those four with a tier name, a list of tools, a list of files, and a buffering behavior is selling an agent. A vendor who cannot is selling an object detector with a chat interface.

See the four tools and three state files on a real device

Ten minutes on a call. We will show you the agent loop running on a 25-camera unit at a real Class C property, including the outbox replay during a simulated WAN outage.

Specific questions, specific answers

Why does it matter which tier of the loop the LLM sits in?

Because compute and latency budgets at the perception tier and the planner tier are different by two orders of magnitude. The perception tier runs at the camera's frame rate, on the order of 10 to 30 frames per second per tile, multiplied by 25 tiles per device. That is roughly 250 to 750 inferences per second, all of which need to finish in under 33 milliseconds to keep up. A modern VLM call is hundreds of milliseconds end to end, sometimes seconds, and costs cents per call from a hosted endpoint. Multiply that by 750 calls per second per device and a single property runs the cost of a small SaaS company per day before any alerts fire. So the LLM cannot live in the perception tier. It has to live above it, on a much smaller stream of events the perception tier and the state tier have already filtered down to. On a real Cyrano deployment the LLM is invoked at most once per (track id, zone, transition) tuple, which on a Class C multifamily property comes out in the low double digits per camera per day, not 750 per second.

Most products say they have an 'AI agent' for cameras. What does that usually mean in practice?

Three different things, and the marketing copy almost never tells you which one. The first is a CNN object detector with a chat interface bolted on top so an operator can ask 'show me people near the gate after midnight last weekend' and get clips back. That is a search agent on top of a detector, the agent itself is not watching the cameras. The second is a frame-by-frame VLM running on every camera, asking 'is anything suspicious in this frame.' That is the most expensive and least reliable option, because VLMs hallucinate suspicious activity in long-tailed scenes and the cost scales with frame rate. The third is a planner agent: a tight perception layer surfaces structured events to a planner that decides what to do with each one, with the LLM only entering the loop when an event has already cleared the state-tier filters. Buyers should ask which of the three a vendor is shipping. Most are shipping the first or the second.

What are the three tiers of a real camera-watching agent loop?

Tier one is perception, on the order of tens of milliseconds per inference. It runs object detection (and on some platforms a small classifier) over every frame of the input, emits bounding boxes with class labels, and runs a tracker so the same person across frames keeps the same track id. No LLM, no agent reasoning, no language calls. Tier two is state, sub-second. It takes detections from tier one, applies the per-camera zone polygon, increments dwell counters, dedups tracks, and emits transitions of the form 'track 4811 entered zone mailroom-01 at 02:14:33' or 'track 4811 has been in zone parcel-shelf for 920 seconds.' Still no LLM. Tier three is the planner, sub-30 seconds. It takes the small stream of transitions from tier two, decides whether to drop the event, route it as a LOW THREAT, escalate to HIGH THREAT, send a 10-second clip, retry a failed alert, or fold the event into an open incident. The planner is the only place where an LLM call (if you use one) belongs.

What tools can the planner actually call on a Cyrano unit?

Four, and only four. zone_polygon_lookup(camera_id, point) returns whether a coordinate falls inside any armed zone for that camera, with the zone name, dwell threshold, and arming schedule. dwell_timer_increment(track_id, zone_id, seconds) reads and writes the dwell counter file at /var/lib/cyrano/meta/dwell_state and returns the current accumulated dwell, plus a flag for whether the dwell threshold has been crossed in this call. threat_router_classify(event) takes a structured event and returns one of drop, low_threat, high_threat, with an optional 10-second clip path. alert_outbox_append(event) appends a JSON line to the NDJSON outbox and returns the line number, so the alert can be retried in order if the outbound link is down. That is the entire toolset. There is no shell tool, no arbitrary API, no 'browse the internet.' The planner cannot do anything outside this set, which is also what makes its behavior auditable: every call is logged to a file the operator can read.

Why expose only four tools? Modern agent frameworks expose dozens.

Two reasons. First, every additional tool is another surface for the planner to make a mistake on, and 'mistake' here means 'sent the wrong family a 2 a.m. text message about a possible break-in.' The cost of an agent error in this domain is not a wrong code change, it is a phone call to a tenant who is already asleep. Second, the four tools cover the entire decision space the planner needs. zone_polygon_lookup answers 'is this happening somewhere I care about.' dwell_timer_increment answers 'has it lasted long enough to matter.' threat_router_classify answers 'how loud should the alert be.' alert_outbox_append answers 'how do I make sure the alert actually leaves the device.' Anything outside that four is either a different layer's job (perception, state) or somebody else's job (the property manager, the on-call ops team, the cops). Keeping the toolset minimal is what makes the agent's behavior describable on one page.

What state does the agent persist on disk, and where does it live?

Three files, all under /var/lib/cyrano/meta/. track_registry.ndjson is one JSON line per track id observed in the last six hours, with first-seen timestamp, last-seen timestamp, last bounding box, and the camera id the track was last seen on. dwell_state is a per-(track, zone) accumulator: how many seconds track 4811 has been inside zone parcel-shelf since it first entered. outbox.ndjson is the append-only alert log, one event per line, with a delivery_state field that is updated when the alert is acknowledged by the on-call ops team. All three files survive a process restart and a reboot. None of them ever leaves the device. If an operator wants to know why an alert fired or did not fire on a particular night, they grep the outbox file by timestamp and the dwell_state file by track id, and the answer is right there.

How does this compare to running a VLM on every frame, which is what some products advertise?

A VLM on every frame is the wrong shape for two reasons. First, it conflates perception and reasoning into a single expensive call, which means you pay LLM costs for every frame even when the answer is 'no person, no vehicle, nothing.' The vast majority of frames at a residential property are exactly that, which means the vast majority of the inference budget is wasted. Second, VLMs are unreliable on long-tailed surveillance scenes (low light, partial occlusion, weird angles, motion blur) in a different way than detectors are. Detectors miss objects. VLMs hallucinate plausible-sounding stories about objects, and a 'person carrying what may be a weapon near the rear gate' alert that turns out to be a maintenance technician with a flashlight is much harder to recover trust from than a missed detection. The agent loop solves both: tier one and tier two never call an LLM, so you only pay LLM cost when something has already cleared a structured filter, and tier three's prompt is tightly scoped (here is the event, here are the four tools, decide an action) so hallucination space is small.

What questions should I ask a vendor to verify their AI agent is real?

Four. First, where in the loop does the LLM run, and on what input rate? A real answer names the tier (planner) and gives a per-event budget (low double digits per camera per day). A non-answer is 'continuously, on every frame.' Second, what is the explicit list of tools the agent can call? A real answer is short and names each tool, what it reads, and what it writes. A non-answer is 'it can do many things' or a vague reference to plug-ins. Third, what state does the agent persist, and where does it live? A real answer is specific files on local disk with named fields. A non-answer is 'the model has memory.' Fourth, what happens during a six-hour WAN outage? A real answer describes the outbox file, the order of replay, and the local-only operation of perception and state tiers. A non-answer is 'the cloud handles failover.' A vendor that cannot answer those four questions with file paths, integers, and behaviors is selling a marketing wrapper around an object detector, not an agent.

Does the planner really need an LLM at all? Could a rule engine do this?

For a fixed property with a stable rule set, yes, a rule engine is enough, and on most production Cyrano deployments today the planner is a deterministic Typescript function (the per-class router shown on this site) rather than an LLM call. The LLM earns its keep when the rules need to handle messy free-form context: an operator who just sent a text saying 'expect a contractor at the rear gate at 2am tonight, do not page me on this one,' or a property where the on-call schedule changes weekly, or a building with a known pattern of 1 a.m. delivery drivers that should be silenced for the next 20 minutes only. The deterministic router cannot express those without code changes. An LLM-backed planner can take the operator note, the on-call schedule, and the event and produce the right routing, then write back to the outbox. So the answer to 'do you need an LLM' is 'only when the rules need natural-language context the operator wants to set and forget.' Detection, state, and the four tools never need one.

Can a Cyrano unit run the planner tier locally without sending events to a cloud LLM?

Yes. Both modes are supported. In the deterministic mode the planner is the per-class Typescript function and runs in microseconds, fully on device, with no outbound network call. In the LLM-backed mode the planner can be wired to either a hosted endpoint (in which case only the structured event and the four tool descriptors leave the device, never video) or a local model running on the same hardware (in which case nothing leaves the device at all). The mode is a config switch, not a separate product. Most multifamily deployments today run the deterministic mode because the rules are stable. Construction site and commercial deployments where context shifts daily benefit more from the LLM-backed mode, often paired with a local 7B-class model so the privacy story stays clean.