A cost-shaped argument

Vision language models for CCTV are a description and search engine, not a real-time alerter. The cost math is what tells you that, before anyone gets to talk about accuracy.

Most pages on this topic describe what VLMs can do and stop there. They show a CLIP demo classifying a webcam frame as “loitering” and call it a security system. The thing they skip is the only number that decides whether the architecture is real on an installed property: how many frames per day actually reach the model. Once you write that number down, the design space collapses to one shape, and the use case the VLM is actually for stops being “watch the live feed” and starts being “describe and index the events the gate already let through.”

Matthew Diakonov, Written with AI

Published May 8, 202611 min read

Direct answer (verified 2026-05-08)

Yes, vision language models can do CCTV surveillance, but only as a description and search layer behind a cheap detector. A 25-camera property emits roughly 32.4 million frames a day. Sending all of them to a hosted VLM at current frontier-model pricing lands near $97,000 per property per day. A cascade gate that filters more than 99.99 percent of frames before the VLM call lands at roughly $1.20 per property per month, a ratio of about 2.4 million to one.

The model’s job is to write a one-line sentence about each surviving event and emit a strict-JSON verdict. The sentence becomes a searchable index of the property’s footage in plain English. The verdict is what the operator sees. The model is never the thing watching the live feed; the on-device detector is.

The thesis: a VLM is too expensive to be the eyes and too good to waste on detection

The marketing pages on this topic talk about VLMs as if they replace the detector. They do not. They sit one layer behind the detector and they do a different job. The detector decides whether anything is in the frame at all. The VLM decides what the something means once you have already paid for it to be there. Confusing those two roles is what breaks the math.

Once the roles are kept apart, the architecture writes itself. A small detector runs on every frame, locally, on an edge accelerator. A dwell timer holds the candidate track until it has earned an escalation. A 5-frame burst, stitched into one image, goes to the VLM with a class-specific prompt and comes back as JSON. A separate text index stores the VLM’s one-line description so the property manager can search the footage in English later. None of those layers is interchangeable. Each one is doing the job the layer above it cannot afford to do at the rate the layer below it produces work.

0Frames per day, 25 cameras at 15 fps

$0Naive daily cost at Sonnet pricing

0Events per day after the gate

$0/moGated cost per property

The arithmetic that decides everything else

A 1024 by 768 image is on the order of 1,048 image tokens to a frontier VLM. A typical 25-camera property recording at 15 frames per second produces 25 multiplied by 15 multiplied by 86,400, which is 32,400,000 frames per day. Multiply that by image tokens per frame and the per-day input-token volume sits in the tens of billions. At Claude Sonnet input pricing of about three dollars per million input tokens, the naive daily cost is in the high five figures per property. Even with batching, prompt caching, and aggressive resolution downsampling, the regime does not move enough to matter.

The number that actually has to drop is not the per-token price. It is the number of frames that reach the model. The detector throws away the frames where nothing relevant is happening. The dwell timer throws away the candidates that did not stay long enough to be worth a sentence. By the time a frame burst is on its way to the VLM, the gate has already discarded more than 99.99 percent of the day’s frames. That is the only architecture where the math works at the per-property prices an installed multifamily operator is willing to pay.

On a Cyrano unit, that math lands at roughly 250 candidate events per day across 25 cameras. Each one becomes a single multimodal request carrying a 5-frame stitched image, a short metadata block, and a class-specific prompt. The combined daily token volume is on the order of 1.31 million image tokens plus a thousand or so text tokens. The cost is in the low single digits of cents per property per day, or about $0 /mo per property in the steady state.

The four-layer cascade between cameras and the VLM

Cameras

32.4M frames/day, 25 feeds

Detector

drops frames with no person

Dwell + zone

drops candidates under threshold

VLM

~250 events/day, JSON verdict

Index

plain-English search of footage

What the VLM is actually for: the sentence and the index

Once a candidate event has cleared the gate, the VLM does two things, and the rest of the system is built around its outputs.

The first output is a strict-JSON verdict. Three fields: threat_level (LOW or HIGH), one_line_summary (an English sentence the operator can read), and policy_match (whether the activity breaks a posted property rule). That JSON is what the dispatcher reads to decide whether to send an SMS, place an outbound call, or just write the event to the morning digest. The reason this field set is small is that the operator’s phone is the slowest part of the chain; the JSON has to be parseable in microseconds and routable in single-digit milliseconds. Anything richer than three fields is wasted on the dispatch path.

The second output is the natural-language description that goes into the index. This is the use case nobody on the marketing pages writes about, and it is the one that justifies the model in production. The description is a sentence in plain English, attached to the event, persisted to disk. Six months later, when a tenant asks why a package went missing on a Tuesday afternoon, the property manager types “person in the lobby with a delivery uniform last Tuesday between 2 and 4 pm” and the index returns the matching events. The retrieval is a text search over sentences, not a vision inference over frames. The VLM was paid for once, at the moment of the event. After that the index is plain text.

That is the operational economics of the search use case: the cost is bounded by the number of events, not the number of queries. A property that has 250 events a day pays for 250 VLM calls a day. The same property can search its index 10,000 times without paying for a single additional model call. That is the regime where putting a VLM in the surveillance stack pays for itself, and it has nothing to do with watching the live feed.

What flows through the gate, and where the model actually sits

Local VLM versus cloud VLM, decided by one integer

Inside the cascade, there is a second split that the marketing pages collapse into a single “the VLM.” In production there are usually two: a small open-source VLM on the device (LLaVA-class, Qwen-VL-class, or a quantized derivative of one of them), and a frontier hosted VLM on a remote API. The local model writes the routine sentences for free. The hosted model writes the high-stakes ones for cents.

The handoff is governed by a single threshold. The local model returns a verdict and a confidence score. If the score is above the threshold, the verdict is final and the event is dispatched. If the score is below, or the event class is one of the small set that always escalates (tailgate, force entry, after-hours pre-action zone entry), the same 5-frame burst is also sent to the hosted model and the verdict is the union of the two. That threshold is one integer in a config file. Tuning it is the lever that moves the monthly cloud bill up and down without changing any code.

This is the part of the architecture that pays the dividend on the choice of a small local model. The 80 percent of events that are unambiguous (a person in a parking lot is a person in a parking lot) are described locally at zero per-event cloud cost. The 20 percent that are ambiguous or high-stakes are described by the model that can actually read intent. The combined per-property cost is dominated by the share of escalations, not the total event count, which is the regime where the architecture is tunable to the operator’s budget rather than fixed by the camera count.

One event from frame to JSON, in five beats

01 / 05

t=0

Side-gate camera. Frame on the bus. Detector runs in parallel across the whole HDMI tile grid. One tile fires: person, confidence 0.92, inside the after-hours zone polygon.

Where current VLMs still get it wrong

An evaluation of vision-LLMs on UCF-Crime and RWF-2000, published on arXiv in October 2025 (paper 2510.23190), reports that frontier VLMs do well on simple, spatially salient events and struggle with noisy spatial cues, identity obfuscation, and dense crowd scenes. They also degrade when full-body GAN privacy filters are applied, because the transform breaks the spatial statistics the model was pre-trained on. The paper’s conclusion is that VLMs are not a drop-in replacement for a structured surveillance pipeline; they are most useful when the inputs have already been narrowed to a single track and a small temporal window.

That conclusion lines up exactly with what the cost arithmetic forces you to build anyway. The detector and the dwell timer narrow the input to a single track and a 5-frame window before the VLM ever sees the event. The model is then operating in the regime where the published papers say it is reliable, not the regime where it fails. The cost gate and the accuracy gate point in the same direction; both want a cheap front end and a smart back end, not the other way around.

The mistake on the marketing pages is to imply that a VLM can be the front end. It can, in a lab, on a single feed, for a research demo. On 25 feeds at 15 frames per second across a real property, the same architecture pays a six-figure-per-day model bill and produces results no better than the gated version. Both arguments converge on the cascade.

The counterargument: when the live-feed VLM is actually the right answer

There are three regimes where pointing a VLM at the live feed is defensible. First, post-incident analysis. A single event has already happened, you have a 30-second clip, you want a structured description. There is no rate problem because there is one clip. Send it to a frontier VLM and read the output. The cost is bounded by the analyst’s patience, not by the camera count.

Second, very low frame rate situations. A doorbell that captures one frame per second is a fundamentally different cost regime than a 15-fps property feed. At 1 fps with a single camera, naive every-frame VLM is on the order of dollars per day, not tens of thousands. For a single-door consumer product, the cascade complexity may not be worth the savings.

Third, research and red-team work. Running a VLM continuously over a real feed to study where it fails is a perfectly good use of the model. The output is a research artifact, not a paged operator. Cost per insight is what matters, not cost per property.

None of those is the multifamily-property use case. None of them ships with an SLA an on-call manager has to live with. None of them has an installed DVR fleet on the other side of the equation. The cascade is the answer for the operational case; the live-feed VLM is the answer for the analyst case. Confusing the two is the source of the bad architectures on the market.

What to ask before buying anything that says “VLM-powered surveillance”

Five questions, in order. Each one is answered in seconds by an honest vendor and is a fog of marketing language by a dishonest one.

How many frames per day actually reach the VLM on a 25-camera property?The honest answer is in the low hundreds. If the answer is “all of them, we have efficient batching,” ask for the daily cost number and watch the conversation change.
What runs on the device, and what runs in the cloud? The detector and the dispatcher have to be on the device or the latency story is fiction. The hosted VLM may live in the cloud; the decision to fire an alert cannot.
What does the VLM return and where does the JSON go? Three fields max on the dispatch path. If the API contract is freeform paragraphs, the dispatcher is a regex pile and will break in production.
Can a property manager edit the prompt without a platform deploy? The prompt is a policy artifact, not source code. Hardcoded prompts are a ticket-driven deployment surface that no operator wants to touch.
Is the natural-language search of footage real, and is it free at query time?If “search” means another VLM call per query, the architecture has not understood why the index exists. The search has to read the persisted sentences, not re-watch the frames.

A vendor whose system passes those five questions has thought about the regime carefully. A vendor whose system fails any of them is selling the marketing-page architecture and will produce the marketing-page bill.

“At one Class C multifamily property in Fort Worth, Cyrano caught 20 incidents including a break-in attempt in the first month. Customer renewed after 30 days.”

Fort Worth, TX deployment, 25-camera DVR install, 2026

The resolution

Vision language models belong in CCTV pipelines. They do not belong on every frame, and they are not the thing watching the live feed. They are the layer that turns a small number of high-value events into structured verdicts, plain-English descriptions, and a searchable index of the property’s footage. The detector and the dwell timer are what make that economically possible, and the cascade is the only architecture where the per-property costs are within an order of magnitude of what a property operator will actually pay.

The Cyrano version of this is an edge box that plugs into the existing DVR or NVR over HDMI, runs the detector and dwell layer locally, sends a small number of stitched bursts to the VLM with class-specific prompts, and exposes the resulting sentences as a search index the property manager can query in English. Same cameras as last week. Same DVR. Different stack behind it.

Want to see the cascade running on a real DVR?

A 10-minute demo: live HDMI feed in, detector + dwell + VLM in the middle, JSON verdict and English-search out. We will show the per-property cost number, not just the model output.

Frequently asked questions

Can a vision language model just watch a CCTV feed and send me alerts?

Mechanically yes; economically no. A 25-camera property running at 15 frames per second emits 32,400,000 frames per day. Each 1024 by 768 frame is roughly 1,048 image tokens. At Claude Sonnet input pricing (about $3 per million input tokens), naive frame-by-frame submission lands at roughly $97,000 per property per day before adding any output tokens or retries. The arithmetic kills the architecture before any accuracy question is even asked. The only viable shape is a cheap on-device gate that drops more than 99.99 percent of frames and only escalates the surviving candidate events to the VLM.

What is a vision language model actually good for in a surveillance pipeline?

Two things a smaller object detector cannot do. First, intent and context classification: 'is this person attempting to force the door,' 'is this a delivery hand-off or a theft,' 'is the person concealing their face deliberately.' Those are language-shaped questions and a 90 megabyte detector cannot answer them. Second, structured natural-language descriptions for an incident feed and a searchable index: 'two adults in dark hooded jackets, one carrying a crowbar-shaped object, approaching the rear gate at 02:14.' The detector says 'person.' The VLM writes the sentence. Everything downstream of the sentence (search, summary, post-incident review) is what the VLM uniquely unlocks.

Why is the natural-language search of footage the strongest case for VLMs in CCTV?

Because it is the use case where the cost arithmetic actually agrees with the user. Search runs on the descriptions, not on the frames. Once an event has been described as a sentence ('masked man near the side gate, 02:14, dark jacket, approaching the door'), every later query against that sentence is plain text retrieval, not vision inference. A property manager can type 'masked person near the gate after midnight last week' and get hits in milliseconds without any VLM call at query time. The model paid its cost once, when the event happened. After that the index is just text.

What makes the per-frame VLM call so expensive compared to a regular detector?

Two layers. First, the per-frame token cost: a single 1024 by 768 image is on the order of 1,000 image tokens, and a hosted VLM is priced per million input tokens, not per frame. Multiply by frames per second per camera and the total is dominated by image tokens, not text. Second, the latency and rate-limit ceiling: hosted VLMs are batch-friendly but live-feed-hostile, with first-token latencies in the hundreds of milliseconds and request budgets that quickly cap out. A small detector running on an edge accelerator inference at tens of milliseconds per tile and a fixed energy cost; it is in a different cost regime entirely.

Do small open-source VLMs (LLaVA, Qwen-VL, MiniCPM-V) change the picture?

They change one regime, not the whole picture. A small open-source VLM running on a local GPU or edge accelerator can take over the routine description work that does not need a frontier model: 'a person walked through the parking lot in a delivery uniform.' That handles a large share of events at zero per-event API cost. What it cannot do is the high-stakes interpretive call where misclassifying a force-entry as a delivery is operationally bad. For those, the cascade escalates to a frontier model. The decision is a single integer threshold: under it, the local VLM's verdict is final; above it, the event also goes to the cloud model. The cost lever is that integer.

Where do current VLMs still fail on surveillance video?

An evaluation of vision-LLMs on UCF-Crime and RWF-2000 published in October 2025 found that models do well on simple, spatially salient events (a person enters a frame, a vehicle stops) and falter on noisy spatial cues, identity obfuscation, and crowd scenes. They also degrade when privacy filters like full-body GAN transforms are applied, because the transform breaks the spatial statistics the model was pre-trained on. The takeaway is that a VLM should not be the only thing in the pipeline; it should be downstream of a detector that has already isolated a single track and a small dwell window. Asking a VLM to do everything from raw footage is exactly the regime where the published papers report it failing.

What does a working VLM-in-CCTV architecture look like end to end?

Four layers in cascade. Layer 1, the camera and recorder produce frames. Layer 2, an on-device object detector runs on every frame and discards anything that does not contain a relevant detection inside an armed zone polygon. Layer 3, a dwell timer on the surviving track advances; the candidate is only escalated when the track has held past a per-zone threshold (90 seconds for loiter, 1.2 seconds for tailgate, 900 seconds for package). Layer 4, the surviving 5-frame burst is stitched into one image and sent to the VLM with a prompt that returns strict JSON: threat_level, one_line_summary, policy_match. Every layer except the first is local on a Cyrano unit; only the burst plus the prompt leaves the property if a frontier model is in the loop.

Why send 5 frames instead of 1 frame or the whole clip?

One frame loses motion: a person standing still and a person caught mid-stride look identical in a single still. A whole clip is wasteful: 30 seconds at 15 frames per second is 450 frames, or roughly 470,000 image tokens for a single verdict. Five frames spaced at 1 frame per second around the dwell-threshold instant gives the VLM enough temporal context to read intent (approaching, walking away, putting something down, tampering) without inflating the cost. On a Cyrano unit the five frames are stitched into one 2x3 grid image plus a short metadata block, so the request is a single multimodal call, not five.

How does the cascade behave when the link to the cloud VLM is down?

The on-device verdict (zone hit, dwell threshold met, event class) does not depend on a cloud VLM and is dispatched to the operator immediately with whatever fields the local model can fill. The 5-frame burst plus the prompt are written to a local outbox as an append-only record. When the link returns, the queue drains in order; the cloud VLM's verdict is appended to the event record after the fact and the operator's incident view updates. The natural-language search index is rebuilt for the lagged events as they come back. The detection layer never blocks on a remote API, which is the correct way to run anything on a residential or commercial uplink.

How is the prompt that goes with the frame burst maintained over time?

On a Cyrano unit the prompts live as plain text files on disk, one per event class. The loitering prompt is different from the tailgate prompt, which is different from the package prompt. Each prompt asks the VLM to return strict JSON with three fields: threat_level (LOW or HIGH), one_line_summary (the operator-facing sentence), and policy_match (whether the activity violates a posted property policy). The text is editable on the device; no model retrain, no platform deploy. A property manager who decides 'we treat anyone in the pool after 22:00 as HIGH regardless' edits the pool prompt and the rule takes effect on the next event.