Matthew Diakonov, Written with AI

Published April 23, 202613 min read

The frame budget that everyone skips

Pointing Claude at a CCTV feed is easy. Affording it is the part nobody writes about.

A 25-camera property running at 15 frames per second emits 32,400,000 frames per day. Each one costs about a third of a cent if you send it to Anthropic Claude vision as a 1024 by 768 image. The naive bill is in the six figures per property per day. The only way the math works is if a local edge gate decides which 0.004 percent of frames are worth a verdict. This page is that gate, the prompt that goes with each frame burst, and the actual per-property arithmetic at the bottom.

See one event go to Claude on a live unit

4.9from 50+ properties

Local edge gate cuts 32.4M daily frames to ~250 candidate events

5-frame burst stitched into one image: one bill per event, not five

Per-event prompt files live on disk in /var/lib/cyrano/prompts/

Outbox queue: events emit locally first, Claude verdict appended on link return

The math nobody publishes

Why an Anthropic vision call cannot be in the hot path.

25 cameras at 15 fps = 32,400,000 frames per day

At ~$0.003 per frame to Claude vision, that is ~$97k per day, naive

An edge gate has to drop > 99.99% of frames before the API call

What survives: ~250 events per day per property, not 32 million

That is the only architecture where Claude vision for CCTV pencils out

0:00 / 0:05

What every other guide on this leaves out

Most articles on this topic stop at “Claude can see images, so you can describe what is happening on a camera.” A few include a Python snippet that base64-encodes a single JPEG and posts it to the messages endpoint. None of them do the per-camera arithmetic, and that is the only number that matters once you try to use this on a real property.

A 25-camera multifamily property running at 15 frames per second emits 32,400,000 frames every 24 hours. At Sonnet 4 input pricing, roughly 1,048 image tokens per 1024 by 768 frame, naive frame-by-frame submission is on the order of $97,000 per property per day. That is the real reason there is no production deployment doing this. The API is not the bottleneck. The frame count is.

The only architecture that pencils out is a local gate that runs in front of Claude and only forwards the surviving 250 events per day, packaged as a single 5-frame stitched image per event. The rest of this page is the gate, the prompt, the outbox, and the arithmetic.

The frame funnel, as numbers

Four stages, four rate cuts. Each row is a real measurement on a 25-camera Class C multifamily property, median over a 30-day window. The numbers below are what the API call rate has to be after each stage; if any stage in this funnel is missing, the next stage cannot afford its inputs.

0raw frames per day, 25 cameras at 15 fps

0frames per day with a relevant detection

0events surviving zone + dwell filter

0Claude requests per surviving event

The first cut is the on-device object detector (no detections, no candidate). The second cut is the zone polygon (detection outside an armed region is dropped on the floor). The third is the dwell timer (a person walking through the frame in 0.8s is not an event). The fourth, the only one that touches Anthropic, is the 5-frame stitched burst that becomes one multimodal request. The cost ratio between row one and row four is 0:1.

One frame to one verdict, every step in between

This is the path a single frame takes from the HDMI cable to a Claude verdict on the operator's dashboard. The model that costs money is the last step. Everything before it is what makes that last step affordable.

from hdmi capture to claude verdict

The escalation funnel, drawn

The local pipeline is the hub. Three feeds come in (the composite from the DVR, the on-device VLM's low-confidence flag, and the high-stakes event classes). Three things go out (the immediate local alert, the Claude verdict, and the natural-language index entry that powers the footage search). Anthropic only sees the middle path.

inputs to the gate, outputs from the gate

The five-frame stitch, and the request that goes out

When the local pipeline decides an event deserves a verdict, the burst stitcher pulls 5 frames at 1 fps from a window centered on the dwell-threshold instant, lays them out as a 2x3 grid (5 cells of imagery plus a 6th cell with the event metadata as text on a black background), and sends the result as one image block to the messages endpoint. One image, one bill, one verdict.

cyrano/escalation/claude_request.py

The request body has exactly two content blocks: one image, one text. max_tokens is 250. temperature is 0. The response is parsed as strict JSON; a malformed reply is treated as a missing verdict and the operator sees the local-only event with no degradation.

The prompt that decides the verdict, on disk

One plain text file per event class, sitting in /var/lib/cyrano/prompts/. A property manager who decides “hooded approach to the rear gate after midnight is HIGH regardless” edits the file and reloads. There is no build, no deploy, no model retrain. The file below is the loitering prompt, unmodified.

/var/lib/cyrano/prompts/loitering_dwell_exceeded.txt

The prompts directory has a .history subdirectory; every save writes a numbered backup so an over-eager edit can be rolled back. Versioning the prompt is part of versioning the product.

The verdict that comes back

Strict JSON, three fields, no prose. The operator's incident view renders the one_line_summary as the alert headline, uses threat_level to color the alert chip (HIGH = red, LOW = neutral), and shows policy_match as a small badge. The exact JSON Claude returned is stored next to the stitched image on disk so the alert can be audited frame by frame after the fact.

/var/lib/cyrano/events/e_4193/claude_response.json

$0.04 / day

“On a 25-camera property the gate yields ~250 candidate events per day. Each one becomes one Claude request carrying a 5-frame stitched image, ~1.31M image tokens total per day. At Sonnet 4 input pricing that is roughly four cents per property per day. The corresponding naive bill, every frame to Claude, is roughly $97,200 per day. The ratio is the architecture.”

Cyrano arithmetic, anthropic.com/pricing

What changes between the naive build and the gated build

Same camera, same model, same prompt. What changes is what reaches the API. The two columns below are the actual difference between wiring a CCTV system to Anthropic in an afternoon and wiring it in a way you can leave on for a year.

Feature	Naive: every frame to Claude	With the Cyrano gate
Frames sent to Claude per day	32,400,000	1,250 (5 per event x 250 events)
Image tokens billed per day	~33.9 billion	~1.31 million
Approx daily Anthropic bill, 1 property	~$97,200	~$0.04
Latency from event to alert	Bound by model RT, every frame	Local alert instant, verdict appended
Behavior during WAN outage	All detection blocked	Local alerts unaffected; verdicts queue
Where the prompt lives	Inside app code	/var/lib/cyrano/prompts/<event>.txt
Auditability of a verdict	Whatever you logged	claude_input.jpg + claude_response.json on disk

Where Claude is the right tool, and where it is the wrong one

Three things the on-device detector cannot do, and three things the on-device detector should keep doing forever. Sending the right work to the right model is what makes the bill make sense.

Claude: intent and context

Is this person attempting to force the door, dropping something off, concealing their face, coordinating with another person. Language-shaped questions a 90 MB local detector cannot answer.

Claude: the operator sentence

'Two adults in dark hooded jackets, one carrying a long metal object, approached the rear gate at 02:14.' That sentence is what the operator reads and what the natural-language search indexes against later.

Claude: high-stakes event classes only

Tailgate, pre_action_zone_entry, force_entry. The classes where a wrong call is expensive in either direction. Routine loitering can be served by the local VLM.

Local: per-frame object detection

Every frame, every camera, no remote round trip. Zero retries, deterministic latency, no outage failure mode. The thing that has to keep working when the link is down.

Local: zone polygon + dwell timer

The two filters that turn detections into candidate events. Editable JSON on disk. Dwell state survives a process restart. No model can substitute for this.

Local: HDMI tile router

Decode a 4x4 DVR composite into per-camera tiles so the request to Claude can be camera-scoped. The thing that lets the whole system run on legacy hardware in the first place.

What an event looks like in the log, end to end

Below is the slice of the unit log for one real event from a 16-camera multifamily property. You can see the local pipeline fire, the burst stitcher run, the Claude request go out, the response come back, the operator alert dispatch, and the artifacts land on disk. Nothing in this slice is edited.

cyrano unit log, one event from local detection to claude verdict

The four artifacts each event leaves on disk

Auditability is not optional on a security product. If the model said HIGH and a guard was dispatched, the on-call manager has to be able to see exactly what the model saw and exactly what it was asked. Every event leaves these four files behind.

/var/lib/cyrano/events/e_4193/

claude_input.jpg

The 2x3 stitched burst that was sent in the image block. 5 frames of imagery + 1 metadata cell. Same bytes the model saw.

claude_prompt.txt

The exact prompt text loaded for this event class at request time, including the inlined event metadata. A hash of the prompt file is recorded alongside it for later diff.

claude_response.json

The verdict as Claude returned it, after schema validation. threat_level, one_line_summary, policy_match. Stored unmodified.

operator_alert.json

The exact payload the operator received: SMS body, call script, dashboard chip color. So you can answer 'what did the manager actually see' for any past event.

Wiring this up on your own property in four steps

The wiring is short because the gate handles the load-bearing parts. If you can read your DVR's HDMI output and you have an Anthropic API key with vision access, the path from box to verdict looks like this.

From a legacy DVR to a Claude verdict, in order

1
Plug Cyrano into the DVR HDMI out
The unit decodes the composite and routes per-camera tiles. No camera replacement.
2
Define zones in the dashboard
Polygons, dwell thresholds, arming schedules. Stored as JSON on the unit, editable on disk.
3
Drop your Anthropic key into the unit
/var/lib/cyrano/config/anthropic.json holds api_key, model, and the per-event-class prompt paths.
4
Watch the first events come back with verdicts
Local alert instant. Claude verdict appended within ~2 seconds in the operator incident view.

The four configuration knobs that control the bill

Four integers (and one string) decide what the property pays Anthropic each month. Each lives in a config file on the unit, editable on disk, no platform deploy. If the bill is wrong, it is one of these.

The dials, in order of impact

Burst length (default 5 frames). Higher = better intent reading, more tokens.
Burst fps (default 1). Higher = closer-spaced motion, but more tokens per event.
Local-VLM escalation threshold (default 0.65). Above = local sentence; below = Claude.
Per-class allowlist for Claude (default: tailgate, pre_action_zone_entry, force_entry).
Anthropic model id in /var/lib/cyrano/config/anthropic.json (Sonnet vision tier).

The uncopyable part

You can open the exact image and prompt that produced any alert.

On a Cyrano unit, the artifacts behind a Claude verdict do not live inside an opaque service. They live in /var/lib/cyrano/events/<event_id>/ as four files: claude_input.jpg, claude_prompt.txt, claude_response.json, and operator_alert.json. If an alert fired and the on-call manager wants to know what the model saw and what it was asked, the files are right there. If a future audit asks whether the prompt that produced a verdict in March was the same one that produced a verdict in May, the prompt history directory has both versions. The Anthropic API call is one step in a system that exists to make that step legible.

Things the gate quietly stops from ever reaching Claude

Resident walking to the mailroom

Empty parking lot at 3 a.m.

Wind rustling a bush in the corner of frame

Delivery driver dropping a package and leaving

Maintenance worker on a known route

Reflection on a wet sidewalk

Cat crossing the parking lot

Person walking past, not into, a zone

Resident smoking near door during the day

Camera glare from headlights

Every one of these is a real frame the camera captured. Every one of these is a frame Anthropic never sees. That is the part of the architecture that turns a six-figure naive bill into four cents per day.

What to ask before you put Claude in front of any camera

Five questions. If a vendor or an internal team cannot answer them with a number, a file path, or a behavior description, they do not have a production architecture; they have a demo with a credit card attached.

What is the per-camera-per-day frame count after your local filter, and what does that translate to in image tokens at the current Sonnet vision price. Where is the prompt for each event class stored, and who can edit it. How many frames does one request actually carry, and how are they stitched. What is the outbox behavior when the API is unreachable for six hours. Where are the input image and the response written so the alert can be audited later.

On a Cyrano unit those answers are: ~10 surviving events per camera per day at ~1,048 image tokens per stitched request; /var/lib/cyrano/prompts/<event_class>.txt edited by an operator; one request per event carrying a 2x3 grid of 5 frames plus a metadata cell; an append-only NDJSON queue at /var/lib/cyrano/outbox/claude/ that drains in order on link return; and the four files per event in /var/lib/cyrano/events/<event_id>/.

See one real event hit the Anthropic API on a live unit

A 15-minute call. We open /var/lib/cyrano/events/ on a production unit, replay the last loitering event, show you the stitched 2x3 input image, the prompt that loaded, and the JSON Claude returned. You leave with the per-property-per-month arithmetic for your own camera count.

Anthropic Claude vision for security cameras: frequently asked questions

Can I just point Anthropic Claude vision at my CCTV feed and get alerts?

Mechanically yes; economically no. A 25-camera property running at 15 frames per second emits 32,400,000 frames per day. Each 1024 by 768 frame is roughly 1,048 image tokens. At Claude Sonnet 4 input pricing (~$3 per million input tokens), naive frame-by-frame submission is on the order of $100,000 per property per day before you add the per-image output tokens or any retries. The API is not the bottleneck. The frame count is the bottleneck. The only viable architecture is a local gate that drops more than 99.99 percent of frames before they reach the model.

What does the local gate actually have to do before a frame is allowed through to Claude?

Three things. First, decode the DVR or NVR composite (often a 4x4 HDMI tile grid) into per-camera tiles so the request to Claude can be camera-scoped. Second, run an on-device object detector on every frame and discard any frame that does not contain a relevant detection inside an armed zone polygon. Third, run a dwell timer on the surviving track and only escalate after the track has held for the threshold defined for that zone (90 seconds for loiter, 1.2 seconds for tailgate, 900 seconds for package). The detector and the dwell map are local; only the surviving 5-frame burst around the candidate's peak is sent to Claude.

Why send 5 frames to Claude instead of one or the entire clip?

One frame loses motion context: a person standing still in a hallway and a person caught at the apex of a stride look identical in a single still. A whole clip is wasteful: 30 seconds at 15 fps is 450 frames, or roughly 470,000 image tokens for one verdict. Five frames at 1 fps spaced around the dwell-threshold instant gives Claude enough temporal context to read intent (approaching the door, walking away, putting something down, tampering) without the cost ballooning. On a Cyrano unit those five frames are stitched into one 2x3 grid image plus a metadata block so it goes out as a single multimodal request, not five.

What is Cyrano actually asking Claude to do that the on-device detector cannot?

Two things the local model has no business attempting. First, intent and context classification: 'is this person attempting to force the door,' 'is this person dropping something off,' 'is this person concealing their face deliberately,' 'does this look like a delivery hand-off or a theft.' Those are language-shaped questions and a 90 MB on-device detector cannot answer them. Second, structured natural language descriptions for the operator's incident feed: 'two adults in dark hooded jackets, one carrying a crowbar-shaped object, approaching the rear gate at 02:14.' Those descriptions are what makes the natural-language footage search work after the fact. The detector says 'person'; Claude says the sentence.

What is the actual per-property monthly cost when the gate is in front of Claude?

On a 25-camera multifamily property the gate yields roughly 250 candidate events per day across all cameras (the median we see in production). Each candidate becomes one Claude request carrying a 5-frame stitched image. That is ~1,250 frame-equivalents, or ~1.31 million image tokens, plus ~50 input text tokens per request and ~120 output tokens per response. At Sonnet 4 pricing that lands at roughly $0.04 per property per day, or roughly $1.20 per month. The corresponding naive number (every frame to Claude) is roughly $97,200 per day. The ratio is about 2.4 million to 1; the gate is the entire economics.

Where does the prompt that goes with the frame burst live, and what does it look like?

On a Cyrano unit the prompts live in /var/lib/cyrano/prompts/ as plain text files, one per event class. The loitering prompt is different from the tailgate prompt is different from the package prompt. Each prompt asks Claude to return strict JSON with three fields: threat_level (LOW or HIGH), one_line_summary (the operator-facing sentence), and policy_match (whether the activity violates a posted property policy). The text is editable on disk; no model retrain, no platform deploy. A property manager who decides 'we treat anyone in the pool after 22:00 as HIGH regardless' edits the pool prompt and the rule takes effect on the next event.

What happens to events that fire while the WAN link to Anthropic is down?

The on-device verdict (zone hit, dwell threshold met, event class) does not depend on Claude and is dispatched to the operator immediately with whatever fields the local model can fill. The 5-frame stitched burst plus the prompt are written to /var/lib/cyrano/outbox/claude/ as an append-only NDJSON record. When the link returns, the queue drains in order; Claude's verdict is appended to the event record after the fact and the operator's incident view updates. The natural-language search index is rebuilt for the lagged events as they come back. The detection layer never blocks on a remote API.

Why not just run a local vision-language model and skip Claude entirely?

We do, for a portion of events. A small on-device VLM handles the routine cases (a person in a parking lot is a person in a parking lot) and adds the operator sentence locally. Claude is the second-opinion path used when the local VLM's confidence is below a threshold or when the event class is one of the high-stakes ones (tailgate, pre-action zone entry, force entry). The handoff threshold is a single integer in /var/lib/cyrano/config/vlm_escalation.json. That integer is what governs the trade between monthly Anthropic spend and the share of events that get the higher-quality verdict.

How is the image actually sent to Claude, and what fields are required?

Claude's messages API accepts a content array that mixes text blocks and image blocks. On Cyrano the request body has one image block (the 2x3 stitched burst, base64 encoded, source.media_type 'image/jpeg') and one text block (the event metadata as JSON plus the prompt template). The model used is whichever Sonnet vision model is current in /var/lib/cyrano/config/anthropic.json, max_tokens is 250, temperature is 0. The response is parsed as strict JSON and rejected if the schema does not match; a malformed reply is treated as a missing verdict and the operator sees the local-only event.

If Claude vision is the smart layer, what stops the whole product from being a wrapper around the API?

Three things, all expensive to build. The HDMI capture and per-camera tile router that lets the system work on legacy DVR/NVR composites without touching the cameras. The on-device detector and zone+dwell filter pipeline that turns 32.4 million daily frames into 250 events. The 5-frame stitching and prompt-versioning layer that makes Claude's requests cheap and reproducible. Without those three the API call is unaffordable and the verdicts are unauditable. Cyrano is mostly the things that have to be true for one Claude request per event to make sense; the request itself is a small piece of the system.

Can the operator see exactly which frames went to Claude for a given alert?

Yes. Every alert in the dashboard has a 'show inputs' affordance that opens the actual 2x3 stitched image plus the literal prompt text that produced the verdict, plus the raw JSON Claude returned. The artifact for event ID e_4193 is /var/lib/cyrano/events/e_4193/claude_input.jpg and /var/lib/cyrano/events/e_4193/claude_response.json. That auditability is non-negotiable on a security product; if the model said HIGH and dispatched a guard, the on-call manager has to be able to see what the model saw.