Matthew Diakonov, Written with AI

Published April 24, 202613 min read

Open source AI CCTV, April 2026 lineup

The April 2026 open source AI CCTV ecosystem split into two architectures. Most lists do not say which.

Frigate, Viseron, Shinobi, Kerberos.io, and Tokay Lite are now clearly classifier-on-live-path. DeepCamera is now clearly VLM-on-retrieval-path with Qwen, DeepSeek, SmolVLM, LLaVA, and YOLO26. ZoneMinder is the previous generation. The two camps want different hardware, different model latencies, and different jobs. This guide places each April 2026 project on that map and shows the frame-budget math that decides which architecture belongs on which path.

See the live + retrieval split on a real unit

4.9from 50+ properties

Trigger path bounded to ~12 ms per forward pass on Cyrano

Retrieval path uses a VLM out of band, never blocks alerts

Maps every April 2026 project to its actual architecture camp

Honest about when open source is the right answer (and when it is not)

Two paths, one camera

The split that decides whether you catch the tailgate

April 2026 open source AI CCTV is now two architectures, not one

Trigger path: small quantized classifier, sub-100 ms per frame

Retrieval path: agentic VLM, hundreds of ms per image, run offline

A 2-second tailgate fits in the trigger path, not the VLM

Most roundups list both as 'AI CCTV' and ship a mis-architected stack

0:00 / 0:05

Why the split exists, written in milliseconds

A 30 frame-per-second camera feed gives the live path 33.3 milliseconds per frame to do everything it needs to do: decode the frame, run inference, walk the bounding boxes through zone and dwell filters, check the arming schedule, append to the event log, and hand control back so the next frame is not dropped. That number is fixed by physics, not by preference.

A quantized YOLOv8-nano on a small NPU lands inside that budget with room to spare. A small VLM on the same hardware tier does not, and is not built to. VLMs are general-purpose models doing multi-token text decoding on top of a vision backbone. They are the right tool for “tell me what happened in the lobby yesterday between 9 and 10pm.” They are the wrong tool for “is that person tailgating right now.” The April 2026 ecosystem has actually figured this out at the project level; the roundups have not caught up.

The two paths, drawn

One unit, two pipelines. The trigger path is cheap and runs every frame. The retrieval path is expensive and runs on demand. They share two artifacts on disk: the outbox and the thumbnail directory. Nothing else couples them.

Live trigger and retrieval, sharing a disk and nothing else

April 2026 lineup, sorted by architecture

Each card below names a project, the camp it now sits in, and the specific reason it sits there. The classifier-on-live-path camp is crowded and mature. The VLM-on-retrieval-path camp is led by one project (DeepCamera) with the rest watching. The previous generation is included so the comparison is honest.

Frigate

Classifier on live path. MobileDet SSD on Coral TPU by default; YOLO via ONNX, TensorRT, OpenVINO, HailoRT for GPU/Hailo users. Tightly integrated with Home Assistant. The de facto reference for the live-path camp in 2026.

DeepCamera (SharpAI)

VLM on retrieval path. Local VLM video analysis with Qwen, DeepSeek, SmolVLM, LLaVA. YOLO26 for cleaner light triggers. Agentic, talks back through Telegram, Discord, Slack. The clearest example of the retrieval-path camp in 2026.

Viseron

Classifier on live path. Python NVR with pluggable backends: CodeProject.AI, DeepStack, YOLO via ONNX. Strong configurability, weaker default ergonomics than Frigate.

Shinobi

Classifier on live path. Node.js NVR with TensorFlow plugin pipeline. Cleaner UI than ZoneMinder, lighter community than Frigate.

Kerberos.io

Classifier on live path. Containerized agent + machinery + hub. Strong if you already operate Docker; agent runs the live ML.

Tokay Lite

Classifier on live path, smallest box. ESP32-S3 board with image sensor and onboard MCU. Built for one or two camera deployments, not 16-25 channel sites.

ZoneMinder

Previous generation. Pixel-diff motion detection at the core, optional YOLO via a Python helper plugin. Stable, slow-moving, fine for pure NVR use; not architecturally an AI-first project.

The frame-budget math, in numbers

These four numbers are the whole argument for why the split exists. They are not benchmarks chosen to flatter Cyrano; they are the published latency profile of the model classes the respective camps actually use today.

0 msms per frame at 30 fps

0 msms per forward pass, yolov8n-int8 on NPU

0 msms per call, SmolVLM-256M on same hardware

0frames in a 2-second tailgate window

At 0 fps a tailgate event is 0 frames. A classifier that costs 0 ms fits in the budget with room. A VLM that costs 0 ms finishes its first inference after the door has already latched. The split is forced by the clock.

The trigger loop, in 25 lines of Python

This is the live path. Notice what is not here: there is no VLM call, no LLM hop, no asynchronous wait. The forward pass returns in roughly 12 milliseconds. Everything that follows is local state inspection (zone, dwell, arming) and an append to the on-device outbox. The assert at the bottom is the contract: if a frame takes longer than the frame budget, something is wrong on the unit and we want a loud failure, not a silent dropped frame.

cyrano/live/trigger.py

The retrieval loop, in 15 lines of Python

This is the other path. It runs out of band, on a different schedule, against the thumbnails the trigger path already persisted. It is allowed to spend seconds per query because no door is closing while the operator types. The VLM here is a direct analog of what DeepCamera ships as its primary loop, just moved out of the live path so it never blocks an alert.

cyrano/retrieval/answer.py

What the active model on a Cyrano unit looks like

This is the literal output of the model inspection command on a production unit. The two model entries are the trigger and the fallback for low-light tiles. Forward pass time is bound to the frame budget by design. The runtime line names the actual on-device path: TensorRT int8 on the NPU, no GPU rental, no uplink call.

cyrano model list --active (production unit)

A 2-second tailgate, second by second

Below is what the live path looks like when a real tailgate fires. Every line is one event from the trigger loop. The total elapsed wall-clock time from frame ingest to outbox commit on the second person is 47 milliseconds. The door takes about another second to latch. The alert is already in flight.

cyrano edge unit, lobby-03 tailgate event

12.1 ms / 750 ms

“The live trigger forward pass is 12.1 milliseconds. A general-purpose VLM call on the same hardware tier is roughly 750 milliseconds. That single ratio is the entire argument for why open source AI CCTV in 2026 has split into two architectures, and why a property-grade unit has to run both paths in parallel rather than choosing one.”

Cyrano on-device benchmarks, production unit, April 2026

How to read each project, in one row

The table below is the quick way to slot any of the active April 2026 projects into the right architectural camp. The questions on the left are the ones an honest evaluation answers before comparing UIs or feature lists.

Feature	VLM on retrieval path	Classifier on live path
Reference open source project	DeepCamera (SharpAI), 2026 onward	Frigate, Viseron, Shinobi, Kerberos.io, Tokay Lite
Default model class	Qwen-VL, SmolVLM, LLaVA, DeepSeek-VL	MobileDet, YOLOv8-nano, YOLO26-small at int8
Per-frame inference budget	Hundreds of ms to seconds per image	Sub-100 ms, fits inside a 30 fps feed
Job it does well	Answer 'what happened' in natural language	Fire alerts on tailgate, package, loiter in real time
Job it does badly	Real-time triggers under 2 seconds	Open-ended retrospective questions
Hardware tier	Mac Mini, AI PC, mid-range GPU box	Mini-PC + Coral TPU, Jetson Orin Nano, NPU box
Right deployment	Operator review, incident write-up, search	Live alerting, on-call manager paged
Right architecture choice for a property	Optional but useful for the retrieval path	Mandatory for the trigger path

How a real April 2026 stack composes the two camps

The pattern that works at a property is not picking one camp over the other. It is wiring both, with the trigger path as the authoritative source of alerts and the retrieval path as the tool the operator opens after an alert lands. The steps below describe that wiring in the order it actually happens.

Live + retrieval, end to end

Frame arrives

The DVR HDMI composite is captured at 1920x1080 / 30 fps. One frame, all 16-25 cameras as tiles. The classifier runs once per frame, not once per camera.

Classifier fires (12 ms)

yolov8n-int8 returns bounding boxes in composite coordinates. The tile router maps each box back to a camera id. Zone, dwell, and arming filters apply per camera.

Outbox commit

On a triggering event the unit appends to /var/lib/cyrano/outbox/YYYY-MM-DD.jsonl with a monotonic local_seq, captures a thumbnail to /var/lib/cyrano/thumbs/, and hands off to the delivery channel (WhatsApp, SMS, phone call).

Operator gets a phone call

Total elapsed from frame to phone in a typical event is 2 to 4 seconds. The trigger path is done. The VLM has not run yet and that is intentional.

Operator opens the retrieval layer

After the alert, the operator can ask 'show me anyone matching this person from earlier today' in natural language. The retrieval pipeline runs the VLM against thumbnails, takes seconds per query, never blocks new triggers.

What lives on the live path, what lives on the retrieval path

The two checklists below are the practical version of the split. If a feature is on the live list, it has to fit inside the frame budget. If a feature is on the retrieval list, it gets to be slow. The April 2026 mistake is putting a retrieval-class feature on the live path because the demo looked impressive in a video.

On the live trigger path:

Person / vehicle / package classification per tile
Per-camera zone mask (where in the frame the alert is allowed)
Per-zone dwell filter (multi-frame state, in milliseconds)
Arming-window schedule (per camera, per day-of-week)
Tailgate / loiter / package-dwell rule evaluation
Outbox append with monotonic local_seq
Thumbnail crop and write to local disk

On the retrieval path (out of band):

Natural language search across the day's thumbnails
Cross-camera person re-identification
Long-form incident summary for an operator handoff
Behavior pattern analysis across days
Chat-with-the-cameras agent over Telegram, Discord, Slack
VLM-driven thumbnail captioning for the dashboard

Project names you will see in any April 2026 list

Frigate (live)

DeepCamera (retrieval)

Viseron (live)

Shinobi (live)

Kerberos.io (live)

Tokay Lite (live, small)

Scrypted (live)

ZoneMinder (legacy)

MotionEye (unmaintained)

AgentDVR (partial OS)

The first six are real April 2026 choices. The last three are context. Pick from the first group, place each project in the right camp, and the comparison table on a roundup page becomes an architecture diagram instead of a feature checklist.

The thing that is uncopyable

You can run both architectures without choosing one.

The mistake every roundup makes is presenting the open source AI CCTV ecosystem as a list of competitors that share one job. They do not. Frigate and DeepCamera are not competitors; they are the trigger and retrieval halves of the same property security stack. The right unit runs a fast classifier on the live path and a slower VLM on the retrieval path, and neither one waits on the other. On a Cyrano unit those two pipelines share two files on disk (the outbox and the thumbnail directory) and nothing else. Tail /var/lib/cyrano/outbox/*.jsonl during a live event and you can watch the trigger path commit without the VLM ever entering the loop. That is the split, on disk, in real time.

When the open source path is the right answer

Worth being direct: if your cameras are IP-native and expose RTSP per channel, you have a server you can keep up, and you have engineering time to spend on per-camera tuning, the open source path is genuinely good. Frigate plus DeepCamera composes into the same two-path architecture this guide describes. You just have to wire it yourself, and you carry the operations burden when a Coral TPU disconnects, when a Home Assistant update breaks the integration, or when a model swap requires re-tuning detection zones.

When the path is wrong is when you have a DVR with cameras over coax (the common case in Class B and C multifamily, gas stations, and small commercial), or when the engineering time required to operate the open source stack outweighs a flat subscription. In those cases a unit that ships both paths pre-wired and plugs into the existing DVR over HDMI is the honest choice. The architecture is the same; the build-vs-buy line is what moves.

Watch a Cyrano unit run both architectures live

A 15-minute call. We trigger a real tailgate against a production unit, show the trigger path commit in 12 ms, then run a VLM retrieval query against the same day's thumbnails so you can see the split on a single device.

Open source AI CCTV, April 2026: frequently asked questions

What are the active open source AI security camera projects as of April 2026?

The April 2026 lineup splits into two camps. Classifier-on-live-path projects: Frigate (the most popular self-hosted AI NVR, MobileDet on Coral TPU by default, YOLO via ONNX or TensorRT, integrated with Home Assistant), Viseron (Python, configurable detector backends including CodeProject.AI and DeepStack), Shinobi (Node.js NVR with TensorFlow plugin pipeline), Kerberos.io (containerized agent + machinery + hub with on-agent ML), and Tokay Lite (an ESP32-based AI camera dev board for very small deployments). VLM-on-retrieval-path projects: DeepCamera (SharpAI, January 2026 onward heavily VLM-centered with Qwen, DeepSeek, SmolVLM, LLaVA, plus YOLO26 for lighter triggers, agentic via Telegram, Discord, Slack). ZoneMinder still exists, still works, still uses pixel-diff with optional ML plugins, but it is not really an AI-first project. The split below is what determines which one fits a given site.

What is the trigger-vs-retrieval architectural split in plain terms?

Trigger path = the code that decides 'fire an alert right now.' This has to run inside the frame budget of the camera feed, which on a 30 fps feed is about 33 milliseconds per frame. If the model on this path takes longer than the frame budget, you start dropping frames, and dropped frames during a fast event mean the event is missed. Retrieval path = the code that answers 'find me the clip where the person in the red coat walks past the mailroom.' This is asynchronous, runs against stored footage or stored thumbnails, and is allowed to take seconds per call because no one is waiting on the door. The two paths want different models. The trigger path wants a small quantized classifier (YOLO-nano-class, MobileNetV3, EfficientDet-D0 at int8). The retrieval path wants a VLM with broad semantics (Qwen-VL, SmolVLM, LLaVA). Most April 2026 roundups conflate them by listing 'AI CCTV' as a single capability, so a property operator who picks DeepCamera for the live path is silently mis-architected.

Why does a tailgate event in particular force this split?

Because the event window is roughly 2 seconds. A tailgater enters the frame, the door is already swinging shut, and either the system fires an alert before the door latches or the unauthorized person is inside the secured space. At 30 fps that is about 60 frames. To detect the entry and emit a routable alert in under 60 frames you need a forward pass that is much smaller than the frame budget so there is room for downstream zone, dwell, and dedupe logic. A quantized YOLOv8-nano runs in roughly 12 milliseconds per forward pass on a small NPU, which leaves about 21 milliseconds of headroom inside the 33 millisecond slot. A SmolVLM-256M call on the same hardware tier needs hundreds of milliseconds per image. The VLM finishes its first inference after the door has already latched, and its second inference after the tailgater is already on the stairwell. That is the architectural force, not a benchmark preference.

Where does DeepCamera fit then? Is it 'wrong' for property security?

Not wrong. Mis-applied. DeepCamera in 2026 is the strongest open source agentic-VLM project for video. It is built so that you can ask 'what happened at my front door this morning' in natural language and have an LLM-driven agent reason over recent footage, remember context across days, and chat back through Telegram, Discord, or Slack. That is the right tool for retrieval and post-hoc investigation. It is the wrong tool for the live trigger on a tailgate, a package theft in progress, or a parking lot loiter ramp because the model latency is wrong by an order of magnitude. The honest deployment is DeepCamera (or any VLM-on-retrieval project) running alongside a fast classifier-on-live-path project, not in place of it.

What does Frigate use on its live trigger path?

Frigate's default detector is MobileDet SSD running on a Google Coral TPU at low single-digit milliseconds per inference, well inside the frame budget. Frigate also supports YOLO variants via ONNX, TensorRT, OpenVINO, and HailoRT for users with a small GPU or a Hailo accelerator. The relevant property is that the live path is bounded to a small fast classifier; the slow stuff (clip indexing, semantic search) sits in separate processes. This is the architecture every classifier-on-live-path project converges on. The differences between Frigate, Viseron, Shinobi, and Kerberos.io are in plumbing and ergonomics, not in the architectural choice.

What does an honest classifier-on-live-path setup look like at a property?

On Cyrano specifically, the live path is one quantized classifier doing one forward pass over the DVR's 1920x1080 HDMI composite frame, then mapping the bounding boxes back to per-camera tile coordinates. The active model on the unit is yolov8n-int8-psv-v4.2.1, person/vehicle/package classes, 12.1 milliseconds per forward pass on the on-device NPU, with a MobileNetV3-small classifier head as a fallback for low-light tiles. One forward pass covers all 16 to 25 cameras because the DVR has already arranged them as tiles. After the box, the pipeline runs a per-tile zone filter, a per-zone dwell filter, an arming-window check, then appends to the on-device outbox NDJSON file. None of that pipeline involves a VLM. The VLM, when it runs, runs offline against thumbnails for retrospective natural-language search.

Can I run Frigate or DeepCamera at a multifamily property today?

Yes, with caveats. Frigate is a great fit if your property's cameras are IP cameras that expose RTSP per camera, you have a server you maintain, and you have someone in-house who is comfortable tuning detector configs. That setup is common in newer Class A buildings and rare in Class B and C buildings where the existing camera plant is analog or HD-TVI on coax wired to a DVR. DeepCamera is a great fit when you want a chat-with-your-cameras agent and you can tolerate retrospective rather than real-time alerting. Most multifamily operators want both: a fast trigger so the on-call manager gets a phone call, and a retrieval layer for incident review. Cyrano ships that pair as a single device that plugs into the DVR HDMI; Frigate plus DeepCamera ships as two open source projects you stitch together.

What hardware do these projects run on in April 2026?

Frigate is most often deployed on a small x86 mini-PC or NUC with a Google Coral USB or M.2 TPU, or on a Jetson Orin Nano. Viseron and Shinobi run on similar hardware and increasingly on a Jetson or a Hailo-8L PCIe card. Kerberos.io is container-first and runs anywhere Docker runs, with optional GPU. Tokay Lite is the smallest of the bunch, ESP32-S3 plus an OV-class image sensor, intended for one or two camera deployments, not 16-25 channel sites. DeepCamera in 2026 targets Mac Mini and AI PC class machines because the local VLMs need 8 to 16 GB of memory and either Apple Silicon unified memory, a small dedicated GPU, or an NPU rated for large-model inference. The hardware ladder maps to the architecture: the smaller the model, the smaller the box; VLMs push the box class up.

What about the open source projects that show up on older lists, like ZoneMinder, MotionEye, or iSpy?

ZoneMinder still works and is still maintained at a slow cadence. Its default analytics layer is pixel-diff motion detection from before the deep learning era, with optional plugins for YOLO via a Python helper. It is fine if you want a pure NVR with light AI and you accept the ergonomics of an early-2010s Perl plus PHP project. MotionEye is largely unmaintained as of 2026. iSpy / AgentDVR is only partially open source and the AI features sit behind a paid SaaS tier. None of these projects sit cleanly in either the classifier-on-live-path or the VLM-on-retrieval-path camp; they are the previous generation. They are worth knowing about for context but they should not be the architecture you choose in 2026.

How do I decide between rolling open source and using a commercial unit like Cyrano?

Two questions. First, what is your camera input layer? If it is per-camera RTSP or ONVIF, the open source category is available and Frigate plus DeepCamera is a credible stack you can self-host. If it is a DVR with cameras over coax (common in Class B/C multifamily, gas stations, small commercial), the open source category is mostly unavailable because none of those projects ingest a DVR HDMI composite at production reliability. Second, what is your engineering tolerance? Frigate plus DeepCamera at 16 cameras is roughly 60-90 hours of standup, plus per-property tuning, plus an ongoing operations burden when models update or a Coral TPU gets disconnected. Cyrano is a $450 unit that auto-detects the DVR over HDMI EDID, ships the live classifier and the retrieval layer pre-wired, and posts to WhatsApp and SMS. Open source is the right answer when the engineering time is free; a flat-priced unit is the right answer when it is not.

Is the classifier-versus-VLM split going to collapse as VLMs get faster?

Eventually, partly. The 2026 trajectory is real: small VLMs (SmolVLM, MobileVLM, MiniCPM-V) are getting faster, and quantized inference on consumer NPUs is closing the gap. But the frame budget on the live trigger path is fixed by the camera feed (33 ms at 30 fps, 50 ms at 20 fps), and VLM inference is doing fundamentally more work per frame because the model is general-purpose. A purpose-trained nano-scale classifier will stay an order of magnitude faster than a general-purpose VLM at the same hardware tier for years. The realistic 2027-2028 picture is small VLMs taking over the dwell and the zone classification (still after the trigger), while the trigger itself remains a small classifier. The split moves; it does not collapse.