The setup, the ceiling, and the other way in

RTSP camera AI detection: how it works, and where per-camera pipelines break down

Adding AI detection to an RTSP camera is, mechanically, a solved problem: point a tool at the stream, run a model on the frames, get an alert. The interesting part is not the model. It is the one configuration choice that decides whether your setup carries one camera or twenty, and the point where the per-camera approach stops being a clean answer and becomes a part-time job. This guide covers both.

Matthew Diakonov, Written with AI

Published May 16, 202610 min read

Direct answer (verified 2026-05-16)

AI detection on an RTSP camera works by pointing a detection tool at the camera's RTSP stream URL. The tool decodes the video, runs an object-detection model on the frames, and fires an alert when it sees a person, vehicle, or package.

Tools that do this include Frigate, Blue Iris paired with CodeProject.AI, Scrypted, and Roboflow Inference. They run on local hardware (a Google Coral accelerator, a consumer GPU) or a hosted endpoint. The one decision that makes or breaks the setup: run detection on the camera's low-resolution substream and reserve the full-resolution main stream for recording. Detecting on the main stream wastes most of your compute and slows every alert down. The approach scales one pipeline per camera, which is perfect for a house and becomes the constraint on a property with 16-plus cameras.

What RTSP gives you, and what it does not

RTSP, the Real Time Streaming Protocol, is a transport. It is the standard way a camera says "here is my video feed, connect and take it." That is the entire job RTSP does. It has no intelligence, no detection, no notion of a person or a car. A camera that "supports RTSP" is a camera you can read a live video stream from using software you choose, instead of being locked to the manufacturer's app.

That is exactly what makes RTSP the foundation for AI detection. The detection is a separate layer that you bolt on. A detection tool opens the RTSP stream the way a media player would, pulls frames out of it, and runs each frame (or every Nth frame) through an object-detection model. The model returns bounding boxes and labels: person at 0.91 confidence, car at 0.84. Rules on top of that decide whether the detection is worth an alert.

So "RTSP camera AI detection" really means three things stacked: a camera that exposes an RTSP stream, a tool that decodes that stream, and a model that classifies what is in the frames. The next five steps are how you actually assemble that stack.

Adding AI detection to an RTSP camera, step by step

Get both RTSP URLs for every camera

Every IP camera publishes at least two RTSP streams: a full-resolution main stream and a low-resolution substream. On Hikvision the paths are /Streaming/Channels/101 for the main stream and /102 for the substream; behind an NVR, channel 2 is /201 and /202, and so on. On Dahua the same camera answers cam/realmonitor?channel=1&subtype=0 for main and subtype=1 for sub. You need both URLs plus the camera or NVR username and password. Test each one in VLC before you touch any AI tool: if VLC will not open the stream, no detector will either.

Pick where the model runs

The detector has to run somewhere. Three common choices: a Google Coral USB or M.2 accelerator (cheap, handles a handful of substreams comfortably), a consumer NVIDIA GPU (more headroom, more power draw and cost), or a hosted inference endpoint (no local hardware, but every frame leaves the building). For security footage most people keep inference local for latency and privacy. Frigate pairs well with a Coral, Blue Iris pairs with a local CodeProject.AI server, and Scrypted or Roboflow Inference can do either.

Detect on the substream, record on the main stream

This is the decision that quietly determines whether the whole setup is usable. A 4K main stream is roughly 8.3 million pixels per frame; a 640x480 substream is about 0.3 million. Object detectors resize every input frame down to a small square tensor before they run, so feeding them 4K buys you nothing but decode cost. Run detection on the substream and keep the detector cheap; reserve the main stream for recording, where resolution actually matters for evidence. In Frigate this is the detect role versus the record role. In Blue Iris, it is leaving "use main stream if available" unchecked in the AI tab. Get this wrong and one camera can saturate a CPU that should have handled a dozen.

Add zones and an object filter

A raw detector fires on every person, car, and animal in frame, all day. That is noise, not signal. Draw zones (a polygon over the gate, the parking row, the loading dock) and filter to the object classes you care about. A person in the driveway zone after dark is an event; a car passing on the public street behind it is not. Most tools also let you set a minimum confidence and a minimum dwell time, which kills the single-frame false positives.

Route the alert to something that acts

Detection produces a notification. A notification is not a response. Decide, before you call the setup finished, who or what receives the alert and what they do with it: a phone push you will actually check, a webhook into a chat channel, a siren, a guard. The setup is only as strong as the last link in this chain, and this is the step almost every tutorial skips.

The config line that matters most

Here is step 3 made concrete in a minimal Frigate camera config. One camera, two RTSP inputs: the substream carries the detect role, the main stream carries the record role. The model never touches the 4K feed.

config.yml

The two paths shown are the Hikvision pattern (102 sub, 101 main). The role split is the same idea in Blue Iris, Scrypted, and any other RTSP detector: cheap stream for the model, full stream for the archive. Frigate documents the full role list in its camera configuration docs.

Where a per-camera RTSP pipeline hits a ceiling

The five steps above are the right answer for a few cameras. The approach starts to strain in three specific places, and all three show up the moment you stop thinking about a house and start thinking about a property someone else installed cameras on.

1. One decode pipeline per camera

RTSP detection scales linearly. Every camera is its own connection, its own decode thread, its own slice of the detector's budget. Two cameras is trivial. Sixteen cameras on one box means sixteen ffmpeg processes, sixteen reconnect-on-drop loops, and an inference queue that has to keep up with all of them. The substream trick from step 3 buys real headroom, but the architecture is still N independent pipelines, and N is the camera count.

2. The credentials are usually missing

On a home setup you set the camera passwords yourself last month. On a commercial property the cameras were installed years ago by a contractor who is no longer reachable, the NVR admin password is on a sticky note that is long gone, and nobody has the per-camera RTSP credentials. You cannot point a detector at a stream you cannot authenticate to. This single problem stalls more "add AI to the existing cameras" projects than any compute limit does.

3. Detection still ends at a notification

Even a perfectly tuned per-camera pipeline produces the same final artifact: an alert. On a property with nobody watching, an alert at 2 AM that lands in an app is hard to distinguish from no alert at all. The hard part of property security was never the detection. It is turning a detection into someone actually showing up.

The other way in: one HDMI composite instead of N RTSP pulls

There is a second ingestion path that sidesteps all three of those problems, and almost no guide on this topic mentions it because it does not look like "AI on an RTSP stream" at all.

Every DVR and NVR built in the last decade already paints a single combined picture of all its cameras: the 1920x1080 mosaic it sends out the HDMI port to a wall monitor in the back office. A 4x4 grid, a 5x5 grid, every camera in one frame. That signal exists whether or not anyone is watching the monitor.

An edge appliance can tap that one HDMI output through a small capture board and run a single inference pass per frame across the entire mosaic. One pass covers all 16 to 25 cameras at once, at roughly 5 frames per second. At a 4x4 grid each camera tile is 480x270 pixels; at 5x5 it is 384x216. Both sit inside the input range a nano-class detector works at, so the model reads the whole property from one frame instead of from N separate streams.

It needs no per-camera RTSP credentials, because it never touches the cameras or the network paths to them; it reads the same composite a guard would have watched on the wall monitor. It needs no PoE switch reconfiguration and no NVR firmware change. The recorder keeps recording. The install is plugging one HDMI cable and a network jack into one box, which is why it takes minutes rather than an afternoon of credential recovery.

This is exactly how the appliance behind this site works. It is not a smarter RTSP detector. It is a different place to stand: one signal that already aggregates every camera, instead of one authenticated pipeline per camera.

Two ingestion paths, side by side

Per-camera RTSP

Connects by: one authenticated RTSP stream per camera.
Needs from you: every camera's URL and credentials, plus a decode pipeline per feed.
Scales: linearly. Compute cost rises with the camera count.
Best for: 2 to 8 cameras you own and can tune, run by someone technical.

HDMI composite

Connects by: one HDMI tap on the recorder's existing wall-monitor output.
Needs from you: nothing per camera. No RTSP credentials, no rewiring.
Scales: one inference pass covers all 16 to 25 feeds at once.
Best for: a property or portfolio with cameras someone else installed.

“Caught 40 trespass incidents at a Texas residential property in one deployment, including a break-in attempt, and helped the property file a successful police report. No cameras were replaced.”

Apartment Security Cameras deployment, Texas residential property

Which path fits your situation

If you have two to eight cameras, you are comfortable in a config file, and you enjoy tuning a system, per-camera RTSP detection with Frigate or Blue Iris is excellent and cheap. You get total control: custom zones, custom models, custom automations. The credential problem does not exist because you own the cameras and set the passwords. The scaling ceiling does not exist because you are nowhere near it. Build that, and the step-by-step section above is your whole map.

If you are responsible for a property, or a portfolio of them, with 16 to 25 or more cameras on a recorder someone else installed years ago, the per-camera path quietly turns into a part-time job: recovering credentials, standing up decode pipelines, babysitting reconnects. And at the end of all that work you still have a notification, not a response. The HDMI-composite path trades the per-camera control for one cable, one inference pass, and a setup that can carry the alert past the phone to a guard who is actually dispatched to the scene.

Neither path is "the AI camera." Both are software watching video the cameras already produce. The honest question is not which tool has the smartest model. It is whether you are wiring up a hobby or covering a property, because that is what decides which of these two architectures you should have been on from the start.

Have cameras you cannot get RTSP credentials for?

10-minute call. Bring your camera count and DVR or NVR brand. Leave with a straight answer on whether an HDMI-composite tap would get AI detection onto every feed without recovering a single password.

Frequently asked questions

Do I need special cameras for RTSP AI detection?

No. Any IP camera that publishes an RTSP stream works, regardless of brand or age. AI detection is software that reads the stream the camera already produces; it does not run on the camera itself. On-camera AI (some Hikvision and Dahua models ship with built-in person and vehicle detection) is a separate and more limited thing: it runs on the camera's own small chip and you cannot change the model or the rules. RTSP-based detection runs on hardware you control, so you pick the model, the zones, and the alert logic.

Should AI detection run on the main stream or the substream?

The substream. Object detectors downscale every frame to a small square tensor before they run, typically somewhere between 224x224 and 640x640 pixels, so a 4K main stream just costs decode time without improving accuracy. Detect on the substream (usually 640x480 or 720p) and reserve the full-resolution main stream for recording, where the detail actually matters as evidence. In Frigate this is the detect role versus the record role; in Blue Iris it is leaving "use main stream if available" unchecked. Detecting on the main stream is the single most common reason a setup that should handle a dozen cameras chokes on three.

How many RTSP cameras can one machine handle?

It depends on substream resolution, detector hardware, and detect frame rate, not a fixed number. A single Google Coral accelerator running 5fps substreams comfortably covers a handful of cameras; pushing past a dozen usually means a GPU or splitting the load across boxes. The cost is per camera because each stream is its own decode-and-inference pipeline. This linear scaling is fine for a house and becomes the planning constraint once you are covering a property with 16 to 25 or more cameras.

What if I lost the RTSP username and password for an old camera install?

Then a per-camera RTSP tool cannot connect, because it has to authenticate to each stream individually. Your options are to factory-reset the camera or NVR (which you may also lack the password to), get the original installer back on site, or use an ingestion path that does not need per-camera credentials at all, such as reading the recorder's HDMI composite output. On commercial properties with installs more than five years old, missing credentials are common enough that they decide the architecture.

Do RTSP AI detection tools work behind a Hikvision or Dahua NVR?

Yes. An NVR re-publishes each connected camera on its own RTSP channel path. On Hikvision NVRs, channel 1 is /Streaming/Channels/101 (main) and /102 (sub), channel 2 is /201 and /202, and so on. On Dahua NVRs you change the channel number in cam/realmonitor?channel=2&subtype=0. Point the detector at the NVR's IP address and the right channel path instead of at the camera directly. The substream-versus-main-stream rule still applies to each channel.

Does running AI detection add latency to my cameras?

It adds a small amount to the alert, not to recording or live view. The detector reads a copy of the RTSP stream; the camera and the recorder are untouched and keep doing exactly what they did before. End-to-end alert latency is mostly decode time plus one inference pass, typically well under a couple of seconds on local hardware. Cloud inference adds the network round trip on top, which is one more reason security setups tend to keep the model local.

Why isn't a phone notification enough?

Because a notification only matters if someone acts on it. On a home, you are the responder: the alert reaches your phone and you decide what to do. On a property with nobody watching at 2 AM, an alert that lands in an unwatched app changes nothing on the ground. Detection is the easy half of the problem now that good models are free and run on cheap hardware. The half that actually protects a property is routing the detection to a person who shows up, which is why the last step of any setup matters more than the model.

Adjacent guides on camera AI and detection

Keep reading

Detection

AI Object Detection for Security Cameras

What object, behavior, and event detection actually mean on a camera feed, and how each level performs from a single home camera to a 25-camera property.

Read

Standards

CCTV Vendor Lock-In: ONVIF and RTSP as a Fallback

Why RTSP and ONVIF support keep your cameras vendor-independent, and how to test whether your current system is quietly locked to one manufacturer.

Read

Tuning

Cutting False Positives in CCTV Computer Vision

The false-positive flood is the number one reason camera AI gets switched off. How zones, dwell time, and confidence thresholds bring the signal back.

Read

RTSP camera AI detection: how it works, and where per-camera pipelines break down

What RTSP gives you, and what it does not

Adding AI detection to an RTSP camera, step by step

Get both RTSP URLs for every camera

Pick where the model runs

Detect on the substream, record on the main stream

Add zones and an object filter

Route the alert to something that acts

The config line that matters most

Where a per-camera RTSP pipeline hits a ceiling

1. One decode pipeline per camera

2. The credentials are usually missing

3. Detection still ends at a notification

The other way in: one HDMI composite instead of N RTSP pulls

Two ingestion paths, side by side

Which path fits your situation

Have cameras you cannot get RTSP credentials for?

Frequently asked questions

Frequently asked questions

Keep reading

AI Object Detection for Security Cameras

CCTV Vendor Lock-In: ONVIF and RTSP as a Fallback

Cutting False Positives in CCTV Computer Vision

Comments (••)

Comments ()