Matthew Diakonov, Written with AI

Published April 19, 202613 min read

Filter pipeline deep dive

Smart camera alert filtering is a four-stage pipeline. Every top search result talks about stage three.

The usual advice is: turn down sensitivity, draw a detection zone, toggle person-vs-vehicle, set a confidence threshold. That is one stage of four, and it only works if your input is a clean per-camera IP stream. It is not. If you are running this against a DVR, your input is an HDMI multiview carrying 16 or 25 tiles at once, stamped with a running clock, a per-tile camera name strip, and a channel bug that animates. This guide is the four stages that actually filter that input.

Stage one masks the overlays. Stage two turns grid coordinates into per-tile zones. Stage three does the confidence and persistence work every vendor talks about. Stage four deduplicates on tile.label so one loiterer produces one alert, not 18,000.

See the filter pipeline on my DVR

4.9from 50+ properties

Works on the HDMI output of any existing DVR or NVR

Up to 25 tiles per unit on one 5x5 multiview

Overlay-masked inference, tile-keyed dedup, 5 to 15 sec latency

Install under 2 minutes on the recorder already in your closet

Four filters. One HDMI tap.

What happens between the multiview cable and the WhatsApp message.

1. overlay_mask blanks the clock, name strip, and channel bug

2. Tile-grid zones resolve on layout_id, anchored to tile.label

3. 3 to 5 frame persistence collapses single-frame artifacts

4. Dedup on property + tile.label + event_class with cooldown

0:00 / 0:05

What the top SERP results all quietly assume

Scylla, Actuate, March Networks, Eufy, Reolink, TP-Link, and the IP Cam Talk threads that dominate the first page on this keyword share one assumption: your cameras deliver clean, per-camera RTSP or ONVIF streams to whatever is doing the filtering. Under that assumption, every piece of advice is reasonable. Tune sensitivity per camera. Draw a polygon zone in the camera app. Toggle person vs vehicle. Set a confidence threshold.

That world does not match a real commercial DVR closet. The cameras are analog or proprietary HD-over-coax. The only output anyone ever looks at is the HDMI multiview going to a wall monitor. The recorder is a black box running firmware from 2019 that refuses to give you RTSP, or gives it to you after two hours on a forum. Every top result skips past that reality because it is inconvenient for their pitch.

The angle of this page is that smart filtering on the HDMI multiview is genuinely different. You need a filter at stage one that nobody talks about (overlay masking), and you need a dedup key at stage four that is keyed on tile.label, not on camera ID, because the multiview does not give you camera IDs. Those two stages are where our system does most of its noise reduction.

The four stages, end to end

This is the pipeline between the HDMI cable you unplugged from the wall monitor and the WhatsApp message that lands on your on-call person's phone. Four filters, in order. Each one drops a different class of noise.

HDMI multiview -> filtered alert

Stage one: overlay_mask

Every DVR stamps three classes of pixels on top of the multiview before it leaves the HDMI port. A running clock (changes every second). A per-tile camera name strip (usually at the top or bottom edge of each tile). A channel bug or mode badge (often animated, in a corner). A naive object detector run on the raw composite frame will lock onto those regions because they are the only pixels that change reliably. Half the detections will be clock digits.

Our system ships with mask templates for every supported recorder. At boot, it reads the recorder model and layout_id, pulls the matching template, and blanks those regions before the frame ever reaches the detector. The regions that got blanked are recorded on the event itself as overlay_mask. You can grep any event to see exactly which overlays were dropped, per frame, per tile. Here is what one looks like.

cyrano.event.json

Anchor fact

overlay_mask: ["clock", "cam_name_strip", "channel_bug"]

That literal array shows up on every our system event that comes off an HDMI multiview. It is the per-event audit trail proving the detector never saw the pixels that change every second. No IP-camera vendor emits this field, because no IP-camera vendor has an HDMI overlay to mask. If you run a general-purpose detector on an unmasked multiview you will generate one false detection per second per tile, forever.

Stage two: tile-grid zones

Zones on an IP camera are drawn on that one camera's frame. Zones on an HDMI multiview have to be drawn on the composite and then resolved per tile. Our system does that through a two-key lookup: layout_id gives you the grid (4x4-std, 5x5-std, 3x3-std, and so on), and tile.index gives you the row-major cell position. Together they pin a zone to a specific rectangle on the composite frame.

The trick is what happens when the recorder switches layouts (which maintenance staff do more often than you would believe). tile.index shifts, but tile.label (the camera name the DVR stamps on the strip) does not. Our system keys zones on tile.label, so a layout change automatically re-anchors every zone. This is the opposite of how IP-camera VMS software works, where zones are keyed on camera ID and every layout change requires a support ticket.

How a zone resolves on the composite frame

Read layout_id at boot

Look up tile rectangle

Match tile.label from strip

Apply zone in tile coords

Detection runs, zone-gated

Stage three: confidence hysteresis and multi-frame persistence

This is the stage every top-ranking page covers. A detection only counts if it survives across 3 to 5 consecutive frames. A new alert only fires above 0.70 confidence, but an existing detection stays alive down to 0.55 so it does not flicker at the threshold. Single-frame noise from shadows, compression, insects, and headlight sweeps never makes it out of stage three.

This is the stage where the familiar advice actually applies. Tune persistence window. Tune confidence upper and lower bounds. Tune minimum bounding-box size in tile coordinates. Our system ships reasonable defaults and exposes the tunables in the dashboard.

0 framesMinimum persistence

0 framesMaximum persistence

0Fire threshold

0Hold threshold

Stage four: dedup on property + tile.label + event_class

A person who walks into a detection zone and stands there for ten minutes will produce tens of thousands of true-positive detections from stage three. Every one of them is a real person on a real tile. Every one of them is also the same alert. The dedup stage collapses them.

The dedup key is property + tile.label + event_class, plus a cooldown window. Default cooldowns: 30 seconds for person_in_zone, 5 minutes for loiter, 2 minutes for vehicle_dwell, 10 minutes for tamper. Inside the window, new detections with the same key are logged as suppressed but not emitted as alerts. Outside the window, the key resets and the next detection fires a new alert. This is the single biggest reduction in operator-facing noise.

dedup_audit.sql

What each stage actually drops

Every suppressed detection gets logged with a reason code so you can audit the funnel. Here is what the stages drop on a representative 25-tile multiview over a 24-hour window at one apartment property.

suppression_audit.log

Before the pipeline, after the pipeline

Same 24-hour window, same recorder, same 25 tiles. What the operator actually sees in their inbox.

Operator inbox on a single apartment property

Sensitivity-based motion on the HDMI composite with no overlay handling. Every clock tick, every channel-bug frame, every compression block on a dark tile triggers. The operator's phone becomes useless inside an afternoon and gets silenced by end of week.

~412,000 candidate detections per 24 hours
~280,000 fired on clock / name strip / bug pixels
One true loitering event = tens of thousands of alerts
Phone gets muted, real break-in attempts get missed

Filtering on HDMI multiview vs. filtering on a per-camera IP stream

If you have clean RTSP streams for every camera, most of the work on this page does not apply. But most commercial DVRs do not give you clean RTSP streams. This is what changes when the only output is HDMI.

Feature	Per-camera IP filter	HDMI multiview filter
Input signal	One RTSP stream per camera, clean pixels	One HDMI composite with 16-25 tiles and overlays
Overlay handling	Not needed, no overlays	overlay_mask blanks clock + strip + bug pre-inference
Zones	Drawn in camera coords, keyed on camera ID	Drawn in tile coords, keyed on tile.label
Layout changes	Not applicable	tile.index shifts, tile.label does not, zones survive
Dedup key	camera_id + event_class	property + tile.label + event_class
Per-frame inference cost	N streams, N detector passes	1 composite, 1 detector pass, scatter by tile.index
Install effort	Per-camera RTSP setup, IT involvement	Unplug HDMI from monitor, plug into our system, done

Try the pipeline on your own DVR

15-minute demo on the recorder already in your office closet. We run the four-stage filter live on your multiview and show you the suppression audit log.

Book a demo →

The four stages as a tuning order

When alerts feel wrong, tune in this order. Each stage filters a different class of noise, and the later stages can only compensate so much for a miscalibrated earlier stage.

Re-verify overlay_mask on the noisy tile

If you see a steady firehose on one tile, pull a recent event JSON and confirm overlay_mask is ["clock", "cam_name_strip", "channel_bug"]. If the clock position is non-standard, recalibrate that tile's mask in the dashboard. This fixes the 70 percent case in one minute.

Tighten the zone on that tile.label

If overlays are masked but you still get daytime-traffic noise, the zone is too wide. Redraw the zone in the dashboard. Because zones are keyed on tile.label, a DVR layout change will not break it.

Raise persistence or fire threshold

If the zone is tight but single-frame artifacts still slip through, push persistence from 3 to 5 frames or raise the fire threshold from 0.70 to 0.80. Watch the dropped:persistence_failed counter climb.

Lengthen dedup cooldown on dwell events

If you are seeing multiple alerts for the same person standing in the same zone over ten minutes, extend cooldown on loiter from 5 minutes to 10. Dedup keys reset when tracked identity changes, so this does not suppress a different person entering the zone.

Why each stage belongs where it is

Stage one is first because overlays never sleep

The clock changes every second, forever. If overlay_mask runs after inference instead of before, the detector has already wasted its budget on the pixels that were going to be thrown away. Masking pre-inference is cheaper and cleaner.

Stage two is before the detector

Zones gate which tile-rectangles the detector runs on, not which detections it emits. Pre-inference zoning cuts compute in half on properties where half the tiles are parking-lot views you do not care about at night.

Stage three is after the detector

Persistence and hysteresis only make sense once you have bounding boxes to track frame-to-frame. They do not belong earlier in the pipeline.

Stage four is last because dedup needs event_class

You cannot deduplicate until you know the detection type. A person_in_zone and a loiter on the same tile.label at the same time are different alerts and should both fire.

Every drop is reversible

Stages one through four all log suppressed detections with a reason code. You can replay them offline to see what would have fired under different parameters. No drop is silent.

Works with the recorder already in your closet

The overlay_mask template library ships with coverage for the DVR and NVR brands our system sees in the field. Each entry is a clock position, a cam-name-strip region, and a channel-bug region per supported layout_id.

Hikvision DS-7xxxDahua XVR / NVRLorexAmcrestReolink NVRUniviewSwannNight OwlQ-SeeANNKEEZVIZBosch DIVARHoneywell PerformancePanasonic WJ-NX

A representative 24-hour funnel

One apartment property in Fort Worth. One recorder. 16 active tiles on a 4x4-std layout. 24 hours of traffic. The numbers below came off one suppression_audit.log export. Percentages are relative to candidate_detections_total.

Candidates

Raw detections

Stage 1 drops

overlay_masked

Stage 2-4 drops

zone + persist + dedup

Alerts emitted

Readable in one sitting

“At one Class C multifamily property in Fort Worth, We caught 20 incidents including a break-in attempt in the first month. Customer renewed after 30 days.”

Fort Worth, TX property deployment

Frequently asked questions

What exactly is smart camera alert filtering on an HDMI multiview, and why is it different from filtering on IP cameras?

Smart camera alert filtering on an HDMI multiview means applying object detection and gating rules to a single composite frame that the DVR has already tiled into a 4x4 or 5x5 grid, with a running clock, per-tile camera name strip, and a channel bug overlaid on top. It is different from filtering on IP cameras because the input is one decoded stream carrying 16 or 25 scenes, not 16 or 25 independent streams. That one difference changes every filter. Zones are grid coordinates, not camera coordinates. Object detection runs once per composite frame, then results get scattered back out by tile.index. And every detection has to survive an overlay mask step, because otherwise the model will chase the clock digits flipping once per second.

What is the overlay_mask field and why does it show up on every our system event?

overlay_mask is a JSON array field on every event our system emits. Its value is typically ["clock", "cam_name_strip", "channel_bug"]. It lists the three classes of DVR overlay pixels that got blanked out before inference ran. Our system records it on every event so you have a per-event audit trail proving the detector never saw the pixels that change every second. No IP-camera vendor emits this field because they have no DVR overlay to mask. On an HDMI multiview, skipping this step produces a firehose of motion alerts pinned to the running clock.

How does our system define detection zones on a tiled multiview?

Zones are drawn on the composite frame but resolved per tile. Our system reads layout_id at boot (for example 4x4-std or 5x5-std), computes the pixel rectangle for each tile.index, and stores your zone in tile-relative coordinates. When the recorder switches layouts, tile.index shifts but tile.label (the camera name the DVR overlays on the strip) does not, so your zones re-anchor automatically. This means a property manager can draw one zone once, on the monitor they are already looking at, and have it follow that camera through layout changes and firmware updates.

What is the multi-frame persistence filter and how many frames does it require?

Multi-frame persistence means a detection has to survive across 3 to 5 consecutive frames of the multiview stream before it can fire an alert. At 30 fps that is roughly 100 to 170 ms. Single-frame detections are intrinsically noisy: a headlight glare, a shadow sweep across a tile, a compression artifact, or the channel-bug animation cycle can all produce a confident bounding box on one frame. Requiring the bounding box to stay on the same tile across several frames collapses those artifacts to zero without losing real people or vehicles, who persist for seconds.

How does event dedup work when the same person stays in frame for ten minutes?

Every candidate alert gets a dedup key of property + tile.label + event_class. If that key has already fired within a configurable cooldown window (default 5 minutes for loiter, 30 seconds for person_in_zone), the new detection is suppressed rather than emitted. Practically, this means one loiterer standing at a mailbox for ten minutes produces one alert, not 18,000 frames of alerts. The cooldown is per key, so a second person walking into the same tile while the first is still dwelling still fires, because the dedup key resets when the tracked object identity changes.

What does one our system event payload actually look like after filtering?

Nine fields plus a 480x270 JPEG thumbnail. tile.label is the camera name the DVR stamps on the strip (for example "Loading Dock NE"). tile.index is the row-major grid position starting at 0. tile.coords is x, y, w, h in the composite frame. property is the site identifier. layout_id is the recorder layout. overlay_mask is the array of overlays that were blanked. event_class is the detection label (person_in_zone, vehicle_dwell, loiter, tamper). iso8601_ts is the recorder clock timestamp. latency_ms is capture-to-delivery time. The thumbnail is a crop of just the triggering tile. Same shape from every unit in the portfolio.

How fast is the end-to-end filtered alert on a 25-tile multiview?

Median capture to delivery is 7 to 8 seconds, with a 5 to 15 second envelope across the portfolio. That includes HDMI capture, overlay masking, batched inference across all visible tiles, multi-frame persistence, dedup, thumbnail crop, and WhatsApp or webhook delivery. Adding tiles does not linearly add latency because the detector runs once on the composite frame, not once per tile. Latency_ms is recorded on every event so you can chart p50 and p95 per property and detect a slow uplink before staff notice missed alerts.

Does the filter pipeline require per-camera configuration or does it calibrate automatically?

Most of it auto-calibrates. layout_id detection, overlay_mask template matching, and the persistence filter are all derived from the multiview itself at boot. The only thing that benefits from manual tuning is zones. Out of the box, our system treats every tile as a full-frame zone and runs person, vehicle, and loiter detection on all of them. You narrow zones later ("only alert on person_in_zone inside the dashed box on Loading Dock NE") once you have seen a week of what the cameras actually see. Calibration is done through the dashboard, not by SSHing into anything.

Can this filtering approach work if my DVR runs a weird custom layout?

Yes, because the overlay_mask template is keyed on recorder model plus layout_id, not on our system's guess. Supported recorders include Hikvision DS-7xxx, Dahua XVR and NVR, Lorex, Amcrest, Reolink NVR, Uniview, Swann, Night Owl, Q-See, ANNKE, EZVIZ, Bosch DIVAR, Honeywell Performance, Panasonic WJ-NX, and most white-label rebrands. Each has its own conventions for where the clock sits, how the camera-name strip renders, and what the channel bug looks like. Our system ships with templates for all of them. A custom layout that is not in the library just gets a one-time mask calibration and then behaves the same.

Why not just buy new AI cameras instead of filtering the HDMI output of the old ones?

Cost, time, and downtime. A full camera replacement at a single apartment property runs $50,000 to $100,000+ including new cameras, new wiring, and a new NVR, plus months of installation and an ongoing cloud subscription. One unit plugs into the HDMI output of whatever DVR you already have in the office closet. Hardware is $450 one-time, software is $200/month, install is under 2 minutes, and the guard monitor on the wall keeps showing the same multiview because our system passes the HDMI signal through. The filter pipeline described on this page is what makes that HDMI tap useful instead of noisy.

What happens to alerts that the filter pipeline drops? Is there an audit trail?

Dropped alerts are logged with the reason they got dropped: overlay_masked, persistence_failed, zone_miss, deduped_within_cooldown. You can query that log by property, tile.label, and event_class to see what would have fired before filtering and how much noise was removed. In practice this is how property managers tune zones in weeks 2 through 4 of a deployment. They see that overlay_masked drops fire every second on the south camera because its clock position is non-standard, and they recalibrate that one tile's mask in a minute.

Adjacent reading

Filter theory

Security Camera False Alarm Reduction

Where AI processing happens determines alert accuracy. A companion to this pipeline deep dive.

Read

Operations

Live Monitoring vs Automated Alert Fatigue

200 alerts a day means zero alerts get read. How hybrid verification actually works.

Read

Portfolio

Security Video Monitoring Systems: The Plural Problem

What running eight recorders at eight properties through one event stream looks like.

Read

Smart camera alert filtering is a four-stage pipeline. Every top search result talks about stage three.

What the top SERP results all quietly assume

The four stages, end to end

HDMI multiview -> filtered alert

Stage one: overlay_mask

overlay_mask: ["clock", "cam_name_strip", "channel_bug"]

Stage two: tile-grid zones

Stage three: confidence hysteresis and multi-frame persistence

Stage four: dedup on property + tile.label + event_class

What each stage actually drops

Before the pipeline, after the pipeline

Operator inbox on a single apartment property

Filtering on HDMI multiview vs. filtering on a per-camera IP stream

The four stages as a tuning order

Re-verify overlay_mask on the noisy tile

Tighten the zone on that tile.label

Raise persistence or fire threshold

Lengthen dedup cooldown on dwell events

Why each stage belongs where it is

Stage one is first because overlays never sleep

Stage two is before the detector

Stage three is after the detector

Stage four is last because dedup needs event_class

Every drop is reversible

Works with the recorder already in your closet

A representative 24-hour funnel

Frequently asked questions

Adjacent reading

Security Camera False Alarm Reduction

Live Monitoring vs Automated Alert Fatigue

Security Video Monitoring Systems: The Plural Problem

Comments (••)

Comments ()