C
Cyrano Security
16 min read
Model selection for the installed base

Edge AI models for security cameras in 2026. Picked for the 384x216 tile you actually have.

Almost every 2026 write-up on this topic lines up YOLO, MobileNet, RT-DETR, and EfficientDet against a clean 1920x1080 RTSP feed from a brand-new smart camera. That benchmark is correct for greenfield installs. It is not the benchmark that matters if your property already has 16 or 25 cameras wired into a DVR. On those sites the AI layer does not get per-camera streams. It gets the DVR’s 1920x1080 HDMI composite, with every camera arranged as a tile. At a 4x4 grid each tile is 480x270 pixels. At a 5x5 grid each tile is 384x216. Those numbers are almost exactly the native input size of a quantized YOLOv8-nano at 384x384 letterbox. Which means the DVR’s wall-monitor mosaic is accidentally the right input shape for a nano-class detector. This guide walks the practical model choices for that input, the one-pass inference pattern it unlocks, and where the tradeoffs sit.

See a nano-class model running on your DVR’s HDMI composite
4.9from 50+ property installs
Quantized YOLOv8-nano at int8 on the device, no cloud inference
One forward pass over the DVR's 1920x1080 HDMI composite covers up to 25 cameras
Per-DVR overlay masks (Lorex, Dahua, Hikvision, Swann, ADT) applied before inference
Within 2 to 4 points of native per-stream recall on person class at 4x4 composite
No RTSP credentials, no camera replacement, no cloud upload
YOLOv8-nano
YOLO26-small
YOLOv9-tiny
RT-DETR-lite
MobileNetV3-small
EfficientDet-D0
ViT-tiny (distilled)
MobileViT-XS
DETR-small
YOLO-NAS-s

1. The assumption hiding inside every model roundup

Pick any current 2026 piece comparing edge AI models for security cameras. i-PRO’s X-series launch, Ambarella CV7, Qualcomm’s camera platform, the consumer buyer’s guide for Arlo Ultra 5 and Reolink Argus 4. They all share one assumption: the model is embedded inside a single camera and runs on a 1920x1080 sensor stream that camera produces. That is correct for the camera the vendor is selling. It is not the situation most properties are actually in.

An operating multifamily property with 16 to 25 cameras has a DVR or NVR at the center. The cameras are wired in, the feeds are recorded, the DVR paints a live 1920x1080 HDMI composite onto a monitor at the front desk. Replacing those cameras to get edge AI per device is a $50,000 to $100,000 project. The realistic path is to put the AI layer between the DVR and its monitor: read the HDMI composite, classify, alert. Once that decision is made, the model selection question gets interesting, because the input the model sees is not 1920x1080 per camera. It is 480x270 per camera, already stitched into a tiled mosaic the model can process in a single pass.

This guide is a model selection pass for that specific input shape. Every number below is measured against 480x270 and 384x216 tiles cut out of a real DVR HDMI output, not against a clean 1080p RTSP reference.

2. The tile math, concretely

A DVR composite is an exact grid. The numbers below are the ones that actually appear on a standard 1920x1080 HDMI output, computed without gutters then with a 2px gutter.

0x270Per-tile pixels at 4x4 composite (16 cams)
0x216Per-tile pixels at 5x5 composite (25 cams)
0x640Native YOLOv8-n detection input
0 msInference per frame, int8 on Orin Nano

The key number is the per-tile resolution. 480x270 letterboxed into 384x384 loses almost nothing because the letterbox adds padding, not downsampling. 384x216 letterboxed into 384x384 is even cleaner. The model sees roughly its native training input size for each camera, without any expensive resize or separate decode.

Per-camera effective pixels

0 px

480 times 270. Enough for person / vehicle / package classification on nano-scale detectors at near-native training input size.

Inferences saved per second

0

16 cameras at 30fps replaced by 30fps over one composite. 480 forwarded passes become 30, a 16x compute reduction per second of video.

3. The data path, beginning to end

Before picking a specific model, it helps to see where the model fits in the full pipeline. The diagram below is the event path on a DVR-anchored site.

Composite ingest, one inference, many cameras

16 coax cameras
DVR / NVR recorder
HDMI OUT
Edge AI model
Tile to camera map
Per-camera alert
Native clip pull

The left side is unchanged from the existing install. The hub is the new piece. The right side is how detections reach an operator. Every design decision for the model is governed by what happens at the hub: what input tensor shape the model wants, how fast it runs, and how clean the output boxes are when they get mapped back to tile coordinates.

4. The model shortlist for this input

Six families are worth considering in 2026 for the DVR composite path. They differ in native input size, class coverage, and how well their tooling supports int8 quantization on small NPUs.

YOLOv8-nano (int8)

The 2026 default for edge detection on small NPUs. 3.2M parameters. Native input 640x640. Runs at roughly 12 ms per frame on Jetson Orin Nano int8. Person / vehicle / package classes trained on COCO plus extra security-specific samples. Ships with production quantization tooling. Best choice for the DVR composite path.

YOLO26-small

Newer 2026 release with faster CPU inference and improved small-object mAP. Drops older anchor-based assumptions. Slightly heavier than YOLOv8-nano but better recall on packages left at doorsteps. Good fit if the NPU has headroom.

RT-DETR-lite

Transformer-based detector with better crowd scenes and partial-occlusion handling. Slower per frame (around 22 ms on the same NPU). Valuable on high-traffic entries where several humans overlap in one tile.

MobileNetV3-small

Classifier, not a detector. Use it as a second-stage head on top of YOLO bounding boxes to decide package vs vehicle vs unknown. Ultra-light (2.5M params). Pairs well with the composite path for sub-class decisions.

EfficientDet-D0

Still functional in 2026 but most tooling has moved on. Smaller memory footprint than YOLO-nano on some NPUs, but lower mAP on person class at 384x384 input. Choose only if your silicon vendor publishes EfficientDet kernels and not YOLO kernels.

ViT-tiny (distilled)

Attention-based backbone, distilled from a larger ViT. Interesting for scene understanding (is the space empty, occupied, or crowded). Not the right choice as the primary detector for the composite path. Useful as a meta-classifier on scene-level change detection.

5. One inference pass, in code

The composite path is almost embarrassingly simple when written out. The parts that take engineering time are the per-DVR overlay masks and the tile geometry autodetection. The actual inference loop is below.

cyrano_edge/ingest.py

The TileMapper.autodetect call inspects the first few frames for the grid pattern the DVR paints. Most DVRs produce 4x4 or 5x5 by default; a few fall back to 3x3 when fewer than nine cameras are wired. The mapper returns tile origins and sizes, and also the set of restricted zones per camera that were configured in the dashboard.

6. The cold-start sequence, on a real DVR

What the edge device does in the first 90 seconds after power-on. This is the output from a deployed unit attached to a 16-camera Lorex install.

Boot on a 16-camera Lorex LNR6xx NVR

The three lines that matter for model selection are the runtime (TensorRT int8 on NPU, 12.1 ms per forward pass), the throughput (30 fps sustained, 16 tiles from one pass), and the effective per-camera input (480x270 letterboxed into 384x384). That is the entire argument for the composite path in one boot log.

7. What a real incident looks like, with timings

1

T+0.00s, motion in camera 7 tile

Camera 7 is the loading dock. A figure crosses the restricted zone. The DVR is still doing its regular job: recording the native per-channel stream and repainting the composite for the wall monitor.

2

T+0.03s, composite repaint

The DVR updates tile 7 on the 4x4 composite. Cyrano’s HDMI capture sees the new frame.

3

T+0.04s, overlay mask applied

Per-brand mask zeros out the timestamp, channel labels, and alarm chrome. The detector never sees the DVR overlays, so they cannot generate false detections.

4

T+0.06s, one YOLOv8-nano forward pass

12.1 ms on the NPU. Output: four bounding boxes in composite coordinates. Class person, confidence 0.93, location inside tile 7 restricted zone.

5

T+0.07s, tile-to-camera map

Box coordinates converted to camera ID 7. Restricted-zone check passes. Alert payload assembled.

6

T+0.40s, snapshot cropped and sent

The illuminated tile 7 crop (480x270) is attached to an SMS and a WhatsApp message. Sent to the on-call manager and the monitoring partner.

7

T+2.0s to T+120s, human loop

Manager reviews, decides, dispatches if warranted. If warranted, Cyrano pulls the native 1080p clip from the DVR over its API (forensic layer) and attaches it to the ticket.

The model choice decides the T+0.06s line. If it is a transformer-based detector like RT-DETR-lite, that line becomes 22 ms instead of 12 ms. Still well within budget. If it is something heavier than nano class, throughput drops below 30 fps and the scheduling story changes. Everything else in the pipeline is orthogonal to model choice.

8. Composite-path AI vs per-stream AI, head to head

FeaturePer-camera smart camera replacementComposite-path edge AI (Cyrano)
Native input resolution1920x1080 per RTSP stream480x270 or 384x216 per tile from HDMI composite
Inferences to cover 16 cameras16 parallel decode + inference passes1 decode + 1 forward pass
Matches YOLOv8-n native input (384 / 416 / 640)Requires resize from 1080pNear-native after letterbox
Cameras to replaceAll 16 to 25 (new smart cameras)Zero
Per-camera credentials / RTSP setupRequired, often lost at handoverNot needed
Person class recall vs native 1080pBaselineWithin 2 to 4 points at 4x4
Bandwidth uploaded to cloud48 to 80 Mbps sustained (cloud AI)Kilobytes on alert, nothing on idle
Forensic face / plate detailFrom native streamFrom native DVR clip pulled after alert

The per-camera replacement path is the right architecture for greenfield builds. The composite path is the right architecture for existing properties with a working DVR and cameras that are under 10 years old.

9. What the install actually does, on day one

First on-site visit

1

Power on edge unit

12V barrel

2

HDMI from DVR in

1.5m cable

3

HDMI passthrough to monitor

front desk unchanged

4

EDID probe + DVR vendor detect

under 10s

5

Tile autodetect + overlay mask

per brand library

6

Model warmup on NPU

under 90s

7

First alert round-trip test

to operator phone

What you do NOT need to do

  • Replace any cameras. HD-TVI, AHD, analog, and IP cameras all keep working.
  • Replace the DVR or the recorder. It keeps clips, storage, and its own mobile app.
  • Find per-camera RTSP passwords. Half of them were lost at the last property handover anyway.
  • Open any inbound firewall ports. The edge unit only makes outbound calls to send alerts.
  • Train a custom model on your site. The shipped YOLOv8-nano checkpoint covers person / vehicle / package on day one.
  • Upload video to any cloud. The full inference loop runs on the device.

10. When a nano-class model is not the right pick

The composite path and a nano-class detector cover the live-alert use case cleanly. They are not the right choice for every security workload. Be honest about the mismatches.

  • Facial identification at distance: you want the native per-camera stream at full 1080p or 4K. Pull it off the DVR after the alert fires, do not try to recognize from a 480x270 tile.
  • License plate recognition: same story. The tile is fine for detecting that a vehicle is in a restricted zone. The plate itself needs the native channel for OCR.
  • Very crowded scenes (40+ people in frame): a nano-class detector starts missing members of the crowd. RT-DETR-lite or a small-scale DETR variant is better here. These sites are rare at the multifamily and small-commercial level; they are common at large retail and stadium operations.
  • Sub-224x224 effective per-tile pixels: on pre-2015 DVRs that only output 720p, the 5x5 composite tile is 256x144. Person recall drops 4 to 8 points. The fix is to move to 4x4 layout (320x180) or upscale the HDMI to 4K before ingest.

For everything else, a quantized YOLOv8-nano or YOLO26-small on a modest NPU is the pragmatic choice, and the DVR composite is the input shape it was almost built for.

Run a nano-class model on your DVR’s HDMI composite

Fifteen minutes. We connect an edge unit to your DVR over HDMI, autodetect the tile grid, apply the overlay mask for your DVR brand, warm up a quantized YOLO checkpoint, and walk through real alerts with your actual cameras. No RTSP required, no camera replacement, no cloud upload.

Book a call

Frequently asked questions

What does 'edge AI model' mean in the context of a security camera in 2026?

An edge AI model is the neural network that performs detection or classification directly on a device located at the property, not in a cloud datacenter. In 2026 the typical candidates are YOLOv8-nano, YOLOv9-tiny, YOLO26-small, RT-DETR-lite, MobileNetV3-small, EfficientDet-D0, and ViT-tiny (sometimes distilled). Each was designed against a specific input tensor size (usually 224x224, 384x384, 416x416, or 640x640) and a specific hardware target (ARM Cortex, Coral TPU, Hailo-8L, Jetson Orin Nano, or a small mobile NPU). For security camera use, the relevant criteria are the model's accuracy on person / vehicle / package classes, its inference time per frame on your NPU, and critically for DVR-installed sites, how the model's native input size compares to the pixel resolution of one tile in the DVR's HDMI composite mosaic.

Why does the DVR's HDMI composite resolution matter for model selection?

Because on installed-base multifamily and small-commercial sites, the AI layer does not get per-camera RTSP streams. It gets the 1920x1080 HDMI frame the DVR paints for its wall monitor, with all 16 or 25 cameras arranged as tiles. At a 4x4 grid that is 480x270 per tile. At a 5x5 grid that is 384x216 per tile. Most lightweight 2026 detection models were trained at 384x384 or 416x416 input size. That means the DVR mosaic is already almost correctly sized for the model, and one forward pass over the full frame, followed by a coordinate crop back to tiles, replaces 25 separate decode-resize-inference pipelines. Model selection for this use case is therefore tied to the composite tile geometry, not to the native per-camera stream resolution.

Is 384x216 per tile enough resolution for reliable person detection?

For live classification (person vs vehicle vs empty), yes. YOLO-nano-class and MobileNetV3 models are trained and published with accuracy numbers at exactly this input range. A 2024 benchmark paper on edge object detection reported 92 to 96 percent mAP for person class on optimized nano-scale models at 384x384. Human faces, plate numbers, and fine-grained identity recognition are a different story: those require the native per-camera stream because they depend on sub-pixel detail that a DVR tile does not carry. The useful division is: if the decision is 'is there a human at the gate right now,' tile resolution is sufficient. If the decision is 'who specifically is this person,' you need the native stream, and almost all sites handle that after the incident by pulling the native clip off the DVR.

Which edge AI model would you pick for a 16-camera multifamily property in 2026?

For the composite-tile use case described here, a quantized YOLOv8-nano or YOLO26-small at int8 on a small NPU (Hailo-8L, Jetson Orin Nano, or an Ambarella CV5) is the practical default. Reasons: YOLOv8 and YOLO26 have production tooling for int8 quantization, run comfortably at 30fps on a 1920x1080 input on modest hardware, and ship with person / vehicle / package classes trained on COCO plus supplemental security datasets. RT-DETR-lite is a credible competitor and has better small-object mAP, at the cost of higher inference time per frame. EfficientDet-D0 is still fine but most 2026 tooling has moved to YOLO or RT-DETR families. MobileNetV3-small is better for a pure classifier head on top of a preselected region, not for the detection step itself.

Do I have to run one inference per camera, or can one pass cover all 16 cameras?

One pass. This is the single largest efficiency win of the DVR HDMI composite approach. The DVR has already arranged the 16 camera feeds into a single 1920x1080 frame. One forward pass through a YOLO-nano-class detector at 1920x1080 input (possibly letterboxed to 1280x736 or 640x640 for the model) yields bounding boxes in composite coordinates. The boxes are then mapped back to tile ids using the tile geometry (480x270 per tile at 4x4, 384x216 at 5x5). Compared to 16 parallel RTSP decodes plus 16 separate inferences, this collapses to one decode path and one inference. On a Jetson Orin Nano 8GB that difference is roughly 30fps sustained versus 4 to 6fps sustained under the per-stream pattern.

How does the DVR's on-screen overlay affect detection?

Every DVR paints timestamps, channel labels, alarm indicators, and sometimes a playback scrubber over the composite frame. If the model is given the raw frame, these overlays cause spurious detections (the timestamp gets classified as text, channel labels generate false vehicle detections on some classes). The fix is a per-brand overlay mask that is applied before inference: a polygon mask that zeroes out the regions where the DVR draws chrome. Cyrano maintains these masks per DVR model (Lorex LNR6xx, Dahua XVR, Hikvision DS-7xxx, Swann DVR, ADT Pulse, and similar). The masks are cheap to produce (five minutes per model with the native config screen visible) and they matter more than any model choice for real-world false positive rate.

What accuracy should I expect compared to a native RTSP pipeline?

Measured across internal benchmarks at 4x4 and 5x5 composite grids, human class recall on the tile-level inference is within 2 to 4 percentage points of the same model running on native 1920x1080 RTSP per channel. Vehicle class recall is within 1 to 3 points. Small-object classes such as 'package on doorstep' degrade more (5 to 8 points) at 5x5 because the package occupies fewer effective pixels. The practical implication: for the live-alert use case (is a human on camera 7 right now) the composite path is operationally equivalent. For forensic reconstruction (who, what package, what plate) the system pulls native clips off the DVR after the alert fires. This is how every real 2026 deployment we have seen actually handles it.

Does this mean the new generative AI cameras that just launched are irrelevant?

No, but they solve a different problem. The 2026 i-PRO X-series with on-device generative AI, the Ambarella CV7 SoC, and similar products are aimed at greenfield deployments where the property is installing new cameras from scratch, or enterprise rip-and-replace buyers with a seven-figure budget. For those sites, per-camera generative AI at the edge is a strong choice. For the 80 percent plus of multifamily and small-commercial properties that already have a DVR and 16 to 25 working cameras, the DVR HDMI composite path is what unlocks edge AI without touching the camera plant. Both truths coexist. The guide pages that only cover the greenfield case are correct for their audience and silent on the installed-base case.

How does model accuracy change when the DVR only outputs 720p or 1080i instead of 1080p?

Some older DVRs (roughly 2015 and earlier) output HDMI at 1280x720 or 1920x1080 interlaced. At 720p the 4x4 composite tile shrinks to 320x180, and the 5x5 tile shrinks to 256x144. Below about 224x224 effective per-tile pixels, nano-class detectors start dropping 4 to 8 points of recall on the person class and more on sub-classes. The practical workaround is to run the DVR in a 2x2 or 3x3 layout (so each tile is larger) and rotate through camera groups, or to drive a smaller physical HDMI matrix that accepts two DVRs and produces a 4K output. Any DVR produced since 2018 will output 1080p natively. For the sub-2015 equipment, a low-cost HDMI upscaler (Atlona AT-HD4-V21 class or similar) in front of the ingest device restores the effective tile size.

Is any of this specific to one product, or is it a general approach?

The composite-tile approach is a general edge AI pattern. You can implement it in-house with a Jetson Orin Nano, a cheap HDMI capture card, an open-source YOLO checkpoint, and a few hundred lines of Python. The engineering that takes time is the per-DVR overlay masks, the tile geometry autodetection across brands, the alert routing, the privacy model (no video ever leaves the site), and the monitoring stack that tells you when inference is silently failing. Cyrano is a commercial implementation of this pattern tuned for multifamily: HDMI ingest, per-brand overlay masks for the common DVRs, on-device quantized YOLO / MobileNet classifiers, and a SMS / WhatsApp / call alerting layer. The pattern itself is not proprietary, but the overlay library and the monitoring stack are the parts that take real engineering time if you roll your own.

🛡️CyranoEdge AI Security for Apartments
© 2026 Cyrano. All rights reserved.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.