M
Matthew Diakonov
12 min read
For ML engineers and security architects

RT-DETR-Lite is interesting because of the tile-grid constraint, not the COCO benchmark.

RT-DETR-Lite is the lightweight variant of Baidu's RT-DETR (Real-Time DEtection TRansformer), an end-to-end transformer object detector that runs without anchor boxes and without non-maximum suppression. The reference implementation lives at github.com/lyuwenyu/RT-DETR under the Apache 2.0 license.

The published benchmarks compare it to YOLOv8 on COCO. Useful, but not what makes the lite variant interesting for surveillance. What makes it interesting is a constraint nobody writes about: when you retrofit AI onto a legacy DVR by classifying its HDMI multiview output, one model has to cover up to 25 tiled cameras inside a single 1080p composite frame. Latency, not mAP, decides whether the deployment ships.

Direct answer (verified 2026-05-08)

RT-DETR-Lite= the small-backbone configuration of Baidu's RT-DETR family. End-to-end transformer object detector. Anchor-free, NMS-free. Real-time on a small GPU. Apache 2.0. Reference repo: lyuwenyu/RT-DETR.

For multiview surveillance retrofits the relevant fact is that the lite variant's per-image latency leaves headroom on a 33 ms / 30 fps budget when the model has to detect across 25 tiles inside one 1080p composite. That headroom is what the rest of this page is about.

What RT-DETR-Lite actually is

The original DETR paper (Carion et al., 2020) reframed object detection as set prediction with a transformer. The model emits a fixed number of object queries, each of which becomes either a detection or a no-object slot, and there is no NMS pass and no anchor box matching. The architecture was clean, the latency was not. DETR was not a real-time model.

RT-DETR (Lv et al., 2023, from Baidu) reworks the encoder so the model meets real-time targets while keeping the no-NMS, no-anchor properties. It introduces an efficient hybrid encoder that separates intra-scale interaction from cross-scale fusion, and a query selection mechanism that initializes queries from high-quality encoder features. The result is a family of detectors (R18, R34, R50, R101 backbones, plus scaled and pruned configurations) that hit real-time speeds without giving up the transformer head.

"RT-DETR-Lite" in the wild usually refers to the small backbone variants (R18 / R34) or to v2 lite configurations from the RT-DETRv2 line. The exact name floats a bit because the project has been renamed and forked across releases. What is stable is the property set: lightweight backbone, real-time inference, no NMS, transformer queries.

The constraint, drawn out

One model, one composite, many tiles. The diagram below is the mental model for why model latency dominates everything else in this stack.

HDMI multiview composite -> RT-DETR-Lite -> per-camera detections

Lobby tile
Mailroom tile
Rear gate tile
Loading bay tile
Parking tile
RT-DETR-Lite
person detections
vehicle detections
per-class router

The five sources on the left are five of up to 25 tiles in the same 1080p composite. The single hub in the middle is one model forward pass over the whole composite, not 25 separate calls. That is the point of the architecture, the model sees every tile at once and the per-camera attribution happens after the fact.

The 33 ms budget, line by line

The number that pins model choice on a multiview retrofit is the per-frame budget. At 30 fps it is roughly 33 ms. The pipeline does five things inside that envelope: HDMI capture and decode, model forward pass, de-tile to per-camera coordinates, per-class routing decision, and async paging. Only the first four are on the critical path. The forward pass is the largest line item, which is why the difference between a lite DETR-class model and a heavy one is the difference between shipping and not shipping.

tile-budget.ts

The numbers above are typical figures for a small CUDA edge target running an ONNX or TensorRT export of an R18-class backbone, not vendor benchmarks. Actual latencies depend on the export path and the input resolution.

~18 ms

Headroom left on a 33 ms / 30 fps budget after a lite RT-DETR-class forward pass over a 1080p multiview, on an edge GPU target. That headroom is the part that gets spent on de-tiling, routing, and the next frame arriving on time.

Why the no-NMS property is load-bearing on a tile grid

Non-maximum suppression assumes that two boxes with high overlap and the same class are the same object, and one of them gets silenced. On a single full-frame image of a real scene this is mostly correct. On a 1080p composite that is actually a 5x5 grid of unrelated sub-scenes, the assumption fails in a small but annoying number of cases. A person standing at the edge of one tile and a different person standing at the edge of the adjacent tile can land in NMS-overlap on the composite even though they are at different cameras and have nothing to do with each other. One of them gets suppressed.

DETR-class detectors do not have this problem because they do not run NMS. The transformer head emits a fixed set of object queries that have already been resolved against each other globally. Two queries pointing at two different tiles produce two detections, and the de-tile layer downstream attaches the right camera ID to each. The no-NMS property is not just a speed argument, it is a correctness argument on the multiview case.

You can patch around this with NMS-on-YOLO by running NMS only within tile bounding boxes, but that requires the NMS step to know the tile grid, which couples the model layer to the de-tile layer in a way that makes the pipeline harder to swap. Choosing a no-NMS detector keeps the layers clean.

Where the model sits in the actual pipeline

The detector is one of four boxes. Capture, model, de-tile, router. The DVR's own motion engine is still running in parallel for its on-site monitor display, but it is not on this data path.

HDMI multiview retrofit, with RT-DETR-Lite as the detector

DVR (HDMI out)Edge captureRT-DETR-LiteDe-tile + routerOn-call ops1080p composite @ 30 fpsdecoded frame buffertransformer forward passdetections in composite coordssnap to tile, attach camera ID, apply zone+time rulesphone + 10s clip on HIGH

What this looks like in production

The model layer is the cheapest part of the stack to swap. The install, the capture path, the de-tile layer, the per-class router and the alerting layer outlive any specific detector choice. What the operator pays for is everything around the model.

Tiles per device
Up to 0
One HDMI cable into one edge device, one model forward pass per composite, all 25 cameras at once.
Per-composite budget
~0 ms
At 30 fps. The lite DETR variant fits with headroom; a heavy backbone does not.
On-site install
~0 min
Plug HDMI into the back of the DVR. Capture the tile-grid template once. Done.

What model choice does not solve

A better detector does not paper over an upstream constraint. Three failure modes that RT-DETR-Lite (or any other detector) cannot fix on its own:

  • The DVR draws the multiview at a frame rate the operator configured years ago. If that rate is 5 fps, the model is running at 5 fps on this site and no model swap fixes it. The fix is at the DVR.
  • The DVR's tile name strip is the source of camera labels. If the DVR draws a tile in a position the operator never renamed ("Camera 7"), the OCR'd label is the one the router sees, which can make alerts harder for humans to interpret. Naming hygiene at the DVR is upstream of the model.
  • On dim tiles with very low signal-to-noise, no detector is reliable. The honest answer is to fix the camera or the lighting at that location, not to retune confidence thresholds until ghosts stop firing.

When a YOLO-class detector is the right call instead

I want to be honest about the cases where the lite DETR family is not the obvious choice. There are real ones.

On CPU-only edge boxes a quantized YOLOv8-N or NanoDet variant is still the pragmatic answer. RT-DETR-Lite assumes a small GPU or an NPU that supports the encoder ops well, and on commodity CPUs the latency picture flips. If your form factor cannot host CUDA or a proper NPU runtime, do not fight that battle, ship the CPU-friendly detector and accept the recall hit on dim tiles.

On single-camera deployments (one IP camera, RTSP available, no multiview) the small-target stress test goes away. The frame is larger per object, the budget is per-frame instead of per-25-tile-composite, and a lightweight YOLO is fine. The DETR family wins specifically on the multiview case, and outside that case the win narrows to the NMS-vs-no-NMS argument, which is real but not always decisive.

On training data your team owns and has labeled in the YOLO format already, switching detector families costs more than it saves. The right move is to keep the existing pipeline and put the engineering hours into the per-class router downstream, which is the layer that compounds.

Want the multiview retrofit running on your DVR?

15-minute call. We will walk through the tile-grid template, the per-class router, and what the detector actually emits on a 25-camera property.

Frequently asked questions

What is RT-DETR-Lite, in one paragraph?

RT-DETR-Lite is the smaller member of Baidu's RT-DETR (Real-Time DEtection TRansformer) family, an end-to-end transformer object detector designed to run at real-time speeds on commodity GPUs. RT-DETR's whole pitch is that it drops the two parts of the YOLO-class pipeline that get expensive at higher quality, hand-tuned anchor boxes and post-hoc non-maximum suppression, and replaces them with a transformer encoder-decoder that emits a fixed set of object queries directly. The lite variants (RT-DETR-R18, RT-DETR-R34, and the smaller pruned configurations sometimes called RT-DETR-Lite or RT-DETRv2-S in derivative work) trade some COCO mAP for a tighter compute envelope so the model can actually finish in time on edge hardware. The reference implementation lives at github.com/lyuwenyu/RT-DETR.

Why would a surveillance retrofit care about RT-DETR-Lite specifically over a YOLO variant?

Two reasons, and they are not the reasons the COCO benchmarks emphasize. First, the multiview retrofit case classifies a single composite frame containing up to 25 tiled sub-views, and the model has to find people and vehicles on every tile, including the ones with hardly any pixels. RT-DETR's transformer attention is more forgiving of small objects in cluttered backgrounds than NMS-based heads, which often suppress overlapping small detections in busy tiles. Second, RT-DETR's no-NMS output lets the per-class router downstream treat the detections as a stable stream rather than a flickering set, which matters for state classes like loitering or tailgating that are computed across frames. None of this is unique to RT-DETR-Lite specifically, the entire DETR family has these properties, but the lite variant is the one that actually fits the latency budget.

What is the latency budget for a multiview pipeline, and why does it pin model choice?

A typical legacy DVR draws a 1080p HDMI multiview at 30fps. The retrofit classifier has roughly 33ms per composite frame to do the entire pipeline, decode, model forward pass, de-tile to per-camera coordinates, run the per-class router. A 5x5 tile grid means each tile is 384x216 inside the 1080p composite, and the model must see all 25 simultaneously. If the model takes 25ms by itself you have 8ms left for everything else. That is why a sub-10ms-image-class model like RT-DETR-Lite or a small YOLOv8-N gets shortlisted, and why a heavier RT-DETR-R50 or DETR-R101 does not. The lite variant is not chosen for accuracy points, it is chosen because it leaves headroom.

How does RT-DETR-Lite handle the tiled-input case, where one frame is a grid of unrelated sub-images?

There is nothing model-architectural that handles the tile grid; the tiling is handled by the surrounding pipeline. The model treats the 1080p composite as a single image and emits detections in composite coordinates. The pipeline records the tile-grid template at install time (5x5, 4x4, 3x3, or 2x2 against a known DVR resolution), then snaps every output bounding box to its tile by integer division on the composite resolution, and looks up the camera ID from the OCR'd tile name strip. The model never knows the tiles exist. What RT-DETR-Lite buys you on top of this is that the transformer's global attention sees the whole composite at once, so a person partially crossing a tile boundary is still a single detection, where an NMS-based detector with overlapping anchor proposals can split the same person into two boxes that get suppressed against each other.

Does RT-DETR-Lite need a GPU on the edge box?

Yes, in any practical configuration. The reference implementation targets CUDA, and the published latencies (around 4 to 9 ms per image at 640x640 on a T4 for the R18 backbone variant) are GPU numbers. On a CPU-only box, RT-DETR-Lite is not real-time at multiview composite resolutions. The pragmatic edge target is a small CUDA card, an NVIDIA Jetson Orin module, or an integrated NPU with an ONNX or TensorRT export of the model. The official repo ships TensorRT export instructions and the speed claims in the paper assume that path. If your edge form factor is CPU-only, you are looking at a quantized YOLOv8-N or NanoDet variant instead, and you will pay for it in small-target recall on the dim tiles.

How does RT-DETR-Lite compare to YOLOv8 and to the original DETR on this workload?

RT-DETR's own paper benchmarks against YOLOv8 and the DETR variants on COCO, where the small RT-DETR-R18 beats YOLOv8-N on mAP at comparable latency. For the multiview case the more interesting axis is small-object recall on cluttered composites, which neither paper measures, because nobody runs that benchmark publicly. In practice on real DVR multiview frames, the lite RT-DETR variants and a tuned YOLOv8-S trade blows: RT-DETR-Lite is more stable across tile boundaries and better on partial occlusions, YOLOv8-S has a slight edge on sharp small objects in well-lit tiles. The original DETR is not a candidate, it is too slow at multiview composite resolutions, that is the whole reason RT-DETR exists.

Where does RT-DETR-Lite sit in the pipeline I would actually deploy?

It is one of three boxes. Box one is the HDMI capture, the multiview composite enters the edge device over a single HDMI cable from the DVR. Box two is the model, RT-DETR-Lite or a comparable lite variant produces detections in composite coordinates. Box three is the de-tile-and-route layer, which snaps every detection to its tile using the stored tile-grid template, attaches a camera ID and zone label, and sends the labeled detection into the per-class router (the part that decides whether anything fires based on class plus zone plus time of day). The model is the cheapest part of this stack to swap. If a better lite detection model ships next year, the rest of the pipeline does not change.

Is RT-DETR-Lite open source, and what license is the reference implementation under?

Yes. The official RT-DETR repository at github.com/lyuwenyu/RT-DETR is published under the Apache 2.0 license, which matches what most production teams want from a model dependency, permissive for commercial use with the standard attribution and patent clauses. The PaddlePaddle reference and the PyTorch reference both ship there. Pretrained weights for the various backbone variants (R18, R34, R50, R101, and the lite/scaled configurations) are linked from the repo readme. As of mid-2026 the v2 line is the actively maintained branch and the lite variants in v2 inherit the same license.

Can I run RT-DETR-Lite on top of the existing DVR's motion-alert stream instead of the HDMI multiview?

You can, but you should not, and the reason has nothing to do with the model and everything to do with the input. The DVR's motion engine is a per-tile pixel-difference threshold. By the time a clip is in the motion-alert stream it has already been filtered against that threshold, so any sub-threshold event (a slow approach in low light, a small distant person) is gone before the model sees anything. RT-DETR-Lite running on top of this stream cleans up false positives but inherits the DVR's recall ceiling. The HDMI direct architecture, where RT-DETR-Lite runs on every multiview composite frame at output rate, is what actually raises recall.

How does Cyrano use a model in this family in production?

Cyrano runs a lite real-time detection transformer in the same family, plus a per-class context router on top, on an edge device that plugs into the back of an existing DVR over HDMI. One device handles up to 25 camera feeds (the 5x5 multiview), inference runs locally with no cloud round trip, and operators get a phone call with a 10-second clip when the router decides an event is HIGH THREAT given class plus zone plus time of day. At a 180-unit Class C property in Fort Worth the same architecture caught 20 incidents including a break-in attempt in the first month. Hardware is $450 one-time, software is $200 per month per property starting month two, and install on site takes about two minutes.

🛡️CyranoEdge AI Security for Apartments
© 2026 Cyrano. All rights reserved.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.