What “type a sentence, get clips” really requires

Natural language search of DVR footage works against a description index, not the video. Stock DVRs do not write one.

Typing “masked person near the loading dock after midnight last week” into a search box and getting a strip of thumbnails feels magical. Mechanically, it is the opposite of magical. The query never touches the video file. It hits a parallel index that has one row per event, with a one-line natural language sentence and nine structured fields. If that index does not exist on your recorder, the answer to the English question is silently “nothing,” and you go back to scrubbing the timeline.

Matthew Diakonov, Written with AI

Published May 8, 20269 min read

Direct answer (verified 2026-05-08)

Yes, you can search DVR footage in plain English, but only if a per-event description index has been written next to the recorder. A stock DVR (Hikvision, Dahua, Lorex, Amcrest, Reolink, Uniview, Swann, Q-See, Night Owl, ANNKE, EZVIZ, Bosch, Honeywell, Panasonic) does not write one. Its search panel takes a date range, a channel number, and a motion mask, because that is all its on-disk format can support.

To turn English search on, an edge layer reads the recorder’s HDMI output, runs detection on each tile, writes a one-line sentence plus nine structured fields per surviving event, and exposes that index to a search box. The query then hits the index in milliseconds and the video file is never opened. Cyrano does this off the HDMI port without changing the recorder or the cameras.

The four moving parts of an English query against footage

Most pages about this topic stop at the marketing claim (“AI search!”) and never name the parts. There are exactly four. If any one of them is missing, the search either returns garbage, returns nothing, or quietly falls back to a date-range scrub.

Sentence in, thumbnail out: the four parts that have to exist

Description

model writes one_line_summary at event time

Structured row

9 fields next to the sentence

Parser

splits query into filter + text match

Result strip

thumbnails sorted by recency

Part 1: a sentence is written for every event, exactly once

The first thing English search needs is something English to match against. A raw H.264 stream is not English. Pixel diffs from a motion detector are not English. The only way to get English is for a model to write it. On a Cyrano unit the description is one line of plain text, generated by a vision-language model that runs on the 5-frame burst that survived the cascade. The prompt asks for strict JSON with a few fields; one of those fields is the sentence.

That sentence is written one time, at the moment of the event, and never re-generated. Every later query is a plain text retrieval against the column where the sentence lives. The model never runs at query time. This is the cost trick that makes English search affordable. Sending every frame to a model costs roughly $97,000 per property per day at hosted-VLM pricing. Sending only the events that survived the cascade (a few hundred a day) and querying the resulting text costs in the low single digits per month per property.

Part 2: a structured row sits next to the sentence

A sentence alone is not enough. “After midnight last week” is not a thing a text match can resolve cleanly. Neither is “the loading dock” (which the operator types as natural-language but means a specific camera). For the query to behave the way a human expects, the sentence has to be paired with structured columns: the timestamp, the camera name as it appears on the multiview, the event class, the property, the layout id, and the overlay mask used at capture. Below is what one event row actually looks like on disk.

{
  "event_id": "01HXR9-3F2K-7T",
  "tile": {
    "label": "Loading Dock NE",
    "index": 7,
    "coords": { "x": 960, "y": 0, "w": 480, "h": 270 }
  },
  "property": "Maple Ridge Apartments",
  "layout_id": "4x4-std",
  "overlay_mask": ["clock", "cam_name_strip", "channel_bug"],
  "event_class": "loiter",
  "iso8601_ts": "2026-05-02T02:14:08-05:00",
  "latency_ms": 7204,
  "one_line_summary": "two adults in dark hooded jackets, one carrying a crowbar-shaped object, approaching the rear loading dock gate at 02:14",
  "thumbnail_uri": "/idx/thumbs/01HXR9-3F2K-7T.jpg"
}

One row per event. tile.label is the human name the DVR paints onto its multiview strip; that is the stable key, not the channel number. overlay_mask records which DVR chrome regions were blanked before inference so the detector did not score boxes on the clock glyph or the channel bug. one_line_summary is what an English query searches against.

Part 3: a parser splits the typed query into a filter plus a text match

When the operator types a sentence, the search box does not throw the whole string at the model or at a vector index. It runs a small parser first that recognizes time phrases, camera names, and event classes, and pulls them out as structured filters. Whatever is left over becomes the text match. This is the single most important step, and it is the part most consumer “AI search” products skip, which is why their results feel arbitrary.

# query the operator typed
masked person near loading dock after midnight last week

# parser splits into three parts

structured_filter:
  iso8601_ts:  [2026-04-25T00:00, 2026-05-01T05:00]   # "after midnight last week"
  tile.label:  "Loading Dock NE"                       # "loading dock"

text_match:
  tokens:    ["masked", "person"]                      # full-text index
  embedding: vec("masked person")                      # semantic match

# resolved as
SELECT event_id, iso8601_ts, tile.label, one_line_summary, thumbnail_uri
  FROM events
 WHERE iso8601_ts BETWEEN $start AND $end
   AND tile.label = $camera
   AND (
        MATCH(one_line_summary) AGAINST ('masked person' IN BOOLEAN MODE)
     OR cosine(embedding, $vec) > 0.78
   )
 ORDER BY iso8601_ts DESC
 LIMIT 50;

“After midnight last week” resolves as a hard time window. “Loading dock” resolves as a tile.label match. “Masked person” is the only piece that hits the sentence column, scored by a token match plus an embedding similarity. The intersection of all three is the result set.

Part 4: a result strip the operator can actually click

The output of the search is not a list of timestamps. It is a strip of thumbnails sorted in reverse chronological order, each one tied to an event row. The thumbnail is rendered once at event time alongside the sentence, so the result strip does not have to seek into the video file to render. A click on a thumbnail opens the corresponding clip on the recorder using the iso8601 timestamp; that is the one moment the underlying video file is touched. Everything else is a database query.

Wall-clock timing of one English query against a property index

“At one Class C multifamily property in Fort Worth, Cyrano caught 20 incidents including a break-in attempt in the first month. Customer renewed after 30 days.”

Fort Worth, TX deployment

Why a stock DVR cannot do this even if you bolt a text box onto it

If the recorder never wrote per-event descriptions, no amount of UI work can produce English search. There is nothing to match. Adding a text input to a stock DVR’s playback panel and routing it through a clever scoring function still gives you zero hits, because the on-disk format is keyed by (channel, timestamp), not by event meaning. The work has to happen at write time, not at read time.

That is the whole reason the description index lives on a separate device. Recorder firmware is a closed embedded stack tuned for the storage and motion-mask job it has done since 2010. Adding a vision-language model and an embedding index to that firmware would require a board-level redesign, a chip with enough on-device compute to run inference, and a new firmware release. Reading the same recorder’s HDMI output from a small box next to it is the same effect with none of those costs.

What an English search will and will not return

Returns hits

Anything covered by an event row whose one_line_summary contains or semantically matches your phrasing.
Anything inside a time window after the unit was installed and indexing started.
Anything inside an armed zone where the cascade was allowed to escalate the event.
Cross-property queries when the operator is signed in to multiple sites; the property column is just another structured filter.
Mixed-brand fleets, because the index keys on tile.label (the painted camera name) instead of the recorder's proprietary channel number.

Returns nothing (and why)

Queries about attributes the model never described. If a license plate prompt is not in the cascade, no description carries the plate; add the prompt, do not re-run the search.
Queries against zones that were never armed. The cascade does not write event rows for un-armed regions; the back fence with no zone produces no hits.
Queries against a window before the unit was installed. The index is forward-looking and cannot retroactively describe last March's footage.
Queries that drift into legal facts (was this person trespassing? was this consent?). The index records what was visible, not what is enforceable.
Queries about audio. The HDMI tile carries video; audio is a separate channel and is not currently in the index shape on this device.

What this looks like when it is sitting on top of an existing recorder

The physical setup is one HDMI cable in from the recorder, one HDMI cable out to whatever monitor was there before, one network cable, and one power plug. Under two minutes of install on a running DVR. Nothing on the recorder changes; the cameras stay where they are; the office monitor still shows the same multiview as it did before. What is new is a small dashboard on the property manager’s laptop or phone that exposes the search box, the result strip, and the alert tier. The index is built continuously as events occur, so on day two of the install the operator can already type a sentence and get hits for things that happened on day one.

The supported recorder list covers Hikvision DS-7xxx, Dahua XVR and NVR, Lorex, Amcrest, Reolink NVR, Uniview, Swann, Night Owl, Q-See, ANNKE, EZVIZ, Bosch DIVAR, Honeywell Performance, and Panasonic WJ-NX, plus the long tail of rebrands. The brand of the recorder is a non-event because the index does not depend on any vendor SDK; it reads the composite frame the recorder paints to its own monitor.

See the description index running on a real property

Ten minutes, one screen-share. Type a sentence, watch the index resolve to thumbnails, see the underlying event row.

Frequently asked questions

Can I really type a sentence into my DVR search box and find footage?

Almost never on the recorder itself. Stock DVR firmware (Hikvision, Dahua, Lorex, Amcrest, Reolink, Uniview, Swann, Q-See, Night Owl, ANNKE, EZVIZ, Bosch, Honeywell, Panasonic) ships a Search panel that takes a date range, a channel number, and a motion-region mask. There is no English-language text input because there is nothing for English text to match against. The recorder stored H.264 frames keyed by (channel, timestamp), not sentences. To make English search work, a separate layer has to write a one-line description per event into a parallel index. The recorder then keeps doing what it does (storing pixels), and the index does what the recorder cannot (storing meaning).

Where does the description actually come from? Who writes the sentence?

A model writes it once, at capture time, when an event survives the cascade. On a Cyrano unit the chain is: the HDMI tile is grabbed off the recorder, an on-device detector flags a candidate (person, vehicle, unusual dwell), a dwell timer holds the track until it crosses a per-zone threshold, and a 5-frame burst stitched into one image is sent to a vision-language model with a prompt that returns strict JSON. One field of that JSON is one_line_summary, which is the sentence that goes into the search index. Example: 'two adults in dark hooded jackets, one carrying a crowbar-shaped object, approaching the rear gate at 02:14.' That sentence is written exactly once. Every later search hits the sentence as plain text and never re-runs the model.

Why is the cost of natural language search dominated by the write, not the query?

Because the model only fires on events that survived the cascade, not on every frame. A 25-camera property emits roughly 32.4 million frames a day. Sending all of them to a hosted vision-language model lands near $97,000 per property per day at Sonnet input pricing. The cascade drops more than 99.99% of those frames before they reach the model, so the day's bill is a few hundred descriptions, not 32 million. Once written, those descriptions are text, and a text query against text is effectively free at the millisecond level. The heavy work happened on the day of the event. The search you run six days later is just a string match plus a structured filter.

What happens behind the scenes when I type 'masked person near loading dock after midnight last week'?

The query is split into three parts before anything is matched. (1) A time window: 'after midnight last week' becomes an iso8601 range that bounds the rows considered. (2) A structured filter: 'loading dock' is matched against tile.label, the human-readable camera name the DVR already paints onto its multiview strip and that the index keys events by. (3) A text match: 'masked person' becomes a token plus an embedding lookup that scores the one_line_summary column. The intersection is the result set. The video file is never opened during this process. Only when the operator clicks a thumbnail does the timestamp open the corresponding clip on the recorder.

Why does the index key on the camera name and not the channel number?

Because channel numbers move and camera names do not. When a tech swaps cameras on a re-cable, or maintenance rearranges the multiview to put the loading dock on a bigger tile, channel 6 stops being the loading dock. Every saved English query that referenced 'loading dock' would silently start returning hits from the wrong camera if the index keyed on channel. The DVR already stamps a human-readable name onto each tile (Loading Dock NE, Mailroom Interior, Lobby W). Reading that name off the composite frame at install time, then keying every event by that name, is what keeps an English query stable across recabling, channel swaps, and layout changes.

What kinds of queries fail, and why?

Three failure modes. First, queries about things the model never described. If 'license plate' was never asked of the model at event time, no description carries the plate and a search for 'red Camaro plate ABC123' returns nothing. The fix is to add a license plate prompt to the cascade, not to re-run a search differently. Second, queries about things outside any armed zone. The cascade only escalates events inside zones, so an English query about the back fence when there is no armed zone there returns nothing because no event row was ever written. Third, queries against a window before the unit was installed. The index is forward-looking. It cannot retroactively write descriptions for footage from last March if Cyrano went live in May.

Does the natural language search use a vector database, a SQL table, or both?

Both, joined on the event id. The structured fields (tile.label, event_class, iso8601_ts, property, layout_id, overlay_mask) live in a relational table and are queried with WHERE clauses. The one_line_summary lives next to it and has both a full-text index for keyword matches and an embedding index for semantic matches. A query is parsed into a SQL WHERE for the structured part and a vector + text search for the sentence part. Results are joined back on the event id and sorted by recency. Splitting the index this way is what makes 'after midnight last week' resolve as a hard filter instead of being thrown into the embedding search and confusing it.

What is the latency of a natural language query against a property's index?

Sub-second on a property with 12 to 24 months of indexed events. The structured filter trims the candidate set to a few hundred or low thousands of rows in milliseconds. The text and embedding match runs over those candidates, not the whole index, so the scoring step is cheap. The thumbnail strip renders progressively as the rows resolve. The dominant time is the network hop between the operator's browser and the dashboard, not the search itself. Compare this with the alternative (scrubbing four channels in 4x playback for an hour) and the gap is three orders of magnitude.

Can the index be exported, queried by SQL, or fed into another tool?

Yes. The event table is a normal columnar store on the device with the nine structured fields plus the one_line_summary plus the embedding vector. Read access is exposed over a local SQL surface for integrators who want to script against it, build a custom dashboard, or feed events into a CMMS or insurance reporting flow. Write access is restricted to the Cyrano cascade so the index cannot be polluted by a misconfigured client. Exports are row-by-row JSON, the same shape the dashboard uses internally, so what you see in the search UI and what you pull into a script are exactly the same record.

How does this differ from the AI search I already see in some camera apps?

Two structural differences. First, most app-level AI search is keyed to one camera brand and runs in that brand's cloud, so a property with mixed cameras (a few Hikvision, a few Dahua, a couple of Reolinks) has three separate search experiences and no portfolio-wide query. The DVR-side approach keys on the human-readable tile.label that any recorder paints, which makes the index brand-agnostic. Second, app-level AI search typically runs the model at query time on cached clips, which is why it feels slow and why coverage gaps appear when the cache misses. The description-at-write-time approach pre-pays the model cost on the event, so query time is plain retrieval. Different latency profile, different cost profile, different surface area.

Does natural language search require uploading footage to the cloud?

Not for the index itself. On a Cyrano unit the detector, the dwell logic, the description model (when running on a small local VLM), and the index all live on the device at the property. The cloud is only on the egress path: the dashboard fetches the index over an authenticated channel so an operator on their phone can query a property they are not physically at. Footage stays on the recorder under whatever retention policy was already in place. If a frontier model is in the cascade, only the 5-frame burst plus the prompt leaves the property, not raw video. The index is queryable from anywhere; the video stays where it always lived.

Adjacent reads on this site: why your DVR does not actually have a search feature, the cascade gate that makes vision-language models economical for CCTV, and the hop-by-hop budget for camera notification latency.

Natural language search of DVR footage works against a description index, not the video. Stock DVRs do not write one.

The four moving parts of an English query against footage

Part 1: a sentence is written for every event, exactly once

Part 2: a structured row sits next to the sentence

Part 3: a parser splits the typed query into a filter plus a text match

Part 4: a result strip the operator can actually click

Why a stock DVR cannot do this even if you bolt a text box onto it

What an English search will and will not return

What this looks like when it is sitting on top of an existing recorder

See the description index running on a real property

Frequently asked questions

Frequently asked questions

Comments (••)

Comments ()