Matthew Diakonov, Written with AI

Published April 19, 202613 min read

Cross-domain edge AI guide

Hearing aids were the first device that could not afford a cloud round trip. Security cameras are the next one.

Edge AI hearing aids run their deep neural network on a chip behind your ear because the speech you hear from the hearing aid has to arrive within 10 milliseconds of the speech your ear canal hears directly. There is no version of that pipeline that touches the cloud. This guide is about what the engineering response to that constraint (a fully integrated NPU on the chip, sub-1W power, zero cloud dependency) tells us about every other always-on edge AI category that came after, including the one Our system works in: security cameras.

See the security camera version of the same pattern

4.9from 50+ properties

On-device inference, no cloud round trip required

Up to 25 camera tiles per $450 unit, ~7s median latency

Reads HDMI output of the existing DVR, no RTSP or ONVIF

9-field event schema works across 14+ recorder brands

Edge AI is not a feature. It is a place.

The chip behind your ear. The box on the DVR shelf. Same architectural choice, two different latency budgets.

Hearing aid: 10 ms budget. Cloud is impossible.

Voice wake word: 100 ms budget. Cloud is unwise.

Security camera: 30 sec action budget. Cloud is fragile.

All three converge on the same answer: integrated NPU, no round trip.

Our system applies the pattern to existing camera fleets via HDMI.

0:00 / 0:05

The latency physics that started it all

The human auditory cortex fuses sound from two paths: the direct acoustic path through your ear canal, and the electrically processed path coming out of the hearing aid receiver. If those two paths arrive within roughly 10 milliseconds of each other, your brain treats them as the same event. Outside that window, the two paths decouple. You hear an echo, a smear, or a phasey hollowness that audiologists call comb filtering. The hearing aid stops being useful and starts being actively uncomfortable.

That 10 ms is not a software target. It is biological. It is the same number whether the year is 1985 or 2026, whether the hearing aid is a $300 OTC unit or a $7,000 top-of-line prescription device. The number is fixed by the brain.

That single number is what made on-chip edge AI non-negotiable in hearing aids. A Bluetooth round trip to a phone is 30 to 100 ms in best case. A round trip to a cloud server is hundreds. Both are an order of magnitude past the perception window. So the only place the deep neural network can live is on the same silicon that drives the receiver.

The three constraints that define always-on edge AI

Edge AI hearing aids exist because three constraints collide. Every other always-on edge AI category that has shipped since hits at least two of the same three. Read the constraint list and you can predict whether a category will end up on the cloud or on the chip.

Latency budget below the cloud floor

If the action window is shorter than a cloud round trip (typically 200 to 600 ms in real conditions), the cloud is not an option for the inference itself. Hearing aids: 10 ms. Wake-word chips: 100 ms. Pacemaker arrhythmia: ~50 ms.

Power envelope below 1 watt

The chip has to draw watts a battery the size of a fingernail can sustain for 20 hours. That rules out general-purpose CPUs and forces a purpose-built NPU integrated with audio I/O.

No graceful fallback

The device cannot punt the workload to a phone or to the cloud when the network is bad. If the NPU stalls, the user gets silence (hearing aids) or a missed event (security). The DNN has to ship on the device or not ship at all.

Continuous, steady-state workload

The model runs every frame, every sample, every minute the device is on. There is no batching, no off-peak. That allows hardware designers to co-locate the NPU with the sensor and skip the system bus.

Same architecture, different latency budgets

Once you accept the architecture (integrated NPU, no cloud round trip), the question becomes: what window am I working inside? Hearing aids work inside the brain's auditory fusion window. Security cameras work inside the human-action window for an alerted operator. The numbers are very different, but the design pressure is the same.

The same edge AI pattern, applied to three different perception windows

The latency numbers on the right are not a ranking. They are the perception or action window the device is operating inside. Hearing aids must finish inside the biological fusion window. Security cameras must finish inside the human response window. Wake-word chips must finish inside the user-patience window for a missed trigger.

Anchor fact

10 ms vs. 7 seconds: same architecture, two different physics

Starkey Edge AI runs its deep neural network on a G2 Neuro Processor with a fully integrated NPU, drawing under 1 watt, finishing each inference in under 10 milliseconds. The number is fixed because the human auditory fusion window is biological.

Our system runs per-tile classification on up to 25 parallel camera tiles ingested from a single HDMI cable on a $450 appliance, with a median capture-to-index latency of about 7 seconds. The number is fixed because the human security action window is operational, not biological. Both numbers are inside their domain's perception envelope. Both ship without a cloud round trip in the inference path. Same edge architecture, two different physics.

“At one Class C multifamily property in Fort Worth, We caught 20 incidents including a break-in attempt in the first month, with no cloud streaming of the underlying video.”

Fort Worth, TX property deployment

The hearing aid checklist for any edge AI buy

Audiology has spent a decade refining the questions you ask before you buy an edge AI hearing aid. Most of those questions translate directly to any other edge AI category. If a vendor cannot answer them, the device is either not actually edge or not actually production.

Questions that work for hearing aids and for cameras

What is the inference latency, in milliseconds, on the actual device under continuous load?
Is there an integrated NPU, or is the DNN running on a general-purpose CPU?
What is the power draw of the NPU specifically, not the device idle figure?
Does the inference path require an internet connection? What happens when the connection is down?
How do model updates ship, and how often?
What sensor stream is the NPU consuming, and what is the schema of the events it emits?
What is the failure mode if the NPU stalls or hits an unrecognized input?

Hearing aids vs. security cameras: where the pattern stays, where it diverges

Most of the engineering DNA from edge AI hearing aids ports straight over to edge AI security cameras. Two things do not. Knowing which is which is the real cross-domain insight.

Feature	Starkey Edge AI (hearing aids)	Our system (security cameras)
Inference location	On-die NPU, behind the ear	Dedicated appliance, on the DVR shelf
Sensor input	Microphone array, ~16 kHz audio	HDMI multiview from the DVR, 1080p video tiles
Latency budget	~10 ms (auditory fusion window)	~30 sec (human security action window)
Power envelope	Sub-1W, button-cell battery	Sub-15W, USB-C wall power
Model size	Single-digit MB, very tight	Tens of MB, plug-in power affords more
Cloud dependency in inference path	None	None
Update cadence	Monthly to quarterly via clinic or app	Continuous, OTA, model + overlay_mask library
Cross-vendor input compatibility	Single-vendor signal chain (own receiver)	Reads HDMI from 14+ DVR/NVR brands via overlay_mask

How our system applies the pattern to a real building

The hearing aid version of edge AI lives on the chip. The security camera version of edge AI lives in the office closet, plugged into the cable that already drives the wall monitor. The install is short because the architectural decision (no cloud round trip in the inference path) does most of the work.

Edge AI on a security camera fleet, end to end

1. Identify the HDMI port on the existing DVR

Every DVR drives a wall monitor in the office. That HDMI cable already carries the multiview tile grid the operator watches.

2. Plug the edge AI unit inline

The $450 appliance sits between the DVR and the monitor. The wall monitor still shows the same picture. The unit captures the same picture for inference.

3. Decode the multiview into per-tile streams

The unit knows the layout (4x4, 5x5, 3x3) and the brand-specific UI chrome (clock, channel name strip, channel bug). It masks the chrome and emits one logical video per tile.

4. Run on-device DNN per tile

Up to 25 tiles classified in parallel on the integrated NPU. No frame leaves the device for inference. Median capture to indexed event: ~7 seconds.

5. Emit a 9-field event record

Each event writes tile.label, tile.index, tile.coords, property, layout_id, overlay_mask, event_class, iso8601_ts, latency_ms, plus a 480x270 thumbnail. Same shape regardless of recorder brand.

6. Forward only events, never raw video

The cloud sees the 9-field record and the small thumbnail. The full video stays on the property's existing DVR. The bandwidth cost is a few kilobytes per event, not megabits per camera per second.

One event record, written entirely on-device

This is the exact JSON shape our system writes for every indexed event.

tile.label

is the camera name the DVR stamps on its own UI strip. The

overlay_mask

field records which brand template was applied so the same schema works across recorder brands. The whole record is generated on the device. Nothing in this payload required a cloud round trip to produce.

cyrano.event.json

Numbers that anchor the comparison

Two columns. Both real. Both edge. Both built around the constraint that the inference cannot leave the device.

0 msHearing aid inference budget

0 secOur system median capture-to-index

0Camera tiles per unit

$0Our system hardware, one-time

The always-on edge AI category, by latency budget

Pull the existing always-on edge AI products into one row and sort them by their perception or action window. The list is short, but it tells you exactly where the architecture has already won and where it is still arriving.

Hearing aids: 10 msPacemakers: ~50 msWake-word chips: ~100 msAutomotive radar: ~200 msWearable fall detection: ~1 sSecurity camera events: ~7 sIndustrial vision QA: ~30 s

“The thing nobody told me when I started shopping for AI cameras is that 'AI' has two completely different meanings. Cloud AI means you're paying for someone to upload your video. Edge AI means the box does the thinking and only the answer leaves the building. The hearing aid people figured this out a decade before the camera people did.”

Property technology lead

Multifamily operator, Southeast

What the count actually looks like

One device. Multiple tiles. No cloud streaming of raw video. The numbers below are what a typical our system deployment produces in a 30 day window on a 16 tile multiview, all classified on-device.

Indexed events / 30 days

16 tile multiview, common areas

Median capture-to-index

0 sec

On-device, no cloud in the loop

Raw video uploaded

Bytes. Only events leave the property.

Want the security camera version of the hearing aid pattern?

15 minute walkthrough. We plug into a live DVR's HDMI output and show you the per-tile event log running entirely on-device, with no cloud video upload.

Frequently asked questions

What does 'edge AI hearing aids' actually mean, in plain language?

It means the AI that classifies the soundscape and enhances speech runs on a chip inside the hearing aid itself, not on your phone and not in a cloud datacenter. Starkey's Edge AI line, for example, ships with a G2 Neuro Processor that has a fully integrated NPU (neural processing unit) on-die, dedicated to running deep neural networks for noise reduction and speech enhancement in real time. The 'edge' in edge AI is not a marketing term here. It is a physical place: the silicon behind the receiver, behind the wax guard, behind your ear.

Why does the AI have to run on the device? Why not stream sound to a phone?

Because of latency physics. The human auditory system fuses the direct acoustic path through your ear canal with the processed path from the hearing aid; if those two paths arrive more than roughly 10 milliseconds apart, you perceive an echo or a smear that defeats the purpose of the device. A round trip from the hearing aid to a phone over Bluetooth and back is on the order of 30 to 100 ms even on best-in-class transports. A round trip to a cloud server is hundreds of milliseconds. Neither is fast enough. The DNN has to live on the same chip as the receiver, draw less than a watt, and finish each inference inside the perception window.

How does an NPU on a hearing aid differ from a generic mobile NPU?

Three ways. First, the workload is steady-state and known: a small set of audio models running continuously, not a parade of different mobile apps each demanding their share of compute. That lets the chip designer co-locate the NPU with audio I/O so there is no round trip across a system bus. Second, the power envelope is brutal. A hearing aid runs ~20 hours on a battery the size of a watch button; the NPU must contribute less than a few hundred milliwatts. Third, there is no fallback. A phone NPU can punt to the CPU or the cloud. A hearing aid NPU cannot. If it can't classify the room in 10 ms, the user hears nothing useful.

Hearing aids are not the only always-on edge AI device. What else is in this category?

The category is anything that has to make sense of a continuous sensor stream within a perception or action window, where shipping the stream to the cloud is impossible or unwise. Hearing aids do it for ~10 ms speech enhancement. Smart cameras (our system, Wyze, Eufy, Reolink) do it for second-scale event detection on continuous video. Voice-activated wake word chips (Amazon Echo, Google Nest) do it for ~100 ms keyword spotting. Wearables doing fall detection, pacemakers doing arrhythmia classification, automotive radar doing collision-avoidance: all of them share the same constraint pattern. The latency budget changes (10 ms for hearing, 30 seconds for security action), but the architectural pressure is the same.

How does this connect to security cameras?

The same engineering constraints. A security camera doing real-time event detection cannot stream 24/7 video to the cloud (bandwidth, cost, privacy), cannot tolerate cloud latency on a developing incident, and cannot trust an internet connection at the moment a power-cut intruder kicks the panel. So it has to run inference locally. The latency budget is wider than hearing (you have ~30 seconds, not 10 ms, before a human cannot meaningfully act on an alert), but the no-cloud-round-trip rule is identical. Our system's edge AI device for security cameras applies the hearing aid pattern to commercial video: an NPU on a small box, fed continuous video, classifying events, no cloud streaming required.

What is our system's specific edge AI architecture for security cameras?

A single $450 device that ingests up to 25 camera tiles in parallel. The novel input path: instead of pulling RTSP streams or replacing analog cameras, Our device plugs into the HDMI output of the existing DVR or NVR (the cable that drives the wall monitor in the leasing office). It captures that HDMI signal, splits it back into the multiview tile grid, and runs per-tile inference on each camera pane. Median capture-to-index latency is roughly 7 seconds. Every event writes a 9-field record (tile.label, tile.index, tile.coords, property, layout_id, overlay_mask, event_class, iso8601_ts, latency_ms). No DVR credentials are used. No cloud round trip happens for the inference itself.

Why 7 seconds for security and 10 ms for hearing? Isn't 7 seconds slow?

It is slow on the speech timescale. It is fast on the human-action timescale. A hearing aid has to beat the brain's auditory fusion window, which is biological and non-negotiable. A security alert has to beat the time it takes for a monitoring agent to notice, classify, and dispatch a response (typically 30 to 90 seconds end-to-end on a well-run service). Inside that window, 7 seconds of edge classification leaves a comfortable margin. The lesson from edge AI hearing aids is not 'all edge AI must be 10 ms,' it is 'edge AI must be inside the perception or action window for its own domain.' our system's window is human security response. Starkey's window is auditory fusion. Both are edge.

Are there any edge AI hearing aid lessons that do NOT translate to security cameras?

Yes, two. First, model size. Hearing aid DNNs are tiny (single-digit megabyte) because power is the binding constraint; security camera edge AI can spend tens to hundreds of megabytes because the appliance plugs into wall power. Second, model freshness. Hearing aid firmware updates ship monthly to quarterly because audiology is a stable domain; security event classifiers benefit from more frequent updates because the 'what does a tailgate look like' question is influenced by recorder UI changes, lighting, and seasonality. Our system's overlay_mask library, which lets one event schema work across 14+ DVR brands, is the security-side answer to a problem hearing aids do not have because they own the entire signal path.

Is the Starkey Edge AI hearing aid the only product with this architecture?

It is the most prominent. Starkey shipped Edge AI in 2024 with the G2 Neuro Processor, the first hearing aid with a fully integrated NPU. Starkey claimed roughly 100x DNN throughput vs. their prior Genesis AI platform. As of January 2026, a firmware update added Auracast assistant support. Edge AI was succeeded as the flagship by Starkey Omega AI in October 2025, but the underlying NPU pattern is now industry-standard; competitors like Phonak and Oticon have parallel chips with on-device DNN paths. The pattern (integrated NPU, sub-10ms inference, no cloud) is no longer novel within hearing aids; it is now the baseline.

If I'm researching 'edge AI hearing aids' as a hearing aid buyer, what should I actually compare?

Five things, in this order. (1) Inference latency at the receiver, in milliseconds, not the marketing 'real time.' (2) DNN throughput on the integrated NPU, typically expressed as 'X times the prior generation.' (3) Battery life on continuous DNN load (most spec sheets quote idle). (4) Whether the device has a fallback path if the DNN fails (it should not need one, but knowing the failure mode matters). (5) Frequency of model updates and how they ship (over-the-air vs. clinic visit). The Starkey product pages cover #1 through #4 well, are quiet on #5. The cross-domain 'why edge' framing in this guide is meant to give you the vocabulary to ask the audiologist the right questions, not to replace their fitting.

Adjacent reading on edge AI

Architecture

Edge AI Device for Security Cameras: The HDMI Integration Pattern

The deep dive on the HDMI input path our system uses, why it sidesteps RTSP and ONVIF, and what 25 tiles in parallel actually looks like.

Read

Primer

What is Edge AI

The plain-English explainer for the edge AI paradigm, with examples from voice, vision, and wearables.

Read

Cost

Edge AI vs Cloud Security Cameras: Bandwidth

The bandwidth and cost arithmetic that pushes continuous video inference to the edge, with worked numbers from real deployments.

Read

Hearing aids were the first device that could not afford a cloud round trip. Security cameras are the next one.

The latency physics that started it all

The three constraints that define always-on edge AI

Latency budget below the cloud floor

Power envelope below 1 watt

No graceful fallback

Continuous, steady-state workload

Same architecture, different latency budgets

The same edge AI pattern, applied to three different perception windows

10 ms vs. 7 seconds: same architecture, two different physics

The hearing aid checklist for any edge AI buy

Hearing aids vs. security cameras: where the pattern stays, where it diverges

How our system applies the pattern to a real building

Edge AI on a security camera fleet, end to end

1. Identify the HDMI port on the existing DVR

2. Plug the edge AI unit inline

3. Decode the multiview into per-tile streams

4. Run on-device DNN per tile

5. Emit a 9-field event record

6. Forward only events, never raw video

One event record, written entirely on-device

Numbers that anchor the comparison

The always-on edge AI category, by latency budget

What the count actually looks like

Want the security camera version of the hearing aid pattern?

Frequently asked questions

Adjacent reading on edge AI

Edge AI Device for Security Cameras: The HDMI Integration Pattern

What is Edge AI

Edge AI vs Cloud Security Cameras: Bandwidth

Comments (••)

Comments ()