Matthew Diakonov, Written with AI

Published April 19, 202614 min read

The architecture behind the quote

Larry Ellison described an outcome. The part nobody is mapping is the architecture it actually requires.

“Citizens will be on their best behavior, because we are constantly recording and reporting everything that is going on.” That line went viral in September 2024 and has been re-quoted across Fortune, 404 Media, Futurism, Interface, and Yahoo Finance. Every piece of coverage stayed at the level of the quote. None of them asked the engineering question underneath it: to deliver the outcome Ellison described, what does the system have to look like? The answer is a specific stack of four architectural choices. And a local, monitor-first edge system like our system cannot implement any of them, as a matter of physics, not policy.

See the opposite architecture on a live DVR

4.9from 50+ properties

Per-event egress on a unit: about 240 KB

Per-tile pixel budget on a 5x5 multiview: 384 x 216

Continuous cloud video upload: 0 bytes

Cross-property centralized identity index: none

Ellison's vision vs. an edge system's ceiling

Same 'AI on cameras,' completely different capability surface.

Ellison: continuous cloud upload

Edge: ~240 KB per delivered event

Ellison: identification at scale

Edge: class detection at 384x216 per tile

Ellison: centralized cross-camera index

Edge: events stop at the property boundary

0:00 / 0:05

What Ellison actually said, and what the coverage left on the table

At the Oracle financial analyst meeting in September 2024, Larry Ellison described a future in which AI continuously watches data from security cameras, police body cameras, doorbell cameras, and dashboard cameras, with a central system flagging problems and keeping everyone, including police officers, on what he called their “best behavior.” Fortune and 404 Media quoted it. Futurism and Interface critiqued it. Yahoo Finance tied it to Oracle's then-pending TikTok role. The political critique has been done thoroughly.

What has not been done is the architecture pass. The Ellison outcome is not a single decision. It is a stack of four design choices. If any one of them is absent, the outcome is not reachable. Most of the AI surveillance systems sold into properties today sit somewhere on that stack; a few refuse all of it. A property manager picking a vendor is, whether the procurement page says so or not, picking a point on this stack.

0 KBEgress per delivered event on a unit

0 MbpsContinuous cloud video upload from a our system property

0 pxPerson height at 30 ft on a 5x5 multiview tile

0 BJSON metadata sent per event

The four architectural primitives Ellison's pitch requires

Break the “AI watches everything” vision into the smallest pieces it can be built from. Four primitives fall out. Each one is load-bearing. Remove any single one and the Ellison outcome collapses into something much narrower. A local edge deployment removes all four at the architectural level.

Primitive 1: continuous cloud upload

Reporting 'everything that is going on' means the central system has to see everything, which means every camera uplinks full-frame video continuously. At 1080p30 per camera across a city that is petabytes per day, all of it crossing the network. The edge refusal: a monitor-first unit uplinks only ~240 KB per delivered event. The raw multiview never leaves the building.

Primitive 2: identification models

A surveillance state that names who is on camera is different from one that only counts them. Identification needs face-ID or plate-ID models running at the pixel budget those models expect (roughly 80-100 px interocular for face, 40-60 px per plate character).

Primitive 3: centralized cross-camera index

The citywide 'where is person X right now' query requires a single index that all cameras feed. Edge systems that keep event metadata at the property cannot answer cross-property questions. That is a feature, not a gap, if the trust model says so.

Primitive 4: per-camera network access

Cloud inference has to reach the cameras. Either RTSP per camera or a VMS gateway. That is what forces the network device, the firewall exception, and the credentials hunt. Monitor-first reads the DVR's HDMI monitor signal instead, which is electrically isolated from the network.

Why four, not one or seven

Two primitives (upload + centralized index) would give you a warehouse of footage but no real-time 'best behavior' loop. Three (add models) gives you offline forensics, not continuous supervision. The Ellison outcome needs all four running at once, at scale. Which is exactly the architecture a local edge system explicitly rejects.

The shape of the two architectures, side by side

The top-down view of Ellison's stack is a hub-and-spoke: every camera uplinks to a central analysis cluster, and the cluster writes back to a cross-camera index. The edge stack is an anti-hub: each property terminates its own analysis loop on-site and emits only the events it chose to emit. Drawn out:

Two ways to wire an AI surveillance system

The opposite wiring: property-terminated edge

The primitive-by-primitive comparison

Map each of the four Ellison primitives onto what the edge architecture does instead, concretely.

Ellison-style cloud AI surveillance vs. monitor-first edge

Each row is one of the four primitives, expanded into what the two architectures actually emit, require, and guarantee.

Feature	Ellison-style cloud surveillance	Monitor-first edge (our system)
Video path leaving the property	continuous 1080p per camera, 40-60 Mbps per 25-camera site	event-only, ~240 KB per delivered event
Uplink medium	fixed building fiber or dedicated circuit	cellular eSIM
Model class run against a given camera	face identification, plate reading, behavior inference	person / vehicle / package at class level
Per-camera pixel budget at the detector	1920 x 1080 per stream (sufficient for identification)	384 x 216 at 5x5 tile (below face-ID floor)
Cross-camera / cross-property correlation	central index across all cameras in deployment	not supported; events terminate at property
Network presence on property LAN	RTSP credentials + firewall exceptions required	zero; HDMI is electrically isolated
What a subpoena can ask the vendor to produce	the full continuous archive of all cameras	only the events the property already saw
Capability ceiling change by firmware update	possible; cloud models can be retargeted server-side	impossible; input pipe bounds capability
Install path on an existing DVR	weeks; requires DVR access, VLAN, credentials	under 2 minutes, no credentials, no IT ticket
Who gets to read the raw multiview	the vendor's cloud cluster, by contract	no one outside the property, ever

Why the pixel budget, not the model, is what rules out identification

A frequent rebuttal is that any AI surveillance system can do anything given enough software updates. That is false for a specific, measurable reason: the model can only see what the input pipe delivers. On a monitor-first edge unit the input is a composite HDMI frame with a fixed 1920 x 1080 resolution. The cameras share that budget. Below are the real per-tile resolutions at each grid size and what they actually resolve on a person standing 30 feet from a camera (a typical property distance).

cyrano:pixel-budget input=hdmi.dvr person_distance_ft=30

This is the piece of the Ellison debate that treating it as a pure policy question misses. You cannot graft face identification onto a 36-pixel-tall person by shipping a better model. The model never gets more pixels than the input pipe delivers. A monitor-first system caps identification at the input, not at the model. That cap is verifiable by anyone with a sample frame and a ruler.

The capability ceiling, before and after

Two different systems sold under the same keyword, quite often with similar screenshots. The ceiling of what each one can do in the field is completely different.

Cloud AI surveillance vs. monitor-first edge

Continuous upload, identification models, a centralized index, and per-camera network access. Capability ceiling is high; the privacy posture follows from cloud policy, not architecture.

Continuous cloud upload of every camera
Face identification on usable tiles
License plate reading where resolution allows
Cross-property correlation in a central index
Full footage archive the vendor can retrieve
Requires DVR credentials and a LAN presence

How a monitor-first property boundary actually holds

One way to evaluate an architectural claim is to walk through exactly what happens when a single event fires. Here is what the egress looks like for one delivered HIGH detection on a monitor-first unit. Every byte is accounted for. The absence of continuous video is what separates this from Ellison's architecture.

One event, start to delivered

t=0.0s: composite frame arrives at sampler

HDMI in from DVR, 1920 x 1080, one frame sampled per second. The full frame never leaves the device.

The sampler reads the HDMI signal coming out of the DVR's OUT port. The raw frame is held in device memory only long enough to run inference and the per-tile crops.

t=0.047s: per-tile inference completes

Detector runs on each of up to 25 tiles. A person silhouette fires on tile [2,3] (compactor alcove).

Class detection only: person / vehicle / package. No identification pass. Output is a bounding box, a class label, and a confidence score. Latency is ~47 ms/tile on current hardware.

t=0.6s: zone match and suppression

The bounding box is compared against the property's zone polygon for that tile. It falls inside an after-hours-restricted zone.

Suppression rules fire first: dedupe against events in the last 30 s, check hour-of-day window, check resident-opted-in exclusions. This is what prevents the 'every passing car is an event' cascade.

t=1.2s: event payload assembled

18 KB JPEG thumbnail, 220 KB 6-second H.264 clip bracketing the detection, 612 B JSON with camera name, zone, class, confidence, and timestamps.

The clip is the six-second window centered on the detection (t minus 3 s to t plus 3 s) encoded at a low bitrate. Nothing else from the multiview is included. Audio is never attached.

t=1.4s: cellular uplink

~240 KB total leaves the device over the 4G eSIM. No property Wi-Fi, no Ethernet, no DVR network.

If the cellular modem is unplugged, the device holds events locally and reconciles when the link returns. If it never returns, no event ever leaves the property. That is the load-bearing property.

t=1.8s: event visible on property dashboard

Property staff see the thumbnail and can play the 6-second clip. No cross-property correlation. No third party gets a feed.

The event is scoped to this property's account. It is not indexed against events at other properties. This is the architectural refusal of the Ellison primitive 3 (centralized cross-camera index).

A narrated sequence of the same event, compared to the cloud path

The difference between the two architectures is clearest when you watch one frame move through each of them. On the edge path the frame stops at the property. On the Ellison path the frame and every frame before and after it leave.

One frame, two architectures

01 / 06

Frame lands at the DVR

The DVR renders a 1920 x 1080 composite of 25 camera tiles and sends it out HDMI for the guard monitor.

The brands and input types where this architectural refusal holds

The monitor-first refusal is not tied to a specific camera or DVR brand. It works on any unit that emits a standard HDMI multiview to a monitor. The list below is every DVR family verified on a deployed property.

Hikvision

Dahua

Lorex

Swann

Uniview

Amcrest

Reolink

Q-See

Annke

ZOSI

Night Owl

Defender

LaView

SV3C

ONWOTE

Samsung SDH

Flir DNR

Hikvision rebadges (LTS / Hikua)

Dahua rebadges (Honeywell / FLIR)

Generic NVR with HDMI out

What orbits the edge unit, and what does not

Monitor-first is not a minimalist architecture for its own sake. It is deliberately missing the components that would let it participate in an Ellison-style index. Here is what is actually in the system, and what is deliberately outside it.

edge AI unit

on-device, property-scoped

HDMI in

DVR multiview

HDMI out

1:1 passthrough

On-device detector

person / vehicle

Zone engine

property-local

Event encoder

thumb + clip

Cellular eSIM

event-only uplink

Not orbiting this unit: a face recognition model, a plate OCR model, a continuous cloud upload channel, a cross-property index, a LAN presence, any property credentials, any audio capture, any resident identity store. Each of those is a component you would need to implement Ellison primitive 1, 2, 3, or 4. None of them are here. Adding any of them would require a different product, not a firmware update.

240 KB

“Total outbound bytes per delivered HIGH event on a unit: one 18 KB JPEG thumbnail, one 220 KB H.264 clip of roughly six seconds, and one 612-byte JSON metadata object, uplinked over a cellular eSIM. No continuous property video ever leaves the building, on any day. The raw 1920 by 1080 multiview is discarded from device memory on each tick. That figure is what an 'AI surveillance system' egresses when the architecture is the opposite of the one Larry Ellison described.”

Our system deployment telemetry

Four procurement questions that reveal which architecture you are buying

The Ellison outcome is not sold with that label. It is sold as “AI-powered surveillance,” which overlaps word for word with what a local edge system calls itself. These four questions separate them at the architectural level.

Ask every AI surveillance vendor before signing

What exactly leaves my building per hour of normal operation, in bytes and in kind? A monitor-first vendor answers in kilobytes per event. An Ellison-style vendor answers in megabits per second per camera. If the answer is ambiguous, it is almost always the second.
What models run on my premises versus in a central cluster? If the face-ID model is in the cloud, your cameras are streaming to the cloud. That is primitive 1 and 2. A local edge system runs the detector on the property and the cloud never receives the video.
Can my property be correlated with other properties in this vendor's index, and what is the opt-out? An Ellison-style vendor is usually built around the shared index. A local edge system does not have one, so there is nothing to opt out of. Either answer is fine if it is disclosed.
What is the physical input to the AI and what is its pixel budget? An HDMI monitor-first input caps at 384 x 216 per tile on a 5x5 multiview, which is architecturally below face identification. An RTSP-per-camera input does not cap identification. The ceiling is what you are paying for.

Three ways a property manager encounters this choice in the wild

Ellison talked about cities. Most buying decisions for AI surveillance happen at property granularity, where the person signing the contract is balancing resident incidents, insurance, and privacy complaints. These are three live situations where the architectural choice shows up as a direct operational question.

Situation 1

The privacy-sensitive property

A luxury rental or a corporate campus rejects any vendor that streams camera video off-property. An edge system passes the review in minutes because the egress contract is ~240 KB per event and the raw multiview never leaves the building. An Ellison-style system fails immediately on continuous cloud upload.

Situation 2

The “catch after-hours trespass” workload

The actual job is “flag a person in this zone after 10 PM.” That does not require identification. Class detection at the tile resolution does it. Paying for an Ellison-style stack here is overkill and introduces privacy liability the property does not need.

Situation 3

The cross-property investigation

A regional chain wants to correlate an incident across 40 properties. This is the workload that does want a centralized index and is where Ellison-style architecture earns its overhead. An edge system is the wrong tool here, and we say so to buyers with that workload.

See the architecture Larry Ellison did not describe.

On a 15-minute demo we bring a unit, intercept the monitor HDMI on an existing DVR, and show the ~240 KB event egress, the 384 x 216 per-tile pixel budget, and the cellular-only uplink in real time. No RTSP credentials. No cloud video. No centralized index. The opposite of the stack Ellison described, running live.

Book the edge demo →

The honest read on Ellison

Ellison was not wrong that AI on cameras is going to be everywhere. He was right about the direction. The part the coverage missed is that “AI on cameras” is not a single architecture. It is at least two. Cloud, continuous, identification-grade, cross-camera. Or local, event-only, class-level, property-terminated. Both exist. Both are sold under the same keyword. Both get called “AI surveillance.” The difference is the architectural ceiling, and a property manager buying one is choosing the ceiling whether the sales deck spells it out or not.

The reason to be specific about the primitives is that the ceiling, not the vibes, is what the privacy posture rests on. A vendor that can swap its face-ID model on server-side has the capability even if its current policy is not to use it. A vendor whose input pipe is a 384 x 216 tile cannot use face ID on that pipe no matter what the current policy says. The second vendor is giving you a guarantee that is independent of management turnover and funding changes. That is the architectural argument. The policy argument is downstream of it.

Frequently asked questions

What did Larry Ellison actually say about AI surveillance?

At Oracle's September 2024 financial analyst meeting, Ellison described a future where 'citizens will be on their best behavior, because we are constantly recording and reporting everything that is going on.' He framed it as AI continuously analyzing data from security cameras, police body cameras, doorbell cameras, and vehicle dashboard cameras, with a central system flagging problems in real time. The phrase 'best behavior' is the one that went viral. The architecture behind the phrase is what nobody in the coverage picked apart.

How does Ellison's vision differ architecturally from an edge AI surveillance deployment?

Ellison's pitch requires four primitives to work the way he described it. One, continuous video upload from every camera to a central analysis cluster, because reporting 'everything' means the cloud has to see everything. Two, identification models running at scale, because naming who is on camera is what separates 'surveillance' from 'detection.' Three, a centralized cross-camera index, because a single view of a citizen across city, highway, and storefront only exists if the data all lands in one place. Four, network access to each camera feed, because cloud inference cannot run on signals it cannot reach. A monitor-first edge system like our system implements none of these: it reads a composite HDMI frame at the DVR, runs inference on the property, sends about 240 KB per delivered event over cellular, and does not cross-correlate between properties.

Why does a 384 by 216 pixel tile preclude Ellison-style identification?

On a 1920 by 1080 HDMI multiview rendered as a 5 by 5 grid of 25 camera tiles, each tile is 384 by 216 pixels. A person standing 30 feet from a camera occupies roughly 36 pixels of vertical height in that tile. That is enough to reliably detect the silhouette class 'person,' and not enough to run a face recognition model with useful accuracy. The empirical floor for reliable face ID is around 80 to 100 pixels between the eyes. A 36-pixel-tall person has roughly a 6 to 8 pixel interocular distance. The pixel budget is not a setting you can change. It is a property of the input architecture (read the monitor, not the cameras). An edge system built on this input cannot identify, only classify. That is an architectural guarantee, not a policy promise.

What actually leaves a property that runs a monitor-first edge system?

Per delivered HIGH event: an 18 KB JPEG thumbnail, a 220 KB H.264 clip of roughly six seconds bracketing the detection, and a 612 byte JSON metadata object. Total around 240 KB per event. Uplink is a 4G cellular modem with an eSIM. No continuous upload. No property Wi-Fi or Ethernet involvement. No face embedding. No license plate string. No audio. The raw 1920 by 1080 multiview never leaves the building. That egress contract is the thing a property manager can sign that an Ellison-style cloud pipe cannot sign, because continuous cross-camera analysis requires continuous upload.

Can the same hardware be retargeted to do Ellison-style cloud surveillance later?

No. And this is the part most critique of AI surveillance misses. A system's capability ceiling is set by its input pipe, not its software. A monitor-first unit's input pipe is a single HDMI signal from the DVR; its uplink is a cellular modem billing by the megabyte. Those two facts together cap the system at event egress and class-level detection forever. To do Ellison-style identification and centralized analysis you would have to swap the input (HDMI to RTSP per camera), swap the uplink (cellular to building fiber or fixed circuit), and add a cloud fleet. That is not a firmware update. That is a new product. If the question is 'is the privacy posture load-bearing,' the answer is yes: the architecture is what enforces it.

Does this mean edge AI surveillance is automatically 'good' and cloud AI surveillance is automatically 'bad'?

No. Edge is a tradeoff, not a moral position. Ellison-style centralized analysis is the right architecture for some workloads. Cross-jurisdiction cold case investigation, a citywide missing-person search, and federated threat detection across tens of thousands of cameras are workloads that require a centralized index, and an edge system cannot do them. The honest framing is: pick the architecture whose capability ceiling matches the workload, because the ceiling is what you are signing up for on day one. A property that wants 'catch people running across a compactor zone at 2 AM' does not need Ellison's stack. A state-level criminal investigation across millions of cameras does, and that is a different sort of accountability problem.

Is the HDMI input architecture actually a privacy feature, or is that marketing language?

It is a load-bearing property of the device. Here is the testable version: unplug the cellular modem and the unit can no longer communicate outbound at all. No event leaves. Nothing crosses the property boundary, ever. The same device with RTSP per-camera input and a cloud uplink would still be reading every camera over the LAN and could be reconfigured to upload continuously. The input pipe bounds what is possible, not just what is intended. 'Monitor-first' is not a branding choice. It is an architectural constraint that decides the ceiling of the device's possible behavior.

If Ellison is right that AI surveillance is coming anyway, why does the edge architecture matter?

Because 'AI surveillance is coming' and 'AI surveillance is centralized' are two different claims, and only one of them is inevitable. Class-level detection on cameras (person in zone, vehicle in zone, package on doorstep) is going to be everywhere and is genuinely useful. A panopticon that cross-correlates identity across every camera in a city requires someone to actually build and operate the centralized index, and the decision to build it is a policy decision that happens per vendor. Edge systems make that policy decision visible by refusing to participate in the index. Choosing one vendor over another is how property managers register that preference at the procurement level, which is where most 'AI surveillance' decisions actually happen in the real economy.

What should a property manager evaluating an AI surveillance system actually ask?

Four architectural questions. One, what exactly leaves my building per hour of normal operation, in bytes and in kind? If the answer includes continuous video upload, you are buying an Ellison-style system. Two, what models run on my premises versus in a central cluster? Edge inference stays at the property; cloud inference does not. Three, can my property be correlated with other properties in this vendor's index, and if so, what is the opt-out? If there is no opt-out, you have signed a centralization decision. Four, what is the physical input to the AI, and what is its pixel budget? A 384 by 216 tile cannot identify, no matter what the marketing says. Those four answers tell you which architecture you bought, independent of the vendor's stated capabilities.

Is our system positioning itself as the anti-Ellison?

Not as a political statement. As an engineering one. Ellison described a specific architecture (continuous recording, centralized analysis, identification-grade cameras) and a specific outcome ('citizens on their best behavior'). An edge-based monitor-first system occupies a different point on the same design space: deterministic alerting on class-level events, no continuous upload, no centralized cross-property index, no identification. Both architectures have legitimate use cases. The only dishonest thing would be to claim either architecture is the other. A property manager choosing between them is choosing an egress contract, a pixel budget, and a trust model, all three at once.

The primitives are the conversation

Ellison described an outcome. The outcome requires four primitives. An edge system refuses all four at the input pipe, not at the software layer. That refusal is checkable with bytes-on-the-wire, pixel counts, and an unplugged cellular modem. It is not a branding claim. It is the shape of the hardware you bought.

If the workload is “city-scale cross-camera index,” Ellison's architecture is the right tool and the price of that tool is the primitives. If the workload is “property-scoped person-in-zone detection with zero continuous upload,” edge is the right tool and the price is the identification ceiling. Choosing between them is the actual AI surveillance decision, long before the model names and the dashboard screenshots.