Addy Osmani recently shared something that caught my attention: "Use Gemini Live + OpenClaw to give your agent eyes and action! Buy what you see and so much more." That simple line captures a shift many of us have been waiting for - agents that do not just chat, but perceive the world and take steps inside your apps.

In the same post, Addy pointed to VisionClaw (a free open-source project by Xiaoan (Sean Liu)) as a practical bridge between "seeing" and "doing." The hook is straightforward: connect a camera feed (smart glasses or phone) to Gemini Live for real-time multimodal understanding, then let OpenClaw carry out actions like messaging, search, and list management.

"This enables a whole bunch of use-cases." - Addy Osmani

I want to expand on why this is such a big deal, what is really happening under the hood, and what you should consider if you are building or adopting systems like this.

From chatbots to embodied agents

Most AI assistants today live in a single modality: text in, text out. Even when they support images, the interaction is often a one-off upload. What Addy is highlighting is an always-available loop:

Perception: the agent continuously gets context from the world (camera plus audio).
Reasoning: a multimodal model interprets what is happening.
Action: a tool-using layer executes tasks in your software environment.

That loop is the difference between "tell me about this" and "help me accomplish something in the moment." When you combine it with wearable hardware like Ray-Ban Meta smart glasses, the agent can operate at the speed of life: while you walk, shop, commute, or troubleshoot an issue.

The three-part architecture: eyes, brain, hands

Addy summarized the system in a clean mental model:

1) The eyes (camera feed)

VisionClaw can take a live camera feed from Ray-Ban Meta glasses or a phone. In practice, this gives the agent visual grounding: what object is on the shelf, what sign is in front of you, what appliance you are looking at, or what your screen shows.

A key detail from the post is that the camera streams at about 1 frame per second to Gemini for visual context. That is not "video understanding" in the movie sense. It is closer to periodic snapshots that keep the model oriented.

2) The brain (real-time multimodal)

Gemini Live provides the real-time multimodal layer that can "see" and "hear." The "live" part matters because natural interaction is not turn-based. You might interrupt yourself, change your mind, or ask follow-up questions while the system is already speaking.

Also, the audio is bidirectional in real-time, which is what makes the interaction feel conversational instead of transactional.

3) The hands (agentic tools)

OpenClaw represents the action layer: routing tasks to tools and connected apps. This is where the assistant stops being a narrator and becomes an operator. In Addy's examples, OpenClaw can:

Add items to a shopping list via connected apps
Send messages through WhatsApp, Telegram, or iMessage
Run web searches and speak results back

"The hands: Agentic layers (OpenClaw) execute the tasks." - Addy Osmani

That separation of concerns is a design pattern worth copying: perception does not have to be tangled with tool execution, and tool execution does not have to be model-specific.

What the demo implies for real use-cases

Addy listed a handful of voice prompts you can say after tapping the AI button on your glasses. I think it is helpful to group them into categories, because that is where the real product opportunities sit.

Situational understanding ("What am I looking at?")

This is the baseline: describe the scene. But it quickly becomes more valuable when it is personalized and goal-directed:

"Which of these has the lowest sugar?"
"Is this the same brand I bought last time?"
"Read me the key differences between these two devices."

With about 1 fps visual context, you should design for questions that do not require high temporal precision. It is great for shelves, signage, menus, instructions, and comparing objects you can hold steady.

Lightweight capture ("Add milk to my shopping list")

This category is where agents quietly win: small tasks that are annoying to do manually but easy to delegate.

The practical requirement is reliability. If the agent adds the wrong item or the wrong quantity, users stop trusting it. The best implementations will confirm critical details tersely (for example, "Added milk"), and provide an easy rollback ("Undo").

Communication as an action ("Send a message to John...")

Messaging is deceptively complex because it involves:

Choosing the right recipient identity
Choosing the right app and thread
Getting the tone right
Confirming before sending (for safety)

A good agent experience usually includes a confirmation step for outbound messages, especially in professional settings. Think: "Draft ready. Send?" rather than auto-send.

Local discovery ("best coffee shops nearby")

This blends web search, location context, and voice output. The winning user experience is not a long list. It is a short recommendation with an action:

"Top pick is X, 6 minutes away. Want directions?"

That is where the hands matter again: launching maps, starting navigation, or saving a place.

Why this is showing up now

Addy called it "very cool to see another great use of Gemini + agents out in the world," and I agree because it signals maturity across three fronts:

Multimodal models are good enough to be useful on everyday scenes.
Real-time streaming APIs are becoming accessible to developers.
Tool-use frameworks are becoming more standard, making "hands" easier to wire.

Open-source projects like VisionClaw also lower the barrier to experimentation. Instead of waiting for a single vendor to ship a polished product, builders can prototype, test, and iterate in the open.

GitHub repo (from Addy's post): https://lnkd.in/g8HtBg9k

The non-obvious challenges (and how to think about them)

If you are excited to build something like this, a few constraints are worth planning for up front.

Latency and interaction design

Real-time voice back-and-forth is unforgiving. If the model response time or tool execution lags, users talk over it or abandon the flow. Design strategies that help:

Use short spoken confirmations
Stream partial responses when possible
Move long results to a phone screen when available

Wearable cameras introduce immediate trust questions:

Are you recording bystanders?
Where is audio and video processed?
Is anything stored?

Even if your system is technically compliant, you still need UI cues (audible tones, lights, explicit prompts) that communicate when the agent is "watching" or "listening." Clear defaults matter.

Safety for actions

The hands are powerful, which means mistakes can be costly. The safest agentic systems:

Require confirmation for irreversible actions (sending, buying, deleting)
Provide previews (message draft, checkout cart)
Log actions transparently so users can audit what happened

Tooling and app integration complexity

OpenClaw-style execution typically depends on connectors, permissions, and stable APIs. Real-world reliability comes from handling failures gracefully:

If WhatsApp is not connected, offer alternatives
If a list service is down, queue the request and confirm later

Why Addy's post went viral (and what content creators can learn)

It is not just the tech. The post format is a blueprint for effective communication:

A clear hook: "give your agent eyes and action"
A simple mental model: eyes, brain, hands
Concrete prompts: copy-pastable, easy to imagine
One crisp technical detail: ~1 fps visual streaming plus real-time audio
A link to try or inspect: the repo

If you write about developer tools or AI, this is a strong content strategy: reduce complexity to a memorable frame, then anchor it with a real project people can explore.

Closing thoughts

What I take from Addy Osmani's example is that the next wave of AI UX is not about bigger prompts. It is about tighter loops between perception and execution. When an agent can see what you see and then do something useful inside your apps, "assistant" starts to mean something closer to a capable teammate.

If you are building in this space, start with one narrow workflow where visual context truly matters (shopping, troubleshooting, accessibility, field service), and be ruthless about safety and clarity. The magic is not just that it can act - it is that users feel in control while it does.

This blog post expands on a viral LinkedIn post by Addy Osmani, Director, Google Cloud AI. Best-selling Author. Speaker. AI, DX, UX. I want to see you win.. View the original LinkedIn post →