Ing. Alejandro Medina on Agents That Call and Click
A deep dive into Alejandro Medina’s agent demo: Twilio phone calls, PC control, and why it feels like a potential AGI inflection point.
Ing. Alejandro Medina recently shared something that caught my attention: after so many years in AI, you develop "smoke detectors" for hype and "BS" - but this time, he wrote, "Could this be the beginning of AGI?"
He then described a moment that feels small on paper, yet big in implication: an AI agent got a phone number via Twilio and called him, and that same agent (built with Clawdbot) had control of the PC. Through the call, he told it to search for YouTube videos, and it actually did it. His takeaway was not a confident declaration, but a gut-level signal: maybe this is not AGI yet, but "it smells" like something important.
"Ignore the security, ethical, and moral problems for a second... this could be the ChatGPT moment for AGI."
That framing is worth unpacking. Not because one agent placing a call proves artificial general intelligence, but because this specific combination of abilities - autonomy, tool access, and real-world action - is exactly where AI starts to feel qualitatively different.
Why a phone call from an AI agent hits differently
Most people have already internalized that AI can write, summarize, and chat. We are used to AI as a "text box". Medina’s post points to a shift from AI-as-text to AI-as-actor.
A phone call is psychologically powerful because it crosses a boundary:
- It is synchronous: you feel like something is "there" with you.
- It reaches into human workflows: phones are how we confirm, coordinate, and authorize.
- It implies intent: calling is not just generating content, it is initiating contact.
When an agent can obtain a number (via a service like Twilio), trigger an outbound call, and hold a coherent goal in mind while doing it, you are seeing the early shape of agentic behavior. It is not the call itself that matters. It is the fact that the model is operating through systems.
The second half: computer control is the real accelerant
Medina’s detail that the agent had control of the PC, then executed a YouTube search, is the bigger story.
Text generation is cheap. Computer action is leverage.
Once an agent can:
- observe a screen,
- decide what to click or type,
- execute those actions reliably,
- verify the result,
it can effectively operate existing software without custom integrations. That matters because the world already runs on interfaces: browsers, CRMs, ERPs, email clients, ticketing systems, banking portals, and internal dashboards.
This is why PC control often feels like a step change. You are no longer asking AI to "recommend" what to do. You are delegating the doing.
Tool use plus autonomy is where "AGI vibes" come from
When people say "this feels like AGI," they often mean: "It did something end-to-end that I would normally do myself." That feeling comes from a stack of capabilities working together:
- Planning: decomposing "find videos" into steps.
- Tool selection: deciding to use a browser, search, and navigate.
- Execution: clicking, typing, scrolling, and handling popups.
- Error recovery: noticing when a page does not load and trying again.
- Persistence: not losing the goal halfway through.
None of these individually requires general intelligence. But the bundle creates the illusion of a general problem-solver, especially when it is wrapped in a familiar channel like a phone call.
Is this actually the start of AGI?
If we define AGI as a system that can generalize across domains at or above human level, with robust reasoning, learning, and autonomy, then one agent that calls and clicks is not the finish line.
But Medina’s instinct is still valid: moments like this can be the start of a new adoption curve.
We have seen this pattern before:
- Speech recognition existed for years before it became "natural enough" for mainstream use.
- Self-driving cars have long had impressive demos, yet struggle with edge cases.
- Large language models existed pre-2022, but ChatGPT made the interface and experience feel universally accessible.
So the better question is not "Is this AGI?" It is:
Are we watching the interface to automation become conversational and autonomous enough that it changes what people build?
That is the "ChatGPT moment" analogy Medina is pointing to.
What needs to be true for agents like this to become normal
To move from impressive demo to everyday infrastructure, agentic systems need progress in a few practical areas.
1) Reliability in the messy real world
Browser automation breaks. Websites change. Two-factor authentication appears. A modal blocks the button. The agent misreads a label. A small failure can derail the whole chain.
For agents to be trusted, they need better:
- perception (screen understanding),
- robust UI interaction,
- self-checking and verification,
- graceful fallback to a human.
2) Stronger boundaries and permissions
Medina explicitly asked readers to ignore security, ethics, and morals "for a second." That is fine as a thought experiment, but in practice this is the central challenge.
An agent that can call, browse, and control a PC can also:
- click the wrong link,
- leak sensitive data,
- be socially engineered,
- take irreversible actions.
So we need permissioning models that look more like modern cloud security:
- least privilege (only the tools it needs),
- scoped credentials (time-limited tokens),
- approval flows (human-in-the-loop for risky steps),
- audit logs (what it did and why),
- sandboxing (separate environments for execution).
3) Clear intent and accountability
If an agent calls someone, who is responsible for the content of the call? If it schedules, purchases, or messages, how do we trace the decision?
This is where product design matters as much as model capability. The winning systems will make intent explicit, show a plan before execution, and provide "undo" paths when possible.
Why this moment matters even if it is not AGI
Medina’s "smoke detector" metaphor resonates because experienced practitioners develop pattern recognition for overhype. When someone like that pauses and says, "but..." it is usually because the demo crossed a threshold of integration.
The integrated part is key:
- A model can converse.
- It can acquire a phone number through an API.
- It can place a call.
- It can operate a computer.
- It can execute a task in the open internet.
Each is doable. Doing them together, coherently, starts to look like a new kind of labor.
If you are building or leading in AI, the strategic takeaway is simple: start treating "agents" as a product category, not a feature.
- Identify one workflow where the bottleneck is not thinking but switching tools.
- Define safe actions vs restricted actions.
- Add checkpoints for irreversible steps.
- Measure outcomes like completion rate, error rate, and intervention rate.
That is how you turn "it smells like something" into "it works, safely, at scale."
Closing thought
I do not think an agent getting a Twilio number and searching YouTube proves we have AGI. But I do think Medina is pointing at the right frontier: AI that reaches out, takes initiative, and operates the same interfaces humans use. That is where the next wave of products will be built, and where the hardest governance questions will surface.
This blog post expands on a viral LinkedIn post by Ing. Alejandro Medina. View the original LinkedIn post →