The interface designed for humans is poor for AI needs? And the interface designed for programmatic use is easier for the AI to use? In other news, the sky is blue and water is wet.
palashawas•May 5, 2026
Yep, everyone knows computer use is more expensive. This is about quantifying the gap
sudb•May 5, 2026
I'm pretty unsurprised that the vision agent did worse. I'd be interested in a comparison between the different tools that now exist to let LLMs drive browsers (e.g. vercel's agent-browser, the relatively new dev-browser[1], etc.)
There are usecases where the vision agent is the more obvious, or only choice though, e.g. prorprietary/locked-down desktop apps that lack an automation layer.
Interesting! I'll play around with agent-browser and update this article if anything comes up
cjbarber•May 5, 2026
I think of computer use as like last mile delivery. APIs and bash and such are the efficient logistics networks. Both have different benefits. Obviously, use the efficient methods when you can.
svnt•May 5, 2026
> This is not a model problem. The vision agent was reasoning about a rendered page and had no signal that the page wasn't showing everything.
> To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own.
This is a model problem, though. Because the model failed to understand it could scroll, you forced it to consume multiples of the tokens. Could you come up with an alternative here?
Do you know what the vision model was trained on? Because often people see “vision model” and think “human-level GUI navigator” when afaik the latter has yet to be built.
palashawas•May 5, 2026
This is a fair point.
The models frequently failed for many reasons on earlier runs, and the browser-use prompt ended up being pretty granular. I'll add a couple of runs that include a scroll instruction to the repo today and see how that compares
Pretty hard to guess what Anthropic trained sonnet on, but general multimodals are what people are using to drive similar tools today, whether GUI-trained or not, so the comparison still holds, for now
aurareturn•May 5, 2026
In an agentic world, the OS needs to be completely rethought. For example, every single app functionality should be exposable via an API while remaining human friendly.
I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.
mtoner23•May 5, 2026
Openai should not design a phone... They should try making money first
sophacles•May 5, 2026
Nonsense. Don't you know how bubbles work? Everyone does massive rushes for all the low hanging and medium hanging fruit. The the bubble pops and the randomized carnage of companies big and small being destroyed is sifted through by the next wave of companies actually intended to make money.
The good ideas and the bad ideas don't signal success in a bubble, nor does making money or not. Its random and any notion of "this was a good business model and that was bad" is post-hoc rationalization. The number of people who make fun of pets.com but order from chewy.com is a prime example of this.
joshstrange•May 5, 2026
> I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.
This is not going to happen, or if it does it will just be Android (like Samsung reskins/modifies it) and it will certainly use Google Play Services.
reorder9695•May 5, 2026
Presumably on Linux at least apps could just expose a DBus API? The machinery for this is already in place as far as I can tell.
lazide•May 5, 2026
This is like insisting - after the problem turns out to be harder than thought - that the worlds roads need to be completely redone to make them self driving friendly, so self driving can work.
Isn’t the whole ‘promise’ of AI that it doesn’t need any of those things?
tikhonj•May 5, 2026
Everything exposed programmatically would have been great even without agents—the NixOSes and Emacses of the world show just how amazing a fully flexible and programmable world would be—but I'm glad that the advent of AI is getting people invested in this vision :P
QuercusMax•May 5, 2026
Lots of apps actually do have all their functionality exposable via an API - but it's an internal API that's hidden from the user.
planb•May 5, 2026
This will not happen. None of the existing apps people use daily on their phones have any incentive to support this. Social media wants the people to doomscroll, shopping apps and booking sites want to use their own dark patterns to make people believe they get a special discount if they buy _now_ and everything else just wants users to see the ads. Why on earth would they offer convenient hooks for AI chatbots?
input_sh•May 5, 2026
It's even more fascinatingly dumb to have this discussion like 2 or so years after every major platform decided to kill any notion of 3rd party clients they used to support.
Yes, in an ideal world, that'd be great for both humans and LLMs, but we are about as far from that ideal world as we could be. You can't even do some of the "advanced actions" as a human with human-level reflexes without encountering a captcha, but sure, all of a sudden, everyone will just decide to make their bread and butter that is data easier to explore via an LLM.
jackphilson•May 5, 2026
because the social media sites that do will outcompete once people get personal AI coaches that tell them to use technology that is better for them.
donaldjbiden•May 5, 2026
How is an AI posting on your social media better for you?
kaashif•May 5, 2026
It's not, but token peddlers will say it is. It's good to interact with everything through buying tokens.
charcircuit•May 5, 2026
And how will a token peddler's social media company survive after the hype runs out?
tdeck•May 6, 2026
People on LinkedIn who are trying to build their "personal brand" seem to favor it. In fact, that's basically all the platform is these days.
ai_fry_ur_brain•May 5, 2026
These people are delusional and want to build a world thats convenient for them to accomplish things lazily with LLMs.
There are no shortcuts in life and its just expensive text autocomplete.
"Lets spin up $750k in GPUs full throttle to scrape a web page with my $200.00 CC subscription."
Everyone is delusional.
aurareturn•May 5, 2026
Why on earth would they offer convenient hooks for AI chatbots?
Competition. If I ask my OS-level AI assistant to find a social media reel about a elephant dancing, the social media app that exposes a set of APIs for an AI agent might get used more.
Watch how fast Meta adds this if a new hot shot social media app succeeds by designing for AI agents controlled by users.
swiftcoder•May 5, 2026
Having used a chatbot to find a reel Meta was censoring from search in the past... I'm not sure how well the incentives align
JambalayaJimbo•May 5, 2026
>Competition. If I ask my OS-level AI assistant to find a social media reel about a elephant dancing, the social media app that exposes a set of APIs for an AI agent might get used more.
This is the exact opposite of what will happen (and in fact what has happened). Reddit is suing Perplexity right now for scraping.
Meta will not serve content to some other app for free - for what benefit? They will not see advertising data.
aurareturn•May 6, 2026
Scraping and asking an agent is different.
jasomill•May 6, 2026
Who said anything about free?
Advertising isn't the only possible business model.
And profit isn't the only possible motive to provide a service.
tdeck•May 6, 2026
Actually this more or less describes how accessibility APIs work.
jasomill•May 6, 2026
Not really. For the most part, accessibility APIs provide programmatic interfaces to user interfaces, application APIs provide semantically meaningful interfaces to application functionality.
A closer analogue would be AppleScript, or rather, the underlying Apple Event and Open Scripting Architecture functionality supplied by the OS to support AppleScript, that allowed applications to expose these interfaces along with metadata documenting them, and for external tools to record manually performed tasks across applications as programs expressed in terms of these interfaces to make them easier to use (this last bit, while not strictly required, is convenient, and especially useful for less technical users).
If you're familiar with VBA in Microsoft Office applications, sort of like that, except with support provided by OS APIs that could be used by any application that chose to implement scripting support, official guidance from Apple suggesting that all well-designed applications should be scriptable and recordable, and application design patterns and frameworks designed with scriptability and recordability in mind.
Note that I use the past tense here, despite AppleScript still being available in macOS, because it is not well-supported by modern applications.
We have a much better chance of an ai-addressable Harmony OS version than of OpenAI making a serious competitor.
dummydummy1234•May 5, 2026
Why not use the same acc disability features?
CodingJeebus•May 5, 2026
One of the most seductive (and destructive) forces in software is the desire to rewrite from scratch because rewrites never, ever, ever go as planned. With AI, we're now thinking it's a good idea to rewrite the entire platform from the ground-up. Wild.
convolvatron•May 5, 2026
except every single piece of progress that we have is the result of trying to do things a different way. so unless you really think we've reached the pinnacle of operating system design, there has to be some room for this?
CodingJeebus•May 5, 2026
There's a very big difference between building onto an existing system and rewriting from the ground up. I'm not opposed to making progress and trying things differently, but saying things like "we need to completely rethink the operating system" is like saying "we need to completely redesign New York City". The most effective progress is incremental, not throwing the old system away wholesale.
The modern javascript ecosystem is a perfect example of what happens when everyone tries to rebuild from scratch and it's a nightmare.
dist-epoch•May 5, 2026
The future is "dark OSs" - OSes with no human users.
wartywhoa23•May 5, 2026
Launched to nuclear fanfare on August 29th.
pmontra•May 5, 2026
I still have to understand what my AI agents could do that I don't want to do myself. Buy stuff? No thanks, I want to see what I buy. I think that they are 99% a solution in search of a problem.
sbrother•May 5, 2026
Same. Well the biggest thing I don't want to do that they could help with is work. But in the cases where it can do that for me, there's no world where that benefit goes to me rather than my employer.
pmontra•May 5, 2026
Well, that's the very nature of the employer / employee relationship. In my case I write software for my customers and I trade time for money. If I use an AI to write code two times faster my daily rate doesn't double. However I can keep my costumers.
That's only another step in the path I experienced since the 80s, when I had to type every single character because there was no auto complete, no command line history, very few libraries. I was very good at writing trees, hash tables, linked lists and so was everybody else. Nobody would hire me if I were that slow at writing code today.
sciencejerk•May 6, 2026
My family (unfortunately) uses InstaCart and probably 15% of items are a shitty "replacement" "not what I wanted". For time sensitive items, having the shitty replacement item NOW is better than having to wait for the "item I actually wanted", so we often just accept the inferior product. This is a dark pattern that I could see AI adopting -- it buys tons of cheap crap you didn't want, some of it was right, and you're left with a mess of returns to sort out, esp. if those returns require you to physically take some sort of action like physically returning the item to the store
switchbak•May 5, 2026
"In an agentic world, the OS needs to be completely rethought" - if AI is progressing as fast as we think it is, I don't think we'll be interested in waiting for the world to rebuild all the legacy tooling from the OS up. For new stuff, that'd be great.
I imagine the AIs will get a lot better at intercepting things at an intermediate level - API calls under the hood, etc. Probably much better (and cheaper) vision abilities, and perhaps even deeper integration into the machine code itself. It's really hard to anticipate what an advanced model will be capable of 5 years from now.
bnyhil31-afk•May 6, 2026
Maybe not the same approach, but start with a kernel that acts as a governed membrane for everything else?
Ah yes. The trains everywhere approach to self driving cars.
donaldjbiden•May 5, 2026
We used to have this. It was called OLE Automation.
andrekandre•May 6, 2026
yep, and applescript...
i'm really not sure companies will allow their apps to be automated so easily, and the reason is api abuse (think of a saas where you can upload file attachments for example); you'd either end up banned or throttled pretty fast, and in the end the company will be like "cost > opportunity" and just close it off (and its like this already, llms just make this worse)
awongh•May 5, 2026
At the beginning of the internet we were promised the free flow of digital information between computers, peer-to-peer. What we got was silos of content each fighting each other to make sure that the silos stay intact with DRM.
I could imagine an AI future where agentic shopping companies who promise me the best deal are pitted against Walmart and Amazon, trying to algorithmically squeeze me for $2 more- just two bots playing a cat and mouse game to save me a few bucks.
For some reason a lot of tech ends up in these antagonistic monopolies- Apple wants to sell privacy aware devices as a product feature, Google wants give you mail and maps, but sell your data. Despite any appearances neither give a shit about you, even if you benefit from the dynamic.
FirestarAlpha•May 5, 2026
That’s actually what the Reflex plugin behind the APIs in the benchmark does. It creates APIs from your app’s event handlers, thereby providing a stateful way for agents to navigate apps.
It’s why we did this benchmark :) - reflex team member
pier25•May 5, 2026
And when the agent fucks up badly (as we've seen over and over again) who will be held accountable? The user?
airstrike•May 5, 2026
It doesn't need to be mobile. The AI-first OS will be headless, undoubtedly.
Humans would be the second-class users of said OS, which can generate UIs on demand as needed.
I've thought about this quite a bit. Started implementing as a side project, but I have too many side projects at the moment...
> In an agentic world, the OS needs to be completely rethought. For example, every single app functionality should be exposable via an API while remaining human friendly.
So, like a Unix system?
ssl-3•May 5, 2026
We'll just close the loop with a systemd MCP, set the shell to /usr/bin/codex, and find some other way to pay the bills.
Perfect.
titzer•May 6, 2026
The GUI was a mistake. Long live the shell!
1659447091•May 6, 2026
> In an agentic world, the OS needs to be completely rethought
Isn't that what Apple is doing with its Foundation Models Framework?[0] Developers can integrate Apples on device llm that includes things like tool calling. I don't write Apple specific apps so not sure what can actually be done with it, but it looks promising and something Apple already seems to think things are headed.
> I think OpenAI designing their own phone is the next logical step
ChatGPT is already integrated into Apple Intelligence for those that want to use that instead of Apples model -- I don't see OpenAI trying to change lanes into phone making when they can focus on doing what they know while collecting a large check from Apple
This is obvious. The problem is that not everything has an API, while everything has a human-oriented UI.
palashawas•May 5, 2026
Right - we did this benchmark because we launched a plugin that makes APIs programmatically from an app's human-oriented UI (from the event handlers, to be specific). So any app that has a human-oriented UI now has an API.
The benchmark is a more generally interesting part of the launch materials, so I figured it had its own separate home here.
moralestapia•May 5, 2026
That is actually great, I'll definitely check it out. Thanks!
Havoc•May 5, 2026
Isn't it possible to somehow wire this into the window manager? Wayland or whatever. Have it speak the native window lang rather than crunch the pixels? At least for the majority.
I can see the appeal in pixel route given universality but wow that seems ugly on efficiency
QuercusMax•May 5, 2026
imagine, if you will, that we had a windowing system that's built on Postscript... lots of folks thought it was a super awesome idea, and built NeXTSTEP around it.
https://en.wikipedia.org/wiki/Display_PostScript
Wayland only has pixels. It was designed to get rid of all the X11 cruft.
lelanthran•May 5, 2026
> Isn't it possible to somehow wire this into the window manager? Wayland or whatever. Have it speak the native window lang rather than crunch the pixels? At least for the majority.
Not possible on wayland, maybe on X11 protocol?
_boffin_•May 5, 2026
What i don't understand about "computer use" is why they're not just grabbing the window handles and storing them to determine what should be clicked after the first few iterations of using that a specific application. if a new case / path / whatever is found, drop back to screen grabbing and bounding boxes and then figure the handles that are there and store after.
idk.. not really thought out too much, but has to be better
faangguyindia•May 5, 2026
I saw Codex was screenshotting, then clicking around. I just stopped it and never used that again.
Using CLI tools is much faster and token-efficient. I developed ten apps in the last two months. One reached 10,000+ monthly active users.
I ask Codex to generate SVG line by line and backtrack edit, ask it to use Inkscape to generate icons, etc...
I developed all this on $20 codex sub.
ceejayoz•May 5, 2026
Claude does this too, with the Chrome extension.
It breaks like 80% of the time for me, and it's incredibly slow. Having it use Playwright (bonus: can test in FF/Saf too) was a big improvement.
embedding-shape•May 5, 2026
I think it's the third or forth time I see you bragging about HN how many apps you're able to develop with AI now. Care to link any of them, especially where we can see the actual code that you've produced here? Without being able to see actual results, I'm not sure what you want people to take away from your repeated comments.
faangguyindia•May 5, 2026
I only write here because people are spreading doomerism here with AI and I am excited about future.
Well I am competing with geoip provider like maxmind.
I developed custom traceroute and ping service to geolocate IPs with very high accuracy beating products like digital element, maxmind, ipinfo
These companies have huge teams. But my 3 people company already beat them.
Code doesn't matter much, it's not an opensource project.
My free app is http://macrocodex.app which I've developed along with a fitness coach.
I am currently beating companies with 20-30 developers and closing more deals while having 1/10th of the staff.
I am simply very excited about all this.
Nobody cares show you solve the problem, or if your code is ugly. As long as it's reliable and without downtime, you aren't breaking things and causing your customer headache, you are winning.
Even before AI, bad code existed. Not every company had 10x developer writing beautiful idiomatic rust code.
AI is just a tool, people who are trying to generate whole codebase with it are doing something very wrong. You can write code faster with AI provided you understand its strength and weakness
embedding-shape•May 5, 2026
> Code doesn't matter much, it's not an opensource project.
Heh, you're in for a rude awakening, sometime in the future :) But I won't spoil the surprise, you clearly have made up your mind about what to focus on.
> My free app is http://macrocodex.app which I've developed along with a fitness coach.
Crazy, this app you've run for ~1-2 months has 10K active users already, even though there is zero info about who runs it, zero reviews, and says "Download on the App Store" on the landing page even though you then ask people to use the web app, impressive.
I don't think anyone said using AI can't produce a ton of code really quickly, and no one is finding that difficult to manage either. But most of us software engineers are trying to build long-lasting codebases with AI too, then "less === better" typically, so it's not about being able to spit out features as fast as possible, but avoid the evergrowing codebase from collapsing on top of itself, and each prompt not getting slower and slower, but as fast as on a greenfield project.
Sounds like you've found the holy grail of being able to avoid that, kudos if so. Judging by you giving zero care to how the design and architecture actually is, I kind of find that hard to believe. But, if it works for you, it works for you, not up to me or others to dictate how you build stuff, hope you enjoy it, however you build stuff :)
faangguyindia•May 6, 2026
>Heh, you're in for a rude awakening, sometime in the future :) But I won't spoil the surprise; you clearly have made up your mind about what to focus on.
>Even though there is zero info about who runs it.
People in the community already know who runs it; most others don't care. You won't get 10K users without people getting results. It's a free app, so not like I am spending bucks to advertise it on social networks.
The app is completely free, doesn't upload data to any server (other than Sentrycrash reporting), doesn't ask for any email or phone number. When people get results, they share them with their friends. That's how it's growing.
>Says "Download on the App Store" on the landing page even though you then ask people to use the web app.
On iOS, we’ve a PWA app. I am well aware of it.
nonameiguess•May 5, 2026
Why even bother asking a guy with the statistical acumen to think he can make a reliable estimate of a monthly average from some span of time shorter than two months? He's probably just going to say it doesn't matter and unfortunately he's probably right. If you sound excited enough, you can convince other people and close deals, so who gives a shit if there's really a there there? We'll see how he's doing in another decade. Reminds me of my sister always trying to get into real estate and mortage brokerage speculation, glowing whenever there's a market spike about people pulling in 200 grand a month, yet 25 years later she's still broke, doesn't own her own house, and her daughter is constantly asking me for money instead of her.
faangguyindia•May 6, 2026
> statistical acumen to think he can make a reliable estimate of a monthly average from some span of time shorter than two months
Perhaps because those numbers are provided on the Playstore dashboard? You should question Google's acumen in providing those statistics to developers?
And people have been estimating ARR through projections for a long time.
Electron uses 10x more RAM than regular apps. But it's so convenient.
Python is 100x slower than C. It's in the top 3 of languages now.
Worse but more convenient always wins.
password4321•May 6, 2026
This is probably why MCP "code mode" (generating code once to call the MCP going forward) hasn't caught on yet... no need until the financial costs reflect reality.
gowld•May 5, 2026
Confusing title? "Computer Use" is actually "Browser vision"?
antves•May 5, 2026
I think one main point is that not all "computer use" is the same, the harness and agentic experience matters a lot. A poorly designed API experience can actually be _less_ efficient than a well designed browser or computer use experience
In particular, the vision-based approach used in the evaluation has clear limitations with regard to efficiency due to its nature (small observation window, heterogeneous modality)
At Smooth we use an hybrid DOM/vision approach and we index very strongly on small models. An interesting fact is that UIs are generally designed to minimize ambiguity and supply all and only the necessary context as token-efficient as possible, and the UX is cabled up to abstract the APIs in well-understood interface patterns e.g. dropdowns or autocompletes. This makes navigation easier and that's why small models can do it, which is another dimension that must be considered
We typically recommend using APIs/MCP where available and well designed, but it's genuinely surprising how token-efficient agentic browser navigation can actually be
overgard•May 5, 2026
I've been thinking of things I'd want an agent for recently. The problem is, everything I think of is something that requires using quite a few different websites, saving a lot of data securely, and working with a lot of sensitive accounts (my email, etc.)
The problem is, all the tasks are essentially: a) things agents probably just can't do, and b) things that absolutely cannot afford to be hallucinated or otherwise fucked up. So far the tasks I've thought of:
- Taxes. So it needs a lot of sensitive information to get W2's. Since I have to look up a lot of this stuff in the physical world anyway, it's not like I can just let it run wild.
- Background check for a new job. It took me 3 hrs to fill out one of them (mostly because the website was THAT bad). Being myself, I already was making mistakes just forgetting things like move in dates from 10 years ago, and having to do a lot of searching in my email for random documents. No way I'm trusting an agent with this.
- Setting up an LLC. Nope nope nope. There's a lot of annoying work involved with this, but I'm not trusting an LLM to do this.
Anyway, I guess my point is that even if an LLM was good at using my computer (so far, it seems like it wouldn't be), the kind of things I'd want an agent for are things that an LLM can't be trusted with.
peyton•May 5, 2026
It’s great at
1. things you wouldn’t otherwise bother doing
2. things where it otherwise would get stuck iterating on hacky workarounds doomed to fail
“Reverse engineer this app/site so we can do $common_task in one click”, “by the way, I’m logged in to $developer_portal, so try @Browser Use if you’re stuck”, etc.
I just had Codex pull user flows out of a site I’m working on and organize them on a single page. It found 116. I went in and annotated where I wanted changes, and now it’s crunching away fixing them all. Then it’ll give me an updated contact sheet and I can do a second pass.
I’d never do this sort of quality pass manually and instead would’ve just fixed issues as they came up, but this just runs in the background and requires 15 minutes of my time for a lot of polish.
overgard•May 5, 2026
I guess the problem I see here is that if the use case is "things I otherwise wouldn't bother doing", that's fine, but it's pretty niche. I dunno, if you're talking about a human "Agent" (like say in sports or entertainment), they'd be a trusted person to handle business matters outside of your competency (contract negotiations, etc.). I don't see AI "agents" being at all like that, they're more like an intern you need to supervise constantly.
rootcage•May 5, 2026
The best use cases I've seen for computer/browser use is for legacy SaaS/Software. For example, hotels use archaic Property Management Systems (PMS) and they're required by corporate to use it and pay for it. These companies can barely keep the product alive, they definitely aren't incentivized to maintain an API. In such a case browser use agent seems to be the best (only) way.
noprocrasted•May 5, 2026
Wouldn't using a coding agent to build a screenscraper be better?
merlindru•May 5, 2026
I'm building something that fixes this exact problem[1].
The landing page doesn't advertise it yet, but essentially, I give agents a small set of tools to explore apps' surfaces, and then an API over common macOS functions, especially those related to accessibility.
The agent explores the app, then writes a repeatable workflow for it. Then it can run that workflow through CLI: `invoke chrome pinTab`
Why accessibility? Well, turns out that it's just a good DOM in general. It's structure for apps. Not all apps implement it perfectly, but enough do to make it wildly useful.
[1] https://getinvoke.com - note that the landing page is targeted towards creatives right now and doesn't talk about this use case yet
gbriel•May 5, 2026
This is a good solution, instead of everyone blowing tokens on repeating the same computer use task, come up with a way to share the workflows. I think you'd need to make sure there aren't workflows shared that extract user information (passwords).
merlindru•May 5, 2026
this is protected against at the OS level, provided the applications declare the input correctly as a SecureTextField.
i so far haven't found any application that doesn't.
all you're able to get out, as far as i can tell, is the length of the entered password.
jasomill•May 6, 2026
From applications that capture the screen or use accessibility APIs, perhaps, but what about, e.g., Windows applications that capture window messages, e.g.,
Obviously, if you can inject code into a process that receives sensitive data, you're already running in a context where all security bets are off.
But with processes you yourself create, you probably can, even without elevated privileges, unless the application takes measures to prevent injection (akin to game anticheat mechanisms), so it seems worth pointing out that there are simple mechanisms to subvert such "protected" fields that don't require application-specific reverse engineering.
teej•May 5, 2026
You should call it Braille
merlindru•May 5, 2026
shit, why didn't i think of that
i tend to think of invoke as "an API over macOS apps" tho...
doesn't `invoke finder shareAndCopyLink` read very nicely? :P
ctoth•May 5, 2026
If agents is what it finally takes to get good a11y I'll take it. I'll bitch about it, but I'll take it.
merlindru•May 5, 2026
i think this goes both ways too :) agents have been a boon for everyone with disabilities, carpal tunnel, RSI, ADHD, anything
and now the fact that interfaces need to be accessible to agents, not just humans, ironically increases it for humans in return
lopis•May 5, 2026
And lets not forget that not all disabilities are chronic. Many disabilities are situational or temporary. AI is a great assist for a hangover day for example...
tomjakubowski•May 5, 2026
Playwright, the end-to-end testing framework for the web, provides a strong incentive to give sites good a11y: Playwright tests are an absolute delight to read, write and maintain on properly accessible sites, when using the accessibility locators. Somewhat less so when using a soup of CSS selector and getByText()-style locators.
One thing I am curious about is a hybrid approach where LLMs work in conjunction with vision models (and probes which can query/manipulate the DOM) to generate Playwright code which wraps browser access to the site in a local, programmable API. Then you'd have agents use that API to access the site rather than going through the vision agents for everything.
lsaferite•May 5, 2026
Using playwright-cli with Claude code is highly effective for debugging locally deployed web apps with essentially zero setup.
tyingq•May 5, 2026
Was looking for this comment. I'd like to see this approach in the comparison...having the LLM build a playwright script and use it. I suspect it would beat time-to-market for the api, and be close-ish in elapsed time per transaction.
Harder to scale if it's doing a lot of them, I suppose.
giancarlostoro•May 5, 2026
This is precisely how the Playwright MCP works, which lets something like Claude directly test a website.
I've mentioned several times and gotten snarky remarks about how rewriting your code so it fits in your head, and in the LLM's context helps the LLM code better, to which people complain about rewriting code just for an LLM, not realizing that the suggestion is to follow better coding principles to let the LLM code better, which has the net benefit of letting humans code better! Well looks like, if you support accessibility in your web apps correctly, Playwright MCP will work correctly for you.
Amazing.
pjc50•May 5, 2026
Very real risk of this going in reverse: people building inaccessible websites to prevent AI use.
blurbleblurble•May 5, 2026
"AI" is a made up hype thing. It's just computers and computer programs. For real!
solenoid0937•May 5, 2026
Those people probably aren't working on anything useful anyways, so its no big deal.
20k•May 5, 2026
I've found that by far the most useful websites as a programmer are also the ones most resistant to AI. This would be a huge loss for anyone vision impaired
claytonjy•May 5, 2026
What sorts of sites are you thinking of? To me, “most useful to a programmer” evokes docs and blogs and github issues and forum posts. I suppose some forums might be AI-resistant (login wall), but the others are trivially AI accessible.
irishcoffee•May 5, 2026
GitHub is naturally LLM resistant via its new uptime feature… I’ll show myself out.
Rebelgecko•May 6, 2026
Plenty of Linux-y websites use Anubis. Arch Wiki and IIRC some other distros too.
fc417fc802•May 6, 2026
That's less a value judgment, more a necessary evil due to the plethora of bad actors out there. I doubt it will get in the way of a local model used in a reasonable manner.
Most wikis you can mirror locally if you really need to hammer them.
stingraycharles•May 5, 2026
Examples, please.
stingraycharles•May 5, 2026
That’s such an extremely small niche of people it’s not a real risk.
sciencejerk•May 6, 2026
Or human engineers limiting AI-consumable documentation to improve job security!
linkjuice4all•May 5, 2026
I mean…I guess. But this is ridiculous - how many layers does our technology need to bash through to update two records on remote systems? I get that value is being added at some point - but just charge some micropayment for transactions. This is just too much.
lazide•May 5, 2026
Ever read Vernor Vinge’s a deepness in the sky? Digital archeologist, coming right up.
hellojimbo•May 5, 2026
Isn't that basically what browser base does. I've found the hardest part of browser use to be stealth first then client change management then browser comprehension (which gets better with every new model).
merlindru•May 5, 2026
i'm not too familiar with browserbase, but invoke works with any macOS app (or at least the accessible ones), i think browserbase is only for browser usage.
in the context of this blog post, the conclusion looks similar though!
"use the whole web like it's an API"
works much better than
"figure out similar or identical tasks from a clean slate every single time you do them"
Not really IMO, webmcp has devs change their apps. invoke just works with existing apps, especially ones that are accessible
invoke rather has overlap with Claude's and Codex' computer-use, except the steps are stored/scripted.
webmcp is bottom-up. computer-use & invoke are top-down
btown•May 5, 2026
If you're on macOS and interested in this space, I highly recommend you open up the system-provided Accessibility Inspector.app and play around with apps and browsers. See how the green cells might guide an LLM to only need to read/OCR specific parts of a screen, how much text is already natively available to the accessibility engine, and how this could lead to really effective hybrid systems - not just MCPs, but code generators that can build and run their own scripts to crawl your accessibility hierarchy for your workflow!
I think this is very fertile ground - big labs need to use approaches that can work on multiple platforms and arbitrary workflows, and full-page vision is the lowest common denominator. Platform-specific approaches are a really exciting open space!
merlindru•May 6, 2026
That's how I got into this thing in the first place, hah. Golden advice. It's incredibly cool to see what some apps offer. More of them have great accessibility support than you think (or at least than I thought!)
drob518•May 6, 2026
Great idea.
willwade•May 6, 2026
take a peek at https://github.com/willwade/app-automate?tab=readme-ov-file#... - its early and needs some work -but this is the idea behind this.. (my use case is not agents but actual real disabled people..who need tooling to provide better access to the desktop)
Interesting! I started something - nowhere near as complete as that and quite different but again using accessibility UI elements. The BIG problem I've found is SOOOO much stuff does really poorly having these elements exposed. Here was my approach https://github.com/willwade/app-automate?tab=readme-ov-file#... - What I do here is build UI templates - either using UIAccess OR using a one pass using a vision model.
"my experience is the opposite actually. UIA looks uniform on paper but WPF, WinForms, and Win32 all expose different control patterns and you end up writing per-toolkit handlers anyway. Qt only exposes anything if QAccessible was compiled in and the accessibility plugin is loaded at runtime, which on shipped binaries is basically never. Electron is just as opaque on Windows as on macOS because it's the same chromium underneath drawing into a canvas. the real split isn't OS vs OS, it's native toolkit vs everything else."
janalsncm•May 5, 2026
Wall clock time tells me everything I need to know. The vision model took almost 20 minutes to do the thing that Sonnet did in 20 seconds.
The only reason you wouldn’t choose an API is if it wasn’t viable.
jacktu•May 5, 2026
Totally agree. I’ve been building an AI visual tool recently and experimented with both approaches. The latency and c ost of generic "agentic" browser use are absolute dealbreakers for real-time consumer apps right now. Structured APIs (even just chained LLM calls with strict JSON schemas) are not only 40x cheaper, but more importantly, they are deterministic enough to actually build a stable product on top of. Computer use is an amazing demo, but structured APIs are what pay the server bills.
ai_fry_ur_brain•May 5, 2026
"Agentic engineering" were always just FADs to bring in more revenue for token providers.
If I think an LLM is good for something I create well defined, very deterministic "middleware" for that purpose on top of Openrouter.
wahnfrieden•May 5, 2026
It’s not a fad or without value.
ai_fry_ur_brain•May 5, 2026
Its very much valuable to lazy people who dont care about quality or doing hard things. I totally see the appeal for those people.
wahnfrieden•May 5, 2026
Sounds like you are more interested in performativity / aesthetics of production if you think writing software in a harder way is an indisputable virtue just because it requires more effort. On top of that you are an elitist about it
Agent use can be used to improve quality and maintainability
k__•May 5, 2026
Agentic engineers can build well defined, very deterministic middleware on top of OpenRouter.
Anthropic even says, that an agent based solution should only be your last resort and that most problems are well served with a one-shot.
Written 1.5 years ago. Anthropic would not advertise this stance today.
I'm much more agreeable with that type of LLM workflow. Running "agents" with monolithic "harness" for long time horizon tasks seems wasteful, unecessary but probably super appealing to lazy people.
zephen•May 5, 2026
I find this extremely surprising.
When you think of everything it takes for an AI to use what the article calls a "vision agent" then it seems as if using a purpose-made API ought to be MANY orders of magnitude faster.
2001zhaozhao•May 5, 2026
I have only found Computer Use useful for GUI app local debugging. Presumably it will also be useful for getting around protections for external apps that don't want AI to interact with them, or for interfacing with legacy apps or those built without AI in mind.
I don't think any new app should ever be specifically designed for AI to interact with them through computer use
ai_fry_ur_brain•May 5, 2026
Its funny watching the slow mean reversion back to more deterministic tooling.
sanderjd•May 5, 2026
Only 45x?
sheepscreek•May 5, 2026
This tracks - has been my experience exactly. Not to mention there isn’t particularly a significant lift in inaccuracy or speed. As things stand, to me it is the worst of both worlds. Expensive and inaccurate.
orliesaurus•May 5, 2026
Computer Use? Or Browser Use? IMHO big diff
The problem is that not everything from the 'past' can be accessed via APIs. It would be a fun time - remember Prism [1] - I would just run that and get all the API calls in a nice format and then replay them over and over to do things in succession.
In the new world, we have access to OpenAPI.json and whatnot, but in the world where things were built in the days pre-OpenAPI and pre-specs and best practices...I am not so sure! (and a lot of world lives then)
Alas, this works for a good chunk of things but not everything. Which is why the other technnology exists.
Is it possible to ask the vision agent to "map" the UI and expose it to another agent as a set of interfaces that resemble an API better? From what I understand the vision agent now should both know that "next page" shows more results and that they need to get more results in the first place.
If one agent just explores the UI, maybe in a test environment, and outputs a somewhat-structured description of the various UI elements and their behavior, then another agent was given that description, would the other agent perform better that an agent that both explores the UI and tries to accomplish the given task at the same time?
With an example UI I made up, the description (API-like interface definition) could be something like:
Get all reviews:
To get all the reviews you need to go to each page and click "show full review" for every review summary in that page.
Go to each page:
Start at page 1 (the default when in the Reviews tab). Continue by clicking the "next" button until the "next" button is no longer available (as you've reached the last page).
So the second agent can skip some thinking about how to navigate because it already has that skill. The first agent can explore the UI on its own, once, without worrying about messing up if there's a test environment.
Or am I misunderstanding the article completely? Probably. But it's interesting nonetheless. Sorry if it makes no sense.
angry_octet•May 5, 2026
I think you're right, you can get agents to do what we do -- learn how a website works. Then expose that model as a simple API. There will still be some vision tasks for navigation but they will be just vision tasks, no thinking required.
nijave•May 6, 2026
That was my first thought as well. A lot of current web development relies heavily on code generation then has obsfuscation and compression slapped on top leading to complicated structures. Then on top of that, more code (client side/JavaScript) reconfigures everything again. You end up with fairly complicated html/css/JavaScript to wade through.
For better and worse, 5-10Mi isn't uncommon for a web app.
Instead of trying to go "bottom up" and, effectively, do what a browser engine is doing in reverse, it seems easier to go "top down" like a human does and go off the visual representation.
faangguyindia•May 6, 2026
>Is it possible to ask the vision agent to "map"
No most vision models focus on subset of an image at a time when using image -> text
image -> image uses whole image.
esperent•May 6, 2026
> No most vision models focus on subset of an image at a time when using image -> text
Is this true? Where can I read more about it?
RobRivera•May 5, 2026
UX feedback
Me: hmm, this title confuses and infuriates Rob.
[Clicks link]
Me: Sees same title, repeat feelings of confusion and infuration
[Scrolls article down on my smartphone]
Me: Sees jpg with the same title, repeat feelings of co fusion and infuriation.
[Closes tab]
[Continues living rest of my life]
I hope this feedback is well received and understood.
arjunchint•May 5, 2026
The hard part about the web is that API's aren't just available even if the website owner wants them exposed (big if).
I embedded a Google Calendar widget on my Book a demo page, I don't know the API and Google doesn't expose/maintain one either.
What we are doing at Retriever AI is to instead reverse engineer the website APIs on the fly and call them directly from within the webpage so that auth/session tokens propoagate for free:
https://www.rtrvr.ai/blog/ai-subroutines-zero-token-determin...
rahulyc•May 5, 2026
All the websites currently blocking Claude Code or other AI agents are fighting a losing battle. Computer-use is in the early stages, and the thing preventing mass-adoption seems to be the number of tokens it takes. Agents can fumble around trying 10 CLI commands that don't work before finding the right one and we barely notice. But other visual agents (browser use / computer use etc) end up eventually fumbling on to the right thing, but we don't have the patience to wait 20 mins. to click a button. As tokens get cheaper + faster, we probably get the models that can use a UI interface just as natively as a CLI.
boringg•May 5, 2026
Tokens cheaper? I don't think that seems to be the case ... VC funded tokens were there to build user base and token price will go up as they eventually switch from growth to profitability.
bheadmaster•May 5, 2026
It will take a few years until scheduled data center construction finishes, and together with software optimizations that may come up in the meantime, it may cause a significant decrease in token price.
Aurornis•May 5, 2026
I wish I could place a lot of money on the opposite side of this bet.
I don't think many realize how could the cheap, alternative models are becoming. I prefer SOTA models for key work, but I can also spend 10X as many tokens on an open model hosted by a non-VC subsidized provider (who is selling at a profit) for tasks that can tolerate slightly less quality.
The situation is only getting better as models improve and data centers get built out.
boringg•May 5, 2026
Fair - there are bets both ways though I wouldn't consider it to be a certainty. That revenue drive on this AI build out is going to be real and multifold.
caughtinthought•May 5, 2026
What open source model and what non-subsidized provider specifically?
nijave•May 6, 2026
GLM 4.7 Flash is 0.07/1m tokens in, 0.40/1m tokens out on AWS Bedrock us-east-1. That's less than 1/10 the price of Haiku 4.5
Bedrock isn't the cheapest either although I'm fairly sure they aren't being VC subsidized
There are definitely cheap tokens out there. The big gotcha is "for tasks that can tolerate slightly less quality"
EduardoBautista•May 5, 2026
Yes, but how cheap is it to run four at the same time? It’s tough to run one good model locally, but running four at the same time which I commonly do with Claude and Codex just doesn’t seem to be happening anytime soon.
Aurornis•May 5, 2026
I'm referring to hosted models such as via OpenRouter or from the model providers' own services.
I think everyone making claims that inference is getting more expensive are unaware that there are more LLM providers than Google, Anthropic, and OpenAI.
johnsmith1840•May 5, 2026
And the lethal trifecta but I suppose that's all agents as of now anyhow. Every AI provider has major warnings about letting AI have access to PII in the browser.
einpoklum•May 5, 2026
> the thing preventing mass-adoption seems to be the number of tokens it takes.
Try the exhorbitant expenses and ballooning waste of generated electricity and usable water.
ls612•May 5, 2026
They don’t need to be 100% effective they just need to make you afraid enough of being banned to not bother trying.
octoberfranklin•May 6, 2026
How do they know that the "you" accessing the site is the same "you" they previously banned?
Face-scanning? Iris patterns?
ls612•May 6, 2026
You used your credit card to buy whatever service or product they sell.
octoberfranklin•May 6, 2026
I hate to break it to you but it is really easy to get anonymous visa/mastercard cards.
nobody can block actual LLM providers, they use spoofed requests to scan web for content, sometimes even using residential proxies.
nijave•May 6, 2026
Sure they can, proof of work seems to be effective. Anubis has become pretty popular
johnsmith1840•May 5, 2026
Text based web browsing? Would love the comparison there. Tons of systems have a dom translation layer. I'm building around this with the concept of turn a webpage into text for an agent to use directly. I actually had to move away from haiku not because of accuracy problems but because it operated the browser too fast for a human to follow what it was doing. The real loss here are bespoke webapps like a figma or google docs which are near impossible to see what they are doing via the dom.
To me the browser is a translation layer. Working on the browser directly while hard enables big advantages on compatibility. The only thing I miss as of now which is on the todo is ocr of the images in the browser into text out. But an api would need to do that anyways to work.
The main loss in my view of pure API based is, where do you get the data? We won't replicate human work without seeing that done. Humans work in the UI that's it. Computer use to me is the promise of being able to replicate end to end actions a human does. API can do that in theory but the data to do that is also near impossible to collect properly.
etothet•May 5, 2026
Vision has a long way to go. I remember trying an early version of AWS's Nova Act and laughed at how slow it was. And a few months later it hadn't really seemed to improve that much.
Recently, I asked Claude to log into my local grocery store chain's website and add all of the items from my shopping list to a cart. It was hilariously slow, but it did get the job done.
Unless I missed it, the article doesn't explictly mention speed in the copy, but the results do show a 17 minute (!!!) total time for the vision agent vs. 0.5s - 2.8s for the API approach.
A big part of the challenge with vision is that to manipulate the DOM, you first have to be sure the entire (current) DOM is loaded. In my experience this ends up in adding a lot of artificial waits for certain elements to exist on the page.
nijave•May 6, 2026
Would a lightweight motion detection algorithm work there?
Thinking of Frigate NVR that does motion > object detection > scene description
Where you build up to progressively slower and more expensive algorithms i.e. there's motion > it's a person > here's what the person is doing
angry_octet•May 5, 2026
Great guidance hidden in here for making it expensive for agents to navigate your website. Move elements on screen as the mouse moves, force natural mouse movement to make the UI work, change the button labels in the JS to be randomly named every visit, force scrolling to the bottom of the screen to check for hidden extra tasks...
Hang on, that sounds like common corporate SaaS apps.
notjustanymike•May 5, 2026
Ah damn it, we invented Jira
fooker•May 6, 2026
Jira from first principles
Almost sounds like an Orielly book
QuantumNomad_•May 6, 2026
The O’Reilly animal for Jira is apparently some kind of duck or goose.
Matthew B. Doar (2011). Practical JIRA Plugins. O’Reilly.
In case anyone was wondering. Which they probably weren’t :p
fooker•May 6, 2026
I'm more interested in the next volume: impractical Jira plugins
drob518•May 6, 2026
Real LOL!
zmmmmm•May 6, 2026
It's really weird, I'm seeing across the board that people who never believed in them before are suddenly all into good software eng practices (starting with writing a spec) because of AI.
It's kind of fascinating that we never were willing to do these things for humans but now that AI needs it ... we are all in. A bit depressing in the sense that I think mostly the reason we happy to do it for AI is that we perceive it will benefit us personally rather than some abstract future human.
taneq•May 6, 2026
It’s an interesting psychological phenomenon. It’s like the way I keep my house way tidier since I got a robot vacuum. Pick things up off the floor for aesthetics’ sake? Nah. Pick them up because the vacuum will attempt to eat them and might get sick? Of course!
GarnetFloride•May 6, 2026
My manager just told me that after 12 years of trying to get one of the founders to understand the difference between dev docs and user docs, they tried getting Claude to do it and he finally got it that they are different. He'd been saying this whole time that customer could just read the dev docs. If they could they wouldn't need our software.
zenoprax•May 6, 2026
How firm is the boundary between a dev doc and a user doc in your opinion? I have found that the overlap can be quite large if the users are also technically proficient. Right now I'm trying to balance "how X works so you can use the app better" with "how X works so you can contribute or build your own plugin". DeepWiki really helps as a backstop for anything not already covered though it's not without its own caveats of course.
PetitPrince•May 6, 2026
Not OP but I think you have the right intuition in making a difference between using the app / contribute to the app. You may want to read https://diataxis.fr/ which elaborate on this idea and add another dimension (action / cognition) to this.
zenoprax•May 6, 2026
I appreciate the suggestion but that's what I've been using! :D
In fact, the only area I've been struggling with are "Concepts" because they have less clear boundaries for the right amount of detail.
My friend at a faang was talking about the "massive overhauls to make everything ready for ai". I asked for an example. He said "basically just documenting the shit out of everything"
I guess that just never occurred to anybody before.
programmarchy•May 6, 2026
AI might actually RTFM
Cthulhu_•May 6, 2026
It would / should / can, but there's big efforts in reducing token consumption now, so AI will likely try to skim and pick documentation just like real humans.
akoboldfrying•May 6, 2026
There was a recent effort at work to make it possible for agents to provide up-to-date help on how to do various admin/setup tasks. A very sensible goal: We already have lots of documentation, the problem is that it's scattered everywhere and mostly out of date. Turns out the new solution amounted to someone manually going through it all and painstakingly preparing some Markdown files for consumption by said agent.
Somebody pointed out that those Markdown files might be helpful for people to read directly. Bit of an Emperor's new clothes moment. (I wanted to slap a : rolling_on_the_floor_laughing: reaction on it, but sadly it turns out I'm actually too chickenshit to do that in today's job market.)
_heimdall•May 6, 2026
The CEO of Uber made the same comment on Diary of a CEO recently. I think it was for their customer service team if I'm not mistaken, they threw their existing docs at an LLM and it was all over the place because policies were poorly documented and defined. The team is now documenting everything from scratch, focusing on outcomes rather than process - TBD if it works out.
noduerme•May 6, 2026
Yeah, someone made the point in a popular post here recently that all the firings are reducing institutional knowledge. IMHO, replacing that knowledge with LLM-written documentation is even more potentially catastrophic. Just from organizations I've worked in, a lot of the useful human knowledge is in knowing how to handle either undocumented edge cases or situations where the documents are outdated or wrong. Working with LLMs and reminding them to update those docs every time? Good luck. And if it's something where the docs touch actual real world operations, that's an area where only human operators with hands-on experience are going to recognize the potential conflicts or cognitive dissonance.
majormajor•May 6, 2026
Having the humans document the code seems backward (maybe that's not what they're doing, but "make everything ready for ai" sound manual). And hopefully there aren't that many scary surprises that humans need to manually document.
One of the best parts of LLMs is that you can use them to bootstrap your documentation, or scan for outdated things, etc, far more quickly than ever before.
Don't just throw a mountain at it and ask it to get it right, but use a targeted process to identify inconsistencies, duplicates, etc, and then resolve those.
And then you have better onboarding material for the next human OR llm...
palmotea•May 6, 2026
> Having the humans document the code seems backward (maybe that's not what they're doing, but "make everything ready for ai" sound manual).
No, that's forward. Any documentation an AI can make, another AI can regenerate. If an LLM didn't write the code, it shouldn't document it either. You don't want to bake in slop to throw off the next LLM (or person).
majormajor•May 6, 2026
> It's really weird, I'm seeing across the board that people who never believed in them before are suddenly all into good software eng practices (starting with writing a spec) because of AI.
> It's kind of fascinating that we never were willing to do these things for humans but now that AI needs it ... we are all in. A bit depressing in the sense that I think mostly the reason we happy to do it for AI is that we perceive it will benefit us personally rather than some abstract future human.
I don't think that's the reason.
I think it's because they take time, and few people were willing to put in time for "maybe it'll make writing the actual code faster" gains when the code was going to take a few times longer to write itself.
You also can get faster feedback to iterate on your spec now, which improves the probability of it helping future-you.
So combine that with the fact that the llms are more likely to get lost if you don't spec stuff in advance, and the value of up-front work is higher (whereas a human is more likely to land on the right track, just more slowly than otherwise, making the value harder to quantify).
Cthulhu_•May 6, 2026
Yeah I think a lot of pushback to best practices is basic cost/benefit; I like writing documentation, but I'm also often feeling a bit depressed that nobody will actually read it in as much detail as I wrote it. But LLMs do / can.
Actually there's a lot of projection there too; I don't read documentation in detail. And nowadays, I point an LLM at documentation so that it can find the details I would otherwise skip over.
The destruction of the millennial attention span is real, and it's worse in the younger generations, lmao.
noduerme•May 6, 2026
Well it's also just that you have a list of 20 features to add, and if it works, you want to ship it, and someone might even get mad if you spend a day dawdling on best practices and documentation and so on. Corporate cultures generally don't have the same long term thinking about reusability and legibility and fault-tolerance that an individual coder may have about the code they want to write once and forget. (Neither do LLMs, for that matter).
DrewADesign•May 6, 2026
I always knew the dev world leaned more toward interesting technical challenges and interoperability than maximizing the benefit to humanity- it’s why I switched to design. However, I didn’t realize the intensity of that preference until the entire industry got ridiculously AI-pilled.
cheriot•May 6, 2026
Better commit messages, better and more up to date docs, etc. It's not all slop!
MereInterest•May 6, 2026
The trick is that you make it something that humans want to do. Using [0] as an example, the interactive elements move, with context-dependent environment interactions.
We built isagent.dev for exactly this reason, serve human content to humans, serve agent optimized content to agents.
jasomill•May 6, 2026
I had one project where a desktop application deliberately hid the contents of all grid controls from Windows accessibility APIs, took measures to ensure checkbox and radio button selections made through accessibility APIs did not register, and all functions that allowed data to be exported were protected by CAPTCHAs.
Generative AI wasn't a thing at the time, but I had to resort to a combination of OCR, simulated user input, and print capture to drive the application and export data.
Had the developers been aware of the Windows DRM APIs that block screen capture, or the fact that text is easily recoverable from PostScript files with minimal formatting, I don't know what I would have done.
The irony is that the process this replaced involved giving cheap offshore labor full read-only remote access to all data in the system, which was by any measure a far more serious security risk than otherwise authorized employees using tools running locally with no network access provided by established, trustworthy vendors to automate their work.
tdeck•May 6, 2026
So ASP WebForms was the technology we needed all along?
theabhinavdas•May 5, 2026
For now.
deafpolygon•May 5, 2026
This is missing the point that AI training probably costed boatloads more to achieve to get here.
ipunchghosts•May 5, 2026
I have a similar finding for a website I made that collates college town bar specials and live music. Using agents with vision models works but it's not as straightforward as one would initially think. U can check out the results here. https://www.nittanynights.com
creatonez•May 5, 2026
Browser agents / vision agents are a menace and ISPs should outright ban subscribers who run them on the public internet.
bottlepalm•May 5, 2026
There's no way this is true. I would argue in some cases computer use is less expensive. First for APIs that don't even exist, it's a non starter. Second most APIs are not designed for agents and are verbose as hell - returning the entire DTO and tons of unnecessary properties burns tokens. Second computer use is not as token hungry as you think it is - a single screenshot may be just 1000 tokens, it's actually competitive and beats API workflows in many cases.
brikym•May 5, 2026
It would be great if institutions like banks provided proper APIs.
mrcwinn•May 5, 2026
We need a superset of HTML that is designed for agents. I'm not sure it's quite as simple as "just make everything an API."
0xWTF•May 5, 2026
So, to make this concrete, Akasa uses computer vision to read medical records to replace medical coders because there aren't enough medical coders to get all the billing right and medical systems leave like $1T a year on the table.
The EHRs could give companies like Akasa API access so Akasa could then just run NLP, but the EHR vendors don't grant various third parties API access for various reasons, so instead Akasa gets a seat license for each medical system they service and uses computer vision to read the screen (a cadre of Akasa medical coders review errors to stay up to date with unannounced changes from the EHR vendors) and then runs the NLP to figure out which CPT codes to assign to actually put in a bill and send the payer so the hospitals can stay afloat.
So this 45x delta is how much more the medical systems pay Akasa because Epic won't work with Akasa.
This is but one example of why US medical bills are outrageously high.
sarmike31•May 5, 2026
Just wondering: RPA companies like UiPath ard dead in the water, right?
bnyhil31-afk•May 6, 2026
I certainly would be curious how their Agentic AI compares. On another note, if RPA has taught me anything, it's 'don't rely on the UI'.
zhxiaoliang•May 5, 2026
I'm always skeptical of the whole "computer use" concept. It's like hiring someone and inviting him to your house and telling him to go ahead, feel free to sleep on the bed, use the toilet, eat whatever is in the fridge, watch the TV, and oh here are the combinations for the safe... and that someone you hire is a monkey.
eddythompson80•May 5, 2026
But think of how comfortable and productive the monkey will feel. It might not be that hard to just build temp housing for it while you have monkey business to do.
andrekandre•May 6, 2026
> build temp housing for it
everyone knows the real trouble starts when the monkey asks for the vote
nijave•May 6, 2026
In fairness, you're hoping the monkey does all the monkey tasks you'd rather not do yourself
titzer•May 6, 2026
I feel like I am taking crazy pills. Are we really having an AI fart around with a mouse and clicking on things to accomplish stuff because we're not capable of making one kind of software query and command another piece of software? It kind of boggles my mind.
hnav•May 6, 2026
The writing was on the wall with the MCP->CLI jump. The promise to investors is that you're replacing people. People don't make API calls.
ex-aws-dude•May 6, 2026
You are because that requires you to expose that API for every single piece of software ever
IMO, this is the argument for doing work in the first place.
game_the0ry•May 5, 2026
My "best practice" is to use as little "visual" (computer use) tooling and as much api + cli tooling as possible specifically to save on tokens.
Tokens a resource and should be managed as such.
zmmmmm•May 6, 2026
And structured APIs are about 1e9x more expensive than not invoking an LLM in the first place compared to using deterministic code to do something ... it's not like any of this is rational based on compute.
hnav•May 6, 2026
It simply doesn't fit in the token/time budget to be useful. I don't think the purveyors of these technologies care about how expensive it is as long as it's "cheap enough"
theptip•May 6, 2026
I’m missing the premise. For internal apps why would you ever reach for Computer Use vs just having your agent whip up a cli or MCP?
_of course_ computer use is worse. It is your last resort. Do not use it on state that lives in a DB that you own.
If anything I am impressed that it’s only 50x worse.
euphetar•May 6, 2026
I wouldn't call it a benchmark since it's just one sample. They do highlight a real problem, though. Computer use is immature right now and far behind language agents
Try playing fruit ninja via text and llm toolcalls though
danpalmer•May 6, 2026
Metadata and structure beats AI every time.
j45•May 6, 2026
Sounds like some efficiency gains will still arrive.
_heimdall•May 6, 2026
We gave up on structured APIs 20+ years ago when JSON RPC largely replaced XML REST. You can do REST in many different formats, it mainly just needs to be structured data and self-discoverable.
Had we not made that wrong turn, LLMs and humans would have a much easier time reasoning about APIs they don't directly control.
ex-aws-dude•May 6, 2026
Yeah but why would we keep that around for 20 years with no good use case
_heimdall•May 6, 2026
Why do you assume theres no good use case?
trpc, grpc, etc are all attempts to add schemas back into JSON. Swagger, OpenAPI, etc are attempts to add discover ability back into JSON-based RPC APIs.
MCPs fall in here as well, which attempt to add schemas and discover ability back in where our APIs aren't actually RESTful.
zepolen•May 6, 2026
> You can do REST in many different formats, it mainly just needs to be structured data and self-discoverable.
REST has nothing to do with structured data or discoverability.
chrismarlow9•May 6, 2026
Blackhat SEO spamming knew this 20 years ago
mbgerring•May 6, 2026
Hello from the distant past, when being able to easily consume a website via API was an exciting and fresh idea for humans, before robots could effectively use the computer
Does anyone remember the conference talk in the early days of React that was titled something like “best practices considered harmful,” or something? Or maybe that was a joke someone made about it. Anyway, the Semantic Web people have been right this whole time, and it’s very funny that we can now quantify the cost of building websites upside down and backwards for more than a decade.
hamasho•May 6, 2026
I'm trying to use computer use and browser use (via playwright MCP) in my work.
Computer use is a hit and miss (mostly miss), but playwright MCP often works very well.
The downside is it takes a lot of time to complete even easy tasks.
For example, to automate processing emails, it needs to
1. go to Gmail
2. log in to Google if necessary (This often requires two step verification so it's hard to completely automating, but possible)
3. read the latest mail
4. check the content and choose the action
- if needed, reply the email
- if it mentions tasks, add them to the todo list
- if it mentions schedules, add them to the calendar
5. repeat for all emails based on specified conditions.
And each step requires dozens of DOM (a11y tree) analyzes and actions (fill username/password input, check keep logging in, click submit button, etc).
Based on the model used, each step can take ~100s.
So easy tasks can easily add up to tens of minutes or even hours.
For frequently used tasks, I write skills like /logging-in, /read-latest-emails, using playwright scripts and let the agent choose them
And based on the email content, the agent chooses other tools like /write-reply, /add-todo, /add-event, etc, so that the model can only focus on the core tasks requiring thinking.
It reduces the execution time drastically.
But it can buries important business logic in the playwright scripts, not the agent's instructions.
For examples, simplified steps to add TODO items are like;
1. read the email
2. check if it's about todos, then decide to add them to Asana
3. extract and summarize the title, content, priority, due date, tags, etc.
3. access to Asana (log in if necessary)
4. check if there are similar tasks
5. if not, add the tasks
This can take tens of minutes, and each step can have important business logic, like;
- how to decide the priority and due date
- how to choose tags based on the content
- how to decide if two tasks are similar
This information should be read and updated by not only developers, but managers and other teams.
And if I write those steps in skills with playwright scripts, it improves the speed, but all those business logic are buried in the code, so not accessible by non-technical people.
It's also error-prone because web sites often tweak the UI and scripts can stop working.
So it's very convenient if the agent processes these step once, then decides it's worth writing the playwright script so that the next time those mundate processs can be executed instantly.
With automatic skill generation, the agent decides by itself if there are workflows worth writing skills with playwright scripts, like /log-in, /extract-information, /check-similar-tasks, /add-tasks.
Like Just-In-Time compiler, the skills are a byproduct of the agent instruction, all business logic are written in the agent's instruction, and doesn't need to be updated manually nor tracked in a version control system.
This can reduce a lot of execution time and API cost, and be applied other than browser automation, like computer use or any other agentic tasks if it's possible to write automation scripts for tasks not requiring thinking.
doctorpcgum•May 6, 2026
Bh
jasomill•May 6, 2026
In what world would a vision agent be the default, when whatever HTTP-based mechanism a site uses to communicate with the server can usually be reverse-engineered and easily emulated with widely available HTTP request libraries, HTML parsers, and JavaScript engines, and at worst you can use something like Puppeteer to navigate and control applications at a significantly higher level than image scraping and simulating user input?
It seems like you'd need a deliberately hostile app before a vision agent would even be considered as an option.
morpheos137•May 6, 2026
Who would have thunk? You know what is a great LLM agent api? bash. vast corpus, text based, already traindd in the model.
Frannky•May 6, 2026
I want to just talk to the Mac and have it do things. I tried computer use and other alternatives, but the latency made it unusable.
I want to be able to control both Mac, apps and the browser. I also need it to figure out things by itself given a goal.
Claude Code with the --chrome flag is kind of good, but it's too slow. I wanted to try faster APIs, like the one hosted on Cerebras, but it's too expensive.
Any solution I might be missing?
jasomill•May 6, 2026
Do you want to do something that can't be done through AppleScript, macOS accessibility APIs, and something like Puppeteer to control the browser?
Or something you don't understand how to do manually?
Because I guess I don't understand the attraction of using an LLM for system automation where existing interfaces exist, other than as a form of documentation, or to write code using these interfaces.
m3kw9•May 6, 2026
I did a simple computer use to search something, and used up 50% of my 5h plan limit from codex.
oleg2025•May 6, 2026
Couple of months ago I was inspired by kubectl, and built desktopctl CLI to control GUI apps. It uses combination of OCR and Accessibility API on Mac, represents UI as markdown, and exposes actions for mouse and keyboard.
My core idea was that "fast" perception loop is fully local, GPU optimised for UI tokenisation and change detection. "Slow" control loop requires LLM roundtrip, and uses token-efficient markdown interface in CLI output.
It uses relatively stable identifiers for controls, so agents can script common actions, eg `desktopctl pointer click --id btn_save` doesn't require UI tokenisation loop.
I've learned that compared to APIs, human interfaces are slow and messy, but there is actually a lot of science behind them. The good apps expose information well, and are optimised for clicks, typing, etc.
The best GUIs make great use of muscle memory, which makes them perfect candidates for scripting via CLI. eg a simple sequence "open Notes app, hit Cmd+F, enter search term, read list of results" can be one Bash command invoked by AI agent.
RadiozRadioz•May 6, 2026
> The alternative, writing an MCP or REST surface per app, is its own engineering project
Well, if your backend was sufficiently decoupled from your frontend, and the server-side operations were designed thoughtfully and generically, it need not be an engineering project.
58 Comments
There are usecases where the vision agent is the more obvious, or only choice though, e.g. prorprietary/locked-down desktop apps that lack an automation layer.
1. https://github.com/SawyerHood/dev-browser
> To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own.
This is a model problem, though. Because the model failed to understand it could scroll, you forced it to consume multiples of the tokens. Could you come up with an alternative here?
Do you know what the vision model was trained on? Because often people see “vision model” and think “human-level GUI navigator” when afaik the latter has yet to be built.
The models frequently failed for many reasons on earlier runs, and the browser-use prompt ended up being pretty granular. I'll add a couple of runs that include a scroll instruction to the repo today and see how that compares
Pretty hard to guess what Anthropic trained sonnet on, but general multimodals are what people are using to drive similar tools today, whether GUI-trained or not, so the comparison still holds, for now
I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.
The good ideas and the bad ideas don't signal success in a bubble, nor does making money or not. Its random and any notion of "this was a good business model and that was bad" is post-hoc rationalization. The number of people who make fun of pets.com but order from chewy.com is a prime example of this.
This is not going to happen, or if it does it will just be Android (like Samsung reskins/modifies it) and it will certainly use Google Play Services.
Isn’t the whole ‘promise’ of AI that it doesn’t need any of those things?
Yes, in an ideal world, that'd be great for both humans and LLMs, but we are about as far from that ideal world as we could be. You can't even do some of the "advanced actions" as a human with human-level reflexes without encountering a captcha, but sure, all of a sudden, everyone will just decide to make their bread and butter that is data easier to explore via an LLM.
There are no shortcuts in life and its just expensive text autocomplete.
"Lets spin up $750k in GPUs full throttle to scrape a web page with my $200.00 CC subscription."
Everyone is delusional.
Watch how fast Meta adds this if a new hot shot social media app succeeds by designing for AI agents controlled by users.
This is the exact opposite of what will happen (and in fact what has happened). Reddit is suing Perplexity right now for scraping.
Meta will not serve content to some other app for free - for what benefit? They will not see advertising data.
Advertising isn't the only possible business model.
And profit isn't the only possible motive to provide a service.
A closer analogue would be AppleScript, or rather, the underlying Apple Event and Open Scripting Architecture functionality supplied by the OS to support AppleScript, that allowed applications to expose these interfaces along with metadata documenting them, and for external tools to record manually performed tasks across applications as programs expressed in terms of these interfaces to make them easier to use (this last bit, while not strictly required, is convenient, and especially useful for less technical users).
If you're familiar with VBA in Microsoft Office applications, sort of like that, except with support provided by OS APIs that could be used by any application that chose to implement scripting support, official guidance from Apple suggesting that all well-designed applications should be scriptable and recordable, and application design patterns and frameworks designed with scriptability and recordability in mind.
Note that I use the past tense here, despite AppleScript still being available in macOS, because it is not well-supported by modern applications.
https://dl.acm.org/doi/epdf/10.1145/1238844.1238845
The modern javascript ecosystem is a perfect example of what happens when everyone tries to rebuild from scratch and it's a nightmare.
That's only another step in the path I experienced since the 80s, when I had to type every single character because there was no auto complete, no command line history, very few libraries. I was very good at writing trees, hash tables, linked lists and so was everybody else. Nobody would hire me if I were that slow at writing code today.
I imagine the AIs will get a lot better at intercepting things at an intermediate level - API calls under the hood, etc. Probably much better (and cheaper) vision abilities, and perhaps even deeper integration into the machine code itself. It's really hard to anticipate what an advanced model will be capable of 5 years from now.
Open source research/project I have been exploring on the topic: https://aevum.build/learn/architecture/
i'm really not sure companies will allow their apps to be automated so easily, and the reason is api abuse (think of a saas where you can upload file attachments for example); you'd either end up banned or throttled pretty fast, and in the end the company will be like "cost > opportunity" and just close it off (and its like this already, llms just make this worse)
I could imagine an AI future where agentic shopping companies who promise me the best deal are pitted against Walmart and Amazon, trying to algorithmically squeeze me for $2 more- just two bots playing a cat and mouse game to save me a few bucks.
For some reason a lot of tech ends up in these antagonistic monopolies- Apple wants to sell privacy aware devices as a product feature, Google wants give you mail and maps, but sell your data. Despite any appearances neither give a shit about you, even if you benefit from the dynamic.
It’s why we did this benchmark :) - reflex team member
Humans would be the second-class users of said OS, which can generate UIs on demand as needed.
I've thought about this quite a bit. Started implementing as a side project, but I have too many side projects at the moment...
https://developer.android.com/ai/appfunctions
So, like a Unix system?
Perfect.
Isn't that what Apple is doing with its Foundation Models Framework?[0] Developers can integrate Apples on device llm that includes things like tool calling. I don't write Apple specific apps so not sure what can actually be done with it, but it looks promising and something Apple already seems to think things are headed.
> I think OpenAI designing their own phone is the next logical step
ChatGPT is already integrated into Apple Intelligence for those that want to use that instead of Apples model -- I don't see OpenAI trying to change lanes into phone making when they can focus on doing what they know while collecting a large check from Apple
[0] https://developer.apple.com/documentation/foundationmodels
The benchmark is a more generally interesting part of the launch materials, so I figured it had its own separate home here.
I can see the appeal in pixel route given universality but wow that seems ugly on efficiency
or even one based on PDF like OSX: https://en.wikipedia.org/wiki/Quartz_2D
Not possible on wayland, maybe on X11 protocol?
idk.. not really thought out too much, but has to be better
Using CLI tools is much faster and token-efficient. I developed ten apps in the last two months. One reached 10,000+ monthly active users.
I ask Codex to generate SVG line by line and backtrack edit, ask it to use Inkscape to generate icons, etc...
I developed all this on $20 codex sub.
It breaks like 80% of the time for me, and it's incredibly slow. Having it use Playwright (bonus: can test in FF/Saf too) was a big improvement.
Well I am competing with geoip provider like maxmind.
I developed custom traceroute and ping service to geolocate IPs with very high accuracy beating products like digital element, maxmind, ipinfo
These companies have huge teams. But my 3 people company already beat them.
Code doesn't matter much, it's not an opensource project.
My free app is http://macrocodex.app which I've developed along with a fitness coach.
I am currently beating companies with 20-30 developers and closing more deals while having 1/10th of the staff.
I am simply very excited about all this.
Nobody cares show you solve the problem, or if your code is ugly. As long as it's reliable and without downtime, you aren't breaking things and causing your customer headache, you are winning.
Even before AI, bad code existed. Not every company had 10x developer writing beautiful idiomatic rust code.
AI is just a tool, people who are trying to generate whole codebase with it are doing something very wrong. You can write code faster with AI provided you understand its strength and weakness
Heh, you're in for a rude awakening, sometime in the future :) But I won't spoil the surprise, you clearly have made up your mind about what to focus on.
> My free app is http://macrocodex.app which I've developed along with a fitness coach.
Crazy, this app you've run for ~1-2 months has 10K active users already, even though there is zero info about who runs it, zero reviews, and says "Download on the App Store" on the landing page even though you then ask people to use the web app, impressive.
I don't think anyone said using AI can't produce a ton of code really quickly, and no one is finding that difficult to manage either. But most of us software engineers are trying to build long-lasting codebases with AI too, then "less === better" typically, so it's not about being able to spit out features as fast as possible, but avoid the evergrowing codebase from collapsing on top of itself, and each prompt not getting slower and slower, but as fast as on a greenfield project.
Sounds like you've found the holy grail of being able to avoid that, kudos if so. Judging by you giving zero care to how the design and architecture actually is, I kind of find that hard to believe. But, if it works for you, it works for you, not up to me or others to dictate how you build stuff, hope you enjoy it, however you build stuff :)
Already running for a decade+ in production, recently talked about my stack here: https://news.ycombinator.com/threads?id=faangguyindia&next=4...
>Even though there is zero info about who runs it.
People in the community already know who runs it; most others don't care. You won't get 10K users without people getting results. It's a free app, so not like I am spending bucks to advertise it on social networks.
The app is completely free, doesn't upload data to any server (other than Sentrycrash reporting), doesn't ask for any email or phone number. When people get results, they share them with their friends. That's how it's growing.
>Says "Download on the App Store" on the landing page even though you then ask people to use the web app.
On iOS, we’ve a PWA app. I am well aware of it.
Perhaps because those numbers are provided on the Playstore dashboard? You should question Google's acumen in providing those statistics to developers?
And people have been estimating ARR through projections for a long time.
I already have services running for a decade+ in a product which I posted here: https://news.ycombinator.com/threads?id=faangguyindia&next=4...
In the end, simplicity wins.
Electron uses 10x more RAM than regular apps. But it's so convenient.
Python is 100x slower than C. It's in the top 3 of languages now.
Worse but more convenient always wins.
In particular, the vision-based approach used in the evaluation has clear limitations with regard to efficiency due to its nature (small observation window, heterogeneous modality)
At Smooth we use an hybrid DOM/vision approach and we index very strongly on small models. An interesting fact is that UIs are generally designed to minimize ambiguity and supply all and only the necessary context as token-efficient as possible, and the UX is cabled up to abstract the APIs in well-understood interface patterns e.g. dropdowns or autocompletes. This makes navigation easier and that's why small models can do it, which is another dimension that must be considered
We typically recommend using APIs/MCP where available and well designed, but it's genuinely surprising how token-efficient agentic browser navigation can actually be
The problem is, all the tasks are essentially: a) things agents probably just can't do, and b) things that absolutely cannot afford to be hallucinated or otherwise fucked up. So far the tasks I've thought of:
- Taxes. So it needs a lot of sensitive information to get W2's. Since I have to look up a lot of this stuff in the physical world anyway, it's not like I can just let it run wild.
- Background check for a new job. It took me 3 hrs to fill out one of them (mostly because the website was THAT bad). Being myself, I already was making mistakes just forgetting things like move in dates from 10 years ago, and having to do a lot of searching in my email for random documents. No way I'm trusting an agent with this.
- Setting up an LLC. Nope nope nope. There's a lot of annoying work involved with this, but I'm not trusting an LLM to do this.
Anyway, I guess my point is that even if an LLM was good at using my computer (so far, it seems like it wouldn't be), the kind of things I'd want an agent for are things that an LLM can't be trusted with.
1. things you wouldn’t otherwise bother doing
2. things where it otherwise would get stuck iterating on hacky workarounds doomed to fail
“Reverse engineer this app/site so we can do $common_task in one click”, “by the way, I’m logged in to $developer_portal, so try @Browser Use if you’re stuck”, etc.
I just had Codex pull user flows out of a site I’m working on and organize them on a single page. It found 116. I went in and annotated where I wanted changes, and now it’s crunching away fixing them all. Then it’ll give me an updated contact sheet and I can do a second pass.
I’d never do this sort of quality pass manually and instead would’ve just fixed issues as they came up, but this just runs in the background and requires 15 minutes of my time for a lot of polish.
The landing page doesn't advertise it yet, but essentially, I give agents a small set of tools to explore apps' surfaces, and then an API over common macOS functions, especially those related to accessibility.
The agent explores the app, then writes a repeatable workflow for it. Then it can run that workflow through CLI: `invoke chrome pinTab`
Why accessibility? Well, turns out that it's just a good DOM in general. It's structure for apps. Not all apps implement it perfectly, but enough do to make it wildly useful.
[1] https://getinvoke.com - note that the landing page is targeted towards creatives right now and doesn't talk about this use case yet
i so far haven't found any application that doesn't.
all you're able to get out, as far as i can tell, is the length of the entered password.
https://devblogs.microsoft.com/cppblog/spy-internals/
Obviously, if you can inject code into a process that receives sensitive data, you're already running in a context where all security bets are off.
But with processes you yourself create, you probably can, even without elevated privileges, unless the application takes measures to prevent injection (akin to game anticheat mechanisms), so it seems worth pointing out that there are simple mechanisms to subvert such "protected" fields that don't require application-specific reverse engineering.
i tend to think of invoke as "an API over macOS apps" tho...
doesn't `invoke finder shareAndCopyLink` read very nicely? :P
and now the fact that interfaces need to be accessible to agents, not just humans, ironically increases it for humans in return
One thing I am curious about is a hybrid approach where LLMs work in conjunction with vision models (and probes which can query/manipulate the DOM) to generate Playwright code which wraps browser access to the site in a local, programmable API. Then you'd have agents use that API to access the site rather than going through the vision agents for everything.
Harder to scale if it's doing a lot of them, I suppose.
https://playwright.dev/docs/getting-started-mcp#accessibilit...
I've mentioned several times and gotten snarky remarks about how rewriting your code so it fits in your head, and in the LLM's context helps the LLM code better, to which people complain about rewriting code just for an LLM, not realizing that the suggestion is to follow better coding principles to let the LLM code better, which has the net benefit of letting humans code better! Well looks like, if you support accessibility in your web apps correctly, Playwright MCP will work correctly for you.
Amazing.
Most wikis you can mirror locally if you really need to hammer them.
in the context of this blog post, the conclusion looks similar though!
"use the whole web like it's an API"
works much better than
"figure out similar or identical tasks from a clean slate every single time you do them"
invoke rather has overlap with Claude's and Codex' computer-use, except the steps are stored/scripted.
webmcp is bottom-up. computer-use & invoke are top-down
I think this is very fertile ground - big labs need to use approaches that can work on multiple platforms and arbitrary workflows, and full-page vision is the lowest common denominator. Platform-specific approaches are a really exciting open space!
https://accessibilityinsights.io/
https://learn.microsoft.com/en-us/windows/win32/winauto/insp...
https://github.com/FlaUI/FlaUInspect
and for WPF applications specifically,
https://github.com/snoopwpf/snoopwpf
Now the argument against this on [reddit](https://www.reddit.com/r/openclaw/comments/1s1dzxq/comment/o...)
"my experience is the opposite actually. UIA looks uniform on paper but WPF, WinForms, and Win32 all expose different control patterns and you end up writing per-toolkit handlers anyway. Qt only exposes anything if QAccessible was compiled in and the accessibility plugin is loaded at runtime, which on shipped binaries is basically never. Electron is just as opaque on Windows as on macOS because it's the same chromium underneath drawing into a canvas. the real split isn't OS vs OS, it's native toolkit vs everything else."
The only reason you wouldn’t choose an API is if it wasn’t viable.
If I think an LLM is good for something I create well defined, very deterministic "middleware" for that purpose on top of Openrouter.
Agent use can be used to improve quality and maintainability
Anthropic even says, that an agent based solution should only be your last resort and that most problems are well served with a one-shot.
https://www.anthropic.com/engineering/building-effective-age...
I'm much more agreeable with that type of LLM workflow. Running "agents" with monolithic "harness" for long time horizon tasks seems wasteful, unecessary but probably super appealing to lazy people.
When you think of everything it takes for an AI to use what the article calls a "vision agent" then it seems as if using a purpose-made API ought to be MANY orders of magnitude faster.
I don't think any new app should ever be specifically designed for AI to interact with them through computer use
The problem is that not everything from the 'past' can be accessed via APIs. It would be a fun time - remember Prism [1] - I would just run that and get all the API calls in a nice format and then replay them over and over to do things in succession.
In the new world, we have access to OpenAPI.json and whatnot, but in the world where things were built in the days pre-OpenAPI and pre-specs and best practices...I am not so sure! (and a lot of world lives then)
Alas, this works for a good chunk of things but not everything. Which is why the other technnology exists.
[1] https://stoplight.io/open-source/prism
If one agent just explores the UI, maybe in a test environment, and outputs a somewhat-structured description of the various UI elements and their behavior, then another agent was given that description, would the other agent perform better that an agent that both explores the UI and tries to accomplish the given task at the same time?
With an example UI I made up, the description (API-like interface definition) could be something like:
So the second agent can skip some thinking about how to navigate because it already has that skill. The first agent can explore the UI on its own, once, without worrying about messing up if there's a test environment.Or am I misunderstanding the article completely? Probably. But it's interesting nonetheless. Sorry if it makes no sense.
For better and worse, 5-10Mi isn't uncommon for a web app.
Instead of trying to go "bottom up" and, effectively, do what a browser engine is doing in reverse, it seems easier to go "top down" like a human does and go off the visual representation.
No most vision models focus on subset of an image at a time when using image -> text
image -> image uses whole image.
Is this true? Where can I read more about it?
Me: hmm, this title confuses and infuriates Rob.
[Clicks link]
Me: Sees same title, repeat feelings of confusion and infuration
[Scrolls article down on my smartphone]
Me: Sees jpg with the same title, repeat feelings of co fusion and infuriation.
[Closes tab]
[Continues living rest of my life]
I hope this feedback is well received and understood.
I embedded a Google Calendar widget on my Book a demo page, I don't know the API and Google doesn't expose/maintain one either.
What we are doing at Retriever AI is to instead reverse engineer the website APIs on the fly and call them directly from within the webpage so that auth/session tokens propoagate for free: https://www.rtrvr.ai/blog/ai-subroutines-zero-token-determin...
I don't think many realize how could the cheap, alternative models are becoming. I prefer SOTA models for key work, but I can also spend 10X as many tokens on an open model hosted by a non-VC subsidized provider (who is selling at a profit) for tasks that can tolerate slightly less quality.
The situation is only getting better as models improve and data centers get built out.
Bedrock isn't the cheapest either although I'm fairly sure they aren't being VC subsidized
There are definitely cheap tokens out there. The big gotcha is "for tasks that can tolerate slightly less quality"
I think everyone making claims that inference is getting more expensive are unaware that there are more LLM providers than Google, Anthropic, and OpenAI.
Try the exhorbitant expenses and ballooning waste of generated electricity and usable water.
Face-scanning? Iris patterns?
https://www.google.com/search?q=identify+anonymous+visa+mast...
To me the browser is a translation layer. Working on the browser directly while hard enables big advantages on compatibility. The only thing I miss as of now which is on the todo is ocr of the images in the browser into text out. But an api would need to do that anyways to work.
The main loss in my view of pure API based is, where do you get the data? We won't replicate human work without seeing that done. Humans work in the UI that's it. Computer use to me is the promise of being able to replicate end to end actions a human does. API can do that in theory but the data to do that is also near impossible to collect properly.
Recently, I asked Claude to log into my local grocery store chain's website and add all of the items from my shopping list to a cart. It was hilariously slow, but it did get the job done.
Unless I missed it, the article doesn't explictly mention speed in the copy, but the results do show a 17 minute (!!!) total time for the vision agent vs. 0.5s - 2.8s for the API approach.
A big part of the challenge with vision is that to manipulate the DOM, you first have to be sure the entire (current) DOM is loaded. In my experience this ends up in adding a lot of artificial waits for certain elements to exist on the page.
Thinking of Frigate NVR that does motion > object detection > scene description
Where you build up to progressively slower and more expensive algorithms i.e. there's motion > it's a person > here's what the person is doing
Hang on, that sounds like common corporate SaaS apps.
Almost sounds like an Orielly book
Matthew B. Doar (2011). Practical JIRA Plugins. O’Reilly.
https://www.oreilly.com/library/view/practical-jira-plugins/...
In case anyone was wondering. Which they probably weren’t :p
It's kind of fascinating that we never were willing to do these things for humans but now that AI needs it ... we are all in. A bit depressing in the sense that I think mostly the reason we happy to do it for AI is that we perceive it will benefit us personally rather than some abstract future human.
In fact, the only area I've been struggling with are "Concepts" because they have less clear boundaries for the right amount of detail.
Here is what I've been working on: https://github.com/super-productivity/super-productivity/wik...
I guess that just never occurred to anybody before.
Somebody pointed out that those Markdown files might be helpful for people to read directly. Bit of an Emperor's new clothes moment. (I wanted to slap a : rolling_on_the_floor_laughing: reaction on it, but sadly it turns out I'm actually too chickenshit to do that in today's job market.)
One of the best parts of LLMs is that you can use them to bootstrap your documentation, or scan for outdated things, etc, far more quickly than ever before.
Don't just throw a mountain at it and ask it to get it right, but use a targeted process to identify inconsistencies, duplicates, etc, and then resolve those.
And then you have better onboarding material for the next human OR llm...
No, that's forward. Any documentation an AI can make, another AI can regenerate. If an LLM didn't write the code, it shouldn't document it either. You don't want to bake in slop to throw off the next LLM (or person).
> It's kind of fascinating that we never were willing to do these things for humans but now that AI needs it ... we are all in. A bit depressing in the sense that I think mostly the reason we happy to do it for AI is that we perceive it will benefit us personally rather than some abstract future human.
I don't think that's the reason.
I think it's because they take time, and few people were willing to put in time for "maybe it'll make writing the actual code faster" gains when the code was going to take a few times longer to write itself.
You also can get faster feedback to iterate on your spec now, which improves the probability of it helping future-you.
So combine that with the fact that the llms are more likely to get lost if you don't spec stuff in advance, and the value of up-front work is higher (whereas a human is more likely to land on the right track, just more slowly than otherwise, making the value harder to quantify).
Actually there's a lot of projection there too; I don't read documentation in detail. And nowadays, I point an LLM at documentation so that it can find the details I would otherwise skip over.
The destruction of the millennial attention span is real, and it's worse in the younger generations, lmao.
[0] https://www.cs.unm.edu/~dlchao/papers/p152-chao.pdf
We built isagent.dev for exactly this reason, serve human content to humans, serve agent optimized content to agents.
Generative AI wasn't a thing at the time, but I had to resort to a combination of OCR, simulated user input, and print capture to drive the application and export data.
Had the developers been aware of the Windows DRM APIs that block screen capture, or the fact that text is easily recoverable from PostScript files with minimal formatting, I don't know what I would have done.
The irony is that the process this replaced involved giving cheap offshore labor full read-only remote access to all data in the system, which was by any measure a far more serious security risk than otherwise authorized employees using tools running locally with no network access provided by established, trustworthy vendors to automate their work.
The EHRs could give companies like Akasa API access so Akasa could then just run NLP, but the EHR vendors don't grant various third parties API access for various reasons, so instead Akasa gets a seat license for each medical system they service and uses computer vision to read the screen (a cadre of Akasa medical coders review errors to stay up to date with unannounced changes from the EHR vendors) and then runs the NLP to figure out which CPT codes to assign to actually put in a bill and send the payer so the hospitals can stay afloat.
So this 45x delta is how much more the medical systems pay Akasa because Epic won't work with Akasa.
This is but one example of why US medical bills are outrageously high.
IMO, this is the argument for doing work in the first place.
Tokens a resource and should be managed as such.
_of course_ computer use is worse. It is your last resort. Do not use it on state that lives in a DB that you own.
If anything I am impressed that it’s only 50x worse.
Try playing fruit ninja via text and llm toolcalls though
Had we not made that wrong turn, LLMs and humans would have a much easier time reasoning about APIs they don't directly control.
trpc, grpc, etc are all attempts to add schemas back into JSON. Swagger, OpenAPI, etc are attempts to add discover ability back into JSON-based RPC APIs.
MCPs fall in here as well, which attempt to add schemas and discover ability back in where our APIs aren't actually RESTful.
REST has nothing to do with structured data or discoverability.
https://en.wikipedia.org/wiki/HATEOAS
For example, to automate processing emails, it needs to 1. go to Gmail 2. log in to Google if necessary (This often requires two step verification so it's hard to completely automating, but possible) 3. read the latest mail 4. check the content and choose the action - if needed, reply the email - if it mentions tasks, add them to the todo list - if it mentions schedules, add them to the calendar 5. repeat for all emails based on specified conditions. And each step requires dozens of DOM (a11y tree) analyzes and actions (fill username/password input, check keep logging in, click submit button, etc). Based on the model used, each step can take ~100s. So easy tasks can easily add up to tens of minutes or even hours.
For frequently used tasks, I write skills like /logging-in, /read-latest-emails, using playwright scripts and let the agent choose them And based on the email content, the agent chooses other tools like /write-reply, /add-todo, /add-event, etc, so that the model can only focus on the core tasks requiring thinking. It reduces the execution time drastically.
But it can buries important business logic in the playwright scripts, not the agent's instructions. For examples, simplified steps to add TODO items are like; 1. read the email 2. check if it's about todos, then decide to add them to Asana 3. extract and summarize the title, content, priority, due date, tags, etc. 3. access to Asana (log in if necessary) 4. check if there are similar tasks 5. if not, add the tasks This can take tens of minutes, and each step can have important business logic, like; - how to decide the priority and due date - how to choose tags based on the content - how to decide if two tasks are similar This information should be read and updated by not only developers, but managers and other teams. And if I write those steps in skills with playwright scripts, it improves the speed, but all those business logic are buried in the code, so not accessible by non-technical people. It's also error-prone because web sites often tweak the UI and scripts can stop working.
So it's very convenient if the agent processes these step once, then decides it's worth writing the playwright script so that the next time those mundate processs can be executed instantly.
With automatic skill generation, the agent decides by itself if there are workflows worth writing skills with playwright scripts, like /log-in, /extract-information, /check-similar-tasks, /add-tasks. Like Just-In-Time compiler, the skills are a byproduct of the agent instruction, all business logic are written in the agent's instruction, and doesn't need to be updated manually nor tracked in a version control system.
This can reduce a lot of execution time and API cost, and be applied other than browser automation, like computer use or any other agentic tasks if it's possible to write automation scripts for tasks not requiring thinking.
It seems like you'd need a deliberately hostile app before a vision agent would even be considered as an option.
I want to be able to control both Mac, apps and the browser. I also need it to figure out things by itself given a goal.
Claude Code with the --chrome flag is kind of good, but it's too slow. I wanted to try faster APIs, like the one hosted on Cerebras, but it's too expensive.
Any solution I might be missing?
Or something you don't understand how to do manually?
Because I guess I don't understand the attraction of using an LLM for system automation where existing interfaces exist, other than as a form of documentation, or to write code using these interfaces.
My core idea was that "fast" perception loop is fully local, GPU optimised for UI tokenisation and change detection. "Slow" control loop requires LLM roundtrip, and uses token-efficient markdown interface in CLI output.
It uses relatively stable identifiers for controls, so agents can script common actions, eg `desktopctl pointer click --id btn_save` doesn't require UI tokenisation loop.
https://github.com/yaroshevych/desktopctl/tree/main
The best GUIs make great use of muscle memory, which makes them perfect candidates for scripting via CLI. eg a simple sequence "open Notes app, hit Cmd+F, enter search term, read list of results" can be one Bash command invoked by AI agent.
Well, if your backend was sufficiently decoupled from your frontend, and the server-side operations were designed thoughtfully and generically, it need not be an engineering project.