That's why LLM will eventually be used only for initial interaction between the user in their language, to prepare the data to a specialized model.
Imagine face recognition to work like a text chat, where the PC gets the frame from the camera and writes in the chat: "Who's that? Here's the RGB888 image in hex: ...".
FeepingCreature•May 11, 2026
That's actually how vision language models already work, pretty much.
stingraycharles•May 11, 2026
Huh? The images are tokenized in the same way language is and it’s just fed into one single model. Not multiple smaller expert models.
Image gets rasterized into smaller pieces (eg 4x4 pixels) and each of those is assigned a token, similarly how text is broken up into tokens. And the whole thing is fed into a single model.
FeepingCreature•May 11, 2026
Yes I'm saying
> Imagine face recognition to work like a text chat, where the PC gets the frame from the camera and writes in the chat: "Who's that? Here's the RGB888 image in hex: ...".
that's p much how it works.
stingraycharles•May 11, 2026
But that isn’t a specialized model like the grandparent claimed, but rather a single, multi-modal model.
Dylan16807•May 11, 2026
Yes, the "imagine" was showcasing the opposite of a specialized model to call it a bad idea.
wongarsu•May 11, 2026
And there's a reason nobody uses them for face recognition
Vision language models are an incredible achievement in the generality and usability. But they pay a hefty price in fidelity and speed
stingraycharles•May 11, 2026
Do you know that MoE is a thing?
jampekka•May 11, 2026
The experts in MoEs aren't specialized in any meaningful task sense. From level of what we would think as tasks MoEs are selected essentially arbitrarily per token and per block.
stingraycharles•May 11, 2026
It’s unsupervised, yes, but “unspecialized in any meaningful task sense” is incorrect, that’s the whole point. It’s just not in the sense of “this is a legal expert, this is a software developer”.
brcmthrowaway•May 11, 2026
Next up: Claude replacement to handle simdjson processing.
westurner•May 11, 2026
Wouldn't this be faster with an agent skill that has code?
/skill-creator [or /create-skill] Write an agent skill
with code script(s) that use an existing user space IP library that works with your agent runtime, to [...]
Even faster would just to be use code in the first place!
jeremyjh•May 11, 2026
Perhaps one day, all network services will be provided by LLMs natively. Truly, that would be a day in the future.
codezero•May 11, 2026
I mean, we did decades of JavaScript, so... I mean... anything is possible, right? :)
vrighter•May 11, 2026
why? We already have more efficient specialized hardware.
pastage•May 11, 2026
You could read about that in 1992 "A Fire Upon the Deep" by Vernor Vinge. There is prompt injection in communication, in the book certain protocols for information communication can not be deterministic so if someone is too smart you get hacked.
lionkor•May 11, 2026
"Perhaps" doing enough lifting to participate in a bodybuilder contest, in that sentence
fouc•May 11, 2026
think about how much faster it would've been with a small local model!
twoodfin•May 11, 2026
Modulo Anthropic messing with the model for load mitigation, I wonder how stable this result is.
1,000 pings, how many correctly ponged?
ShinyLeftPad•May 11, 2026
How quickly claude responds when it acts like a user space LLM chatbot?
bot403•May 11, 2026
Now do the equivalent of just in time compilation. Claude sees that we need to respond to a lot of pings and writes a program to compute it instead of thinking about each one.
mintflow•May 11, 2026
This is cool, let aside the token usage, perhaps it can help analyze tcp throughput by redirect wire shark/to dump result
fl7305•May 11, 2026
Opus 4.6 is already very good at troubleshooting all kinds of network problems if it has access to the command line tshark tool and the pcap files.
ForHackernews•May 11, 2026
>Fun? Oh yeah!
I think this author and I have different definitions of fun.
fl7305•May 11, 2026
Do some people still claim "LLMs are just dumb auto completers"?
Because this seems to disprove that claim pretty convincingly?
11 Comments
Imagine face recognition to work like a text chat, where the PC gets the frame from the camera and writes in the chat: "Who's that? Here's the RGB888 image in hex: ...".
Image gets rasterized into smaller pieces (eg 4x4 pixels) and each of those is assigned a token, similarly how text is broken up into tokens. And the whole thing is fed into a single model.
> Imagine face recognition to work like a text chat, where the PC gets the frame from the camera and writes in the chat: "Who's that? Here's the RGB888 image in hex: ...".
that's p much how it works.
Vision language models are an incredible achievement in the generality and usability. But they pay a hefty price in fidelity and speed
/skill-creator [or /create-skill] Write an agent skill with code script(s) that use an existing user space IP library that works with your agent runtime, to [...]
ComposioHQ/awesome-claude-skills: https://github.com/ComposioHQ/awesome-claude-skills
anthopics/skills//skill-creator/SKILL.md: https://github.com/anthropics/skills/blob/main/skills/skill-...
/.agents/skills/skill-name/SKILL.md, scripts/{script_name.py,__init__.py}
https://agentskills.io/what-are-skills
Even faster would just to be use code in the first place!
1,000 pings, how many correctly ponged?
I think this author and I have different definitions of fun.
Because this seems to disprove that claim pretty convincingly?