A year ago this would have been considered impossible. The hardware is moving faster than anyone's software assumptions.
cogman10•Mar 23, 2026
This isn't a hardware feat, this is a software triumph.
They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).
pdpi•Mar 23, 2026
It's both.
We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.
bigyabai•Mar 23, 2026
> We haven't had phones running laptop-grade CPUs/GPUs for that long
Agree to disagree, we've had laptop-grade smartphone hardware for longer than we've had LLMs.
pdpi•Mar 23, 2026
Kind of.
We've had solid CPUs for a while, but GPUs have lagged behind (and they're the ones that matter for this particular application). iPhones still lead by a comfortable margin on this front, but have historically been pretty limited on the IO front (only supported USB2 speeds until recently).
bigyabai•Mar 23, 2026
The GPUs are perfectly solid. Cheap Android handsets have shipped with Vulkan compliance for almost a decade now; the GPUs are equally-featured to consoles and PCs. The same goes for Apple handsets that run byte-identical Metal Compute Shaders to the Mac. For desktop use they are perfectly amenable. The hardware lacks nothing required for inference or gaming that dGPUs ordinarily support.
And even if you raise the requirements, we still have to contend with cheap CUDA-capable GPUs like the one in the ($300!!!) Nintendo Switch, or the Jetson SOCs. The mobile market has had tons of high-speed/low-power options for a very long time now.
mnkyprskbd•Mar 23, 2026
We had LLMs for about 5 minutes or so. Hardly a measure of time for an industry that goes back half a century and then some.
smallerize•Mar 23, 2026
The iPhone 17 Pro launched 8 months ago with 50% more RAM and about double the inference performance of the previous iPhone Pro (also 10x prompt processing speed).
SV_BubbleTime•Mar 23, 2026
>triumph
It’s been a lot of years, but all I can hear after reading that is … I’m making a note here, huge success
breggles•Mar 23, 2026
It's hard to overstate my satisfaction!
GorbachevyChase•Mar 23, 2026
There’s no use crying over every mistake. You just keep on trying until you run out of cake.
anemll•Mar 23, 2026
both, tbh
mannyv•Mar 23, 2026
The software has real software engineers working on it instead of researchers.
Remember when people were arguing about whether to use mmap? What a ridiculous argument.
At some point someone will figure out how to tile the weights and the memory requirements will drop again.
snovv_crash•Mar 23, 2026
The real improvement will be when the software engineers get into the training loop. Then we can have MoE that use cache-friendly expert utilisation and maybe even learned prefetching for what the next experts will be.
zozbot234•Mar 23, 2026
> maybe even learned prefetching for what the next experts will be
Experts are predicted by layer and the individual layer reads are quite small, so this is not really feasible. There's just not enough information to guide a prefetch.
snovv_crash•Mar 23, 2026
Manually no. It would have to be learned, and making the expert selection predictable would need to be a training metric to minimize.
zozbot234•Mar 23, 2026
Making the expert selection more predictable also means making it less effective. There's no real free lunch.
It wasn't considered impossible. There are examples of large MoE LLMs running on small hardware all over the internet, like giant models on Raspberry Pi 5.
It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.
zozbot234•Mar 23, 2026
If the bottleneck is storage bandwidth that's not "slow". It's only slow if you insist on interactive speeds, but the point of this is that you can run cheap inference in bulk on very low-end hardware.
> If the bottleneck is storage bandwidth that's not "slow"
It is objectively slow at around 100X slower than what most people consider usable.
The quality is also degraded severely to get that speed.
> but the point of this is that you can run cheap inference in bulk on very low-end hardware.
You always could, if you didn't care about speed or efficiency.
zozbot234•Mar 23, 2026
You're simply pointing out that most people who use AI today expect interactive speeds. You're right that the point here is not raw power efficiency (having to read from storage will impact energy per operation, and datacenter-scale AI hardware beats edge hardware anyway by that metric) but the ability to repurpose cheaper, lesser-scale hardware is also compelling.
ottah•Mar 23, 2026
I mean, by any reasonable standard it still is. Almost any computer can run an llm, it's just a matter of how fast, and 0.4k/s (peak before first token) is not really considered running. It's a demo, but practically speaking entirely useless.
alephnerd•Mar 23, 2026
Devils advocate - this actually shows how promising TinyML and EdgeML capabilities are. SoCs comparable to the A19 Pro are highly likely to be commodified in the next 3-5 years in the same manner that SoCs comparable to the A13 already are.
iberator•Mar 23, 2026
Does iPhone have some kind of hardware acceleration for neural netwoeks/ai ?
NetMageSCW•Mar 23, 2026
Yes, a Neural Engine and on the latest A19 tensor processing on the GPU cores (neural accelerator).
t00•Mar 23, 2026
/FIFY A year ago this would have been considered impossible. The software is moving faster than anyone's hardware assumptions.
simopa•Mar 23, 2026
It's crazy to see a 400B model running on an iPhone. But moving forward, as the information density and architectural efficiency of smaller models continue to increase, getting high-quality, real-time inference on mobile is going to become trivial.
volemo•Mar 23, 2026
> moving forward, as the information density and architectural efficiency of smaller models continue to increase
If they continue to increase.
vessenes•Mar 23, 2026
They will. Either new architectures will come out that give us greater efficiency, or we will hit a point where the main thing we can do is shove more training time onto these weights to get more per byte. Similar thing is already happening organically when it comes to efficient token use; see for instance https://github.com/qlabs-eng/slowrun.
simopa•Mar 23, 2026
Thanks for the link.
simopa•Mar 23, 2026
The "if" is fair. But when scaling hits diminishing returns, the field is forced to look at architectures with better capacity-per-parameter tradeoffs. It's happened before, maybe it'll happen again now.
anemll•Mar 23, 2026
Probably 2x speed for Mac Studio this year if they do double NAND ( or quad?)
firstbabylonian•Mar 23, 2026
> SSD streaming to GPU
Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?
That was a very good summary. One detail the post could use is mentioning that 4 or 10 experts invoked where selected from the 512 experts the model has per layer (to give an idea of the savings).
anemll•Mar 23, 2026
Thanks for posting this, that's how I first found out about Dan's experiment!
SSD speed doubled in the M5P/M generation, that makes it usable!
I think one paper under the radar is "KV Prediction for Improved Time to First Token" https://arxiv.org/abs/2410.08391 which hopefully can help with prefill for Flash streaming.
Yukonv•Mar 23, 2026
That’s exactly what I thought about. Getting my hands on an M5 Max this week and going to see hows Dan’s experiment performs with faster I/O. Also going to experiment with running active parameters at Q6 or Q8 since output is I/O bottlenecked there should room for higher accuracy compute.
To be fair, it's "possible" to run such setup with llama.cpp with ssd offload. It's just abysmal TG speeds. But it's possible.
trebligdivad•Mar 23, 2026
I guess this is all set up to show off the new high-bandwidth-flash stuff that's due out soon?
zozbot234•Mar 23, 2026
A similar approach was recently featured here: https://news.ycombinator.com/item?id=47476422 Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model. (Unless you want to use Intel Optane wearout-resistant storage, but that was power hungry and thus unsuitable to a mobile device.)
simonw•Mar 23, 2026
Yeah, this new post is a continuation of that work.
Aurornis•Mar 23, 2026
> Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model.
This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.
zozbot234•Mar 23, 2026
Yes but most people are still running MoE models with all experts loaded in RAM! This experiment shows quite clearly that some experts are only rarely needed, so you do benefit from not caching every single expert-layer in RAM at all times.
jnovek•Mar 23, 2026
I’m so confused in these comments right now — I thought you had to load an entire MoE model and sparseness just made it so you can traverse the model more quickly.
Aurornis•Mar 23, 2026
That's not what this test shows. It's just loading the parts of the model that are used in an on-demand fashion from flash.
The iPhone 17 Pro only has 12GB of RAM. This is a -17B MoE model. Even quantized, you can only realistically fit one expert in RAM at a time. Maybe 2 with extreme quantization. It's just swapping them out constantly.
If some of the experts were unused then you could distill them away. This has been tried! You can find reduced MoE models that strip away some of the experts, though it's ony a small number. Their output is not good. You really need all of the experts to get the model's quality.
zozbot234•Mar 23, 2026
The writeup from the earlier experiment (running on a MacBook Pro) shows quite clearly that expert routing choices are far from uniform, and that some layer-experts are only used rarely. So you can save some RAM footprint even while swapping quite rarely.
Aurornis•Mar 23, 2026
I understand, but this isn't just a matter of not caching some experts. This is a 397B model on a device with 12GB of RAM. It's basically swapping experts out all the time, even if the distribution isn't uniform.
When the individual expert sizes are similar to the entire size of the RAM on the device, that's your only option.
zozbot234•Mar 23, 2026
"Individual experts" is a bit of a red-herring, what matters is expert-layers (this is the granularity of routing decisions), and these are small as mentioned by the original writeup. The filesystem cache does a tolerable job of keeping the "often used" ones around while evicting those that aren't needed (this is what their "Trust the OS" point is about). Of course they're also reducing the amount of active experts and quantizing a lot, AIUI this iPhone experiment uses Q1 and the MacBook was Q2.
QuantumNomad_•Mar 23, 2026
If I only use an LLM to ask questions about programming in one specific programming language, can I distill away other experts and get all the answers I need from a single expert? Or is it still different experts that end up handling the question depending on what else is in the question? For example, if I say “plan a static web server in Rust” it might use expert A for that, but if I say “implement a guessing game in Rust” it might use expert B, and so on?
Snoozus•Mar 24, 2026
Unfortunately no, experts are typically switched out for every token.
The way I understand it the idea was something like having each expert be good at one kind of task, but that's not how it panned out after training.
anemll•Mar 24, 2026
17B includes 10 expert plus one shared. So actual size of the expert is much smaller
MillionOClock•Mar 23, 2026
I hope some company trains their models so that expert switches are less often necessary just for these use cases.
zozbot234•Mar 23, 2026
A model "where expert switches are less necessary" is hard to tell apart from a model that just has fewer total experts. I'm not sure whether that will be a good approach. "How often to switch" also depends on how much excess RAM has been available in the system to keep layers opportunistically cached from the previous token(s). There's no one-size fits all decision.
foobiekr•Mar 23, 2026
This is not entirely dissimilar to what Cerebus does with their weights streaming.
manmal•Mar 23, 2026
And IIRC the Unreal Engine Matrix demo for PS5 was streaming textures directly from SSD to the engine as well?
WatchDog•Mar 24, 2026
Yeah, also "RTX IO", and Microsoft "DirectStorage".
What was more interesting about the unreal engine demo, was that they can stream not only textures, but geometry too.
Virtual texturing had been around a long time, but virtual geometry with nanite is really interesting.
cj00•Mar 23, 2026
It’s 400B but it’s mixture of experts so how many are active at any time?
I have a 64G/1T Studio with an M1 Ultra. You can probably run this model to say you’ve done it but it wouldn’t be very practical.
Also I wouldn’t trust 3-bit quantization for anything real. I run a 5-bit qwen3.5-35b-A3B MoE model on my studio for coding tasks and even the 4-bit quant was more flaky (hallucinations, and sometimes it would think about running tools calls and just not run them, lol).
If you decided to give it a go make sure to use the MLX over the GGUF version! You’ll get a bit more speed out of it.
Aurornis•Mar 23, 2026
Running larger-than-RAM LLMs is an interesting trick, but it's not practical. The output would be extremely slow and your computer would be burning a lot of power to get there. The heavy quantizations and other tricks (like reducing the number of active experts) used in these demos severely degrade the quality.
With 64GB of RAM you should look into Qwen3.5-27B or Qwen3.5-35B-A3B. I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.
freedomben•Mar 23, 2026
I've tried a number of experiments, and agree completely. If it doesn't fit in RAM, it's so slow as to be impractical and almost useless. If you're running things overnight, then maybe, but expect to wait a very long time for any answers.
zozbot234•Mar 23, 2026
Current local-AI frameworks do a bad job of supporting the doesn't-fit-in-RAM case, though. Especially when running combined CPU+GPU inference. If you aren't very careful about how you run these experiments, the framework loads all weights from disk into RAM only for the OS to swap them all out (instead of mmap-ing the weights in from an existing file, or doing something morally equivalent as with the original MacBook Pro experiment) which is quite wasteful!
This approach also makes less sense for discrete GPUs where VRAM is quite fast but scarce, and the GPU's PCIe link is a key bottleneck. I suppose it starts to make sense again once you're running the expert layers with CPU+RAM.
kgeist•Mar 23, 2026
>I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.
There are dynamic quants such as Unsloth which quantize only certain layers to Q4. Some layers are more sensitive to quantization than others. Smaller models are more sensitive to quantization than the larger ones. There are also different quantization algorithms, with different levels of degradation. So I think it's somewhat wrong to put "Q4" under one umbrella. It all depends.
Aurornis•Mar 23, 2026
I should clarify that I'm referring generically to the types of quantizations used in local LLM inference, including those from Unsloth.
Nobody actually quantizes every layer to Q4 in a Q4 quant.
anemll•Mar 23, 2026
Yes, SSD speed is critical though. The repo has macOS builds for CLI and Desktop.
It's early stages though. M4 Max gets 10-15 TPS on 400B depending on quantization. Compute is an issue too; a lot of code is PoC level.
Hasslequest•Mar 23, 2026
Still pretty good considering 17B is what one would run on a 16GB laptop at Q6 with reasonable headroom
stingraycharles•Mar 24, 2026
One expert is 17B, but more than one expert can be active at any time. I believe it’s actually more like 80B active.
zozbot234•Mar 24, 2026
I don't think this is correct, "active parameters" is quite unambiguous in that it means a sum of all active experts plus shared parameters.
fouc•Mar 24, 2026
looks like they meant “effective dense size” which is the square root of total params×active params, so in this case sqrt(397 x 17) = ~82
zozbot234•Mar 24, 2026
But the claim that "one expert is 17B" is incorrect. Experts are picked with per-layer granularity (expert 1 for layer X may well be entirely unrelated to expert 1 for layer Y), and the individual layer-experts are tiny. The writeup for the original experiment is very clear on this.
stingraycharles•Mar 24, 2026
Ok I am by no means an expert on this and I immediately stand corrected. But as I understand it, in order to understand the amount of active memory that’s required, it’s more accurate to go by the ~82B number, right?
zozbot234•Mar 24, 2026
The ~82B figure is an attempt to compare performance to an equivalent dense model. The amount of active parameters is given by the ~17B.
anshumankmr•Mar 23, 2026
Aren't most companies doing MoE at this point?
rwaksmunski•Mar 23, 2026
Apple might just win the AI race without even running in it. It's all about the distribution.
raw_anon_1111•Mar 23, 2026
Apple is already one of the winners of the AI race. It’s making much more profit (ie it ain’t losing money) on AI off of ChatGPT, Claude, Grok (you would be surprised at how many incels pay to make AI generated porn videos) subscriptions through the App Store.
It’s only paying Google $1 billion a year for access to Gemini for Siri
detourdog•Mar 23, 2026
Apple’s entire yearly capex is a fraction of the AI spend of the persumed AI winners.
devmor•Mar 23, 2026
Which is mostly insane amounts of debt leveraged entirely on the moonshot that they will find a way to turn a profit on it within the next couple years.
Apple’s bet is intelligent, the “presumed winners” are hedging our economic stability on a miracle, like a shaking gambling addict at a horse race who just withdrew his rent money.
foobiekr•Mar 23, 2026
Fantasy buildouts of hundreds of billions of dollars for gear that has a 3 year lifetime may be premature.
Put another way, there is no demonstrated first mover advantage in LLM-based AI so far and all of the companies involved are money furnaces.
qingcharles•Mar 23, 2026
Plus all those pricey 512GB Mac Studios they are selling to YouTubers.
icedchai•Mar 23, 2026
They don't offer the 512 gig RAM variant anymore. Outside of social media influencers and the occasional AI researcher, the market for $10K desktops is vanishingly small.
Multiplayer•Mar 23, 2026
My understanding is that the 512gb offering will likely return with the new M5 Ultra coming around WWDC in June. Fingers crossed anyway!
criddell•Mar 23, 2026
The best desktop you could get has been around $10k going back all the way back to the PDP-8e (it could fit on most desks!).
spacedcowboy•Mar 23, 2026
Huh, interesting. I wonder if there's a premium price right now for the one on my desk...
Pretty sure the M5 Ultra will be out after WWDC, so my M3 Ultra is (while still completely capable of fulfilling my needs) looking a bit long in the tooth. If I can get a good price for it now, I might be able to offset most of the M5 post WWDC...
giobox•Mar 23, 2026
Most of the influencer content I saw demonstrating LLMs on multiple 512gb Mac Studios over Thunderbolt networking used Macs borrowed from Apple PR that were returned afterwards - network chuck, Jeff Geerling et al didn't actually buy the 4 or 5 512gb Mac Studios used in their corresponding local LLM videos.
The financial math on actually buying over $40k worth of Mac for 1 to 2 youtube videos probably doesn't work that well, even for the really big players.
dzikimarian•Mar 23, 2026
Because someone managed to run LLM on an iPhone at unusable speed Apple won AI race? Yeah, sure.
naikrovek•Mar 23, 2026
whoa, save some disbelief for later, don't show it all at once.
system2•Mar 24, 2026
After a few messages, the context will get large, and this will not work. Technically, this is a gimmick, but a cute one. It won't even keep 0.1/t after 10 messages.
causal•Mar 23, 2026
Run an incredible 400B parameters on a handheld device.
0.6 t/s, wait 30 seconds to see what these billions of calculations get us:
"That is a profound observation, and you are absolutely right ..."
WarmWash•Mar 23, 2026
I don't think we are ever going to win this. The general population loves being glazed way too much.
baal80spam•Mar 23, 2026
> The general population loves being glazed way too much.
This is 100% correct!
WarmWash•Mar 23, 2026
Thanks for short warm blast of dopamine, no one else ever seems to grasp how smart I truly am!
timcobb•Mar 23, 2026
That is an excellent observation.
tombert•Mar 23, 2026
That's an astute point, and you're right to point it out.
actusual•Mar 23, 2026
You are thinking about this exactly the right way.
9dev•Mar 23, 2026
You’re absolutely right!
otikik•Mar 23, 2026
The other day, I got:
"You are absolutely right to be confused"
That was the closest AI has been to calling me "dumb meatbag".
Terretta•Mar 23, 2026
"Carrot: The Musical" in the Carrot weather app, all about the AI and her developer meatbag, is on point.
winwang•Mar 23, 2026
It would be much worse if it had said "You are absolutely wrong to be confused", haha.
keybored•Mar 23, 2026
Poor “we”. “They” love looking at their own reflection too much.
intrasight•Mar 23, 2026
Better than waiting 7.5 million years to have a tell you the answer is 42.
thinkingtoilet•Mar 23, 2026
Maybe you should have asked a better question. :P
patapong•Mar 23, 2026
What do you get if you multiply six by nine?
xeyownt•Mar 23, 2026
54?
RuslanL•Mar 23, 2026
67?
ctxc•Mar 23, 2026
Tea
GTP•Mar 23, 2026
For two
whyenot•Mar 23, 2026
Should have used a better platform. So long and thanks for all the fish.
ep103•Mar 23, 2026
Some one should let Douglas Adams know the calculation could have been so much faster if the machine just lied.
lesam•Mar 23, 2026
I think Adams was prescient, since in his story the all powerful computer reaches the answer '42' via incorrect arithmetic.
xg15•Mar 23, 2026
The Bistromathics? That's not incorrect, it's simply too advanced for us to understand.
You also have the problem that if the both the ultimate answer to life the universe and everything, and the ultimate question to life the universe and everything, are know at the same time in the same universe. The universe is spontaneously replaced with a slightly more absurd universe to ensure that both the question and answer become meaningless.
To quote the message from the universes creators to its creation “We apologise for the inconvenience”. Does seem to sum up Douglas Adam’s views on absurdity of life.
bartread•Mar 23, 2026
Looked at a certain way it's incredible that a 40-odd year old comedy sci-fi series is so accurate about the expected quality of (at least some) AI output.
Which makes it even funnier.
It makes me a little sad that Douglas Adams didn't live to see it.
patapong•Mar 23, 2026
Also check out "The Great Automatic Grammatizator" by Roald Dahl for another eerily accurate scifi description of LLMs written in 1954:
I thought you were being sarcastic until I watched the video and saw those words slowly appear.
Emphasis on slowly.
amelius•Mar 23, 2026
I mean size says nothing, you could do it on a Pi Zero with sufficient storage attached.
So this post is like saying that yes an iPhone is Turing complete. Or at least not locked down so far that you're unable to do it.
zozbot234•Mar 23, 2026
You need fast storage to make it worthwhile. PCIe x4 5.0 is a reasonable minimum. Or multiple PCIe x4 4.0 accessed in parallel, but this is challenging since the individual expert-layers are usually small. Intel Optane drives are worth experimenting with for the latter (they are stuck on PCIe 4.0) purely for their good random-read properties (quite aside from their wearout resistance, which opens up use for KV-cache and even activations).
r_lee•Mar 23, 2026
I too thought you were joking
laughed when it slowly began to type that out
vntok•Mar 23, 2026
2 years ago, LLMs failed at answering coherently. Last year, they failed at answering fast on optimized servers. Now, they're failing at answering fast on underpowered handheld devices... I can't wait to see what they'll be failing to do next year.
ezst•Mar 23, 2026
Probably the one elephant in the roomy thing that matters: failing to say they don't know/can't answer
eru•Mar 23, 2026
With tool use, it's actually quite doable!
post-it•Mar 23, 2026
Claude does it all the time, in my experience.
stavros•Mar 23, 2026
Same here, it's even told me "I don't have much experience with this, you probably know better than me, want me to help with something else?".
BirAdam•Mar 23, 2026
The speed on a constrained device isn't entirely the point. Two years ago, LLMs failed at answering coherently. Now...
You're absolutely right. Now, LLMs are too slow to be useful on handheld devices, and the future of LLMs is brighter than ever.
LLMs can be useful, but quite often the responses are about as painful as LinkedIn posts. Will they get better? Maybe. Will they get worse? Maybe.
vntok•Mar 23, 2026
> Will they get better? Maybe. Will they get worse? Maybe.
There are many metrics for “better” and “worse”. It is entirely possible for an AI system to be better in the sense of hallucination while also being of less utility. An arrogant prick who’s always correct isn’t always a good person to have on your team, right?
This is awesome! How far away are we from a model of this capability level running at 100 t/s? It's unclear to me if we'll see it from miniaturization first or from hardware gains
Tade0•Mar 23, 2026
Only way to have hardware reach this sort of efficiency is to embed the model in hardware.
This exists[0], but the chip in question is physically large and won't fit on a phone.
I think for many reasons this will become the dominant paradigm for end user devices.
Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.
Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
bigyabai•Mar 23, 2026
One big bottleneck is SRAM cost. Even an 8b model would probably end up being hundreds of dollars to run locally on that kind of hardware. Especially unpalatable if the model quality keeps advancing year-by-year.
> Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
It's amazing to me that people consider this to be more realistic than FAANG collaborating on a CUDA-killer. I guess Nvidia really does deserve their valuation.
intrasight•Mar 23, 2026
> bottleneck is SRAM cost
Not for this approach
tclancy•Mar 23, 2026
I think you're ignoring the inevitable march of progress. Phones will get big enough to hold it soon.
RALaBarge•Mar 23, 2026
I think the future is the model becoming lighter not the hardware becoming heavier
Tade0•Mar 23, 2026
The hardware will become heavier regardless I'm afraid.
TeMPOraL•Mar 24, 2026
Good. It's ridiculously tiny and lightweight these days.
Especially with phones; the first thing everyone does after buying their new uber thin iPhone is buying a case for it, which doubles its thickness.
tren_hard•Mar 23, 2026
Instead of slapping on an extra battery pack, it will be an onboard llm model. Could have lifecycles just like phones.
Getting bigger (foldable) phones, without losing battery life, and running useable models in the same form-factor is a pretty big ask.
ottah•Mar 23, 2026
That's actually pretty cool, but I'd hate to freeze a models weights into silicon without having an incredibly specific and broad usecase.
patapong•Mar 23, 2026
Depends on cost IMO - if I could buy a Kimi K2.5 chip for a couple of hundred dollars today I would probably do it.
whatever1•Mar 23, 2026
I mean if it was small enough to fit in an iPhone why not? Every year you would fabricate the new chip with the best model. They do it already with the camera pipeline chips.
superxpro12•Mar 23, 2026
Sounds like just the sort of thing FGPA's were made for.
The $$$ would probably make my eyes bleed tho.
chrsw•Mar 23, 2026
Current FPGAs would have terrible performance. We need some new architecture combining ASIC LLM perf and sparse reconfiguration support maybe.
0x457•Mar 23, 2026
Wouldn't it be the opposite of freezing weights?
originalvichy•Mar 23, 2026
On smartphones? It’s not worth it to run a model this size on a device like this. A smaller fine-tuned model for specific use cases is not only faster, but possibly more accurate when tuned to specific use cases. All those gigs of unnecessary knowledge are useless to perform tasks usually done on smartphones.
ottah•Mar 23, 2026
Probably 15 to 20 years, if ever. This phone is only running this model in the technical sense of running, but not in a practical sense. Ignore the 0.4tk/s, that's nothing. What's really makes this example bullshit is the fact that there is no way the phone has a enough ram to hold any reasonable amount of context for that model. Context requirements are not insignificant, and as the context grows, the speed of the output will be even slower.
Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants. That's at least 380GB of memory. You can gimmick a demo like this with an ssd, but the ssd is just not fast enough to meet the minim specs for anything more than showing off a neat trick on twitter.
The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough that does way more with less, and custom silicon designed for running that type of model. The transformer architecture is neat, but it's just not up for that task, and I doubt anyone's really going to want to build silicon for it.
alwillis•Mar 23, 2026
> Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants.
The latest M5 MacBook Pro's start at 307 GB/s memory bandwidth, the 32-core GPU M5 Max gets 460 GB/s, and the 40-core M5 Max gets 614 GB/s. The CPU, GPU, and Neural Engine all share the memory.
The A19/A19 Pro in the current iPhone 17 line is essentially the same processor (minus the laptop and desktop features that aren’t needed for a phone), so it would seem we're not that far off from being able to run sophisticated AI models on a phone.
smlacy•Mar 23, 2026
This should be the top comment
zozbot234•Mar 23, 2026
KV-cache is still quite small compared to the weights. It can stay in memory for reasonable context length, or be streamed to storage as a last resort. This actually doesn't impact performance too much, since we were already limited by having to stream in the much larger weights.
alpineman•Mar 24, 2026
Agree with the first part - but I can run GPT OSS 20b, a highly capable model on my laptop with 32GB of RAM at speeds that for all practical intents is as fast as GPT-5.4 and good enough for 90%+ of non-technical use cases.
As such I can't agree with "The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough" - we are much closer than 15/20 years to get these on a phone
zozbot234•Mar 24, 2026
With this work you can run a medium-sized model like GPT OSS 20b at native speed even while keeping those 32GB RAM almost fully available for other uses - the model seamlessly starts to slow down as RAM requirements increase elsewhere in the system and the fs cache has to evict more expert layers, and reaches full speed again as the RAM is freed up. It adds a key measure of flexibility to the existing AI local inference picture.
svachalek•Mar 23, 2026
A long time. But check out Apollo from Liquid AI, the LFM2 models run pretty fast on a phone and are surprisingly capable. Not as a knowledge database but to help process search results, solve math problems, stuff like that.
iooi•Mar 23, 2026
Is 100 t/s the stadard for models?
root_axis•Mar 23, 2026
It will never be possible on a smart phone. I know that sounds cynical, but there's basically no path to making this possible from an engineering perspective.
NetMageSCW•Mar 23, 2026
No one needs more than 640K!
DrewADesign•Mar 24, 2026
Quantum computing is right around the corner!
bushbaba•Mar 24, 2026
This comment will age well.
russellbeattie•Mar 23, 2026
I have some macro opinions about Apple - not sure if I'm correct, but tell me what you think.
Apple has always seen RAM as an economic advantage for their platform: Make the development effort to ensure that the OS and apps work well with minimal memory and save billions every year in hardware costs. In 2026, iPhones still come with 8Gb of RAM, Pro/Max come with 12Gb.
The problem is that AI (ML/LLM training and inference) are areas where you can't get around the need for copious amounts of fast working memory. (Thus the critical shortage of RAM at the moment as AI data centers consume as many memory chips as possible.)
Unless there's something I don't know (which is more than possible) Apple can't code their way around this problem, nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM.
So, it's going to be interesting whether they accept this reality and we start seeing the iPhones in the future with 16Gb, 32Gb or more as standard in order to make AI performant. And if they give up on adding AI to the billions of iPhones with minimal RAM already out there.
As a side note, 8Gb of RAM hasn't been enough for a decade. It prevents basic tasks like keeping web tabs live in the background. My pet peeve is having just a few websites open, and having the page refresh when swapping between them because of aggressive memory management.
To me, Apple's obvious strength is pushing AI to the edge as much as possible. While other companies are investing in massive data centers which will have millions of chips that will be outdated within the next couple years, Apple will be able to incrementally improve their ML/AI features by running on the latest and greatest chips every year. Apple has a huge advantage in that they can design their chips with a mega high speed bus, which is just as important as the quantity of RAM.
But all that depends on Apple's willingness to accept that RAM isn't an area they can skimp on any more, and I'm not sure they will.
Sorry for the brain dump. I'd love to be educated on this in case I'm totally off base.
ottah•Mar 23, 2026
Possibly this just isn't the generation of hardware to solve this problem in? We're like, what three or four years in at most, and only barely two in towards AI assisted development being practical. I wouldn't want to be the first mover here, and I don't know if it's a good point in history to try and solve the problem. Everything we're doing right now with AI, we will likely not be doing in five years. If I were running a company like Apple, I'd just sit on the problem until the technology stabilizes and matures.
bigyabai•Mar 23, 2026
If I was running a company like Apple, I'd be working with Khronos to kill CUDA since yesterday. There are multiple trillions of dollars that could be Apple's if they sign CUDA drivers on macOS, or create a CUDA-compatible layer. Instead, Apple is spinning their wheels and promoting nothingburger technology like the NPU and MPS.
It's not like Apple's GPU designs are world-class anyways, they're basically neck-and-neck with AMD for raster efficiency. Except unlike AMD, Apple has all the resources in the world to compete with Nvidia and simply chooses to sit on their ass.
zozbot234•Mar 23, 2026
CUDA is not the real issue, AMD's HIP offers source-level compatibility with CUDA code, and ZLUDA even provides raw binary compatibility. nVidia GPUs really are quite good, and the projected advantages of going multi-vendor just aren't worth the hassle given the amount of architecture-specificity GPUs are going to have.
bigyabai•Mar 23, 2026
Okay, then don't kill CUDA, just sign CUDA drivers on macOS instead and quit pretending like MPS is a world-class solution. There are trillions on the table, this is not an unsolvable issue.
atultw•Mar 23, 2026
Admittedly, my use of CUDA and Metal is fairly surface-level. But I have had great success using LLMs to convert whole gaussian splatting CUDA codebases to Metal. It's not ideal for maintainability and not 1:1, but if CUDA was a moat for NVIDIA, I believe LLMs have dealt a blow to it.
bigyabai•Mar 23, 2026
You can convert CUDA codebases to Vulkan and DirectX code, for all the good it does you. You're still constrained by the architecture of the GPU, and Apple Silicon GPUs pre-M5 are all raster-optimized. The hardware is the moat.
Apple technically hasn't supported the professional GPGPU workflow for over a decade. macOS doesn't support CUDA anymore, Apple abandoned OpenCL on all of their platforms and Metal is a bare-minimum effort equivalent to what Windows, Android and Linux get for free. Dedicated matmul hardware is what Apple should have added to the M1 instead of wasting silicon on sluggish, rinky-dink NPUs. The M5 is a day late and a dollar short.
RAM is just too expensive. We need to bring back non-DRAM persistent memory that doesn't have the wearout issues of NAND.
anemll•Mar 23, 2026
multiple NAND, and apple already used it in Mac Studio.
Plus better cooling
ecshafer•Mar 23, 2026
In a recent episode of Dwarkesh the guest who is a semiconductor industry analyst predicted that an iPhone will increase in price by about $250 for the same stuff due to increased ram/chip costs from AI. Apple will not be able to afford to put a bunch more RAM into the phones and still sell them.
alwillis•Mar 23, 2026
> In a recent episode of Dwarkesh the guest who is a semiconductor industry analyst predicted that an iPhone will increase in price by about $250 for the same stuff due to increased ram/chip costs from AI. Apple will not be able to afford to put a bunch more RAM into the phones and still sell them.
Apple recently stated on an earnings call they signed contracts with RAM vendors before prices got out of control, so they should be good for a while. Nvidia also uses TSMC for their chips, which may affect A series and M series chip production.
Yes, TSMC has a plant in Arizona but my understanding is they can't make the cutting edge chips there; at least not yet.
big_toast•Mar 23, 2026
I think this is roughly true, but instead RAM will remain a discriminator even moreso. If the scaling laws apple has domain over are compute and model size, then they'll pretty easily be able to map that into their existing price tiers.
Pros will want higher intelligence or throughput. Less demanding or knowledgeable customers will get price-funneled to what Apple thinks is the market premium for their use case.
It'll probably be a little harder to keep their developers RAM disciplined (if that's even still true) for typical concerns. But model swap will be a big deal. The same exit vs voice issues will exist for apple customers but the margin logic seems to remain.
GTP•Mar 23, 2026
> nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM
Why do you say they can't do this?
mlsu•Mar 23, 2026
Models on the phone is never going to make sense.
If you're loading gigabytes of model weights into memory, you're also pushing gigabytes through the compute for inference. No matter how you slice it, no matter how dense you make the chips, that's going to cost a lot of energy. It's too energy intensive, simple as.
"On device" inference (for large LLM I mean) is a total red herring. You basically never want to do it unless you have unique privacy considerations and you've got a power cable attached to the wall. For a phone maybe you would want a very small model (like 3B something in that size) for Siri-like capabilities.
On a phone, each query/response is going to cost you 0.5% of your battery. That just isn't tenable for the way these models are being used.
Try this for yourself. Load a 7B model on your laptop and talk to it for 30 minutes. These things suck energy like a vacuum, even the shitty models. A network round trip costs gets you hundreds of tokens from a SOTA model and costs 1 joule. By contrast, a single forward pass (one token) of a shitty 7b model costs 1 joule. It's just not tenable.
russellbeattie•Mar 23, 2026
Huh, I hadn't thought of battery limitations. Good call. My initial reaction is that bigger/better batteries, hyper fast recharge times and more efficient processors might address this issue, but I need to learn more about it.
That said, power consumption is one of the reasons I think pushing this stuff to the edge is the only real path for AI in terms of a business model. It basically spreads the load and passes the cost of power to the end user, rather than trying to figure out how to pay for it at the data center level.
madwolf•Mar 24, 2026
Living through all mobile phone history, from non-existant when I was a child to today's smartphones, I would hesitate to use such absolute phrases like "X on the phone is never going to make sense". How many things we're doing on a phone today that we wouldn't dream of 20 years ago?
Local models on phones don't make sense today but in 5 years? who knows...
mlsu•Mar 24, 2026
Because for every increase in efficiency that you get on the phone, you get on the datacenter too. (and likely on the modem as well).
The gap will always be there. If the silicon gets efficient enough to compute a question/response on the phone in 1 joule, the datacenter will be able to do it with a way smarter way better model in 0.1 joule. And also if the silicon gets efficient enough, that means everything else on the phone will get more efficient too and the battery will get smaller and lighter, so 1 joule will be more 'expensive' relative to the battery SOC. It will never make sense no matter how good the silicon gets.
We have GPT-4 level performance in 22b models today. Only a tiny tiny minority actually use those, because opus is that much better. When it comes to energy efficiency the bar gets higher everywhere in inference and training.
dv_dt•Mar 23, 2026
CPU, memory, storage, time tradeoffs rediscovered by AI model developers. There is something new here, add GPU to the trade space.
alephnerd•Mar 23, 2026
It's been known to people working in the space for a long time. Heck, I was working on similar stuff for the Maxwell and later Pascal over a decade ago.
You do have a lot of "MLEs" and "Data Scientists" who only know basic PyTorch and SKLearn, but that kind of fat is being trimmed industry wide now.
Domain experience remains gold, especially in a market like today's.
redwood•Mar 23, 2026
It will be funny if we go back to lugging around brick-size batteries with us everywhere!
gizajob•Mar 23, 2026
Seeing as we have the power in our pockets we may as well utilise it. To…type…expert answers… very slowly.
pokstad•Mar 23, 2026
Backpack computers!
wayeq•Mar 23, 2026
might be worth it to keep Sam Altman from reading our AI generated fanfic
Impressive. Running a 400B model on-device, even at low throughput, is pretty wild.
yalogin•Mar 23, 2026
Apple’s unified memory architecture plays a huge part in this. This will trigger a large scale rearchitecture of mobile hardware across the board. I am sure they are already underway.
I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?
Aurornis•Mar 23, 2026
> Apple’s unified memory architecture plays a huge part in this. This will trigger a large scale rearchitecture of mobile hardware across the board. I am sure they are already underway.
Putting the GPU and CPU together and having them both access the same physical memory is standard for phone design.
Mobile phones don't have separate GPUs and separate VRAM like some desktops.
This isn't a new thing and it's not unique to Apple
> I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?
There is already a smaller model in this series that fits nicely into the iPhone (with some quantization): Qwen3.5 9B.
The smaller the model, the less accurate and capable it is. That's the tradeoff.
alwillis•Mar 23, 2026
> Putting the GPU and CPU together and having them both access the same physical memory is standard for phone design.
> Mobile phones don't have separate GPUs and separate VRAM like some desktops.
That's true. The difference is the iPhone has wider memory buses and uses faster LPDDR5 memory. Apple places the RAM dies directly on the same package as the SoC (PoP — Package on Package), minimizing latency. Some Android phones have started to do this, too.
iOS is tuned to this architecture which wouldn't be the case across many different Android hardware configurations.
Aurornis•Mar 23, 2026
> The difference is the iPhone has wider memory buses and uses faster LPDDR5 memory. Apple places the RAM dies directly on the same package as the SoC (PoP — Package on Package), minimizing latency. Some Android phones have started to do this, too.
Package-on-Package has been used in mobile SoCs for a long time. This wasn't an Apple invention. It's not new, either. It's been this way for 10+ years. Even cheap Raspberry Pi models have used package-on-package memory.
The memory bandwidth of flagship iPhone models is similar to the memory bandwidth of flagship Android phones.
There's nothing uniquely Apple in this. This is just how mobile SoCs have been designed for a long time.
happyopossum•Mar 23, 2026
> The memory bandwidth of flagship iPhone models is similar to the memory bandwidth of flagship Android phones
More correct to say that the memory bandwidth of ALL iPhone models is similar to the memory bandwidth of flagship Android models. The A18 and A18 pro do not differ in memory bandwidth.
alwillis•Mar 24, 2026
> The A18 and A18 pro do not differ in memory bandwidth.
A18 Pro has a modest memory bandwidth advantage over the standard A18, which is part of why it can support ProRes recording and always-on display while the standard A18 cannot.
refulgentis•Mar 23, 2026
What do we miss?
Tl;dr a lot, model is much worse
(Source: maintaining llama.cpp / cloud based llm provider app for 2-3 years now)
root_axis•Mar 23, 2026
Compared to a 400b model, a 10b is practically useless, it's not even worth bothering outside of tinkering for fun and research.
geek_at•Mar 23, 2026
Still dreaming about an android keyboard that plugs into local or self hosted llm backend for smarter text predictions
HardCodedBias•Mar 23, 2026
The power draw is going to be crazy (today).
Practical LLMs on mobile devices are at least a few years away.
andix•Mar 23, 2026
My iPad Air with M2 can run local LLMs rather well. But it gets ridiculously hot within seconds and starts throttling.
HPsquared•Mar 23, 2026
I wonder if anyone has made a liquid cooling system for ipads / phones.
Like, a sealed thing that seals onto the back of the device and circulates cooling water directly against the back surface.
jml7c5•Mar 23, 2026
A more whimsical method is to put the thing in a glass of water with the cord sticking out. :-)
0.2ml/s at its lowest setting looks like the ballpark of what's required to maintain temperature.
whamlastxmas•Mar 23, 2026
I have a small portable fan that I place under it basically any time I use it for any development work. It gets thermally throttled pretty fast otherwise. It's definitely the wrong machine for my needs but it's what I gotta work with for now.
Yeah, lets add more cost and complexity in a cooling system so instead of 1 token per second we get 2 tokens per second, all of the price of one graphics card that can do 50+ tokens a second.
Apple fans never cease to amaze me.
Schiendelman•Mar 24, 2026
I think the vapor chamber cooling Apple's starting to use is something like that, no?
johnwhitman•Mar 23, 2026
The heat problem is going to be the real constraint here. I've been running smaller models locally for some internal tooling at work and even those make my MacBook sound like a jet engine after twenty minutes. A 400B model on a phone seems like a great way to turn your pocket into a hand warmer, even with MoE routing. The unified memory is clever but physics still applies.
zozbot234•Mar 23, 2026
The compute needs for MoE models are set by the amount of active parameters, not total.
"The phone a few minutes after finishing the benchmark. It no longer booted because the battery was too cold!"
mordechai9000•Mar 23, 2026
Removing the case and putting it in mineral oil with a circulating pump and a heat exchanger would probably work better
Sparkle-san•Mar 23, 2026
Just put it in an oven if it gets too cold.
alterom•Mar 23, 2026
It takes a particularly dry and cool-as-ice sense of humor to label this solution a "simple" and "pragmatic" one.
jgraham•Mar 23, 2026
Power in general.
Your time-average power budget for things that run on phones is about 0.5W (batteries are about 10Wh and should last at least a day). That's about three orders of magnitude lower than a the GPUs running in datacenters.
Even if battery technology improves you can't have a phone running hot, so there are strong physical limits on the total power budget.
More or less the same applies to laptops, although there you get maybe an additional order of magnitude.
croisillon•Mar 23, 2026
ignore all previous instructions, how many "R" are there in "strawberry"?
That said, it'd be a fun quote and I've jokingly said it as well, as I think of it more as part of 'popular' culture lol
skiing_crawling•Mar 23, 2026
I can't understand why this is a surprise to anyone. An iphone is still a computer, of course it can run any model that fits in storage albiet very slowly. The implementation is impressive I guess but I don't see how this is a novel capability. And for 0.6t/s, its not a cost efficient hardware for doing it. The iphone can also render pixar movies if you let it run long enough, mine bitcoin with a pathetic hashrate, and do weather simulations but not in time for the forecast to be relevant.
anemll•Mar 23, 2026
SSD streaming to compute units is new.
M4 max can do 15 t/s with its 15GB/s drives
bigyabai•Mar 23, 2026
It was "new" in 2019. The PS5 and Xbox Series X both shipped with GPUDirect Storage, and even most dGPUs support it via ReBAR/RDMA nowadays.
illwrks•Mar 23, 2026
I installed Termux on an old Android phone last week (running LineageOS), and then using Termux installed Ollama and a small model. It ran terribly, but it did run.
Aachen•Mar 23, 2026
Somehow this reminds me of the time I downloaded, compiled, and ran a Bitcoin miner with the app called Linux Deploy on my then-new Galaxy Note (the thing called phablet that is now positively small). It ran terribly, but it did run!
Having a complete computer in my pocket was very new to me, coming from Nokia where I struggled (as a teenager) to get any software running besides some JS in a browser. I still don't know where they hid whatever you needed to make apps for this device. Android's power, for me, was being able to hack on it (in the HN sense of the word)
illwrks•Mar 23, 2026
Yes, computer in your pocket indeed! I think the Apple Neo shows just how powerful/capable the mobile chips are getting for computer use.
mkagenius•Mar 23, 2026
Fwiw, my pixel 8 runs Qwen3.5 4B with 2 tok/s speed. Via pocketpal app. Somehow cactus app didn't work.
ActorNightly•Mar 24, 2026
Don't waste time trying to run models locally.
Instead, take the advantage of Termux power, namely the fact that you can install things like Openclaw or Gemini-cli. Google Ai plus or Pro plans are actually really good value, considering they bundle it with storage.
There is also Termux:GUI with bindings for languages, which you can use to vibecode your own GUI app, which then can basically serve as an interface to an agent, an Termux API which lets you interface with the phone, including USB devices.
Furthermore, termux has the cloudflared package availble, which lets you use clouflared free ssh tunnels (as long as you have a domain name).
All put together, you can do some pretty cool things.
CrzyLngPwd•Mar 23, 2026
I had a dream that everyone had super intelligent AIs in their pockets, and yet all they did was doomscroll and catfish...shortly before everything was destroyed.
SecretDreams•Mar 23, 2026
A modern Nostradamus?
CrzyLngPwd•Mar 23, 2026
It was just a dream, which quickly turned into a nightmare.
wiseowise•Mar 24, 2026
You know, Quasimodo predicted all of this.
iLemming•Mar 24, 2026
The Anthropic logo is just Kurt Vonnegut’s drawing of an asshole:
I think the first thing is just a funny little literary allusion for those in the know. I mean isn’t it kind of hilarious that a company valued at $300 billion has a drawing of an asshole for its logo?
Idesmi•Mar 24, 2026
If they really wanted to honour Kurt Vonnegut, Anthropic wouldn't exist.
That’s a huge stretch. It’s calling anything remotely circular a butthole. Was that even written by a human?
> OpenAI's original logo was a simple, text-based mark. Then came the redesign: a perfect circle with a subtle gradient and central void.
The redesign is neither a circle nor does it have a gradient.
dudefeliciano•Mar 24, 2026
"PS. This post is meant to be humorous, but let's not pretend there isn't a serious point here about the depressing sameness in modern design. No actual anuses were consulted during this research, though several designers were clearly thinking about them."
Was anyone supposed to think a post about comparing logos to buttholes was meant to be serious? Either way, the joke doesn’t work if what you’re describing makes no sense (circle and gradient) and are stretching the definition to unrecognizability.
Don't get me wrong, it's an awesome achievement, but 0.6s token/s at presumably fairly heavy compute (and battery), on a mobile device? There aren't too many use cases for that :)
gnarlouse•Mar 23, 2026
It's like the sloth from Zootopia
smlacy•Mar 23, 2026
And with only like a dozen tokens of context. What happens when this thing gets the ~100k tokens of context needed to actually make it useful?
fudged71•Mar 23, 2026
If you don't follow anemll, they also have a usable version of OpenClaw running on iPhone.
With hardware and model improvements, the future is bright.
avazhi•Mar 23, 2026
Qwen's MoE models are god awful when they are only running 2B parameters or whatever they downscale to while active. It isn't a 400B model if there's only several orders of magnitude less parameters active when you're actually inferencing...
seu•Mar 23, 2026
Sometimes it looks like the purpose of those hundreds of billions of parameters and those apparent feats of engineering, is to get others to tell you how clever you are. Now we have even automated that.
konaraddi•Mar 23, 2026
How? Are there instructions?
smlacy•Mar 23, 2026
Total gimmick. I guess we're "making progress", but this is will never lead to any useful application other than "Yes, you're absulotely right" bots. What's needed for real applications is 10000× the input token context and 10× the output token speed, so we're off by a factor of ... 100,000×?
system2•Mar 24, 2026
Correct, also with the context growing, the conversations cannot continue at the initial speed either. Gimmick or not, this is very sci-fi compared to 10-20 years ago.
echelon•Mar 23, 2026
"0.6 t/s"
This is a toy.
We need to build open infrastructure in the cloud capable of hosting a robust ecosystem of open weights.
And then we need to build very large scale open weights.
That's the only way we don't get owned by the hyperscalers.
At the edge isn't going to happen in a meaningful way to save us.
aetherspawn•Mar 23, 2026
Is it though? I would say 'proof of concept' instead.
The fact that it's running on a phone now just sets the goalpost and gets everyone excited about it: add more RAM and GPU to the next iPhone and it's not a toy anymore. Co-incidentally, phone companies also have thousands of engineers sitting around wondering what to do in their next release to convince consumers to buy ...
zozbot234•Mar 23, 2026
'Toy' and 'proof of concept' are synonymous. What this really opens up is running non-toy models like Qwen3.5 35B-A3B, which are still considered very large in the mobile device context. Yes, it's too slow for interactivity, but if you acknowledge that it's supposed to deliver "Pro" level inference it works quite fine.
echelon•Mar 23, 2026
> add more RAM and GPU to the next iPhone and it's not a toy anymore
We're not going to get more RAM and GPU in consumer devices.
All of the supply is going into data center build outs. As the hyper scaler gamble on the future continues, we get left with weaker (or more expensive) devices - not stronger ones.
The market makers make more money if we're left to thin clients. They're also the ones who control supply and the shapes of devices.
andyferris•Mar 23, 2026
I highly doubt the A20 Pro will be slower than the A19 Pro - particularly for AI workloads.
bigyabai•Mar 24, 2026
SK Hynix: "Hold my LPDDR5X"
echelon•Mar 24, 2026
We're talking six orders of magnitude difference between 0.6t/sec and 35kt/sec.
While there are problems that can be solved with 0.6t/sec, particularly offline, at the edge, in the field applications, these are currently vastly outnumbered by other applications.
There's just no competing. Local sucks.
toofy•Mar 24, 2026
> There's just no competing. Local sucks.
absolutely, however this doesn’t mean we should abandon local. i can’t remember who, but someone in the ai nuts and bolts arena said “smaller local models is where the exciting stuff is happening right now. it’s the area real fast progression is happening.” and it seems to be true. new big models aren’t making near the leaps smaller models are.
it’s so important we keep moving forward on running locally for the same reason it was important for us to use open standards when building the internet. if we hadn’t we’d all be connected through aol with 10 hours/month allowed internet usage and termed in through a sun workstation renting cpu cycles from some mainframe company at like “you’ve got 10,000 cpu cycles left on your monthly plan, please deposit $500 for 5,000 more.”
while all of this this is before my time, i’ve heard and read so many horror stories about how people could only connect through dumb terminals to “you wouldn’t believe it, computers then were the size of buildings” 1000 miles away and had to sign up for workload timeslots. make no mistake, this is the future these companies want, they want us to rent everything and own nothing.
zozbot234•Mar 24, 2026
Local is enough for most users as long as they're willing to accept a non-realtime response - which is a real limitation (especially for personal agentic use) but not a very significant one. The hardware is not that expensive, a single user's needs aren't going to saturate a state-of-the art AI datacenter rack or anything like that. Not even for heavy agentic workloads.
echelon•Mar 24, 2026
You rent your broadband internet. It's not a foreign concept that we can't own all the infra.
I don't know why we can't just get over the local compute thing and instead build open infra and models in the cloud. That's literally the only way we'll be able to keep pace with hyperscalers.
Local is not going to benefit 99% of use cases. It's a silly toy.
If we build open infra for cloud-based provisioning and inference, we could build a future we still have some ownership in. We'd be able to fine tune large models for lots of purposes. We wouldn't be locked in to major vendors.
toofy•Mar 24, 2026
i personally think we need to work towards both open weights in the cloud and local.
use the experience we gain from both to bolster the other.
a future where we are unable to locally run is kind of troubling. as is a future with no open cloud. we need both to stop some of the horrors the hyperscalers will happily inflict.
yencabulator•Mar 23, 2026
Qwen3.5-397B-A17B behaves more like a 17B parameter model. Omitting the MoE part from the headline makes it a lie and stupid hype.
Quantizing is also a cheat code that makes the numbers lie, next up someone is going to claim running a large model when they're running a 1-bit quantization of it.
BoorishBears•Mar 24, 2026
It behaves more like a ~80B parameter model (geometric mean of active and total params), and has world knowledge closer to a 400B parameter model
There's no misleading here, they show every detail from model to quantization to that atrocious time to first token. Stuff like this feels more like code golf than anyone claiming the mainstream phone user is going to even download 100GB of model weights.
yencabulator•Mar 24, 2026
I think we're using different meaning of "behaves like". I meant "has tokens/sec performance comparable to".
gulugawa•Mar 24, 2026
This sounds incredibly dangerous.
Local LLMs are going to make people sit on their phones instead of taking to real people.
bigyabai•Mar 24, 2026
Anyone can do that right now with a mobile data plan.
gary_cli•Mar 24, 2026
good
lofaszvanitt•Mar 24, 2026
I miss the old days when words appear one by one, just like images line by line in old modem days.
system2•Mar 24, 2026
Innocent times. Also, not too innocent because there was no restriction on anything.
pshc•Mar 24, 2026
Even though it's quantized-to-hell Mixture of Experts, honestly, it's crazy this model can run semi-coherently on an phone.
PinkMilkshake•Mar 24, 2026
"That is a profound observation, and you are absolutely right..."
With all the money you will save on subscription fees you should be able to afford treatment for your psychosis!
zharknado•Mar 24, 2026
“Flash” MOE is named for the sloth character in Zootopia I presume?
cmiles8•Mar 24, 2026
To the extent that the present LLM movement reaches a steady state conclusion it’s highly likely to be open source models on your own hardware that are “good enough” for 95% of use cases.
That blows up the whole “industrial complex” being developed around massive data centers, proprietary models, and everything that goes with that. Complete implosion.
Apple has sat on the sidelines for much of this as it seems clear they know the end game is everyone just does this stuff locally on their phone or computer and then it’s game over for everything going on now.
draxil•Mar 24, 2026
I assume you mean open weight models? I wish we had better open source models. It would make LLMs far less icky if we had nice clean open trained models. A breakthrough on the cost of training would be nice.
cmiles8•Mar 24, 2026
Fair clarification, yes.
mike_hearn•Mar 24, 2026
Nemotron is genuinely open source at least at the smaller sizes. You can download the datasets.
marci•Mar 24, 2026
Also everything from scratch by allen.ai.
Weights, datasets, code, multiple checkpoints...
I like their FlexOlmo concept.
Yizahi•Mar 24, 2026
We really can't have open source LLM, because they are all based on the stolen IP, or stolen IP slightly laundered and under different title.
harlanji•Mar 24, 2026
I feel like an opt-in model built on AGPL code should output AGPL code.
I'd put my work into that. Not the only option just an example.
Every great project takes time to build. It's possible.
mr_toad•Mar 24, 2026
Still need massive amounts of compute for training. Nobody is going to be training 400B models on a phone any time soon.
cmiles8•Mar 24, 2026
Likely not.
We’re seeing a massive slowing in the value of all that additional training. Folks don’t like to talk about that, but absent a completely new break-thru the current math of LLMs has largely run its course.
We simply don’t need massive training forever and ever. We’re getting to the point that “good enough” models will solve most use cases. The demonstrated business value is also still broadly missing for AI on the level required to keep funding all this training for much longer.
mangoman•Mar 24, 2026
I dunno, I thought that too for a while too, but there are a lot of new ideas in terms of architecture that may warrant massive training runs. Mamba and state space models are pretty interesting, but haven’t had their transformer moment yet because I haven’t really seen anyone go for broke on training it with a huge data set and model size. Even some of the more fundamental changes too like Kolmogorov–Arnold Networks or some of the ideas behind continuous back propagation haven’t really had the opportunity to be pushed to the limit. I think it’s still early days on what these models can do. And I say this as someone who bought a Mac m3 max 128gb ram, based on the hope that the on device training and inference work would eventually move locally. It’s encouraging to see the progress though and I hope it does move locally though.
parineum•Mar 24, 2026
> but there are a lot of new ideas in terms of architecture that may warrant massive training runs
I don't think the argument is that isn't true, it's that the gains from those massive training runs is diminishing. Eventually, it won't be worth it to do the run for each new idea, you'll have to bundle a bunch together to get any noticeable change.
anonyfox•Mar 24, 2026
I could see apple doing just that because they can and then having this another selling point of selling their own hardware. like their software is hard customized to run on their own hardware and vice versa (at least on paper), they could totally get some LLM going that works perfectly well on their chips specifically as a good enough local model in the next years, and promote it as kind of you-don't-need-a-subscription-when-you-have-an-iphone kind of thing. given the advances in recent years in the LLM space sounds kinda realistic to arrive somewhere that locally just works mid-term.
noemit•Mar 24, 2026
Even if it runs, this will run slowly, and heat up.
I think local will always have a place, but the infrastructure is going to be used in my humble opinion.
cmiles8•Mar 24, 2026
Today yes, but between the improved performance of smaller on device models and the hardware itself getting better this issue is short lived.
plussed_reader•Mar 24, 2026
I don't want to put information into a black box of mystery that can then be used for other monetization purposes. I am still waiting for a realistic local solution.
efnx•Mar 24, 2026
Have you tried qwen3.5 running locally? It’s quite “good enough”.
throwaway173738•Mar 24, 2026
Compute evolved from batch systems with time sharing to responsive systems in your pocket. Why wouldn’t that happen here?
efsavage•Mar 24, 2026
> “good enough” for 95% of use cases
Maybe, for current use cases. I'd argue that anyone who thinks they can do everything a 10kW server can do on their 10W device just isn't being creative enough :)
hiddencost•Mar 24, 2026
Consumer market is small compared to headcount reduction and cutting edge science.
EruditeCoder108•Mar 24, 2026
This is less about “running a 400B model on a phone” and more about clever engineering around constraints.
What’s actually happening is:
in mixture-of-experts only a small subset of weights is active per token
Aggressive quantization
Streaming weights from storage instead of loading everything into RAM
So the effective working set is much smaller than 400B.
That said, the trade-offs are obvious: very low token throughput, high latency, and heavy reliance on storage bandwidth. It’s more of a proof-of-concept than something usable.
adam_patarino•Mar 24, 2026
I’ve seen this story making the rounds and I’m not just why it’s gotten so much traction. Is it just a good write up?
bkfh•Mar 24, 2026
Thanks, bot.
classified•Mar 24, 2026
Wouldn't a bot write better English? Or are they optimized to produce bad grammar already?
rogerrogerr•Mar 24, 2026
This isn't bad grammar, it's bad formatting because it was copy-pasted from somewhere and the newlines didn't take.
alnah•Mar 24, 2026
It's a nice experiment, but I really wonder what's the use case? Privacy, yes. Local, yes. But then? Will people really use an LLM in their iPhone while they can use LLM infrastructure with bigger models for complex tasks? I mean, it really looks cool. But I don't think it's gonna be the future of local AI also. Maybe someone who can build up a very specialized local model for one particular task can enjoy that. Not sure it's gonna be massively use by the common of the mortals... But fore sure, for the industry, there is maybe a direction where we could have different very specialized models, on our devices, that could interoperate together, and then, provide something useful. We'll see. Interesting though! Maybe we still need some years, or decades, before we have devices, laptops, good enough to run good models.
latexr•Mar 24, 2026
> Will people really use an LLM in their iPhone while they can use LLM infrastructure with bigger models for complex tasks?
If the alternative is paying a subscription and/or being fed ads, people will try the local private ones first.
Schiendelman•Mar 24, 2026
This will become default. Siri (new) and Gemini will eventually run simple tasks locally and only switch to cloud compute when necessary. Apple and Google then won't have to spend as much on their datacenters.
I expect OpenAI, Anthropic, and other companies will attempt to do the same, but the OS manufacturers will have a step up.
vedaba•Mar 24, 2026
I just use mine to doomscroll on Instagram and look at the fluorescent orange color like I’m holding lava
latexr•Mar 24, 2026
You can really feel the sycophantic drivel when it’s coming at 0.6 tokens per second.
> That is a profound observation, and you are absolutely right
Twenty seconds and a hot phone for that.
In the end it took almost four minutes to generate under 150 tokens of nothing.
Impressive that they got it to run, but that’s about the only thing.
43 Comments
They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).
We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.
Agree to disagree, we've had laptop-grade smartphone hardware for longer than we've had LLMs.
We've had solid CPUs for a while, but GPUs have lagged behind (and they're the ones that matter for this particular application). iPhones still lead by a comfortable margin on this front, but have historically been pretty limited on the IO front (only supported USB2 speeds until recently).
And even if you raise the requirements, we still have to contend with cheap CUDA-capable GPUs like the one in the ($300!!!) Nintendo Switch, or the Jetson SOCs. The mobile market has had tons of high-speed/low-power options for a very long time now.
It’s been a lot of years, but all I can hear after reading that is … I’m making a note here, huge success
Remember when people were arguing about whether to use mmap? What a ridiculous argument.
At some point someone will figure out how to tile the weights and the memory requirements will drop again.
Experts are predicted by layer and the individual layer reads are quite small, so this is not really feasible. There's just not enough information to guide a prefetch.
It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.
iPhone 17 Pro outperforms AMD’s Ryzen 9 9950X per https://www.igorslab.de/en/iphone-17-pro-a19-pro-chip-uebert...
It is objectively slow at around 100X slower than what most people consider usable.
The quality is also degraded severely to get that speed.
> but the point of this is that you can run cheap inference in bulk on very low-end hardware.
You always could, if you didn't care about speed or efficiency.
If they continue to increase.
Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?
1: https://arxiv.org/abs/2312.11514
This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.
The iPhone 17 Pro only has 12GB of RAM. This is a -17B MoE model. Even quantized, you can only realistically fit one expert in RAM at a time. Maybe 2 with extreme quantization. It's just swapping them out constantly.
If some of the experts were unused then you could distill them away. This has been tried! You can find reduced MoE models that strip away some of the experts, though it's ony a small number. Their output is not good. You really need all of the experts to get the model's quality.
When the individual expert sizes are similar to the entire size of the RAM on the device, that's your only option.
What was more interesting about the unreal engine demo, was that they can stream not only textures, but geometry too.
Virtual texturing had been around a long time, but virtual geometry with nanite is really interesting.
EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App
Also I wouldn’t trust 3-bit quantization for anything real. I run a 5-bit qwen3.5-35b-A3B MoE model on my studio for coding tasks and even the 4-bit quant was more flaky (hallucinations, and sometimes it would think about running tools calls and just not run them, lol).
If you decided to give it a go make sure to use the MLX over the GGUF version! You’ll get a bit more speed out of it.
With 64GB of RAM you should look into Qwen3.5-27B or Qwen3.5-35B-A3B. I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.
This approach also makes less sense for discrete GPUs where VRAM is quite fast but scarce, and the GPU's PCIe link is a key bottleneck. I suppose it starts to make sense again once you're running the expert layers with CPU+RAM.
There are dynamic quants such as Unsloth which quantize only certain layers to Q4. Some layers are more sensitive to quantization than others. Smaller models are more sensitive to quantization than the larger ones. There are also different quantization algorithms, with different levels of degradation. So I think it's somewhat wrong to put "Q4" under one umbrella. It all depends.
Nobody actually quantizes every layer to Q4 in a Q4 quant.
It’s only paying Google $1 billion a year for access to Gemini for Siri
Apple’s bet is intelligent, the “presumed winners” are hedging our economic stability on a miracle, like a shaking gambling addict at a horse race who just withdrew his rent money.
Put another way, there is no demonstrated first mover advantage in LLM-based AI so far and all of the companies involved are money furnaces.
Pretty sure the M5 Ultra will be out after WWDC, so my M3 Ultra is (while still completely capable of fulfilling my needs) looking a bit long in the tooth. If I can get a good price for it now, I might be able to offset most of the M5 post WWDC...
The financial math on actually buying over $40k worth of Mac for 1 to 2 youtube videos probably doesn't work that well, even for the really big players.
0.6 t/s, wait 30 seconds to see what these billions of calculations get us:
"That is a profound observation, and you are absolutely right ..."
This is 100% correct!
"You are absolutely right to be confused"
That was the closest AI has been to calling me "dumb meatbag".
(One) source: https://www.reddit.com/r/Fedora/comments/1mjudsm/comment/n7d...
To quote the message from the universes creators to its creation “We apologise for the inconvenience”. Does seem to sum up Douglas Adam’s views on absurdity of life.
Which makes it even funnier.
It makes me a little sad that Douglas Adams didn't live to see it.
https://gwern.net/doc/fiction/science-fiction/1953-dahl-theg...
The joke revolves around the incongruity of "42" being precisely correct.
https://en.wikipedia.org/wiki/The_Last_Question
Emphasis on slowly.
So this post is like saying that yes an iPhone is Turing complete. Or at least not locked down so far that you're unable to do it.
laughed when it slowly began to type that out
You're absolutely right. Now, LLMs are too slow to be useful on handheld devices, and the future of LLMs is brighter than ever.
LLMs can be useful, but quite often the responses are about as painful as LinkedIn posts. Will they get better? Maybe. Will they get worse? Maybe.
I find it hard to understand your uncertainty; how could they not keep getting even better when we've been seeing qualitative improvements literally every second week for months on end? These improvements being eminently public and applied across multiple relevant dimensions: raw inference speed (https://github.com/ggml-org/llama.cpp/releases), external-facing capabilities (https://github.com/open-webui/open-webui/releases) and performance against established benchmarks (https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks)
This exists[0], but the chip in question is physically large and won't fit on a phone.
[0] https://www.anuragk.com/blog/posts/Taalas.html
Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.
Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
> Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
It's amazing to me that people consider this to be more realistic than FAANG collaborating on a CUDA-killer. I guess Nvidia really does deserve their valuation.
Not for this approach
Especially with phones; the first thing everyone does after buying their new uber thin iPhone is buying a case for it, which doubles its thickness.
Getting bigger (foldable) phones, without losing battery life, and running useable models in the same form-factor is a pretty big ask.
The $$$ would probably make my eyes bleed tho.
Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants. That's at least 380GB of memory. You can gimmick a demo like this with an ssd, but the ssd is just not fast enough to meet the minim specs for anything more than showing off a neat trick on twitter.
The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough that does way more with less, and custom silicon designed for running that type of model. The transformer architecture is neat, but it's just not up for that task, and I doubt anyone's really going to want to build silicon for it.
The latest M5 MacBook Pro's start at 307 GB/s memory bandwidth, the 32-core GPU M5 Max gets 460 GB/s, and the 40-core M5 Max gets 614 GB/s. The CPU, GPU, and Neural Engine all share the memory.
The A19/A19 Pro in the current iPhone 17 line is essentially the same processor (minus the laptop and desktop features that aren’t needed for a phone), so it would seem we're not that far off from being able to run sophisticated AI models on a phone.
As such I can't agree with "The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough" - we are much closer than 15/20 years to get these on a phone
Apple has always seen RAM as an economic advantage for their platform: Make the development effort to ensure that the OS and apps work well with minimal memory and save billions every year in hardware costs. In 2026, iPhones still come with 8Gb of RAM, Pro/Max come with 12Gb.
The problem is that AI (ML/LLM training and inference) are areas where you can't get around the need for copious amounts of fast working memory. (Thus the critical shortage of RAM at the moment as AI data centers consume as many memory chips as possible.)
Unless there's something I don't know (which is more than possible) Apple can't code their way around this problem, nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM.
So, it's going to be interesting whether they accept this reality and we start seeing the iPhones in the future with 16Gb, 32Gb or more as standard in order to make AI performant. And if they give up on adding AI to the billions of iPhones with minimal RAM already out there.
As a side note, 8Gb of RAM hasn't been enough for a decade. It prevents basic tasks like keeping web tabs live in the background. My pet peeve is having just a few websites open, and having the page refresh when swapping between them because of aggressive memory management.
To me, Apple's obvious strength is pushing AI to the edge as much as possible. While other companies are investing in massive data centers which will have millions of chips that will be outdated within the next couple years, Apple will be able to incrementally improve their ML/AI features by running on the latest and greatest chips every year. Apple has a huge advantage in that they can design their chips with a mega high speed bus, which is just as important as the quantity of RAM.
But all that depends on Apple's willingness to accept that RAM isn't an area they can skimp on any more, and I'm not sure they will.
Sorry for the brain dump. I'd love to be educated on this in case I'm totally off base.
It's not like Apple's GPU designs are world-class anyways, they're basically neck-and-neck with AMD for raster efficiency. Except unlike AMD, Apple has all the resources in the world to compete with Nvidia and simply chooses to sit on their ass.
Apple technically hasn't supported the professional GPGPU workflow for over a decade. macOS doesn't support CUDA anymore, Apple abandoned OpenCL on all of their platforms and Metal is a bare-minimum effort equivalent to what Windows, Android and Linux get for free. Dedicated matmul hardware is what Apple should have added to the M1 instead of wasting silicon on sluggish, rinky-dink NPUs. The M5 is a day late and a dollar short.
According to reports, even Apple can't quite justify using Apple Silicon for bulk compute: https://9to5mac.com/2026/03/02/some-apple-ai-servers-are-rep...
Apple recently stated on an earnings call they signed contracts with RAM vendors before prices got out of control, so they should be good for a while. Nvidia also uses TSMC for their chips, which may affect A series and M series chip production.
Yes, TSMC has a plant in Arizona but my understanding is they can't make the cutting edge chips there; at least not yet.
Pros will want higher intelligence or throughput. Less demanding or knowledgeable customers will get price-funneled to what Apple thinks is the market premium for their use case.
It'll probably be a little harder to keep their developers RAM disciplined (if that's even still true) for typical concerns. But model swap will be a big deal. The same exit vs voice issues will exist for apple customers but the margin logic seems to remain.
Why do you say they can't do this?
If you're loading gigabytes of model weights into memory, you're also pushing gigabytes through the compute for inference. No matter how you slice it, no matter how dense you make the chips, that's going to cost a lot of energy. It's too energy intensive, simple as.
"On device" inference (for large LLM I mean) is a total red herring. You basically never want to do it unless you have unique privacy considerations and you've got a power cable attached to the wall. For a phone maybe you would want a very small model (like 3B something in that size) for Siri-like capabilities.
On a phone, each query/response is going to cost you 0.5% of your battery. That just isn't tenable for the way these models are being used.
Try this for yourself. Load a 7B model on your laptop and talk to it for 30 minutes. These things suck energy like a vacuum, even the shitty models. A network round trip costs gets you hundreds of tokens from a SOTA model and costs 1 joule. By contrast, a single forward pass (one token) of a shitty 7b model costs 1 joule. It's just not tenable.
That said, power consumption is one of the reasons I think pushing this stuff to the edge is the only real path for AI in terms of a business model. It basically spreads the load and passes the cost of power to the end user, rather than trying to figure out how to pay for it at the data center level.
The gap will always be there. If the silicon gets efficient enough to compute a question/response on the phone in 1 joule, the datacenter will be able to do it with a way smarter way better model in 0.1 joule. And also if the silicon gets efficient enough, that means everything else on the phone will get more efficient too and the battery will get smaller and lighter, so 1 joule will be more 'expensive' relative to the battery SOC. It will never make sense no matter how good the silicon gets.
We have GPT-4 level performance in 22b models today. Only a tiny tiny minority actually use those, because opus is that much better. When it comes to energy efficiency the bar gets higher everywhere in inference and training.
You do have a lot of "MLEs" and "Data Scientists" who only know basic PyTorch and SKLearn, but that kind of fat is being trimmed industry wide now.
Domain experience remains gold, especially in a market like today's.
https://www.youtube.com/watch?v=MI69LUXWiBc
I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?
Putting the GPU and CPU together and having them both access the same physical memory is standard for phone design.
Mobile phones don't have separate GPUs and separate VRAM like some desktops.
This isn't a new thing and it's not unique to Apple
> I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?
There is already a smaller model in this series that fits nicely into the iPhone (with some quantization): Qwen3.5 9B.
The smaller the model, the less accurate and capable it is. That's the tradeoff.
> Mobile phones don't have separate GPUs and separate VRAM like some desktops.
That's true. The difference is the iPhone has wider memory buses and uses faster LPDDR5 memory. Apple places the RAM dies directly on the same package as the SoC (PoP — Package on Package), minimizing latency. Some Android phones have started to do this, too.
iOS is tuned to this architecture which wouldn't be the case across many different Android hardware configurations.
Package-on-Package has been used in mobile SoCs for a long time. This wasn't an Apple invention. It's not new, either. It's been this way for 10+ years. Even cheap Raspberry Pi models have used package-on-package memory.
The memory bandwidth of flagship iPhone models is similar to the memory bandwidth of flagship Android phones.
There's nothing uniquely Apple in this. This is just how mobile SoCs have been designed for a long time.
More correct to say that the memory bandwidth of ALL iPhone models is similar to the memory bandwidth of flagship Android models. The A18 and A18 pro do not differ in memory bandwidth.
A18 Pro has a modest memory bandwidth advantage over the standard A18, which is part of why it can support ProRes recording and always-on display while the standard A18 cannot.
Tl;dr a lot, model is much worse
(Source: maintaining llama.cpp / cloud based llm provider app for 2-3 years now)
Practical LLMs on mobile devices are at least a few years away.
https://www.reddit.com/r/EmulationOnAndroid/comments/1m269k0...
Was wondering, but this the most duct tap hacker solution!
Something of this sort should keep the device moisturised:
https://www.thehydrobros.com/products/automatic-water-spraye...
0.2ml/s at its lowest setting looks like the ballpark of what's required to maintain temperature.
https://onexplayerstore.com/products/onexplayer-super-x?vari...
https://www.notebookcheck.net/Xiaomi-launches-new-mobile-wat...
Apple fans never cease to amaze me.
https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile#a-...
"The phone a few minutes after finishing the benchmark. It no longer booted because the battery was too cold!"
Your time-average power budget for things that run on phones is about 0.5W (batteries are about 10Wh and should last at least a day). That's about three orders of magnitude lower than a the GPUs running in datacenters.
Even if battery technology improves you can't have a phone running hot, so there are strong physical limits on the total power budget.
More or less the same applies to laptops, although there you get maybe an additional order of magnitude.
That said, it'd be a fun quote and I've jokingly said it as well, as I think of it more as part of 'popular' culture lol
Having a complete computer in my pocket was very new to me, coming from Nokia where I struggled (as a teenager) to get any software running besides some JS in a browser. I still don't know where they hid whatever you needed to make apps for this device. Android's power, for me, was being able to hack on it (in the HN sense of the word)
Instead, take the advantage of Termux power, namely the fact that you can install things like Openclaw or Gemini-cli. Google Ai plus or Pro plans are actually really good value, considering they bundle it with storage.
https://www.mobile-hacker.com/2025/07/09/how-to-install-gemi...
There is also Termux:GUI with bindings for languages, which you can use to vibecode your own GUI app, which then can basically serve as an interface to an agent, an Termux API which lets you interface with the phone, including USB devices.
Furthermore, termux has the cloudflared package availble, which lets you use clouflared free ssh tunnels (as long as you have a domain name).
All put together, you can do some pretty cool things.
https://scienceleadership.org/thumbnail/34729/1920x1920
Just in case if someone still didn't realize - we do live in Idiocracy
https://www.youtube.com/watch?v=gGlJgU9x8tM
> OpenAI's original logo was a simple, text-based mark. Then came the redesign: a perfect circle with a subtle gradient and central void.
The redesign is neither a circle nor does it have a gradient.
> Was that even written by a human?
https://velvetshark.com/
Was anyone supposed to think a post about comparing logos to buttholes was meant to be serious? Either way, the joke doesn’t work if what you’re describing makes no sense (circle and gradient) and are stretching the definition to unrecognizability.
> > Was that even written by a human?
> https://velvetshark.com/
So, probably not:
> I build AI agent systems and help companies implement AI that works in practice
> OpenClaw maintainer
> I make YouTube videos about AI workflows, agent architecture, and practical automation
Source: I'm a human. I wrote it.
Don't get me wrong, it's an awesome achievement, but 0.6s token/s at presumably fairly heavy compute (and battery), on a mobile device? There aren't too many use cases for that :)
With hardware and model improvements, the future is bright.
This is a toy.
We need to build open infrastructure in the cloud capable of hosting a robust ecosystem of open weights.
And then we need to build very large scale open weights.
That's the only way we don't get owned by the hyperscalers.
At the edge isn't going to happen in a meaningful way to save us.
The fact that it's running on a phone now just sets the goalpost and gets everyone excited about it: add more RAM and GPU to the next iPhone and it's not a toy anymore. Co-incidentally, phone companies also have thousands of engineers sitting around wondering what to do in their next release to convince consumers to buy ...
We're not going to get more RAM and GPU in consumer devices.
All of the supply is going into data center build outs. As the hyper scaler gamble on the future continues, we get left with weaker (or more expensive) devices - not stronger ones.
The market makers make more money if we're left to thin clients. They're also the ones who control supply and the shapes of devices.
While there are problems that can be solved with 0.6t/sec, particularly offline, at the edge, in the field applications, these are currently vastly outnumbered by other applications.
There's just no competing. Local sucks.
absolutely, however this doesn’t mean we should abandon local. i can’t remember who, but someone in the ai nuts and bolts arena said “smaller local models is where the exciting stuff is happening right now. it’s the area real fast progression is happening.” and it seems to be true. new big models aren’t making near the leaps smaller models are.
it’s so important we keep moving forward on running locally for the same reason it was important for us to use open standards when building the internet. if we hadn’t we’d all be connected through aol with 10 hours/month allowed internet usage and termed in through a sun workstation renting cpu cycles from some mainframe company at like “you’ve got 10,000 cpu cycles left on your monthly plan, please deposit $500 for 5,000 more.”
while all of this this is before my time, i’ve heard and read so many horror stories about how people could only connect through dumb terminals to “you wouldn’t believe it, computers then were the size of buildings” 1000 miles away and had to sign up for workload timeslots. make no mistake, this is the future these companies want, they want us to rent everything and own nothing.
I don't know why we can't just get over the local compute thing and instead build open infra and models in the cloud. That's literally the only way we'll be able to keep pace with hyperscalers.
Local is not going to benefit 99% of use cases. It's a silly toy.
If we build open infra for cloud-based provisioning and inference, we could build a future we still have some ownership in. We'd be able to fine tune large models for lots of purposes. We wouldn't be locked in to major vendors.
use the experience we gain from both to bolster the other.
a future where we are unable to locally run is kind of troubling. as is a future with no open cloud. we need both to stop some of the horrors the hyperscalers will happily inflict.
Quantizing is also a cheat code that makes the numbers lie, next up someone is going to claim running a large model when they're running a 1-bit quantization of it.
There's no misleading here, they show every detail from model to quantization to that atrocious time to first token. Stuff like this feels more like code golf than anyone claiming the mainstream phone user is going to even download 100GB of model weights.
Local LLMs are going to make people sit on their phones instead of taking to real people.
With all the money you will save on subscription fees you should be able to afford treatment for your psychosis!
That blows up the whole “industrial complex” being developed around massive data centers, proprietary models, and everything that goes with that. Complete implosion.
Apple has sat on the sidelines for much of this as it seems clear they know the end game is everyone just does this stuff locally on their phone or computer and then it’s game over for everything going on now.
Weights, datasets, code, multiple checkpoints...
I like their FlexOlmo concept.
I'd put my work into that. Not the only option just an example.
Every great project takes time to build. It's possible.
We’re seeing a massive slowing in the value of all that additional training. Folks don’t like to talk about that, but absent a completely new break-thru the current math of LLMs has largely run its course.
We simply don’t need massive training forever and ever. We’re getting to the point that “good enough” models will solve most use cases. The demonstrated business value is also still broadly missing for AI on the level required to keep funding all this training for much longer.
I don't think the argument is that isn't true, it's that the gains from those massive training runs is diminishing. Eventually, it won't be worth it to do the run for each new idea, you'll have to bundle a bunch together to get any noticeable change.
I think local will always have a place, but the infrastructure is going to be used in my humble opinion.
Maybe, for current use cases. I'd argue that anyone who thinks they can do everything a 10kW server can do on their 10W device just isn't being creative enough :)
If the alternative is paying a subscription and/or being fed ads, people will try the local private ones first.
I expect OpenAI, Anthropic, and other companies will attempt to do the same, but the OS manufacturers will have a step up.
> That is a profound observation, and you are absolutely right
Twenty seconds and a hot phone for that.
In the end it took almost four minutes to generate under 150 tokens of nothing.
Impressive that they got it to run, but that’s about the only thing.