506 pointsby ricardbejaranoMar 13, 2026

67 Comments

John23832Mar 13, 2026
RTX Pro 6000 is a glaring omission.
schaeferMar 13, 2026
No Nvidia Spark workstation is another omission.
embedding-shapeMar 13, 2026
Yeah, that's weird, seems it has later models, and earlier, but specifically not Pro 6000? Also, based on my experience, the given numbers seems to be at least one magnitude off, which seems like a lot, when I use the approx values for a Pro 6000 (96GB VRAM + 1792 GB/s)
sxatesMar 13, 2026
Cool thing!

A couple suggestions:

1. I have an M3 Ultra with 256GB of memory, but the options list only goes up to 192GB. The M3 Ultra supports up to 512GB. 2. It'd be great if I could flip this around and choose a model, and then see the performance for all the different processors. Would help making buying decisions!

utopcellMar 13, 2026
Unfortunately, Apple retired the 512GiB models.
ProllyInfamousMar 13, 2026
Sure, but those already sold still exist.
GrayShadeMar 13, 2026
This feels a bit pessimistic. Qwen 3.5 35B-A3B runs at 38 t/s tg with llama.cpp (mmap enabled) on my Radeon 6800 XT.
AurornisMar 13, 2026
At what quantization and with what size context window?
GrayShadeMar 13, 2026
Looks like it's a bit slower today. Running llama.cpp b8192 Vulkan.

$ ./llama-cli unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -c 65536 -p "Hello"

[snip 73 lines]

[ Prompt: 86,6 t/s | Generation: 34,8 t/s ]

$ ./llama-cli unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -c 262144 -p "Hello"

[snip 128 lines]

[ Prompt: 78,3 t/s | Generation: 30,9 t/s ]

I suspect the ROCm build will be faster, but it doesn't work out of the box for me.

phelmMar 13, 2026
This is awesome, it would be great to cross reference some intelligence benchmarks so that I can understand the trade off between RAM consumption, token rate and how good the model is
S4phyreMar 13, 2026
Oh how cool. Always wanted to have a tool like this.
adithyassekharMar 13, 2026
This just reminded me of this https://www.systemrequirementslab.com/cyri.

Not sure if it still works.

twampssMar 13, 2026
Is this just llmfit but a web version of it?

https://github.com/AlexsJones/llmfit

deancMar 13, 2026
Yes. But llmfit is far more useful as it detects your system resources.
dgrin91Mar 13, 2026
Honestly I was surprised about this. It accurately got my GPU and specs without asking for any permissions. I didnt realize I was exposing this info.
dekhnMar 13, 2026
How could it not? That information is always available to userspace.
bityardMar 13, 2026
"Available to userspace" is a much different thing than "available to every website that wants it, even in private mode".

I too was a little surprised by this. My browser (Vivladi) makes a big deal about how privacy-conscious they are, but apparently browser fingerprinting is not on their radar.

swiftcoderMar 13, 2026
It's pretty hard to avoid GPU fingerprinting if you have webgl/webgpu enabled
dekhnMar 13, 2026
We switched to talking about llmfit in this subthread, it runs as native code.
rithdmcMar 13, 2026
Do you mean the OPs website? Mine's way off.

> Estimates based on browser APIs. Actual specs may vary

spudlyoMar 13, 2026
I run LibreWolf, which is configured to ask me before a site can use WebGL, which is commonly used for fingerprinting. I got the popup on this site, so I assume that's how they're doing it.
johnisgoodMar 13, 2026
Why were you surprised?

You can check out here how it does that: https://github.com/AlexsJones/llmfit/blob/main/llmfit-core/s...

To detect NVIDIA GPUs, for example: https://github.com/AlexsJones/llmfit/blob/main/llmfit-core/s...

In this case it just runs the command "nvidia-smi".

Note: llmfit is not web-based.

Someone1234Mar 13, 2026
I feel like they both solve different issues well:

- If you already HAVE a computer and are looking for models: LLMFit

- If you are looking to BUY a computer/hardware, and want to compare/contrast for local LLM usage: This

You cannot exactly run LLMFit on hardware you don't have.

rootusrootusMar 13, 2026
That's super handy, thanks for sharing the link. Way more useful than the web site this post is about, to be honest.

It looks like I can run more local LLMs than I thought, I'll have to give some of those a try. I have decent memory (96GB) but my M2 Max MBP is a few years old now and I figured it would be getting inadequate for the latest models. But llmfit thinks it's a really good fit for the vast majority of them. Interesting!

hrmtst93837Mar 13, 2026
Your hardware can run a good range of local models, but keep an eye on quantization since 4-bit models trade off some accuracy, especially with longer context or tougher tasks. Thermal throttling is also an issue, since even Apple silicon can slow down when all cores are pushed for a while, so sustained performance might not match benchmark numbers.
mrdependableMar 13, 2026
This is great, I've been trying to figure this stuff out recently.

One thing I do wonder is what sort of solutions there are for running your own model, but using it from a different machine. I don't necessarily want to run the model on the machine I'm also working from.

cortesoftMar 13, 2026
Ollama runs a web server that you use to interact with the models: https://docs.ollama.com/quickstart

You can also use the kubernetes operator to run them on a cluster: https://ollama-operator.ayaka.io/pages/en/

rebolekMar 13, 2026
ssh?
g_br_lMar 13, 2026
could you add raspi to the list to see which ridiculously small models it can run?
vova_hn2Mar 13, 2026
It says "RAM - unknown", but doesn't give me an option to specify how much RAM I have. Why?
charcircuitMar 13, 2026
On mobile it does not show the name of the model in favor of the other stats.
debatem1Mar 13, 2026
For me the "can run" filter says "S/A/B" but lists S, A, B, and C and the "tight fit" filter says "C/D" but lists F.

Just FYI.

metalliqazMar 13, 2026
Hugging Face can already do this for you (with much more up-to-date list of available models). Also LM Studio. However they don't attempt to estimate tok/sec, so that's a cool feature. However I don't really trust those numbers that much because it is not incorporating information about the CPU, etc. True GPU offload isn't often possible on consumer PC hardware. Also there are different quants available that make a big difference.
havalocMar 13, 2026
Missing the A18 Neo! :)
arjieMar 13, 2026
Cool website. The one that I'd really like to see there is the RTX 6000 Pro Blackwell 96 GB, though.
ge96Mar 13, 2026
Raspberry pi? Say 4B with 4GB of ram.

I also want to run vision like Yocto and basic LLM with TTS/STT

boutellMar 13, 2026
I've been trying to get speech to text to work with a reasonable vocabulary on pis for a while. It's tough. All the modern models just need more GPU than is available
ge96Mar 13, 2026
Whispr?

For wakewords I have used pico rhino voice

I want to use these I2S breakout mics

meatmanekMar 13, 2026
For ASR/STT on a budget, you want https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3 - it works great on CPU.

I haven't tried on a raspberry pi, but on Intel it uses a little less than 1s of CPU time per second of audio. Using https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/a... for chunked streaming inference, it takes 6 cores to process audio ~5x faster than realtime. I expect with all cores on a Pi 4 or 5, you'd probably be able to at least keep up with realtime.

(Batch inference, where you give it the whole audio file up front, is slightly more efficient, since chunked streaming inference is basically running batch inference on overlapping windows of audio.)

EDIT: there are also the multitalker-parakeet-streaming-0.6b-v1 and nemotron-speech-streaming-en-0.6b models, which have similar resource requirements but are built for true streaming inference instead of chunked inference. In my tests, these are slightly less accurate. In particular, they seem to completely omit any sentence at the beginning or end of a stream that was partially cut off.

LeifCarrotsonMar 13, 2026
This lacks a whole lot of mobile GPUs. It also does not understand that you can share CPU memory with the GPU, or perform various KV cache offloading strategies to work around memory limits.

It says I have an Arc 750 with 2 GB of shared RAM, because that's the GPU that renders my browser...but I actually have an RTX1000 Ada with 6 GB of GDDR6. It's kind of like an RTX 4050 (not listed in the dropdowns) with lower thermal limits. I also have 64 GB of LPDDR5 main memory.

It works - Qwen3 Coder Next, Devstral Small, Qwen3.5 4B, and others can run locally on my laptop in near real-time. They're not quite as good as the latest models, and I've tried some bigger ones (up to 24GB, it produces tokens about half as fast as I can type...which is disappointingly slow) that are slower but smarter.

But I don't run out of tokens.

FelixbotMar 13, 2026
The RAM/VRAM cutoff matters more than the parameter count alone. A 13B model in Q4_K_M quantization fits in 8GB VRAM with reasonable throughput, but the same model in fp16 needs 26GB. Most calculators treat quantization as a footnote when it is actually the primary variable. The question is not "can I run 13B" but "what quantization level gives acceptable quality at my hardware ceiling".
itigges22Mar 13, 2026
This is the right framing. I'd add that quantization is only the first dimension -- the second is what you build around the model. A Q4_K_M 14B model running raw inference vs. the same model with structured constraint extraction, diverse candidate sampling, and iterative self-repair are essentially different systems despite identical VRAM footprint.

The real question isn't "what quantization gives acceptable quality at my hardware ceiling" -- it's "what inference pipeline gives acceptable quality at my hardware ceiling." A single-shot Q4_K_M 14B will disappoint you. The same model generating 3 candidates, scoring them with self-embeddings, and self-repairing failures will surprise you. Same GPU, same VRAM, just smarter infrastructure.

sshagentMar 13, 2026
I don't see my beloved 5060ti. looks great though
carraMar 13, 2026
Having the rating of how well the model will run for you is cool. I miss to also have some rating of the model capabilities (even if this is tricky). There are way too many to choose. And just looking at the parameter number or the used memory is not always a good indication of actual performance.
jrmgMar 13, 2026
Is there a reliable guide somewhere to setting up local AI for coding (please don’t say ‘just Google it’ - that just results in a morass of AI slop/SEO pages with out of date, non-self-consistent, incorrect or impossible instructions).

I’d like to be able to use a local model (which one?) to power Copilot in vscode, and run coding agent(s) (not general purpose OpenClaw-like agents) on my M2 MacBook. I know it’ll be slow.

I suspect this is actually fairly easy to set up - if you know how.

AstroBenMar 13, 2026
Ollama or LM Studio are very simple to setup.

You're probably not going to get anything working well as an agent on an M2 MacBook, but smaller models do surprisingly well for focused autocomplete. Maybe the Qwen3.5 9B model would run decently on your system?

jrmgMar 13, 2026
Right - setting up LM studio is not hard. But how do I connect LM Studio to Copilot, or set up an agent?
brcmthrowawayMar 13, 2026
Basically LM Studio has a server that serves models over HTTP (localhost). Configure/enable the server and connect OpenCode to it.

Try this article https://advanced-stack.com/fields-notes/qwen35-opencode-lm-s...

I'm looking for an alternative to OpenCode though, I can barely see the UI.

AstroBenMar 13, 2026
Codex also supports configuring an alternative API for the model, you could try that: https://unsloth.ai/docs/basics/codex#openai-codex-cli-tutori...
AstroBenMar 13, 2026
It looks like Copilot has direct support for Ollama if you're willing to set that up: https://docs.ollama.com/integrations/vscode

For LM Studio under server settings you can start a local server that has an OpenAI-compatible API. You'd need to point Copilot to that. I don't use Copilot so not sure of the exact steps there

NortySpockMar 13, 2026
I tried the Zed editor and it picked up Ollama with almost no fiddling, so that has allowed me to run Qwen3.5:9B just by tweaking the ollama settings (which had a few dumb defaults, I thought, like assuming I wanted to run 3 LLMs in parallel, initially disabling Flash Attention, and having a very short context window...).

Having a second pair of "eyes" to read a log error and dig into relevant code is super handy for getting ideas flowing.

chatmastaMar 13, 2026
Any time I google something on this topic, the results are useful but also out of date, because this space is moving so absurdly fast.
AstroBenMar 13, 2026
This doesn't look accurate to me. I have an RX9070 and I've been messing around with Qwen 3.5 35B-A3B. According to this site I can't even run it, yet I'm getting 32tok/s ^.-
misnomeMar 13, 2026
It seems to be missing a whole load of the quantized Qwen models, Qwen3.5:122b works fine in the 96GB GH200 (a machine that is also missing here....)
unfirehoseMar 13, 2026
if you do, would you still want to collect data in a single pane of glass? see my open source repo for aggregating harness data from multiple machine learning model harnesses & models into a single place to discover what you are working on & spending time & money. there is plans for a scrobble feature like last.fm but for agent research & code development & execution.

https://github.com/russellballestrini/unfirehose-nextjs-logg...

thanks, I'll check for comments, feel free to fork but if you want to contribute you'll have to find me off of github, I develop privately on my own self hosted gitlab server. good luck & God bless.

varispeedMar 13, 2026
Does it make any sense? I tried few models at 128GB and it's all pretty much rubbish. Yes they do give coherent answers, sometimes they are even correct, but most of the time it is just plain wrong. I find it massive waste of time.
boutellMar 13, 2026
I'm not sure how long ago you tried it, but look at Qwen 3.5 32b on a fast machine. Usually best to shut off thinking if you're not doing tool use.
orthoxeroxMar 13, 2026
For some reason it doesn't react to changing the RAM amount in the combo box at the top. If I open this on my Ryzen AI Max 395+ with 32 GB of unified memory, it thinks nothing will fit because I've set it up to reserve 512MB of RAM for the GPU.
bityardMar 13, 2026
Yeah, this site is iffy at best. I didn't even see Strix Halo on the list, but I selected 128GB and bumped up the memory bandwidth. It says gpt-oss-120b "barely runs" at ~2 t/s.

In reality, gpt-oss-120b fits great on the machine with plenty of room to spare and easily runs inference north of 50 t/s depending on context.

kylehotchkissMar 13, 2026
My Mac mini rocks qwen2.5 14b at a lightning fast 11/tokens a second. Which is actually good enough for the long term data processing I make it spend all day doing. It doesn’t lock up the machine or prevent its primary purpose as webserver from being fulfilled.
freediddyMar 13, 2026
i think the perplexity is more important than tokens per second. tokens per second is relatively useless in my opinion. there is nothing worse than getting bad results returned to you very quickly and confidently.

ive been working with quite a few open weight models for the last year and especially for things like images, models from 6 months would return garbage data quickly, but these days qwen 3.5 is incredible, even the 9b model.

srousseyMar 13, 2026
No, getting bad results slowly is much worse. Bad results quickly and you can make adjustments.

But yes, if there is a choice I want quality over speed. At same quality, I definitely want speed.

meatmanekMar 13, 2026
This seems to be estimating based on memory bandwidth / size of model, which is a really good estimate for dense models, but MoE models like GPT-OSS-20b don't involve the entire model for every token, so they can produce more tokens/second on the same hardware. GPT-OSS-20B has 3.6B active parameters, so it should perform similarly to a 3-4B dense model, while requiring enough VRAM to fit the whole 20B model.

(In terms of intelligence, they tend to score similarly to a dense model that's as big as the geometric mean of the full model size and the active parameters, i.e. for GPT-OSS-20B, it's roughly as smart as a sqrt(20b*3.6b) ≈ 8.5b dense model, but produces tokens 2x faster.)

lambdaMar 13, 2026
Yeah, I looked up some models I have actually run locally on my Strix Halo laptop, and its saying I should have much lower performance than I actually have on models I've tested.

For MoE models, it should be using the active parameters in memory bandwidth computation, not the total parameters.

littlestymaarMar 13, 2026
While your remark is valid, there's two small inaccuracies here:

> GPT-OSS-20B has 3.6B active parameters, so it should perform similarly to a 3-4B dense model, while requiring enough VRAM to fit the whole 20B model.

First, the token generation speed is going to be comparable, but not the prefil speed (context processing is going to be much slower on a big MoE than on a small dense model).

Second, without speculative decoding, it is correct to say that a small dense model and a bigger MoE with the same amount of active parameters are going to be roughly as fast. But if you use a small dense model you will see token generation performance improvements with speculative decoding (up to x3 the speed), whereas you probably won't gain much from speculative decoding on a MoE model (because two consecutive tokens won't trigger the same “experts”, so you'd need to load more weight to the compute units, using more bandwidth).

lambdaMar 13, 2026
So, this is all true, but this calculation isn't that nuanced. It's trying to get you into a ballpark range, and based on my usage on my real hardware (if I put in my specs, since it's not in their hardware list), the results are fairly close to my real experience if I compensate for the issue where it's calculating based on total params instead of active.

So by doing so, this calculator is telling you that you should be running entirely dense models, and sparse MoE models that maybe both faster and perform better are not recommended.

littlestymaarMar 13, 2026
I agree, and I even started my response expressing my agreement with the whole point.

But since this is a tech forum, I assumed some people would be interested by the correction on the details that were wrong.

pbronezMar 13, 2026
The docs page addresses this:

> A Mixture of Experts model splits its parameters into groups called "experts." On each token, only a few experts are active — for example, Mixtral 8x7B has 46.7B total parameters but only activates ~12.9B per token. This means you get the quality of a larger model with the speed of a smaller one. The tradeoff: the full model still needs to fit in memory, even though only part of it runs at inference time.

> A dense model activates all its parameters for every token — what you see is what you get. A MoE model has more total parameters but only uses a subset per token. Dense models are simpler and more predictable in terms of memory/speed. MoE models can punch above their weight in quality but need more VRAM than their active parameter count suggests.

https://www.canirun.ai/docs

lambdaMar 13, 2026
It discusses it, and they have data showing that they know the number of active parameters on an MoE model, but they don't seem to use that in their calculation. It gives me answers far lower than my real-world usage on my setup; its calculation lines up fairly well for if I were trying to run a dense model of that size. Or, if I increase my memory bandwidth in the calculator by a factor of 10 or so which is the ratio between active and total parameters in the model, I get results that are much closer to real world usage.
tommy_axleMar 13, 2026
I'm guessing this is also calculating based on the full context size that the model supports but depending on your use case it will be misleading. Even on a small consumer card with Qwen 3 30B-A3B you probably don't need 128K context depending on what you're doing so a smaller context and some tensor overrides will help. llama.cpp's llama-fit-params is helpful in those cases.
nilslindemannMar 13, 2026
1. More title attributes please ("S 16 A 7 B 7 C 0 D 4 F 34", huh?)

2. Add a 150% size bonus to your site.

Otherwise, cool site, bookmarked.

ameliusMar 13, 2026
Why isn't there some kind of benchmark score in the list?
ameliusMar 13, 2026
What is this S/A/B/C/etc. ranking? Is anyone else using it?
vikramkrMar 13, 2026
Just a tier list I think
relaxingMar 13, 2026
Apparently S being a level above A comes from Japanese grading. I’ve been confused by that, too.
swiftcoderMar 13, 2026
It's very common in Japanese-developed video games as well
tcbrahMar 13, 2026
tbh i stopped caring about "can i run X locally" a while ago. for anything where quality matters (scripting, code, complex reasoning) the local models are just not there yet compared to API. where local shines is specific narrow tasks - TTS, embeddings, whisper for STT, stuff like that. trying to run a 70b model at 3 tok/s on your gaming GPU when you could just hit an API for like $0.002/req feels like a weird flex IMO
itigges22Mar 13, 2026
The "local models aren't there yet" take was accurate 12 months ago, but things have moved fast. A frozen Qwen3-14B at Q4_K_M on a single 16GB consumer GPU can clear 70%+ on LiveCodeBench pass@1 if you wrap it in the right inference pipeline -- structured generation, best-of-k candidate sampling, self-verified iterative repair. That puts it in the ballpark of Claude 4 Sonnet's single-shot score.

The insight most people miss is that "running locally" doesn't have to mean "single-shot raw inference and hope for the best." The model is the engine, not the car. You can build constraint extraction, budget-controlled thinking, and self-repair loops around a frozen model and get results that would have seemed impossible at that parameter count a year ago. Cost works out to fractions of a cent per task in electricity.

For narrow tasks like embeddings and TTS, sure, local has always been fine. But for coding and reasoning, the gap has closed dramatically -- you just have to stop treating local inference as "discount API" and start treating it as a compute substrate you control.

hatthewMar 13, 2026
For me and probably many other people, local has nothing to do with cost and everything to do with privacy
sdingiMar 13, 2026
When running models on my phone - either through the web browser or via an app - is there any chance it uses the phone's NPU, or will these be GPU only?

I don't really understand how the interface to the NPU chip looks from the perspective of a non-system caller, if it exists at all. This is a Samsung device but I am wondering about the general principle.

ameliusMar 13, 2026
It would be great if something like this was built into ollama, so you could easily list available models based on your current hardware setup, from the CLI.
rootusrootusMar 13, 2026
Someone linked to llmfit. That would be a great tool to integrate with ollama. Just highlight the one you want and tell it to install.

Quick, someone go vibe code that.

dugidugoutMar 13, 2026
The latest level of abstraction! You just release your ideas half baked in some internet connected box and wake up with products! Yahoo! Onwards into the Gestell!
am17anMar 13, 2026
You can still run larger MoE models using expert weight off-loading to the CPU for token generation. They are by and large useable, I get ~50 toks/second on a kimi linear 48B (3B active) model on a potato PC + a 3090
brcmthrowawayMar 13, 2026
If anyone hasn't tried Qwen3.5 on Apple Silicon, I highly suggest you to! Claude level performance on local hardware. If the Qwen team didn't get fired, I would be bullish on Local LLM.
golem14Mar 13, 2026
Has anyone actually built anything with this tool?

The website says that code export is not working yet.

That’s a very strange way to advertise yourself.

cafed00dMar 13, 2026
Open with multiple browsers (safari vs chrome) to get more "accurate + glanceable" rankings.

Its using WebGPU as a proxy to estimate system resource. Chrome tends to leverage as much resources (Compute + Memory) as the OS makes available. Safari tends to be more efficient.

Maybe this was obvious to everyone else. But its worth re-iterating for those of us skimmers of HN :)

ryandrakeMar 13, 2026
Missing RTX A4000 20GB from the GPU list.
mark_l_watsonMar 13, 2026
I have spent a HUGE amount of time the last two years experimenting with local models.

A few lessons learned:

1. small models like the new qwen3.5:9b can be fantastic for local tool use, information extraction, and many other embedded applications.

2. For coding tools, just use Google Antigravity and gemini-cli, or, Anthropic Claude, or...

Now to be clear, I have spent perhaps 100 hours in the last year configuring local models for coding using Emacs, Claude Code (configured for local), etc. However, I am retired and this time was a lot of fun for me: lot's of efforts trying to maximize local only results. I don't recommend it for others.

I do recommend getting very good at using embedded local models in small practical applications. Sweet spot.

nine_kMar 13, 2026
What kind of hardware did you use? I suppose that a 8GB gaming GPU and a Mac Pro with 512 GB unified RAM give quite different results, both formally being local.
fzzzyMar 13, 2026
A Mac Pro with 512 gb unified ram does not exist.
nine_kMar 13, 2026
Mac Studio Ultra, my bad. The 512 GB option existed up until March 2026: https://macdailynews.com/2026/03/06/apple-drops-512gb-m3-ult...
manmalMar 13, 2026
What about running e.g. Qwen3.5 128B on a rented RTX Pro 6000?
kylehotchkissMar 13, 2026
I've been really interested in the difference between 3.5 9b and 14b for information extraction. Is there a discernible difference in quality of capability?
johnmaguireMar 13, 2026
I'd love to know how you fit smaller models into your workflow. I have an M4 Macbook Pro w/ 128GB RAM and while I have toyed with some models via ollama, I haven't really found a nice workflow for them yet.
philipkglassMar 13, 2026
It really depends on the tasks you have to perform. I am using specialized OCR models running locally to extract page layout information and text from scanned legal documents. The quality isn't perfect, but it is really good compared to desktop/server OCR software that I formerly used that cost hundreds or thousands of dollars for a license. If you have similar needs and the time to try just one model, start with GLM-OCR.

If you want a general knowledge model for answering questions or a coding agent, nothing you can run on your MacBook will come close to the frontier models. It's going to be an exercise in frustration if you try to use local models that way. But there are a lot of useful applications for local-sized models when it comes to describing and categorizing unstructured data.

saltwoundsMar 13, 2026
I use Raycast and connect it to LM Studio to run text clean up and summaries often. The models are small enough I keep them in memory more often than not
BluecobraMar 13, 2026
I didn’t realize that you can get 128GB of memory in a notebook, that is impressive!
AzN1337c0d3rMar 13, 2026
Most workstation class laptops (i.e. Lenovo P-series, Dell Precision) have 4 DIMM slots and you can get them with 256 GB (at least, before the current RAM shortages).

There's also the Ryzen AI Max+ 395 that has 128GB unified in laptop form factor.

Only Apple has the unique dynamic allocation though.

the_pwner224Mar 13, 2026
Yep, I have a 13" gaming tablet with the 128 GB AMD Strix Halo chip (Ryzen AI Max+ 395, what a name). Asus ROG Flow Z13. It's a beast; the performance is totally disproportionate to its size & form factor.

I'm not sure what exactly you're referring to with "Only Apple has the unique dynamic allocation though." On Strix Halo you set the fixed VRAM size to 512 MB in the BIOS, and you set a few Linux kernel params that enable dynamic allocation to whatever limit you want (I'm using 110 GB max at the moment). LLMs can use up to that much when loaded, but it's shared fully dynamically with regular RAM and is instantly available for regular system use when you unload the LLM.

lambdaMar 13, 2026
> Only Apple has the unique dynamic allocation though.

What do you mean? On Linux I can dynamically allocate memory between CPU and GPU. Just have to set a few kernel parameters to set the max allowable allocation to the GPU, and set the BIOS to the minimum amount of dedicated graphics memory.

lambdaMar 13, 2026
I've got a 128 GiB unified memory Ryzen Ai Max+ 395 (aka Strix Halo) laptop.

Trying to run LLM models somehow makes 128 GiB of memory feel incredibly tight. I'm frequently getting OOMs when I'm running models that are pushing the limits of what this can fit, I need to leave more memory free for system memory than I was expecting. I was expecting to be able to run models of up to ~100 GiB quantized, leaving 28 GiB for system memory, but it turns out I need to leave more room for context and overhead. ~80 GiB quantized seems like a better max limit when trying not running on a headless system so I'm running a desktop environment, browser, IDE, compilers, etc in addition to the model.

And memory bandwidth limitations for running the models is real! 10B active parameters at 4-6 bit quants feels usable but slow, much more than that and it really starts to feel sluggish.

So this can fit models like Qwen3.5-122B-A10B but it's not the speediest and I had to use a smaller quant than expected. Qwen3-Coder-Next (80B/3B active) feels quite on speed, though not quite as smart. Still trying out models, Nemotron-3-Super-120B-A12B just came out, but looks like it'll be a bit slower than Qwen3.5 while not offering up any more performance, though I do really like that they have been transparent in releasing most of its training data.

echelonMar 13, 2026
Shouldn't we prioritize large scale open weights and open source cloud infra?

An OpenRunPod with decent usage might encourage more non-leading labs to dump foundation models into the commons. We just need infra to run it. Distilling them down to desktop is a fool's errand. They're meant to run on DC compute.

I'm fine with running everything in the cloud as long as we own the software infra and the weights.

This is conceivably the only way we could catch up to Claude Code is to have the Chinese start releasing their best coding models and for them to get significant traction with companies calling out to hosted versions. Otherwise, we're going to be stuck in a take off scenario with no bridge.

andy_pppMar 13, 2026
Is it correct that there's zero improvement in performance between M4 (+Pro/Max) and M5 (+Pro/Max) the data looks identical. Also the memory does not seem to improve performance on larger models when I thought it would have?

Love the idea though!

EDIT: Okay the whole thing is nonsense and just some rough guesswork or asking an LLM to estimate the values. You should have real data (I'm sure people here can help) and put ESTIMATE next to any of the combinations you are guessing.

GeekyBearMar 13, 2026
> Is it correct that there's zero improvement in performance between M4 (+Pro/Max) and M5 (+Pro/Max)

Preliminary testing did not come to that conclusion.

> Apple’s New M5 Max Changes the Local AI Story

https://www.youtube.com/watch?v=XGe7ldwFLSE

lostmsuMar 13, 2026
From the video: 4.4k is "almost" 4x times 1.8k because 4.4k has "number 4" in the beginning, and the other one - number 1.

For the lazy: that's less then 3x: 1.8 * 3 = 5.4

mkageniusMar 13, 2026
Literally made the same app, 2 weeks back - https://news.ycombinator.com/item?id=47171499
zitterbewegungMar 13, 2026
The M4 Ultra doesn't exist and there is more credible rumors for an M5 Ultra. I wouldn't put a projection like that without highlighting that this processor doesn't exist yet.
rcarmoMar 13, 2026
This is kind of bogus since some of the S and A tier models are pretty useless for reasoning or tool calls and can’t run with any sizable system prompt… it seems to be solely based on tokens per second?
polyterativeMar 13, 2026
awesome, needed this
tristorMar 13, 2026
This does not seem accurate based on my recently received M5 Max 128GB MBP. I think there's some estimates/guesswork involved, and it's also discounting that you can move the memory divider on Unified Memory devices like Apple Silicon and AMD AI Max 395+.
bheadmasterMar 13, 2026
Missing 5060 Ti 16GB
tencentshillMar 13, 2026
Missing laptop versions of all these chips.
mopierottiMar 13, 2026
This (+ llmfit) are great attempts, but I've been generally frustrated by how it feels so hard to find any sort of guidance about what I would expect to be the most straightforward/common question:

"What is the highest-quality model that I can run on my hardware, with tok/s greater than <x>, and context limit greater than <y>"

(My personal approach has just devolved into guess-and-check, which is time consuming.) When using TFA/llmfit, I am immediately skeptical because I already know that Qwen 3.5 27B Q6 @ 100k context works great on my machine, but it's buried behind relatively obsolete suggestions like the Qwen 2.5 series.

I'm assuming this is because the tok/s is much higher, but I don't really get much marginal utility out of tok/s speeds beyond ~50 t/s, and there's no way to sort results by quality.

J_Shelby_JMar 13, 2026
It’s a hard problem. I’ve been working on it for the better part of a year.

Well, granted my project is trying to do this in a way that works across multiple devices and supports multiple models to find the best “quality” and the best allocation. And this puts an exponential over the project.

But “quality” is the hard part. In this case I’m just choosing the largest quants.

downrightmikeMar 13, 2026
LLMs are just special purpose calculators, as opposed to normal calculators which just do numbers and MUST be accurate. There aren't very good ways of knowing what you want because the people making the models can't read your mind and have different goals
A7OMMar 13, 2026
Great tool for local inference. The flip side question is always 'should I run it locally or use a cloud API?' The answer depends heavily on volume and current vendor pricing. Cloud inference costs have been surprisingly volatile lately — we tracked 30 price changes across 615 models just this week.
JulianPembrokeMar 13, 2026
Tools like this are crucial for the local AI movement. What I've found in practice is that the 7-8B parameter models with Q4_K_M quantization hit a sweet spot for most developer machines, giving you 90%+ of the capability at a fraction of the memory footprint. The bigger unlock here isn't just cost savings though, it's data sovereignty. When you can run inference without your prompts leaving your machine, you can actually use LLMs for sensitive code reviews, proprietary data analysis, and internal tooling that you'd never trust to a cloud API. Would love to see this tool also flag which models have good tool-calling support since that's increasingly what separates "neat demo" from "production-ready."
A7OMMar 13, 2026
Great tool for local inference. The flip side question is always 'should I run it locally or use a cloud API?' The answer depends heavily on volume and current vendor pricing. Cloud inference costs have been surprisingly volatile lately. We tracked 30 price changes across 615 models just this week.
JulianPembrokeMar 13, 2026

  Tools like this are crucial for the local AI movement. What I've found in practice is that the 7-8B parameter models with Q4_K_M quantization hit a sweet spot for most developer machines, giving you 90%+ of the capability at a fraction of the memory footprint. The bigger unlock here isn't just cost savings though, it's data sovereignty. When you can run inference without your prompts leaving your machine, you can actually use LLMs for sensitive code reviews, proprietary data analysis, and internal tooling that you'd never trust to a cloud API. Would love to see this tool also flag which models have good tool-calling support since that's increasingly what separates "neat demo" from "production-ready."
kennywinkerMar 13, 2026
Are you using 7/8b models for coding? I keep getting the impression from what i read that 8b is only good for autocomplete. Also, it seems like an 8b model will run on a $100 2nd hand gpu (e.g. an 8gb gtx 1050/1060/1070 kind of thing) - why would you need to quantize?
SXXMar 13, 2026
Sorry if already been answered, but will there be a metric for latency aka time to first token?

Since I considered buying M3 Ultra and feel like it the most often discussed regarding using Apple hardware for runninh local LLMs. Where speed might be okay, but prompt processing can take ages.

teaearlgraycoldMar 13, 2026
Wait for the M5 Ultra. It will get the 4x prompt processing speeds from the rest of the M5 product line. I hear rumors it will be released this year.
tkfossMar 13, 2026
Nice UI, but crap data, probably llm generated.
anigbrowlMar 13, 2026
Useful tool, although some of the dark grey text is dark that I had to squint to make it out against the background.
lagrange77Mar 13, 2026
Finally! I've been waiting for something like this.
mmaunderMar 13, 2026
OP can you please make it not as dark and slightly larger. Super useful otherwise. Qwen 3.5 9B is going to get a lot of love out of this.
ProllyInfamousMar 13, 2026
I'm not usually one to whine, but agreed; additionally, add contrast to the modifiers (e.g. processor select). First thing I did when I visited was scale the website to 150%

Super impressive comparisons, and correlates with my perception having three seperate generations of GPU (from your list pulldown). Thanks for including the "old AMD" Polaris chipsets, which are actually still much faster than lower-spec Apple silicon. I have Ollama3.1 on a VEGA64 and it really is twice as fast as an M2Pro...

----

For anybody that thinks installing a local LLM is complicated: it's not (so long as you have more than one computer, don't tinker on your primary workhorse). I am a blue collar electrician (admittedly: geeky); no more difficult than installing linux.

reactordevMar 13, 2026
This shows no models work with my hardware but that’s furthest from the truth as I’m running Qwen3.5…

This isn’t nearly complete.

kennywinkerMar 13, 2026
Well… don’t keep us guessing -what hardware? And which size qwen3.5?
azmenakMar 13, 2026
From my personal testing, running various agentic tasks with a bunch of tool calls on an M4 Max 128GB, I've found that running quantized versions of larger models to produce the best results which this site completely ignores.

Currently, Nemotron 3 Super using Unsloth's UD Q4_K_XL quant is running nearly everything I do locally (replacing Qwen3.5 122b)

bearjawsMar 13, 2026
So many people have vibe coded these websites, they are posted to Reddit near daily.
kuonMar 13, 2026
I have amd 9700 and it is not listed while it is great llm hardware because it has 32Gb for a reasonable price. I tried doing "custom" but it didn't seem to work.

The tool is very nice though.

ipunchghostsMar 13, 2026
What is S? Also, NVIDIA RTX 4500 Ada is missing.
fraywingMar 13, 2026
This is amazing. Still waiting for the "Medusa" class AMD chips to build my own AI machine.
kpw94Mar 13, 2026
People complaining about how hard to get simple answer is don't appreciate the complexity in figuring out optimal models...

There's so many knobs to tweak, it's a non trivial problem

- Average/median length of your Prompts

- prompt eval speed (tok/s)

- token generation speed (tok/s)

- Image/media encoding speed for vision tasks

- Total amount of RAM

- Max bandwidth of ram (ddr4, ddr5, etc.?)

- Total amount of VRAM

- "-ngl" (amount of layers offloaded to GPU)

- Context size needed (you may need sub 16k for OCR tasks for instance)

- Size of billion parameters

- Size of active billion parameters for MoE

- Acceptable level of Perplexity for your use case(s)

- How aggressive Quantization you're willing to accept (to maintain low enough perplexity)

- even finer grain knobs: temperature, penalties etc.

Also, Tok/s as a metric isn't enough then because there's:

- thinking vs non-thinking: which mode do you need?

- models that are much more "chatty" than others in the same area (i remember testing few models that max out my modest desktop specs, qwen 2.5 non-thinking was so much faster than equivalent ministral non-thinking even though they had equivalent tok/s... Qwen would respond to the point quickly)

At the end, final questions are: are you satisfied with how long getting an answer took? and was the answer good enough?

The same exercise with paid APIs exists too, obviously less knobs but depending on your use case, there's still differences between providers and models. You can abstract away a lot of the knobs , just add "are you satisfied with how much it cost" on top of the other 2 questions

paxysMar 13, 2026
I wish creators of local model inference tools (LM Studio, Ollama etc.) would release these numbers publicly, because you can be sure they are sitting on a large dataset of real-world performance.
gopalvMar 13, 2026
Chrome runs Gemini Nano if you flip a few feature flags on [1].

The model is not great, but it was the "least amount of setup" LLM I could run on someone else's machine.

Including structured output, but has a tiny context window I could use.

[1] - https://notmysock.org/code/voice-gemini-prompt.html