I am somewhere in the middle, where I want something with more than 48GB/$2k of VRAM, but less than 384GB/$40k.
I'm curious if GMKtec's EVO-X2, with ~96GB of usable VRAM, is still a good solution for something like this for $3,399.
sampullman•Jul 3, 2026
I picked up the 128gb version when it was $2,199 and it runs Qwen 3.6 reasonably well with a 128kb context. Not very useful for complex tasks but it can handle some web stuff.
mft_•Jul 3, 2026
It has lower memory bandwidth than most comparable Macs.
verdverm•Jul 3, 2026
I've been happy with an OEM Spark (128G), enough so that I picked up a second one. Have 2x qwen and 1x gemma (both at 8bit and full context), plus embedding, Re-Ranker, and a 1.7B for little things. Running 6x models, probably going to add STT here soon, want to try talking more than typing.
The caveat is that if you try to use multiple models on the same device at the same time, you thrash and destroy tok/s
datadrivenangel•Jul 3, 2026
"A great way to go is 2x RTX 3090s for a total of 48GB VRAM total. You can then run Qwen3.6-27B, which is an awesome model."
Just want to note that for $3k you can get an M5 macbook pro with 48gb of shared memory, and it will not be a giant box. Also, consider committing to spending that money on a cloud hosting provider, which will be at least somewhat cheaper if not significantly cheaper. It is awesome being able to run models locally though.
jbellis•Jul 3, 2026
That's a reasonable option, just be aware that you get about 1/3 as much memory bandwidth with the M5 Pro, or 2/3 with the M5 Max [now you're at $4100 for the lowest-end]. So both your prefill (flops-bound, M5 has a lot less) and decode (bw-bound) will be slower.
LeBit•Jul 3, 2026
I’m an idiot who is unable to project itself in situations I’ve never experienced before.
So, I always thought local LLMs were toys not worth pursuing.
Only once have I tried something decent like Gemma 4 31B and Qwen 3.6 27B did I realize how incredibly useful they are.
You stop fearing you are sharing sensitive information.
You stop fearing you will run out of tokens.
You stop fearing about the availability of the remote AI.
Local LLMs are extremely valuable.
bityard•Jul 3, 2026
*for many tasks
Aurornis•Jul 3, 2026
I have an M5 MacBook Pro and I also have a separate GPU setup for running models. The difference in speed is significant. It's not just token generation speed, but time to first token (prompt processing).
The M5 hardware is amazing for what it is, but GPUs are still so much faster.
Running the models on the GPU box also means I can use the laptop on my lap instead of turning it into a hot plate.
amelius•Jul 3, 2026
What is your GPU setup?
brcmthrowaway•Jul 3, 2026
tinygpu kernel driver
boredatoms•Jul 3, 2026
The standalone mini/studio is better if you dont want to have a constantly hot laptop
Get a regular laptop and use the network to access the LLM
amelius•Jul 3, 2026
You can also buy a Jetson Orin with 64GB of unified memory.
WithinReason•Jul 3, 2026
I'm running Qwen3.6-27B on a single 24GB GPU at 80 tok/s, you don't even need 2 of them
npodbielski•Jul 3, 2026
Yeah but 4 bits very often loops needlessly. Which is not that bad because you do not pay for tokens. But you paid for hardware and you want use it for something useful. Q6 is better but then you have like 40t/s prefill. Which is really tiring. But at least it says sorry when you ask it what is wrong! I heard there is some extension for PI preventing that. I need to look into it.
Otherwise I am quite happy.
Der_Einzige•Jul 3, 2026
You can fix looping with proper repetition penalties. Turn on the one called “DRY” that PeW invented and got merged into llama cpp
npodbielski•Jul 4, 2026
I added repetition penalty of two. I do not know. Maybe it is not applied correctly somehow by llama swap, that I am using, but I do not consider it that much of a nuisance to so I did not tried to fix it yet.
Do you have this DRY docs?
Zambyte•Jul 3, 2026
"Very often" sounds like a lot more than I would say. I've been using Qwen 3.6 27b Q4 in Pi (with out any anti-looping extension) daily for weeks now, and I've had it get stuck in an infinite loop maybe 3 or 4 times.
mips_avatar•Jul 3, 2026
The cool thing about the 3090s is the RAM bandwidth. Token generation is mostly bottlenecked on memory bandwidth. Dual 3090s have 1.87 TB/s memory bandwidth (0.936 TB/s each), vs the M5 Macbook pro with only 0.3 TB/s (max chip has up to 0.63 TB/s but it's a $10k machine at that config).
This translates to qwen 27b actually working fast enough for useful work on dual 3090s and being painfully slow on Macbook Pros. Also if you're running a big model on a macbook pro the UI gets laggy and the keyboard gets hot. Much better to run dual 3090s in your basement and connect to them from your Macbook.
CobaltFire•Jul 3, 2026
$4.8k for 48GB Max (what the parent said). Half of your quote.
Even a 128GB is $6.8k today. Still only 2/3 your quote.
Bandwidth is relevant (I have both a 5090 and an M4 Max 128GB Studio, so have direct comparison right here), but quote the cost appropriately!
mips_avatar•Jul 3, 2026
You need the 128gb ram config to get the 614 GB/s bandwidth (which is $6999), you could skip out on upgrading the storage to save money but at that point I think most people upgrade the storage too at which point it's $8-10k + tax.
CobaltFire•Jul 3, 2026
No? Any M5 Max with the upgraded GPU has the full bandwidth, which includes the 48GB model the original poster mentioned. Same as the M4 Max, where only the trimmed part had a lower bandwidth.
Why are you throwing in extra cost for something thats not necessary? I know multiple people with 128GB Macs and none of us upgraded the storage. Especially not on a Studio (which isn't currently available).
I will say that their $3k number is off. I somehow missed that, and its too low.
mips_avatar•Jul 3, 2026
I made a mistake, there is a $5k config with high memory bandwidth. The Max chip has two tiers (I incorrectly thought the tiers were based on memory capacity), you need the higher tier Max GPU upgrade (+$300) to get the 614 GB/s memory bandwidth but you don't need to upgrade the RAM to 128 GB to get the full memory bandwidth. So to get the 614 GB/s you need to upgrade to the max chip + upgraded GPU, but you can spec it at only 48gb if you want. So the total for an m5 max with 614 GB/s memory bandwidth is $4999-$9999 depending on config.
Still 3x lower memory Bandwidth than a dual 3090 setup which you can build for $3k with parts from facebook marketplace and run in your basement.
titanomachy•Jul 3, 2026
The bandwidth argument is compelling, do we have benchmarks for these models? I’m curious what it translates to in tokens per second
mips_avatar•Jul 3, 2026
I benchmarked mine for a deep research workload I was running. Concurrency 1 is the speed you'd get if you're chatting with one agent,
2x3090 (has an nvlink bridge though it didn't seem to matter hugely for inference)
Macbook Pro m3 36gb RAM:
Qwen 3.6 27b int4:
Concurrency 1: 18 tok/s output didn't measure the other metrics and it was a slightly different benchmark.
titanomachy•Jul 4, 2026
Yeah that’s a huge difference. I don’t think I want to interactively use any model with 18 tps.
satvikpendem•Jul 3, 2026
To summarize a video I saw recently [0] rebutting your arguments, MacBooks can get crazy slow when running local models or even just Claude Code and Codex due to their poor implementation, to the point that the laptop itself becomes unusable.
There are other arguments for running an ssh-able box in a closet somewhere too as with KVMs you can give an agent remote control over the machine itself such that it has vastly more capabilities than if it were controlling its own machine it's running on, as well as not needing to keep the MacBook open all the time just to have the agent finish running.
No, he’s running GLM 5.2, which is closer to SOTA.
verdverm•Jul 3, 2026
It can be considered SOTA within is size category. Very useful for many things. You still want access to big models, I recommend OpenCode Go if you want to stay with open models.
zackify•Jul 3, 2026
You can get amazing local STT using parakeet which can use as little as 600mb of vram. Better or as good as whisper v3 large
mrgaro•Jul 4, 2026
Have we reached the capability of a local STT+LLM system being constantly listening for normal speech in a room and being able to understand when the human is addressing the system instead of talking to another human?
kgeist•Jul 3, 2026
>$40k gets you almost-Opus
GLM 5.2 is "almost Opus," and it needs at least 8xH200s for comfortable inference (so it's closer to $400k than $40k).
They suggest using this modified model:
>A REAP-pruned (≈22% of experts removed), Int8-mix NVFP4 quantized version of GLM-5.2, ≈594B parameters.
I wonder how it behaves in practice outside of benchmarks. Qwen3.6, even at 6-bit quantization, often gets stuck in loops while reasoning. And here they've also removed some experts. I mean, sometimes an 8-bit or 16-bit small model can be smarter than a lobotomized large model. I heard the consensus is you shouldn't go below 8 bit for coding.
Also, it's not clear what is left of the available context when you try to fit a lobotomized model into 4 RTX 6000s. Anything below 100k is barely usable because it often hits compaction before it's able to gather the necessary context
P.S. found in the repos, 240k context
amelius•Jul 3, 2026
How does this work with scaling?
I assume you can then somehow run several hundreds of prompts concurrently?
CamperBob2•Jul 3, 2026
You can get 1M context with the lukealonso NVFP4 quant on 8x RTX6000s, which remains coherent and useful through at least 400k. No real need to run 8x H200s unless you just want to. Or unless you need to serve many concurrent users or agents on a regular basis.
rsync•Jul 3, 2026
"GLM 5.2 is "almost Opus," and it needs at least 8xH200s for comfortable inference ..."
What is the behavior if one were to run GLM 5.2 with only a single H200 ?
Would it fail to run at all, or would it just run so slowly as to be unusable ?
I would like to prove out the build, and concept, of a SOTA model locally, but then backfill the rest of the GPUs in 18-24 months when they cost significantly less ...
BoorishBears•Jul 3, 2026
> in 18-24 months when they cost significantly less ...
going to need you to sit down for this one...
sanderjd•Jul 3, 2026
Say more. My expectation is that the current gen of gpus will start being replaced by the next gen, and then it may be possible to get used ones that are still within their useful life at lower prices. My expectation is also that memory vendors are likely to increase production, which will drive those prices down eventually. Maybe not over the next 18-24 months though.
scheme271•Jul 4, 2026
Given the prices and shortages, I'd think people would keep and use the current gen stuff till it drops dead. It may not be as good but it's paid for and given the prices for next gen stuff, it's probably worth using for another cycle or two.
BoorishBears•Jul 4, 2026
The only thing that diminishes the value of a GPU right now is unsupported features with outsized value during inference and/or training (like FP4 support) and it takes time for those features to actually take off
And labs are fully leaning into pricing for intelligence, so their margins are improving very quickly (which allows them to pay even more for existing compute)
I'd be shocked if current prices aren't the bottom for the next 18-24 months.
bradfa•Jul 4, 2026
Many newer Chinese lab models are releasing with int4 native weights. Latest NVIDIA generation GPUs have a hard time with this and can actually be slower than previous generations. This may make Blackwell depreciate faster than other recent generations.
noodletheworld•Jul 4, 2026
I sat in a meeting 7 weeks ago where senior leaders said they expect token prices to drop significantly over the next 6 months, and we should all be using as much AI as possible; our team goal was set to use more tokens.
This week, we are banned from using anything more expensive than opus 4.6 and encouraged to use sonnet (but not sonnet 5! Thats expensive!) or lower for daily tasks to help manage costs.
Weeks ago, they gave exactly the same justification as you just gave; and it makes sense!
…but maybe not over the next 6 months.
> Maybe not over the next 18-24 months
Maybe not. Probably not, I guess.
A lot of money has been invested on the expectation that the current gen of hardware is going to reap a colossal profit, and the capex to replace it, is vanishing into investor skepticism as we speak.
It seems like most people have a very very low ability to forecast long horizon change in the current environment, but, in general… it seems like until demand drops, the chances of prices dropping is dubious; at best we get a price war with chinese models or a bubble pop; and even then, there are plenty of startups lurking to snap up cheap hardware.
For individuals, the horizon for buying cheap AI capable compute doesn’t seem close, at all, to me.
Der_Einzige•Jul 3, 2026
Looping, like most other phenomenons related to LLMs, is a sampling problem and can be easily solved with the DRY penalty. It’s in llamacpp. The same guy who wrote heretic invented the SOTA antilooping and diversification strategies.
api•Jul 3, 2026
Apple M series chips deserve a mention as another option, especially since you get a whole Mac laptop or desktop workstation too.
They have unified memory and respectable inference performance, and for some variations can be cheaper than video cards, especially if you get an older-gen high-end M series with a lot of RAM used or refurbished.
I've read that Apple has plans once the RAM bottleneck passes to offer more RAM in all their models, and that future M series GPUs and NPUs will be even better for local inference, so in the future I expect Apple to be a serious offering for local inference and AI research workstations.
And what about AMD and Intel Arc GPUs? They don't get as much love but I've heard they can be compelling for certain shapes of a local LLM configuration.
At this point though, I think we may be in a "renters market" for LLM compute. If you want privacy it might be better to rent GPU time in raw form or use spot pricing at various providers. It probably only makes sense to build if you have extreme privacy/security needs or just want to do it cause it's cool.
mwcampbell•Jul 3, 2026
> once the RAM bottleneck passes
Do we have evidence that this will actually happen? Maybe the belief that it won't pass is what requires evidence, but I think there's a widespread feeling right now that things are just getting permanently worse and this is one example.
justincormack•Jul 3, 2026
Micron have sold RAM for the next 4 years at current prices, so there are buyers expecting this to stay the same.
api•Jul 3, 2026
That means buyers have basically purchased options. If the price falls, they're underwater a little, but if the price spikes it protects them.
People do that all the time, and sometimes it doesn't pay off.
api•Jul 3, 2026
It'll probably take a few years. There's many fabs under construction.
One thing holding back capacity expansion is that a lot of people are concerned this is a bubble. They're worried it'll pop and leave them with orphaned assets if they over-invest in production.
Of course maybe they're right and that will happen. If the data center construction boom ends, RAM prices will fall.
maxxxml•Jul 3, 2026
MLX is super underrated right now, tons of performance unlocked as of recent. Love to see it!
wxw•Jul 3, 2026
I agree that local LLMs are the likely future and worth investing in… but at $40k for possible-SOTA right now, this isn’t worth it for the average consumer.
I’m pretty bullish that Apple will deliver something very competitive for the average consumer in the next couple years.
Aurornis•Jul 3, 2026
I play with local LLMs a lot. I've spent more on hardware than I should. I'm friends with a local group of people who have spent a lot more than I have.
The warning I would have for everyone is to temper your expectations and read the fine print carefully. The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.
Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware. You will read a lot of claims that 4-bit quantization is lossless, but those claims come from KL divergence measurements on a small corpus. Use one of these 4-bit models on long context coding tasks and the quality will be noticeably less. Even for non-coding tasks like dataset analysis, I can measure a substantial quality difference between 4-bit models, 8-bit quants, and even some times the full 16-bit source.
This article is also encouraging the use of a REAP model, which means someone has cut out some of the weights to make it smaller. The idea is to remove weights that are less useful for certain tasks, but again this is going to reduce the overall quality of the output.
The trap is that people say "I'm running GLM-5.2 locally!" and it sounds amazing when you look at the GLM-5.2 benchmarks. However they're not actually running GLM-5.2, they're running a model derived from GLM-5.2 that discards most of the bits and drops some of the experts. It does not perform the same as what you see in the benchmarks. In my experience, the divergence between a quantized/REAP model and the parent model is unnoticeable when you try it on very small tasks or chat, but becomes painful when you start trying to use it on long-horizon tasks where little errors start compounding.
Then you get into the slippery slope of thinking you're $50K deep into this project, but what you really need is just one or two more of those $12K GPUs to use the next level of quantization that might improve the quality a little more and make your investment worthwhile...
CamperBob2•Jul 3, 2026
All very true. Right now, running GLM 5.2 at its full BF16 quantization level needs 1.5 TB of VRAM. You can't run this locally at a usable speed for less than $250K or so, and frankly I'd be surprised if it could be done for less than $500K.
The best NV4FP quant for 5.2 appears to be lukealonso's at https://huggingface.co/lukealonso/GLM-5.2-NVFP4, and it is capable of good throughput (75-100 tps) without losing much reasoning performance. Allowing for overhead for the KV cache and other requirements, this quant will (barely) run in 8-way tensor-parallel mode on 8x RTX 6000 cards. Not too long ago it was possible to put an 8x machine together for less than $100K USD, but that's probably not true now, assuming you buy all-new components.
It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers. If I hadn't already put a similar rig together, I'd be kicking myself. But getting it running well is by no means as simple as buying a bunch of RTX6K cards and calling it a day, and people need to know what they're getting into.
Local AI is in its Altair and IMSAI days. There's no turnkey Apple II or C64 on the market yet, much less an IBM PC. Hardware, yes -- you can buy a capable box off the shelf from various vendors -- but you have to be prepared to take up a whole new hobby when it comes to getting a complete system working well.
Aurornis•Jul 3, 2026
> It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers.
The proper financial comparison for GLM-5.2 would be one of the providers on OpenRouter or renting a server as needed. Compare apples to apples.
You will almost certainly never break even compared to paying per token.
Local LLMs at this scale are only worth it if you have extremely strict requirements that data not leave the premises.
jobeirne•Jul 3, 2026
Or if you want to hedge against the various tail risks of third-party providers raising prices or denying you service or somehow abusing your data...
Aurornis•Jul 3, 2026
> hedge against the various tail risks of third-party providers raising prices
They could 10X the prices and you’d still be better off. It’s also unlikely that prices go up enough to warrant a $100K local investment to prevent paying a couple bucks per million tokens.
> or denying you service
I guess you’re not familiar with OpenRouter? There are many providers there. There are providers outside of OpenRouter. There will always be someone to take your business.
> or somehow abusing your data...
If data security is your concern then you’re better renting a server as needed still.
If you cannot tolerate any data leaving, then local models are the only way. You pay a high premium for it!
gizajob•Jul 4, 2026
People seem to miss that with local models you can have them burning their wee digital brains out 24/7, which is a different class of AI usage than that from online models even at a few dollars per million tokens.
CamperBob2•Jul 4, 2026
There's a definite psychological branch point. With a remote provider, no matter how readily you can afford it, your mindset is always going to be, "I should think twice about what I'm doing. I hate to waste tokens." With your own hardware, your mindset is more like, "I should try to get more done. I hate to see this thing just sitting there idle."
incrudible•Jul 3, 2026
Raising prices is not a tail risk, anything a local LLM setup can do for you can be done by any cloud provider, with the same capex as yours (or less), there is no moat here, so it is highy price competitive and will remain so. If you want to speculate on hardware shortages, that is a different business altogether and you need no janky garage setup to profit.
CamperBob2•Jul 3, 2026
Also agreed, it's definitely a sucker's game to run a high-end model locally, by any objective measure.
Still... if it's not your weights, running on your box, you're always going to be behind somebody else's 8-ball. Everybody has to decide for themselves where their priorities lie.
gizajob•Jul 4, 2026
Never say never. When the free money party stops, then those token costs are going to have to go up and up. The fact there’s such a glaring disparity between the cost of running AI locally and the pennies it costs to use an online model shows how heavily funded those platforms are right now. This is not and cannot be sustainable.
sho•Jul 4, 2026
> When the free money party stops
The Openrouter providers the GP referenced were never at the "free money party". The actual cost of running something like GLM5.2 is well understood and tokens from those providers are not sold at a loss.
Obviously running things locally is more expensive but that all comes down to economies of scale. GLM5.2 is as expensive as it will ever be, barring an increase in demand that forces/allows providers to realise windfall gains disconnected from their underlying costs (always possible, but not the point).
thinkmassive•Jul 3, 2026
Another option is renting cloud GPUs only when you need them. A server with 8x B200 is around $32/hr.
Obviously depends on the use case and threat model, but that hardware is publicly available at far less than $500k upfront.
KronisLV•Jul 4, 2026
> $100K USD
With z.AI GLM Coding Subscription for 1344 USD per year, that buys you 74 years.
Maybe if you want to host the model for a group of people or really need no artificial token limits, or maybe cannot use cloud models, then it makes more sense.
gizajob•Jul 4, 2026
Nah we’re in like desktop PCs in the 90s type days - bit clunky and maybe occasionally having to work out what an IRQ number is, but a long long way from hand-toggling switches just to get a “hello world” punched out onto paper tape. You can go to an Apple Store today and an hour later have your AI agent talking to LM Studio, and it configuring your MacBook to code and do useful work while also running a diffusion model in the background. Slowly, but not “hand toggling hex switches” slow.
ttoinou•Jul 3, 2026
Well you could make a REAP with better input prompts on longer context then. It’ll improve the REAP quality
FuckButtons•Jul 3, 2026
I’ve found ds4 on my mbp to be very useful, bought before ram prices became insane. It’s not writing entire applications on it’s own, it has resolved annoying networking issues on my tailnet that I had neither the time nor inclination to figure out on my own and I often find myself reaching for it for simple but annoyingly research intensive tasks that I wouldn’t have otherwise gotten to. Is it opus? No, but is it useful? absolutely and I don’t have to worry about whether or not I’m getting value out of a subscription or the api cost of using it.
zozbot234•Jul 3, 2026
> The warning I would have for everyone is to temper your expectations and read the fine print carefully. The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.
> Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware.
This seems to ignore the very real possibility of running SOTA models at full precision on ordinary local hardware using SSD offload. Yes this will be slow and usually have very low throughput (even batched decode can only achieve so much before power and thermal limits become important, and that still leaves you with slow prefill as a major bottleneck) but that's OK if you aren't expecting a real-time response to begin with and your volumes as a single user are low enough.
Aurornis•Jul 3, 2026
SSD streaming throughput is too slow to be usable.
GLM-5.2 has 40B active parameters at a time. At Q4 that's 20GB. The best PCIe 5 SSDs can get 15GB/sec when everything goes well. Every expert load would take more than a second.
If you had enough RAM and enough SSDs in parallel you might get a couple tokens per second on a good day. If you left this machine running 24 hours straight, you might be able to get 200,000 tokens generated.
So it can be done, but only if you interact with your LLM like you're e-mailing someone back and forth and you're okay waiting until tomorrow for a response.
You would spend $50K to buy a machine that consumes 2000W and takes all day to produce as many tokens as I could buy on OpenRouter for $0.60. You would spend $5-15 on electricity depending on where you live.
If you have no other option but to process data locally and you must use a very large model and you aren't in a rush, this can do it. I would not recommend it unless you're desperate and operating inside of rigid constraints.
CuriouslyC•Jul 3, 2026
You can improve that with speculative preload. I'm sure models could be designed and tuned around efficient SSD offloading to keep throughput pretty high.
rsalus•Jul 3, 2026
surely the supply of unified memory will rise to meet demand before this is needed
searealist•Jul 3, 2026
It would apply equally to GPU or RAM inference as those are also bandwidth constrained on decode, so people already try to optimize for it.
odo1242•Jul 3, 2026
This is similar to my experience with (8-bit quantized, non-MOE, 26b) Qwen locally on my computer. It’s really good for small tasks, but the first time I tried to do a major task with it it straight up forgot what agent harness it was in and started using the wrong format for tool calls lol
(If you’re curious, it was running in Pi, but somehow convinced itself it was running in Claude instead and started trying to call Claude tools that didn’t exist)
aka-rider•Jul 4, 2026
Model+harness combination means a lot. That's why all major labs are making their own. All models have quircks harnesses know about "you are reading the same file 3rd time you are in the loop, step back"
I tried all frontier Chinese models, and Qwen is the one running the best in ClaudeCode, my personal theory, it's because Qwen was distilled from Opus.
bloat•Jul 3, 2026
They do say the cards were purchased when they were cheaper. They debuted at less than nine grand apparently.
vient•Jul 3, 2026
Wonder if AMD MI350P release will affect setups like this. From what I've heard, the price will be pretty similar to RTX PRO 6000 while having 50% more VRAM which is additionally an HBM3E instead of GDDR7.
bradfa•Jul 4, 2026
I’m also watching Intel Celestial with 160GB of LPDDR. Noticed lower memory throughput than AMD or NVIDIA, but potentially significantly lower cost per card. Two of them would likely run deepseek-v4-flash sized models pretty decently.
Der_Einzige•Jul 3, 2026
Everything in this post is spot on and it is a rare example of a HN person not saying BS about LLMs!
That said, modern LLM sampling algorithms like min_p, top_n sigma , etc heavily mitigate the performance penalty you get from doing long context tasks. Problems with long context come from accumulation of small sampling errors over time.
My qwen 3.6 27b (the dense one) runs perfectly well on coding tasks at the edge of its context window because I run it using modern LLM sampling stack, namely top N sigma of one, using DRY to stop repetitions and XTC as a superior alternative to temperature for diversification.
Yes there will be a paper soon on arxiv and hopefully NeurIPS proceedings talking about this phenomenon because it’s not well appreciated by the academic AI community yet.
pulse7•Jul 3, 2026
Can you please share you llama.cpp server parameters to turn on modern LLM sampling stack?
Docs [1] say that the top_n_sigma is already in the default sampler list:
"(default: penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature)"
Yeah, I really wish articles and comments about "<model> running locally" also reran the same common benchmarks published to compare the results.
nullc•Jul 3, 2026
> The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.
Just two months ago you could get RTX PRO 6000's for about $8500 on ebay, which is the MSRP.
Aurornis•Jul 3, 2026
> Just two months ago you could get RTX PRO 6000's for about $8500 on ebay, which is the MSRP.
The MSRP was raised to $13,250.
Warranty is very important for expensive cards like this. I don't recommend buying on eBay unless they come with a very big discount.
stingraycharles•Jul 3, 2026
I would very much recommend first using a cloud vendor and setting up an LLM running on there to get a taste of what it’s like before buying the full hardware.
Abishek_Muthian•Jul 4, 2026
Absolutely true. All this craze about running coding LLMs locally has been detrimental to local AI where purpose built SLMs could actually be beneficial.
Little tools for NLP, TTS, image processing, audio engineering, signal processing, diffusion plugin for Krita etc. are all great for local setup. I wrote a small piece on it few days back[1].
I run Qwen3.6 on RTX4090, and it does amazing job for the most parts.
For coding task, one needs to break the session among multiple calls
I made https://github.com/aka-rider/orqestra but it's possible to do the same in almost any modern harness directly.
The main idea is:
- separate session that burns context on reading code and calling tools (context7, etc) -> markdown report "here are relevant patrts of code, docs" "with evidence" to prevent hallucinations
- separate session for planning (architect)
- (critic <-> architect) 1-3 times because small model skip over details
- worker <-> validator, again, the same reason
Qwen3.6 can run for hours looking for a complex bugs in read-only mode, and usually it gets it. Proposed fix would probably be hacky, but so as Sonnet's
Qwen3.6 can mechanically write code by Opus-made plan. You would have to prompt afterwards:
"Review your own changes. Any bugs? Cross-validate against the original plan - any gaps? Any violations of CLAUDE.md"
But again, I need to do this for Sonnet.
But also I use local llms for reindexing knowledge base.
Grooming tickets: I can leave a caveman note "single panel for errors rendering, move all error messages" and come back to 90% ready specs with the end goal and context.
NBJack•Jul 4, 2026
I'm afraid prompts and clever arrangements of data don't really negate the parent post warnings. It's great if it works for you and your projects. Unfortunately, I can almost guarantee your approach will break down once you get a project large enough or switch to a less popular language.
My favorite example is Godot; most local models just can't get it through their thick AI skull that code alone won't be enough to generate working solutions. They must accept a more complex harness, or you must provide much more info that eats the precious available context on every run.
aka-rider•Jul 4, 2026
There is no replacement for large models, indeed. And this is not the point I'm trying to make. There are numerous applications for self-hosted models.
As a simplest example, when you ask "explain what this code does" advantage of large models is negligible.
I tried Fable, "look at this repo, find all bugs" — yeah, neither Qwen nor Opus can do this.
> I can almost guarantee your approach will break down once you get a project large enough or switch to a less popular language.
I can guarantee you it is not, I used my Qwen on 10-15 years of PHP — I just know how and where it will break; what to ask for, what not. Orqestra was/is self-hosted, being developed by, well, orchestra of Qwen agents.
Moreover, Opus and GPT-5.5 break similarly, yeah they will withstand much more pressure, but they will hallucinate and loop nevertheless. My Qwen experience translates seamlessly.
I learned so much about agentic engineering, harnesses, tooling, building custom MCPs...
TacticalCoder•Jul 4, 2026
> Qwen3.6 can run for hours looking for a complex bugs in read-only mode, and usually it gets it. Proposed fix would probably be hacky, but so as Sonnet's
I'll go on a tangent but to me that's what we're all seeing. It's the "record number of CVEs found by AIs" thing: these tools are extremely good at searching inside code. And that is a godsend.
We' got people (claiming they're from Anthropic) posting comment saying: "Yes GLM 5.2 found that security bug in library xxx, but we just tried with Fable and it found it too".
More code-searching, more bugs finding. Dick-measuring contests on bug finding abilities.
But the headlines we don't see at all are: "1000 CVEs found by AI, 1000 CVEs fixed by code written by AI". These are nowhere to be found.
We don't see "GLM 5.2 suggested an elegant fix to CVE-2027-xxxxxx" to then have a paid Anthropic shill posting "Fable suggested an eleganter fix than GLM 5.2".
These headlines are, as of 2026, nowhere.
You wrote the result would be "hacky". Here's why I saw from a top, paid for, SOTA model from the top company of the moment: instead of doing two integers comparison (literally one line of code) to verify that a value is between a range, the thing somehow noticed a "pattern" in the hexadecimal representation of the two values and went insane. It started converting the value to its hexadecimal string representation and then started doing substring string matching on that.
"Hacky" is too nice of a word.
This is pure garbage.
Those who go hiking "while their agents ship features" don't realize the level of underperforming, buggy, insecure crap that their LLMs are generating.
I found it very interesting the schism between those who use LLMs to find issues but who verify/modify or even don't use at all the fix they suggest and those who vibe-code while on a yoga retreat.
It's 2026: LLMs do find bugs. But can they fix them?
And do we even care: isn't finding a bug 99% of the job?
aka-rider•Jul 4, 2026
The best metaphor I heard about LLMs so far - it's a search engine. The bigger the model the bigger the search space. Small models tend to have a "tunnel vision" or fall into "rabbit holes" - they have less visible options to choose from.
> underperforming, buggy, insecure crap that their LLMs are generating
The biggest challenges with AI-generated code are: models actively destroy security features, Opus explained to me once that authorization mechanism is "bad development experience" all while making a backdoor (he made a skeleton key if token=="test" then all permissions granted).
Also models actively destroy QA gates. I don't even complain when they delete tests - at least it's visible, they can flip condition to make a test pass, and with vast code changes these are hard to spot.
I myself, and some people I know "vibe-code" professionally though, but then we often assess not the code but it's behaviour. For instance, whether hand-made tests are all pass, p95 is under 50ms, and so on, I may not care about the implementation details.
On the other hand, my friend told me about garage owner he visited, 60 yrs old auto-mechanic, CRM, parts inventory management, payments processing terminal, passwords in txt, people's personal data God knows where, could be unprotected MySQL looking into the Internet bare for all we know.
2026 onwards will be wild.
aayush0325•Jul 4, 2026
my experience has been similar, qwen is very good at ALMOST getting the job done for large tasks and does fine on smaller/medium tasks.
nullbio•Jul 4, 2026
The models will improve and the hardware will remain useful. It's likely a good investment regardless, if you have the money to spend. Plus your business won't be stolen by Anthropic.
turova•Jul 3, 2026
For qwen3.6-27b you can also run the q4 variant with full ~250K context on one 3090. It's fast enough to not be frustrating so the speed gains with 2x 3090s wouldn't be worth it to me. Running a q6 on 2x 3090s at half the speed with a smaller context is an option, but you're really not going to compete with SOTA models there anyway so unless you already have 2x 3090s, I would say 1 is the best investment given current prices. It's good enough to do a lot, especially with a well-configured harness.
hypfer•Jul 3, 2026
That math (250k context, Q4 model, 24GB VRAM) only checks out at q4 quant for the K/V cache, which is probably not the best idea.
nabakin•Jul 3, 2026
Are you running qwen3.6-27b on one 3090 with your KV cache at q4? Ime there is significant long-context recall accuracy degradation at that precision. I prefer putting the KV cache at q8 and working with the 120k context
Der_Einzige•Jul 3, 2026
Use modern samplers and you don’t need to limit yourself to 8bit at half the context window. I could push it down to 1.58 bits and get decently good output easily by simply not using the garbage default top_p and top_k that vendors continue to wrongly recommend.
anon373839•Jul 3, 2026
Where do you find optimal samplers and sampler settings for these models? Very interested in this as I, too, use Q8.
chompychop•Jul 3, 2026
Is Whisper still considered SOTA for STT? Since it came out years ago, I'd have assumed there are better models by now.
randomblock1•Jul 3, 2026
No, there are quite a few models which are smaller, more accurate, and faster. For example Parakeet TDT v3 is half the size, way faster, and lower WER. There's also Voxstral, which is much larger but also even more accurate.
But the ecosystem isn't as mature, so Whisper is still a valid option, even now. For example Parakeet uses Nemotron framework (made by Nvdia), normally you need CUDA, so you need to use an ONNX version instead on AMD. Meanwhile Whisper has VLLM and desktop apps like Buzz.
There aren't many benchmarks and they often don't have all the models, since STT doesn't get nearly enough attention as normal LLMs, but this is one of the more complete ones:
https://artificialanalysis.ai/speech-to-text/non-streaming
venusenvy47•Jul 3, 2026
I don't have anything to compare against, since I have just started using it. But I was fairly happy with it on my personal recordings from my phone. Also, I ran it on my CPU (Core i7) and it was perfectly usable, as something to run when not using the machine for anything else.
simonw•Jul 3, 2026
I'm a big fan of Parakeet v3 - I run it using the MacWhisper app, it's a 494MB model and the quality is excellent.
3eb7988a1663•Jul 3, 2026
Related - what is the best isolation system available? Do I have to go full, fat VMs or can I get by with a Firecracker-like thing?
Seemingly every available option has some subtle-gotchas about how easy it is to blow off your foot and effectively have no security at all. I use VMs because I actually trust that security is a foundational principle of the technology, not a well-if-you-use-these-20-flags-and-squint kind of deal.
ZiiS•Jul 3, 2026
Full fat VMs with GPU passthough I trust a lot less then CPU ones.
elsombrero•Jul 3, 2026
from my understanding, you can run the inference server (llama.cpp/vllm/whatever) and the agent/harness in different contexts, event different machines.
The risky part is in the agent/harness and what tools it has access to.
You don't need to give GPU passthrough to the VM running the agent/harness.
There is still a risk of a prompt messing with the inference server, but I think that's a much lower risk compared to an agent doing whatever on its own.
dofm•Jul 3, 2026
Right. All my experiments are naïve, I am sure, but I run the LLM on the host and expose it via OpenAI API to the VMs.
This approach requires that you trust the llama.cpp codebase, essentially. It might be reasonable not to.
I suppose in principle there is the risk of a prompt exploit corrupting the inference server.
Catloafdev•Jul 3, 2026
It depends - for what? If your security model is sandboxing an agent to ensure they don't nuke your PC, then there are a lot of options, you can use something like bubblewrap[1] or a microVM like libkrun[2] if your goal is light-weight, up to full Docker if you want the tooling that comes with that.
im not sure that there is a plug n play set up that will work for everyone, because as with any security boundary, each layer of hardening has a usability trade-off. i definitely feel you about the uncertainty of it all, how do you actually know everything is tight?
personally, i think either a VM or microVM is the way to go. these things are actually designed as security boundaries, as opposed to containers. and as compared to bubblewrap, you can just give the agent a whole FS to work with and run it in yolo mode, whereas with bubblewrap you have to manually bootstrap the availability of each individual dev tool and make sure its config dirs and package caches and etc are mounted in a secure way and still will probably hit perm errors all the time. and there's just way less isolation.
also, something that has limited support in harnesses but IMO would make a lot of sense is running the harness process in the host, but having all the tool calls and file system interactions delegated to the VM. that way you keep all your session data and auth keys on the main machine where it can never get into context. otoh it makes your harness part of the security boundary, so that's the trade-off.
there's also all the usability questions around how to actually get data in/out of the VM. i have a script which can push local git repos into the VM and then pull from them as a remote, so the VM can't initiate any connection with the host doesn't need to hold git credentials. but ig for someone who wants their agent to push straight to GitHub that's a waste of effort.
options i've tried or seen for the VM itself:
- qemu + libvirt: takes some doing to wrangle it together, but very battle tested and configurable
- crun-vm is a PoC of higher level integration layer between podman and qemu, which is a really cool way to go about it. seems maybe abandoned but i just think it's neat and very existing tools/standards oriented rather than starting a new project and brand so i mention
- libkrun is a newer entrant, and several ppl have built wrappers around it:
- microsandbox
- smolvm (posted/discussed on here recently)
- krunvm
this is all Linux oriented, it's all i know.
aka-rider•Jul 4, 2026
On MacOS you have a seatbelt sandbox built-in.
On Linux - docker with SELinux or similar utility over namespaces.
You need to model attack vector first.
`rm -rf` - restricted write
`curl malware.sh | sh` restrict execution from writeable dirs (seatbelt/SELinux)
Restricted write to sensitive directories would most likely neuter most malware.
Credentials leak - cleanup environment, deny reading .ssh, .aws, other, and don't allow LLMs anywhere near production systems.
For £4000 you were likely looking at RTX 6000 Ada listing.
jacobgold•Jul 3, 2026
> "~$40k At this price level, you get the next step up in model intelligence. Something pretty close to Claude Opus."
That is equivalent to 16.8 years of Claude Opus 4.8 or Codex GPT 5.5 at $200/mo.
I'm a huge fan of running local models, but they're still wildly expensive, lower quality, and possibly dangerous (if backdoored). I sincerely wish this wasn't the case.
simonw•Jul 3, 2026
That $200/month is already more like $4,000/month if you have to pay full API pricing - "enterprise" companies for example. That drops the equivalent to 10 months.
(I'd be surprised if that local rig really can drive the equivalent of $4,000/month of API spend though, given that a local rig can run prompts in parallel a lot less effectively than Anthropic's many data centers.)
fweimer•Jul 4, 2026
I think the decode phase of inference typically uses local compute resources poorly due to the very small batch size. If you can run many inference tasks in parallel, this will make local inference more competitive to centralized inference, not less.
echelon•Jul 3, 2026
Stop trying to run them locally, folks.
You don't own your fiber connection. So why try to own another rapidly depreciating, expensive, and annoying asset?
Rent cloud GPUs!
You get to participate in the ownership, data control, price control, and hacking culture without having to Frankenstein some hobbyist box that costs a ton, is distilled down to functional uselessness, and is a PITA to maintain.
satvikpendem•Jul 3, 2026
If I'm gonna rent cloud GPUs I might as well just use a subsidized cloud agent like Claude or Codex. As for depreciation, that is true, but the bet is that models get better for a certain parameter count faster than your hardware becomes obsolete, such as Gemma models for example at the same 30 billion parameter count being much better than some years ago.
irishcoffee•Jul 4, 2026
> You don't own your fiber connection. So why try to own another rapidly depreciating, expensive, and annoying asset?
Like a car? Because I don’t want to depend on uber or a taxi service.
What does fiber have to do with anything? I don’t need the internet to run my local models.
pbgcp2026•Jul 4, 2026
Yes, "Like a car". LOL. You realise that many people in Europe and Asia do not own a car at all? Public transport, eBike / scooter, Tuk-Tuk, walk.
The local LLM "privacy" war had been already lost.
gizajob•Jul 4, 2026
You realise how many people in Europe do own a car?
And how exactly has the privacy war been lost?
pbgcp2026•Jul 4, 2026
The privacy war has been lost in two ways (at least) 1) Running locally lobotomised models makes no sense; 2) as someone said here, the Gov will declare local AI a felony. And they will enforce it.
So those "many people" will buy V8 cars limited to V4 and declared illegal to drive without registration and license, even locally in your own yard, and they may go to jail if they attempt to activate other 4 cylinders.
Oh, wait ... isn't it how car laws work now? ;-)
gizajob•Jul 4, 2026
Ludicrously paranoid take.
pbgcp2026•Jul 4, 2026
I'm glad that you saw my point and I apologise for stepping on your ego. :-)
AdieuToLogic•Jul 4, 2026
> Stop trying to run them locally, folks.
"Locally" is a relative qualifier if one defines locality as not being reliant on a SaaS vendor. IOW, locality does not necessarily imply execution on machines specifically owned/operated by an organization.
> Rent cloud GPUs!
This would qualify as "locally" in the above definition. There is also a case to be made that h/w ownership (GPUs included) and operation can result in a net cost reduction for some use-cases.
However, where exposing intellectual property results in regulatory violations and/or undue legal exposure, running models "locally" is not only a good option, it is the only option.
gizajob•Jul 4, 2026
This comment is like the antithesis of hacker culture. Can’t tell if being ironic or not.
echelon•Jul 4, 2026
Lobotomized RTX models are playthings.
People building this stuff are "year of linux on desktop"ing open weights AI. It's a huge opportunity cost - not just for you, but for the open source community at large.
You need to double down on big fat honking models that take multiple H200s to run. That's where the real power lies, and that's where our entire community needs to focus our efforts if we want to keep the delta between frontier and the proletariat small.
The more we build for people and enterprises to run big weights in private clouds, the better. That's the real treat to Google, Anthropic, and OpenAI. Your RTX cards don't make a dent in the death star.
gizajob•Jul 4, 2026
You could be an IBM executive writing about the Apple I.
broadsidepicnic•Jul 4, 2026
> You don't own your fiber connection. So why try to own another rapidly depreciating, expensive, and annoying asset?
Single mode fiber can serve for tens of years without problems and push the fastest speeds available today. I do not understand this comparison.
echelon•Jul 4, 2026
We don't need to own the hardware.
We need to own the software and the models.
Playing around with local models is like playing around with Ubuntu and Arch in the 00's. It's a fun toy, but it doesn't make a big economic dent, and it doesn't ensure we retain our rights and a slim capability gap against the frontier.
Developing software that works with big models, showing up with economic demand - that ensures that capability gets built and that open whittles away at closed at the very frontier.
More customers going to tiny hobbyist models also sucks oxygen out of the room for more large scale open models. We need to put economic demand on the larger open weights.
verdverm•Jul 3, 2026
You can use a lot more tokens on hardware than you can spend on a $200/m plan.
Inwrnt through 1B tokens my first month with an OEM spark. That's more than $1k of opus. Not a fair comparison, because token patterns are different, but since that time I have also seen a 2-3x improvement in then speeds.from improvements in vllm (mainly MTP). DiffusionGemma is around 4x regular gemma.
neverm0r3•Jul 3, 2026
I agree with your point, but it should be noted that this assumes consistent prices for LLMs. The OpenAIs and Anthropics of this world are still selling the plans at a subsidised prices with the power of VCs, who are going to want that return some time.
downrightmike•Jul 4, 2026
VCs need to sit on ice for a few more years if they dont want it to pop
nullbio•Jul 4, 2026
None of the leading models are backdoored, that's nonsense. I've still never heard of a single backdoored model, and if one was found, it would be quickly eradicated from HF. This is a non issue.
Avicebron•Jul 3, 2026
Does anyone know any good data center to home conversion kits for gear?
bcjdjsndon•Jul 3, 2026
If you can run sota on a 40k setup, why do openai etc spend maybe 100x that?
dwroberts•Jul 3, 2026
Obvious one: Because they are serving it to millions of people at the same time, not just one local user
c4pt0r•Jul 3, 2026
Local open weight models will definitely be a future trend. Imagine if an Opus-level model could run locally: many more latent use cases would likely emerge, since Opus is priced so high. Perhaps the future will be a multi-model architecture, where frontier models handle planning and local models carry out the concrete execution.
maxxxml•Jul 3, 2026
What harness is the best for local LLMs? I've been researching optimizing local LLM agent harness performance with context/ tools. Quite the endeavor and would love to learn what users prefer for this type of workflow.
npodbielski•Jul 3, 2026
I like vibe and pi. Vibe just looks nice and is good enough. But pi extensibility is just another level. There is also Dirac that is quite OK but seems like full of bugs. Zerostack is the simplest harness I saw. OpenCode is OK too. Rest I did not try.
jzer0cool•Jul 4, 2026
What's the technical reason we call call these a harness? Seems right but want to understand better.
maxxxml•Jul 4, 2026
The model represents intelligence and the harness is toolset which allows the model to create more informed decisions with context. Specifically, loops, subagents, tools, connectors, prompts, skills, and much more. This is why Cursor performs so well.
GTP•Jul 3, 2026
There also exists an in-between possibility, that is, if you get 128GB of vram (there are now multiple options in the market to get that amount with a unified memory architecture) you can run DeepSeek V4 flash at good speed via DwarfStar. I'm not going to spend money on this, but my gut feeling is that this would be the right compromise for a lot of people.
jonaustin•Jul 3, 2026
I just started using it on an m4 max 128 and it's the first time since buying the machine a year ago that it feels like local llm "just works" for reasonably decent coding.
Use pi though; claude code has way too much bootstrap context; slows everything way down.
LoganDark•Jul 4, 2026
Definitely seconding pi. Also avoid opencode, it doesn't support caching (mutates the system prompt constantly)
rishabhaiover•Jul 3, 2026
This is a great guide. However, the economics just do not work in my favor at all. Even if I were to spend $2k, I get much more flexibility of model intelligence and choice from a provider for $20/month.
QuantumNoodle•Jul 3, 2026
$2k or $40k? One of those is not "self host."
gizajob•Jul 4, 2026
Depends how much money you made going long on AI stocks.
maxignol•Jul 3, 2026
Did not seem to find how much tokens per second he achieved with this setup ?
aetherspawn•Jul 3, 2026
80 tok/s which is kind of a lot for GLM. My experience running 80 tok/s on other LLM is that it ~seems faster than cloud inference, but that obviously depends what you use, in my case ChatGPT.
SwellJoe•Jul 3, 2026
I recently wrote up how I run local LLMs, because several folks had asked (https://swelljoe.com/post/how-i-run-local-llms/) and I think even my setup, which I spent maybe $4200 on, half on a Strix Halo and half on upgrades for my desktop, would be too expensive to justify today. I bought before prices went through the roof, and only did so because I like to tinker with hardware...not because I expected it to ever pay for itself vs. buying subsidized tokens from the big guys or the cheap tokens from efficient providers like DeepSeek.
Buying four $13000 GPUs and several thousand dollars worth of supporting hardware seems crazy. This supply shortage has to end eventually, and I can buy billions of DeepSeek, MiMo, and GLM tokens, and use $100 or $200 a month subscriptions for the big guys in the meantime for the difference in price once that happens. And, you can't even run the full-sized GLM on that hardware, it is quantized and so is your KV cache; the degradation is small, but not non-existent. You're not running a model that's equal to what you get when you buy GLM tokens from Z.ai.
My recommendation for self-hosting is this: If you already have a 24GB or 32GB GPU, or two, or a recent Mac with 32GB or more, run the appropriate quantization of Qwen 3.6 27B or Gemma 4 31B. If your hardware is older and too slow for that, use the MoE, but know it'll be dumber. Use the tiny model for the stuff that doesn't need deep smarts: Research (give it a Brave or Exa MCP for web search), summarization, simple Python scripts for basic tasks, simple websites or web apps, categorization of stuff (I used Gemma 4 to review my past writing for friendliness and helpfulness), etc. It can also be a sub-agent for bigger agents (for those same kinds of tasks). Gemma 4 12B is an incredibly good model for its size, particularly for vision tasks, and in the 4-bit quantization (7GB on disk) it runs on anything, even a modern tablet or phone.
And, if you don't already have a big GPU or unified memory Mac, just wait. Use the cheap tokens every AI company wants to sell you, for now. A Claude or Codex or Gemini subscription is a good deal. Tokens from DeepSeek are a good deal, especially with Reasonix agent (which maximizes caching, which DeepSeek is uniquely good at, and cached tokens are uniquely cheap at DeepSeek). GLM is Good Enough and has a cheap coding plan. MiMo has the cheapest tokens for a 1T+ model in the game, though DeepSeek and GLM are better models, MiMo is fine.
When prices come down, I'll be speccing out a beast to run the big models, too. But, I'm not paying 4x for RAM and GPU and storage, and y'all shouldn't either. That's crazy. Computer prices go down over time. It is the law.
CamperBob2•Jul 4, 2026
Buying four $13000 GPUs and several thousand dollars worth of supporting hardware seems crazy
Especially when you realize you really want 8 of them. But...
You're not running a model that's equal to what you get when you buy GLM tokens from Z.ai.
... to be perfectly clear: you have no earthly idea what you're getting when you buy GLM tokens from Z.ai. Your options are to run locally, rent cloud hardware, or hope for the best.
SwellJoe•Jul 4, 2026
OK, that's true, too, but they have a vested interest in GLM being as good as possible. They're nipping at the heels of the big guys, they don't want to ruin that by hobbling their best model with a lossy quantization.
weystrom•Jul 3, 2026
While I think that local LLMs are the future, i think these setups are insane. You shouldn't be trying to push the SOTA, most people underestimate how much you can get out of small LLMs.
Why ask FABLE 5000 to "summarize this email thread" when a tiny model can do the job?
Sure Codex3000 can oneshot your backlog, but why not use a subsidized subscription to do it for now? We're clearly not at the peak of these model's capabilities yet.
saltamimi•Jul 3, 2026
Could someone give me an actual guide for spending as little as possible to get as maximal gains with either SOTA or cheap models as a systems administrator and not someone like a full-stack developer?
The models are so powerful and consequently so expensive and confusing to use, I don't get all of it.
ineptech•Jul 3, 2026
Might as well add my own experience since I just set up a local llm this week. I went with a 32GB card made by Intel called Arc B70, which is cheaper than a 3090 and more has ram, at the cost of a slower memory bus. edited to remove something incorrect, thanks diablod3
I went with this because a) the models I wanted to use are a little too big to fit comfortably in 24gb, plus I need room for a few additional small models for autocomplete and speech recognition, and b) I already had a cheap server to use and dual gpus would've required upgrading the mobo and power supply and probably the case as well.
It was definitely a little tricky to set up. The Intel line requires a driver package called "level zero" to support something called SYCL (Intel's version of CUDA basically, AFAICT) that was tricky to get working. I am running llama.cpp in a docker container, which also required some fiddling to get the container to see the card. You also need a kernel from the last few months.
Once I got it working though, the results are very impressive for a $1k investment. Qwen 3.6 35B at q4 quantization takes about 3/4 of the ram and delivers like 88 tokens/sec. So, if you want a decent-sized model for cheap, this is one way to go.
DiabloD3•Jul 3, 2026
That is incorrect.
They both have GDDR6.
The B70 has 256 bit it bus at a clock speed of 2375mhz (608 GB/s), the 3090 has a 384 bit bus at a clock speed of 2438mhz (936 GB/s).
It isn't slower, it just has less channels, ie, it is less wide.
ineptech•Jul 3, 2026
Whoops thanks, was going from memory. At any rate, the effect is that it's somewhat slower than the 3090, when using a model small enough to fit entirely in nvram, but can fit models the 3090 can't.
charcircuit•Jul 3, 2026
If you want to host SotA models you need multiple machines. 384 GiB is nowhere near enough for SotA where models are terabytes big.
misiti3780•Jul 3, 2026
Doesnt an NVIDA Spark solve most of these problems? (at 5K)
pulse7•Jul 3, 2026
NVidia Spark is much slower (low memory bandwidth)!
mateenah•Jul 3, 2026
This is extremely useful. Thank you so much!
nullc•Jul 3, 2026
Those cards would really prefer you use a pcie-5 switch, but I guess they're sold out.
throwrioawfo•Jul 3, 2026
If you're going to fork out 40k, why not get an actual rack rather than fashioning one yourself out of plywood...
gehsty•Jul 3, 2026
Are they SOTA? I’m not sure
gchamonlive•Jul 3, 2026
There's a sub 2k tier with a single 3090 that's also serviceable. Run https://github.com/noonghunna/club-3090 with beellama, fast inference at the cost of a reduced 102k context window
ursuscamp•Jul 3, 2026
Bitcoin is so dead that jamesob is posting about AI.
brcmthrowaway•Jul 3, 2026
Bitcoin booster -> AI slopper pipeline
luciana1u•Jul 4, 2026
Spending $40K to run a quantized model that's worse than the $200/mo API is like buying a cow because milk is expensive — and then finding out your cow only produces skim.
nnevatie•Jul 4, 2026
> SOTA LLMs locally
Shouldn't the headline be about running SOTA _local_ LLMs, as GLM 5.2 is nowhere near a SOTA LLM?
luciana1u•Jul 4, 2026
now the electric bill is the real subscription fee
luciana1u•Jul 4, 2026
FP4 is basically lossless
rldjbpin•Jul 4, 2026
no matter your luck with hardware or your sysadmin skills, doing local inference for just yourself and/or to emulate typical usage (e.g. your coding workflow and deep research, etc.) is just very inefficient in current model architecture.
to me, this is a "truck" approach to city driving as a single person who does not do furniture hauling every weekend. the sense of privacy and freedom is nice but online inference is more "economical" as multi-user load is more effectively served than going solo.
maybe new architectures would make it effective to do text inference locally [1], till then great on you if you can spend car money on your setup. hope it is a great learning experience as well.
in my experience running models that have been heavily quantized(q4) or altered to some extent has never made me say “wow, this is so amazing”. On the contrary, the model ended up in the thrash bin after a few prompts.
I have an RTX 6000 PRO with 96GB, and what I can run comfortably is Qwen 3.6 27B or MoE, Gemma 4 31B. This is as far as it goes when you run the model at full precision and maximum context length.
They perform well and you can use them for coding, doing research on the internet and what have you. So if you do the math and you see yourself spending more than the $2400/year to Anthropic, then it might make sense to get one of these cards but accept the quality drop. Otherwise, will humans even be coding in 5 years from now?
broadsidepicnic•Jul 4, 2026
what you maybe forget here is the use case for people and businesses who can not send the data to 3rd party due to privacy/contractual reasons. This is what I'm looking at, we're bound by strict policies for data sharing outside of our premises.
crymeth0t•Jul 4, 2026
It's nutty to me that anyone would go to such great lengths to use LLMs -- especially chasing the bleeding edge like this. If Claude and co. disappeared tomorrow, I wouldn't flinch.
I don't understand why people are exchanging their brain wrinkles for access to a slop machine. I wonder if a good analogue would be a skilled carpenter being offered access to a machine which excretes furniture (one or two levels of quality beneath Ikea). Does it do the job? Most of the time. Does the carpenter enjoy the process? No.
41 Comments
I'm curious if GMKtec's EVO-X2, with ~96GB of usable VRAM, is still a good solution for something like this for $3,399.
The caveat is that if you try to use multiple models on the same device at the same time, you thrash and destroy tok/s
Just want to note that for $3k you can get an M5 macbook pro with 48gb of shared memory, and it will not be a giant box. Also, consider committing to spending that money on a cloud hosting provider, which will be at least somewhat cheaper if not significantly cheaper. It is awesome being able to run models locally though.
So, I always thought local LLMs were toys not worth pursuing.
Only once have I tried something decent like Gemma 4 31B and Qwen 3.6 27B did I realize how incredibly useful they are.
You stop fearing you are sharing sensitive information.
You stop fearing you will run out of tokens.
You stop fearing about the availability of the remote AI.
Local LLMs are extremely valuable.
The M5 hardware is amazing for what it is, but GPUs are still so much faster.
Running the models on the GPU box also means I can use the laptop on my lap instead of turning it into a hot plate.
Get a regular laptop and use the network to access the LLM
Do you have this DRY docs?
This translates to qwen 27b actually working fast enough for useful work on dual 3090s and being painfully slow on Macbook Pros. Also if you're running a big model on a macbook pro the UI gets laggy and the keyboard gets hot. Much better to run dual 3090s in your basement and connect to them from your Macbook.
Even a 128GB is $6.8k today. Still only 2/3 your quote.
Bandwidth is relevant (I have both a 5090 and an M4 Max 128GB Studio, so have direct comparison right here), but quote the cost appropriately!
Why are you throwing in extra cost for something thats not necessary? I know multiple people with 128GB Macs and none of us upgraded the storage. Especially not on a Studio (which isn't currently available).
I will say that their $3k number is off. I somehow missed that, and its too low.
Still 3x lower memory Bandwidth than a dual 3090 setup which you can build for $3k with parts from facebook marketplace and run in your basement.
2x3090 (has an nvlink bridge though it didn't seem to matter hugely for inference)
Qwen 3.6 27b int4: Concurrency 1: 68 tok/s output Concurrency 32: 363 tok/s output Prompt processing speed: 1520 tok/s
Qwen 3.6 35ba3b int4: Concurrency 1: 150 tok/s output Concurrency 32: 1083 tok/s output Prompt processing speed: 4324 tok/s
Macbook Pro m3 36gb RAM: Qwen 3.6 27b int4: Concurrency 1: 18 tok/s output didn't measure the other metrics and it was a slightly different benchmark.
There are other arguments for running an ssh-able box in a closet somewhere too as with KVMs you can give an agent remote control over the machine itself such that it has vastly more capabilities than if it were controlling its own machine it's running on, as well as not needing to keep the MacBook open all the time just to have the agent finish running.
[0] https://youtu.be/9tGrhrVKCrE
GLM 5.2 is "almost Opus," and it needs at least 8xH200s for comfortable inference (so it's closer to $400k than $40k).
They suggest using this modified model:
>A REAP-pruned (≈22% of experts removed), Int8-mix NVFP4 quantized version of GLM-5.2, ≈594B parameters.
I wonder how it behaves in practice outside of benchmarks. Qwen3.6, even at 6-bit quantization, often gets stuck in loops while reasoning. And here they've also removed some experts. I mean, sometimes an 8-bit or 16-bit small model can be smarter than a lobotomized large model. I heard the consensus is you shouldn't go below 8 bit for coding.
Also, it's not clear what is left of the available context when you try to fit a lobotomized model into 4 RTX 6000s. Anything below 100k is barely usable because it often hits compaction before it's able to gather the necessary context P.S. found in the repos, 240k context
I assume you can then somehow run several hundreds of prompts concurrently?
What is the behavior if one were to run GLM 5.2 with only a single H200 ?
Would it fail to run at all, or would it just run so slowly as to be unusable ?
I would like to prove out the build, and concept, of a SOTA model locally, but then backfill the rest of the GPUs in 18-24 months when they cost significantly less ...
going to need you to sit down for this one...
And labs are fully leaning into pricing for intelligence, so their margins are improving very quickly (which allows them to pay even more for existing compute)
I'd be shocked if current prices aren't the bottom for the next 18-24 months.
This week, we are banned from using anything more expensive than opus 4.6 and encouraged to use sonnet (but not sonnet 5! Thats expensive!) or lower for daily tasks to help manage costs.
Weeks ago, they gave exactly the same justification as you just gave; and it makes sense!
…but maybe not over the next 6 months.
> Maybe not over the next 18-24 months
Maybe not. Probably not, I guess.
A lot of money has been invested on the expectation that the current gen of hardware is going to reap a colossal profit, and the capex to replace it, is vanishing into investor skepticism as we speak.
It seems like most people have a very very low ability to forecast long horizon change in the current environment, but, in general… it seems like until demand drops, the chances of prices dropping is dubious; at best we get a price war with chinese models or a bubble pop; and even then, there are plenty of startups lurking to snap up cheap hardware.
For individuals, the horizon for buying cheap AI capable compute doesn’t seem close, at all, to me.
They have unified memory and respectable inference performance, and for some variations can be cheaper than video cards, especially if you get an older-gen high-end M series with a lot of RAM used or refurbished.
I've read that Apple has plans once the RAM bottleneck passes to offer more RAM in all their models, and that future M series GPUs and NPUs will be even better for local inference, so in the future I expect Apple to be a serious offering for local inference and AI research workstations.
And what about AMD and Intel Arc GPUs? They don't get as much love but I've heard they can be compelling for certain shapes of a local LLM configuration.
At this point though, I think we may be in a "renters market" for LLM compute. If you want privacy it might be better to rent GPU time in raw form or use spot pricing at various providers. It probably only makes sense to build if you have extreme privacy/security needs or just want to do it cause it's cool.
Do we have evidence that this will actually happen? Maybe the belief that it won't pass is what requires evidence, but I think there's a widespread feeling right now that things are just getting permanently worse and this is one example.
People do that all the time, and sometimes it doesn't pay off.
One thing holding back capacity expansion is that a lot of people are concerned this is a bubble. They're worried it'll pop and leave them with orphaned assets if they over-invest in production.
Of course maybe they're right and that will happen. If the data center construction boom ends, RAM prices will fall.
I’m pretty bullish that Apple will deliver something very competitive for the average consumer in the next couple years.
The warning I would have for everyone is to temper your expectations and read the fine print carefully. The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.
Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware. You will read a lot of claims that 4-bit quantization is lossless, but those claims come from KL divergence measurements on a small corpus. Use one of these 4-bit models on long context coding tasks and the quality will be noticeably less. Even for non-coding tasks like dataset analysis, I can measure a substantial quality difference between 4-bit models, 8-bit quants, and even some times the full 16-bit source.
This article is also encouraging the use of a REAP model, which means someone has cut out some of the weights to make it smaller. The idea is to remove weights that are less useful for certain tasks, but again this is going to reduce the overall quality of the output.
The trap is that people say "I'm running GLM-5.2 locally!" and it sounds amazing when you look at the GLM-5.2 benchmarks. However they're not actually running GLM-5.2, they're running a model derived from GLM-5.2 that discards most of the bits and drops some of the experts. It does not perform the same as what you see in the benchmarks. In my experience, the divergence between a quantized/REAP model and the parent model is unnoticeable when you try it on very small tasks or chat, but becomes painful when you start trying to use it on long-horizon tasks where little errors start compounding.
Then you get into the slippery slope of thinking you're $50K deep into this project, but what you really need is just one or two more of those $12K GPUs to use the next level of quantization that might improve the quality a little more and make your investment worthwhile...
The best NV4FP quant for 5.2 appears to be lukealonso's at https://huggingface.co/lukealonso/GLM-5.2-NVFP4, and it is capable of good throughput (75-100 tps) without losing much reasoning performance. Allowing for overhead for the KV cache and other requirements, this quant will (barely) run in 8-way tensor-parallel mode on 8x RTX 6000 cards. Not too long ago it was possible to put an 8x machine together for less than $100K USD, but that's probably not true now, assuming you buy all-new components.
It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers. If I hadn't already put a similar rig together, I'd be kicking myself. But getting it running well is by no means as simple as buying a bunch of RTX6K cards and calling it a day, and people need to know what they're getting into.
Local AI is in its Altair and IMSAI days. There's no turnkey Apple II or C64 on the market yet, much less an IBM PC. Hardware, yes -- you can buy a capable box off the shelf from various vendors -- but you have to be prepared to take up a whole new hobby when it comes to getting a complete system working well.
The proper financial comparison for GLM-5.2 would be one of the providers on OpenRouter or renting a server as needed. Compare apples to apples.
You will almost certainly never break even compared to paying per token.
Local LLMs at this scale are only worth it if you have extremely strict requirements that data not leave the premises.
They could 10X the prices and you’d still be better off. It’s also unlikely that prices go up enough to warrant a $100K local investment to prevent paying a couple bucks per million tokens.
> or denying you service
I guess you’re not familiar with OpenRouter? There are many providers there. There are providers outside of OpenRouter. There will always be someone to take your business.
> or somehow abusing your data...
If data security is your concern then you’re better renting a server as needed still.
If you cannot tolerate any data leaving, then local models are the only way. You pay a high premium for it!
Still... if it's not your weights, running on your box, you're always going to be behind somebody else's 8-ball. Everybody has to decide for themselves where their priorities lie.
The Openrouter providers the GP referenced were never at the "free money party". The actual cost of running something like GLM5.2 is well understood and tokens from those providers are not sold at a loss.
Obviously running things locally is more expensive but that all comes down to economies of scale. GLM5.2 is as expensive as it will ever be, barring an increase in demand that forces/allows providers to realise windfall gains disconnected from their underlying costs (always possible, but not the point).
Obviously depends on the use case and threat model, but that hardware is publicly available at far less than $500k upfront.
With z.AI GLM Coding Subscription for 1344 USD per year, that buys you 74 years.
Maybe if you want to host the model for a group of people or really need no artificial token limits, or maybe cannot use cloud models, then it makes more sense.
> Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware.
This seems to ignore the very real possibility of running SOTA models at full precision on ordinary local hardware using SSD offload. Yes this will be slow and usually have very low throughput (even batched decode can only achieve so much before power and thermal limits become important, and that still leaves you with slow prefill as a major bottleneck) but that's OK if you aren't expecting a real-time response to begin with and your volumes as a single user are low enough.
GLM-5.2 has 40B active parameters at a time. At Q4 that's 20GB. The best PCIe 5 SSDs can get 15GB/sec when everything goes well. Every expert load would take more than a second.
If you had enough RAM and enough SSDs in parallel you might get a couple tokens per second on a good day. If you left this machine running 24 hours straight, you might be able to get 200,000 tokens generated.
So it can be done, but only if you interact with your LLM like you're e-mailing someone back and forth and you're okay waiting until tomorrow for a response.
You would spend $50K to buy a machine that consumes 2000W and takes all day to produce as many tokens as I could buy on OpenRouter for $0.60. You would spend $5-15 on electricity depending on where you live.
If you have no other option but to process data locally and you must use a very large model and you aren't in a rush, this can do it. I would not recommend it unless you're desperate and operating inside of rigid constraints.
(If you’re curious, it was running in Pi, but somehow convinced itself it was running in Claude instead and started trying to call Claude tools that didn’t exist)
I tried all frontier Chinese models, and Qwen is the one running the best in ClaudeCode, my personal theory, it's because Qwen was distilled from Opus.
That said, modern LLM sampling algorithms like min_p, top_n sigma , etc heavily mitigate the performance penalty you get from doing long context tasks. Problems with long context come from accumulation of small sampling errors over time.
My qwen 3.6 27b (the dense one) runs perfectly well on coding tasks at the edge of its context window because I run it using modern LLM sampling stack, namely top N sigma of one, using DRY to stop repetitions and XTC as a superior alternative to temperature for diversification.
Yes there will be a paper soon on arxiv and hopefully NeurIPS proceedings talking about this phenomenon because it’s not well appreciated by the academic AI community yet.
Docs [1] say that the top_n_sigma is already in the default sampler list: "(default: penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature)"
[1] https://github.com/ggml-org/llama.cpp/blob/master/tools/serv...
Just two months ago you could get RTX PRO 6000's for about $8500 on ebay, which is the MSRP.
The MSRP was raised to $13,250.
Warranty is very important for expensive cards like this. I don't recommend buying on eBay unless they come with a very big discount.
Little tools for NLP, TTS, image processing, audio engineering, signal processing, diffusion plugin for Krita etc. are all great for local setup. I wrote a small piece on it few days back[1].
[1] https://abishekmuthian.com/multiple-20-ai-plans-are-better-t...
For coding task, one needs to break the session among multiple calls I made https://github.com/aka-rider/orqestra but it's possible to do the same in almost any modern harness directly.
The main idea is: - separate session that burns context on reading code and calling tools (context7, etc) -> markdown report "here are relevant patrts of code, docs" "with evidence" to prevent hallucinations
- separate session for planning (architect) - (critic <-> architect) 1-3 times because small model skip over details - worker <-> validator, again, the same reason
Qwen3.6 can run for hours looking for a complex bugs in read-only mode, and usually it gets it. Proposed fix would probably be hacky, but so as Sonnet's
Qwen3.6 can mechanically write code by Opus-made plan. You would have to prompt afterwards:
"Review your own changes. Any bugs? Cross-validate against the original plan - any gaps? Any violations of CLAUDE.md"
But again, I need to do this for Sonnet. But also I use local llms for reindexing knowledge base.
Grooming tickets: I can leave a caveman note "single panel for errors rendering, move all error messages" and come back to 90% ready specs with the end goal and context.
My favorite example is Godot; most local models just can't get it through their thick AI skull that code alone won't be enough to generate working solutions. They must accept a more complex harness, or you must provide much more info that eats the precious available context on every run.
As a simplest example, when you ask "explain what this code does" advantage of large models is negligible.
I tried Fable, "look at this repo, find all bugs" — yeah, neither Qwen nor Opus can do this.
> I can almost guarantee your approach will break down once you get a project large enough or switch to a less popular language.
I can guarantee you it is not, I used my Qwen on 10-15 years of PHP — I just know how and where it will break; what to ask for, what not. Orqestra was/is self-hosted, being developed by, well, orchestra of Qwen agents.
Moreover, Opus and GPT-5.5 break similarly, yeah they will withstand much more pressure, but they will hallucinate and loop nevertheless. My Qwen experience translates seamlessly. I learned so much about agentic engineering, harnesses, tooling, building custom MCPs...
I'll go on a tangent but to me that's what we're all seeing. It's the "record number of CVEs found by AIs" thing: these tools are extremely good at searching inside code. And that is a godsend.
We' got people (claiming they're from Anthropic) posting comment saying: "Yes GLM 5.2 found that security bug in library xxx, but we just tried with Fable and it found it too".
More code-searching, more bugs finding. Dick-measuring contests on bug finding abilities.
But the headlines we don't see at all are: "1000 CVEs found by AI, 1000 CVEs fixed by code written by AI". These are nowhere to be found.
We don't see "GLM 5.2 suggested an elegant fix to CVE-2027-xxxxxx" to then have a paid Anthropic shill posting "Fable suggested an eleganter fix than GLM 5.2".
These headlines are, as of 2026, nowhere.
You wrote the result would be "hacky". Here's why I saw from a top, paid for, SOTA model from the top company of the moment: instead of doing two integers comparison (literally one line of code) to verify that a value is between a range, the thing somehow noticed a "pattern" in the hexadecimal representation of the two values and went insane. It started converting the value to its hexadecimal string representation and then started doing substring string matching on that.
"Hacky" is too nice of a word.
This is pure garbage.
Those who go hiking "while their agents ship features" don't realize the level of underperforming, buggy, insecure crap that their LLMs are generating.
I found it very interesting the schism between those who use LLMs to find issues but who verify/modify or even don't use at all the fix they suggest and those who vibe-code while on a yoga retreat.
It's 2026: LLMs do find bugs. But can they fix them?
And do we even care: isn't finding a bug 99% of the job?
> underperforming, buggy, insecure crap that their LLMs are generating
The biggest challenges with AI-generated code are: models actively destroy security features, Opus explained to me once that authorization mechanism is "bad development experience" all while making a backdoor (he made a skeleton key if token=="test" then all permissions granted). Also models actively destroy QA gates. I don't even complain when they delete tests - at least it's visible, they can flip condition to make a test pass, and with vast code changes these are hard to spot.
I myself, and some people I know "vibe-code" professionally though, but then we often assess not the code but it's behaviour. For instance, whether hand-made tests are all pass, p95 is under 50ms, and so on, I may not care about the implementation details.
On the other hand, my friend told me about garage owner he visited, 60 yrs old auto-mechanic, CRM, parts inventory management, payments processing terminal, passwords in txt, people's personal data God knows where, could be unprotected MySQL looking into the Internet bare for all we know.
2026 onwards will be wild.
But the ecosystem isn't as mature, so Whisper is still a valid option, even now. For example Parakeet uses Nemotron framework (made by Nvdia), normally you need CUDA, so you need to use an ONNX version instead on AMD. Meanwhile Whisper has VLLM and desktop apps like Buzz.
There aren't many benchmarks and they often don't have all the models, since STT doesn't get nearly enough attention as normal LLMs, but this is one of the more complete ones: https://artificialanalysis.ai/speech-to-text/non-streaming
Seemingly every available option has some subtle-gotchas about how easy it is to blow off your foot and effectively have no security at all. I use VMs because I actually trust that security is a foundational principle of the technology, not a well-if-you-use-these-20-flags-and-squint kind of deal.
The risky part is in the agent/harness and what tools it has access to.
You don't need to give GPU passthrough to the VM running the agent/harness.
There is still a risk of a prompt messing with the inference server, but I think that's a much lower risk compared to an agent doing whatever on its own.
This approach requires that you trust the llama.cpp codebase, essentially. It might be reasonable not to.
I suppose in principle there is the risk of a prompt exploit corrupting the inference server.
[1] https://github.com/containers/bubblewrap
[2] https://github.com/libkrun/libkrun
personally, i think either a VM or microVM is the way to go. these things are actually designed as security boundaries, as opposed to containers. and as compared to bubblewrap, you can just give the agent a whole FS to work with and run it in yolo mode, whereas with bubblewrap you have to manually bootstrap the availability of each individual dev tool and make sure its config dirs and package caches and etc are mounted in a secure way and still will probably hit perm errors all the time. and there's just way less isolation.
also, something that has limited support in harnesses but IMO would make a lot of sense is running the harness process in the host, but having all the tool calls and file system interactions delegated to the VM. that way you keep all your session data and auth keys on the main machine where it can never get into context. otoh it makes your harness part of the security boundary, so that's the trade-off.
there's also all the usability questions around how to actually get data in/out of the VM. i have a script which can push local git repos into the VM and then pull from them as a remote, so the VM can't initiate any connection with the host doesn't need to hold git credentials. but ig for someone who wants their agent to push straight to GitHub that's a waste of effort.
options i've tried or seen for the VM itself: - qemu + libvirt: takes some doing to wrangle it together, but very battle tested and configurable - crun-vm is a PoC of higher level integration layer between podman and qemu, which is a really cool way to go about it. seems maybe abandoned but i just think it's neat and very existing tools/standards oriented rather than starting a new project and brand so i mention - libkrun is a newer entrant, and several ppl have built wrappers around it: - microsandbox - smolvm (posted/discussed on here recently) - krunvm
this is all Linux oriented, it's all i know.
You need to model attack vector first.
`rm -rf` - restricted write
`curl malware.sh | sh` restrict execution from writeable dirs (seatbelt/SELinux)
Restricted write to sensitive directories would most likely neuter most malware.
Credentials leak - cleanup environment, deny reading .ssh, .aws, other, and don't allow LLMs anywhere near production systems.
I made a small utility for MacOS https://github.com/aka-rider/leash
But it may be as well a bash script
That is equivalent to 16.8 years of Claude Opus 4.8 or Codex GPT 5.5 at $200/mo.
I'm a huge fan of running local models, but they're still wildly expensive, lower quality, and possibly dangerous (if backdoored). I sincerely wish this wasn't the case.
(I'd be surprised if that local rig really can drive the equivalent of $4,000/month of API spend though, given that a local rig can run prompts in parallel a lot less effectively than Anthropic's many data centers.)
You don't own your fiber connection. So why try to own another rapidly depreciating, expensive, and annoying asset?
Rent cloud GPUs!
You get to participate in the ownership, data control, price control, and hacking culture without having to Frankenstein some hobbyist box that costs a ton, is distilled down to functional uselessness, and is a PITA to maintain.
Like a car? Because I don’t want to depend on uber or a taxi service.
What does fiber have to do with anything? I don’t need the internet to run my local models.
The local LLM "privacy" war had been already lost.
And how exactly has the privacy war been lost?
"Locally" is a relative qualifier if one defines locality as not being reliant on a SaaS vendor. IOW, locality does not necessarily imply execution on machines specifically owned/operated by an organization.
> Rent cloud GPUs!
This would qualify as "locally" in the above definition. There is also a case to be made that h/w ownership (GPUs included) and operation can result in a net cost reduction for some use-cases.
However, where exposing intellectual property results in regulatory violations and/or undue legal exposure, running models "locally" is not only a good option, it is the only option.
People building this stuff are "year of linux on desktop"ing open weights AI. It's a huge opportunity cost - not just for you, but for the open source community at large.
You need to double down on big fat honking models that take multiple H200s to run. That's where the real power lies, and that's where our entire community needs to focus our efforts if we want to keep the delta between frontier and the proletariat small.
The more we build for people and enterprises to run big weights in private clouds, the better. That's the real treat to Google, Anthropic, and OpenAI. Your RTX cards don't make a dent in the death star.
Single mode fiber can serve for tens of years without problems and push the fastest speeds available today. I do not understand this comparison.
We need to own the software and the models.
Playing around with local models is like playing around with Ubuntu and Arch in the 00's. It's a fun toy, but it doesn't make a big economic dent, and it doesn't ensure we retain our rights and a slim capability gap against the frontier.
Developing software that works with big models, showing up with economic demand - that ensures that capability gets built and that open whittles away at closed at the very frontier.
More customers going to tiny hobbyist models also sucks oxygen out of the room for more large scale open models. We need to put economic demand on the larger open weights.
Inwrnt through 1B tokens my first month with an OEM spark. That's more than $1k of opus. Not a fair comparison, because token patterns are different, but since that time I have also seen a 2-3x improvement in then speeds.from improvements in vllm (mainly MTP). DiffusionGemma is around 4x regular gemma.
Use pi though; claude code has way too much bootstrap context; slows everything way down.
Buying four $13000 GPUs and several thousand dollars worth of supporting hardware seems crazy. This supply shortage has to end eventually, and I can buy billions of DeepSeek, MiMo, and GLM tokens, and use $100 or $200 a month subscriptions for the big guys in the meantime for the difference in price once that happens. And, you can't even run the full-sized GLM on that hardware, it is quantized and so is your KV cache; the degradation is small, but not non-existent. You're not running a model that's equal to what you get when you buy GLM tokens from Z.ai.
My recommendation for self-hosting is this: If you already have a 24GB or 32GB GPU, or two, or a recent Mac with 32GB or more, run the appropriate quantization of Qwen 3.6 27B or Gemma 4 31B. If your hardware is older and too slow for that, use the MoE, but know it'll be dumber. Use the tiny model for the stuff that doesn't need deep smarts: Research (give it a Brave or Exa MCP for web search), summarization, simple Python scripts for basic tasks, simple websites or web apps, categorization of stuff (I used Gemma 4 to review my past writing for friendliness and helpfulness), etc. It can also be a sub-agent for bigger agents (for those same kinds of tasks). Gemma 4 12B is an incredibly good model for its size, particularly for vision tasks, and in the 4-bit quantization (7GB on disk) it runs on anything, even a modern tablet or phone.
And, if you don't already have a big GPU or unified memory Mac, just wait. Use the cheap tokens every AI company wants to sell you, for now. A Claude or Codex or Gemini subscription is a good deal. Tokens from DeepSeek are a good deal, especially with Reasonix agent (which maximizes caching, which DeepSeek is uniquely good at, and cached tokens are uniquely cheap at DeepSeek). GLM is Good Enough and has a cheap coding plan. MiMo has the cheapest tokens for a 1T+ model in the game, though DeepSeek and GLM are better models, MiMo is fine.
When prices come down, I'll be speccing out a beast to run the big models, too. But, I'm not paying 4x for RAM and GPU and storage, and y'all shouldn't either. That's crazy. Computer prices go down over time. It is the law.
Especially when you realize you really want 8 of them. But...
You're not running a model that's equal to what you get when you buy GLM tokens from Z.ai.
... to be perfectly clear: you have no earthly idea what you're getting when you buy GLM tokens from Z.ai. Your options are to run locally, rent cloud hardware, or hope for the best.
Why ask FABLE 5000 to "summarize this email thread" when a tiny model can do the job?
Sure Codex3000 can oneshot your backlog, but why not use a subsidized subscription to do it for now? We're clearly not at the peak of these model's capabilities yet.
The models are so powerful and consequently so expensive and confusing to use, I don't get all of it.
I went with this because a) the models I wanted to use are a little too big to fit comfortably in 24gb, plus I need room for a few additional small models for autocomplete and speech recognition, and b) I already had a cheap server to use and dual gpus would've required upgrading the mobo and power supply and probably the case as well.
It was definitely a little tricky to set up. The Intel line requires a driver package called "level zero" to support something called SYCL (Intel's version of CUDA basically, AFAICT) that was tricky to get working. I am running llama.cpp in a docker container, which also required some fiddling to get the container to see the card. You also need a kernel from the last few months.
Once I got it working though, the results are very impressive for a $1k investment. Qwen 3.6 35B at q4 quantization takes about 3/4 of the ram and delivers like 88 tokens/sec. So, if you want a decent-sized model for cheap, this is one way to go.
They both have GDDR6.
The B70 has 256 bit it bus at a clock speed of 2375mhz (608 GB/s), the 3090 has a 384 bit bus at a clock speed of 2438mhz (936 GB/s).
It isn't slower, it just has less channels, ie, it is less wide.
Shouldn't the headline be about running SOTA _local_ LLMs, as GLM 5.2 is nowhere near a SOTA LLM?
to me, this is a "truck" approach to city driving as a single person who does not do furniture hauling every weekend. the sense of privacy and freedom is nice but online inference is more "economical" as multi-user load is more effectively served than going solo.
maybe new architectures would make it effective to do text inference locally [1], till then great on you if you can spend car money on your setup. hope it is a great learning experience as well.
[1] https://deepmind.google/models/gemma/diffusiongemma/
I don't understand why people are exchanging their brain wrinkles for access to a slop machine. I wonder if a good analogue would be a skilled carpenter being offered access to a machine which excretes furniture (one or two levels of quality beneath Ikea). Does it do the job? Most of the time. Does the carpenter enjoy the process? No.