308 pointsby AlifatiskApr 20, 2026

10 Comments

OsamaJaberApr 20, 2026
Good to see this exist. Inference providers quietly swap quant levels. Most users never check. A standard verifier from the model maker is the right move, would love to see other labs ship the same
bobbiechenApr 20, 2026
If I understand correctly, threat model here seems to be to protect against accidental issues that would impact performance, but doesn't cover malicious actor.

For example, Sketchy Provider tells you they are running the latest and greatest, but actually is knowingly running some cheaper (and worse) model and pocketing the difference. These tests wouldn't help since Sketchy Provider could detect when they're being tested and do the right thing (like the Volkswagen emissions scandal). Right?

j-bosApr 20, 2026
Seems like a great challenge for all these systems, see fromtier labs serving quants when under hesvy load.
gpmApr 20, 2026
Yes and no.

For a truly malicious actor, you're right. But it shifts it from "well we aren't obviously committing fraud by quantizing this model and not telling people" to "we're deliberately committing fraud by verifying our deployment with one model and then serving customer requests with another".

I suspect there's a lot of semi-malicious actors who are only happy to do the former.

nulltraceApr 20, 2026
Catching accidental drift is still worth a lot. It's basically the same idea as performance regression tests in CI, nobody writes those because they expect sabotage. It's for the boring stuff, like "oops, we bumped a dep and throughput dropped 15%".

If someone actually goes out of their way to bypass the check, that's a pretty different situation legally compared to just quietly shipping a cheaper quant anyway.

jychangApr 21, 2026
Yeah, the threat model is nonexistent. Most people use a dozen or so well known providers, who have no incentives to so obviously cheat.
KeplerBoyApr 21, 2026
Also it's not just about running an obviously worse quant.

Running different GPU kernels / inference engines also matters. It's easy to write an implementation that is faster and thus cheaper but numerically much noisier / less accurate.

frogpersonApr 21, 2026
Providers like OpenRouter default to the cheapest provider. They are often cheap because they are rediculously quantized and tuned for throughput, not quality.

This is probably kimi trying to protect their brand from bargain basement providers that dont properly represent what the models are capable of.

latchkeyApr 21, 2026
> This is probably kimi trying to protect their brand from bargain basement providers that dont properly represent what the models are capable of.

I'm curious what exactly they mean by this...

"because we learned the hard way that open-sourcing a model is only half the battle."

HarHarVeryFunnyApr 21, 2026
I'd take it at face value. Since they release open weights they would appear to genuinely want other providers to serve this as well as themselves, but the benefit of this depends on it being served accurately.
latchkeyApr 21, 2026
I agree, but how about some details.
OnavoApr 21, 2026
Kimi, GLM, and Minimax are the "Big Three" of open source Chinese AI startups. There's also Qwen and DeepSeek but they are all subsidized by other lines of business.

The Chinese AI models are generally 5-6 months behind high end SOTA western models (and as of the time of this comment it's Opus 4.7 and ChatGPT 5.4 Thinking, it's rumored however that the Mythos and Spud codename models are even better).

To gain market share, the Chinese startup use open source as a distribution strategy and essentially made mid-high end AI a commodity. The best models are still Western but for any application that doesn't require the highest performance in the market or if there's a need for extensive customization or alignment (imagine if you are an oil rich petro state and you don't want your national AI strategy to be tied to liberal international order ideology).

It creates a lot of pricing pressure on the low and mid end, and it's also why Anthropic is desperately trying to go full B2B instead.

However if the third parties hosting the Chinese models at near cost doesn't perform good quality control, it ruins the strategy because customers are not inclined to use chinese models anymore (and first party hosting on chinese infrastructure is out of the question because of geopolitical reasons, so everybody hides behind the polite fiction of using resellers like OpenRouter, Fal.ai, Wavespeed, fireworks AI etc.).

ashirviskasApr 21, 2026
I've been burned on openrouter getting routed through terrible quants with equally terrible quality. While paying maybe 15% less.

Nearly a year ago it was impossible to avoid it due to silly openrouter routing algorithm and the api. You had to set multiple things just right to make it work.

Similar to their other api quirks. You want valid json format response? sure, set response_format to "json" just like our documentation suggests. Oh, it only works some of the time? How silly, why would you expect it to work all of the time? If you want it to work more often, set require_params to true. We may still use other providers that don't offer it, but you want that, right? You don't? Well, then set our "very_require_params" to "very_true". And then switch a few toggles in the frontend. Oh and also add these 7 lines just so your other config options don't break. Oh wait they will break, how silly of us Is there any way to make it work as advertised? Of course no!

Sorry for the semi-offtopic rant. I still use them every day though, but not for open models anymore.

stingraycharlesApr 21, 2026
Openrouter has “exacto” verified models trying to combat this, but it seems like it’s not available for most of the models.
seismApr 20, 2026
A test that runs for 15 hours on a high powered rig is going to be hard to reproduce or scale. But I think this addresses a widespread concern, which affects all kinds of cloud services. What you ping is not necessarily what you get.
LalabadieApr 20, 2026
You can run the whole suite once at the start for each vendor, then roll through each part of it over a two or four week cycle, mimicking regular use. That jeeps the evaluation up to date over time.
MajromaxApr 21, 2026
My reading of the article is that the first audience for this test is the vendors themselves. The test is long and comprehensive to give the vendor confidence in its own hosting.
curioussquirrelApr 20, 2026
After Anthropic, Moonshot is another model provider who restricts tweaking of sampling parameters. I do like the idea of the vendor verifier, though.
charcircuitApr 20, 2026
If the post training is done with specific sampling parameters it would make sense to only use the parameters it was trained with.
curioussquirrelApr 21, 2026
Yes, but post training cannot possibly account for all possible use cases. Sane defaults are fine, you can't really do much about sampling parameters in chatbots and coding harnesses anyway. And when making an API call, you have to actively change the parameter in your payload. I don't believe there's any real risk.
charcircuitApr 22, 2026
The risk is that people tweak it, potentially on accident, and then think the model is bad instead of understand they are using it wrong. This causes potential reputational damage by exposing the control.
throwa356262Apr 21, 2026
What does "restricts tweaking of sampling parameters" mean?
yorwbaApr 21, 2026
"enforcing Temperature=1.0 and TopP=0.95 in Thinking mode, with mandatory validation that thinking content is correctly passed back."
foundry27Apr 20, 2026
I like this idea. This might be one of the more effective social pressures available for getting inference providers to fix long-standing issues. AWS Bedrock, for example, has crippling defects in its serving stack for Kimi’s K2 and K2.5 models that cause 20%-30% of attempts to emit tool calls to instead silently end the conversation (with no token output). That makes AWS effectively irrelevant as a serious inference provider for Kimi, and conveniently pushes users onto Bedrock’s significantly more expensive Anthropic models for comparable performance on agentic tasks.
jychangApr 21, 2026
It's old, Kimi's been doing this for months now.

https://github.com/MoonshotAI/K2-Vendor-Verifier

https://github.com/MoonshotAI/Kimi-Vendor-Verifier

Note, this is before K2.5 and K2.6 even launched.

gertlabsApr 20, 2026
This is real issue in our benchmarks. Beware of OpenRouter providers that don't specify quantizations or use lower ones than you might be expecting. OpenRouter does provide configuration options for this, and it often limits your options significantly. That being said, even with the best providers, Kimi-K2-thinking was underwhelming and slow on our benchmarks, albeit interesting and useful for temperature/variation.

Kimi K2.6, however, is the new open source leader, so far. Agentic evaluations still in progress, but one-shot coding reasoning benchmarks are ready at https://gertlabs.com/?mode=oneshot_coding

kristianpApr 21, 2026
Openrouter has an "exacto" [1] option to favour higher quality providers for a given model. Have you found any benefits to using that?

Edit: Kimi K2 uses int4 during its training as well as inference [2]. I wonder if that affects the quality if different gguf creators may not convert these correctly?

[1] https://openrouter.ai/docs/guides/routing/model-variants/exa...

[2] https://www.reddit.com/r/LocalLLaMA/comments/1pzfuqg/why_kim...

gertlabsApr 21, 2026
I did not know about this! We've put a lot of effort into probing providers and their offerings and auto-selecting the best options. I wonder how well their exacto option works.

Going to test it out, thanks!

m1keilApr 21, 2026
A related article from fireworks.ai about running open weights models and why such verifier needs to exists in the first place

https://fireworks.ai/blog/quality-first-with-kimi-k2p5

punkpeyeApr 21, 2026
Now this is brilliant.

I run an AI gateway (Glama), and we had to delist all third-party providers because some of them are obviously lying about their quantization.

Being able to vet providers would be a major improvement to our ability to offer a more diverse set of providers.

cowartcApr 21, 2026
The verifier isn't just a fraud detector. It's an admission that open weights alone aren't a shippable contract. Without a standardized verifier, a buyer has no way to know which case they're in. The weights are the easy part. The verification isn't.
_alphageekApr 21, 2026
Once vendors optimize for 6 KVV benchmarks, they measure compliance with KVV, not model fidelity. Is there rotation strategy in place?