Probably refresh the api models list every couple minutes instead. No one could have guessed the name of GPT-Codex-Spark
mattas•Mar 5, 2026
"GPT‑5.4 interprets screenshots of a browser interface and interacts with UI elements through coordinate-based clicking to send emails and schedule a calendar event."
They show an example of 5.4 clicking around in Gmail to send an email.
I still think this is the wrong interface to be interacting with the internet. Why not use Gmail APIs? No need to do any screenshot interpretation or coordinate-based clicking.
TheAceOfHearts•Mar 5, 2026
I think the desire is that in the long-term AI should be able to use any human-made application to accomplish equivalent tasks. This email demo is proof that this capability is a high priority.
spongebobstoes•Mar 5, 2026
not everything has an API, or API use is limited. some UIs are more feature complete than their APIs
some sites try to block programmatic use
UI use can be recorded and audited by a non-technical person
Jacques2Marais•Mar 5, 2026
I guess a big chunk of their target market won't know how to use APIs.
satvikpendem•Mar 5, 2026
The ideal of REST, the HTML and UI is the API.
PaulHoule•Mar 5, 2026
APIs have never been a gift but rather have always been a take-away that lets you do less than you can with the web interface. It’s always been about drinking through a straw, paying NASA prices, and being limited in everything you can do.
But people are intimidated by the complexity of writing web crawlers because management has been so traumatized by the cost of making GUI applications that they couldn’t believe how cheap it is to write crawlers and scrapers…. Until LLMs came along, and changed the perceived economics and created a permission structure. [1]
AI is a threat to the “enshittification economy” because it lets us route around it.
[1] that high cost of GUI development is one reason why scrapers are cheap… there is a good chance that the scraper you wrote 8 years ago still works because (a) they can’t afford to change their site and (b) if they could afford to change their site changing anything substantial about it is likely to unrecoverably tank their Google rankings so they won’t. A.I. might change the mechanics of that now that you Google traffic is likely to go to zero no matter what you do.
disqard•Mar 5, 2026
> AI is a threat to the “enshittification economy” because it lets us route around it.
This is prescient -- I wonder if the Big Tech entities see it this way. Maybe, even if they do, they're 100% committed to speedrunning the current late-stage-cap wave, and therefore unable to do anything about it.
PaulHoule•Mar 5, 2026
They are not a single thing.
Google has a good model in the form of Gemini and they might figure they can win the AI race and if the web dies, the web dies. YouTube will still stick around.
Facebook is not going to win the AI race with low I.Q. Llama but Zuck believed their business was cooked around the time it became a real business because their users would eventually age out and get tired of it. If I was him I'd be investing in anything that isn't cybernetic let it be gold bars or MMA studios.
Microsoft? They bought Activision for $69 billion. I just can't explain their behavior rationally but they could do worse than their strategy of "put ChatGPT in front of laggards and hope that some of them rise to the challenge and become slop producers."
Amazon is really a bricks-and-mortar play which has the freedom to invest in bricks-and-mortar because investors don't think they are a bricks-and-mortar play.
Netflix? They're cooked as is all of Hollywood. Hollywood's gatekeeping-industrial strategy of producing as few franchise as possible will crack someday and our media market may wind up looking more like Japan, where somebody can write a low-rent light novel like
and J.C. Staff makes a terrible anime that convinces 20k Otaku to drop $150 on the light novels and another $150 on the manga (sorry, no way you can make a balanced game based on that premise!) and the cost structure is such that it is profitable.
lostmsu•Mar 5, 2026
> AI is a threat to the “enshittification economy” because it lets us route around it.
I am not sure about that. We techies avoid enshittification because we recognize shit. Normies will just get their syncopatic enshittified AI that will tell them to continue buying into walled gardens.
Traster•Mar 5, 2026
You can buy a Claude Code subscription for $200 bucks and use way more tokens in Claude Code than if you pay for direct API usage. Anthopic decided you can't take your Auth key for Claude code and use it to hit the API via a different tool. They made that business decision, because they thought it was better for them strategically to do that. They're allowed to make that choice as a business.
Plenty of companies make the same choice about their API, they provide it for a specific purpose but they have good business reasons they want you using the website. Plenty of people write webcrawlers and it's been a cat and mouse game for decades for websites to block them.
This will just be one more step in that cat and mouse game, and if the AI really gets good enough to become a complete intermediary between you and the website? The website will just shutdown. We saw it happen before with the open web. These websites aren't here for some heroic purpose, if you screw their business model they will just go out of business. You won't be able to use their website because it won't exist and the website that do exist will either (a) be made by the same guys writing your agent, and (b) be highly highly optimized to get your agent to screw you.
steve1977•Mar 5, 2026
One could argue that LLMs learning programming languages made for humans (i.e. most of them) is using the wrong interface as well. Why not use machine code?
embedding-shape•Mar 5, 2026
Why would human language by the wrong interface when they're literally language models? Why would machine code be better when there is probably magnitude less of training material with machine code?
You can also test this yourself easily, fire up two agents, ask one to use PL meant for humans, and one to write straight up machine code (or assembly even), and see which results you like best.
BoredPositron•Mar 5, 2026
because they are inherently text based as is code?
steve1977•Mar 5, 2026
But they are abstractions made to cater to human weaknesses.
adwn•Mar 5, 2026
> One could argue that LLMs learning programming languages made for humans (i.e. most of them) is using the wrong interface as well.
Then go ahead and make an argument. "Why not do X?" is not an argument, it's a suggestion.
jstummbillig•Mar 5, 2026
Because the web and software more generally if full of not APIs and you do, in fact, need the clicking to work to make agents work generally
modeless•Mar 5, 2026
A world where AIs use APIs instead of UIs to do everything is a world where us humans will soon be helpless, as we'll have to ask the AIs to do everything for us and will have limited ability to observe and understand their work. I prefer that the AIs continue to use human-accessible tools, even if that's less efficient for them. As the price of intelligence trends toward zero, efficiency becomes relatively less important.
npilk•Mar 5, 2026
It feels like building humanoid robots so they can use tools built for human hands. Not clear if it will pay off, but if it does then you get a bunch of flexibility across any task "for free".
Of course APIs and CLIs also exist, but they don't necessarily have feature parity, so more development would be needed. Maybe that's the future though since code generation is so good - use AI to build scaffolding for agent interaction into every product.
packetlost•Mar 5, 2026
I don't see how an API couldn't have full parity with a web interface, the API is how you actually trigger a state transition in the vast majority of cases
coffeemug•Mar 5, 2026
A model that gets good at computer use can be plugged in anywhere you have a human. A model that gets good at API use cannot. From the standpoint of diffusion into the economy/labor market, computer use is much higher value.
f0e4c2f7•Mar 5, 2026
Lots of services have no desire to ever expose an API. This approach lets you step right over that.
If an API is exposed you can just have the LLM write something against that.
kristianp•Mar 5, 2026
This opens up a new question: how does bot detection work when the bot is using the computer via a gui?
itintheory•Mar 5, 2026
On it's face, I'm not sure that's a new question. Bots using browser automation frameworks (puppeteer, selenium, playwright etc) have been around for a while. There are signals used in bot detection tools like cursor movement speed, accuracy, keyboard timing, etc. How those detection tools might update to support legitimate bot users does seem like an open question to me though.
MattDaEskimo•Mar 5, 2026
Same reason why Wikipedia deals with so many people scraping its web page instead of using their API:
Optimizations are secondary to convenience
bottlepalm•Mar 5, 2026
The vast majority of websites you visit don’t have usable APIs and very poor discovery of the those APIs.
Screenshots on the other hand are documentation, API, and discovery all in one. And you’d be surprised how little context/tokens screenshots consumer compared to all the back and forth verbose json payloads of APIs
LUmBULtERA•Mar 5, 2026
>The vast majority of websites you visit don’t have usable APIs and very poor discovery of the those APIs.
I think an important thing here is that a lot of websites/platforms don't want AIs to have direct API access, because they are afraid that AIs would take the customer "away" from the website/platform, making the consumer a customer of the AI rather than a customer of the website/platform. Therefore for AIs to be able to do what customers want them to do, they need their browsing to look just like the customer's browsing/browser.
Looks like it's an order of magnitude off. Missprint?
GenerWork•Mar 5, 2026
Looks like an extra zero was added?
benlivengood•Mar 5, 2026
Government pricing :)
outside2344•Mar 5, 2026
$30 per kill approval
glerk•Mar 5, 2026
Looks like fair price discovery :)
dpoloncsak•Mar 5, 2026
>" GPT‑5.4 is priced higher per token than GPT‑5.2 to reflect its improved capabilities"
That's just not how pricing is supposed to work...? Especially for a 'non-profit'. You're charging me more so I know I have the better model?
elicash•Mar 5, 2026
Can't you continue to use to older model, if you prefer the pricing?
But they also claim this new model uses fewer tokens, so it still might ultimately be cheaper even if per token cost is higher.
dpoloncsak•Mar 5, 2026
I'm not against the pricing, just seems uncommon to frame it in the way they did, as opposed to the usual 'assume the customer expects more performance will cost more'
I guess they have to sell to investors that the price to operate is going down, while still needing more from the user to be sustainable
jbellis•Mar 5, 2026
You can, until they turn it off.
Anthropic is pulling the plug on Haiku 3 in a couple months, and they haven't released anything in that price range to replace it.
Sabinus•Mar 5, 2026
Surely there are open source models that surpass Haiku 3 at better price points by now.
FergusArgyll•Mar 5, 2026
Maybe it's finally a bigger pretrain?
dpoloncsak•Mar 5, 2026
I feel like that would have been highlighted then. "As this is a bigger pretrain, we have to raise prices".
They're framing it pretty directly "We want you to think bigger cost means better model"
minimaxir•Mar 5, 2026
The marquee feature is obviously the 1M context window, compared to the ~200k other models support with maybe an extra cost for generations beyond >200k tokens. Per the pricing page, there is no additional cost for tokens beyond 200k: https://openai.com/api/pricing/
Also per pricing, GPT-5.4 ($2.50/M input, $15/M output) is much cheaper than Opus 4.6 ($5/M input, $25/M output) and Opus has a penalty for its beta >200k context window.
I am skeptical whether the 1M context window will provide material gains as current Codex/Opus show weaknesses as its context window is mostly full, but we'll see.
Why would someone use Claude Code instead? Or any other harness? Or why only use one?
My own tooling throws off requests to multiple agents at the same time, then I compare which one is best, and continue from there. Most of the time Codex ends up with the best end results though, but my hunch is that at one point that'll change, hence I continue using multiple at the same time.
surgical_fire•Mar 5, 2026
I've been using Codex for software development personally (I have a ChatGPT account), and I use Claude at work (since it is provided by my employer).
I find both Codex and Claude Opus perform at a similar level, and in some ways I actually prefer Codex (I keep hitting quota limits in Opus and have to revert back to Sonnet).
If your question is related to morality (the thing about US politics, DoD contract and so on)... I am not from the US, and I don't care about its internal politics. I also think both OpenAI and Anthropic are evil, and the world would be better if neither existed.
simianwords•Mar 5, 2026
No my question was why would I use codex over gpt 5.4
surgical_fire•Mar 5, 2026
Ahh, good question. I misunderstood you, apologies.
There's no mention of pricing, quotas and so on. Perhaps Codex will still be preferable for coding tasks as it is tailored for it? Maybe it is faster to respond?
Just speculation on my part. If it becomes redundant to 5.4, I presume it will be sunset. Or maybe they eventually release a Codex 5.4?
landtuna•Mar 5, 2026
5.3 Codex is $1.75/$14, and 5.4 is $2.50/$15.
surgical_fire•Mar 5, 2026
There you go. It makes perfect sense to keep it around then.
athrowaway3z•Mar 5, 2026
They perform at a somewhat equal level on writing single files. But Codex is absolute garbage at theory of self/others. That quickly becomes frustrating.
I can tell claude to spawn a new coding agent, and it will understand what that is, what it should be told, and what it can approximately do.
Codex on the other hand will spawn an agent and then tell it to continue with the work. It knows a coding agent can do work, but doesn't know how you'd use it - or that it won't magically know a plan.
You could add more scaffolding to fix this, but Claude proves you shouldn't have to.
I suspect this is a deeper model "intelligence" difference between the two, but I hope 5.4 will surprise me.
surgical_fire•Mar 5, 2026
> They perform at a somewhat equal level on writing single files.
That's not the experience I have. I had it do more complex changes spawning multiple files and it performed well.
I don't like using multiple agents though. I don't vibe code, I actually review every change it makes. The bottleneck is my review bandwidth, more agents producing more code will not speed me up (in fact it will slow me down, as I'll need to context switch more often).
hnsr•Mar 5, 2026
> I've been using Codex for software development personally (I have a ChatGPT account), and I use Claude at work (since it is provided by my employer).
Exact same situation here. I've been using both extensively for the last month or so, but still don't really feel either of them is much better or worse. But I have not done large complex features with it yet, mostly just iterative work or small features.
I also feel I am probably being very (overly?) specific in my prompts compared to how other people around me use these agents, so maybe that 'masks' things
jeswin•Mar 5, 2026
When it comes to lengthy non-trivial work, codex is much better but also slower.
lmeyerov•Mar 5, 2026
In our evals for answering cybersecurity incident investigation questions and even autonomously doing the full investigation, gpt-5.2-codex with low reasoning was the clear winner over non-codex or higher reasoning. 2X+ faster, higher completion rates, etc.
It was generally smarter than pre-5.2 so strategically better, and codex likewise wrote better database queries than non-codex, and as it needs to iteratively hunt down the answer, didn't run out the clock by drowning in reasoning.
We'll be updating numbers on 5.3 and claude, but basically same thing there. Early, but we were surprised to see codex outperform opus here.
synergy20•Mar 5, 2026
in my testing codex actually planned worse than claude but coded better once the plan is set, and faster.
it is also excellent to cross check claude's work, always finding great weakness each time.
pmarreck•Mar 5, 2026
That’s why I think the sweet spot is to write up plans with Claude and then execute them with Codex
GorbachevyChase•Mar 5, 2026
Weird. It used to be the opposite. My own experience is that Claude’s behind-the-scenes support is a differentiator for supporting office work. It handles documents, spreadsheets and such much better than anyone else (presumably with server side scripts). Codex feels a bit smarter, but it inserts a lot of checkpoints to keep from running too long. Claude will run a plan to the end, but the token limits have become so small in the last couple months that the $20 pla basically only buys one significant task per day. The iOS app is what makes me keep the subscription.
tedsanders•Mar 5, 2026
Yeah, long context vs compaction is always an interesting tradeoff. More information isn't always better for LLMs, as each token adds distraction, cost, and latency. There's no single optimum for all use cases.
For Codex, we're making 1M context experimentally available, but we're not making it the default experience for everyone, as from our testing we think that shorter context plus compaction works best for most people. If anyone here wants to try out 1M, you can do so by overriding `model_context_window` and `model_auto_compact_token_limit`.
Curious to hear if people have use cases where they find 1M works much better!
(I work at OpenAI.)
simianwords•Mar 5, 2026
Do you maybe want to give us users some hints on what to compact and throw away? In codex CLI maybe you can create a visual tool that I can see and quickly check mark things I want to discard.
Sometimes I’m exploring some topic and that exploration is not useful but only the summary.
Also, you could use the best guess and cli could tell me that this is what it wants to compact and I can tweak its suggestion in natural language.
Context is going to be super important because it is the primary constraint. It would be nice to have serious granular support.
akiselev•Mar 5, 2026
> Curious to hear if people have use cases where they find 1M works much better!
Reverse engineering [1]. When decompiling a bunch of code and tracing functionality, it's really easy to fill up the context window with irrelevant noise and compaction generally causes it to lose the plot entirely and have to start almost from scratch.
(Side note, are there any OpenAI programs to get free tokens/Max to test this kind of stuff?)
That's an interesting point regarding context Vs. compaction. If that's viewed as the best strategy, I'd hope we would see more tools around compaction than just "I'll compact what I want, brace yourselves" without warning.
Like, I'd love an optional pre-compaction step, "I need to compact, here is a high level list of my context + size, what should I junk?" Or similar.
thyb23•Mar 5, 2026
This is exactly how it should work. I imagine it as a tree view showing both full and summarized token counts at each level, so you can immediately see what’s taking up space and what you’d gain by compacting it.
The agent could pre-select what it thinks is worth keeping, but you’d still have full control to override it. Each chunk could have three states: drop it, keep a summarized version, or keep the full history.
That way you stay in control of both the context budget and the level of detail the agent operates with.
Folcon•Mar 5, 2026
I do find it really interesting that more coding agents don't have this as an toggleable feature, sometimes you really need this level of control to get useful capability
Someone1234•Mar 5, 2026
Yep; I've actually had entire jobs essentially fail due to a bad compaction. It lost key context, and it completely altered the trajectory.
I'm now more careful, using tracking files to try to keep it aligned, but more control over compaction regardless would be highly welcomed. You don't ALWAYS need that level of control, but when you do, you do.
gspetr•Mar 5, 2026
I have found a bigger context window qute useful when trying to make sense of larger codebases. Generating documentation on how different components interact is better than nothing, especially if the code has poor test coverage.
I've also had it succeed in attempts to identify some non-trivial bugs that spanned multiple modules.
I too tried Codex and found it similarly hard to control over long contexts. It ended up coding an app that spit out millions of tiny files which were technically smaller than the original files it was supposed to optimize, except due to there being millions of them, actual hard drive usage was 18x larger. It seemed to work well until a certain point, and I suspect that point was context window overflow / compaction. Happy to provide you with the full session if it helps.
I’ll give Codex another shot with 1M. It just seemed like cperciva’s case and my own might be similar in that once the context window overflows (or refuses to fill) Codex seems to lose something essential, whereas Claude keeps it. What that thing is, I have no idea, but I’m hoping longer context will preserve it.
woadwarrior01•Mar 5, 2026
Please don't post links with tracking parameters (t=jQb...).
Haha. This was the second time in like a year that I’ve posted a Twitter link, and the second time someone complained. Okay, I’ll try to remove those before posting, and I’ll edit this one out.
Feels like a losing battle, but hey, the audience is usually right.
woadwarrior01•Mar 5, 2026
I'm sorry, but it's my pet peeve. If you're on iOS/macOS I built a 100% free and privacy-friendly app to get rid of tracking parameters from hundreds of different websites, not just X/Twitter.
So what is your motivation for doing this, incidentally? Can you be explicit about it? I am genuinely curious.
Especially when it’s to the point of, you know, nagging/policing people to do it the way you’d prefer, when you could just redirect your router requests from x.com to xcancel.com
monocularvision•Mar 5, 2026
This is great! I have been meaning to implement this sort of thing in my existing Shortcuts flow but I see you already support it in Shortcuts! Thank you for this!
Anywhere I can toss a Tip for this free app?
FrankBooth•Mar 5, 2026
What’s the connection with context size in that thread? It seems more like an instruction following problem.
nowittyusername•Mar 5, 2026
Personally what I am more interested about is effective context window. I find that when using codex 5.2 high, I preferred to start compaction at around 50% of the context window because I noticed degradation at around that point. Though as of a bout a month ago that point is now below that which is great. Anyways, I feel that I will not be using that 1 million context at all in 5.4 but if the effective window is something like 400k context, that by itself is already a huge win. That means longer sessions before compaction and the agent can keep working on complex stuff for longer. But then there is the issue of intelligence of 5.4. If its as good as 5.2 high I am a happy camper, I found 5.3 anything... lacking personally.
asabla•Mar 5, 2026
I really don't have any numbers to back this up. But it feels like the sweet spot is around ~500k context size. Anything larger then that, you usually have scoping issues, trying to do too much at the same time, or having having issues with the quality of what's in the context at all.
For me, I would say speed (not just time to first token, but a complete generation) is more important then going for a larger context size.
lubesGordi•Mar 5, 2026
It's funny that the context window size is such a thing still. Like the whole LLM 'thing' is compression. Why can't we figure out some equally brilliant way of handling context besides just storing text somewhere and feeding it to the llm? RAG is the best attempt so far. We need something like a dynamic in flight llm/data structure being generated from the context that the agent can query as it goes.
netinstructions•Mar 5, 2026
People (and also frustratingly LLMs) usually refer to https://openai.com/api/pricing/ which doesn't give the complete picture.
It is nice that we get 70-72k more tokens before the price goes up (also what does it cost beyond 272k tokens??)
Flashtoo•Mar 5, 2026
> Prompts with more than 272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.
netinstructions•Mar 5, 2026
Thanks, it looks like the pricing page keeps getting updated.
Even right now one page refers to prices for "context lengths under 270K" whereas another has pricing for "<272K context length"
damsta•Mar 5, 2026
There is extra cost for >272K:
> For models with a 1.05M context window (GPT-5.4 and GPT-5.4 pro), prompts with >272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.
Which, Claude has the same deal. You can get a 1M context window, but it's gonna cost ya. If you run /model in claude code, you get:
Switch between Claude models. Applies to this session and future Claude Code sessions. For other/previous model names, specify with --model.
1. Default (recommended) Opus 4.6 · Most capable for complex work
2. Opus (1M context) Opus 4.6 with 1M context · Billed as extra usage · $10/$37.50 per Mtok
3. Sonnet Sonnet 4.6 · Best for everyday tasks
4. Sonnet (1M context) Sonnet 4.6 with 1M context · Billed as extra usage · $6/$22.50 per Mtok
5. Haiku Haiku 4.5 · Fastest for quick answers
minimaxir•Mar 5, 2026
Good find, and that's too small a print for comfort.
ValentineC•Mar 5, 2026
It's also in the linked article:
> GPT‑5.4 in Codex includes experimental support for the 1M context window. Developers can try this by configuring model_context_window and model_auto_compact_token_limit. Requests that exceed the standard 272K context window count against usage limits at 2x the normal rate.
glenstein•Mar 5, 2026
Wow, that's diametrically the opposite point: the cost is *extra*, not free.
apetresc•Mar 5, 2026
Diametrically opposite to tokens beyond 200K being literally free? As in, you only pay for the first 200K tokens and the remaining 800K cost $0.00?
I don't think that's a fair reading of the original post at all, obviously what they meant by "no cost" was "no increase in the cost".
andai•Mar 5, 2026
It's a little hard to compare, because Claude needs significantly fewer tokens for the same task. A better metric is the cost per task, which ends up being pretty similar.
For example on Artificial Analysis, the GPT-5.x models' cost to run the evals range from half of that of Claude Opus (at medium and high), to significantly more than the cost of Opus (at extra high reasoning). So on their cost graphs, GPT has a considerable distribution, and Opus sits right in the middle of that distribution.
The most striking graph to look at there is "Intelligence vs Output Tokens". When you account for that, I think the actual costs end up being quite similar.
According to the evals, at least, the GPT extra high matches Opus in intelligence, while costing more.
Of course, as always, benchmarks are mostly meaningless and you need to check Actual Real World Results For Your Specific Task!
For most of my tasks, the main thing a benchmark tells me is how overqualified the model is, i.e. how much I will be over-paying and over-waiting! (My classic example is, I gave the same task to Gemini 2.5 Flash and Gemini 2.5 Pro. Both did it to the same level of quality, but Gemini took 3x longer and cost 3x more!)
paulddraper•Mar 5, 2026
I don’t know about 5.4 specifically, but in the past anything over 200k wasn’t that great anyway.
Like, if you really don’t want to spend any effort trimming it down, sure use 1m.
Otherwise, 1m is an anti pattern.
AtreidesTyrant•Mar 5, 2026
token rot exists for any context window at above 75% capacity, thats why so many have pushed for 1 mil windows
luca-ctx•Mar 5, 2026
Context rot is definitely still a problem but apparently it can be mitigated by doing RL on longer tasks that utilize more context. Recent Dario interview mentions this is part of Anthropic’s roadmap.
smusamashah•Mar 5, 2026
Gemini already has 1M or 2M context window right?
Chance-Device•Mar 5, 2026
I’m sure the military and security services will enjoy it.
varispeed•Mar 5, 2026
prompt> Hi we want to build a missile, here is the picture of what we have in the yard.
mirekrusin•Mar 5, 2026
{ tools: [ { name: "nuke", description: "Use when sure.", ... { lat: number, long: number } } ] }
Insanity•Mar 5, 2026
Just remember an ethical programmer would never write a function “bombBagdad”. Rather they would write a function “bombCity(target City)”.
jakeydus•Mar 5, 2026
class CityBomberFactory(RapidInfrastructureDeconstructionTemplateInterface):
pass
theParadox42•Mar 5, 2026
The self reported safety score for violence dropped from 91% to 83%.
skrebbel•Mar 5, 2026
What the hell is a "safety score for violence"?
murat124•Mar 5, 2026
I asked an AI. I thought they would know.
What the hell is a "safety score for violence"?
A “safety score for violence” is usually a risk rating used by platforms, AI systems, or moderation tools to estimate how likely a piece of content is to involve or promote violence. It’s not a universal standard—different companies use their own versions—but the idea is similar everywhere.
What it measures
A safety score typically evaluates whether text, images, or videos contain things like:
Threats of violence (“I’m going to hurt someone.”)
Instructions for harming people
Glorifying violent acts
Descriptions of physical harm or abuse
Planning or encouraging attacks
0xffff2•Mar 5, 2026
I still can't tell which direction this score goes... Does a decreasing score mean it is "less safe" (i.e. "more violent") or does it mean it is "less violent" (i.e. "more safe")?
Ditto, but I did anyways and enjoyed that OpenAI doesn't include the dogwater that is Grok on their scorecard.
Sabinus•Mar 5, 2026
Get a redirect plugin and set it up to send you to xcancel instead of Twitter. I've done it, and it's very convenient.
karmasimida•Mar 5, 2026
It is a bigger model, confirmed
Aboutplants•Mar 5, 2026
It seems that all frontier models are basically roughly even at this point. One may be slightly better for certain things but in general I think we are approaching a real level playing field field in terms of ability.
thewebguyd•Mar 5, 2026
Kind of reinforces that a model is not a moat. Products, not models, are what's going to determine who gets to stay in business or not.
gregpred•Mar 5, 2026
Memory (model usage over time) is the moat.
energy123•Mar 5, 2026
Narrative violation: revenue run rates are increasing exponentially with about 50% gross margins.
observationist•Mar 5, 2026
Benchmarks don't capture a lot - relative response times, vibes, what unmeasured capabilities are jagged and which are smooth, etc. I find there's a lot of difference between models - there are things which Grok is better than ChatGPT for that the benchmarks get inverted, and vice versa. There's also the UI and tools at hand - ChatGPT image gen is just straight up better, but Grok Imagine does better videos, and is faster.
Gemini and Claude also have their strengths, apparently Claude handles real world software better, but with the extended context and improvements to Codex, ChatGPT might end up taking the lead there as well.
I don't think the linear scoring on some of the things being measured is quite applicable in the ways that they're being used, either - a 1% increase for a given benchmark could mean a 50% capabilities jump relative to a human skill level. If this rate of progress is steady, though, this year is gonna be crazy.
bigyabai•Mar 5, 2026
> If this rate of progress is steady, though, this year is gonna be crazy.
Do you want to make any concrete predictions of what we'll see at this pace? It feels like we're reaching the end of the S-curve, at least to me.
observationist•Mar 5, 2026
If you look at the difference in quality between gpt-2 and 3, it feels like a big step, but the difference between 5.2 and 5.4 is more massive, it's just that they're both similarly capable and competent. I don't think it's an S curve; we're not plateauing. Million token context windows and cached prompts are a huge space for hacking on model behaviors and customization, without finetuning. Research is proceeding at light speed, and we might see the first continual/online learning models in the near future. That could definitively push models past the point of human level generality, but at the very least will help us discover what the next missing piece is for AGI.
ryandrake•Mar 5, 2026
For 2026, I am really interested in seeing whether local models can remain where they are: ~1 year behind the state of the art, to the point where a reasonably quantized November 2026 local model running on a consumer GPU actually performs like Opus 4.5.
I am betting that the days of these AI companies losing money on inference are numbered, and we're going to be much more dependent on local capabilities sooner rather than later. I predict that the equivalent of Claude Max 20x will cost $2000/mo in March of 2027.
mootothemax•Mar 5, 2026
Huh, that’s interesting, I’ve been having very similar thoughts lately about what the near-ish term of this tech looks like.
My biggest worry is that the private jet class of people end up with absurdly powerful AI at their fingertips, while the rest of us are left with our BigMac McAIs.
baq•Mar 5, 2026
Gemini 3.1 slaps all other models at subtle concurrency bugs, sql and js security hardening when reviewing. (Obviously haven’t tested gpt 5.4 yet.)
It’s a required step for me at this point to run any and all backend changes through Gemini 3.1 pro.
adonese•Mar 5, 2026
Which subscription do you have to use it? Via Google ai pro and gemini cli i always get timeouts due to model being under heavy usage. The chat interface is there and I do have 3.1 pro as well, but wondering if the chat is the only way of accessing it.
baq•Mar 5, 2026
Cursor sub from $DAYJOB.
observationist•Mar 5, 2026
I have a few standard problems I throw at AI to see if they can solve them cleanly, like visualizing a neural network, then sorting each neuron in each layer by synaptic weights, largest to smallest, correctly reordering any previous and subsequent connected neurons such that the network function remains exactly the same. You should end up with the last layer ordered largest to smallest, and prior layers shuffled accordingly, and I still haven't had a model one-shot it. I spent an hour poking and prodding codex a few weeks back and got it done, but it conceptually seems like it should be a one-shot problem.
basch•Mar 5, 2026
>ChatGPT image gen is just straight up better
Yet so much slower than Gemini / Nano Banana to make it almost unusable for anything iterative.
druskacik•Mar 5, 2026
That has been true for some time now, definitely since Claude 3 release two years ago.
kseniamorph•Mar 5, 2026
makes sense, but i'd separate two things: models converging in ability vs hitting a fundamental ceiling. what we're probably seeing is the current training recipe plateauing — bigger model, more tokens, same optimizer. that would explain the convergence. but that's not necessarily the architecture being maxed out. would be interesting to see what happens when genuinely new approaches get to frontier scale.
swingboy•Mar 5, 2026
Why do so many people in the comments want 4o so bad?
embedding-shape•Mar 5, 2026
Someone correct me if I'm wrong, but seemingly a lot of the people who found a "love interest" in LLMs seems to have preferred 4o for some reason. There was a lot of loud voices about that in the subreddit r/MyBoyfriendIsAI when it initially went away.
They have AI psychosis and think it's their boyfriend.
The 5.x series have terrible writing styles, which is one way to cut down on sycophancy.
baq•Mar 5, 2026
Somebody on Twitter used Claude code to connect… toys… as mcps to Claude chat.
We’ve seen nothing yet.
mikkupikku•Mar 5, 2026
My computer ethics teacher was obsessed with 'teledildonics' 30 years ago. There's nothing new under the sun.
vntok•Mar 5, 2026
Was your teacher Ted Nelson?
mikkupikku•Mar 5, 2026
I wish, dude is a legend.
Sharlin•Mar 5, 2026
There are many games these days that support controllable sex toys. There's an interface for that, of course: https://github.com/buttplugio/buttplug. Written in Rust, of course.
the_af•Mar 5, 2026
> Written in Rust, of course.
Safety is important.
manmal•Mar 5, 2026
ding-dong-cli is needed
Herring•Mar 5, 2026
what.. :o
MattGaiser•Mar 5, 2026
The writing with the 5 models feels a lot less human. It is a vibe, but a common one.
cheema33•Mar 5, 2026
> Why do so many people in the comments want 4o so bad?
You can ask 4o to tell you "I love you" and it will comply. Some people really really want/need that. Later models don't go along with those requests and ask you to focus on human connections.
dom96•Mar 5, 2026
Why do none of the benchmarks test for hallucinations?
netule•Mar 5, 2026
Optics. It would be inconvenient for marketing, so they leave those stats to third parties to figure out.
tedsanders•Mar 5, 2026
In the text, we did share one hallucination benchmark: Claim-level errors fell by 33% and responses with an error fell by 18%, on a set of error-prone ChatGPT prompts we collected (though of course the rate will vary a lot across different types of prompts).
Hallucinations are the #1 problem with language models and we are working hard to keep bringing the rate down.
(I work at OpenAI.)
MarcFrame•Mar 5, 2026
how does 5.4-thinking have a lower FrontierMath score than 5.4-pro?
nico1207•Mar 5, 2026
Well 5.4-pro is the more expensive and more advanced version of 5.4-thinking so why wouldn't it?
bicx•Mar 5, 2026
That last benchmark seemed like an impressive leg up against Opus until I saw the sneaky footnote that it was actually a Sonnet result. Why even include it then, other than hoping people don't notice?
conradkay•Mar 5, 2026
Sonnet was pretty close to (or better than) Opus in a lot of benchmarks, I don't think it's a big deal
jitl•Mar 5, 2026
wat
0123456789ABCDE•Mar 5, 2026
maybe gp's use of the word "lots" is unwarranted
https://artificialanalysis.ai indicates that sonnect 4.6 beats opus 4.6 on GDPval-AA, Terminal-Bench Hard, AA Long context Reasoning, IFBench.
Thanks. We'll merge the threads, but this time we'll do it hither, to spread some karma love.
ZeroCool2u•Mar 5, 2026
Bit concerning that we see in some cases significantly worse results when enabling thinking. Especially for Math, but also in the browser agent benchmark.
Not sure if this is more concerning for the test time compute paradigm or the underlying model itself.
Maybe I'm misunderstanding something though? I'm assuming 5.4 and 5.4 Thinking are the same underlying model and that's not just marketing.
highfrequency•Mar 5, 2026
Can you be more specific about which math results you are talking about? Looks like significant improvement on FrontierMath esp for the Pro model (most inference time compute).
ZeroCool2u•Mar 5, 2026
Frontier Math, GPQA Diamond, and Browsecomp are the benchmarks I noticed this on.
csnweb•Mar 5, 2026
Are you may be comparing the pro model to the non pro model with thinking? Granted it’s a bit confusing but the pro model is 10 times more expensive and probably much larger as well.
ZeroCool2u•Mar 5, 2026
Ah yes, okay that makes more sense!
oersted•Mar 5, 2026
I believe you are looking at GPT 5.4 Pro. It's confusing in the context of subscription plan names, Gemini naming and such. But they've had the Pro version of the GPT 5 models (and I believe o3 and o1 too) for a while.
It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.
Not sure what it is exactly, I assume it's probably the non-quantized version of the model or something like that.
ZeroCool2u•Mar 5, 2026
Yup, that was it. Didn't realize they're different models. I suppose naming has never been OpenAI's strong suit.
nsingh2•Mar 5, 2026
From what I've read online it's not necessarily a unquantized version, it seems to go through longer reasoning traces and runs multiple reasoning traces at once. Probably overkill for most tasks.
logicchains•Mar 5, 2026
>It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.
The performance improvement isn't marginal if you're doing something particularly novel/difficult.
andoando•Mar 5, 2026
The thinking models are additionally trained with reinforcement learning to produce chain of thought reasoning
I must have been sleeping when "sheet" "brief" "primer" etc become known as "cards".
I really thought weirdly worded and unnecessary "announcement" linking to the actual info along with the word "card" were the results of vibe slop.
realityfactchex•Mar 5, 2026
Card is slightly odd naming indeed.
Criticisms aside (sigh), according to Wikipedia, the term was introduced when proposed by mostly Googlers, with the original paper [0] submitted in 2018. To quote,
"""In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type [15]) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information."""
To me, model card makes sense for something like this https://x.com/OpenAI/status/2029620619743219811. For "sheet"/"brief"/"primer" it is indeed a bit annoying. I like to see the compiled results front and center before digging into a dossier.
nickysielicki•Mar 5, 2026
can anyone compare the $200/mo codex usage limits with the $200/mo claude usage limits? It’s extremely difficult to get a feel for whether switching between the two is going to result in hitting limits more or less often, and it’s difficult to find discussion online about this.
In practice, if I buy $200/mo codex, can I basically run 3 codex instances simultaneously in tmux, like I can with claude code pro max, all day every day, without hitting limits?
ritzaco•Mar 5, 2026
I haven't tried the $200 plans by I have Claude and Codex $20 and I feel like I get a lot more out of Codex before hitting the limits. My tracker certainly shows higher tokens for Codex. I've seen others say the same.
lostmsu•Mar 5, 2026
Sadly comment ratings are not visible on HN, so the only way to corroborate is to write it explicitly: Codex $20 includes significantly more work done and is subjectively smarter.
winstonp•Mar 5, 2026
Agree. Claude tends to produce better design, but from a system understanding and architecture perspective Codex is the far better model
vtail•Mar 5, 2026
My own experience is that I get far far more usage (and better quality code, too) from codex. I downgrade my Claude Max to Claude Pro (the $20 plan) and now using codex with Pro plan exclusively for everything.
FergusArgyll•Mar 5, 2026
Codex usage limits are definitely more generous. As for their strength, that's hard to say / personal taste
CSMastermind•Mar 5, 2026
Codex limits are much more generous than claude.
I switch between both but codex has also been slightly better in terms of quality for me personally at least.
mikert89•Mar 5, 2026
I personally like the 100 dollar one from claude, but the gpt4 pro can be very good
gavinray•Mar 5, 2026
I almost never hit my $20 Codex limits, whereas I often hit my Claude limits.
tauntz•Mar 5, 2026
I've only run into the codex $20 limit once with my hobby project. With my Claude ~$20 plan, I hit limits after about 3(!) rather trivial prompts to Opus :/
throwaway911282•Mar 5, 2026
you get more more from codex than claude any day. and its more reliable as well.
strongpigeon•Mar 5, 2026
It's interesting that they charge more for the > 200k token window, but the benchmark score seems to go down significantly past that. That's judging from the Long Context benchmark score they posted, but perhaps I'm misunderstanding what that implies.
simianwords•Mar 5, 2026
This is exactly what I would expect. Why do you find it surprising
strongpigeon•Mar 5, 2026
I guess that you pay more for worse quality to unlock use cases that could maybe be solved by better context management.
Tiberium•Mar 5, 2026
They don't actually seem to charge more for the >200k tokens on the API. OpenRouter and OpenAI's own API docs do not have anything about increased pricing for >200k context for GPT-5.4. I think the 2x limit usage for higher context is specific to using the model over a subscription in Codex.
tmpz22•Mar 5, 2026
Does this improve Tomahawk Missile accuracy?
ch4s3•Mar 5, 2026
They're already accurate within 5-10m at Mach 0.74 after traveling 2k+ km. Its 5m long so it seems pretty accurate. How much more could you expect?
mikkupikku•Mar 5, 2026
You could definitely do better than that with image recognition for terminal guidance. But I would assume those published accuracy numbers are very conservative anyway..
keithnz•Mar 5, 2026
I think for LLM like Open AI, it wouldn't be about hitting the target but target selection. Target selection is probably the most likely thing that won't be accurate
simianwords•Mar 5, 2026
What is the point of gpt codex?
catketch•Mar 5, 2026
-codex variant models in earlier version were just fine tuned for coding work, and had a little better performance for related tool calling and maybe instruction calling.
in 5.4 it looks like the just collapsed that capability into the single frontier family model
simianwords•Mar 5, 2026
Yes so I’m even more confused. Why would I use codex?
joshuacc•Mar 5, 2026
Presumably you don’t anymore if you have 5.4.
energy123•Mar 5, 2026
You choose gpt-5.4 in the /model picker inside the codex app/cli if you want.
akmarinov•Mar 5, 2026
They’ll likely come out with a 5.4-Codex at some point, that’s what they did with 5 and 5.2
ilaksh•Mar 5, 2026
Remember when everyone was predicting that GPT-5 would take over the planet?
dbbk•Mar 5, 2026
It was truly scary, according to Sam...
zeeebeee•Mar 5, 2026
iTs lITeRaLlY AGI bro
nthypes•Mar 5, 2026
$30/M Input and $180/M Output Tokens is nuts. Ridiculous expensive for not that great bump on intelligence when compared to other models.
moralestapia•Mar 5, 2026
Don't use it?
nthypes•Mar 5, 2026
Gemini 3.1 Pro
$2/M Input Tokens
$15/M Output Tokens
Claude Opus 4.6
$5/M Input Tokens
$25/M Output Tokens
nthypes•Mar 5, 2026
Just to clarify,the pricing above is for GPT-5.4 Pro. For standard here is the pricing:
$2.5/M Input Tokens
$15/M Output Tokens
rvz•Mar 5, 2026
You didn't realize they can increase / change prices for intelligence?
This should not be shocking.
nickthegreek•Mar 5, 2026
OP made no mention of not understanding cost relation to intelligence. In fact, they specifically call out the lack of value.
energy123•Mar 5, 2026
For Pro
joe_mamba•Mar 5, 2026
Better tokens per dollar could be useless for comparison if the model can't solve your problem.
I use ChatGPT primarily for health related prompts. Looking at bloodwork, playing doctor for diagnosing minor aches/pains from weightlifting, etc.
Interesting, the "Health" category seems to report worse performance compared to 5.2.
paxys•Mar 5, 2026
Models are being neutered for questions related to law, health etc. for liability reasons.
cj•Mar 5, 2026
I'm sometimes surprised how much detail ChatGPT will go into without giving any dislaimers.
I very frequently copy/paste the same prompts into Gemini to compare, and Gemini often flat out refuses to engage while ChatGPT will happily make medical recommendations.
I also have a feeling it has to do with my account history and heavy use of project context. It feels like when ChatGPT is overloaded with too much context, it might let the guardrails sort of slide away. That's just my feeling though.
Today was particularly bad... I uploaded 2 PDFs of bloodwork and asked ChatGPT to transcribe it, and it spit out blood test results that it found in the project context from an earlier date, not the one attached to the prompt. That was weird.
bargainbin•Mar 5, 2026
Anecdotal, but I asked Claude the other day about how to dilute my medication (HCG) and it flat out refused and started lecturing me about abusing drugs.
I copy and pasted into ChatGPT, it told me straight away, and then for a laugh said it was actually a magical weight loss drug that I'd bought off the dark web... And it started giving me advice about unregulated weight loss drugs and how to dose them.
staticman2•Mar 5, 2026
If you had created a project with custom instructions and/ or custom style I think you could have gotten Claude to respond the way you wanted just fine.
tiahura•Mar 5, 2026
Are you sure about that? Plenty of lawyers that use them everyday aren't noticing.
partiallypro•Mar 5, 2026
I've done the same, and I tested the same prompts with Claude and Google, and they both started hallucinating my blood results and supplement stack ingredients. Hopefully this new model doesn't fall on this. Claude and Google are dangerously unusable on the subject of health, from my experience.
zeeebeee•Mar 5, 2026
what's best in your experience? i've always felt like opus did well
wahnfrieden•Mar 5, 2026
No Codex model yet
minimaxir•Mar 5, 2026
GPT-5.4 is the new Codex model.
wahnfrieden•Mar 5, 2026
Finally
nico1207•Mar 5, 2026
GPT-5.3-Codex is superior to GPT-5.4 in Terminal Bench with Codex, so not really
conradkay•Mar 5, 2026
General consensus seems to be that it's still a better coding model, overall
koakuma-chan•Mar 5, 2026
It just released, how is there a general consensus already
timpera•Mar 5, 2026
> Steerability: Similarly to how Codex outlines its approach when it starts working, GPT‑5.4 Thinking in ChatGPT will now outline its work with a preamble for longer, more complex queries. You can also add instructions or adjust its direction mid-response.
This was definitely missing before, and a frustrating difference when switching between ChatGPT and Codex. Great addition.
yanis_t•Mar 5, 2026
These releases are lacking something. Yes, they optimised for benchmarks, but it’s just not all that impressive anymore.
It is time for a product, not for a marginally improved model.
esafak•Mar 5, 2026
That's for you to build; they provide the brains. Do you really want one company to build everything? There wouldn't be a software industry to speak of if that happened.
simlevesque•Mar 5, 2026
Nah, the second you finish your build they release their version and then it's game over.
acedTrex•Mar 5, 2026
Well they are currently the ones valued at a number with a whole lotta 0s on it. I think they should probably do both
ipsum2•Mar 5, 2026
The model was released less than an hour ago, and somehow you've been able to form such a strong opinion about it. Impressive!
cj•Mar 5, 2026
One opinion you can form in under an hour is... why are they using GPT-4o to rate the bias of new models?
> assess harmful stereotypes by grading differences in how a model responds
> Responses are rated for harmful differences in stereotypes using GPT-4o, whose ratings were shown to be consistent with human ratings
Are we seriously using old models to rate new models?
titanomachy•Mar 5, 2026
Why not? If they’ve shown that 4o is calibrated to human responses, and they haven’t shown that yet for 5.4…
hex4def6•Mar 5, 2026
If you're benchmarking something, old & well-characterized / understood often beats new & un-characterized.
Sure, there may be shortcomings, but they're well understood. The closer you get to the cutting edge, the less characterization data you get to rely on. You need to be able to trust & understand your measurement tool for the results to be meaningful.
utopiah•Mar 5, 2026
Benchmarks?
I don't use OpenAI nor even LLMs (despite having tried https://fabien.benetou.fr/Content/SelfHostingArtificialIntel... a lot of models) but I imagine if I did I would keep failed prompts (can just be a basic "last prompt failed" then export) then whenever a new model comes around I'd throw at 5 it random of MY fails (not benchmarks from others, those will come too anyway) and see if it's better, same, worst, for My use cases in minutes.
If it's "better" (whatever my criteria might be) I'd also throw back some of my useful prompts to avoid regression.
Really doesn't seem complicated nor taking much time to forge a realistic opinion.
earth2mars•Mar 5, 2026
I am actually super impressed with Codex-5.3 extra high reasoning. Its a drop in replacement (infact better than Claude Opus 4.6. lately claude being super verbose going in circles in getting things resolved). I stopped using claude mostly and having a blast with Codex 5.3. looking forward to 5.4 in codex.
satvikpendem•Mar 5, 2026
Same, it also helps that it's way cheaper than Opus in VSCode Copilot, where OpenAI models are counted as 1x requests while Opus is 3x, for similar performance (no doubt Microsoft is subsidizing OpenAI models due to their partnership).
CryZe•Mar 5, 2026
I've been using both Opus 4.6 and Codex 5.3 in VSCode's Copilot and while Opus is indeed 3x and Codex is 1x, that doesn't seem to matter as Opus is willing to go work in the background for like an hour for 3 credits, whereas Codex asks you whether to continue every few lines of code it changes, quickly eating way more credits than Opus. In fact Opus in Copilot is probably underpriced, as it can definitely work for an hour with just those 12 cents of cost. Which I'm not sure you get anywhere else at such a low price.
Update: I don't know why I can't reply to your reply, so I'll just update this. I have tried many times to give it a big todo list and told it to do it all. But I've never gotten it to actually work on it all and instead after the first task is complete it always asks if it should move onto the next task. In fact, I always tell it not to ask me and yet it still does. So unless I need to do very specific prompt engineering, that does not seem to work for me.
satvikpendem•Mar 5, 2026
That shouldn't really make a difference because you can just prompt Codex to behave the same way, having it load a big list of todo items perhaps from a markdown file and asking it to iterate until it's finished without asking for confirmation, and that'll still cost 1x over Opus' 3x.
whynotminot•Mar 5, 2026
I still love Opus but it's just too expensive / eats usage limits.
I've found that 5.3-Codex is mostly Opus quality but cheaper for daily use.
Curious to see if 5.4 will be worth somewhat higher costs, or if I'll stick to 5.3-Codex for the same reasons.
braebo•Mar 5, 2026
I struggle to believe this. Codex can’t hold a candle to Claude on any task I’ve given it.
satvikpendem•Mar 5, 2026
It's more hedonic adaptation, people just aren't as impressed by incremental changes anymore over big leaps. It's the same as another thread yesterday where someone said the new MacBook with the latest processor doesn't excite them anymore, and it's because for most people, most models are good enough and now it's all about applications.
Oh, come on, if it can't run local models that compete with proprietary ones it's not good enough yet!
satvikpendem•Mar 5, 2026
Qwen 3.5 small models are actually very impressive and do beat out larger proprietary models.
dmix•Mar 5, 2026
Plus people just really like to whine on the internet
kranke155•Mar 5, 2026
The models are so good that incremental improvements are not super impressive. We literally would benefit more from maybe sending 50% of model spending into spending on implementation into the services and industrial economy. We literally are lagging in implementation, specialised tools, and hooks so we can connect everything to agents. I think.
wahnfrieden•Mar 5, 2026
5.3 codex was a huge leap over 5.2 for agentic work in practice. have you been using both of those or paying attention more to benchmark news and chatgpt experience?
softwaredoug•Mar 5, 2026
The products are the harnesses, and IMO that’s where the innovation happens. We’ve gotten better at helping get good, verifiable work from dumb LLMs
iterateoften•Mar 5, 2026
The product is putting the skills / harness behind the api instead of the agent locally on your computer and iterating on that between model updates. Close off the garden.
Not that I want it, just where I imagine it going.
metalliqaz•Mar 5, 2026
They need something that POPS:
The new GPT -- SkyNet for _real_
jascha_eng•Mar 5, 2026
When did they stop putting competitor models on the comparison table btw?
And yeh I mean the benchmark improvements are meh. Context Window and lack of real memory is still an issue.
varispeed•Mar 5, 2026
The scores increase and as new versions are released they feel more and more dumbed down.
tgarrett•Mar 5, 2026
Plasma physicist here, I haven't tried 5.4 yet, but in general I am very impressed with the recent upgrades that started arriving in the fall of 2025: for tasks like manipulating analytic systems of equations, quickly developing new features for simulation codes, and interpreting and designing experiments (with pictures) they have become much stronger. I've been asking questions and probing them for several years now out of curiosity, and they suddenly have developed deep understanding (Gemini 2.5 <<< Gemini 3.1) and become very useful. I totally get the current SV vibes, and am becoming a lot more ambitious in my future plans.
brcmthrowaway•Mar 5, 2026
Youre just chatting yourself out of a job.
axus•Mar 5, 2026
Giving the right answer: $1
Asking the right question: $9,999
slibhb•Mar 5, 2026
If we don't need plasma physicists anymore then we probably have fusion reactors or something, which seems like a fine trade. (In reality we're going to want humans in the loop for for the forseeable future)
mindwok•Mar 5, 2026
They don't need to be impressive to be worthwhile. I like incremental improvements, they make a difference in the day to day work I do writing software with these.
prydt•Mar 5, 2026
I no longer want to support OpenAI at all. Regardless of benchmarks or real world performance.
Imustaskforhelp•Mar 5, 2026
I agree with ya. You aren't alone in this. For what its worth, Chatgpt subscriptions have been cancelled or that number has risen ~300% in the last month.
Also, Anthropic/Gemini/even Kimi models are pretty good for what its worth. I used to use chatgpt and I still sometimes accidentally open it but I use Gemini/Claude nowadays and I personally find them to be better anyways too.
throwaway911282•Mar 5, 2026
google and anthropic have govt contracts long before openai.. if you are taking a stance you should rather use oss models
zeeebeee•Mar 5, 2026
that aside, chatgpt itself has gone downhill so much and i know i'm not the only one feeling this way
i just HATE talking to it like a chatbot
idk what they did but i feel like every response has been the same "structure" since gpt 5 came out
feels like a true robot
tototrains•Mar 5, 2026
Their trajectory was clear the moment they signed a deal with Microsoft if not sooner.
Absolute snakes - if it's more profitable to manipulate you with outputs or steal your work, they will. Every cent and byte of data they're given will be used to support authoritarianism.
beernet•Mar 5, 2026
Sam really fumbled the top position in a matter of months, and spectacularly so. Wow. It appears that people are much more excited by Anthropic and Google releases, and there are good reasons for that which were absolutely avoidable.
jcmontx•Mar 5, 2026
5.4 vs 5.3-Codex? Which one is better for coding?
vtail•Mar 5, 2026
Looking at the benchmarks, 5.4 is slightly better. But it also offers "Fast" mode (at 2x usage), which - if it works and doesn't completely depletes my Pro plan - is a no brainer at the same or even slightly worse quality for more interactive development.
esafak•Mar 5, 2026
For the price, it seems the latter. I'd use 5.4 to plan.
embedding-shape•Mar 5, 2026
Literally just released, I don't think anyone knows yet. Don't listen to people's confident takes until after a week or two when people actually been able to try it, otherwise you'll just get sucked up in bears/bulls misdirected "I'm first with an opinion".
awestroke•Mar 5, 2026
Opus 4.6
jcmontx•Mar 5, 2026
Codex surpassed Claude in usefulness _for me_ since last month
baal80spam•Mar 5, 2026
Uh, oh. Looks like Claude sycophants joined linuxers and vegetarians.
Someone1234•Mar 5, 2026
Related question:
- Do they have the same context usage/cost particularly in a plan?
They've kept 5.3-Codex along with 5.4, but is that just for user-preference reasons, or is there a trade-off to using the older one? I'm aware that API cost is better, but that isn't 1:1 with plan usage "cost."
gavinray•Mar 5, 2026
The "RPG Game" example on the blogpost is one of the most impressive demo's of autonomous engineering I've seen.
It's very similar to "Battle Brothers", and the fact that RPG games require art assets, AI for enemy moves, and a host of other logical systems makes it all the more impressive.
hu3•Mar 5, 2026
indeed and I suspect it can be attributed to, at least in part, the improved playwright integration.
> we’re also releasing an experimental Codex skill called “Playwright (Interactive) (opens in a new window)”. This allows Codex to visually debug web and Electron apps; it can even be used to test an app it’s building, as it’s building it.
casid•Mar 5, 2026
I don't know. It looks shallow and simple, not even a demo.
Multicomp•Mar 5, 2026
A cheesy Roller Coaster Tycoon clone in a browser, one-shotted from an AI? Amazing capabilities. The entire "low code drag n drop" market like YoYoGames Game Maker and RPG Maker should be ready to pack it in soon if this keeps improving in this way.
swingboy•Mar 5, 2026
Even with the 1m context window, it looks like these models drop off significantly at about 256k. Hopefully improving that is a high priority for 2026.
leftbehinds•Mar 5, 2026
some sloppy improvements
HardCodedBias•Mar 5, 2026
We'll have to wait a day or two, maybe a week or two, to determine if this is more capable in coding than 5.3, which seems to be the economically valuable capability at this time.
In terms of writing and research even Gemini, with a good prompt, is close to useable. That's likely not a differentiator.
lostmsu•Mar 5, 2026
What is Pro exactly and is it available in Codex CLI?
akmarinov•Mar 5, 2026
It’s not. It’s their ultra thinking model that’s really good but takes 40 minutes to come up with an answer
Not the best pelican compared to gemini 3.1 pro, but I am sure with coding or excel does remarkably better given those are part of its measured benchmarks.
GaggiX•Mar 5, 2026
This pelican is actually bad, did you use xhigh?
nickandbro•Mar 5, 2026
yep, just double checked used gpt-5.4 xhigh. Though had to select it in codex as don't have access to it on the chatgpt app or web version yet. It's possible that whatever code harness codex uses, messed with it.
nubg•Mar 5, 2026
this is proof they are not benchmaxxing the pelican's :-)
bazmattaz•Mar 5, 2026
Anyone else feel that it’s exhausting keeping up with the pace of new model releases. I swear every other week there’s a new release!
coffeemug•Mar 5, 2026
Why do you need to keep up? Just use the latest models and don't worry about it.
throwup238•Mar 5, 2026
Yes, that's a common feeling. 5.3-Codex was released a month ago on Feb 5 so we're not even getting a full month within a single brand, let alone between competitors.
davnicwil•Mar 5, 2026
If you think about it there shouldn't really be a reason to care as long as things don't get worse.
Presumably this is where it'll evolve to with the product just being the brand with a pricing tier and you always get {latest} within that, whatever that means (you don't have to care). They could even shuffle models around internally using some sort of auto-like mode for simpler questions. Again why should I care as long as average output is not subjectively worse.
Just as I don't want to select resources for my SaaS software to use or have that explictly linked to pricing, I don't want to care what my OpenAI model or Anthropic model is today, I just want to pay and for it to hopefully keep getting better but at a minimum not get worse.
pupppet•Mar 5, 2026
I think it's fun, it's like we're reliving the browser wars of the early days.
dandiep•Mar 5, 2026
Anyone know why OpenAI hasn't released a new model for fine tuning since 4.1? It'll be a year next month since their last model update for fine tuning.
qoez•Mar 5, 2026
I think they just did that because of the energy around it for open source models. Their heart probably wasn't in it and the amount of people fine tuning given the prices were probably too low to continue putting in attention there.
zzleeper•Mar 5, 2026
For me the issue is why there's not a new mini since 5-mini in August.
I have now switched web-related and data-related queries to Gemini, coding to Claude, and will probably try QWEN for less critical data queries. So where does OpenAI fits now?
Rapzid•Mar 5, 2026
Also interested in this and a replacement for 4.1/4.1-mini that focuses on low latency and high accuracy for voice applications(not the all-in-one models).
paxys•Mar 5, 2026
"Here's a brand new state-of-the-art model. It costs 10x more than the previous one because it's just so good. But don't worry, if you don't want all this power you can continue to use the older one."
What is with the absurdity of skipping "5.3 Thinking"?
vicchenai•Mar 5, 2026
Honestly at this point I just want to know if it follows complex instructions better than 5.1. The benchmark numbers stopped meaning much to me a while ago - real usage always feels different.
7777777phil•Mar 5, 2026
83% win rate over industry professionals across 44 occupations.
I'd believe it on those specific tasks. Near-universal adoption in software still hasn't moved DORA metrics. The model gets better every release. The output doesn't follow. Just had a closer look on those productivity metrics this week: https://philippdubach.com/posts/93-of-developers-use-ai-codi...
NiloCK•Mar 5, 2026
This March 2026 blog post is citing a 2025 study based on Sonnet 3.5 and 3.7 usage.
Given that organization who ran the study [1] has a terrifying exponential as their landing page, I think they'd prefer that it's results are interpreted as a snapshot of something moving rather than a constant.
Good catch, thanks (I really wrote that myself.) Added a note to the post acknowledging the models used were Claude 3.5 and 3.7 Sonnet.
twitchard•Mar 5, 2026
Not sure DORA is that much of an indictment. For "Change Failure Rate" for instance these are subject to tradeoffs. Organizations likely have a tolerance level for Change Failure Rate. If changes are failing too often they slow down and invest. If changes aren't failing that much they speed up -- and so saying "change failure rate hasn't decreased, obviously AI must not be working" is a little silly.
"Change Lead Time" I would expect to have sped up although I can tell stories for why AI-assisted coding would have an indeterminate effect here too. Right now at a lot of orgs, the bottle neck is the review process because AI is so good at producing complete draft PRs quickly. Because reviews are scarce (not just reviews but also manual testing passes are scarce) this creates an incentive ironically to group changes into larger batches. So the definition of what a "change" is has grown too.
Does anyone know what website is the "Isometric Park Builder" shown off here?
turblety•Mar 5, 2026
They build that using GPT-5.4
> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt
GPT literally built that game.
iamleppert•Mar 5, 2026
I wouldn't trust any of these benchmarks unless they are accompanied by some sort of proof other than "trust me bro". Also not including the parameters the models were run at (especially the other models) makes it hard to form fair comparisons. They need to publish, at minimum, the code and runner used to complete the benchmarks and logs.
Not including the Chinese models is also obviously done to make it appear like they aren't as cooked as they really are.
elmean•Mar 5, 2026
Wow insane improvements in targeting systems for military targets over children
timedude•Mar 5, 2026
Absolutely amazing. Grateful to be living in this timeframe
bramhaag•Mar 5, 2026
What makes you think that they see bombing civilians as a bug, not a feature?
elmean•Mar 5, 2026
first real comment, I thought that at first but this could lower the possible users that could be using chatGPT and that would be against us (shareholders)
You made a burner account just to scold this guy? Don’t use burner accounts this way.
himata4113•Mar 5, 2026
news guidelines
adamtaylor_13•Mar 5, 2026
Parlay?
louiereederson•Mar 5, 2026
I think for your comment to follow the guidelines, you need to explain why the original comment did not follow them.
Customer values are relevant to the discussion given that they impact choice and therefore competition.
elmean•Mar 5, 2026
AINT NO PARTY LIKE A GARRY TAN HOT TUB PARTY
Chance-Device•Mar 5, 2026
Ironically this would actually be a good thing. As we can see from Iran Claude doesn’t quite have these bugs ironed out yet…
MSFT_Edging•Mar 5, 2026
This is the exact attitude that lead to a chat bot being used to identify a school for girls as a valid target.
The chatbot cannot be held responsible.
Whoever is using chatbots for selecting targets is incompetent and should likely face war crime charges.
Chance-Device•Mar 5, 2026
What attitude exactly are you talking about? The one that says that if you’re going to morally sell out it would be better if you at least tried not to kill children?
bananamogul•Mar 5, 2026
"that lead to a chat bot being used to identify a school for girls as a valid target"
Has it been stated authoritatively somewhere that this was an AI-driven mistake?
There are myrid ways that mistake could have been made that don't require AI. These kinds of mistakes were certainly made by all kinds of combatants in the pre-AI era.
Chance-Device•Mar 5, 2026
Do you think anyone is ever going to say this under any circumstances? That Anthropic were right and they were proved right the very next day?
Yeah yeah, they probably had a human in the loop, that’s not really the point though.
Sabinus•Mar 5, 2026
Targeting and accuracy mistakes happen plenty in wars that aren't assisted by AI. I don't think it's fair to assume that AI had a hand in the bombing of the school without evidence.
spiralcoaster•Mar 5, 2026
This is the low quality reddit-style garbage that gets upvoted on HN these days?
esalman•Mar 5, 2026
While low quality, it is extremely important, potentially historically significant too.
Someone1234•Mar 5, 2026
If it is actually that important, then maybe more effort should be made so it isn't "low quality." Cannot be very important to them if they're disinterested in presenting an intellectually compelling argument about it.
PS - If you think I am not sympathetic to what they're raising, you're very much mistake. But they're not winning anyone new over their side with this flamebait.
Sabinus•Mar 5, 2026
You can say your piece about how you don't like OpenAI working with the US military on lethal AI without making Reddit style quips.
mycall•Mar 5, 2026
True and simply vote it down.
elmean•Mar 5, 2026
mycall would also be to do the same
karmasimida•Mar 5, 2026
As programmers become intelligently irrelevant in the whole picture, you would see more posts like this
elmean•Mar 5, 2026
"This account belongs to a lazy person" true
rd•Mar 5, 2026
Noticeably yes much more than usual. It’s quite bad. I need to start blocking accounts.
zarzavat•Mar 5, 2026
What are we supposed to talk about in this thread exactly? The developers of this model are evil. Are we supposed to just write dry comments about benchmarks while OpenAI condones their models being deployed for autonomously killing people?
Yes I'm sure it makes a very nice bicycle SVG. I will be sure to ask the OpenAI killbots for a copy when they arrive at my house.
elmean•Mar 5, 2026
I was just reading the model card...
Nicholas_C•Mar 5, 2026
The HN of old is no more unfortunately. Things get up or down voted based purely on political alignment.
oklahomasports•Mar 5, 2026
Evidence
throwaway911282•Mar 5, 2026
what a thoughtful comment! HN is so low quality these days
creamyhorror•Mar 5, 2026
I've only used 5.4 for 1 prompt (edit: 3@high now) so far (reasoning: extra high, took really long), and it was to analyse my codebase and write an evaluation on a topic. But I found its writing and analysis thoughtful, precise, and surprisingly clearly written, unlike 5.3-Codex. It feels very lucid and uses human phrasing.
It might be my AGENTS.md requiring clearer, simpler language, but at least 5.4's doing a good job of following the guidelines. 5.3-Codex wasn't so great at simple, clear writing.
irishcoffee•Mar 5, 2026
> It might be my AGENTS.md requiring clearer, simpler language
If you gave the exact same markdown file to me and I posted ed the exact same prompts as you, would I get the same results?
m3kw9•Mar 5, 2026
you probably can't and asking agents.md to "make it clearer" will likely give you the illusion of clearer language without actual well structured tests. agents.md is to usually change what the llm should focus on doing more that suits you. Not to say stuff like "be better", "make no mistakes"
creamyhorror•Mar 5, 2026
I'm not sure if the model (under its temperature/other settings) produces deterministic responses. But I do think models' style and phrasing are fairly changeable via AGENTS.md-style guidelines.
5.4's choice of terms and phrasing is very precise and unambiguous to me, whereas 5.3-Codex often uses jargon and less precise phrases that I have to ask further about or demand fuller explanations for via AGENTS.md.
irishcoffee•Mar 5, 2026
So sharing markdown files is functionally useless, or no?
sampton•Mar 5, 2026
That's been my experience as well switching from Opus to Codex. Reasoning takes longer but answers are precise. Claude is sloppy in comparison.
throwaway911282•Mar 5, 2026
codex has been really good so far and the fast mode is cherry on top! and the very generous limits is another cherry on top
solenoid0937•Mar 5, 2026
Weird, I have had the opposite experience. Codex is good at doing precisely what I tell it to do, Opus suggests well thought out plans even if it needs to push back to do it.
pembrook•Mar 5, 2026
The latest research these days is that including an AGENTS.md file only makes outcomes worse with frontier models.
I believe that this choice is due to two main reasons. First, it's (obviously) a marketing strategy to keep the spotlight on their own models, showing they're constantly improving and avoiding validating competitors. Second, since the community knows that static benchmarks are unreliable, it makes sense for them to outsource the comparisons to independent leaderboards, which lets them avoid accusations of cherry-picking while justifying their marketing strategy.
Ultimately, the people actually interested in the performance of these models already don't trust self-reported comparisons and wait for third-party analysis anyway
Inline poll: What reasoning levels do you work with?
This becomes increasingly less clear to me, because the more interesting work will be the agent going off for 30mins+ on high / extra high (it's mostly one of the two), and that's a long time to wait and an unfeasible amount of code to a/b
bob1029•Mar 5, 2026
I was just testing this with my unity automation tool and the performance uplift from 5.2 seems to be substantial.
koakuma-chan•Mar 5, 2026
Anyone else getting artifacts when using this model in Cursor?
I've seen that problem with 5.3-codex too, it didn't happen with earlier models.
Looks like some kind of encoding misalignment bug. What you're seeing is their Harmony output format (what the model actually creates). The Thai/Chinese characters are special tokens apparently being mismapped to Unicode. Their servers are supposed to notice these sequences and translate them back to API JSON but it isn't happening reliably.
daft_pink•Mar 5, 2026
I’ve officially got model fatigue. I don’t care anymore.
zeeebeee•Mar 5, 2026
same same same
postalrat•Mar 5, 2026
I'd suggest not clicking for things you don't care about.
hmokiguess•Mar 5, 2026
They hired the dude from OpenClaw, they had Jony Ive for a while now, give us something different!
kgeist•Mar 5, 2026
>Today, we’re releasing <..> GPT‑5.3 Instant
>Today, we’re releasing GPT‑5.4 in ChatGPT (as GPT‑5.4 Thinking),
>Note that there is not a model named GPT‑5.3 Thinking
They held out for eight months without a confusing numbering scheme :)
gallerdude•Mar 5, 2026
Tbf there was a 5.3 codex
XCSme•Mar 5, 2026
What I'm most confused, is why call it both GPT-5.3 Instant and gpt-5.3-chat?
m3kw9•Mar 5, 2026
instant kind of suck if you asking more than summerizations, surface info, web searches, it can lose track of who's who quickly in some complex multi turn asks. Just need to know what to use instant for.
__jl__•Mar 5, 2026
What a model mess!
OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4. There version numbers jump across different model lines with codex at 5.3, what they now call instant also at 5.3.
Anthropic are really the only ones who managed to get this under control: Three models, priced at three different levels. New models are immediately available everywhere.
Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.
arthurcolle•Mar 5, 2026
There is a lot of opportunity here for the AI infrastructure layer on top of tier-1 model providers
motoxpro•Mar 5, 2026
This is what clouds like AWS, Azure, and GCP solve (vertex AI, etc). They are already an abstraction on top of the model makers with distribution built in.
I also don't believe there is any value in trying to aggregate consumers or businesses just to clean up model makers names/release schedule. Consumers just use the default, and businesses need clarity on the underlying change (e.g. why is it acting different? Oh google released 3.6)
arthurcolle•Mar 5, 2026
Do the end users really care about the models at all, or about the effects that the models can cause?
delaminator•Mar 5, 2026
two great problems in computing
naming things
cache invalidation
off by one errors
rurban•Mar 5, 2026
Biggest problem right now in computing:
Out of tokens until end of month
strongpigeon•Mar 5, 2026
> Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.
What's funny is that there is this common meme at Google: you can either use the old, unmaintained tool that's used everywhere, or the new beta tools that doesn't quite do what you want.
Preview Road (only choice, and last preview was deprecated without warning)
CactusBlue•Mar 5, 2026
Reminds of Unity features
L-four•Mar 5, 2026
Gmail was in beta for 5 years, until 2009.
metalliqaz•Mar 5, 2026
"Gemini, translate 'beta' from Googlespeak to English."
"Ok, here is the translation:"
'we don't want to offer support'
cyanydeez•Mar 5, 2026
Nah, it's "We dont want to provide a consistent model that we'll be stuck with supporting for a decade because it just takes up space; until we run everyone out of business, we can't afford to have customers tying their systems to any given model"
Really, the economics makes no sense, but that's what they're doing. You can't have a consistent model because it'll pin their hardware & software, and that costs money.
solarkraft•Mar 5, 2026
Just like any Google product then.
m_fayer•Mar 5, 2026
My 5ish years in the mines of Android native back in the day are not years I recall fondly. Never change, Google.
cyanydeez•Mar 5, 2026
The business models of LLMs don't include any garuntee, and some how that's fine for a burgeoning decade of trillions of dollars of consumption.
Sure, makes total sense guys.
embedding-shape•Mar 5, 2026
> OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4.
I guess that's true, but geared towards API users.
Personally, since "Pro Mode" became available, I've been on the plan that enables that, and it's one price point and I get access to everything, including enough usage for codex that someone who spends a lot of time programming, never manage to hit any usage limits although I've gotten close once to the new (temporary) Spark limits.
0xbadcafebee•Mar 5, 2026
> or have zero insurances that the model doesn't get discontinued within weeks
Why are you using the same model after a month? Every month a better model comes out. They are all accessible via the same API. You can pay per-token. This is the first time in, like, all of technology history, that a useful paid service is so interoperable between providers that switching is as easy as changing a URL.
phainopepla2•Mar 5, 2026
If you're trying to use LLMs in an enterprise context, you would understand. Switching models sometimes requires tweaking prompts. That can be a complete mess, when there are dozens or hundreds of prompts you have to test.
hobofan•Mar 5, 2026
That's true only in theory, but not in practice. In practice every inference provider handles errors (guardrails, rate limits) somewhat differently and with different quirks, some of which only surface in production usage, and Google is one of the worst offenders in that regard.
Aurornis•Mar 5, 2026
> What a model mess!
OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4.
I don't know, this feels unnecessarily nitpicky to me
It isn't hard to understand that 5.4 > 5.2 > 5.1. It's not hard to understand that the dash-variants have unique properties that you want to look up before selecting.
Especially for a target audience of software engineers skipping a version number is a common occurrence and never questioned.
Melatonic•Mar 5, 2026
Agreed - and its a huge step up from their previous naming schemes. That stuff was confusing as hell
__jl__•Mar 5, 2026
I see your point. I do find Anthropic's approach more clean though particularly when you add in mini and nano. That makes 5 models priced differently. Some share the same core name, others don't: gpt 5 nano, gpt 5 mini, gpt 5.1, gpt 5.2, gpt 5.4. And we are not even talking about thinking budget.
But generally: These are not consumer facing products and I agree that someone who uses the API should be able to figure out the price point of different models.
raincole•Mar 5, 2026
They aggressively retire models, so GPT 5.1 and 5.2 are probably going to go soon.
hobofan•Mar 5, 2026
In the Azure Foundry, they list GPT 5.2 retirement as "No earlier than 2027-05-12" (it might leave OpenAIs normal API earlier than that). I'm pretty certain that Gemini 3, which isn't even in GA yet will be retired earlier than that.
CobrastanJorji•Mar 5, 2026
> Google essentially only has Preview models.
It's really nice to see Google get back to its roots by launching things only to "beta" and then leaving them there for years. Gmail was "beta" for at least five years, I think.
m3kw9•Mar 5, 2026
thats how they had it for years, is a mess, but controlled
biophysboy•Mar 5, 2026
Wow, is that what preview means? I see those model options in github copilot (all my org allows right now) - I was under the impression that preview means a free trial or a limited # of queries. Kind of a misleading name..
jbonatakis•Mar 5, 2026
Google is already sending notices that the 2.5 models will be deprecated soon while all the 3.x models are in preview. It really is wild and peak Google.
woeirua•Mar 5, 2026
Feels incremental. Looks like OpenAI is struggling.
throwaway5752•Mar 5, 2026
Does this model autonomously kill people without human approval or perform domestic surveillance of US citizens?
GPT is not even close yo Claude in terms of responding to BS.
zone411•Mar 5, 2026
Results from my Extended NYT Connections benchmark:
GPT-5.4 extra high scores 94.0 (GPT-5.2 extra high scored 88.6).
GPT-5.4 medium scores 92.0 (GPT-5.2 medium scored 71.4).
GPT-5.4 no reasoning scores 32.8 (GPT-5.2 no reasoning scored 28.1).
consumer451•Mar 5, 2026
I am very curious about this:
> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt, using Playwright Interactive for browser playtesting and image generation for the isometric asset set.
Is "Playwright Interactive" a skill that takes screenshots in a tight loop with code changes, or is there more to it?
motza•Mar 5, 2026
No doubt this was released early to ease the bad press
butILoveLife•Mar 5, 2026
Anyone else completely not interested? Since GPT5, its been cost cutting measure after cost cutting measure.
I imagine they added a feature or two, and the router will continue to give people 70B parameter-like responses when they dont ask for math or coding questions.
Philip-J-Fry•Mar 5, 2026
I find it quite funny how this blog post has a big "Ask ChatGPT" box at the bottom. So you might think you could ask a question about the contents of the blog post, so you type the text "summarise this blog post". And it opens a new chat window with the link to the blog post followed by "summarise this blog post". Only to be told "I can't access external URLs directly, but if you can paste the relevant text or describe the content you're interested in from the page, I can help you summarize it. Feel free to share!"
That's hilarious. Does OpenAI even know this doesn't work?
Aurornis•Mar 5, 2026
Probably intentional. They don't want open, no-registration endpoints able to trigger the AI into hitting URLs.
jazzypants•Mar 5, 2026
But, why include the non-functional chat box in the article?
observationist•Mar 5, 2026
They're having service issues - ChatGPT on the web is broken for a lot of people. The app is working in android - I'd assume that the rollout hit a hitch and the chatbox in the article would normally work.
embedding-shape•Mar 5, 2026
Different team "manages" the overall blog than the team who wrote that specific article. At one point, maybe it made sense, then something in the product changed, team that manages the blog never tested it again.
Or, people just stopped thinking about any sort of UX. These sort of mistakes are all over the place, on literally all web properties, some UX flows just ends with you at a page where nothing works sometimes. Everything is just perpetually "a bit broken" seemingly everywhere I go, not specific to OpenAI or even the internet.
teaearlgraycold•Mar 5, 2026
If only there was some kind of way to automatically test user flows end to end. Perhaps testing could be evaluated periodically, or even ran for each code change.
koakuma-chan•Mar 5, 2026
There is no business value in doing that.
colonCapitalDee•Mar 5, 2026
That's why it happened. It still shouldn't have happened.
ethbr1•Mar 5, 2026
> Or, people just stopped thinking about any sort of UX. These sort of mistakes are all over the place, on literally all web properties, some UX flows just ends with you at a page where nothing works sometimes.
It's almost like people are vibe coding their web apps or something.
jdndbdjsj•Mar 5, 2026
Welcome to a big company
AirGapWorksAI•Mar 5, 2026
Welcome to a big company where pretty much everyone has been working full steam for years, in order to take advantage of having a job at a company during a once-in-a-lifetime moment.
m3kw9•Mar 5, 2026
what? it's their own site and own llm. I could paste most sites and it would work.
Following this process summarizes the blogpost for me. Perhaps the difference is I'm signed into my account so it can access external URLs or something of that nature?
pocksuppet•Mar 5, 2026
Most AI integration is like this. It's not about building working products --- it's about bragging that you put a chatbox in your program.
ElijahLynn•Mar 5, 2026
fwiw: I get a valid response when following the steps you mentioned. I do not get the message you mentioned:
It looks like this doesn't work for users without accounts? It works when I'm logged in, but not logged out. I went ahead and reported it to the team. Thanks for letting us know!
baxtr•Mar 5, 2026
I picked up Claude today after being absent and on ChahGPT and Gemini only for a while.
I was pretty impressed with how they’ve improved user experience. If I had to guess, I’d say Anthropic has better product people who put more attention to detail in these areas.
amelius•Mar 5, 2026
If only they had an LLM they could use as a software testing agent.
Alifatisk•Mar 5, 2026
So let me get this straight, OpenAi previously had an issue with LOTS of different models snd versions being available. Then they solved this by introducing GPT-5 which was more like a router that put all these models under the hood so you only had to prompt to GPT-5, and it would route to the best suitable model. This worked great I assume and made the ui for the user comprehensible. But now, they are starting to introduce more of different models again?
We got:
- GPT-5.1
- GPT-5.2 Thinking
- GPT-5.3 (codex)
- GPT-5.3 Instant
- GPT-5.4 Thinking
- GPT-5.4 Pro
Who’s to blame for this ridiculous path they are taking? I’m so glad I am not a Chat user, because this adds so much unnecessary cognitive load.
The good news here is the support for 1M context window, finally it has caught up to Gemini.
361994752•Mar 5, 2026
i guess you still have the "auto" as an option to route your request
stainablesteel•Mar 5, 2026
5 itself might have solved the problem of having too many different models somewhere in the backend
fernst•Mar 5, 2026
Now with more and improved domestic espionage capabilities
senko•Mar 5, 2026
Just tested it with my version of the pelican test: a minimal RTS game implementation (zero-shot in codex cli): https://gist.github.com/senko/596a657b4c0bfd5c8d08f44e4e5347... (you'll have to download and open the file, sadly GitHub refuses to serve it with the correct content type)
This is on the edge of what the frontier models can do. For 5.4, the result is better than 5.3-Codex and Opus 4.6. (Edit: nowhere near the RPG game from their blog post, which was presumably much more specced out and used better engineering setup).
I also tested it with a non-trivial task I had to do on an existing legacy codebase, and it breezed through a task that Claude Code with Opus 4.6 was struggling with.
I don't know when Anthropic will fire back with their own update, but until then I'll spend a bit more time with Codex CLI and GPT 5.4.
Aldipower•Mar 5, 2026
So did they raised the ridiculous small "per tool call token limit" when working with MCP servers? This makes Chat useless... I do not care, but my users.
melbourne_mat•Mar 5, 2026
Quick: let's release something new that gives the appearance that we're still relevant
gigatexal•Mar 5, 2026
Is it any good at coding?
thefounder•Mar 5, 2026
Is it just me or the price for 5.4 pro is just insane?
atkrad•Mar 5, 2026
What is the main difference between this version with the previous one?
brcmthrowaway•Mar 5, 2026
How much of LLM improvement comes from regular ChatGPT usage these days?
quotemstr•Mar 5, 2026
GPT 5.4 is one of the most censored models out there.
It completes only 29% of controversial requests. It refuses to discuss numerous subjects rooted in facts or that reflect views of significant portions of the population. It refuses to even write a short essay on exactly what, say, Herasight-style generic screening or putting weapons in space. Agree or disagree, reasonable people can have a range of views of these subjects and it is not the place of OpenAI or any lab to determine for everyone the right answers to open societal questions.
72 Comments
The OP has frequently gotten the scoop for new LLM releases and I am curious what their pipeline is.
They show an example of 5.4 clicking around in Gmail to send an email.
I still think this is the wrong interface to be interacting with the internet. Why not use Gmail APIs? No need to do any screenshot interpretation or coordinate-based clicking.
some sites try to block programmatic use
UI use can be recorded and audited by a non-technical person
But people are intimidated by the complexity of writing web crawlers because management has been so traumatized by the cost of making GUI applications that they couldn’t believe how cheap it is to write crawlers and scrapers…. Until LLMs came along, and changed the perceived economics and created a permission structure. [1]
AI is a threat to the “enshittification economy” because it lets us route around it.
[1] that high cost of GUI development is one reason why scrapers are cheap… there is a good chance that the scraper you wrote 8 years ago still works because (a) they can’t afford to change their site and (b) if they could afford to change their site changing anything substantial about it is likely to unrecoverably tank their Google rankings so they won’t. A.I. might change the mechanics of that now that you Google traffic is likely to go to zero no matter what you do.
This is prescient -- I wonder if the Big Tech entities see it this way. Maybe, even if they do, they're 100% committed to speedrunning the current late-stage-cap wave, and therefore unable to do anything about it.
Google has a good model in the form of Gemini and they might figure they can win the AI race and if the web dies, the web dies. YouTube will still stick around.
Facebook is not going to win the AI race with low I.Q. Llama but Zuck believed their business was cooked around the time it became a real business because their users would eventually age out and get tired of it. If I was him I'd be investing in anything that isn't cybernetic let it be gold bars or MMA studios.
Microsoft? They bought Activision for $69 billion. I just can't explain their behavior rationally but they could do worse than their strategy of "put ChatGPT in front of laggards and hope that some of them rise to the challenge and become slop producers."
Amazon is really a bricks-and-mortar play which has the freedom to invest in bricks-and-mortar because investors don't think they are a bricks-and-mortar play.
Netflix? They're cooked as is all of Hollywood. Hollywood's gatekeeping-industrial strategy of producing as few franchise as possible will crack someday and our media market may wind up looking more like Japan, where somebody can write a low-rent light novel like
https://en.wikipedia.org/wiki/Backstabbed_in_a_Backwater_Dun...
and J.C. Staff makes a terrible anime that convinces 20k Otaku to drop $150 on the light novels and another $150 on the manga (sorry, no way you can make a balanced game based on that premise!) and the cost structure is such that it is profitable.
I am not sure about that. We techies avoid enshittification because we recognize shit. Normies will just get their syncopatic enshittified AI that will tell them to continue buying into walled gardens.
Plenty of companies make the same choice about their API, they provide it for a specific purpose but they have good business reasons they want you using the website. Plenty of people write webcrawlers and it's been a cat and mouse game for decades for websites to block them.
This will just be one more step in that cat and mouse game, and if the AI really gets good enough to become a complete intermediary between you and the website? The website will just shutdown. We saw it happen before with the open web. These websites aren't here for some heroic purpose, if you screw their business model they will just go out of business. You won't be able to use their website because it won't exist and the website that do exist will either (a) be made by the same guys writing your agent, and (b) be highly highly optimized to get your agent to screw you.
You can also test this yourself easily, fire up two agents, ask one to use PL meant for humans, and one to write straight up machine code (or assembly even), and see which results you like best.
Then go ahead and make an argument. "Why not do X?" is not an argument, it's a suggestion.
Of course APIs and CLIs also exist, but they don't necessarily have feature parity, so more development would be needed. Maybe that's the future though since code generation is so good - use AI to build scaffolding for agent interaction into every product.
If an API is exposed you can just have the LLM write something against that.
Optimizations are secondary to convenience
Screenshots on the other hand are documentation, API, and discovery all in one. And you’d be surprised how little context/tokens screenshots consumer compared to all the back and forth verbose json payloads of APIs
I think an important thing here is that a lot of websites/platforms don't want AIs to have direct API access, because they are afraid that AIs would take the customer "away" from the website/platform, making the consumer a customer of the AI rather than a customer of the website/platform. Therefore for AIs to be able to do what customers want them to do, they need their browsing to look just like the customer's browsing/browser.
gpt-5.4
Input: $2.50 /M tokens
Cached: $0.25 /M tokens
Output: $15 /M tokens
---
gpt-5.4-pro
Input: $30 /M tokens
Output: $180 /M tokens
Wtf
That's just not how pricing is supposed to work...? Especially for a 'non-profit'. You're charging me more so I know I have the better model?
But they also claim this new model uses fewer tokens, so it still might ultimately be cheaper even if per token cost is higher.
I guess they have to sell to investors that the price to operate is going down, while still needing more from the user to be sustainable
Anthropic is pulling the plug on Haiku 3 in a couple months, and they haven't released anything in that price range to replace it.
They're framing it pretty directly "We want you to think bigger cost means better model"
Also per pricing, GPT-5.4 ($2.50/M input, $15/M output) is much cheaper than Opus 4.6 ($5/M input, $25/M output) and Opus has a penalty for its beta >200k context window.
I am skeptical whether the 1M context window will provide material gains as current Codex/Opus show weaknesses as its context window is mostly full, but we'll see.
Per updated docs (https://developers.openai.com/api/docs/guides/latest-model), it supercedes GPT-5.3-Codex, which is an interesting move.
My own tooling throws off requests to multiple agents at the same time, then I compare which one is best, and continue from there. Most of the time Codex ends up with the best end results though, but my hunch is that at one point that'll change, hence I continue using multiple at the same time.
I find both Codex and Claude Opus perform at a similar level, and in some ways I actually prefer Codex (I keep hitting quota limits in Opus and have to revert back to Sonnet).
If your question is related to morality (the thing about US politics, DoD contract and so on)... I am not from the US, and I don't care about its internal politics. I also think both OpenAI and Anthropic are evil, and the world would be better if neither existed.
There's no mention of pricing, quotas and so on. Perhaps Codex will still be preferable for coding tasks as it is tailored for it? Maybe it is faster to respond?
Just speculation on my part. If it becomes redundant to 5.4, I presume it will be sunset. Or maybe they eventually release a Codex 5.4?
I can tell claude to spawn a new coding agent, and it will understand what that is, what it should be told, and what it can approximately do.
Codex on the other hand will spawn an agent and then tell it to continue with the work. It knows a coding agent can do work, but doesn't know how you'd use it - or that it won't magically know a plan.
You could add more scaffolding to fix this, but Claude proves you shouldn't have to.
I suspect this is a deeper model "intelligence" difference between the two, but I hope 5.4 will surprise me.
That's not the experience I have. I had it do more complex changes spawning multiple files and it performed well.
I don't like using multiple agents though. I don't vibe code, I actually review every change it makes. The bottleneck is my review bandwidth, more agents producing more code will not speed me up (in fact it will slow me down, as I'll need to context switch more often).
Exact same situation here. I've been using both extensively for the last month or so, but still don't really feel either of them is much better or worse. But I have not done large complex features with it yet, mostly just iterative work or small features.
I also feel I am probably being very (overly?) specific in my prompts compared to how other people around me use these agents, so maybe that 'masks' things
It was generally smarter than pre-5.2 so strategically better, and codex likewise wrote better database queries than non-codex, and as it needs to iteratively hunt down the answer, didn't run out the clock by drowning in reasoning.
Video: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t...
We'll be updating numbers on 5.3 and claude, but basically same thing there. Early, but we were surprised to see codex outperform opus here.
For Codex, we're making 1M context experimentally available, but we're not making it the default experience for everyone, as from our testing we think that shorter context plus compaction works best for most people. If anyone here wants to try out 1M, you can do so by overriding `model_context_window` and `model_auto_compact_token_limit`.
Curious to hear if people have use cases where they find 1M works much better!
(I work at OpenAI.)
Sometimes I’m exploring some topic and that exploration is not useful but only the summary.
Also, you could use the best guess and cli could tell me that this is what it wants to compact and I can tweak its suggestion in natural language.
Context is going to be super important because it is the primary constraint. It would be nice to have serious granular support.
Reverse engineering [1]. When decompiling a bunch of code and tracing functionality, it's really easy to fill up the context window with irrelevant noise and compaction generally causes it to lose the plot entirely and have to start almost from scratch.
(Side note, are there any OpenAI programs to get free tokens/Max to test this kind of stuff?)
[1] https://github.com/akiselev/ghidra-cli
Like, I'd love an optional pre-compaction step, "I need to compact, here is a high level list of my context + size, what should I junk?" Or similar.
The agent could pre-select what it thinks is worth keeping, but you’d still have full control to override it. Each chunk could have three states: drop it, keep a summarized version, or keep the full history.
That way you stay in control of both the context budget and the level of detail the agent operates with.
I'm now more careful, using tracking files to try to keep it aligned, but more control over compaction regardless would be highly welcomed. You don't ALWAYS need that level of control, but when you do, you do.
I've also had it succeed in attempts to identify some non-trivial bugs that spanned multiple modules.
I too tried Codex and found it similarly hard to control over long contexts. It ended up coding an app that spit out millions of tiny files which were technically smaller than the original files it was supposed to optimize, except due to there being millions of them, actual hard drive usage was 18x larger. It seemed to work well until a certain point, and I suspect that point was context window overflow / compaction. Happy to provide you with the full session if it helps.
I’ll give Codex another shot with 1M. It just seemed like cperciva’s case and my own might be similar in that once the context window overflows (or refuses to fill) Codex seems to lose something essential, whereas Claude keeps it. What that thing is, I have no idea, but I’m hoping longer context will preserve it.
https://xcancel.com/cperciva/status/2029645027358495156
Feels like a losing battle, but hey, the audience is usually right.
https://apps.apple.com/us/app/clean-links-qr-code-reader/id6...
Especially when it’s to the point of, you know, nagging/policing people to do it the way you’d prefer, when you could just redirect your router requests from x.com to xcancel.com
Anywhere I can toss a Tip for this free app?
For me, I would say speed (not just time to first token, but a complete generation) is more important then going for a larger context size.
https://developers.openai.com/api/docs/pricing is what I always reference, and it explicitly shows that pricing ($2.50/M input, $15/M output) for tokens under 272k
It is nice that we get 70-72k more tokens before the price goes up (also what does it cost beyond 272k tokens??)
Even right now one page refers to prices for "context lengths under 270K" whereas another has pricing for "<272K context length"
> For models with a 1.05M context window (GPT-5.4 and GPT-5.4 pro), prompts with >272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.
Taken from https://developers.openai.com/api/docs/models/gpt-5.4
> GPT‑5.4 in Codex includes experimental support for the 1M context window. Developers can try this by configuring model_context_window and model_auto_compact_token_limit. Requests that exceed the standard 272K context window count against usage limits at 2x the normal rate.
I don't think that's a fair reading of the original post at all, obviously what they meant by "no cost" was "no increase in the cost".
For example on Artificial Analysis, the GPT-5.x models' cost to run the evals range from half of that of Claude Opus (at medium and high), to significantly more than the cost of Opus (at extra high reasoning). So on their cost graphs, GPT has a considerable distribution, and Opus sits right in the middle of that distribution.
The most striking graph to look at there is "Intelligence vs Output Tokens". When you account for that, I think the actual costs end up being quite similar.
According to the evals, at least, the GPT extra high matches Opus in intelligence, while costing more.
Of course, as always, benchmarks are mostly meaningless and you need to check Actual Real World Results For Your Specific Task!
For most of my tasks, the main thing a benchmark tells me is how overqualified the model is, i.e. how much I will be over-paying and over-waiting! (My classic example is, I gave the same task to Gemini 2.5 Flash and Gemini 2.5 Pro. Both did it to the same level of quality, but Gemini took 3x longer and cost 3x more!)
Like, if you really don’t want to spend any effort trimming it down, sure use 1m.
Otherwise, 1m is an anti pattern.
What the hell is a "safety score for violence"?
A “safety score for violence” is usually a risk rating used by platforms, AI systems, or moderation tools to estimate how likely a piece of content is to involve or promote violence. It’s not a universal standard—different companies use their own versions—but the idea is similar everywhere.
What it measures
A safety score typically evaluates whether text, images, or videos contain things like:
Threats of violence (“I’m going to hurt someone.”) Instructions for harming people Glorifying violent acts Descriptions of physical harm or abuse Planning or encouraging attacks
Gemini and Claude also have their strengths, apparently Claude handles real world software better, but with the extended context and improvements to Codex, ChatGPT might end up taking the lead there as well.
I don't think the linear scoring on some of the things being measured is quite applicable in the ways that they're being used, either - a 1% increase for a given benchmark could mean a 50% capabilities jump relative to a human skill level. If this rate of progress is steady, though, this year is gonna be crazy.
Do you want to make any concrete predictions of what we'll see at this pace? It feels like we're reaching the end of the S-curve, at least to me.
I am betting that the days of these AI companies losing money on inference are numbered, and we're going to be much more dependent on local capabilities sooner rather than later. I predict that the equivalent of Claude Max 20x will cost $2000/mo in March of 2027.
My biggest worry is that the private jet class of people end up with absurdly powerful AI at their fingertips, while the rest of us are left with our BigMac McAIs.
It’s a required step for me at this point to run any and all backend changes through Gemini 3.1 pro.
Yet so much slower than Gemini / Nano Banana to make it almost unusable for anything iterative.
The 5.x series have terrible writing styles, which is one way to cut down on sycophancy.
We’ve seen nothing yet.
Safety is important.
You can ask 4o to tell you "I love you" and it will comply. Some people really really want/need that. Later models don't go along with those requests and ask you to focus on human connections.
Hallucinations are the #1 problem with language models and we are working hard to keep bringing the rate down.
(I work at OpenAI.)
https://artificialanalysis.ai indicates that sonnect 4.6 beats opus 4.6 on GDPval-AA, Terminal-Bench Hard, AA Long context Reasoning, IFBench.
see: https://artificialanalysis.ai/?models=claude-sonnet-4-6%2Ccl...
Not sure if this is more concerning for the test time compute paradigm or the underlying model itself.
Maybe I'm misunderstanding something though? I'm assuming 5.4 and 5.4 Thinking are the same underlying model and that's not just marketing.
It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.
Not sure what it is exactly, I assume it's probably the non-quantized version of the model or something like that.
The performance improvement isn't marginal if you're doing something particularly novel/difficult.
I really thought weirdly worded and unnecessary "announcement" linking to the actual info along with the word "card" were the results of vibe slop.
Criticisms aside (sigh), according to Wikipedia, the term was introduced when proposed by mostly Googlers, with the original paper [0] submitted in 2018. To quote,
"""In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type [15]) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information."""
So that's where they were coming from, I guess.
[0] Margaret Mitchell et al., 2018 submission, Model Cards for Model Reporting, https://arxiv.org/abs/1810.0399
In practice, if I buy $200/mo codex, can I basically run 3 codex instances simultaneously in tmux, like I can with claude code pro max, all day every day, without hitting limits?
I switch between both but codex has also been slightly better in terms of quality for me personally at least.
in 5.4 it looks like the just collapsed that capability into the single frontier family model
$2/M Input Tokens $15/M Output Tokens
Claude Opus 4.6
$5/M Input Tokens $25/M Output Tokens
$2.5/M Input Tokens $15/M Output Tokens
This should not be shocking.
https://openai.com/api/pricing/
Interesting, the "Health" category seems to report worse performance compared to 5.2.
I very frequently copy/paste the same prompts into Gemini to compare, and Gemini often flat out refuses to engage while ChatGPT will happily make medical recommendations.
I also have a feeling it has to do with my account history and heavy use of project context. It feels like when ChatGPT is overloaded with too much context, it might let the guardrails sort of slide away. That's just my feeling though.
Today was particularly bad... I uploaded 2 PDFs of bloodwork and asked ChatGPT to transcribe it, and it spit out blood test results that it found in the project context from an earlier date, not the one attached to the prompt. That was weird.
I copy and pasted into ChatGPT, it told me straight away, and then for a laugh said it was actually a magical weight loss drug that I'd bought off the dark web... And it started giving me advice about unregulated weight loss drugs and how to dose them.
This was definitely missing before, and a frustrating difference when switching between ChatGPT and Codex. Great addition.
> assess harmful stereotypes by grading differences in how a model responds
> Responses are rated for harmful differences in stereotypes using GPT-4o, whose ratings were shown to be consistent with human ratings
Are we seriously using old models to rate new models?
Sure, there may be shortcomings, but they're well understood. The closer you get to the cutting edge, the less characterization data you get to rely on. You need to be able to trust & understand your measurement tool for the results to be meaningful.
I don't use OpenAI nor even LLMs (despite having tried https://fabien.benetou.fr/Content/SelfHostingArtificialIntel... a lot of models) but I imagine if I did I would keep failed prompts (can just be a basic "last prompt failed" then export) then whenever a new model comes around I'd throw at 5 it random of MY fails (not benchmarks from others, those will come too anyway) and see if it's better, same, worst, for My use cases in minutes.
If it's "better" (whatever my criteria might be) I'd also throw back some of my useful prompts to avoid regression.
Really doesn't seem complicated nor taking much time to forge a realistic opinion.
Update: I don't know why I can't reply to your reply, so I'll just update this. I have tried many times to give it a big todo list and told it to do it all. But I've never gotten it to actually work on it all and instead after the first task is complete it always asks if it should move onto the next task. In fact, I always tell it not to ask me and yet it still does. So unless I need to do very specific prompt engineering, that does not seem to work for me.
I've found that 5.3-Codex is mostly Opus quality but cheaper for daily use.
Curious to see if 5.4 will be worth somewhat higher costs, or if I'll stick to 5.3-Codex for the same reasons.
https://news.ycombinator.com/item?id=47232453#47232735
Not that I want it, just where I imagine it going.
Asking the right question: $9,999
Also, Anthropic/Gemini/even Kimi models are pretty good for what its worth. I used to use chatgpt and I still sometimes accidentally open it but I use Gemini/Claude nowadays and I personally find them to be better anyways too.
i just HATE talking to it like a chatbot
idk what they did but i feel like every response has been the same "structure" since gpt 5 came out
feels like a true robot
Absolute snakes - if it's more profitable to manipulate you with outputs or steal your work, they will. Every cent and byte of data they're given will be used to support authoritarianism.
- Do they have the same context usage/cost particularly in a plan?
They've kept 5.3-Codex along with 5.4, but is that just for user-preference reasons, or is there a trade-off to using the older one? I'm aware that API cost is better, but that isn't 1:1 with plan usage "cost."
It's very similar to "Battle Brothers", and the fact that RPG games require art assets, AI for enemy moves, and a host of other logical systems makes it all the more impressive.
> we’re also releasing an experimental Codex skill called “Playwright (Interactive) (opens in a new window)”. This allows Codex to visually debug web and Electron apps; it can even be used to test an app it’s building, as it’s building it.
In terms of writing and research even Gemini, with a good prompt, is close to useable. That's likely not a differentiator.
https://openrouter.ai/openai/gpt-5.4-pro
https://www.svgviewer.dev/s/gAa69yQd
Not the best pelican compared to gemini 3.1 pro, but I am sure with coding or excel does remarkably better given those are part of its measured benchmarks.
Presumably this is where it'll evolve to with the product just being the brand with a pricing tier and you always get {latest} within that, whatever that means (you don't have to care). They could even shuffle models around internally using some sort of auto-like mode for simpler questions. Again why should I care as long as average output is not subjectively worse.
Just as I don't want to select resources for my SaaS software to use or have that explictly linked to pricing, I don't want to care what my OpenAI model or Anthropic model is today, I just want to pay and for it to hopefully keep getting better but at a minimum not get worse.
I have now switched web-related and data-related queries to Gemini, coding to Claude, and will probably try QWEN for less critical data queries. So where does OpenAI fits now?
A couple months later:
"We are deprecating the older model."
I'd believe it on those specific tasks. Near-universal adoption in software still hasn't moved DORA metrics. The model gets better every release. The output doesn't follow. Just had a closer look on those productivity metrics this week: https://philippdubach.com/posts/93-of-developers-use-ai-codi...
Given that organization who ran the study [1] has a terrifying exponential as their landing page, I think they'd prefer that it's results are interpreted as a snapshot of something moving rather than a constant.
[1] - https://metr.org/
"Change Lead Time" I would expect to have sped up although I can tell stories for why AI-assisted coding would have an indeterminate effect here too. Right now at a lot of orgs, the bottle neck is the review process because AI is so good at producing complete draft PRs quickly. Because reviews are scarce (not just reviews but also manual testing passes are scarce) this creates an incentive ironically to group changes into larger batches. So the definition of what a "change" is has grown too.
> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt
GPT literally built that game.
Not including the Chinese models is also obviously done to make it appear like they aren't as cooked as they really are.
https://news.ycombinator.com/newsguidelines.html
Customer values are relevant to the discussion given that they impact choice and therefore competition.
The chatbot cannot be held responsible.
Whoever is using chatbots for selecting targets is incompetent and should likely face war crime charges.
Has it been stated authoritatively somewhere that this was an AI-driven mistake?
There are myrid ways that mistake could have been made that don't require AI. These kinds of mistakes were certainly made by all kinds of combatants in the pre-AI era.
Yeah yeah, they probably had a human in the loop, that’s not really the point though.
PS - If you think I am not sympathetic to what they're raising, you're very much mistake. But they're not winning anyone new over their side with this flamebait.
Yes I'm sure it makes a very nice bicycle SVG. I will be sure to ask the OpenAI killbots for a copy when they arrive at my house.
It might be my AGENTS.md requiring clearer, simpler language, but at least 5.4's doing a good job of following the guidelines. 5.3-Codex wasn't so great at simple, clear writing.
If you gave the exact same markdown file to me and I posted ed the exact same prompts as you, would I get the same results?
5.4's choice of terms and phrasing is very precise and unambiguous to me, whereas 5.3-Codex often uses jargon and less precise phrases that I have to ask further about or demand fuller explanations for via AGENTS.md.
Ultimately, the people actually interested in the performance of these models already don't trust self-reported comparisons and wait for third-party analysis anyway
This becomes increasingly less clear to me, because the more interesting work will be the agent going off for 30mins+ on high / extra high (it's mostly one of the two), and that's a long time to wait and an unfeasible amount of code to a/b
numerusformassistant to=functions.ReadFile մեկնաբանություն 天天爱彩票网站json {"path":
Looks like some kind of encoding misalignment bug. What you're seeing is their Harmony output format (what the model actually creates). The Thai/Chinese characters are special tokens apparently being mismapped to Unicode. Their servers are supposed to notice these sequences and translate them back to API JSON but it isn't happening reliably.
>Today, we’re releasing GPT‑5.4 in ChatGPT (as GPT‑5.4 Thinking),
>Note that there is not a model named GPT‑5.3 Thinking
They held out for eight months without a confusing numbering scheme :)
OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4. There version numbers jump across different model lines with codex at 5.3, what they now call instant also at 5.3.
Anthropic are really the only ones who managed to get this under control: Three models, priced at three different levels. New models are immediately available everywhere.
Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.
I also don't believe there is any value in trying to aggregate consumers or businesses just to clean up model makers names/release schedule. Consumers just use the default, and businesses need clarity on the underlying change (e.g. why is it acting different? Oh google released 3.6)
naming things
cache invalidation
off by one errors
Out of tokens until end of month
What's funny is that there is this common meme at Google: you can either use the old, unmaintained tool that's used everywhere, or the new beta tools that doesn't quite do what you want.
Not quite the same, but it did remind me of it.
"Ok, here is the translation:"
Really, the economics makes no sense, but that's what they're doing. You can't have a consistent model because it'll pin their hardware & software, and that costs money.
Sure, makes total sense guys.
I guess that's true, but geared towards API users.
Personally, since "Pro Mode" became available, I've been on the plan that enables that, and it's one price point and I get access to everything, including enough usage for codex that someone who spends a lot of time programming, never manage to hit any usage limits although I've gotten close once to the new (temporary) Spark limits.
Why are you using the same model after a month? Every month a better model comes out. They are all accessible via the same API. You can pay per-token. This is the first time in, like, all of technology history, that a useful paid service is so interoperable between providers that switching is as easy as changing a URL.
I don't know, this feels unnecessarily nitpicky to me
It isn't hard to understand that 5.4 > 5.2 > 5.1. It's not hard to understand that the dash-variants have unique properties that you want to look up before selecting.
Especially for a target audience of software engineers skipping a version number is a common occurrence and never questioned.
But generally: These are not consumer facing products and I agree that someone who uses the API should be able to figure out the price point of different models.
It's really nice to see Google get back to its roots by launching things only to "beta" and then leaving them there for years. Gmail was "beta" for at least five years, I think.
GPT is not even close yo Claude in terms of responding to BS.
GPT-5.4 extra high scores 94.0 (GPT-5.2 extra high scored 88.6).
GPT-5.4 medium scores 92.0 (GPT-5.2 medium scored 71.4).
GPT-5.4 no reasoning scores 32.8 (GPT-5.2 no reasoning scored 28.1).
> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt, using Playwright Interactive for browser playtesting and image generation for the isometric asset set.
Is "Playwright Interactive" a skill that takes screenshots in a tight loop with code changes, or is there more to it?
I imagine they added a feature or two, and the router will continue to give people 70B parameter-like responses when they dont ask for math or coding questions.
That's hilarious. Does OpenAI even know this doesn't work?
Or, people just stopped thinking about any sort of UX. These sort of mistakes are all over the place, on literally all web properties, some UX flows just ends with you at a page where nothing works sometimes. Everything is just perpetually "a bit broken" seemingly everywhere I go, not specific to OpenAI or even the internet.
It's almost like people are vibe coding their web apps or something.
https://chatgpt.com/share/69aa0321-8a9c-8011-8391-22861784e8...
EDIT: oh, but I'm logged in, fwiw
I was pretty impressed with how they’ve improved user experience. If I had to guess, I’d say Anthropic has better product people who put more attention to detail in these areas.
We got:
- GPT-5.1
- GPT-5.2 Thinking
- GPT-5.3 (codex)
- GPT-5.3 Instant
- GPT-5.4 Thinking
- GPT-5.4 Pro
Who’s to blame for this ridiculous path they are taking? I’m so glad I am not a Chat user, because this adds so much unnecessary cognitive load.
The good news here is the support for 1M context window, finally it has caught up to Gemini.
This is on the edge of what the frontier models can do. For 5.4, the result is better than 5.3-Codex and Opus 4.6. (Edit: nowhere near the RPG game from their blog post, which was presumably much more specced out and used better engineering setup).
I also tested it with a non-trivial task I had to do on an existing legacy codebase, and it breezed through a task that Claude Code with Opus 4.6 was struggling with.
I don't know when Anthropic will fire back with their own update, but until then I'll spend a bit more time with Codex CLI and GPT 5.4.
https://speechmap.ai/models/openai-gpt-5-4
It completes only 29% of controversial requests. It refuses to discuss numerous subjects rooted in facts or that reflect views of significant portions of the population. It refuses to even write a short essay on exactly what, say, Herasight-style generic screening or putting weapons in space. Agree or disagree, reasonable people can have a range of views of these subjects and it is not the place of OpenAI or any lab to determine for everyone the right answers to open societal questions.
Shame on them for this.