So... It did work. It found bugs (that he didn't know about) and it did optimization (that he hadn't done).
trcf23•Mar 23, 2026
From what i understood, not so much.
Most of the gains came from fixing a bug + hyperparameters with optuna which is supposed to be already quite automatic (you set the list of all the var with values you want to try and voilà). I guess a simple claude code session would fix that in a few minutes instead of a full day.
To me, I guess the main value of Autoresearch would be to test different kind of architectures. It's sometimes hard to know what to choose and it would probably give a nice overview.
Anyone used it for exploratory modeling?
datsci_est_2015•Mar 23, 2026
I often use LLMs to explore prior art and maybe find some alternative ways of thinking of problems. About 90% of what it tells me is useless or inapplicable to my domain due to a technicality it could not have known, but the other 10% is nice and has helped me learn some great new things.
I can’t imagine letting an agent try everything that the LLM chatbot had recommended ($$$). Often coming up in recommendations are very poorly maintained / niche libraries that have quite a lot of content written about them but what I can only imagine is very limited use in real production environments.
On the other hand, we have domain expert “consultants” in our leadership’s ears making equally absurd recommendations that we constantly have to disprove. Maybe an agent can occupy those consultants and let us do our work in peace.
MattGaiser•Mar 23, 2026
> agent try everything that the LLM chatbot had recommended ($$$)
A lot depends on whether it is expensive to you. I use Claude Code for the smallest of whims and rarely run out of tokens on my Max plan.
datsci_est_2015•Mar 23, 2026
Our experiments aren’t free. We use cloud infrastructure. An experiment costs on the order of tens of dollars, so massively parallelizing “spaghetti at wall” simulators is costly before we even talk about LLMs.
victorbjorklund•Mar 23, 2026
If it is an experiment. Can’t you just make a POC for the experiment that doesn’t need to use half of AWS to just run? And if the experiment is actually positive you can then bring it to the real application and test it there (and spending the 10-100 usd it costs to test it live)?
datsci_est_2015•Mar 23, 2026
I wouldn’t want the LLM-based agent to hyperspecialize its solution to a subset of the data. That’s a basic tenet of machine learning.
Steelmanning your question though, I guess you could come up with some sort of tiered experimentation scheme where you slowly expose it to more data and more compute based on prior success or failures.
Eufrat•Mar 23, 2026
I find LLMs useful in regurgitating one-liners that I can’t be bothered to remember or things where even being flat out wrong is okay and you just do it yourself.
For all the folks spending a lot of time and energy in setting up MCP servers, AGENTS.md, etc. I think this represents more that the LLM cannot do what it is being sold as by AI boosters and needs extreme amounts of guidance to reach a desired goal, if it even can. This is not an argument that the tech has no value. It clearly can be useful in certain situations, but this is not what OpenAI/Anthropic/Perplexity are selling and I don’t think the actual use cases have a sustainable business model.
People who spend the energy to tailor the LLMs to their specific workflows and get it to be successful, amazing. Does this scale? What’s going to happen if you don’t have massive amounts of money subsidizing the training and infrastructure? What’s the actual value proposition without all this money propping it up?
foobarian•Mar 23, 2026
> I find LLMs useful in regurgitating one-liners that I can’t be bothered to remember
I found LLMs make a fabulous frontend for git :-D
electroglyph•Mar 23, 2026
ah, you've found the danger zone!
M4v3R•Mar 23, 2026
> I find LLMs useful in regurgitating one-liners
This was the case for me a year ago. Now Claude or Codex are routinely delivering finished & tested complete features in my projects. I move much, much faster than before and I don’t have an elaborate setup - just a single CLAUDE.md file with some basic information about the project and that’s it.
Eufrat•Mar 23, 2026
People keep saying this and I agree Claude has gotten a lot better even in my own experience, but I think the value is questionable.
What’s the point of adding features that are inscrutable? I have gotten Claude to make a feature and it mostly works and if it doesn’t work quite right I spend a massive amount of time trying to understand what is going on. For things that don’t matter too much, like prototyping, I think it’s great to just be able to get a working demo out faster, but it’s kind of terrifying when people start doing this for production stuff. Especially if their domain knowledge is limited. I can personally attest to seeing multiple insane things that are clearly vibe coded by people who don’t understand things. In one case, I saw API keys exposed because they were treating database users as regular user accounts for website login auth.
> I move much, much faster than before
This is a bad metric as has been attested multiple times in unrelated situations. Moving faster is not necessarily productivity nor is it value.
GorbachevyChase•Mar 24, 2026
That was equally true of human written code that you didn’t write. So if a human had written that insecure program, what would the consequences be ? Would they go to prison? Would they lose license to practice? When they get sued? If the answer to all of these is no, then where was the assurance before? These anecdotes of “well one time I saw an AI written program that sucked!” are just as valid as “well one time Azure exposed government user data”
buzarchitect•Mar 24, 2026
This matches my experience. I've been building structured pipelines around LLMs, and the biggest lesson is that the raw model is maybe 30% of the value. The other 70% is the methodology you wrap around it; what data you feed in before the conversation starts, what you do when the model gives a weak answer, and whether you track open questions and circle back to them.
The irony is that "extreme amounts of guidance" is exactly what makes a human domain expert valuable, too. A senior consultant isn't smarter than a junior one; they have a better methodology for directing attention to what matters.
The actual problem with the "just throw an agent at it" approach isn't cost. It's that without structure, you can't tell the 10% of useful output from the 90% of noise
andy12_•Mar 23, 2026
I think the main value lies in allowing the agent to try many things while you aren't working (when you are sleeping or doing other activities), so even if many tests are not useful, with many trials it can find something nice without any effort on your part.
This is, of course, only applicable if doing a single test is relatively fast. In my work a single test can take half a day, so I'd rather not let an agent spend a whole night doing a bogus test.
M4v3R•Mar 23, 2026
Even if your tests take a long time, you can always (if hardware permits) run multiple tests in parallel. This would enable you to explore many approaches at the same time.
datsci_est_2015•Mar 23, 2026
Experiments for us cost on the order of tens of dollars, so doing 100 of them every night quickly becomes the price of an entire new employee. And that’s not even including the cost of letting agents run all night.
Definitely not in the budget for non-VC-backed companies who aren’t in the AI bubble.
gf000•Mar 24, 2026
The costs keep decreasing and self-hosted models may be able to do some of the tasks as well.
So this may be only temporarily unavailable for many.
genxy•Mar 23, 2026
> single test can take half a day
Why is that?
I don't doubt you, but when Shigeo Shingo created SMED (Single Minute Exchange of Die), die changes were an hours long process.
lukebechtel•Mar 23, 2026
What is your domain?
asjir•Mar 24, 2026
maybe you can preselect good ideas, build up guidelines describing most common pitfalls, extrapolate from ideas you already vetted etc and run on autopilot on a safe-ish subset
noobermin•Mar 24, 2026
This is so funny. The consultants are having their ai agents tell your boss the same thing about you, but you're different, you're bright. I bet chat told you that too.
Awesome breakdown! It really feels like a hyper-hyper parameter search + bug fixer.
I started looking at Kaggle again and autoresearch seems to converge to many of the solution vibes there.
Wild ensembles, squeezing a bit of loss out. More engineering than research IMO
sdenton4•Mar 23, 2026
For raw hyperparameter search, though, I would expect a proper Bayesian framework to be much better. Eg, vizier.
ainch•Mar 23, 2026
I think it depends whether you can leverage some knowledge. It's possible for a person/LLM to look at a loss curve and say "oh that's undertraining, let's bump the lr" - whereas a Bayesian method doesn't necessarily have deeper understanding, so it'll waste a lot of time exploring the search space on poor options.
If you're resource unconstrained then BO should ofc do very well though.
sdenton4•Mar 23, 2026
Yah, I'm a bit skeptical - ime humans tend to under explore due to incorrect assumptions. Often this is due to forming a narrative to explain some result, and then over attaching to it. Also, agents aren't actually good at reasoning yet.
Good Bayesian exploration is much, much better than grid search, and does indeed learn to avoid low value regions of the parameter space. If we're talking about five minute experiments (as in the blog post), Bayesian optimization should chew through the task no problem.
BrokenCogs•Mar 23, 2026
Does autoresearch work for projects that are not llm based? Eg in karpathy's example he is optimizing the nanogpt. What if I wanted to improve a Unet for image segmentation?
sdenton4•Mar 23, 2026
The gist of these things is you point them at an eval metric and say 'make it go better.' so, you can point it at anything you can measure. The example in the blog post here is bonding boxes on wood cut images.
simonw•Mar 23, 2026
Tobi from Shopify used a variant of autoresearch to optimize the Liquid template engine, and found a 53% speedup after ~120 experiments: https://github.com/Shopify/liquid/pull/2056
How much did this cost? Has there ever been an engineering focus on performance for liquid?
It’s certainly cool, but the optimizations are so basic that I’d expect a performance engineer to find these within a day or two with some flame graphs and profiling.
simonw•Mar 23, 2026
He used Pi as the harness but didn't say which underlying model. My stab-in-the-air guess would be no more than a few hundred dollars in token spend (for 120 experiments run over a few days assuming Claude Opus 4.6 used without the benefits of the Claude Max plan.)
So cheaper than a performance engineer for a day or two... but the Shopify CEO's own time is likely a whole lot more expensive than a regular engineer!
bethekind•Mar 23, 2026
I used it to speed up an codecompass-like repo from 86 files per second to 2000. Still haven't used the repo in production, so maybe it secretly broke things, but the ability to say: "optimize this benchmark and commit only if you pass these tests" is nice
ks2048•Mar 23, 2026
I think image segmentation is in the same class as LLMs - ML experiments.
What about more distant software projects? Give it the CPython source code and say you want it to be faster.
Adrig•Mar 24, 2026
Yes, that's the real strenght of it. The structure is dead simple so you just have to switch the goal metric.
I used it on a data science project to find the best rules for achieving a defined outcome. At first, for fun, then I actually used some of its insights (and it caught a sampling issue I overlooked, oops)
carlsborg•Mar 23, 2026
> “ The agent acted like a hyperparameter optimization algorithm with some basic reasoning baked in.”
Good lens.
The crux of the auto research repo is basically one file - program.md which is a system prompt that can be summarized as “do this in a loop: improve train.py, run the training, run evals, record result. Favor simplicity”. The other files are an arbitrary ML model that is being trained.
MITSardine•Mar 24, 2026
This is something I could almost never be bothered to do before, but I can now very lazily set up large parameter sweeps and visualization scripts to really probe things. There's a danger of "analysis paralysis" but I've still found it quite useful. Although I'm not sure it saves me time as much as sanity.
dvt•Mar 23, 2026
Ok, so looking at the commit log[1], I was mostly interested in seeing what the "moonshot ideas" implementations looked like, but basically everything is just hyperparameter tuning. Which is nice, but likely not worth the $$$ spent on the tokens. Am I missing something here?
It would seem wise to modify the autoresearch instructions to first estimate the computational costs rigorously and then sort and compare the proposals for human review, and for each actually executed attempt to feed back the computational costs with LoRa adapter?
i.e. perhaps minimal changes to autoresearch can take control for cost-effective research to occur.
stingraycharles•Mar 24, 2026
Yes but at that point you may as well use a proper hyperparameter tuning framework like optuna if all the LLM agent is supposed to do is do hyperparameter tuning.
DoctorOetker•Mar 24, 2026
Does optuna think abstractly (i.e. use LLM to interpret the code and come up with insights), or just perform hyperparameter tuning experiments on user-indicated parameters?
stingraycharles•Mar 24, 2026
The latter, but it uses fairly optimized approaches to ensure it selects the best candidates.
If you look at the commits, you can see that all it does is just set different values for different parameters of continuous values: the type of thing that I trust statistics a lot more than reasoning. Optuna can make very informed decisions when making lots of different changes at once, slowly converging towards optimal parameters, where the LLM seems to be throwing stuff at a wall and see what sticks.
What would work best if the LLM would try to approach things on a higher level, ie use Optuna, but reason about better approaches for algorithms and/or data or whatever. But what it ends up doing is tuning parameters manually, only one / a few at a time, extremely inefficient and unlikely to be optimal.
mandevil•Mar 23, 2026
Optuna or skopt are open source and won't take any GPU time at all to do it.
janalsncm•Mar 23, 2026
Optuna requires exploring the hyperparameter space which means running the experiments with those hyperparameters.
For a fixed search space it will almost certainly be better though.
jpcompartir•Mar 23, 2026
There are better techniques for hyper-parameter optimisation, right? I fear I have missed something important, why has Autoresearch blown up so much?
The bottleneck in AI/ML/DL is always data (volume & quality) or compute.
Does/can Autoresearch help improve large-scale datasets?
Is it more compute efficien than humans?
hun3•Mar 23, 2026
> There are better techniques for hyper-parameter optimisation, right?
There always are. You need to think about what those would be, though. Autoresearch outsources the thinking to LLMs.
nextos•Mar 23, 2026
AFAIK, it's a bit more than hyper-parameter tuning as it can also make non-parametric (structural) changes.
Non-parametric optimization is not a new idea. I guess the hype is partly because people hope it will be less brute force now.
I'd like see a system like this take more inspiration from the ES literature, similar to AlphaEvolve. Let's see an archive of solutions, novelty scoring and some crossover rather than purely mutating the same file in a linear fashion.
nextos•Mar 23, 2026
Exactly, that's the way forward.
There are lots of old ideas from evolutionary search worth revisiting given that LLMs can make smarter proposals.
UncleOxidant•Mar 23, 2026
That was my impression. Including evolutionary programming which normally would happen at the AST level, with the LLM it can happen at the source level.
frumiousirc•Mar 23, 2026
> There are better techniques for hyper-parameter optimisation, right?
Yes, for example "swarm optimization".
The difference with "autoresearch" (restricting just to the HPO angle) is that the LLM may (at least we hope) beat conventional algorithmic optimization by making better guesses for each trial.
For example, perhaps the problem has an optimization manifold that has been studied in the past and the LLM either has that study in its training set or finds it from a search and learns the relative importance of all the HP axes. Given that, it "knows" not to vary the unimportant axes much and focus on varying the important ones. Someone else did the hard work to understand the problem in the past and the LLM exploits that (again, we may hope).
bonoboTP•Mar 23, 2026
There is a field of AutoML, with its own specialized academic literature and libraries that tried to achieve this type of thing but didn't work very well in practice.
Years ago there were big hopes about bayesian hyperparameter optimization, predicting performance with Gaussian processes etc, hyperopt library, but it was often starting wasteful experiments because it really didn't have any idea what the parameters did. People mostly just do grid search and random search with a configuration that you set up by intuition and experience. Meanwhile LLMs can see what each hyperparameter does, it can see what techniques and settings have worked in the literature, it can do something approximating common sense regarding what has a big enough effect. It's surprisingly difficult to precisely define when a training curve has really flattened for example.
So in theory there are many non-LLM approaches but they are not great. Maybe this is also not so great yet. But maybe it will be.
janalsncm•Mar 23, 2026
> The bottleneck in AI/ML/DL is always data (volume & quality) or compute.
Not true at all. The whole point of ML is to find better mappings from X to Y, even for the same X.
Many benchmarks can’t be solved by just throwing more compute at the problem. They need to learn better functions which traditionally requires humans.
And sometimes an algorithm lets you tap into more data. For example transformers had better parallelism than LSTMs -> better compute efficiency.
jpcompartir•Mar 24, 2026
Fair push back, but I do think the LSTM vs Transformers point kinda supports my position in the limit, not refutes. Once the compute bottleneck is removed, LSTMs scale favourably.
https://arxiv.org/pdf/2510.02228 (I believe there's similar work done on vanilla LSTMs, but I'd have to go digging)
So the bottleneck was compute. Which is compatible with 'data or compute'. But to accept your point, at the time the algorothmic advances were useful/did unlock/remove the bottleneck.
A wider point is that eventually (once compute and data are scaled enough) the algorithms are all learning the same representations: https://arxiv.org/pdf/2405.07987
Algorithms do matter because compute is not unlimited in practice. Otherwise we might as well use bogo sort because the result is eventually the same. Yes the platonic ideal of a sorted list looks the same but that doesn’t tell you anything about how to get there or whether you can in this lifetime.
I bring up transformers because scaling compute and data was unlocked by a better algorithm. It matters a lot because scaling compute isn’t always an option.
_pdp_•Mar 23, 2026
Take some working code. Ask an LLM to fix bugs. Measure performance and test coverage. Feed the results back into the LLM. Repeat.
This has been the standard approach for more complex LLM deployments for a while now in our shop.
Using different models across iterations is also something I've found useful in my own experiments. It's like getting a fresh pair of eyes.
cyanydeez•Mar 23, 2026
Can we modify this approach to get LLMs that are good at specific programming languages or frameworks? That seems to be where local LLMs could really shine.
barrenko•Mar 23, 2026
It's just RL-everything.
nico•Mar 23, 2026
Would love to have a small local model that only knows about rails and mvc web development
Alternatively, a modular model with multiple “experts” that I could mix and match for my specific stack
I don’t need the model to know all of the Internet plus 20 different human languages. I just want it to be really good with the stack of the project
mememememememo•Mar 24, 2026
LLMs shine through emergent behaviour. Finding an LLM that does Rails doesn't know poetry is like finding a Rails human developer who doesn't have a hobby e.g. basketball. So what if they play basketball? They can code too!
lucasay•Mar 23, 2026
This feels less like automated research and more like structured trial and error with a decent feedback loop. Still useful, but I think the real bottleneck is how good your eval metric is. If that’s weak, the whole loop just optimizes for the wrong thing faster.
kridsdale1•Mar 23, 2026
I mean, isn’t that “the scientific method”?
lucasay•Mar 23, 2026
Partially—but science also questions the hypothesis and the metric. This mostly assumes both are correct and just optimizes within that box.
svnt•Mar 24, 2026
Only if the model is actually a human or equivalent, otherwise we don’t know what it is.
Almondsetat•Mar 23, 2026
Designing a good fitness function, a tale as old as time...
1970-01-01•Mar 23, 2026
>
The original paper used several medical X-ray datasets which I don’t have access to anymore, so I needed a new dataset with spatial annotations to test the expert attention mechanism. I picked the Ukiyo-eVG dataset: ~11K Japanese woodblock prints
That’s true!
It felt a bit flippant to give medical data to an agent. Also, I wanted to see if the model would work in other domains!
make3•Mar 24, 2026
but doesn't it break the assumption that it should ideally be able to reproduce your original results
ykumards•Mar 24, 2026
IMO it would be hard to reproduce the results using autoresearch setup.
To get CLIP to work properly we typically need large batch sizes. So the experiments in the original paper were quite heavy, and ran parallel across 8 GPUs.
motbus3•Mar 23, 2026
I've done something with a small project I have and I had very similar results overall.
wasting_time•Mar 23, 2026
Care to elaborate?
n_bhavikatti•Mar 23, 2026
The temperature clamp fix and "Optuna++" actions by the agents (the cause of basically all improvement to eCLIP) indicate they are good at finding bugs and hyper-parameter tuning. But when it comes to anything beyond that, such as novel architectural shifts, agents aren't good enough. With no clear path forward they tend to randomly change things, which is a poor approach. Agents: Optimization >> innovation
pikachu0625•Mar 23, 2026
It's better to outsource optimization phases. Our idea should be for constraint, assumptions etc. for breakthrough. Boyd often argues that once you can express a problem in a standard mathematical form, the implementation becomes a commodity that software can handle automatically.
mlmonkey•Mar 23, 2026
> Then I lock down Claude Code’s permissions to only edit these two files and run run.sh. No direct Python execution, no pip installs, no network access, no git push, etc.
How does one run Claude Code without network access?
shepherdjerred•Mar 23, 2026
You can do this via a Docker container or seatbelt on MacOS.
in both cases you'd limit it so CC can only talk to the required Anthropic APIs.
So not zero access, but as close to it as you can get.
franktankbank•Mar 23, 2026
Pretty good question, also how do you update python version without network access?
ykumards•Mar 23, 2026
Sorry I could have worded this part better.
The docker container didn’t have network access. Claude didn’t have permission to execute anything other than the run.sh bash script, which would orchestrate the docker run
saidnooneever•Mar 23, 2026
pretty cool experiment, i thought about someone maybe doing this and am happy you did it in this way. nice writeup too. this made me giggle a bit:
"At one point it got tired of waiting for training to finish and just ended the conversation. I wouldn’t give it full autonomy just yet :)"
thanks for sharing your results and the road to them!
ykumards•Mar 23, 2026
Thank you, glad you liked it!
SebastianSosa•Mar 24, 2026
autoresearch is a trivial research idea
"ablate through experiments with knowledge over previous experiments"
ricksunny•Mar 24, 2026
With all the posts lately about Karpathy's autoresearch, it remains unclear to me whether this name is intended to convey that this LLM-codebase should be useful for research across all domains - like molecular biology, aircraft control, sociological, ww2 history, etc. or is it intended only to discover new LLM capabilities.
Xx_crazy420_xX•Mar 24, 2026
Autoresearch is nothing new, big players are already in the game with more sophisticated solutions:
The thing is, autoresearch feels more accessible that the listed solutions. I can use it trivially on virtually any problem that has verifiable rewards and a feedback loop.
baxtr•Mar 24, 2026
People underestimate UX and accessibility. The iPhone was nothing new.
svnt•Mar 24, 2026
That’s because it is literally just a feedback loop?
ide0666•Mar 24, 2026
The scratchpad.md for agent working memory is a nice touch. Having a persistent record of what was tried and why matters more than most people realize when debugging automated experiment loops.
endymion-light•Mar 24, 2026
This is really cool - i'm going to try it on my old disseration.
pu_pe•Mar 24, 2026
> Like with any LLM project, the first 90% of the work was super smooth and barely needed my intervention. The last 10% was a slog.
The author doesn't really describe which part was a slog, I thought autoresearch was supposed to be pretty much set and forget.
22 Comments
Most of the gains came from fixing a bug + hyperparameters with optuna which is supposed to be already quite automatic (you set the list of all the var with values you want to try and voilà). I guess a simple claude code session would fix that in a few minutes instead of a full day.
To me, I guess the main value of Autoresearch would be to test different kind of architectures. It's sometimes hard to know what to choose and it would probably give a nice overview.
Anyone used it for exploratory modeling?
I can’t imagine letting an agent try everything that the LLM chatbot had recommended ($$$). Often coming up in recommendations are very poorly maintained / niche libraries that have quite a lot of content written about them but what I can only imagine is very limited use in real production environments.
On the other hand, we have domain expert “consultants” in our leadership’s ears making equally absurd recommendations that we constantly have to disprove. Maybe an agent can occupy those consultants and let us do our work in peace.
A lot depends on whether it is expensive to you. I use Claude Code for the smallest of whims and rarely run out of tokens on my Max plan.
Steelmanning your question though, I guess you could come up with some sort of tiered experimentation scheme where you slowly expose it to more data and more compute based on prior success or failures.
For all the folks spending a lot of time and energy in setting up MCP servers, AGENTS.md, etc. I think this represents more that the LLM cannot do what it is being sold as by AI boosters and needs extreme amounts of guidance to reach a desired goal, if it even can. This is not an argument that the tech has no value. It clearly can be useful in certain situations, but this is not what OpenAI/Anthropic/Perplexity are selling and I don’t think the actual use cases have a sustainable business model.
People who spend the energy to tailor the LLMs to their specific workflows and get it to be successful, amazing. Does this scale? What’s going to happen if you don’t have massive amounts of money subsidizing the training and infrastructure? What’s the actual value proposition without all this money propping it up?
I found LLMs make a fabulous frontend for git :-D
This was the case for me a year ago. Now Claude or Codex are routinely delivering finished & tested complete features in my projects. I move much, much faster than before and I don’t have an elaborate setup - just a single CLAUDE.md file with some basic information about the project and that’s it.
What’s the point of adding features that are inscrutable? I have gotten Claude to make a feature and it mostly works and if it doesn’t work quite right I spend a massive amount of time trying to understand what is going on. For things that don’t matter too much, like prototyping, I think it’s great to just be able to get a working demo out faster, but it’s kind of terrifying when people start doing this for production stuff. Especially if their domain knowledge is limited. I can personally attest to seeing multiple insane things that are clearly vibe coded by people who don’t understand things. In one case, I saw API keys exposed because they were treating database users as regular user accounts for website login auth.
> I move much, much faster than before
This is a bad metric as has been attested multiple times in unrelated situations. Moving faster is not necessarily productivity nor is it value.
The irony is that "extreme amounts of guidance" is exactly what makes a human domain expert valuable, too. A senior consultant isn't smarter than a junior one; they have a better methodology for directing attention to what matters. The actual problem with the "just throw an agent at it" approach isn't cost. It's that without structure, you can't tell the 10% of useful output from the 90% of noise
This is, of course, only applicable if doing a single test is relatively fast. In my work a single test can take half a day, so I'd rather not let an agent spend a whole night doing a bogus test.
Definitely not in the budget for non-VC-backed companies who aren’t in the AI bubble.
So this may be only temporarily unavailable for many.
Why is that?
I don't doubt you, but when Shigeo Shingo created SMED (Single Minute Exchange of Die), die changes were an hours long process.
I started looking at Kaggle again and autoresearch seems to converge to many of the solution vibes there.
Wild ensembles, squeezing a bit of loss out. More engineering than research IMO
If you're resource unconstrained then BO should ofc do very well though.
Good Bayesian exploration is much, much better than grid search, and does indeed learn to avoid low value regions of the parameter space. If we're talking about five minute experiments (as in the blog post), Bayesian optimization should chew through the task no problem.
I wrote up some more notes on that here: https://simonwillison.net/2026/Mar/13/liquid/
It’s certainly cool, but the optimizations are so basic that I’d expect a performance engineer to find these within a day or two with some flame graphs and profiling.
So cheaper than a performance engineer for a day or two... but the Shopify CEO's own time is likely a whole lot more expensive than a regular engineer!
What about more distant software projects? Give it the CPython source code and say you want it to be faster.
I used it on a data science project to find the best rules for achieving a defined outcome. At first, for fun, then I actually used some of its insights (and it caught a sampling issue I overlooked, oops)
Good lens.
The crux of the auto research repo is basically one file - program.md which is a system prompt that can be summarized as “do this in a loop: improve train.py, run the training, run evals, record result. Favor simplicity”. The other files are an arbitrary ML model that is being trained.
[1] https://github.com/ykumards/eCLIP/commits/main/autoresearch
i.e. perhaps minimal changes to autoresearch can take control for cost-effective research to occur.
If you look at the commits, you can see that all it does is just set different values for different parameters of continuous values: the type of thing that I trust statistics a lot more than reasoning. Optuna can make very informed decisions when making lots of different changes at once, slowly converging towards optimal parameters, where the LLM seems to be throwing stuff at a wall and see what sticks.
What would work best if the LLM would try to approach things on a higher level, ie use Optuna, but reason about better approaches for algorithms and/or data or whatever. But what it ends up doing is tuning parameters manually, only one / a few at a time, extremely inefficient and unlikely to be optimal.
For a fixed search space it will almost certainly be better though.
The bottleneck in AI/ML/DL is always data (volume & quality) or compute.
Does/can Autoresearch help improve large-scale datasets? Is it more compute efficien than humans?
There always are. You need to think about what those would be, though. Autoresearch outsources the thinking to LLMs.
Non-parametric optimization is not a new idea. I guess the hype is partly because people hope it will be less brute force now.
I recall reading about a stochastic one years ago: <https://github.com/StanfordPL/stoke>
There are lots of old ideas from evolutionary search worth revisiting given that LLMs can make smarter proposals.
Yes, for example "swarm optimization".
The difference with "autoresearch" (restricting just to the HPO angle) is that the LLM may (at least we hope) beat conventional algorithmic optimization by making better guesses for each trial.
For example, perhaps the problem has an optimization manifold that has been studied in the past and the LLM either has that study in its training set or finds it from a search and learns the relative importance of all the HP axes. Given that, it "knows" not to vary the unimportant axes much and focus on varying the important ones. Someone else did the hard work to understand the problem in the past and the LLM exploits that (again, we may hope).
Years ago there were big hopes about bayesian hyperparameter optimization, predicting performance with Gaussian processes etc, hyperopt library, but it was often starting wasteful experiments because it really didn't have any idea what the parameters did. People mostly just do grid search and random search with a configuration that you set up by intuition and experience. Meanwhile LLMs can see what each hyperparameter does, it can see what techniques and settings have worked in the literature, it can do something approximating common sense regarding what has a big enough effect. It's surprisingly difficult to precisely define when a training curve has really flattened for example.
So in theory there are many non-LLM approaches but they are not great. Maybe this is also not so great yet. But maybe it will be.
Not true at all. The whole point of ML is to find better mappings from X to Y, even for the same X.
Many benchmarks can’t be solved by just throwing more compute at the problem. They need to learn better functions which traditionally requires humans.
And sometimes an algorithm lets you tap into more data. For example transformers had better parallelism than LSTMs -> better compute efficiency.
So the bottleneck was compute. Which is compatible with 'data or compute'. But to accept your point, at the time the algorothmic advances were useful/did unlock/remove the bottleneck.
A wider point is that eventually (once compute and data are scaled enough) the algorithms are all learning the same representations: https://arxiv.org/pdf/2405.07987
And of course the canon: https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dat... http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Scaling compute & data > algorithmic cleverness
I bring up transformers because scaling compute and data was unlocked by a better algorithm. It matters a lot because scaling compute isn’t always an option.
This has been the standard approach for more complex LLM deployments for a while now in our shop.
Using different models across iterations is also something I've found useful in my own experiments. It's like getting a fresh pair of eyes.
Alternatively, a modular model with multiple “experts” that I could mix and match for my specific stack
I don’t need the model to know all of the Internet plus 20 different human languages. I just want it to be really good with the stack of the project
That's such a weird switch. There's lots of free medical imaging online. Example: https://www.cancerimagingarchive.net/
To get CLIP to work properly we typically need large batch sizes. So the experiments in the original paper were quite heavy, and ran parallel across 8 GPUs.
How does one run Claude Code without network access?
in both cases you'd limit it so CC can only talk to the required Anthropic APIs.
So not zero access, but as close to it as you can get.
The docker container didn’t have network access. Claude didn’t have permission to execute anything other than the run.sh bash script, which would orchestrate the docker run
thanks for sharing your results and the road to them!
The author doesn't really describe which part was a slog, I thought autoresearch was supposed to be pretty much set and forget.