This is basically the same workflow I've come to adopt. I don't use any "pre-built" skills, mine are actually still .md files in the .claude/command/ folder because that's when I started. The workflow is so good, I'm the bottleneck.
I've started to use git worktrees to parallelize my work. I spend so much time waiting...why not wait less on 2 things? This is not a solved problem in my setup. I have a hard time managing just two agents and keeping them isolated. But again, I'm the bottleneck. I think I could use 5 agents if my brain were smarter........or if the tools were better.
I am also a PM by day and I'm in Claude Code for PM work almost 90% of my day.
orwin•Mar 23, 2026
I like Claude, at least when the user reviews the code before asking for a PR. But gods I hate tickets/feature requests written by Opus/Sonnet (or worse: Codex or Gemini). If you know/understand your product enough it's probably less of a problem for your team than it is for mine, but each time I see a feature request automagically written in the backlog I know I will have to spend at least 30 minutes rewriting in so that it doesn't take us one hour to refine it collectively.
jmathai•Mar 23, 2026
Is it that the tickets are too verbose?
orwin•Mar 24, 2026
A bit, but mostly it propose extremely well-rounded solutions that are almost never complete, and sometimes miss a major point. I would rather have my juniors work themselves to understand what is needed, or/and ask me questions rather than follow the ticket that is basically a Claude plan. Right now I am modifying and object that was incomplete and I will have to do a migration because I didn't catch the missing attribute during the PR. It isn't big, and we could have coded workaround instead of redesigning the object, but: workarounds complexify the code, the data is less intuitive, and that also means the person who wrote the original object do not really understand the goals.
With a less 'expensive' ticket, with less explanation about how things should be done, but why they are needed, we would have had discussions, in dailies or 1on1, and that could have been ironed out then.
Yeah, basically Claude generate tickets that are heavy on the 'how' and light on the 'why', and I think that should be the other way around, for multiple reasons, but I'm already long-winded.
CrzyLngPwd•Mar 23, 2026
`And like any good manager, you get to claim credit for all the work your “team” does.`
Is that how it works? Do managers claim credit for the work of those below them, despite not doing the work?
I hope they also get penalised when a lowly worker does a bad thing, even if the worker is an LLM silently misinterpreting a vague instruction.
idiotsecant•Mar 23, 2026
Yes. That is how management works. Although a good manager will focus some of that praise onto team members who deserve it.
jmathai•Mar 23, 2026
Yup, the manager gets implicit credit for the work their team does. In most cases, deservedly so. I don't see why it should be any different for engineers using LLMs as "direct reports". Not all engineers will be the same level of "good" with LLM tools so the better you are (as with any other skill as well) the more credit you would receive.
dakiol•Mar 23, 2026
Are you kidding? What else would managers get credit from? They don't produce anything the company is interested in. They steer, they manage, and so if the ones being managed produce the thing the company is interested in, then sure all the credit goes to the team (including the manager!). As it usually happens, getting credit means nothing if not accompanied by a salary bump or something like that. And as it usually happens, not the whole team can get a salary bump. So the ones who get the bump are usually one or two seniors on the team, plus the manager of course... because the manager is the gatekeeper between upper management (the ones who approve salary bumps) and the ICs... and no sane manager would sacrifice a salary bump for themselves just to give it away to an IC. And that's not being a bad manager, that's simply being human. Also if you think about it, if the team succeeded in delivering "the thing", then the manager would think it's partially because of their managing, and so he/she would believe a salary bump is deserved
When things go south, no penalization is made. A simple "post-mortem" is written in confluence and people write "action items". So, yeah, no need for the manager to get the blame.
It's all very shitty, but it's always been like that.
troyvit•Mar 23, 2026
> I hope they also get penalised when a lowly worker does a bad thing, even if the worker is an LLM silently misinterpreting a vague instruction.
Yeah the buck stops with the manager (IMO the direct manager). So I can do some constructive criticism with my dev if they make a mistake, but it's my fault in the larger org that it happened. Then it's my manager's job to work with me to make sure I create the environment where the same mistake doesn't happen again. Am I training well? Am I giving them well-scoped work? All that.
markbao•Mar 23, 2026
> What’s become more fun is building the infrastructure that makes the agents effective.
Solving new problems is a thing engineers get to do constantly, whereas building an agent infrastructure is mostly a one-ish time thing. Yes, it evolves, but I worry that once the fun of building an agentic engineering system is done, we’re stuck doing arguably the most tedious job in the SDLC, reviewing code. It’s like if you were a principal researcher who stopped doing research and instead only peer reviewed other people’s papers.
The silver lining is if the feeling of faster progress through these AI tools gives enough satisfaction to replace the missing satisfaction of problem-solving. Different people will derive different levels of contentment from this. For me, it has not been an obvious upgrade in satisfaction. I’m definitely spending less time in flow.
serf•Mar 23, 2026
I like llms too, and I think they make me more productive..
but a chart of commits/contribs is such a lousy metric for productivity.
It's about on par with the ridiculousness of LOC implying code quality.
matheusmoreira•Mar 23, 2026
I don't know. Claude helped me implement a ton of features I had been procrastinating for months in a matter of days. I'm implementing features in my project faster than I can blog about them. It definitely manifested as a huge commit spike.
And it's not like I'm blindly commiting LLM output. I often write everything myself because I want to understand what I'm doing. Claude often comments that my version is better and cleaner. It's just that the tasks seemed so monumental I felt paralyzed and had difficulty even starting. Claude broke things down into manageable steps that were easy to do. Having a code review partner was also invaluable for a solo hobbyist like me.
munk-a•Mar 23, 2026
This right here is the big value I see in LLMs as well. I specifically suffer from analysis paralysis when starting something big and just getting skeletonized cheap code out quick as a template then refining it is much more to my strengths. I am ADHD and task breakdown is a known difficulty for that disorder so it has been hugely helpful.
That said, by the time I'm happy with it all the AI stuff outside very boilerplate ops/config stuff has been rewritten and refined. I just find it quite helpful to get over that initial hump of "I have nothing but a dream" to the stage of "I have a thing that compiles but is terrible". Once I can compile it then I can refine which where my strengths lie.
vova_hn2•Mar 23, 2026
> Claude often comments that my version is better and cleaner.
Every comment I make is a "really perceptive observation" according to Claude and every question I ask is either "brilliant" or at least "good", so...
marginalia_nu•Mar 23, 2026
In Claude's world, every user is a generational genius up there with Gauss and Euler, every new suggestion, no matter how banal, is a mind boggling Copernican turn that upends epistemology as we know it.
matheusmoreira•Mar 23, 2026
I have quite a lot of skepticism about that as well. I didn't mean to imply I believed it. I was just trying to say that I wasn't lazily copy pasting the LLM output into my repository.
I'm taking the time to understand what it is proposing. I'm pushing back and asking for clarifications. When I implement things, I do it myself in my own way. I experienced a huge increase in my ability to make the cool stuff I've always wanted to make even in spite of this.
I can't even fathom how productive the people who have Claude Code cranking out features on multiple git worktrees in parallel must be. I wouldn't do that in my personal projects but I can totally understand doing that in a job setting.
ziml77•Mar 23, 2026
It's really annoying when it does that. I wish there was an alternate mode you could toggle it to when pushing back on its output. One where it's tuned to not assume you're the authority so it can come back with a response that doesn't just immediately jump to agreeing with you.
SpicyLemonZest•Mar 24, 2026
I do think this is a learnable skill. I haven't quite gotten Claude to push back as much as I would prefer, but there's a specific tone to strike where the average person in your position would expect and welcome being told they're wrong.
piva00•Mar 24, 2026
To me it started doing this more often lately, I remember in January being quite happy it wasn't behaving with praising or sycophancy, after correcting it I'd usually get responses like "I see the issue now, I shouldn't have done X and instead do Y".
Lately it's been praising me much more for correcting it, quite annoying to be honest, it's just a clanker, I want it to act like a non-human clanker instead of playing theater with me...
jedmeyers•Mar 23, 2026
> It's about on par with the ridiculousness of LOC implying code quality.
Most effective engineers on the brownfield projects I've worked on, usually deleted more LOC than they've added, because they were always looking to simplify the code and replace it with useful (and often shorter) abstractions.
marginalia_nu•Mar 23, 2026
Yeah it's very much the opposite of how Claude Code tends to approach a problem it hasn't seen before, which tends toward constructing an elaborate Rube Goldberg machine by just inserting more and more logic until it manages to produce the desired outcome. You can coax it into simplifying its output, but it's very time consuming to get something that is of a professional standard, and doesn't introduce technical debt.
Especially in brownfield settings, if you do use CC, you really should be spending something like a day refactoring the code for every 15 minutes of work it spends implementing new functionality. Otherwise the accumulation of technical debt will make the code base unworkable by both human and claude hands in a fairly short time.
I think overall it can be a force for good, and a source of high quality code, but it requires a significant amount of human intervention.
Claude Code operating on unsupervised Claude code fairly rapidly generates a mess not even Claude Code can decode, resulting in a sort of technical debt Kessler syndrome, where the low quality makes the edits worse, which makes the quality worse, rinse and repeat.
koolba•Mar 23, 2026
My fav metric for codebase improvement (not feature improvement) is negative LOC. Nothing beats a patch that only deletes things without breaking anything or simply removing tests. Just dead code deletion.
MeetingsBrowser•Mar 23, 2026
> I’m not “using a tool that writes code.” I’m in a tight loop: kick off a task, the agent writes code, I check the preview, read the diff, give feedback or merge, kick off the next task
the assumption to this workflow is that claude code can complete tasks with little or no oversight.
If the flow looks like review->accept, review->accept, it is manageable.
In my personal experience, claude needs heavy guidance and multiple rounds of feedback before arriving at a mergeable solution (if it does at all).
Interleaving many long running tasks with multiple rounds of feedback does not scale well unfortunately.
I can only remember so much, and at some point I spend more time trying to understand what has been done so far to give accurate feedback than actually giving feedback for the next iteration.
felipevb•Mar 23, 2026
> The worktree system removed the friction of context-switching - juggling multiple streams of work without them colliding.
I'm so conflicted about this. On the one hand I love the buzz of feeling so productive and working on many different threads. On the other hand my brain gets so fried, and I think this is a big contributor.
dgunay•Mar 23, 2026
I do parallel agents in worktrees and I don't always constantly keep an eye on them like a fry cook flipping 20 burgers at once. Sometimes it's just nice to know that I can spin one up, come back tomorrow, and some progress has been made without breaking my current flow.
saadn92•Mar 23, 2026
the way I handle this is that I just create pull requests (tell the agent to do it at the end), and then I'll come back at a later time to review, so I always have stuff queued up to review.
kace91•Mar 23, 2026
I would like some research regarding multi agent flows and impact on speed and correctness, because I have a feeling that it's like a texting and driving situation, where self perception of skill loss and measured skill loss diverge.
I have nothing to back up the idea though.
saadn92•Mar 23, 2026
you do lose context, but if you generate a plan beforehand and save it, then it makes it easier to gain that context when you come back. I've been able to get out things a lot more quickly this way, because instead of "working" that day, I'll just review the work that's been queued up and focus on it one at a time, so I'm still the bottle neck but it has allowed me to move more quickly at times
jannyfer•Mar 23, 2026
Ooooh very interesting idea.
I also have nothing to back it up, but it fits my mental models. When juggling multiple things as humans, it eats up your context window (working memory). After a long day, your coherence degrades and your context window needs flushing (sleeping) and you need to start a new session (new day, or post-nap afternoon).
kukkeliskuu•Mar 24, 2026
I am just running multiple agents to work on different projects. Once in a while I have a feature that splits nicely into multiple threads that can be developed concurrently, and I use several concurrent agents to do it. But that is rare.
kalaksi•Mar 23, 2026
Is constant juggling of multiple agents productive? I haven't seen the allure (except maybe with 2 agents sometimes). I guess it depends on what kind of tasks one is doing and I can imagine it working if doing large, long-running tasks, but then reviewing those large changes and refactoring becomes more difficult. And if you're juggling multiple agents, there's the mental context switching and tooling overhead for managing them. Maybe predictable and repetitive tasks can work well.
I prefer focusing mostly on 1 task at a time (sometimes 2 for a short time, or asking other agent some questions simultaneously) and doing the task in chunks so it doesn't take much time until you have something to review. Then I review it, maybe ask for some refactoring and let it continue to the next step (maybe let it continue a bit before finishing review if feeling confident about the code). It's easier to review smaller self-contained chunks and easier to refer to code and tell AI what needs changing because of fewer amount of relevant lines.
kukkeliskuu•Mar 24, 2026
I have two modes. Mostly what you describe (phase 1), but followed by "project management" (phase 2), where I iterate through the impementing the plan done in phase 1.
aguimaraes1986•Mar 23, 2026
This is the "lines of code per week" metric from the 90s, repackaged. "I'm doing more PRs" is not evidence that AI is working, it's evidence that you are merging more.
Whether thats good depends entirely on what you are merging.
I use AI every day too. But treating throughput of code going to production as a success metric, without any mention of quality, bugs, or maintenance burden is exactly the kind of thinking developers used to push back on when management proposed it.
Turns out we weren't opposed to bad metrics! We were just opposed to being measured!
Given the chance to pick our own, we jumped straight to the same nonsense.
zahlman•Mar 23, 2026
> Turns out we weren't opposed to bad metrics! We were just opposed to being measured! Given the chance to pick our own, we jumped straight to the same nonsense.
This seems like a distinction without a difference, unless there actually are any good metrics (which also requires them to be objectively and reliably quantifiable). I think most developers don't really want to measure themselves, it's just that pro-AI people think measurement is necessary to put forward a convincing argument that they've improved anything.
sodapopcan•Mar 23, 2026
The only time metrics have been useful to me in the past is when they are kept private to each team, which is to say that I do think they are useful for measuring yourself, but not for others to measure you. Taken over time, they can eventual give you a really good idea of what you can deliver. Sandbag a bit (ie, undershoot that number), communicate that to ye olde stakeholders, and everybody's happy that you can actually do what you say you'll do without being stressed out (obviously this doesn't work in startups).
browningstreet•Mar 23, 2026
Maybe author knows that too, but wants to talk about it nonetheless. First line of article: “Commits are a terrible metric for output, but they're the most visible signal I have.”
skydhash•Mar 23, 2026
What about number of working features or system completeness? Current state vs desired state is fairly visible.
101011•Mar 23, 2026
how do you define system completeness? what if you ship one really big feature vs three really small ones?
I would posit that you need extra context to obtain meaning from those metrics, which inherently makes them less visible
skydhash•Mar 23, 2026
System completeness can be defined from the product definition. The latter is where requirements and definitions of done come from. Working features are the most important thing and most principles and techniques were about reducing the cost to get there.
jmalicki•Mar 24, 2026
If you only accept PRs that implement working features, i.e. you're not gaming it, then it's the same thing.
If you try to come up with an objective definition of working feature you're back to gamability criticism.
the_arun•Mar 24, 2026
Using AI we can make 1000s of commits per day. This metric becomes even more pointless in the days of AI. If we increase sales, New subscription count, reduced bug count, reduced incidents etc., those can be real metrics. I'm sure I am preaching to the choir.
tecleandor•Mar 24, 2026
I have coworkers commiting tens or hundreds of thousands of "lines of code" a week, because they'll push whatever the AI gives them, including dependencies and virtualenvs, without any review.
Of course, at the same time we're getting dozens of alerts a week about services deployed open to the Internet without authentication and full of outdated vulnerable libraries (LLMs will happily add two or three years old dependencies to your lockfiles).
duskdozer•Mar 24, 2026
Set the AIs off on those alerts and look at how many more alerts per week are now getting solved due to AI!
Lines of code are meaningful when taken in aggregate and useless as a metric for an individual’s contributions.
COCOMO, which considers lines of code, is generally accepted as being accurate (enough) at estimating the value of a software system, at least as far as how courts (in the US) are concerned.
> Lines of code are meaningful when taken in aggregate
The linked article does not demonstrate this. It establishes no causal link. One can obviously bloat LOC to an arbitrary degree while maintaining feature parity. Very generously, assuming good faith participants, it might reflect a kind average human efficiency within the fixed environment of the time.
Carrying the conclusions of this study from the 80s into the LLM age is not justified scientifically.
sarchertech•Mar 23, 2026
No one has any idea how to estimate software value, so the idea that some courts in the US have used a wildly inaccurate system that considers LOC is so far away from evidence that LOC is useful for anything that I can’t believe you bothered including that.
LOC is essentially only useful to give a ballpark estimate it complexity and even then only if you compare orders of magnitude and only between similar program languages and ecosystems.
It’s certainly not useful for AI generated projects. Just look at OpenClaw. Last I heard it was something close to half a million lines of code.
When I was in college we had a professor senior year who was obsessed with COCOMO. He required our final group project to be 50k LOC (He also required that we print out every line and turn it in). We made it, but only because we build a generator for the UI and made sure the generator was as verbose as possible.
brabel•Mar 24, 2026
They gave a widely accepted way to estimate value, and your counter argument is that that is inaccurate. Fine but how can you be confident about that? I see only one way which is for you to come up with a better way and then show that by your better estimation, COCOMO is bad. Until you do that, all your argument goes down to is vibes.
Your example about OpenClaw works exactly against your own argument by the way: OpenAI acquired it for millions by all accounts.
sarchertech•Mar 24, 2026
COCOMO has been shown to be inaccurate numerous times. Google it. Here’s one result.
“A very high MMRE (1.00) indicates that, on average, the COCOMO model misses about 100% of the actual project
effort. This means that the estimate generated by the model can be double or even greater than the actual effort. This
shows that the COCOMO model is not able to provide estimates that are close to the actual value.”
No one in the industry has taken COCOMO seriously for nearly 2 decades.
>OpenClaw
1. OpenAI bought the vibes and the creator. Why would they buy the code? It’s open source.
2. You don’t seriously think OpenClaw needs half a million lines of code to provide the functionality it does do you?
Seriously just go look at the code. No one is defending that as being an efficient use of code.
> Lines of code are meaningful when taken in aggregate and useless as a metric for an individual’s contributions.
Yes, and in fact a lot of the studies that show the impact of AI on coding productivity get dismissed because they use LoC or PRs as a metric and "everyone knows LoC/PR counts is a BS metric." But the better designed of these studies specifically call this out and explicitly design their experiments to use these as aggregate metrics.
BoorishBears•Mar 23, 2026
> at least as far as how courts (in the US) are concerned.
That's an anti-signal if we're being honest.
post-it•Mar 23, 2026
I think that's a "looking under the lamp post because that's where the light is" metric.
I'm not sure most developers, managers, or owners care about the calculated dollar value of their codebase. They're not trading code on an exchange. By condensing all software into a scalar, you're losing almost all important information.
I can see why it's important in court, obviously, since civil court is built around condensing everything into a scalar.
renegade-otter•Mar 24, 2026
I am writing a book! I used AI to write 1 billion words this morning!
kqr•Mar 24, 2026
COCOMO estimates the cost of the software, not the value. The cost is only weakly correlated with value.
Sabu87•Mar 23, 2026
I'm also trying everything to learn how to use Claude, everything is so new. And keep upgrading.
scorpioxy•Mar 23, 2026
And the author has a blog post about burnout and anxiety. Maybe all of those things are related.
Working to the point of making yourself sick should not be seen as a mark of pride, it is a sign that something is broken. Not necessarily the individual, maybe the system the individual is in.
cocoa19•Mar 23, 2026
I’m glad I’m not the only one that noticed this is madness.
I find it crazy to build a complex system to juggle 10 different threads in your brain, including the complexity of the tool itself.
scuff3d•Mar 23, 2026
Multiple agents in parallel "working on different features" is where people lose me. I don't care how much friction you've eliminated from the loop, eventually that code has to be looked at. Trying to switch between 5 different feature branches and properly review the code, even with AI help, if done properly is going to eat up most of not all the productivity improvements. The only way around it is to start pencils whipping reviews.
renegade-otter•Mar 24, 2026
Yes. Your brain, your clear thinking and your focus are the ultimate scarce resource. Writing code is easy, but I review one large PR from a coworker, and I need a nap.
Claiming that you have "ten agents writing code at night" is not the flex you think it is. That's just a recipe for burnout and bad design decisions.
Stop running your agents and go touch grass.
r_lee•Mar 24, 2026
> I review one large PR from a coworker, and I need a nap.
feels like nowadays this is illegal and instead you should be running 50 agent swarms and be putting out 20 features an hour while reviewing the code via agents and .....
ugh.
groby_b•Mar 23, 2026
Here's the thing every discussion around this tries to weasel around: All else being equal, yes, more PRs is a signal of productivity.
It's not the only metric. But I'm more and more convinced that the people protesting any discussion of it are the ones who... don't ship a lot.
Of course it matters in what code base. What size PR. How many bugs. Maintenance burden. Complexity. All of that doesn't go away. But that doesn't disqualify the metric, it just points out it's not a one-dimensional problem.
And for a solo project, it's fairly easy to hold most of these variables relatively constant. Which means "volume went up" is a pretty meaningful signal in that context.
sarchertech•Mar 23, 2026
> All else being equal, yes, more PRs is a signal of productivity.
Yeah but all else isn’t equal, so unless you’re measuring a whole lot more than PRs it’s completely meaningless.
Even on a solo project, something as simple as I’m working with a new technology that I’m excited about is enough to drastically ramp up number of PRs.
anukin•Mar 23, 2026
Can you define what “all else” means here?
PRs or closed jira tickets can be a metric of productivity only if they add or improve the existing feature set of the product.
If a PR introduces a feature with 10 bugs in other features and I have my agent swarm fix those in 10-20 PRs in a week, my productivity and delivery have both taken a hit. If any of these features went to prod, I have lost revenue as well.
Shipping is not same as shipping correctly with minimal introduction of bugs.
groby_b•Mar 24, 2026
"All else equal" means that PR volume is a signal that needs to be read in context a number of other metrics, as well as qualitative feedback.
You're absolutely right that PRs fixing things that a previous PR broke is a negative. Same for PRs implementing work not needed, or driving up tech debt.
"You're productive because you have lots of PRs" is a mistake without that context. But so is "You produce very little PRs, but that's fine, we shouldn't look at volume".
It's not a performance metric. It is an indicator worth following up. And there's a lot of reflexive "bad metric" arguments blanket dismissing that indicator.
Does that help explain?
mememememememo•Mar 24, 2026
Number of integration tests might be a good metric (until you announce that it is the metric then like every other metric, inc. profit, it becomes useless!)
For profit failing as a metric, see: Enron.
SpicyLemonZest•Mar 24, 2026
The problem is that these caveats, while tolerable in some contexts, make the metric impossible to interpret for something like Claude Code which is (I agree!) a huge change in how most software is developed.
If you mostly get around on your feet, distance traveled in a day is a reasonable metric for how much exercise you got. It's true that it also matters how you walk and where you walk, but it would be pretty tedious to tell someone that a "3 mile run" is meaningless and they must track cardiovascular health directly. It's fine, it works OK for most purposes, not every metric has to be perfect.
But once you buy a car, the metric completely decouples, and no longer points towards your original fitness goals even a tiny bit. It's not that cars are useless, or that driving has a magic slowdown factor that just so happens to compensate for your increased distance travelled. The distance just doesn't have anything to do with the exercise except by a contingent link that's been broken.
groby_b•Mar 24, 2026
> But once you buy a car, the metric completely decouples, and no longer points towards your original fitness goals even a tiny bit.
True, but if what you care about is "how quickly and safely can I reach a given goal", distance traveled over time is a great initial indicator, and accident rate will help illuminate.
The question "does AI help me move faster towards a goal, at the same quality standard", is relatively easy to judge in a solo project. As long as you verify equivalent standards, and don't play in an area you don't know at least - folks have a pretty clear understanding of their own productivity if it's a familiar thing.
Rover222•Mar 24, 2026
Of course lines of code is a meaningful metric. It's not like the author said it's the ONLY meaningful metric.
sailfast•Mar 24, 2026
It’s not meaningless - it just shouldn’t be held up as the only thing. Sometimes having a couple proxies is Ok as long as you also look at value in other ways. /shrug
conwy•Mar 24, 2026
FWIW, I've been using AI, but instead of "max # of lines/commits", I'm optimising for "min # of pr comments/iterations/bugs". My goal is to end up with less/simpler code and more/bigger impact. The real goal is business value, and ultimately human value. Optimise for that, using AI where it fits.
Along those lines, some techniques I've been dabbling in:
1. Getting multiple agents to implement a requirement from scratch, them combining the best ideas from all of them with my own informed approach.
2. Gathering documentation (requirements, background info, glossaries, etc), targeting an Agent at it, and asking carefully selected questions for which the answers are likely useful.
3. Getting agents to review my code, abstracting review comments I agree with to a re-usable checklist of general guidelines, then using those guidelines to inform the agents in subsequent code reviews. Over time I hope this will make the code reviews increasingly well fitted to the code base and nature of the problems I work on.
kaashif•Mar 24, 2026
The Goodhart's law effect there seems obvious - rather than code getting better, you might just become less rigorous in your reviews and stop commenting as much. You may not even realize your standards are dropping.
SOLAR_FIELDS•Mar 24, 2026
To me commit volume and similar metrics are something that indicate ai adoption, nothing more. And for a lot of people right now that is the goal - however short or long sighted that it might be.
tomasz-tomczyk•Mar 23, 2026
I've been doing a lot of parallel work and it can be draining. It feels exciting to have 6 agents spinning on things, but unless you have very well scoped plans, you need to still check in frequently.
If you have the tokens for it, having a team of agents checking and improving on the work does help a lot and reduces the slop.
paganel•Mar 23, 2026
> The PR descriptions are more thorough than what I’d write
Why do people do this? Why do they outsource something that is meant to have been written by a human, so that another human can actually understand what that first human wanted to do, so why do people outsource that to AI? It just doesn't make sense.
paulhebert•Mar 23, 2026
Yeah I agree.
We have “Cursor Bot” enabled at work. It reviews our PRs (in addition to a human review)
One thing it does is add a PR summary to the PR description. It’s kind of helpful since it outlines a clear list of what changed in code. But it would be very lacking if it was the full PR description. It doesn’t include anything about _why_ the changes were made, what else was tried, what is coming next, etc.
ytoawwhra92•Mar 23, 2026
Same reason they outsource writing their blog posts.
This weird notion that the purpose of the thing is the thing itself, not what people get out of the thing. Tracks completely that a person who thinks their number of commits and think that shows how productive they are (while acknowledging that it's a poor metric and just shrugging).
godd2•Mar 24, 2026
> Why do they outsource something that is meant to have been written by a human
Says who? The point of the summary is so that I don't have to go look at the diff and figure out what happened.
piva00•Mar 24, 2026
The point of the summary is also to explain "why" something was done, most Claude-generated PR descriptions I've been seeing go through the "what" and "how" but if the human-in-the-loop didn't care to precisely describe the "why" it is just an English version of the changes made in the code... I can just read the code for that, give me the reasons behind the diff and I'm a happy camper.
nzach•Mar 24, 2026
If you have a large PR the existence of a good summary on "what" changed can help you to make a better review.
But I agree with you, when reading PR descriptions and code comments I want a "why" not a "what". And that is why I think most LLM-generated documentation is bad.
neilkakkar•Mar 24, 2026
This is exactly why I use a custom skill - I can tell it what to focus on, I can give it a ill formatted blurb of why I'm making the changes, and it will format it nicely, and add more context based on the changes.
Most of the time, the PR descriptions it generates for me are great.
I think the issue is you're assuming it's always a poor output, which isn't the case. I'm in a much smaller team than you'd expect, so the why is talked about sync more often than not, and it becomes less of an problem.
dakiol•Mar 23, 2026
I don't understand the "being more productive" part. Like, sure, LLMs make us iterate faster but our managers know we're using them! They don't naively think we suddenly became 10x engineers. Companies pay for these tools and every engineer has access to them. So if everyone is equally productive, the baseline just shifted up... same as always, no?
Mentioning LLM usage as a distinction is like bragging about using a modern compiler instead of writing assembly. Yeah it's faster, but so is everyone else code...
Besides, I wouldn't brag about being more productive with LLMS because it's a double edge sword: it's very easy to use them, and nobody is reviewing all the lines of code you are pushing to prod (really, when was the last time you reviewed a PR generated by AI that changed 20+ files and added/removed thousands of lines of code?), so you don't know what's the long game of your changes; they seem to work now but who knows how it will turn out later?
bluelightning2k•Mar 23, 2026
Sometimes outcomes and achievements and work product are useful beyond just... stack ranking yourself against your peers. Seems so odd to me that this is your mentality unless you're earlier in your career.
dakiol•Mar 23, 2026
Fair enough. I've been in software more than I would like to admit. And the more I'm in, the less I care about achievements in a work environment. All I care about is that the company pays me every month, because companies don't care about me (they care about my outome per hour/week/month). So it's essential to rank yourself high against your peers (being ethically and the like, ofc), otherwise you are out in the next layoff. I know not every company is like this, but the vast majority of tech companies are.
Outside of work, yeah, everything is fine and there's nothing but the pure pursue of knowledge and joy.
exogenousdata•Mar 23, 2026
All companies are like this. Some just have better HR/PR.
renegade-otter•Mar 24, 2026
People would really be better off seeing themselves as mercenaries with health benefits. You are nothing more. You learn, you make friends, but your job is ephemeral. Do it, but don't get attached TO it.
adamtaylor_13•Mar 24, 2026
The key there is "vast majority of tech companies". And I agree with you.
I think the next big movement in tech will be ALL companies becoming tech companies. Right now there are hundreds of thousands of "small" companies with big enough budgets to pay for a CTO to modernize their stack and lead them into the 21st century.
The problem is they don't know they have this problem and so they aren't actively hiring for a CTO. You've got to go find them and insert yourself as the solution.
layer8•Mar 23, 2026
Usually hedonic adaptation ends up catching up, and then it’s just the new baseline.
kqr•Mar 24, 2026
> like bragging about using a modern compiler instead of writing assembly.
Yet people look at me like I'm the odd one out when I say I am more productive with a modern compiler like GHC.
ayhanfuat•Mar 23, 2026
I don't know if I am just in an unlucky A/B assignment or anything but I really don't understand people juggling multiple agent sessions. For me Opus 4.6 High performance went from unbelievable to mediocre. And this keeps happening making the whole agentic coding very unreliable and frustrating. I do use it but I have to babysit and I get overwhelmed even with a single session.
keybored•Mar 23, 2026
As an outsider it seems like agentic coders get buried in the weeds of running agents in parallel and churning out commits. (Even after a sheepish “commits are a bad metric but”) And every week there is a new orchestration, something, who even cares.
Is that the end game? Well why can’t the agents orchestrate the agents? Agents all the way down?
The whole agent coding scene seems like people selling their soul for very shiny inflatable balloons. Now you have twelve bespoke apps tailored for you that you don’t even care about.
dakiol•Mar 23, 2026
Honest question: if you're using multiple agents, it's usually to produce not a dozen lines of code. It's to produce a big enough feature spanning multiple files, modules and entry points, with tests and all. So far so good. But once that feature is written by the agents... wouldn't you review it? Like reading line by line what's going on and detecting if something is off? And wouldn't that part, the manual reviewing, take an enormous amount of time compare to the time it took the agents to produce it? (you know, it's more difficult to read other people's/machine code than to write it yourself)... meaning all the productivity gained is thrown out the door.
Unless you don't review every generated line manually, and instead rely on, let's say, UI e2e testing, or perhaps unit testing (that the agents also wrote). I don't know, perhaps we are past the phase of "double check what agents write" and are now in the phase of "ship it. if it breaks, let agents fix it, no manual debugging needed!" ?
Salgat•Mar 23, 2026
This is the biggest bottleneck for me. What's worse is that LLMs have a bad habit of being very verbose and rewriting things that don't need to be touched, so the surface area for change is much larger.
cyanydeez•Mar 23, 2026
It's kind weird; I jumped on the vibe coding opencode bandwagon but using local 395+ w/128; qwen coder. Now, it takes a bit to get the first tokens flowing, and and the cache works well enough to get it going, but it's not fast enough to just set it and forget it and it's clear when it goes in an absurd direction and either deviates from my intention or simply loads some context whereitshould have followed a pattern, whatever.
I'm sure these larger models are both faster and more cogent, but its also clear what matter is managing it's side tracks and cutting them short. Then I started seeing the deeper problematic pattern.
Agents arn't there to increase the multifactor of production; their real purpose is to shorten context to manageable levels. In effect, they're basically try to reduce the odds of longer context poisoning.
So, if we boil down the probabilty of any given token triggering the wrong subcontext, it's clear that the greater the context, the greater the odds of a poison substitution.
Then that's really the problematic issue every model is going to contend with because there's zero reality in which a single model is good enough. So now you're onto agents, breaking a problem into more manageable subcontext and trying to put that back into the larger context gracefully, etc.
Then that fails, because there's zero consistent determinism, so you end up at the harness, trying to herd the cats. This is all before you realize that these businesses can't just keep throwing GPUs at everything, because the problem isn't computing bound, it's contextual/DAG the same way a brain is limited.
We all got intelligence and use several orders of magnitude less energy, doing mostly the same thing.
sheept•Mar 24, 2026
Not only that, but LLMs do a disservice to themselves by writing inconcise code, decorating lines with redundant comments, which wastes their context the next time they work with it
bluGill•Mar 24, 2026
I have had good luck in asking my agent 'now review this change: is it a good design, does it solve the problem, are there excessive comments, is there anything else a reviewer would point out'. I'm still working on what promt to use but that is about right.
mohsen1•Mar 24, 2026
I highly recommend adding `/simplify` to your workflow. It walks back over-engineerings quite often for me.
browningstreet•Mar 23, 2026
I use coding agents to produce a lot of code that I don’t ship. But I do ship the output of the code.
Leynos•Mar 23, 2026
Here's what I suggest:
Serious planning. The plans should include constraints, scope, escalation criteria, completion criteria, test and documentation plan.
Enforce single responsibility, cqrs, domain segregation, etc. Make the code as easy for you to reason about as possible. Enforce domain naming and function / variable naming conventions to make the code as easy to talk about as possible.
Use code review bots (Sourcery, CodeRabbit, and Codescene). They catch the small things (violations of contract, antipatterns, etc.) and the large (ux concerns, architectural flaws, etc.).
Go all in on linting. Make the rules as strict as possible, and tell the review bots to call out rule subversions. Write your own lints for the things the review bots are complaining about regularly that aren't caught by lints.
Use BDD alongside unit tests, read the .feature files before the build and give feedback. Use property testing as part of your normal testing strategy. Snapshot testing, e2e testing with mitm proxies, etc. For functions of any non-trivial complexity, consider bounded or unbounded proofs, model checking or undefined behaviour testing.
I'm looking into mutation testing and fuzzing too, but I am still learning.
Pause for frequent code audits. Ask an agent to audit for code duplication, redundancy, poor assumptions, architectural or domain violations, TOCTOU violations. Give yourself maintenance sprints where you pay down debt before resuming new features.
The beauty of agentic coding is, suddenly you have time for all of this.
dominotw•Mar 23, 2026
> Serious planning. The plans should include constraints, scope, escalation criteria, completion criteria, test and documentation plan.
I feel like i am a bit stupid to be not able to do this. my process is more iterative. i start working on a feature then i disocover some other function thats silightly related. go refactor into commmon code then proceed with original task. sometimes i stop midway and see if this can be done with a libarary somewhere and go look at example. i take many detours like these. I am never working on a single task like a robot. i dont want claude to work like that either .That seems so opposite of how my brain works.
what i am missing.
Leynos•Mar 24, 2026
Again, here's what works for me.
When I get an idea for something I want to build, I will usually spend time talking to ChatGPT about it. I'll request deep research on existing implementations, relevant technologies and algorithms, and a survey of literature. I find NotebookLM helps a lot at this point, as does Elevenreader (I tend to listen to these reports while walking or doing the dishes or what have you). I feed all of those into ChatGPT Deep Research along with my own thoughts about the direction the system, and ask it to produce a design document.
At the moment, I use Opus or GPT-5.4 on high to generate those plans, and Sonnet or GPT-5.4 medium to implement.
The roadmap and the design are definitely not set in stone. Each step is a learning opportunity, and I'll often change the direction of the project based on what I learn during the planning and implementation. And of course, this is just what works for me. The fun of the last few months has been everyone finding out what works for them.
hirvi74•Mar 24, 2026
You seem to work a lot like how I do. If that is being stupid, then well, count me in too. To be honest, if I had to go through all the work of planning, scope, escalation criteria, etc., then I would probably be better off just writing the damn code myself at that point.
dominotw•Mar 24, 2026
i see lots of posts like stripes minion where they just type a feature into slack chat and agent goes and does it. That doesnt make any sense to me.
bmurphy1976•Mar 24, 2026
Can't upvote you enough. This is the way. You aren't vibe coding slop you have built an engineering process that works even if the tools aren't always reliable. This is the same way you build out a functioning and highly effective team of humans.
The only obvious bit you didn't cover was extensive documentation including historical records of various investigations, debug sessions and technical decisions.
bluGill•Mar 24, 2026
Documentation is only useful if it is read. I have found it impossiple to get many humans to read the documentation i write.
dominotw•Mar 24, 2026
Building a fancy looking process doesnt mean output isnt slop. Vibecoders on reddit have even more insane "engineering" process.
parent comment has all these
And here I am, just drawing diagrams on a whiteboard and designing UI in Balsamiq.
dominotw•Mar 24, 2026
you are prbly shipping so that puts you ahead of most ppl still setting up their perfect process.
sigotirandolas•Mar 24, 2026
To be devil's advocate:
Many of those tools are overpowered unless you have a very complex project that many people depend on.
The AI tools will catch the most obvious issues, but will not help you with the most important aspects (e.g. whether you project is useful, or the UX is good).
In fact, having this complexity from the start may kneecap you (the "code is a liability" cliché).
You may be "shipping a lot of PRs" and "implementing solid engineering practices", but how do you know if that is getting closer to what you value?
How do you know that this is not actually slowing your down?
piva00•Mar 24, 2026
It depends a lot on what kind of company you are working at, for my work the product concerns are taken care by other people, I'm responsible for technical feasibility, alignment, design but not what features should be built, validating if they are useful and add value, etc., product people take care of that.
If you are solo or in a small company you apply the complexity you need, you can even do it incrementally when you see a pattern of issues repeating to address those over time, hardening the process from lessons learnt.
Ultimately the product discussion is separate from the engineering concerns on how to wrangle these tools, and they should meet in the middle so overbearing engineering practices don't kneecap what it is supposed to do: deliver value to the product.
I don't think there's a hard set of rules that can be applied broadly, the engineering job is to also find technical approaches that balance both needs, and adapt those when circumstances change.
sigotirandolas•Mar 24, 2026
On the one side I reject that product and engineering concerns are separated: Sometimes you want to avoid a feature due to the way it will limit you in the future, even if the AI can churn it in 2 minutes today.
On the other side perhaps your company, like most, does not know how to measure overengineering, cognitive complexity, lack of understanding, balancing speed/quality, morale, etc. but they surely suffer the effects of it.
I suspect that unless we get fully automated engineering / AGI soon, companies that value engineers with good taste will thrive, while those that double down into "ticket factory" mode will stagnate.
piva00•Mar 24, 2026
> On the one side I reject that product and engineering concerns are separated: Sometimes you want to avoid a feature due to the way it will limit you in the future, even if the AI can churn it in 2 minutes today.
That is exactly not what I meant, I'm sorry if it wasn't clear but your assumption about how my job works is absolutely wrong.
I even mention that the product discussion is separate only on "how to wrangle these tools":
> Ultimately the product discussion is separate from the engineering concerns on how to wrangle these tools, and they should meet in the middle so overbearing engineering practices don't kneecap what it is supposed to do: deliver value to the product.
Delivering value, which means also avoiding a feature that will limit or entrap you in the future.
> On the other side perhaps your company, like most, does not know how to measure overengineering, cognitive complexity, lack of understanding, balancing speed/quality, morale, etc. but they surely suffer the effects of it.
We do measure those and are quite strict about it, most of my design documents are about the trade-offs in all of those dimensions. We are very critical about proposals that don't consider future impacts over time, and mostly reject workarounds unless absolutely necessary (and those require a phase-out timeline for a more robust solution that will be accounted for as part of the initiative, so the cost of the technical debt is embedded from the get-go).
I believe I wasn't clear and/or you misunderstood what I said, I agree with you on all these points, and the company I work for is very much in opposite to a "ticket factory". Work being rejected due to concerns for the overall impact cross-boundaries on doing it is very much praised, and invited.
My comment was focused on how to wrangle these tools for engineering purposes being a separate discussion to the product/feature delivery, it's about tool usage in the most technical sense, which doesn't happen together with product.
We on the engineering side determine how to best apply these tools for the product we are tasked on delivering, the measuring of value delivered is outside and orthogonal to the technical practices since we already account for the trade-offs during proposal, not development time. This measurement already existed pre-AI and is still what we use to validate if a feature should be built or not, its impact and value delivered afterwards, and the cost of maintaining it vs value delivered. All of that includes the whole technical assessment as we already did before.
Determining if a feature should be built or not is ultimately a pairing of engineering and product, taking into account everything you mentioned.
Determining the pipeline of potential future non-technical features at my job is not part of engineering, except for side-projects/hack ideas that have potential to be further developed as part of the product pipeline.
sigotirandolas•Mar 24, 2026
Sorry, I think you're right that I misinterpreted your comment. I still had in mind OP's example (BDD, mutational testing, all that jazz). I apologize!
Reading your comment, it looks like you work for a pretty nice company that takes those things seriously. I envy you!
My concern was that for companies unlike yours that don't have well established engineering practices, it _feels_ that with AI you can go much faster and in fact it's a great excuse to dismantle any remaining practices. But, in reality they either doing busywork or building the wrong thing. My guess is that those are going to learn that this is a bad idea in the future, when they already have a mess to deal with.
To put what I mean into perspective... if you browse OP's profile you can find absolutely gigantic PRs like https://github.com/leynos/weaver/pull/76. I can not review any PR like that in good faith, period.
MattGaiser•Mar 23, 2026
Yep. In many cases I am just reviewing test cases it generated now.
> if it breaks, let agents fix it, no manual debugging needed!" ?
Pretty trivial to have every Sentry issue have an immediate first pass by AI now to attempt to solve the bug.
keeda•Mar 23, 2026
> you know, it's more difficult to read other people's/machine code than to write it yourself
Not at all, it's just a skill that gets easier with practice. Generally if you're in the position to review a lot of PR's, you get proficient at it pretty quickly. It's even easier when you know the context of what the code is trying to do, which is almost always the case when e.g. reviewing your team-mates' PR's or the code you asked the AI to write.
As I've said before (e.g. https://news.ycombinator.com/item?id=47401494), I find reviewing AI-generated code very lightweight because I tend to decompose tasks to a level where I know what the code should look like, and so the rare issues that crop up quickly stand out. I also rely on comprehensive tests and I review the test cases more closely than the code.
That is still a huge amount of time-savings, especially as the scope of tasks has gone from a functions to entire modules.
That said, I'm not slinging multiple agents at a time, so my throughput with AI is way higher than without AI, but not nearly as much as some credible reports I've heard. I'm not sure they personally review the code (e.g. they have agents review it?) but they do have strategies for correctness.
nprateem•Mar 24, 2026
I'll often run 4 or 5 agents in parallel. I review all the code.
Some agents will be developing plans for the next feature, but there can sometimes be up to 4 coding.
These are typically a mix between trivial bug fixes and 2 larger but non-overlapping features. For very deep refactoring I'll only have a single agent run.
Code reviews are generally simple since nothing of any significance is done without a plan. First I run the new code to see if it works. Then I glance at diffs and can quickly ignore the trivial var/class renames, new class attributes, etc leaving me to focus on new significant code.
If I'm reviewing feature A I'll ignore feature B code at this point. Merge what I can of feature A then repeat for feature B, etc.
This is all backed by a test suite I spot check and linters for eg required security classes.
Periodically we'll review the codebase for vulnerabilities (eg incorrectly scoped db queries, etc), and redundant/cheating tests.
But the keys to multiple concurrent agents are plans where you're in control ("use the existing mixin", "nonsense, do it like this" etc) and non-overlapping tasks. This makes reviewing PRs feasible.
jwilliams•Mar 24, 2026
It’s a blend. There are plenty of changes in a production system that don’t necessarily need human review. Adding a help link. Fixing a typo. Maybe upgrades with strong CI/CD or simple ui improvements or safe experiments.
There are features you can skip safely behind feature flags or staged releases. As you push in you fine with the right tooling it can be a lot.
If you break it down often quite a bit can be deployed safely with minimal human intervention (depends naturally on the domain, but for a lot of systems).
So many pretend they are more productive but so few are able to articulate what they actually produced.
Some says features. Well. Are they used. Are they beneficial in any way for our society or humanity? Or are we junk producing for the sake of producing?
jwpapi•Mar 23, 2026
I have a little ai-commit.sh as "send" in package.json which describes my changes and commits. Formatting has been solved by linters already. Neither my approach nor OP approach are ground-breaking, but i think mine is faster, you also !p send (p alias pnpm) inside from claude no need for it to make a skill and create overhead..
Like thinking about it a pr skill is pretty much an antipattern even telling ai to just create a pr is faster.
I think some vibe coders should let AI teach them some cli tooling
neilkakkar•Mar 24, 2026
OP here, I disagree, it's great to have a skill for cases where you have extra steps and want the agent to run some verification steps before making a PR. It's called making a PR, but it's not _just_ running the gh cli to make a PR.
It's checking if I'm in a worktree, renames branches accordingly, adds a linear ticket if provided, generates a proper PR summary.
I'm not optimising for how fast the PR is created, I want it to do the menial steps I used to do .
imiric•Mar 23, 2026
> The PR descriptions are more thorough than what I’d write, because it reads the full diff and summarises the changes properly. I’d gotten so used to the drudgery that I’d stopped noticing it was drudgery.
Who are you creating PR descriptions for, exactly? If you consider it "drudgery", how do you think your coworkers will feel having to read pages of generic "AI" text? If reviewing can be considered "drudgery" as well, can we also offload that to "AI"? In which case, why even bother with PRs at all? Why are you still participating in a ceremony that was useful for humans to share knowledge and improve the codebase, when machines don't need any of it?
> My role has changed. I used to derive joy from figuring out a complicated problem, spending hours crafting the perfect UI. [...] What’s become more fun is building the infrastructure that makes the agents effective. Being a manager of a team of ten versus being a solo dev.
Yeah, it's great that you enjoy being a "manager" now. Personally, that is not what I enjoy doing, nor why I joined this industry.
Quick question: do you think your manager role is safe from being automated away? If machines can write code and prose now better than you, couldn't they also manage other machines into producing useful output better than you? So which role is left for you, and would you enjoy doing it if "manager" is not available?
Purely rhetorical, of course, since I don't think the base premise is true, besides the fact that it's ignoring important factors in software development such as quality, reliability, maintainability, etc. This idea that the role of an IC has now shifted into management is amusing. It sounds like a coping mechanism for people to prove that they can still provide value while facing redundancy.
neilkakkar•Mar 24, 2026
I think you'd like this post I wrote: https://neilkakkar.com/agentic-debt.html , parts of why I think we wouldn't get automated away just yet. It might be true eventually - and when it does happen, I'm sure I'll find something else to do, most probably up the stack. Managing for now seems like a terrible task for agents. I need to guide them to the right solution.
_Parts_ of what I write are drudgery, which gets automated away. The "why" we talk about in sync, so it's much less of an issue in general.
When I say management, I mean more like a staff engineer or a tech lead, rather than a traditional manager.
SeriousM•Mar 23, 2026
> Fast rebuilds and automated previews made another friction visible: I could only comfortably work on one thing at a time.
Oh really? I enjoy doing one thing at the time, with focus.
AI, as you're using it OP, isn't make you faster, it is making you work more for the same amount of money. You burn yourself for no reason.
m000•Mar 23, 2026
I'm very sceptical on how well AI can "read the full diff and summarise the changes properly".
A colleague has been using Claude for this exact purpose for the past 2-3 months. Left alone, Claude just kept spewing spammy, formulaic, uninteresting summaries. E.g. phrases like "updated migrations" or "updated admin" were frequent occurrences for changes in our Django project. On the other hand, important implementation choices were left undocumented.
Basically, my conclusion was that, for the time being, Claude's summaries aren't worthy for inclusion in our git log. They missed most things that would make the log message useful, and included mostly stuff that Claude could generate on demand at any time. I.e. spam.
piva00•Mar 24, 2026
Same experience here, I see many people in the company (5-10k employees) pushing commits with Claude-generated comments that are absolutely useless.
I got praised for my commit messages by another team, they asked me how I was making Claude generate them, and I had to tell them I'm just not using Claude for that.
I like writing my own commit messages because it helps me as well, I have to understand what was done and be able to summarise it, if I don't understand quickly enough to write a summary in the commit message it means something can be simplified or is complex enough to need comments in the code.
skydhash•Mar 24, 2026
> /git-pr removed the friction of formatting - turning code changes into a presentable PR.
What I want from a PR is what's not in the patch, especially the end goal of the PR, or the reasoning for the solution represented by the changes.
> SWC removed the friction of waiting - the dead time between making a change and seeing it.
Not sure how that relates to Claude Code.
> The preview removed the friction of verifying changes - I could quickly see what’s happening.
How Claude is "verifying" UI changes is left very vague in the article.
> The worktree system removed the friction of context-switching - juggling multiple streams of work without them colliding.
Ultimately, there's only one (or two) main branches. All those changes needs to be merged back together again and they needs to be reviewed. Not sure how collisions and conflicts is miraculously solved.
chadcmulligan•Mar 24, 2026
Maybe OT - I find Claude Code hit or miss, I spend a lot of time removing dumb code or asking Claude to remove it eg "why do you have a separate..." Claude: "Good catch — there's no real reason...." and so on.
Where I find it incredible - learning new things, I recently started flutter/dart dev - I just ask Claude to tell me about the bits, or explaining things to me, it's truly revolutionary imho, I'm building things in flutter after a week without reading a book or manual. It's like a talking encyclopaedia, or having an expert on tap, do many people use it like this? or am I just out of the loop, I always think of Star Trek when I'm doing it. I architected / designed a new system by asking Claude for alternatives and it gave me an option I'd never considered to a problem, it's amazing for this, after all it's read all the books and manuals in the world, it's just a matter of asking the right questions.
AtlasBarfed•Mar 24, 2026
Ive done a couple exploratory learning with AIs and wow could it help with learning.
Imo we may be messing up the economy with AIs. They should be engineering better workers, not being employed to make one person do the work of three poorly.
The power of AIs to smooth learning and raise expertise, rather than replace it, should be the adaptation goal. Obviously AIs as work assistants are powerful, but all the AI bullshitting CEOs overselling AIs is really damaging on the whole economic level
Particularly because the current marketing leads to the next generation abandoning roles that AI bullshitters claim are perfectly replaced.
It's like the urbanization demographic bomb on steroids.
chadcmulligan•Mar 24, 2026
I find myself worrying the AI bubble will pop and we'll lose this aspect of AI's without it ever being properly explored. Instead of doomscrolling now I find myself firing up claude and saying 'explain ... to me' and it proceeds to tell me all about it. I can ask it questions and it seems fairly right - at least right enough for me to proceed, it's way better at this than building code, in my experience anyway.
andyferris•Mar 24, 2026
When people say the "bubble will pop" it's meant in analogy to the dotcom era - businesses and investers lost money, but the internet (and its opportunities) didn't vanish.
Even open-weight local models are becoming good enough for teaching yourself quite a range of stuff, especially the beginner aspects. LLMs are not going to simply disappear because of a financial reallignment. The worst thing might be not being able to access a super-duper frontier model for free?
holden_nelson•Mar 24, 2026
this is the only use case I'm super bullish on. And for this it is revolutionary. Agreed.
juped•Mar 24, 2026
Many people use it like this - this is playing to its strengths, rather than trying to work around its weaknesses. "What's the idiomatic X language way to do Y?" gets you a solid, useful answer in seconds.
But it's just a damn good tool, not the apocalypse/the thing that lets you finally fire everyone. So it kind of gets lost in the hype.
thegrim33•Mar 24, 2026
Ah, another pro-AI coding post written by someone whose livelihood depends on promoting/selling AI-assisted coding products. Color me shocked. And they used AI to write the post itself.
AuthAuth•Mar 24, 2026
Oh look someone over glazing AI and its usefulness. I hope this is a real person authentically sharing their opinion and not some AI startup guerrilla marketing.
whatthe12899•Mar 24, 2026
if you can't be bothered to write your own PR descriptions because it's drudgery, how can you expect others to read your (now-lengthier-because-AI) PR descriptions?
This is an honest as someone who is also now doing this.
Klaster_1•Mar 24, 2026
I recently switched to agent writing my PR and commit messages with skills that mimic me doing the same. Most of the time, it writes exactly what I'd write and if something is off, editing takes less time than writing from scratch.
throw_m239339•Mar 24, 2026
I don't even need to read that article, I just can ask Claude how could I be more productive with Claude.
breakingcups•Mar 24, 2026
You'd be getting practically the same result. If someone is too lazy to write their own commit messages they're definitely too lazy to write this blog post manually.
overgard•Mar 24, 2026
Is Anthropic raising funds again? I'm so sick of these thinly veiled advertisements.
Razengan•Mar 24, 2026
This. I couldn't have been the only one noticing this uncanny frequency in fluff specifically praising Claude instead of AI coding in general.
Now it's just becoming blatant
bdangubic•Mar 24, 2026
There is a big difference between Claude and "AI coding in general"
shevy-java•Mar 24, 2026
Guys - we lost another one to Skynet.
thunfischtoast•Mar 24, 2026
Call me incompetent, but I don't get it.
> I switched the build to SWC, and server restarts dropped to under a second.
In the links you provided, swc is the same entity.
TheRoque•Mar 24, 2026
Both links are the same, SWC in this context is probably Speedy Web Compiler. It transpiles really fast but doesn't do any type checks.
maleldil•Mar 24, 2026
> It transpiles really fast but doesn't do any type checks
What's the point of using it during development, then?
OscarDC•Mar 24, 2026
Transpilation is here a necessary step to test the application because e.g. his browser won't be able to parse raw TypeScript code.
Typechecking is not: the browser doesn't care about it, it's mainly to help the developer verify its code.
So to speed-up the build during development (to have faster iterations) the idea is often to make the building process only about the build by removing "unnecessary" steps like type-checking from it, while having a separate linting / typechecking etc. process, which could even run in parallel - but not be necessary to be able to test the application.
This is often done by using tools like a bundler (e.g. esbuild) or a transpiler (babel, swc) to erase the types without checking them in your bundling process.
michaelsalim•Mar 24, 2026
Pretty sure they're the same thing. The second link is on how to use swc with nestjs.
Havoc•Mar 24, 2026
> And like any good manager, you get to claim credit for all the work your “team” does.
Meanwhile in the real world the expectations shift to normalise the 10x and your boss wants to know why your output isn’t 12x like that of Max
I have been using Claude AI - not Claude Code - and it has greatly improved my productivity, too.
However, I agree with you that commits are a terrible (or an unreliable) metric; more commits do not necessarily equal higher productivity.
ulrikrasmussen•Mar 24, 2026
I think more people should focus on using LLMs to relieve cognitive load rather than parallelize and overload their brains. We need to learn to live with the fact that humans are not good at multi-tasking, and LLMs are not going to make us better at it.
I have started using Claude to develop an implementation plan, but instead of making Claude implement it and then have me spend time figuring out what it did, I simply tell it to walk me through implementing it by hand. This means that I actually understand every step of the development process and get to intervene and make different choices at the point of time where it matters. As opposed to the default mode which spits out hundreds of lines of code changes which overloads my brain, this mode of working actually feels like offloading the cognitive burden of keeping track of the implementation plan and letting me focus on both the details and the big picture without losing track of either one. For truly mechanical sub-tasks I can still save time by asking Claude to do them for me.
wouldbecouldbe•Mar 24, 2026
Some of us love it, bit intense sometimes, but fun. So I guess we get to decide it ourselves what we prefer.
I know many will then say, BUT QUALITY, but if you learn to deal with your own and claude quirks, you also learn how to validate & verify more efficiently. And experience helps here.
nzach•Mar 24, 2026
I've been using a POC-driven workflow for my agentic coding.
What I do is to use the LLM to ask a lot of questions to help me better understand to problem. After I have a good understanding I jump into the code and code by hand the core of the solution. With this core work finished(keep in mind that at this point the code doesn't even need to compile) I fire up my LLM and say something like "I need to do X, uncommited in this repo we have a POC for how we want to do it. Create and implement a plan on what we need to do to finish this feature."
I think this is a good model because I'm using the LLM for the thing it is good at: "reading through code and explaining what it does" and "doing the grunt work". While I do the hard part of actually selecting the right way of solving a problem.
kajkojednojajko•Mar 24, 2026
I love this idea! I'll try it today.
This resonates with me because I've been looking for a way to detect when I would make a different decision than the LLM. These divergence points generally happen because I'm thinking about future changes as I code, and the LLM just needs to pick something to make progress.
Prompts like "list your assumptions and do not write any code yet" help during planning. I've been experimenting with "list the decisions you've made during implementation that were not established upfront in the plan" after it makes a change, before I review it, because when eyeballing the diff alone, I often miss subtle decisions.
Thanks for sharing the suggestion to slow it down and walk the forking path with the LLM :)
neilkakkar•Mar 24, 2026
Hello! OP here, a lot of comments have this common theme of wondering if this is overloading / context switching / the brain thrashing.
Helped me surface an important distinction on why it doesn't really happen for me. I think there's three parts to it:
1. I work on only one thing at a time, and try to keep chunks meaty
2. I make sure my agents can run a lot longer so every meaty chunk gets the time it deserves, and I'm not babysitting every change in parallel, that would be horrible! (how I do this is what this post focuses on)
3. New small items that keep coming up / bug fixes get their own thread in the middle of the flow when they do come up, so I can fire and forget, come back to it when I have time. This works better for me because I'm not also thinking about these X other bugs that are pending, and I can focus on what I'm currently doing.
What I had to figure out was how to adapt this workflow to my strengths (I love reviewing code and working on one thing at a time, but also get distracted easily). For my trade-offs, it was ideal to offload context to agents whenever a new thing pops up, so I continue focusing on my main task.
The # of PRs might look huge (and they are to me), but I'm focusing on one big chonky thing a day, the others are smaller things, which together mean progress on my product is much faster than it otherwise would be.
Nevermark•Mar 24, 2026
The amount of code changes I find acceptable, to simplify and shrink my code base, is now almost unbounded.
Overstating things of course. But paying off technical debt never felt so good. And the expected decrease in forward friction has never been so achievable so quickly.
mulr00ney•Mar 24, 2026
>The time saved matters, but the real unlock was the mental overhead removed. Every PR used to be a small context switch: stop thinking about the code, start thinking about how to describe the code. Now I type /git-pr and move on to the next thing.
This one's interesting to me. For a lot of my career, the act of writing the PR is the last sanity check that surfaces any weirdness or my own misgivings about my choices. Sometimes there would be code that felt natural when I was writing it and getting the feature working, and maybe that code survived my own personal round of code review... but having to write about it in plain english for the benefit of someone doing review with less context was a useful spot to do some self-reflection.
34 Comments
I've started to use git worktrees to parallelize my work. I spend so much time waiting...why not wait less on 2 things? This is not a solved problem in my setup. I have a hard time managing just two agents and keeping them isolated. But again, I'm the bottleneck. I think I could use 5 agents if my brain were smarter........or if the tools were better.
I am also a PM by day and I'm in Claude Code for PM work almost 90% of my day.
With a less 'expensive' ticket, with less explanation about how things should be done, but why they are needed, we would have had discussions, in dailies or 1on1, and that could have been ironed out then.
Yeah, basically Claude generate tickets that are heavy on the 'how' and light on the 'why', and I think that should be the other way around, for multiple reasons, but I'm already long-winded.
Is that how it works? Do managers claim credit for the work of those below them, despite not doing the work?
I hope they also get penalised when a lowly worker does a bad thing, even if the worker is an LLM silently misinterpreting a vague instruction.
When things go south, no penalization is made. A simple "post-mortem" is written in confluence and people write "action items". So, yeah, no need for the manager to get the blame.
It's all very shitty, but it's always been like that.
Yeah the buck stops with the manager (IMO the direct manager). So I can do some constructive criticism with my dev if they make a mistake, but it's my fault in the larger org that it happened. Then it's my manager's job to work with me to make sure I create the environment where the same mistake doesn't happen again. Am I training well? Am I giving them well-scoped work? All that.
Solving new problems is a thing engineers get to do constantly, whereas building an agent infrastructure is mostly a one-ish time thing. Yes, it evolves, but I worry that once the fun of building an agentic engineering system is done, we’re stuck doing arguably the most tedious job in the SDLC, reviewing code. It’s like if you were a principal researcher who stopped doing research and instead only peer reviewed other people’s papers.
The silver lining is if the feeling of faster progress through these AI tools gives enough satisfaction to replace the missing satisfaction of problem-solving. Different people will derive different levels of contentment from this. For me, it has not been an obvious upgrade in satisfaction. I’m definitely spending less time in flow.
but a chart of commits/contribs is such a lousy metric for productivity.
It's about on par with the ridiculousness of LOC implying code quality.
And it's not like I'm blindly commiting LLM output. I often write everything myself because I want to understand what I'm doing. Claude often comments that my version is better and cleaner. It's just that the tasks seemed so monumental I felt paralyzed and had difficulty even starting. Claude broke things down into manageable steps that were easy to do. Having a code review partner was also invaluable for a solo hobbyist like me.
That said, by the time I'm happy with it all the AI stuff outside very boilerplate ops/config stuff has been rewritten and refined. I just find it quite helpful to get over that initial hump of "I have nothing but a dream" to the stage of "I have a thing that compiles but is terrible". Once I can compile it then I can refine which where my strengths lie.
Every comment I make is a "really perceptive observation" according to Claude and every question I ask is either "brilliant" or at least "good", so...
I'm taking the time to understand what it is proposing. I'm pushing back and asking for clarifications. When I implement things, I do it myself in my own way. I experienced a huge increase in my ability to make the cool stuff I've always wanted to make even in spite of this.
I can't even fathom how productive the people who have Claude Code cranking out features on multiple git worktrees in parallel must be. I wouldn't do that in my personal projects but I can totally understand doing that in a job setting.
Lately it's been praising me much more for correcting it, quite annoying to be honest, it's just a clanker, I want it to act like a non-human clanker instead of playing theater with me...
Most effective engineers on the brownfield projects I've worked on, usually deleted more LOC than they've added, because they were always looking to simplify the code and replace it with useful (and often shorter) abstractions.
Especially in brownfield settings, if you do use CC, you really should be spending something like a day refactoring the code for every 15 minutes of work it spends implementing new functionality. Otherwise the accumulation of technical debt will make the code base unworkable by both human and claude hands in a fairly short time.
I think overall it can be a force for good, and a source of high quality code, but it requires a significant amount of human intervention.
Claude Code operating on unsupervised Claude code fairly rapidly generates a mess not even Claude Code can decode, resulting in a sort of technical debt Kessler syndrome, where the low quality makes the edits worse, which makes the quality worse, rinse and repeat.
the assumption to this workflow is that claude code can complete tasks with little or no oversight.
If the flow looks like review->accept, review->accept, it is manageable.
In my personal experience, claude needs heavy guidance and multiple rounds of feedback before arriving at a mergeable solution (if it does at all).
Interleaving many long running tasks with multiple rounds of feedback does not scale well unfortunately.
I can only remember so much, and at some point I spend more time trying to understand what has been done so far to give accurate feedback than actually giving feedback for the next iteration.
I'm so conflicted about this. On the one hand I love the buzz of feeling so productive and working on many different threads. On the other hand my brain gets so fried, and I think this is a big contributor.
I have nothing to back up the idea though.
I also have nothing to back it up, but it fits my mental models. When juggling multiple things as humans, it eats up your context window (working memory). After a long day, your coherence degrades and your context window needs flushing (sleeping) and you need to start a new session (new day, or post-nap afternoon).
I prefer focusing mostly on 1 task at a time (sometimes 2 for a short time, or asking other agent some questions simultaneously) and doing the task in chunks so it doesn't take much time until you have something to review. Then I review it, maybe ask for some refactoring and let it continue to the next step (maybe let it continue a bit before finishing review if feeling confident about the code). It's easier to review smaller self-contained chunks and easier to refer to code and tell AI what needs changing because of fewer amount of relevant lines.
Turns out we weren't opposed to bad metrics! We were just opposed to being measured! Given the chance to pick our own, we jumped straight to the same nonsense.
This seems like a distinction without a difference, unless there actually are any good metrics (which also requires them to be objectively and reliably quantifiable). I think most developers don't really want to measure themselves, it's just that pro-AI people think measurement is necessary to put forward a convincing argument that they've improved anything.
I would posit that you need extra context to obtain meaning from those metrics, which inherently makes them less visible
If you try to come up with an objective definition of working feature you're back to gamability criticism.
Of course, at the same time we're getting dozens of alerts a week about services deployed open to the Internet without authentication and full of outdated vulnerable libraries (LLMs will happily add two or three years old dependencies to your lockfiles).
https://en.wikipedia.org/wiki/Perverse_incentive?wprov=sfla1
COCOMO, which considers lines of code, is generally accepted as being accurate (enough) at estimating the value of a software system, at least as far as how courts (in the US) are concerned.
https://en.wikipedia.org/wiki/COCOMO
The linked article does not demonstrate this. It establishes no causal link. One can obviously bloat LOC to an arbitrary degree while maintaining feature parity. Very generously, assuming good faith participants, it might reflect a kind average human efficiency within the fixed environment of the time.
Carrying the conclusions of this study from the 80s into the LLM age is not justified scientifically.
LOC is essentially only useful to give a ballpark estimate it complexity and even then only if you compare orders of magnitude and only between similar program languages and ecosystems.
It’s certainly not useful for AI generated projects. Just look at OpenClaw. Last I heard it was something close to half a million lines of code.
When I was in college we had a professor senior year who was obsessed with COCOMO. He required our final group project to be 50k LOC (He also required that we print out every line and turn it in). We made it, but only because we build a generator for the UI and made sure the generator was as verbose as possible.
Your example about OpenClaw works exactly against your own argument by the way: OpenAI acquired it for millions by all accounts.
“A very high MMRE (1.00) indicates that, on average, the COCOMO model misses about 100% of the actual project effort. This means that the estimate generated by the model can be double or even greater than the actual effort. This shows that the COCOMO model is not able to provide estimates that are close to the actual value.”
No one in the industry has taken COCOMO seriously for nearly 2 decades.
>OpenClaw
1. OpenAI bought the vibes and the creator. Why would they buy the code? It’s open source.
2. You don’t seriously think OpenClaw needs half a million lines of code to provide the functionality it does do you?
Seriously just go look at the code. No one is defending that as being an efficient use of code.
https://journal.fkpt.org/index.php/BIT/article/download/2027...
Yes, and in fact a lot of the studies that show the impact of AI on coding productivity get dismissed because they use LoC or PRs as a metric and "everyone knows LoC/PR counts is a BS metric." But the better designed of these studies specifically call this out and explicitly design their experiments to use these as aggregate metrics.
That's an anti-signal if we're being honest.
I'm not sure most developers, managers, or owners care about the calculated dollar value of their codebase. They're not trading code on an exchange. By condensing all software into a scalar, you're losing almost all important information.
I can see why it's important in court, obviously, since civil court is built around condensing everything into a scalar.
Working to the point of making yourself sick should not be seen as a mark of pride, it is a sign that something is broken. Not necessarily the individual, maybe the system the individual is in.
I find it crazy to build a complex system to juggle 10 different threads in your brain, including the complexity of the tool itself.
Claiming that you have "ten agents writing code at night" is not the flex you think it is. That's just a recipe for burnout and bad design decisions.
Stop running your agents and go touch grass.
feels like nowadays this is illegal and instead you should be running 50 agent swarms and be putting out 20 features an hour while reviewing the code via agents and .....
ugh.
It's not the only metric. But I'm more and more convinced that the people protesting any discussion of it are the ones who... don't ship a lot.
Of course it matters in what code base. What size PR. How many bugs. Maintenance burden. Complexity. All of that doesn't go away. But that doesn't disqualify the metric, it just points out it's not a one-dimensional problem.
And for a solo project, it's fairly easy to hold most of these variables relatively constant. Which means "volume went up" is a pretty meaningful signal in that context.
Yeah but all else isn’t equal, so unless you’re measuring a whole lot more than PRs it’s completely meaningless.
Even on a solo project, something as simple as I’m working with a new technology that I’m excited about is enough to drastically ramp up number of PRs.
PRs or closed jira tickets can be a metric of productivity only if they add or improve the existing feature set of the product.
If a PR introduces a feature with 10 bugs in other features and I have my agent swarm fix those in 10-20 PRs in a week, my productivity and delivery have both taken a hit. If any of these features went to prod, I have lost revenue as well.
Shipping is not same as shipping correctly with minimal introduction of bugs.
You're absolutely right that PRs fixing things that a previous PR broke is a negative. Same for PRs implementing work not needed, or driving up tech debt.
"You're productive because you have lots of PRs" is a mistake without that context. But so is "You produce very little PRs, but that's fine, we shouldn't look at volume".
It's not a performance metric. It is an indicator worth following up. And there's a lot of reflexive "bad metric" arguments blanket dismissing that indicator.
Does that help explain?
For profit failing as a metric, see: Enron.
If you mostly get around on your feet, distance traveled in a day is a reasonable metric for how much exercise you got. It's true that it also matters how you walk and where you walk, but it would be pretty tedious to tell someone that a "3 mile run" is meaningless and they must track cardiovascular health directly. It's fine, it works OK for most purposes, not every metric has to be perfect.
But once you buy a car, the metric completely decouples, and no longer points towards your original fitness goals even a tiny bit. It's not that cars are useless, or that driving has a magic slowdown factor that just so happens to compensate for your increased distance travelled. The distance just doesn't have anything to do with the exercise except by a contingent link that's been broken.
True, but if what you care about is "how quickly and safely can I reach a given goal", distance traveled over time is a great initial indicator, and accident rate will help illuminate.
The question "does AI help me move faster towards a goal, at the same quality standard", is relatively easy to judge in a solo project. As long as you verify equivalent standards, and don't play in an area you don't know at least - folks have a pretty clear understanding of their own productivity if it's a familiar thing.
Along those lines, some techniques I've been dabbling in: 1. Getting multiple agents to implement a requirement from scratch, them combining the best ideas from all of them with my own informed approach. 2. Gathering documentation (requirements, background info, glossaries, etc), targeting an Agent at it, and asking carefully selected questions for which the answers are likely useful. 3. Getting agents to review my code, abstracting review comments I agree with to a re-usable checklist of general guidelines, then using those guidelines to inform the agents in subsequent code reviews. Over time I hope this will make the code reviews increasingly well fitted to the code base and nature of the problems I work on.
If you have the tokens for it, having a team of agents checking and improving on the work does help a lot and reduces the slop.
Why do people do this? Why do they outsource something that is meant to have been written by a human, so that another human can actually understand what that first human wanted to do, so why do people outsource that to AI? It just doesn't make sense.
We have “Cursor Bot” enabled at work. It reviews our PRs (in addition to a human review)
One thing it does is add a PR summary to the PR description. It’s kind of helpful since it outlines a clear list of what changed in code. But it would be very lacking if it was the full PR description. It doesn’t include anything about _why_ the changes were made, what else was tried, what is coming next, etc.
This weird notion that the purpose of the thing is the thing itself, not what people get out of the thing. Tracks completely that a person who thinks their number of commits and think that shows how productive they are (while acknowledging that it's a poor metric and just shrugging).
Says who? The point of the summary is so that I don't have to go look at the diff and figure out what happened.
But I agree with you, when reading PR descriptions and code comments I want a "why" not a "what". And that is why I think most LLM-generated documentation is bad.
Most of the time, the PR descriptions it generates for me are great.
I think the issue is you're assuming it's always a poor output, which isn't the case. I'm in a much smaller team than you'd expect, so the why is talked about sync more often than not, and it becomes less of an problem.
Mentioning LLM usage as a distinction is like bragging about using a modern compiler instead of writing assembly. Yeah it's faster, but so is everyone else code... Besides, I wouldn't brag about being more productive with LLMS because it's a double edge sword: it's very easy to use them, and nobody is reviewing all the lines of code you are pushing to prod (really, when was the last time you reviewed a PR generated by AI that changed 20+ files and added/removed thousands of lines of code?), so you don't know what's the long game of your changes; they seem to work now but who knows how it will turn out later?
Outside of work, yeah, everything is fine and there's nothing but the pure pursue of knowledge and joy.
I think the next big movement in tech will be ALL companies becoming tech companies. Right now there are hundreds of thousands of "small" companies with big enough budgets to pay for a CTO to modernize their stack and lead them into the 21st century.
The problem is they don't know they have this problem and so they aren't actively hiring for a CTO. You've got to go find them and insert yourself as the solution.
Yet people look at me like I'm the odd one out when I say I am more productive with a modern compiler like GHC.
Is that the end game? Well why can’t the agents orchestrate the agents? Agents all the way down?
The whole agent coding scene seems like people selling their soul for very shiny inflatable balloons. Now you have twelve bespoke apps tailored for you that you don’t even care about.
Unless you don't review every generated line manually, and instead rely on, let's say, UI e2e testing, or perhaps unit testing (that the agents also wrote). I don't know, perhaps we are past the phase of "double check what agents write" and are now in the phase of "ship it. if it breaks, let agents fix it, no manual debugging needed!" ?
I'm sure these larger models are both faster and more cogent, but its also clear what matter is managing it's side tracks and cutting them short. Then I started seeing the deeper problematic pattern.
Agents arn't there to increase the multifactor of production; their real purpose is to shorten context to manageable levels. In effect, they're basically try to reduce the odds of longer context poisoning.
So, if we boil down the probabilty of any given token triggering the wrong subcontext, it's clear that the greater the context, the greater the odds of a poison substitution.
Then that's really the problematic issue every model is going to contend with because there's zero reality in which a single model is good enough. So now you're onto agents, breaking a problem into more manageable subcontext and trying to put that back into the larger context gracefully, etc.
Then that fails, because there's zero consistent determinism, so you end up at the harness, trying to herd the cats. This is all before you realize that these businesses can't just keep throwing GPUs at everything, because the problem isn't computing bound, it's contextual/DAG the same way a brain is limited.
We all got intelligence and use several orders of magnitude less energy, doing mostly the same thing.
Serious planning. The plans should include constraints, scope, escalation criteria, completion criteria, test and documentation plan.
Enforce single responsibility, cqrs, domain segregation, etc. Make the code as easy for you to reason about as possible. Enforce domain naming and function / variable naming conventions to make the code as easy to talk about as possible.
Use code review bots (Sourcery, CodeRabbit, and Codescene). They catch the small things (violations of contract, antipatterns, etc.) and the large (ux concerns, architectural flaws, etc.).
Go all in on linting. Make the rules as strict as possible, and tell the review bots to call out rule subversions. Write your own lints for the things the review bots are complaining about regularly that aren't caught by lints.
Use BDD alongside unit tests, read the .feature files before the build and give feedback. Use property testing as part of your normal testing strategy. Snapshot testing, e2e testing with mitm proxies, etc. For functions of any non-trivial complexity, consider bounded or unbounded proofs, model checking or undefined behaviour testing.
I'm looking into mutation testing and fuzzing too, but I am still learning.
Pause for frequent code audits. Ask an agent to audit for code duplication, redundancy, poor assumptions, architectural or domain violations, TOCTOU violations. Give yourself maintenance sprints where you pay down debt before resuming new features.
The beauty of agentic coding is, suddenly you have time for all of this.
I feel like i am a bit stupid to be not able to do this. my process is more iterative. i start working on a feature then i disocover some other function thats silightly related. go refactor into commmon code then proceed with original task. sometimes i stop midway and see if this can be done with a libarary somewhere and go look at example. i take many detours like these. I am never working on a single task like a robot. i dont want claude to work like that either .That seems so opposite of how my brain works.
what i am missing.
When I get an idea for something I want to build, I will usually spend time talking to ChatGPT about it. I'll request deep research on existing implementations, relevant technologies and algorithms, and a survey of literature. I find NotebookLM helps a lot at this point, as does Elevenreader (I tend to listen to these reports while walking or doing the dishes or what have you). I feed all of those into ChatGPT Deep Research along with my own thoughts about the direction the system, and ask it to produce a design document.
That gets me something like this:
https://github.com/leynos/spycatcher-harness/blob/main/docs/...
If I need further revisions, I'll ask Codex or Claude Code to do those.
Finally, I break that down into a roadmap of phases, steps and achievable tasks using a prompt that defines what I want from each of those.
That gets me this:
https://github.com/leynos/spycatcher-harness/blob/main/docs/...
Then I use an adapted version of OpenAI's execplans recipe to plan out each task (https://github.com/leynos/agent-helper-scripts/blob/main/ski...).
The task plans end up looking like this:
https://github.com/leynos/spycatcher-harness/blob/main/docs/...
At the moment, I use Opus or GPT-5.4 on high to generate those plans, and Sonnet or GPT-5.4 medium to implement.
The roadmap and the design are definitely not set in stone. Each step is a learning opportunity, and I'll often change the direction of the project based on what I learn during the planning and implementation. And of course, this is just what works for me. The fun of the last few months has been everyone finding out what works for them.
The only obvious bit you didn't cover was extensive documentation including historical records of various investigations, debug sessions and technical decisions.
Architecture & Design Principles • Single Responsibility Principle (SRP) • CQRS (Command Query Responsibility Segregation) • Domain Segregation • Domain-Driven Naming Conventions • Clear function/variable naming standards • Architectural constraint definition • Scope definition • Escalation criteria design • Completion criteria definition
⸻
Planning & Process • Formal upfront planning • Constraint-based design • Defined scope management • Escalation protocols • Completion criteria tracking • Maintenance sprints (technical debt paydown) • Frequent code audits
⸻
AI / Agentic Development Practices • Agent-assisted code audits • Agent-based feedback loops (e.g., reading .feature files pre-build) • Agent-driven reasoning optimization (code clarity for AI) • Continuous automated review cycles
⸻
Code Review & Static Analysis • Code review bots: • Sourcery • CodeRabbit • CodeScene • Automated detection of: • Anti-patterns • Contract violations • UX concerns • Architectural flaws
⸻
Linting & Code Quality Enforcement • Strict linting rules • Custom lint rules • Enforcement of lint compliance via bots • Detection of lint rule subversion
⸻
Testing Strategies
Core Testing • Unit Testing • BDD (Behavior-Driven Development) • .feature file validation before build
Advanced Testing • Property-based testing • Snapshot testing • End-to-end (E2E) testing • With MITM (man-in-the-middle) proxies
Formal / Heavyweight Testing • Model checking • Bounded proofs • Unbounded proofs • Undefined behavior testing
Emerging / Exploratory • Mutation testing • Fuzzing
⸻
Code Quality & Auditing • Code duplication detection • Redundancy analysis • Assumption validation • Architectural compliance checks • Domain boundary validation • TOCTOU (Time-of-check to time-of-use) vulnerability analysis
⸻
Development Workflow Enhancements • Continuous audit cycles • Debt-first maintenance phases • Feedback-driven iteration • Pre-build validation workflows
⸻
Security & Reliability Considerations • TOCTOU vulnerability detection • MITM-based E2E testing • Undefined behavior analysis • Fuzz testing (planned)
Many of those tools are overpowered unless you have a very complex project that many people depend on.
The AI tools will catch the most obvious issues, but will not help you with the most important aspects (e.g. whether you project is useful, or the UX is good).
In fact, having this complexity from the start may kneecap you (the "code is a liability" cliché).
You may be "shipping a lot of PRs" and "implementing solid engineering practices", but how do you know if that is getting closer to what you value?
How do you know that this is not actually slowing your down?
If you are solo or in a small company you apply the complexity you need, you can even do it incrementally when you see a pattern of issues repeating to address those over time, hardening the process from lessons learnt.
Ultimately the product discussion is separate from the engineering concerns on how to wrangle these tools, and they should meet in the middle so overbearing engineering practices don't kneecap what it is supposed to do: deliver value to the product.
I don't think there's a hard set of rules that can be applied broadly, the engineering job is to also find technical approaches that balance both needs, and adapt those when circumstances change.
On the other side perhaps your company, like most, does not know how to measure overengineering, cognitive complexity, lack of understanding, balancing speed/quality, morale, etc. but they surely suffer the effects of it.
I suspect that unless we get fully automated engineering / AGI soon, companies that value engineers with good taste will thrive, while those that double down into "ticket factory" mode will stagnate.
That is exactly not what I meant, I'm sorry if it wasn't clear but your assumption about how my job works is absolutely wrong.
I even mention that the product discussion is separate only on "how to wrangle these tools":
> Ultimately the product discussion is separate from the engineering concerns on how to wrangle these tools, and they should meet in the middle so overbearing engineering practices don't kneecap what it is supposed to do: deliver value to the product.
Delivering value, which means also avoiding a feature that will limit or entrap you in the future.
> On the other side perhaps your company, like most, does not know how to measure overengineering, cognitive complexity, lack of understanding, balancing speed/quality, morale, etc. but they surely suffer the effects of it.
We do measure those and are quite strict about it, most of my design documents are about the trade-offs in all of those dimensions. We are very critical about proposals that don't consider future impacts over time, and mostly reject workarounds unless absolutely necessary (and those require a phase-out timeline for a more robust solution that will be accounted for as part of the initiative, so the cost of the technical debt is embedded from the get-go).
I believe I wasn't clear and/or you misunderstood what I said, I agree with you on all these points, and the company I work for is very much in opposite to a "ticket factory". Work being rejected due to concerns for the overall impact cross-boundaries on doing it is very much praised, and invited.
My comment was focused on how to wrangle these tools for engineering purposes being a separate discussion to the product/feature delivery, it's about tool usage in the most technical sense, which doesn't happen together with product.
We on the engineering side determine how to best apply these tools for the product we are tasked on delivering, the measuring of value delivered is outside and orthogonal to the technical practices since we already account for the trade-offs during proposal, not development time. This measurement already existed pre-AI and is still what we use to validate if a feature should be built or not, its impact and value delivered afterwards, and the cost of maintaining it vs value delivered. All of that includes the whole technical assessment as we already did before.
Determining if a feature should be built or not is ultimately a pairing of engineering and product, taking into account everything you mentioned.
Determining the pipeline of potential future non-technical features at my job is not part of engineering, except for side-projects/hack ideas that have potential to be further developed as part of the product pipeline.
Reading your comment, it looks like you work for a pretty nice company that takes those things seriously. I envy you!
My concern was that for companies unlike yours that don't have well established engineering practices, it _feels_ that with AI you can go much faster and in fact it's a great excuse to dismantle any remaining practices. But, in reality they either doing busywork or building the wrong thing. My guess is that those are going to learn that this is a bad idea in the future, when they already have a mess to deal with.
To put what I mean into perspective... if you browse OP's profile you can find absolutely gigantic PRs like https://github.com/leynos/weaver/pull/76. I can not review any PR like that in good faith, period.
> if it breaks, let agents fix it, no manual debugging needed!" ?
Pretty trivial to have every Sentry issue have an immediate first pass by AI now to attempt to solve the bug.
Not at all, it's just a skill that gets easier with practice. Generally if you're in the position to review a lot of PR's, you get proficient at it pretty quickly. It's even easier when you know the context of what the code is trying to do, which is almost always the case when e.g. reviewing your team-mates' PR's or the code you asked the AI to write.
As I've said before (e.g. https://news.ycombinator.com/item?id=47401494), I find reviewing AI-generated code very lightweight because I tend to decompose tasks to a level where I know what the code should look like, and so the rare issues that crop up quickly stand out. I also rely on comprehensive tests and I review the test cases more closely than the code.
That is still a huge amount of time-savings, especially as the scope of tasks has gone from a functions to entire modules.
That said, I'm not slinging multiple agents at a time, so my throughput with AI is way higher than without AI, but not nearly as much as some credible reports I've heard. I'm not sure they personally review the code (e.g. they have agents review it?) but they do have strategies for correctness.
Some agents will be developing plans for the next feature, but there can sometimes be up to 4 coding.
These are typically a mix between trivial bug fixes and 2 larger but non-overlapping features. For very deep refactoring I'll only have a single agent run.
Code reviews are generally simple since nothing of any significance is done without a plan. First I run the new code to see if it works. Then I glance at diffs and can quickly ignore the trivial var/class renames, new class attributes, etc leaving me to focus on new significant code.
If I'm reviewing feature A I'll ignore feature B code at this point. Merge what I can of feature A then repeat for feature B, etc.
This is all backed by a test suite I spot check and linters for eg required security classes.
Periodically we'll review the codebase for vulnerabilities (eg incorrectly scoped db queries, etc), and redundant/cheating tests.
But the keys to multiple concurrent agents are plans where you're in control ("use the existing mixin", "nonsense, do it like this" etc) and non-overlapping tasks. This makes reviewing PRs feasible.
There are features you can skip safely behind feature flags or staged releases. As you push in you fine with the right tooling it can be a lot.
If you break it down often quite a bit can be deployed safely with minimal human intervention (depends naturally on the domain, but for a lot of systems).
I’m aiming to revamp the while process - I wrote a little on it here : https://jonathannen.com/building-towards-100-prs-a-day/
Some says features. Well. Are they used. Are they beneficial in any way for our society or humanity? Or are we junk producing for the sake of producing?
Like thinking about it a pr skill is pretty much an antipattern even telling ai to just create a pr is faster.
I think some vibe coders should let AI teach them some cli tooling
It's checking if I'm in a worktree, renames branches accordingly, adds a linear ticket if provided, generates a proper PR summary.
I'm not optimising for how fast the PR is created, I want it to do the menial steps I used to do .
Who are you creating PR descriptions for, exactly? If you consider it "drudgery", how do you think your coworkers will feel having to read pages of generic "AI" text? If reviewing can be considered "drudgery" as well, can we also offload that to "AI"? In which case, why even bother with PRs at all? Why are you still participating in a ceremony that was useful for humans to share knowledge and improve the codebase, when machines don't need any of it?
> My role has changed. I used to derive joy from figuring out a complicated problem, spending hours crafting the perfect UI. [...] What’s become more fun is building the infrastructure that makes the agents effective. Being a manager of a team of ten versus being a solo dev.
Yeah, it's great that you enjoy being a "manager" now. Personally, that is not what I enjoy doing, nor why I joined this industry.
Quick question: do you think your manager role is safe from being automated away? If machines can write code and prose now better than you, couldn't they also manage other machines into producing useful output better than you? So which role is left for you, and would you enjoy doing it if "manager" is not available?
Purely rhetorical, of course, since I don't think the base premise is true, besides the fact that it's ignoring important factors in software development such as quality, reliability, maintainability, etc. This idea that the role of an IC has now shifted into management is amusing. It sounds like a coping mechanism for people to prove that they can still provide value while facing redundancy.
_Parts_ of what I write are drudgery, which gets automated away. The "why" we talk about in sync, so it's much less of an issue in general.
When I say management, I mean more like a staff engineer or a tech lead, rather than a traditional manager.
Oh really? I enjoy doing one thing at the time, with focus.
AI, as you're using it OP, isn't make you faster, it is making you work more for the same amount of money. You burn yourself for no reason.
A colleague has been using Claude for this exact purpose for the past 2-3 months. Left alone, Claude just kept spewing spammy, formulaic, uninteresting summaries. E.g. phrases like "updated migrations" or "updated admin" were frequent occurrences for changes in our Django project. On the other hand, important implementation choices were left undocumented.
Basically, my conclusion was that, for the time being, Claude's summaries aren't worthy for inclusion in our git log. They missed most things that would make the log message useful, and included mostly stuff that Claude could generate on demand at any time. I.e. spam.
I got praised for my commit messages by another team, they asked me how I was making Claude generate them, and I had to tell them I'm just not using Claude for that.
I like writing my own commit messages because it helps me as well, I have to understand what was done and be able to summarise it, if I don't understand quickly enough to write a summary in the commit message it means something can be simplified or is complex enough to need comments in the code.
What I want from a PR is what's not in the patch, especially the end goal of the PR, or the reasoning for the solution represented by the changes.
> SWC removed the friction of waiting - the dead time between making a change and seeing it.
Not sure how that relates to Claude Code.
> The preview removed the friction of verifying changes - I could quickly see what’s happening.
How Claude is "verifying" UI changes is left very vague in the article.
> The worktree system removed the friction of context-switching - juggling multiple streams of work without them colliding.
Ultimately, there's only one (or two) main branches. All those changes needs to be merged back together again and they needs to be reviewed. Not sure how collisions and conflicts is miraculously solved.
Where I find it incredible - learning new things, I recently started flutter/dart dev - I just ask Claude to tell me about the bits, or explaining things to me, it's truly revolutionary imho, I'm building things in flutter after a week without reading a book or manual. It's like a talking encyclopaedia, or having an expert on tap, do many people use it like this? or am I just out of the loop, I always think of Star Trek when I'm doing it. I architected / designed a new system by asking Claude for alternatives and it gave me an option I'd never considered to a problem, it's amazing for this, after all it's read all the books and manuals in the world, it's just a matter of asking the right questions.
Imo we may be messing up the economy with AIs. They should be engineering better workers, not being employed to make one person do the work of three poorly.
The power of AIs to smooth learning and raise expertise, rather than replace it, should be the adaptation goal. Obviously AIs as work assistants are powerful, but all the AI bullshitting CEOs overselling AIs is really damaging on the whole economic level
Particularly because the current marketing leads to the next generation abandoning roles that AI bullshitters claim are perfectly replaced.
It's like the urbanization demographic bomb on steroids.
Even open-weight local models are becoming good enough for teaching yourself quite a range of stuff, especially the beginner aspects. LLMs are not going to simply disappear because of a financial reallignment. The worst thing might be not being able to access a super-duper frontier model for free?
But it's just a damn good tool, not the apocalypse/the thing that lets you finally fire everyone. So it kind of gets lost in the hype.
This is an honest as someone who is also now doing this.
Now it's just becoming blatant
> I switched the build to SWC, and server restarts dropped to under a second.
What is SWC? The blog assumes I know it. Is it https://swc.rs/ ? or this https://docs.nestjs.com/recipes/swc ?
What's the point of using it during development, then?
Typechecking is not: the browser doesn't care about it, it's mainly to help the developer verify its code.
So to speed-up the build during development (to have faster iterations) the idea is often to make the building process only about the build by removing "unnecessary" steps like type-checking from it, while having a separate linting / typechecking etc. process, which could even run in parallel - but not be necessary to be able to test the application.
This is often done by using tools like a bundler (e.g. esbuild) or a transpiler (babel, swc) to erase the types without checking them in your bundling process.
Meanwhile in the real world the expectations shift to normalise the 10x and your boss wants to know why your output isn’t 12x like that of Max
However, I agree with you that commits are a terrible (or an unreliable) metric; more commits do not necessarily equal higher productivity.
I have started using Claude to develop an implementation plan, but instead of making Claude implement it and then have me spend time figuring out what it did, I simply tell it to walk me through implementing it by hand. This means that I actually understand every step of the development process and get to intervene and make different choices at the point of time where it matters. As opposed to the default mode which spits out hundreds of lines of code changes which overloads my brain, this mode of working actually feels like offloading the cognitive burden of keeping track of the implementation plan and letting me focus on both the details and the big picture without losing track of either one. For truly mechanical sub-tasks I can still save time by asking Claude to do them for me.
I know many will then say, BUT QUALITY, but if you learn to deal with your own and claude quirks, you also learn how to validate & verify more efficiently. And experience helps here.
What I do is to use the LLM to ask a lot of questions to help me better understand to problem. After I have a good understanding I jump into the code and code by hand the core of the solution. With this core work finished(keep in mind that at this point the code doesn't even need to compile) I fire up my LLM and say something like "I need to do X, uncommited in this repo we have a POC for how we want to do it. Create and implement a plan on what we need to do to finish this feature."
I think this is a good model because I'm using the LLM for the thing it is good at: "reading through code and explaining what it does" and "doing the grunt work". While I do the hard part of actually selecting the right way of solving a problem.
This resonates with me because I've been looking for a way to detect when I would make a different decision than the LLM. These divergence points generally happen because I'm thinking about future changes as I code, and the LLM just needs to pick something to make progress.
Prompts like "list your assumptions and do not write any code yet" help during planning. I've been experimenting with "list the decisions you've made during implementation that were not established upfront in the plan" after it makes a change, before I review it, because when eyeballing the diff alone, I often miss subtle decisions.
Thanks for sharing the suggestion to slow it down and walk the forking path with the LLM :)
Helped me surface an important distinction on why it doesn't really happen for me. I think there's three parts to it:
1. I work on only one thing at a time, and try to keep chunks meaty
2. I make sure my agents can run a lot longer so every meaty chunk gets the time it deserves, and I'm not babysitting every change in parallel, that would be horrible! (how I do this is what this post focuses on)
3. New small items that keep coming up / bug fixes get their own thread in the middle of the flow when they do come up, so I can fire and forget, come back to it when I have time. This works better for me because I'm not also thinking about these X other bugs that are pending, and I can focus on what I'm currently doing.
What I had to figure out was how to adapt this workflow to my strengths (I love reviewing code and working on one thing at a time, but also get distracted easily). For my trade-offs, it was ideal to offload context to agents whenever a new thing pops up, so I continue focusing on my main task.
The # of PRs might look huge (and they are to me), but I'm focusing on one big chonky thing a day, the others are smaller things, which together mean progress on my product is much faster than it otherwise would be.
Overstating things of course. But paying off technical debt never felt so good. And the expected decrease in forward friction has never been so achievable so quickly.
This one's interesting to me. For a lot of my career, the act of writing the PR is the last sanity check that surfaces any weirdness or my own misgivings about my choices. Sometimes there would be code that felt natural when I was writing it and getting the feature working, and maybe that code survived my own personal round of code review... but having to write about it in plain english for the benefit of someone doing review with less context was a useful spot to do some self-reflection.