The Optimal Amount of Slop is Non-zero

Doug · 2026-06-22

Table of Contents

Regretting that code you vibed? Learn when skipping human review is and isn't a smart move.

A 2D chart. Horizontal axis label is risk, vertical axis label is rigor. A line goes up and to the right. Above the line, it says 'Too much rigor.' Below the line, it says, 'Too little rigor.'

Rigor should be proportional to risk.

My regular readers might be shocked at the title of this post. If you've read my other posts, such as AI: Accelerated Incompetence or LLMs are not Bicycles for the Mind, you might expect that I would more readily miss my son's birthday than ship unreviewed LLM code. You would not be far from wrong: there are just a few narrow situations where I have. Today you'll learn about those and along the way my decision criteria for skipping code review.

note

Definitions

agentic coding: An LLM edits, runs, and tests code for you in a loop
vibe coding: Accepting LLM-generated code without reading it
slop: Low-quality, high-quantity AI-generated content

Looks can be deceiving

Month by month, I encounter more people who have discovered agentic coding and have come to trust it so much that they are now unbothered to outsource not just software implementation but also verification to it. Just yesterday I chatted with a dev who says he's stopped reviewing code. He lets a team of LLM agents do it for him. I felt disappointed because he should understand this vexatious property of software, that externally observable appearance and behavior gives very little signal about internal quality. A program that does everything expected of it can still be riddled with quality issues. It works today but will break when revised and the world around it changes.

As a daily user of Claude Code, I can attest that when given clear requirements and context, it regularly generates software that actually does what I asked. However, across hundreds of sessions, the code has not once been what I would call good, even after adversarial LLM review.

Closed-source software is both an experience good and credence good¹. We've all bought some downloadable software or subscribed to a SaaS. Before you bought, you evaluated whether the software works well for your needs, but there was no way for you, a prospective customer, to evaluate the quality of its implementation. You can only evaluate on externals. If there's a security flaw, you can't discover it. Certifications like SOC 2 exist to rebalance this information asymmetry between the developer and the customer.

If you, a developer, outsource reading the code to an LLM, then you discard your information advantage and bring no more value over a nonpractitioner.

Here's how we know software is a credence good: give an exec a slick-looking prototype, and they'll be ready to write a check for millions. Really all you've done is given them a poster for a movie that doesn't even have a premier date yet. This is why good prototypes deliberately look unfinished.

A screenshot of Balsamiq mockup

Prodent mockups and prototypes look like pencil sketches on purpose because no exec says "ship it now."

Programmers have a capability the general population doesn't: to review the code their LLM generates. That's a valuable advantage, but internal code quality is paid for in the scarce currencies of time and attention. When is the effort worthwhile?

What we're looking for is the right risk-rigor ratio.

Matching rigor to risk

In any situation, when deciding how much rigor to exercise, we have to consider the possible costs of things going wrong. If they're low enough, we don't bother to exercise the rigor, but if they are high enough, we should. Let me tell you two stories that demonstrate getting this wrong.

Too much rigor

Imagine a dystopian future where hamburgers are extremely valuable. Crime rings regularly steal, launder, and resell them. When you walk into McDonald's, you pass through a metal detector and are subjected to a brisk frisking. When you order a hamburger, the cashier sternly asks to see your government-issued photo ID. In this dire world, such extreme measures are necessary to maximize McDonald's profits by protecting from loss. In the real world, this story is a laughable fiction because the rigor far outweighs the risk. Burgers would cost ten times more, and McDonald's wouldn't sell very many.

At a sufficient level of risk, such drastic security measures land fully inside the Overton window: they are routine at every commercial airport in the world.

Too little rigor

Let's flip over to a story that demonstrates the exercise of too little rigor. The movie The Invention of Lying (2009)² takes place in a world where nobody has ever told a lie. The main character Mark Bellison, played by Ricky Gervais, is down on his luck: he's about to be evicted because he can't afford his rent. Defeated and expecting to become homeless, he saunters in to his local bank branch and asks to close his account. The teller replies that unfortunately, the computer system is down and she can't close the account, but if Mark will kindly tell her what his balance is, she can make a withdrawal right away. The account holds a balance of $300, but an epiphany hits Mark, and he tells the world's first lie: "I have $800 in my account." Just that moment, the computer system comes back up which correctly reports a balance of $300. Since lying is inconceivable to her, the teller assumes the computer is wrong and happily hands Mark $800. She even apologizes for the inconvenience.

A guilty-looking Mark Bellison realizing he just stole $500

In the real world, Mark would have asked for $8 billion. The bank would have failed, and the effects would have rippled through the US financial system. I think that would have been a more interesting movie, but in a world where nobody lies, I don't think banks would even exist. One purpose of a real bank is to keep your dirty, greedy hands off my money. In that world, a bank would be more like an office refrigerator that contains a styrofoam take-home box into which your fingernail etched your initials.

The right amount of rigor

The movie is not entirely a fiction. To an extent, banks pretend people don't lie. The title of this post is a snowclone of Patrick McKenzie's classic essay The optimal amount of fraud is non-zero³ which explains that banks allow some fraud as a policy decision that maximizes the overall value of commerce. The banking industry is not stupid or gullible. Smart people have converged on this arrangment after centuries of facilitating commerce and handling fraud.

Enforcing zero fraud would be very expensive. Similarly, enforcing a human review of all code is expensive.

An important difference between a bank and your business that for banks, the risk of fraud is distributed. Most card fraud is absorbed by retailers as the cost of doing business. Beyond that, the card network absorbs the cost. Fraud is also policy. Banks are deputies of the state, regulated and backstopped by the full faith and assurance of the US Federal Reserve.

What kind of software is it?

Well before LLMs, it was clear that some software needs more stringent verification. The Python script that backs up your spouse's photos merits less scrutiny than your employer's authentication platform which in turn needs less care than the software running your dad's pacemaker.

Bertrand Meyer, an expert on software verification, uses a three-bucket "ABC" taxonomy: Acute, Business, and Casual⁴.

Casual software has limited distribution and loose quality constraints. Examples include an app for your personal use, a spreadsheet macro, or an internal proof-of-concept. Most software falls into this category.

if sometimes they crash, sometimes produce not-quite-right results, cannot be easily understood or maintained by anyone other than their original developers, target just one platform, run too slowly, eat up too much memory, are not easy to change, include duplicated code — it is not the end of the world

Business software is what most professional developers work on every day. If the software doesn't work, your organization suffers loss.

Acute software is mission-critical and merits the highest levels of scrutiny.

if it does not work exactly right — someone will get killed, someone will lose huge amounts of money, or something else will go terribly wrong.

When deciding to which bucket software belongs, consider these factors:

Longevity: How long does the software have to keep working?
Its potential harm
- Reach: How many people or organizations can defects harm?
- Severity: How badly can defects harm someone or your org?

Examples:

Banking infrastructure still run COBOL written in the 1960s.
A disruption of flight scheduling can delay thousands of itineraries and inflict costly second-order economic loss.
A malfunctioning medical device can kill someone.

The grey zone

For the two extremes of acute and casual software, appropriate LLM use is pretty evident. A biologist who vibe codes a Python script that produces incorrect data may publish a bad paper. Deploying unverified cancer treatment planning software will in the best case earn your business an FDA audit and in the worst case mistreat or kill a patient.

For line of business software, the right amount of rigor is more elusive. It depends on what you ultimately want to achieve, your time horizon, and your appetite for risk.

What are you optimizing for?

Speed

If you're trying to ship as fast as possible right now, all else be jammed, the optimal amount of unreviewed code you should ship is close to 100%. On the other hand, if you're trying keep a reasonably fast velocity for the long haul, you'll want to slow down so you can understand the code and invest more in its maintainability.

Business value

Businesses want to maximize profits and minimize costs, but getting greedy today can cost later. Shipping unreviewed code can land a quick lucrative sale. This same decision can also make it expensive to iterate or pivot later. This has been the case long before LLMs.

Learning

It's been well established for over 50 years that producing information makes you retain it better than just reading it.⁵ If you're training junior software engineers, the optimal amount of vibe coding is probably zero. If you're an expert dev and learning a new language or domain, the optimal amount is still probably close to zero.

Ethics

There are some serious ethical problems with LLMs: stolen training data, violation of copyright, energy, water, and land use, suppression of wages, and devaluing of human labor are just a few. If any these bother your conscience, optimal use of LLMs might be zero.

Slop I have shipped

Here's a sample of the software I have created without human review:

A macOS app that turns the screen black after 5 idle minutes to spare my OLED monitors
A macOS app that rearranges my windows to preset locations
A macOS app that shows Claude usage in the menu bar
A private fork of Wezterm with vertical, draggable tabs
A VIM clone for the AlphaSmart Dana
A CLI that automatically says "yes" to Claude after a coundown
A CLI that watches a folder for receipt images and OCRs them
A web app for tracking prayers
A web and Android app for sending text messages from my browser
An iOS app for tracking baby routines like sleep and feeding
An Android and iOS app replacing the awful one that our smart thermostat uses

What do all of these apps have in common? They have limited distribution. They're just for me or my family.

Here is software I have shipped with either wider distribution or elevated risk posture:

At work, I clauded up an app that hits internal APIs and emails the team about problems like HTTP 4xx or 5xx. I didn't even look at the code. Later, I delivered an internal desktop app to our customer success team that clones user settings, which makes reproducing customer issues easier. I glanced at the code and decided it was fine.

Beanscrape is the most vibed code I have ever shipped to the public. It's a line-of-business app that lives in that grey zone of rigor. As one who holds strong internal convictions about code quality, I don't feel amazing about it, but it does everything it needs to. It's no credence good: I gave it away as FOSS, so anyone can audit its small codebase. I think the world is better off for it. Without LLMs, Beanscrape would not exist. The utility justifies the means.

One could argue that Beanscrape isn't vibe coded. It's somewhere in between. The first proof-of-concept was a single, 3000 line ball of JavaScript. From there, I had Claude start over, break it up into web components, use a well-known web framework, and rewrite it in TypeScript. I've scanned the code to vet its shape, but I have not scrutinized every line, since TypeScript is not a skill you'll find on my resume, and learning it is not one of my aspirations. On the other hand, my eyeballs have closely vetted the parts that run outside the browser. I can do that since they're in C#, a language I know well, and it's worthwhile for a desktop app that handles bank data. The security stakes are higher.

Surebeans is a budgeting app I sell that carries forward the spirit of YNAB4. Like Beanscrape, it is also slightly vibed, but less so. It began as a Christmas vacation project in December 2025. Up through February, it was completely vibe coded. I was thinking it would be just for me and my wife. Once I decided to turn it into a product, I hit the brakes, since I now expected to be working on this code for a long time. I reviewed all existing code and did some significant refactoring. Now I read all LLM generated code before merging. This is time-consuming and makes me go slower. I commit to new features more selectively. I think this is a good thing.

Objections

Vibe coding doesn't imply slop

I'll admit I'm using the term slop loosely. It's true that the code in a small app may be of decent quality. That won't be the case for anything larger, but for software with limited impact, that might not matter. I think this is where the ABC taxonomy is helpful. To explain, I'll invoke the shed, house, and skyscraper metaphor.

A handy person can improvise a shed in their backyard with spare lumber and hand tools. Execution is barely distinguishable from planning. If the shed collapses under a heavy snow load, you might have to buy some new tools.

To build a house, you need an architect, structural engineer, and building approvals, followed by construction: concrete, framing, roofing, electricity, plumbing, finishing, and finally inspections. The whole process can exceed a year and cost hundreds of thousands of dollars.

To build a skyscraper, you need thousands of people, tens of millions of dollars, and five to ten years of planning and execution.

The larger the project, the more expensive and consequential is failure.

LLMs can review the code, so I don't have to

Adversarial code reviews are all the rage right now. A host of prompts, plugins, and skills stand ready to let a virtual team review your code: a principal engineer, product engineer, UX designer, domain expert, and so on. For added diversity, they'll even invoke different providers, like a mix of Claude, Gemini, and Codex.

As an augmentation to human review, I have no issue with this practice except that it is expensive, and it produces more text for me to evaluate when I could be looking at the code.

As a replacement for human review, there are two showstoppers.

Jury, judge, and executioner

The review is not independent. The same model that wrote your code is now reviewing it. Even across providers, the models are trained on similar data and with similar methods, and they all have access to the same world wide web. Moreover, LLM roleplay is a fiction. It dresses copies of the same model in different cosplay. Asking an LLM to be an expert QA does not cause it to know more about software verification.

We've had this problem with human devs for decades. There's a reason software engineer-in-test is a separate role from software developer. Devs make for terrible QAs. We only test happy paths. Our systemic incentive is to merge ASAP and claim the next Jira ticket.

You don't really know

LLM code review as a replacement for human review isn't epistemologically sound.

Episte-what, you say?

Epistemology is a branch of philosophy which asks, "What is knowledge, where does it come from, and what are its limits?"

If we ask, "How do we know this code is good?" the field names three sources of knowledge:

One source is testimony. We trust what a knowledgeable source tells us. An infallible source of knowledge is known as an oracle. Real oracles don't exist, but when only LLMs review our code, we're treating them as one, trusting testimony we don't independently verify.

Another source is reason. We work things out by thinking: tracing the logic, applying definitions, evaluating legibility. If we skip code review, we skip applying reason to it. "But LLMs can reason!" you reply. I disagree. LLM reasoning is a matter of belief. An LLM doesn't know that a ball I am holding cannot pass through my hand. It merely makes it seem like it does with a specific sequence of words. It's an illusion that lives in my mind, not in the machine. Let's grant though that LLMs can actually reason. It's not your reasoning. If you rely on the LLM to do all of the thinking, then you've regressed back to testimonial knowledge.

A third source is experience. This produces a posteriori knowledge, also called empirical knowledge. We try things and see what happened. The scientific method formalizes this. We pose a hypothesis, control for all other factors, and test for the one open factor. In software, when we have a failing test, make a single change to the subject under test, and the test turns green, we have produced empirical knowledge about the code. It's possible your LLM followed TDD or red-green-refactor, but if you don't read the tests, then again you've regressed back to testimonial knowledge.

What to do now

Consider your appetite for risk and for rigor and if those are aligned.

Right now, get out a pen and some paper. Draw the chart at the top of this page. Then, think of the code you have shipped lately. For each, draw a dot indicating where it is on the risk-rigor line. If it's above the line, would you have gotten away with less review? If it's below, should you have reviewed more?

Think carefully before shipping unreviewed code. It can come back to bite you later. Consider how many people it could affect, and how severely. Consider how long you have to maintain what you are shipping.

If you don't review the code, you are trusting the LLM to get things right. I certainly don't.

References

Subscribe for More

I'll tell you about new posts. I take your privacy seriously.

Conversation

#webmentions-loading { display: none !important; }

Discuss on:

Bluesky

Hacker News

Mastodon