Don't believe "software never fails"

Doug · 2026-01-02

Table of Contents

The claim "software never fails" is cavalier and dangerous. It creates a mindset that itself leads to software failures.

On March 26, 2024, a container ship collided with one of the piers of the Francis Scott Key Bridge in Maryland. Within 30 seconds, the entire bridge had collapsed into the Patapsco River, killing six people¹.

The collapsed Francis Scott Key Bridge

I'm going to make a bold claim: the bridge didn't fail.

That's right. Six people dead. $15M economic loss per day. The bridge didn't fail.

After all, it did exactly what it was designed to do. The bridge was doomed the moment its blueprints were approved, before a single truss was machined, before a single pier was erected. From the moment of impact, its collapse was mechanically inevitable. A computer simulation could have easily predicted it.

How does my claim make you feel? How do you think the families of the dead would feel? They'd be ready to throw me off that bridge! Of course the bridge failed. Look at it!

Savor your righteous indignation, because that is exactly what I felt when I read the post Software Never Fails².

It claims that:

The brief reason we can say software never fails is that software always does precisely what we told it to do. The outcome may be undesirable, and it may be unexpected, but the software itself never deviates from the instructions it was compiled into.

The post offers a definition of failure:

A failure is the event that occurs when a component first does what we designed it to, then something happens, and then the component no longer does what it was designed to do.

If we are to consistently apply this claim and definition, then the Francis Scott Key Bridge didn't fail. It did precisely what it was designed to do.

Software developers: Now imagine this kind of thinking applied to your codebase or to the medical device that keeps your grandmother alive. What could go wrong?

The failure of "software never fails"

The syllogism presented in Software Never Fails goes:

Premise 1: Failure means some component changed from working to non-working.
Premise 2: A program's compiled instructions never changes.
Conclusion: "Software is design-stuff. It doesn't fail ever."

Both premises are false, and so is the conclusion.

A faulty definition

Premise 1 is false by its definition of failure.

The article cherry-picks a definition from hardware reliability engineering which says that failure is a state transition from working to non-working. A capacitor dries out, and a bearing wears down. While this definition is suitable for hardware, it doesn't apply to programs. If a program "never deviates from the instructions it was compiled into", how can it transition? Software is definitionally excluded.

In When the Machine Metaphor Breaks Down, I compared and contrasted software failures to hardware failures. I provided precise definitions of failure that apply to both:

Sins of commission: The thing did something it was required not to.
Sins of omission: The thing didn't do something it was required to.
Latent faults: Bugs that are present but not causing observable failures.
Unmet expectations: The user was surprised the thing did or didn't do something.
Compromised safety: A hazard or loss occurred and the thing was involved, even if it performed as required and expected.

I also showed that hardware and software fail for different causes:

Hardware fails as its internal physical condition deteriorates.
A program fails as it environment changes, invalidating its assumptions and preconditions.
A codebase fails as its design integrity dissolves and defect count rises.

Software is surprising

Premise 2 is also false.

The claim "software never changes" wants us to to believe that programs are deterministic, and if deterministic, then predictable, and therefore incapable of surprising us with failure.

note

note

Deterministic means the same inputs produce the same outputs.

But programs are effectively nondeterministic. Yes, it's narrowly correct that compiled bytecode doesn't spontaneously mutate, but the argument conflates instructions with execution. A program's behavior depends not only on its bytecode but on its entire input space. Proving a program correct requires proving it correct for all possible inputs and states. As Dijkstra wrote, tests can show the presence of bugs, but never show their absence.

The dangers of believing "software never fails"

Danger: Design and implementation conflated

In building architecture, the distinction between design and construction is clear. Design is clay models and blueprints. Construction is pouring concrete and milling steel. Many structural failures happened because the construction did not follow the design³.

Software also has distinct design and implementation activities. Sometime this is easy to see: it's common industry practice for a software architect to hand over a completed design specification to another implementation team, perhaps even an offshore one. The distinction is harder to observe if the same person does the designing and the implementing.

When the Therac-25 radiation therapy machine killed patients in the 1980s, investigators found that the software implementation failed to properly sequence safety interlocks that the design had specified. The design was sound, but the implementation was not. If we say "software is pure design," we have no vocabulary to disambiguate these failure modes, and we have no framework for preventing them.

Fix: Remember Verification and Validation

Implementation errors are mitigated with verification, i.e. asking the question, "did we build the thing correctly?" Design errors are mitigated with validation, i.e. asking "are we building the correct thing?"

In A Design is a Mold for Code, I show with examples the distinction between design and code.

Danger: Presbyopic causal chains

The benevolent intent of the phrase "software never fails" is to encourage us to look beyond proximal causes of failure and seek distal causes, but it goes too far. A high-quality report on a commercial airliner crash will include both: the plane had a problem, and a long causal chain of lax training, spotty maintenance, regulatory capture, and eroded safety culture from the very top allowed the problem to cascade into a catastrophe.⁴.

Presbyopia is the inability of the eye to focus clearly on close objects⁵, and I'm concerned that people who believe "software never fails" will look for distant causes of failure while allowing the immediate ones to blur into invisibility.

What is a causal chain? Many of us have had or heard a conversation with a young child that went something like this:

Child: "Why can't I play on the grass?"
You: "Because it's wet, sweetie"
"Why?"
"Because I turned the valve on."
"Why?"
"Because the grass is turning brown"
"Why?"
"Because we live in New Mexico"
"Why?"
"Because your great-grandfather moved here in the '40s"
"Why?"
"Because the Wright Brothers invented the airplane. This made the Japanese attack on Pearl Harbor possible, which brought the USA into World War 2, which resulted in the founding of Los Alamos in 1943."

As you can see, the more removed a cause is from its effect, the less plausible its association and the less actionable any mitigation or prevention. A proximal cause to my grass being wet is I turned the valve on. A distal cause is the Wright Brothers invented the airplane. Removing the middle links from the chain of causality reveals the absurdity: "You can't play outside because airplanes exist."

If we believe a claim like "software never fails", we sever the proximal links in the causal chain. We'll cast a wide net looking for monsters in other castles but not the ones under our beds.

Fix: The 4-part filter for relevant causes

Use this framework to discover relevant distal causes of failure without ignoring proximal causes. When engineers analyze why a failure happened or could happen, they privilege causes that:

are proximal, physically or temporally. My hand turned the valve, and a few seconds later the grass got wet. The Pearl Harbor attack and invention of the airplane happened long ago and far away.
exhibit agency. I chose to turn the valve. I didn't choose where to found Los Alamos.
are controllable. Turning a valve is an event I can control (separately from choosing to). There's no action I can take about historical wars.
are normal and expected. Turning a valve is a common event. Wars aren't.

Danger: Latent defects ignored

Bad metaphors breed bad decisions. What might actually happens if a software team internalizes the idea that "software never fails"?

If a team thinks software never fails, they might not allow themselves to believe their software could contain latent defects. They might think, "It works today, so it will always work." In their false confidence, they might skip testing, breeze through code review, dismiss bug reports, or skip error handling. This may let them ship quickly at first only to get bogged down fixing bugs later. In the meantime, hopefully those bugs don't cause harm to anyone.

Fix: Use better metaphors

In It's Not Tech Debt. It's Tech Risk, I show that the metaphors we use shape how seriously we take software problems.

Danger: Eroded alignment with nonexperts

The phrase "software never fails" communicates the wrong thing to non-practitioners. When a politician or mid-level manager hears "X never fails", they hear, "We never have to think about that." They don't share your nuanced understanding. This mindset could lead to poor business decisions, for example when leadership says "ship it" or downsizes the QA team. It could also lead to public harm when an elected or appointed official makes an misguided public policy decision.

Fix: Use different words

A professional structural engineer would never say "bridges never fail". Instead of propagating incorrect absolute statements like "software never fails", let your word choices build, not erode, alignment between you and decision makers. For guidance on translating technical risks into terms leadership will prioritize, see my post Tech Risk is Business Risk.

What to do instead

Change your Mindset

When analyzing software failures:

Use precise definitions of failure and definitions appropriate for software.
Distinguish between design and implementation. Remember validation and verification.
Use my 4-part filter for relevant causes.
Use metaphors that discourage hubris and language that communicates the right things to leaders.

Concrete actions

Pretend your next commit message or PR description will be printed on the front page of the New York Times. How will your word choices come across to your manager, CEO, elected government officials, or to the public?
In your next code review this week, identify one assumption about the environment that could become invalid. Are there places where the code assumes "this will always work"?
Bookmark this page. The next time a postmortem gets off track, share it.
Start a discussion. In e.g. Slack, Teams, or another collaboration tool, ask "how do we distinguish design errors from implementation errors?"

Summary

Saying a program can't fail because its deployed bytecode doesn't change is like saying the Francis Scott Key Bridge didn't fail because its steel molecules obeyed the laws of physics. One might object, "Yo dawg, the bridge was destroyed by an external force, not internal failure", and this is exactly my point about software. Programs eventually experience external forces that their designers didn't anticipate. If we excuse the bridge because it wasn't designed to withstand a collision with a modern container ship, then we must equally excuse software that fails when its environment changes. The question is how designs can anticipate and accommodate these stresses.

References

When the Machine Metaphor Breaks Down. How software failures differ from hardware failures
A Design is a Mold for Code. Why design precedes and shapes code, not the reverse
It's Not Tech Debt. It's Tech Risk.. How the metaphors we use shape how seriously we take software problems
Tech Risk is Business Risk. Translating technical risks into terms leadership will prioritize
It's Time to License Software Engineering. Why society should train and recognize professional software engineers

Subscribe for More

I'll tell you about new posts. I take your privacy seriously.

Conversation

#webmentions-loading { display: none !important; }

Reactions:

Discuss on:

Bluesky

Hacker News

Mastodon

Comments from LinkedIn

Josh Marshall

Senior Software Engineer | Backend & API Development (C#/.NET, SQL, REST) | Open to Remote Roles (Full-Time, or Contract)

2026-01-06

I see this constantly in my work as a software engineer in the logistics and supply chain space. The happy path logic, when everything behaves exactly as expected, is the easy part. The hard part is designing for the messy external reality the software is modeling. A location has less inventory than expected. Or different inventory than expected. A location that should be picked clean still has items left. The right SKU is there, but some or all of it is damaged and unusable. On top of that, there’s the human variable… making sure someone is at the correct location, picking or replenishing the correct item, in the correct quantity. That’s where real software risk lives. Not in the code doing what it was told to do, but in what happens when the world diverges from the model. Thoughtful exception handling is what separates fragile systems from resilient ones and is a major driver of long-term software quality.

Doug Slater

Software Engineering Leader | Mentor | Driver of Technical Excellence | Aligned with Business Goals

2026-01-06

Josh Marshall Great examples! A program that works today can fail tomorrow because its environment changed, invalidating its assumptions and preconditions. It's not possible to anticipate all failures, but we design with awareness of this pattern. It's best to do all we can up front - all the ad-hoc and reactive fixes we make to a codebase accumulate over time into an unfixable mess.

James Ousey

Principal Software Architect | .NET | SQL Server | Intralogistics; Process Generalist

2026-01-06

Josh Marshall the golden path is what I could term "the sure-footed billy-goat slowly climbing the narrow path on a fair-weather day". I had one manager a few decades back that used to talk about the plumbing test of an NFL stadium to handle half-time during the super-bowl. Two extremes indeed.

Josh Marshall

Senior Software Engineer | Backend & API Development (C#/.NET, SQL, REST) | Open to Remote Roles (Full-Time, or Contract)

2026-01-06

James Ousey That’s a great way to put it, and I think both sides matter. The golden path is important because it’s the experience the system is optimized around and what most users will hit day to day. Getting that right creates confidence and makes everything else easier to reason about. The stadium plumbing example is a good reminder that systems are ultimately tested at the extremes. Real life doesn’t stay in fair weather, and software has to hold up when things get messy. Keeping both in mind is what leads to systems that work well when things are normal and don’t fall apart when they’re not.

James Ousey

Principal Software Architect | .NET | SQL Server | Intralogistics; Process Generalist

2026-01-06

Josh Marshall just keeping it real...specifications writers often tell you what the system should do "normally", but the abnormal happens also

Doug Slater

Software Engineering Leader | Mentor | Driver of Technical Excellence | Aligned with Business Goals

2026-01-07

James Ousey my favorite comment so far

Mark Persuitte

Senior Software Engineer | C#, MS SQL, React, Vue, Typescript

2026-01-06

And AI can't anticipate.

Jonathan Locke

Expert in reusability working on a major reboot of Java enterprise software.

2026-01-06

"Okay, great. Now that you've written 'Hello world!' for me, go over there on that whiteboard and write down as many ways as you can think of that it can fail. You should be able to think of at least 20 ways that 'Hello world!' can fail."

Scott Johnson

SW Engineer at Telestream

2026-01-06

It was primarily HW failures, including failure of some systems that the ship's computers did not have visibility of, that caused the loss of control that led to the collision with the FSK Bridge. To put it in another way--software, especially software in a hardware control system--doesn't work in isolation. The SYSTEM needs to be designed for reliability, and there was a cascade of failures that led to the accident.

Pierre Abbat

Developer of cross-platform software for land surveying and math

2026-01-06

This is a reason for fuzzing, and in particular American Fuzzy Lop. AFL puts instruments in the branches of a program, feeds it data, and tries to find data that make it do something new.

Eric H.

I do your jira tickets. you pay me. lol

2026-01-06

The streets don't change, but maybe the name.

Jim P.

Snowflake Data Engineering | SQL • MySQL • Linux • ETL • Cloud Architecture | 25+ Years Systems & Infrastructure

2026-01-07

Is anyone actually claiming that software never fails?

Doug Slater

Software Engineering Leader | Mentor | Driver of Technical Excellence | Aligned with Business Goals

2026-01-07

Jim P. Thanks for checking. See reference #2 in my post. https://www.slater.dev/2026/01/dont-believe-software-never-fails/

Mark Pictor

Sr Software Engineer with a focus on security | Golang & Linux specialist | Experience in kernel module work and early userspace development | Committed to enhancing system reliability

2026-01-07

Who makes the claim that software never fails, and in what context? I feel like this is made up.

Doug Slater

Software Engineering Leader | Mentor | Driver of Technical Excellence | Aligned with Business Goals

2026-01-07

Mark Pictor Thanks for checking. The claim is referenced in my blog post. Here's the direct link to my post, it's reference #2 https://www.slater.dev/2026/01/dont-believe-software-never-fails/

Mark Pictor

Sr Software Engineer with a focus on security | Golang & Linux specialist | Experience in kernel module work and early userspace development | Committed to enhancing system reliability

2026-01-07

Doug Slater my bad, I really should have clicked the link 🤦 Now that I've read your post, it seems like we're in agreement.

Doug Slater

Software Engineering Leader | Mentor | Driver of Technical Excellence | Aligned with Business Goals

2026-01-07

Those obfuscated links are notoriously easy to miss!

Mark Pictor

Sr Software Engineer with a focus on security | Golang & Linux specialist | Experience in kernel module work and early userspace development | Committed to enhancing system reliability

2026-01-07

I saw it but I think I assumed it was just a link to the picture and not a link to additional information 😖

Mark Pictor

Sr Software Engineer with a focus on security | Golang & Linux specialist | Experience in kernel module work and early userspace development | Committed to enhancing system reliability

2026-01-07

Doug Slater BTW have you seen a NASA/Honeywell slide deck entitled "Observed Failures"? It's about hardware failures rather than software, but I suspect you'd find it interesting. https://c3.ndc.nasa.gov/dashlink/static/media/other/ObservedFailures1.html The first couple pages are background - keep going, it gets interesting ;) This surely accompanied a talk, but I have not been able to find video.

Doug Slater

Software Engineering Leader | Mentor | Driver of Technical Excellence | Aligned with Business Goals

2026-01-07

I'll check it out, thanks! And in case you didn't know about the NASA study on flight software complexity: https://www.nasa.gov/wp-content/uploads/2015/04/418878main_fswc_final_report.pdf?emrc=ed1c3e

Doug Slater

Software Engineering Leader | Mentor | Driver of Technical Excellence | Aligned with Business Goals

2026-01-07

Mark Pictor Ah I'll own up that it's not clear what the link was for - I've updated the post to describe the link, and I'll do that for future posts.

Michael Birtwistle

2026-01-07

Failure modes in safety related software are due to systematic failures in the development process. Misunderstandings, missed or incorrect requirements, poor application of agreed procedures. Hardware running the software can randomly fail. There are things you can do to detect and mitigate that. Such as lock-step cores and timing watchdogs, but if the hardware is working and the software isn’t meeting the requirements, that’s a systematic development / verification process failure. If people got hurt as a result, somebody’s in big trouble.

Max Hinkley

Better Firmware, Faster | Safety-Critical Systems | Secure Coding | Lateral-thinking Integrator | Consultant

2026-01-07

I like the philosophy of your article. I did find one statement that I take issue with: "it's narrowly correct that compiled bytecode doesn't spontaneously mutate" disregards Single Event Upsets. An SEU occurs when one or more bit in memory is changed spontaneously generally due to cosmic / solar radiation. Flash and other non-volatile memory is less susceptible, but many devices run their executables in RAM for performance -- and that memory is highly susceptible to SEUs. They can affect any computing device, but the phenomena is actually fairly common in the aviation world, where the atmosphere provides less protection.

Doug Slater

Software Engineering Leader | Mentor | Driver of Technical Excellence | Aligned with Business Goals

2026-01-07

Max, thanks for reading and the feedback. I do know about this kind of event, but you taught me a new three-letter acronym! I know NASA deals with it a lot in space.