When the Machine Metaphor Breaks Down

Doug · 2025-12-11

Table of Contents

While it's rare to see a sixty-year old car still running, COBOL is 65 years old and processes 95% of today's ATM transactions¹. The world is full of critical infrastructure running on ancient software, so it's important to understand how software can fail.

Computer people have an old tradition of calling computers machines:

The ACM is the Association for Computing Machinery
Alan Turing's paper "Computing Machinery and Intelligence" introduced the Turing test
We all recognize phrases like "machine code", "machine learning", and "virtual machine".

Edsger Dijkstra was an influential computer scientist and highly active in the ACM, but he took issue with this organization's name. In 1988 he wrote² that computers represent a "radical novelty" and "sharp discontinuity" compared to anything before. Computers are not better typewriters.

It's human to try and understand new things in terms of things we already understand, but programs are very different than machines. When we try to apply models of hardware failure to software, we risk not addressing software's unique failure modes.

How software and hardware fail differently

Entropy spares nothing. A ball bearing wears out over time, and though a program never deteriorates, it can become obsolete in a changing world: just think of Adobe Flash or Java applets. Software developers are often asked keep old programs working, but even skillful and careful maintenance can make code messy and hard to understand.

A chart showing failure rates over time for hardware and software

Failure rates over time for hardware and software. Software looks very different than hardware.

A model of hardware failure

The blue line in the chart above looks like a bathtub, which is why it's called a bathtub curve³. It shows the lifespan of a typical machine. In the early days, a lot of devices experience "infant mortality". Perhaps a batch has a manufacturing defect or is damaged in transit. For those that survive, there is a long steady period of useful life. Then, the devices fail more frequently as their physical condition deteriorates.

For example, consider hard disks. Not solid state drives, but the old kind with spinning platters:

A hard disk drive. The top platter and actuator arm are visible.

A hard disk drive has software running on it called "firmware", a term of art for software that interacts with hardware and will rarely be updated in the field. Hard disks also have lots of physical parts that can break: motors, bearings, actuator heads, and platters. If any of these fail, the disk has has failed and you can't read or write data. Most disk manufacturers publish a number called MTBF⁴ which predicts about how long you can expect the disk to work. It's reasonable to expect a hard disk to survive continuous service for about three years. If you keep using it after that, you're on borrowed time: you're on the right side of the bathtub curve.

A model of software failure

Looking back at the chart, the red line tells a story of software failure rates over time.

note

note

The term "software" is ambiguous. It can refer to a program or its source code. A program doesn't wear out or deteriorate, but its code can. Here, we're looking at an evolution of a program deployed over time from a changing codebase.

Stabilizing Phase

Like hardware, there is an "infant mortality" phase. New programs tend to be buggy. The customers complain and the team scrambles to fix the bugs. Once the bugs are fixed, the program enters a stable phase with relatively few failures.

Maintenance Phase

It continues that way until the boss walks in and says, "We need to add Feature X by Monday". The software team rushes to ship feature X by Monday. They succeed, but in their rush they add some new bugs. (This is why I avoid revision-zero software, e.g. I skipped macOS 15.7.0 and waited 14 days for 15.7.1). Customers complain, the team scrambles and fixes the bugs, and the waters calm down.

A while later, NIST announces a critical security vulnerability in one of the code's dependencies. The team scrambles to update or replace the dependency. In doing so, they create new bugs. Customers complain, the team scrambles to fix them, and the waters settle down again.

This happens many more times, and all the while something ominous is happening: the "floor" of defects is rising. The team can never get them down to how it was in the beginning before all of the new features and security patches. The code also doesn't feel as clean. It's messy and hard to understand. Its internal quality is deteriorating. A pile of quick fixes has obfuscated the elegance of the original design. Engineers leave, and new ones arrive who haven't internalized the design. Some people call all of this "tech debt", but tech risk is a better metaphor.

End-of-life Phase

Eventually maintenance becomes too expensive, and the program doesn't receive any more updates. It is announced deprecated or out-of-support and eventually removed from service.

Entropy affects software differently than hardware

Notice that while hardware fails due to changes in its internal state, a program succumbs to entropy even if it doesn't change. Its shifting environment does the job. Maintaining software to accommodate its shifting environment postpones its demise but does not prevent it. As one of the greats in computing wrote:

Program maintenance is an entropy-increasing process, and even its most skillful execution only delays the subsidence of the system into unfixable obsolescence.

-- Fred Brooks, The Mythical Man-Month, 1975

How software and hardware fail alike

Software and hardware share some common notions of failure.

We've all seen this cartoon before. It takes outstanding communication to keep the spec, implementation, and user expectations in sync. This is exacerbated in situations where the user is not the customer or the designer is not the implementer.

Failure to meet requirements

A requirements-centric of failure view says that a system fails when it deviates from specified requirements.

The IEEE defines failure as "the inability of a system or component to perform its required functions within specified performance requirements."⁵ It distinguishes between a fault (what most of us would call a bug or defect) and a failure, which is the externally observable consequence.
If your mind has trouble distinguishing between fault and failure, consider latent faults, which are bugs that are present in the software but not causing any observable failures.
- Latent faults often exist from the beginning but inspection (code review for software) and testing don't reveal them.
- Examples for hardware:
  - The O-rings of the Challenger solid rocket boosters had a latent fault that caused them to deviate from their required elasticity at low temperatures, resulting in the failure of the seal to contain hot combustion gases.
  - The rivets on the Titanic had a latent fault: they exceeded the required maximum level of slag impurity. This made them brittle in the freezing North Atlantic waters. When ice contacted the hull plates, the rivets sheared rather than deformed, a failure which caused ingress of seawater.

Failure to meet expectations

A user-centric view of failure says that a system fails if it does does not meet user expectations.

This broader view pits the end-user's expectations against the design and implementation.
This kind of failure is mitigated with early and iterative validation, by asking users to try it and give feedback.
For software:
- The design may have said "make the button green", and the button background may factually be #00FF00, but if the user was expecting blue, the software has failed.
For hardware:
- The original Segway was a well-functioning self-balancing vehicle. People expected it to improve their urban transit. Instead, they got expensive scooters that attract ridicule. The device failed to meet customer expectations. The Segway brand now emphasizes traditional scooters and e-bikes.
- The Concorde jet met its spec: it achieved supersonic commercial flight. Passengers expected fast, worldwide travel. Instead they got limited routes and extremely expensive fares. The jet failed to find product-market fit. The last one was made in 1979.

The processes for identifying these two kinds of failures are often called verification ("Did we built the thing right?") and validation ("Did we build the right thing?").

Failure to avoid hazard or loss

There is a third, safety-centric view which says that a system fails if it contributes to a hazard or loss.

This perspective is more common in safety-critical systems and public infrastructure.
For software:
- The Boeing 737 MAX MCAS software functioned as designed when the single angle of attack sensor failed: it thought the plane was in a nose-up attitude, so it commanded the nose to go down. While a requirement-centric view would say the software functioned as designed in this situation, nobody expects a plane to take an uncommanded nose dive. The failures are the two ground collisions which killed 346 people.
For hardware:
- We say a bridge failed if it collapses, even if it's because a container ship collided into it, a load it was never required or expected to withstand.
- The towers of the World Trade Center collapsed after airliners flew into them and the inferno weakened its metal structure. The doomed occupants may have hoped the buildings would stay up, but nobody expected it, and no building code required it. Still, nobody challenges a statement like, "the structure failed."
- For space shuttle Challenger, the failure from this perspective is not the seal but the loss of crew and vehicle.
- For Titanic, the failure from this perspective is not the rivets or hull plating but the hazard of the ship flooding with seawater and the resulting tragic loss of life.

Summary

Over time, software and hardware fail for different reasons:

Hardware fails as its internal physical condition deteriorates.
A program fails as it environment changes, invalidating its assumptions and preconditions.
A codebase fails as its design integrity dissolves and defect count rises.

In an instant of time, software and hardware fail for the same reasons:

Sins of commission: The system did something it was required not to.
Sins of omission: The system didn't do something it was required to.
Unmet expectations: The user was surprised the system did or didn't do something.
Compromised safety: A hazard or loss occurred and the system was involved, even if it performed as required and expected.

I plan to write more about software failures. See you soon.

References

Subscribe for More

I'll tell you about new posts. I take your privacy seriously.

Conversation

#webmentions-loading { display: none !important; }

Discuss on:

Bluesky

Mastodon

Reactions:

Comments from LinkedIn

Matthew Donica

Senior Developer - Using AI to go faster

2025-12-24

I would love a COBOL job. Old languages are extremely fun and digging through antique textbooks and reference manuals is really fun and satisfying.

Doug Slater

Software Engineering Leader | Mentor | Driver of Technical Excellence | Aligned with Business Goals

2025-12-24

Matthew Donica My understanding is that COBOL devs are in high demand as the old guard ages out. It seems to me a niche that could pay well at the risk of being hard to get out of.