When the Machine Metaphor Breaks Down

Doug Slater
Doug ·

While it's rare to see a sixty-year old car still running, COBOL is 65 years old and processes 95% of today's ATM transactions1. The world is full of critical infrastructure running on ancient software, so it's important to understand how society can keep software working for a long time.

Computer people have an old tradition of calling computers machines:

  • The ACM is the Association for Computing Machinery
  • Alan Turing's paper "Computing Machinery and Intelligence" introduced the Turing test
  • We all recognize phrases like "machine code", "machine learning", "virtual machine", "bare metal".

Edsger Dijkstra was an influential computer scientist and highly active in the ACM, but he took issue with this organization's name. In 1988 he wrote2 that computers represent a "radical novelty" and "sharp discontinuity" compared to anything before. Computers are not better typewriters.

It's human to try and understand new things in terms of things we already understand, but programs are very different than machines. When we try to apply models of hardware failure to software, we risk not addressing software's unique failure modes.

How software and hardware fail differently

Entropy spares nothing. A ball bearing wears out over time, and though a program never deteriorates, it can become obsolete in a changing world: just think of Adobe Flash or Java applets. Software developers are often asked keep old programs working, but even skillful and careful maintenance can make code messy and hard to understand.

A chart showing failure rates over time for hardware and software

Failure rates over time for hardware and software. Software looks very different than hardware.

A model of hardware failure

The blue line in the chart above looks like a bathtub, which is why it's called a bathtub curve3. It shows the lifespan of a typical machine. In the early days, a lot of devices experience "infant mortality". Perhaps a batch has a manufacturing defect or is damaged in transit. For those that survive, there is a long steady period of useful life. Then, the devices fail more frequently as their physical condition deteriorates.

For example, consider hard disks. Not solid state drives, but the old kind with spinning platters:

A hard disk drive

A hard disk drive. The top platter and actuator arm are visible.

A hard disk drive has software running on it called "firmware", a term of art for software that interacts with hardware and will rarely be updated in the field. Hard disks also have lots of physical parts that can break: motors, bearings, actuator heads, and platters. If any of these fail, the disk has has failed and you can't read or write data. Most disk manufacturers publish a number called MTBF4 which predicts about how long you can expect the disk to work. It's reasonable to expect a hard disk to survive continuous service for about three years. If you keep using it after that, you're on borrowed time because you're on the right side of the bathtub curve.

A model of software failure

Looking back at the chart, the red line tells a story of software failure rates over time.

note
note

The term "software" is ambiguous. It can refer to a program or its source code. A program doesn't wear out or deteriorate, but its code can. Here, we're looking at an evolution of a program deployed over time from a changing codebase.

Stabilizing Phase

Like hardware, there is an "infant mortality" phase. New programs tend to be buggy. The customers complain and the team scrambles to fix the bugs. Once the bugs are fixed, the program enters a stable phase with relatively few failures.

Maintenance Phase

It continues that way until the boss walks in and says, "We need to add Feature X by Monday". The software team rushes to ship feature X by Monday. They succeed, but in their rush they add some new bugs. (This is why I avoid revision-zero software, e.g. I will skip macOS 15.7.0 and wait for 15.7.1). Customers complain, the team scrambles and fixes the bugs, and the waters calm down.

A while later, NIST announces a critical security vulnerability in one of the code's dependencies. The team scrambles to update or replace the dependency. In doing so, they create new bugs. Customers complain, the team scrambles to fix them, and the waters settle down again.

This happens many more times, and all the while something ominous is happening: the "floor" of failures is rising. The team can never get the number of failures down to how it was in the beginning before all of the new features and security patches. The code also doesn't feel as clean. The internal quality of the code is deteriorating. A pile of quick fixes has obfuscated the elegance of the original design. The code is messy and hard to understand. Engineers leave, and new ones arrive who haven't internalized the design.

End-of-life Phase

Eventually maintenance becomes too expensive, and the program doesn't receive any more updates. It is announced deprecated or out-of-support and eventually removed from service.

Entropy affects software differently than hardware

Notice that while hardware fails due to changes in its internal state, a program succumbs to entropy even if it doesn't change. Its shifting environment does the job. Maintaining software to accommodate its shifting environment postpones its demise but does not prevent it. As one of the greats in computing wrote:

Program maintenance is an entropy-increasing process, and even its most skillful execution only delays the subsidence of the system into unfixable obsolescence.

-- Fred Brooks, The Mythical Man-Month, 1975

How software and hardware fail alike

Software and hardware share some common notions of failure.

The tree swing cartoon

We've all seen this cartoon before. It takes outstanding communication to keep the spec, implementation, and user expectations in sync. This is exacerbated in situations where the user is not the customer or the designer is not the implementer.

Failure to meet requirements

A requirements-centric of failure view says that a system fails when it deviates from specified requirements.

  • The IEEE defines failure as "the inability of a system or component to perform its required functions within specified performance requirements."5 It distinguishes between a fault (what most of us would call a bug or defect) and a failure, which is the externally observable consequence.
  • If your mind has trouble distinguishing between fault and failure, consider latent faults, which are bugs that are present in the software but not causing any observable failures.
    • Latent faults often exist from the beginning but inspection (code review for software) and testing don't reveal them.
    • Examples for hardware:
      • The O-rings of the Challenger solid rocket boosters had a latent fault that caused them to deviate from their required elasticity at low temperatures, resulting in the failure of the seal to contain hot combustion gases.
      • The rivets on the Titanic had a latent fault: they exceeded the required maximum level of slag impurity. This made them brittle in the freezing North Atlantic waters. When ice contacted the hull plates, the rivets sheared rather than deformed, a failure which caused ingress of seawater.

Failure to meet expectations

A user-centric view of failure says that a system fails if it does does not meet user expectations.

  • This broader view pits the end-user's expectations against the design and implementation.
  • This kind of failure is mitigated with early and iterative validation, by asking users to try it and give feedback.
  • For software:
    • The design may have said "make the button green", and the button background may factually be #00FF00, but if the user was expecting blue, the software has failed.
  • For hardware:
    • The original Segway was a well-functioning self-balancing vehicle. People expected it to improve their urban transit. Instead, they got expensive scooters that attract ridicule. The device failed to meet customer expectations. Segway as a brand now mostly sells conventional scooters, go-karts, and other e-vehicles.
    • The Concorde jet met its spec: it achieved supersonic commercial flight. Passengers expected fast, worldwide travel. Instead they got limited routes and extremely expensive fares. The jet failed to meet flyer expectations. The last one was made in 1979.

The processes for identifying these two kinds of failures are often called verification ("Did we built the thing right?") and validation ("Did we build the right thing?").

Failure to avoid hazard or loss

There is a third, safety-centric view which says that a system fails if it contributes to a hazard or loss.

  • This perspective is more common in safety-critical systems and public infrastructure.
  • For software:
    • In 2009, Toyota software bugs caused sudden uncontrolled vehicle acceleration. The software contributed to a dangerous safety situation and an expensive and disreputable recall. It's a semantic squabble whether the software failed or functioned as designed. The salient event is the hazard that the vehicles accelerated unintentionally.
  • For hardware:
    • We say a bridge failed if it collapses, even if it's because a container ship collided into it, a load it was never required or expected to withstand.
    • The towers of the World Trade Center collapsed after airliners flew into them and the inferno weakened its metal structure. The doomed occupants may have hoped the buildings would stay up, but nobody expected it, and no building code required it. Still, nobody challenges a statement like, "the structure failed."
    • For space shuttle Challenger, the failure from this perspective is not the seal but the loss of crew and vehicle.
    • For Titanic, the failure from this perspective is not the rivets or hull plating but the hazard of the ship flooding with seawater and the resulting tragic loss of life.

Summary

In an instant of time, software and hardware fail for the same reasons:

  • Sins of omission and commission: The system was required to do something it didn't, or it did something it was required not to.
  • Unmet expectations: The user was surprised the system did or didn't do something.
  • Compromised safety: A hazard or loss occurred and the system was involved, even if it performed as required and expected.

Over time, software and hardware fail for different reasons:

  • Hardware fails as its internal physical condition deteriorates.
  • A program fails as it environment changes, invalidating its assumptions and preconditions.
  • A codebase fails as its design integrity dissolves and defect count rises.

In a future post, I plan to write more about software failures. See you soon.

References

  1. IBM: What is COBOL?
  2. Dijkstra: On the cruelty of really teaching computing science
  3. Wikipedia: Bathtub curve
  4. Wikipedia: Mean Time Between Failures
  5. IEEE 610.12 Standard Glossary of Software Engineering Terminology

Subscribe for More

I'll tell you about new posts. I take your privacy seriously.