Skip to content

Trains, Elevators, and Computer Science

February 8, 2010

A possible method for making systems safe

George Westinghouse was not a theoretician, but was one of the great inventors of the 1800’s. He is most famous, perhaps, for the invention of the train air-brake in 1869. More on this in a moment.

Today I plan on talking about generalizations of Westinghouse’s ideas, and a role that they might play in computer science.

Train Brakes

The first trains were called Wagonways and were used in Germany as early as 1550. The power to move these trains was supplied by horses, until 1804 when Richard Trevithick, funded by Samuel Homfray, used a steam engine to haul 10 tons of iron and 70 men for 9 miles. This was the beginning of modern trains.

Almost immediately a critical problem developed: how to stop a train? Stopping the engine alone was insufficient once trains became long and moved fast; there was too much momentum for the engine, by itself, to stop a train. The first method used was simple: have an operator assigned to each car—they would pull on a hand brake when signaled to do so by the engineer. This quickly gave way to the direct air-brake system. In this system when the engineer wished to stop the train, he opened a valve that sent compressed air to each car—the air forced a brake against the car’s wheels. The train then stopped. Pretty neat.

Almost immediately another critical problem developed: how to stop a train reliably? The direct air-brakes worked well, when they worked. But if the compressed air tank was empty, or the pressure was too low, or there was a leak in the lines, the brakes would fail. Then the train would not stop. This was a problem.

Westinghouse’s genius was to solve this by inventing the reverse air-brake system. His brilliant idea was to use compressed air to make the train go, not to stop the train. Here is how his system worked. Each car had brakes that were held against the car’s wheels by a strong spring. In this position the train could not move. If the engineer wanted the train to move, he released compressed air that forced the brakes away from the wheels of all the cars. This allowed the train to move.

This, I think, is extremely clever. Notice if the compressed air tank was empty, or the pressure was too low, or there was a leak in the lines, the train would stop. The brakes would not fail, since all the brakes will be forced against the wheels by the springs.

This system is still in use today. I was once on an Amtrak train that was going from D.C. to Princeton when it just stopped in the middle of nowhere. It was late at night and we all wanted to get home, so someone asked the conductor, as he walked by us, what had happened? He answered in technical jargon:

The choo-choo she no go.

We later found out the train had broken an air hose and all the brakes were pressed on. We sat there for about an hour until they go a new hose and restored the integrity of the air system.


Elevators are even older than trains, they may date back to ancient times. Only in the middle of the 1800’s, as tall buildings became possible, did elevator safety become a critical issue. As long as buildings were six stories or less, safety was not a major issue—although I would not want to be in an unsafe elevator even at this low height. However, in order to build tall buildings, elevators needed to be not only be safe, but to appear to be safe. Otherwise, people would be afraid to use them.

In 1853 Elisha Otis solved the safety problem of elevators. He invented a mechanism that would stop a falling elevator, even if its supporting ropes broke. The key was as long as the rope was taut the elevator’s brake was kept in, but if the rope became slack then the brakes would be released and they would stop the fall.

Otis’ brilliant insight was not an instant success. He finally realized a live demonstration of the braking mechanism would be needed to get the public to feel safe. During the first American world’s fair in 1854 Otis built an open elevator shaft. Several times a day he would get on the elevator, be hoisted up, and then cut the rope. Since the shaft was open in front all the spectators could see him do this. As the elevator began to fall, his mechanism would bring him to a safe, if sudden, stop. He is reported to have said each time:

All safe, gentlemen, all safe.

This live demo to thousands eventually made his company a success, and made tall buildings possible.

The General Principle

I think there is a powerful principle in use in both the train and elevator safety systems. They are both designed so that no positive action is required. Instead the safety is built into the physics of the system:

  • In the train case: no air in the line, then springs stop the train.
  • In the elevator case: no rope holding the elevator up, then brakes spring out and stop the elevator from falling.

The key principle seems to be: do not rely on an action, but on the structure of the system. Make the default, a passive state, a safe state so that when the system fails, it gets to the safe state by default.

I have always liked these systems, and have often wondered if we could use the same type of passive methods to build better computing systems. Could we, for example, make a system that is safe from worm attacks and uses a passive system? Is there a formal model of passive vs active systems that we could use to reason about whether such systems are even possible?

Open Problems

Is there any way to exploit the power of the passive methods used by Westinghouse and Otis to solve computer questions?

58 Comments leave one →
  1. mazenharake permalink
    February 9, 2010 2:48 am

    I like the idea… very interesting thought! Definitely worth pondering about. I think some would argue that they have already done it thought but I guess it is a matter of how you define it all. I know I have made systems where the “default” is the “safe” choice… but as usual it is really hard to map reality to software 🙂

  2. February 9, 2010 5:34 am

    What the examples you’re pointing to here do is turn the no-energy state into a stable state. Without the particular inventions you’ve mentioned, the objects mentioned would be subject to free-body physics and all sorts of gravity fun times.

    I think atomicity in database updates is a good analogy in the computer world; the system is trying to ensure it is always in a stable state, rather than caught in between which could corrupt and/or invalidate data (i.e., run over a cow or crush a human with g-force).

  3. Jack permalink
    February 9, 2010 5:55 am

    From the user perspective, Westinghouse and Otis created the train’s and elevator’s Blue Screen of Death?

  4. Hilco permalink
    February 9, 2010 6:18 am

    Interesting post about what seems to be related to self-stabilizing systems.

    Whitelisting seems to be an example of what you’re thinking of. To take dispatch tables as an example, by requiring code – such as the name of a command – to be validated before it can be run, there is always a transition from a default passive state (no code validated and running) to an active state (code validated and (possibly) running) with a proper check in place.

    All this is somewhat similar to my preference for systems that simply have no error states and where every state is a valid and acceptable one, producing a valid result for any input.

  5. February 9, 2010 6:27 am

    The problem with passive is easy denial of service – you just cut a simple air line to immobilize entire train.

    And in computer security, especially in case of worm infections, human is the weakest link.

    – Will you please pump compressed air in the system? I’ll give you nice smileys for this!
    – Sure do!

  6. Kit permalink
    February 9, 2010 7:46 am

    Good post.

    The problem is that in software systems, what you describe as passive states are represented fundamentally by the content of memory locations. In this respect, there is not really much difference between a passive state and any other type of state. All are what information is currently held about the system.

    One option is correct implementation of pre-/post-conditions and invariants, to prevent memory achieving an invalid state. I had an idea of a system once where if constraint checking failed, a h/w mechanism (held off in normal operation) would kick in to load a default safe passive system state pre-defined by the designer.

    Back in the real world, 2 things that can help here:
    1. the tools (languages, compilers, etc) can guide us in controlling system state correctly, eg by improving immutability, thread safety, etc. Though there is the side effect of us needing to learn how to use them properly.

    2. Obviously, where software interacts with the physical world, the physical systems can be built passively safe, with the software enabling the active state.

  7. Jacques Chester permalink
    February 9, 2010 7:53 am

    I think that we software engineers sometimes fetishise bridge makers, railroad engineers, builders and the like without remembering that these professions too have dark and dangerous histories filled with examples of lackadaisical, “it’s good enough” engineering.

    For thousands of years it was not possible to predict if a building would stay up at all except by rules of thumb. We’re at about that stage in software engineering — still waiting for our Newton.

    • rjlipton permalink*
      February 9, 2010 8:46 am

      Good point. Even today bridges still fail, building do collapse. At GIT we had a parking garage pancake down on itself—for no apparent reason. Luckily no one was hurt, although many cars were destroyed.

  8. Jacques Chester permalink
    February 9, 2010 7:55 am

    Incidentally, other engineering disciplines call the air-brake and Otis brake “fail safe” systems. When they fail, they default to a “safe” state.

  9. February 9, 2010 8:12 am

    Nice article…Fundamentally the problem we have for building computer systems is the fact that the entire development spans hardly a century. Heck we havent even agreed upon what programming language to use uniformly.
    If this discovery had happened in today’s world, you would have Otis and Westinghouse filing patents for each component. No one else could ever repeat this and would be relegated to alternate approaches. We have succumbed to an economy driven more by capitalists than by scientists or inventors. You cannot foster individuality or creativity by assigning a bottomline target.
    Lastly, we havent even agreed upon what programming language to uniformly use – leave alone standardization. We do have similar such models in financial institutions – ATMs for example. A transaction is initially flagged with a failure flag. Only the target account (assuming you are withdrawing, your account would be the target) has the option to change this flag to a success. This would happen only when the transaction is a success (or money has been removed from the bank and deposited into your account)

  10. Dude375 permalink
    February 9, 2010 8:18 am

    1. Embedded systems often have a “heartbeat” function. In normal operation, firmware causes the CPU to output a slow sequence of alternating 1’s and 0’s (a square wave) on one of its pins. The “heartbeat”. External hardware detects this square wave. If/when the heartbeat stops (because the software has gone haywire / frozen / infinite loop / etc), the external hardware asserts the CPU’s Reset pin. When the heart stops, activate the defibrillator.

    2. Computer instruction sets are frequently designed so that all-zeroes is a harmless operation, often “No-Op” or “return to supervisor (the OS)”. Then if a program erroneously jumps to somewhere it shouldn’t, it will most often be a big block of all-zeroes memory. “Executing” this is rendered harmless, or at least, less harmful.

    • February 9, 2010 1:10 pm

      I’m not sure that #2 is an example of fail-safe here. Wouldn’t fail-safe be more like software-level ECC where the program has explicitly confirmed that each memory location is correct?

      The real question is “what are we trying to prevent?”. In the case of transportation devices such as elevators and trains, the answer is clear: we wish to prevent catastrophic mechanical failure resulting in damage to passengers and cargo. In the case of a general software program, there is not a universal “failure” to prevent. For some programs it might be better to misbehave for a little while, but keep running (safety sensitive) and for others it’s better to shutdown than to risk performing a single bad instruction (security sensitive). The notions of fail safe are useful, but it seems that they must be deeply tied to the application.

  11. February 9, 2010 9:12 am

    I thought passive safety systems were common knowledge among engineers. To add an extra example, that’s what keeps nuclear power plants safe nowadays. Dead man’s switch, anyone?

    The question is, how exactly are we supposed to apply this principle to software? I wish I knew.

    • rjlipton permalink*
      February 9, 2010 10:00 am

      I think this is the key question. How can we use this type of safety more. Perhaps it will be impossible to make a completely passive system, but I think we have gone completely in a different direction.

    • Mark permalink
      February 9, 2010 10:40 am

      Keeping with trains- all modern diesels also have a dead mans switch. The driver must acknowledge an alert with 10 seconds or the engine will slow and the brakes will apply.

  12. February 9, 2010 10:01 am

    Oh WOw, I had absolutely no idea it worked that way.


  13. Victor Papa permalink
    February 9, 2010 10:17 am

    Prof. Lipton, I study AE at Tech. We aircraft/system designers are always designing triple/quad redundant systems and are always trying to make systems whose probability of failure is remote (<10E-9). I'm pretty sure that an Aircraft designer would have put in three or four air brake lines instead of the system that Westinghouse came up with. I have never before imagined that the failure of a key component could only make the system safer. I should say this post was a bit of a fresh perspective for me, and i will keep this in mind for any future design work that i do. Thanks for the great post .

  14. Mark permalink
    February 9, 2010 10:37 am

    Just to clarify- train brakes do not use a spring. Truck and bus brakes do- but trains do not.

    The brake line on a train is used to charge a reservoir tank on each car. When the line pressure drops- pressure from that reservoir tank is used to apply the brakes.

    The system is still mostly fail-safe. There are a couple of exceptions. Ice in the air line can block the pressure drop used to signal brake application to the cars (avoided with the use of an EOT- End of Train device- which can apply the brakes from the back of the train in addition to application from the engine).

    The other problem is if the brakes are applied and released repeatedly. In that case the pressure in the reservoir can fall so low (without sufficient time to recharge) that it is not high enough to supply enough pressure to stop the train. There are several other related issues- but it’s still not a perfectly fail safe system.

  15. February 9, 2010 10:57 am

    Another example that I like is the Triga nuclear reactor, which uses a fuel whose thermal coefficient of reactivity is large and negative. So if it heats up, it shuts down. Much nicer than relying on elaborate safety systems. I especially appreciate this design since one of these reactors sits in my building.

  16. Challenge permalink
    February 9, 2010 11:26 am

    This is already a well understood design principle among software engineers and quality assurance specialists. It’s referred to as ‘graceful degradation,’ and the typical example given is an escalator. When an escalator breaks, it become stairs.

    There’s even a wikipedia entry on it. See:

  17. Sam permalink
    February 9, 2010 12:00 pm

    In quantum computing, this stability question is something we think about a lot. Classically, memory is stable. For example, you can think of magnetic memory as an array of tiny bits each of which can be 0 or 1, and an energy penalty for every pair of neighboring bits that disagree. In two dimensions, at low enough temperatures, such a system will be stable for a time that is exponential in the number of bits. (In one dimension, this no longer holds!)

    Quantumly, we want to encode a qubit instead of a bit. A self-stabilizing memory based on local interactions can be constructed in four dimensions. There are conjectured stable memories in three dimensions. There are partial impossibility results in two dimensions. The goal is to find a pattern of interactions that gives a low-dimensional stable memory, and then hopefully to find or engineer a physical material that implements those interactions.

  18. Jeff permalink
    February 9, 2010 12:46 pm

    This is basically the idea of Fail-Safe design rather than Fail-Fire design. The Deadman’s Switch and Watchdog Timers are long used examples. The Cold War movie “Fail Safe” also illustrates the idea (and how even it can have a risk thanks to human frailties) in nuclear war.

    Other ideas in reliability design such as “diversity design” and ideas from cryptography such as zero-knowledge proofs and anonymous voting touch on these ideas as well.

    From a control systems theory view, removing an input puts the system into a convergent, dissipative attraction basin rather than a divergent or non-attraction-basin.

    You are right that “classical engineering”, or just “what all engineering professions (EE, ME, CE, ChemE, etc.)” just normally teach and practice, being useful for software and computer science.

    This gets much more fundamental to apply to software – even basic questions of synchronous vs. asynchronous software design have to be examined and analyzed in fundamental terms. I don’t think you can really do much without considering most software in a large asynchronous sense because nearly any threat involves the asynchronous world intruding on the naively simplistic world of synchronous software operation. This also gets into the hardware implementation as well as the software implementation – probably why there are such problems with language selection/implementation only approaches to this. I think this is part of the cause of so much churn and religious wars in software: the problem is usually improperly scoped or framed.

    As Alan Kay said: “Those who are truly serious about software also do hardware.”

  19. Koray permalink
    February 9, 2010 2:00 pm

    In CS distributed algorithms that address certain failure types work in a self-stabilizing fashion (e.g. leader election, etc.)

    I was just reading

    It’s amazing for how long we (engineers) didn’t exactly know how a bike really worked…

  20. February 9, 2010 2:32 pm

    I think that a key word used in the original post is “physics”. Real-world engineering can always fall back to physics to prove that a system fails to a safe state. The engineer knows that when pressure in a line lessens, springs force brake to wheel because of physical properties that are fundamentally part of the real world. (Although as others pointed out, even then implementation details can gum up the works.) The same physical laws that prevent us from doing the things we want to do (e.g., hyper-light speed travel, useful cold fusion) also allow us to limit what can go wrong.

    Of course, software programs do not benefit from any such laws of physics. The closest thing we have to a law is the Turing Machine, and that only tells us that, in general, we can’t verify safety (or any other non-trivial property). The best that we can ever do is layer checks on top of checks, tests on top of tests, and *hope* that we did not miss anything.

    Occasionally, we get lucky in a specific domain and are able to specify a sufficiently simple system (e.g., a regular expression) that can be verified, but this kind of luck is pretty rare.

    I think systems like SPARK Ada are the best we ever can do, and ultimately the winning approach is to select critical systems, simplify them as much as possible by fighting creeping featuritis, and then implement them using these sorts of techniques. Even this won’t ensure fail safety, and it will be extremely expensive, but for critical systems, it’s the closest you will get to safe.

    Unfortunately, I think that the safety of general-purpose computation (e.g., desktop computation, the web, information systems) can’t ever be significantly better than it is now (in a silver bullet-sense), and it is, in general, a waste of time to try for anything other than minor improvements (e.g., static checkers for specific kinds of buffer overflows, domain-specific checks, firewalls).

  21. Bob Foster permalink
    February 9, 2010 3:14 pm

    Two main sources of creativity are pleasure in noticing that things are unexpectedly alike and annoyance when you find they are unexpectedly different. Your question hits on both cylinders.

    Fail-safe systems don’t fall back to a default safe state when they fail so much as they are kept from the default safe state by what we consider normal operation. The default state is the initial state. Losing pressure doesn’t apply the brakes; pressure forces the brakes off the wheels.

    From that point of view, the previous comment about designing systems so that they don’t have error states is pertinent and productive. For example, in many programming languages the alternative to unsafe uninitialized variables is variables initialized to unsafe values, e.g., null. Safe systems do not have null pointers. Likewise, numeric overflow is an algorithmic failure (or not), but it shouldn’t be able to cause a system failure because the overflow raises an uncaught exception. Likewise, our programs typically index arrays with signed integers. Our software is not merely fragile, we are using eggs for carpets.

  22. Mark permalink
    February 9, 2010 9:34 pm

    An example of this may be leasing systems, such as Tuple Space or DHCP. Take the DHCP example, the worst case is a total failure of the DHCP server. The result is no IPs are allocated. It is simply not possible to “run off the tracks.”

    A trivial example, but makes more sense when expanding to systems that, for example, spend money. If a lease must first be obtained for any amount to spend, the failure of any budgeting system doesn’t cause the system to wildly spend without limits — it causes the system to stop spending.

  23. Hagit Attiya permalink
    February 10, 2010 12:57 am

    Very nice and educational post, as always.

    The engineering principle of preferring safety (nothing bad happens) over liveness (something good happens eventually) is often practiced in designing computer systems.

    In distributed systems, a good example is provided by consensus algorithms preferring not to terminate (decide on an outcome) when the system is asynchronous, but always guaranteeing agreement if they do terminate.
    This idea has several manifestations, like Lamport’s Paxos algorithm, DLS’s partial synchrony consensus, or Chandra and Toueg’s unreliable failure detectors.
    To some extent, this is one principle underlying Castro and Liskov’s BFT.

  24. Benjamin Smith permalink
    February 10, 2010 1:02 am

    It’s pretty straightforward to set up code so that it’s inherently resilient. Note, that I didn’t say it was EASY, just straightforward. Pay attention to “best practices”. For example, when writing a database app:

    1) Use constraints and foreign keys everywhere possible, in a fully normalized database. Know the definitions of terms like “atomic value”, “primary key”, and “check constraint”.

    2) Sanitize your inputs.

    3) Use transactions, especially when performing multiple updates.

    4) Find out what prepared statements are. Then only use prepared statements.

    5) Check for error states in every single database connection, and rollback the database transaction at the first sign of error.

    6) Have a comprehensive, well-tested, AUTOMATED backup system. Combine your backup and development environments so that, in providing data for developers and QA, you also confirm that your backup system is actually working, day in and day out.

    With all the above, you have a system that defaults to a safe point. You know your backups are working because it’s tested every single work day. Any commits are rolled back in the event of any kind of verifiable error. Database constraints and foreign keys eliminate huge swaths of possible errors.

    It *seems* simple to have spring-loaded brakes that apply if/when the air hose breaks, but it introduces additional complexity. Software “best practices” is no different.

    • February 10, 2010 11:53 am

      Except that’s so much effort, many of those measures never get implemented for various reasons, and even if you do implement them all, failure is still possible. Not even close to fail-safe. Unfortunately.

  25. February 10, 2010 2:09 am

    A passive fail system caused a lot of headaches during our rocket launch:

    Click to access responsive.pdf

  26. February 10, 2010 9:40 am

    I wonder if the recent Toyota case is relevant with this post.

  27. Robert permalink
    February 10, 2010 8:07 pm

    In the software engineering area there are lots of examples. A Journaling filesystem is supposed to be safe no matter when you turn the power off. Snapshot technology for databases and even full and running systems has been built into production systems as early as the 70’s. Even on detail level there are lots of “failsafe” constructs used by a programmer, e.g. guard constructs or exceptions. I think this also applies to theory where a known upper bound might be used as failsafe in the sense to rectify a new bound. Interesting would be what formal rules a model has to satisfy to call an aspect of it intuitively appealing “failsafe”. Someone here to courageously cast a sketch?

  28. Josh permalink
    February 11, 2010 7:54 pm

    In the theoryCS world, this idea also arises in complexity theory when studying nondeterministic or conondeterministic machines. Amusingly, I can argue both ways:

    1) Nondeterministic machines default to “no,” because some computation path has to actually accept in order to get a “yes” answer.

    2) Nondeterministic machines default to “yes.” If even one computation path accepts, then any other computation path that wants to REJECT does not have the power to do so. It can not accept, but that doesn’t cause the computation to accept.

    For complexity, I think the second point of view is more interesting, and is the cause of some difficulties when, say, constructing various oracles.

  29. February 11, 2010 11:58 pm

    As far as I know, many computer peripherals (like printers) worked on the NOT logic, i.e. they would always see a HIGH, and presence of LOW was a signal (for action). This is because static can create noise to make a HIGH while there was none, but no noise can make the LOW when the state is HIGH. I don’t know if this is true anymore with the advent of USB technology.

  30. February 12, 2010 1:11 pm

    Is there any way to exploit the power of the passive methods used by Westinghouse and Otis to solve computer questions?
    I think I have an example where this power is not used. When I start the Xorg server it accepts connections from the Internet. Usually X server is used only locally, so it is insecure by default. The “-nolisten tcp” parameter makes it secure. To exploit the power of the passive methods they can replace that parameter with the new parameter “-listen tcp” having obvious meaning. 🙂

  31. Shoshi Loeb permalink
    February 13, 2010 7:28 pm

    Very nice post.
    It reminds me of my fascination, years back, with the way photoreceptor cells respond to light in the retina. These cells can accurately detect a single photon and the way they do it is counter to what one expects. They release a certain amout of neurotransmitters in the dark and when exposed to light, the amount neurotransmitters goes down not up. They do it by running a “dark current” of ions through their membranes andwhen a photon hits the cell, the excitation results in a sequence of events that lead to the reduction of the dark current. Some background on this can be found in

  32. February 23, 2010 3:02 pm

    Looks like great info)))) Add to bookmarks. Thanks.

  33. February 28, 2010 8:58 am

    Nice Nice Nice…

  34. March 2, 2010 7:13 am

    i like your style rjlipton…

  35. May 13, 2010 2:25 pm

    Whitelisting seems to be an example of what you’re thinking of. To take dispatch tables as an example, by requiring code – such as the name of a command – to be validated before it can be run, there is always a transition from a default passive state (no code validated and running) to an active state (code validated and (possibly) running) with a proper check in place.

  36. October 9, 2010 12:17 pm

    Very informative. Thanks for posting this.

  37. October 16, 2010 7:22 am

    George.w was science man and good clothes man!.

  38. September 16, 2011 6:22 am

    Thanks a ton for sharing “Trains, Elevators, and Computer Science Gödel’s Lost Letter and P=NP”.

  39. Ross permalink
    December 18, 2019 12:29 pm



  1. Trains, Elevators, and Computer Science « Gödel's Lost Letter and P=NP | GA Publications
  2. IT Corner » Blog Archive » Trains, Elevators, and Computer Science « Gödel's Lost Letter and P=NP
  3. The Science of Science!
  4. Noya Khobor » Blog Archive » Trains, Elevators, and Computer Science « Gödel's Lost Letter and P=NP
  5. === === popular today
  6. uberVU - social comments
  7. Trenler, Asansörler ve Bilgisayar | engin güller
  8. Top Posts —
  9. A kindred blog | Models Of Reality
  10. Trains, Elevators, and Computer Science « Gödel's Lost Letter and P=NP Mobile
  11. Trains, Elevators, and Computer Science « Gödel's Lost Letter and P=NP | Drakz Free Online Service
  12. magpie's shiny things » Morning Linkage (Feb 12):
  13. Seguridad pasiva « Mbpfernand0's Blog
  14. Обработка ошибок и безопасные состояния в программировании |

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s