One of the computer groups I moderate is the Yahoo "linux" group. I recently began a thread about ksplice, a method of updating the kernel while it's running and not requiring a reboot.
The following copy'n'pasted lines are from an article appearing in that thread this morning and I thought it worth sharing since I learned something new from it; author's name is at the end:
During normal operation, when a fault is detected things are a little more complex than that. Normally the system is running a hot backup with two systems running side by side, in lockstep, with a "matcher" between them comparing the results of their calculations. In the event that a mismatch (indicating a fault) occurs a "fault isolation" program fires up and determines which unit has failed. The system is then reconfigured so that the faulty unit is cut out and the two systems share the good one. A diagnostic program then is run on the faulty unit, the failing circuit board identified, and a message is printed on the maintenance console to replace that board. Once a new board has been inserted the diagnostic is rerun and, if it passes, the system is reconfigured to use the repaired unit for one side of the system (going back to the normal configuration).
How effective is this strategy? The specifications for the #5 ESS (Electronic Switching System) require no more than 2 seconds of down time over the design lifetime of the system -- 40 years. The last statistics I saw indicated that the systems were actually accumulating less than 1 second of down time over that
40 year period. As I recall, diagnostic and fault recovery software comprise about 2/3 of the code in the system.In essence, that is what the retrofit process does. An important part of the process, though, is translation of any data in memory from the format in the old generic to that in the new generic, and not losing any in-progress information during the switch over. Upgrading the other processor is then a normal memory mirroring operation (unless the upgrade included hardware changes as well).
A good while ago the Bell System Technical Journal had an issue devoted to the #5 ESS (#5 Electronic Switching System) that you might find interesting.
Rich Strebendt