OS Failures

Catastrophic failures

Things that can break will break!

By NearEDGE | January 13, 2022 | Read time 3 min

In a perfect world, there would be no errors, no unseen situation and no cause for failures. Everything would run smoothly and in perfect order. The reality is different. An update breaks something. An operational error is made. Or some decay process runs to a point of fading causing operation to stop. This is life.

Not all failures are catastrophic. A failure becomes catastrophic on criteria that depend on the context. What is just an annoyance in some cases, may become life critical in another context; just imagine a critical machine that reboots in the middle of a surgery. For the purpose of this blog, we need to define what is causing a failure to be catastrophic. We must also define the failure of what. So here it goes:

The system, which failure concerns us, is a computer operating as a server in a remote location, such as at the edge.
The failure is not self healing
Connectivity to the computer is permanently lost
No corrective action is possible

This definition for a catastrophic failure is pretty broad. It ranges from hardware failures, such as failed power supply, to simple operator errors, such as IP address configuration mistakes. But regardless of what they are, those failures prevent a server from performing its task, or tasks and become inoperative.

Failure modes

For the purpose of this blog, what we refer to when talking about failure modes is how a given failure manifests itself. In the example of the power supply failure, the manifestation is simply that the computer ceases to operate; all lights are out! On the other hand, the computer may appear to operate when a configuration mistake, such as the bad IP address error above takes place. However, from the point of the view of a central operator, this failure appears the same as the power supply failure. In both cases, the system becomes unreachable.

Another failure mode is for the computer to reboot, which may be explicitly triggered by some monitoring process. The reboot may also happen unexpectedly. A reboot may heal the failure, in which case the situation is not catastrophic. It becomes a catastrophic failure if the failure immediately reappears upon the reboot; the bad IP address configuration being permanently written to a file or configuration system will again cause loss of connectivity. The worst scenario is that the original trigger, explicit or not, manifests itself again and the system reboots again. And again. And again in an endless loop.

The first failure mode, where the server becomes silent must be avoided. Solutions must be present in the system to prevent this type of mode. A good example for such a solution is a watchdog, which reboots the system if it is not serviced in time. But that alone will not completely cover all failure causes. Some monitoring or supervision solution must also be implemented and reboot the system if required.

The reboot mode is better than the simple silent mode. But if it ends up in an endless reboot loop we are not much better. We will see later what we should do about this.

Someone at the keyboard

The original PC was conceived as a machine where there is always someone attending it. It was a personal device after all and there was no need for any form of remote recovery capability. This is still the case today for the most part in every kind of computer, ranging from a simple NUC all the way to full fledged and high-end rack mount servers. On the other hand, some designs, usually targeting large public or private data centers, do provide certain capabilities for remote control independent of the main CPU and main OS. However, such capability requires local services at an edge or remote site, which is typically not available. Consequently, a server installed at such an edge site still requires the presence of a human should something go wrong. Needless to say, this quickly becomes costly and unbearable for many business use cases.

Clearly, recovering a remote server must not involve having to dispatch someone on site.

Recovery principles

If the goal is to be able to remotely recover a server asset from a catastrophic failure, without dispatching someone, some implementation principles must be followed. At the very least, the following components must be provided

A supervision agent must monitor for remote accessibility by communicating with a central
A watchdog must protect the supervision agent; failure of the agent must cause a reboot of the system
The boot solution (boot loader, etc.) must be able to take a different path. This means that the system should not always boot with the same configuration, including not booting the same OS, which could be broken after all.
The default boot path should not be the normal production path. This path becomes the recovery mechanism
The default path must be able to communicate with a central.

Once the server is running in the default path recovery can take place. The scope and extent of what this recovery means is beyond this blog but should aim at bringing back the server into production.

Wrap it up

NearEDGE's solution consists in a dedicated boot OS, which executes only when needed mostly for recovery purposes. During normal operation the OS installed as the production OS is selected by the boot system. This production OS is a customer OS and is not provided by NearEDGE.

The solution is also an optional supervision agent running as a daemon in the production OS. Its supervisory features can be implemented differently in scenarios where a daemon is not desirable.

Free account

Share this article

Book a meeting
All articles

Compute Anywhere Anytime

Sitemap

Contacts

438 McGill, suite 500
Montréal, QC
H2Y 2G1
[email protected] Contact Us